WO2009067741A1

WO2009067741A1 - Bandwidth compression of parametric soundfield representations for transmission and storage

Info

Publication number: WO2009067741A1
Application number: PCT/AU2008/001748
Authority: WO
Inventors: Dipanjan Sen
Original assignee: Acouity Pty Ltd
Priority date: 2007-11-27
Filing date: 2008-11-27
Publication date: 2009-06-04

Abstract

The invention relates to the coding and subsequent decoding of parametric representation of soundfields. The spatially and temporally sampled pressure field in a compact three-dimensional target area can be parameterised primarily by decomposition onto orthogonal basis functions and secondarily by taking advantage of spatial and temporal correlations between the first set of primary parameters. Both primary and secondary parameters are subsequently coded using perceptual psychoacoustic thresholds. Additionally, spatial and information theoretic analysis of the parameters provide an optimal updating rate of the parameters as well as the maximal order required. The signal dependent order and update rate of the parametric representation forms part of the encoded bitstream. The psychoacoustic thresholds are a function of both the spatial distribution of acoustic energy (outside of the target area) as well as the frequency distribution of the sound impinging on the ear of a listener inside the target listening area. The calculated psychoacoustic thresholds reflect the precision required to represent the primary and/or secondary parameters for transparent perception of both sounds and their spatial location. This facilitates various quantization techniques to encode the parameters. The coded parameters may be stored or transmitted to a receiver unit. At the receiver unit, the coded parameters are dequantized and an adaptive synthesis system generates loudspeaker feeds to allow the reconstruction of a perceptually transparent acoustic field. The decoder is able to adjust to the differing and time-varying update rates as well as as the time-varying order of the parametric representation. The synthesis system adapts to the number and geometric configuration of the loudspeakers at the receiver location.

Description

Bandwidth Compression of Parametric Soundfield Representations for transmission and storage

Field of the invention The present invention relates to the field of two and three-dimensional soundfield representations. In particular, the invention provides techniques and apparatus for the efficient representation of the soundfields. The representation facilitates subsequent transmission and/or storage of soundfields for its eventual reproduction and re-synthesis at an alternate location and time to which it was originally recorded. The invention has applications in the fields of audio distribution, archiving, streaming, surveillance, digital gaming and telepresence.

Background of the invention and prior art

The demand for high quality audio, its efficient storage and transmission remains at an all time high. This demand is from consumers, various industry and research sectors and professional studios. Streaming audio and video over the Internet, cheaper storage media, cheaper bandwidth over wired and wireless transmission medium and the technology of digital audio coding have all fuelled this seemingly insatiable demand for high quality audio. While two channel (stereo) audio has been the most prolific format, the deployment of multi-speaker home theatre systems has fuelled consumer desire for accurate spatial rendering of audio. Accurate spatial rendering would allow realistic perception of the ambience of the recording location as well as the ability to localise the source of sounds as present at the original recording location. The first step (refer to Figure 2B) towards facilitating this rendering of "immersive audio" at arbitrary locations (from living rooms to cars) is the spatial sampling of the original acoustic environment using a large number of microphones distributed geometrically in the vicinity to the area of interest. The second step is the representation of the acoustic field, sampled by the microphones in the previous step. The third step is the coding (or bandwidth reduction) of the acoustic field representation. The fourth step is the transmission (or storage) of this coded representation to the location of the consumer. The fifth step is the reception (or retrieval) and decoding of the acoustic field representation. The final step is the rendering of the audio through multiple loudspeakers whose individual "feeds" have to be derived from the acoustic field representation.

The invention described herein involves the coding (and associated decoding) of the acoustic field representation (third step in the previous paragraph) without which the transmission or storage of the acoustic field would be prohibitive due to the sheer bandwidth required.

The acoustic field representation of choice in this invention, are the coefficients which result from projecting the pressure field onto orthogonal basis function. One such decomposition result in the Fouher-Bessel (FB) coefficients which are the direct result of solving the three dimensional spherical wave-equation - entrenched in the fundamental physics of acoustic wave propagation. The theoretical basis for this representation can be found in numerous acoustic books including, "Skudrzyk, E., The foundations of Acoustics, Springer-Verlag, 1971 ". The discrete and infinite sequence of Fouher-Bessel coefficients are a complete representation of the acoustic field (or the pressure distribution at every spatial location at the recording venue) and are independent of the type, number and geometrical configuration of the microphones used to sample the acoustic field.

A similar representation forms the basis of Ambisonics technology. However, the representation is not as complete as the FB representation, due to the fact that unlike the FB decomposition, the Ambisonics representation ignores the pressure field variation as a function of radial distance and represents only the angular (azimuth and elevation) distribution of the pressure field. It is for this reason that the present invention uses the FB coefficients. However the use of Ambisonics representations (as well as other representations) of arbitrary order could just as well have been used to demonstrate the invention and would be in the scope of the present invention. A characteristic of the FB representation and its use in representing acoustic fields that also sets it apart from currently deployed audio coding (such as MP3, AAC, MPEG-Surround, etc), distribution and recording technologies is that the representation is completely independent of the acoustic conditions at the location of the consumer. The use of the FB representation thus requires the synthesis equipment to adapt to the local acoustic conditions as well as the type, number and geometrical configuration of the loudspeakers at the listening venue. In contrast, current audio distribution and audio coding systems assume that the playback loudspeakers are geometrically positioned in a standard configuration such as the ITU standardised 5.1 channel arrangement [ITU-R BS.775-1] or a pair of headphones. This decoupling between the analysis and synthesis implicit in the use of FB representation means that the acoustic field can be re-synthesised at any arbitrary locations and with arbitrary speaker arrangements - as long as necessary steps are taken to adjust to the local conditions by the synthesis equipment.

Current audio coding systems that attempt to capture and resynthesise spatial acoustics are known as Spatial Audio Coders (SAC). They are predicated upon an assumption of maintaining a standard spatial configuration of loudspeakers (such as the ITU standardised 5.1 channel arrangement) at the listening environment. This assumption allows SAC systems to use a reference loudspeaker feed for encoding and use a low-bandwidth side information to facilitate up-mixing to multiple channels of audio. The side information typically consists of such information as Inter-channel Level Differences (ICLD), Inter- channel Time Difference (ICTD) and Inter-channel coherence (ICC) which represent level and time differences between the reference loudspeaker feed and the other loudspeaker channels. These are only guaranteed to reflect the correct ICLD, ICTD and ICC at the listening environment if the consumer has adhered to the original assumption of a standardised spatial loudspeaker configuration (such as the ITU standardised 5.1 channel arrangement). The single monophonic channel which acts as the reference is usually encoded using psychoacoustic principles. To apply psychoacoustic principles to the monophonic signal (for encoding), an implicit assumption is made that the monophonic signal is representative of the signal that will impinge upon the listener's ears. When the rendering is not through headphones, this is indeed a contentious assumption, since, in reality, the signal that will impinge upon the listener's ears will be a superposition of the acoustic output from multiple loudspeakers filtered through a system whose impulse response is a function of the location of the speakers relative to the listener and the surrounding objects which cause various reflections and diffractions of the acoustic wavefronts. In contrast, the coding of the FB coefficients in the present invention, makes no assumptions about listening conditions and configurations. Further, SAC systems and general mono and stereo audio coders (such as AAC and earlier MP3 systems) are focused on coding the pressure signal as observed at one discrete point in space. In SAC, the discrete point is the location of the reference loudspeaker feed that is usually coded using psychoacoustic principles. For headphone based rendering and AAC type coders, the discrete points are the left and right ears of the listener. In sharp contrast, and fundamentally different is the coding of the acoustic field as per this invention which is principally an attempt to code the pressure signal at every point in space at the recording environment.

It is acknowledged that if the assumption of a certain loudspeaker geometric configuration (along with radiation patterns and number of loudspeakers) and acoustic conditions, as used in the SAC encoders could be adhered to at the listening environment - the coding efficiency would be unparalleled. However, the probability of this is little and the cost in terms of acoustic rendering enormous. With having lost all acoustic information but the /V-channels of loudspeaker feeds, it is a tall task to recreate the soundfield for arbitrary speaker locations and acoustic environments. This invention is thus an attempt to increase acoustic fidelity at a cost of increased bandwidth.

Practical constraints in obtaining an FB representation, such as the number of microphones for spatial sampling of the acoustic field (during recording) and the number of loudspeakers available at the consumer location to render the acoustic field will mean that the accuracy of the acoustic field will be maintained only in compact regions of space and only over a finite range of frequencies. However, the fundamental difference of coding an entire acoustic field as opposed to the pressure signal at a discrete location in space requires entirely different approaches for coding the FB coefficients.

The next three sections will provide a background on the state-of-the-art Perceptual Audio Coders, Psychoacoustic Models, and Multi-Channel SAC. Perceptual Audio Coders

Perceptual coding of audio takes advantage of the masking properties of the human peripheral auditory system which is known to tolerate the presence of noise in the simultaneous presence of a desirable audio signal. The time varying detection thresholds of the noise can be computed using computational psychoacoustic models (described below). Quantisation of the audio signal is then carried out ensuring that the quantisation noise is kept below this threshold of masking.

In practice, the perceptual coder is followed by an entropy coder (such as Huffman coding) which further minimises the amount of data required to represent the audio signal by taking advantage of the redundancies present in the digital signal. Examples of such coders include AAC and MP3 systems as well as recent SAC coders. Psvchoacoustic models used in audio codings Masking effects are predominantly caused by the peripheral mechanisms of the human auditory system. The psychoacoustic models used in audio coders strive to model this peripheral physiology by calculating the frequency response of the pressure signals to mimic the electro-mechanical response along the length of the cochlea. These psychoacoustic models are limited to mono-aural (single ear) perception. While it is inevitable that all auditory signals are processed by this peripheral mechanism, it cannot be doubted that perception that requires binaural hearing (such as auditory localisation) must be processed at higher levels of the auditory pathway - requiring alternate models to predict their behaviour.

An example is a model of Binaural Masking Level Difference (BMLD), which is often used in addition to the peripheral psychoacoustic models (simultaneous masking and temporal masking) in stereo (two channel) audio coding. BMLD is the effect whereby sensitivity is improved, lowering masking thresholds, due to stimuli perception in both years.

The prediction of BMLD in stereo coding, however, requires some knowledge of the temporal shift between the signals arriving at each ear. This shift is easily known if the final mode of delivery is via headphones. However, if the final mode of delivery is via loudspeakers, an assumption has to be made about the relative position of the listener with respect to the loudspeakers. A symmetrical assumption is usually made - ensuring that the left and right channels arrive at the listener's ears at the same time.

Problems associated with this (symmetrical assumption) include the possibility of the listener not being situated at the centre of the speakers as well as the fact that each channel will in fact leak to either ear - the effect of which is governed by the distance of the speakers from the listener as well as the head related transfer function of the listener.

BMLD is, however, only one perceptual effect that is attributed to binaural hearing. Perception of the localisation of auditory sources, perception of reverberation and room size, envelopment and other similar "auditory spatial features" all require binaural hearing processed beyond the cochlea in the auditory pathway.

Pertinent to this invention is the ability to localise sound sources and the effect of 'spatial release from masking' (SRM) which is the effect by which the ability to detect sounds increases with increased spatial separation of sound sources. We label both of these effects under the term "acuity" in this document and attribute them to the limited resolution of spatial hearing. The current invention aims to take advantage of limited spatial acuity by limiting the spatial resolution of the representation when an increased resolution in the representation provides no benefit in terms of aiding the listener to localise sounds. To conclude, two types of psychoacoustic models apply for the perception of spatial audio. The first is the traditional mono-aural simultaneous and temporal masking models that have very little to do with spatial localisation and the second is a psychoacoustic model that predicts the spatial acuity of hearing in the presence of contending acoustic sources at various spatial positions. The present invention describes novel computational techniques to model both peripheral auditory phenomena as well as spatial hearing phenomena - to facilitate the coding of acoustic fields. Multichannel audio coding & Spatial Audio Coding

Multichannel spatial audio coding (SAC) has been addressed by Dolby's AC3 algorithm, other Dolby technology and most recently standardised by MPEG (MPEG-surround). In all cases, Auditory Masking (perceptual coding) is used in combination with sub-band or transform coding. An exception is the DTS format [ETSI TS 102 114 V1.2.1] from DTS inc., which only uses information theoretic predictive quantisation (Adaptive Delta Pulse Code Modulation) to code individual channels providing a much lower compression ratio than the other perceptual coders). In all SAC systems and delivery methods, an assumption is made about the configuration of the loudspeaker geometry at the point of delivery. The principle behind this method of coding is that perception of the soundfield should be maintained as long as the relative delay, level and coherence between the multiple loudspeakers is reproduced during the playback of the multiple channels.

All of the techniques relies on the following assumptions: (i) an assumption of the geometry and configuration of the loud speakers used in final mode of delivery to the listeners (e.g. 5.1 speaker layout) (ii) an assumption that one of the channels represent an accurate pressure signal that will be incident on one of the two ears of the listener - this allows the encoder to apply psychoacoustic models of Auditory Masking.

If the assumptions about the final mode of delivery made by the encoder are not adhered to during the playback process, the perception of the decoded audio will be suboptimal. A primary manifestation of this sub-optimality will be the inaccurate spatial perception of sound sources. The above limitations illustrates that there is a need for alternative technology which:

(i) will also allow listeners to have the ability to move within a soundfield and perceive the same sensation as they would if they moved in exactly the same manner in the original soundfield (within a target area), (ii) makes no implicit assumption about the geometry of the playback transducers, but is able to re-synthesize the original pressure field in the close vicinity of the listener with as much precision as is required to enable the listener to perceive exactly the same auditory sensation as they would if they were to be immersed at a certain target location in the original soundfield.

(iii) provides adequate bandwidth compression to ensure efficient and inexpensive transmission/storage, while maintaining temporal and spatial fidelity of the perception of the audio. Associated with the creation of such technology will be the requirement to manufacture inexpensive synthesizing equipment for the consumer. While, the technology of resynthesizing the soundfield using multiple loudspeakers is available, the cost of such equipment is prohibitive.

Another problem in current multi-channel coding and playback technology is that they only cater for sound localisation in the horizontal (or 2D) plane. In MPEG-surround, this limitation is due to the coding of the ICTD, ICLD and ICC cues from horizontally placed loudspeaker feeds - which only cater for how humans localise sound in the horizontal plane. Again, this limitation is due to the encoder being forced to make simplistic assumptions on the listening environment's speaker layout and the total system (from recording, encoding, decoding and resynthesis algorithms).

Thus, there exists a need in the art for a method of reducing the amount of data required to accurately represent a three-dimensional soundfield and thereby facilitate transmission and storage for the average consumer. The technology will be required to recreate the soundfield for arbitrary number and type of loudspeakers and arbitrary acoustic conditions and the listener's venue.

Further, in situations where the number of microphones are limited, resulting in an imprecise soundfield recording and thereby an alternate perception to that of the original soundfield, there is a need for an encoder which ensures that the ensuing quantisation noise does not result in further deviations in perception.

The object of the invention is to address the problems in the art as discussed above along with other needs in the art which will become apparent to those skilled in the art from this disclosure.

Any reference herein to known prior art does not, unless the contrary indication appears, constitute an admission that such prior art is commonly known by those skilled in the art to which the invention relates, at the priority date of this application.

Summary of the Invention

The object of the present invention is the bandwidth compression (encoding) of acoustic field representations. The acoustic field is the scalar pressure variation at every point in space, in a compact area. This object is achieved by the invention described within this disclosure. In accordance with the first aspect of the invention, the encoding is achieved by the use of a parametric representation of the acoustic field and exploiting the statistical redundancies amongst the parameters along with computational models of human auditory acuity which dictate the lower limit of precision required of the parameters. This comprises the steps of: (i) Deriving a parametric representation of the soundfield from a spatial sampling of the soundfield achieved by multi-microphone transducers. The parametric representation is independent of the microphone type, number and position and completely describes the pressure field in the target area, (ii) Selecting a finite subset of the parameters from the potentially infinite number of parameters in the previous step, (iii) The encoding and quantization of the finite set of parameters using information theoretic principles and limits of human audition and perception. The encoding is independent of any listening conditions (including the listening room impulse response, number, type and geometrical configuration of loudspeakers). The only dependence is that the synthesis apparatus will strive to recreate the acoustic field at the listening venue with high accuracy. This assumption of rendering an accurate soundfield, facilitates the computation of psychoacoustic thresholds reflecting signal dependent limits of both human spectro-temporal resolution as well as limits of human auditory spatial acuity. Further details are disclosed in the following discussion and accompanying figures.

In another aspect of this invention the parametric acoustic field representation is further transformed to a lower dimensionality representation that has physical and statistical properties that are more amenable to coding.

A further aspect of this invention involves the decoding and dequantising of the encoded soundfield representation. The decoder strives to re-synthesize a faithful acoustic field, maintaining perceptual transparency such that the listener perceives an identical sensation of the soundfield that was present at the recording venue. In a preferred embodiment, the decoder adapts to the acoustic reflective, diffractive and diffusive conditions at the listening environment. More preferably, the decoder adapts to the number, type, positions and radiation patters of the loudspeakers at the listening environment. When either of the above two embodiments are not possible due to the lack of information on acoustic environment and loudspeakers, the decoder will have default settings that provide optimal synthesis based on user defined descriptions of room type, number, geometrical configuration and type of loudspeakers. In a multi-descriptive or scalable embodiment, the decoder will adapt to the available bandwidth and provide a lower accuracy/quality synthesis when the consumer does not have access to the complete bandwidth required for perceptual transparency.

In another embodiment of this invention, the synthesis apparatus (which incorporates the decoder) transmits information about the listening environment back to the encoder. It is preferred that in this embodiment, the encoder adapts to the listening environment information from the synthesis apparatus by estimating the synthetic soundfield (rendered at the listening environment) at the encoder - allowing a more accurate estimate of thresholds of audition, acuity and perception. The increased accuracy of the thresholds allows the encoder to optimize the quantization resulting in a further reduction in required bandwidth and/or increase in perceptual quality at the synthesis environment. In this embodiment, the communications is in real-time, two-way, one-to-one (as opposed to one-to-many and one-way) mode.

In various embodiments of this invention, the coding can be lossy and/or scalable. In the scalable embodiment, the encoder has the ability to select from a plurality of bit-rates, where-in higher bit-rates increase the perceptual accuracy/quality and lower bit-rates decrease the perceptual accuracy/quality. The selection of the bit-rates is controlled by an end-user or automatically controlled by channel or storage media limitations. In another embodiment, the synthesis apparatus may carry out further processing to enable noise-cancellation and conditioning.

All embodiments of the encoder and decoder will have physical realization incorporating appropriate analogue and digital hardware. The input to the encoding apparatus will include appropriate connectors to allow the connection to multiple microphones, analogue to digital devices, CPU and memory and associated glue logic. The output of the encoder will either be transmitted/streamed onto telecommunication networks or stored on media such as CDs and DVDs. Similarly the input to the decoder will be the encoded stream, and the output will be signals that represent the input to multiple loudspeakers. The contents of the following discussion and the drawings are set forth as examples only and should not be understood to represent the limitations of upon the scope of the invention.

Summary of the drawings

In order that the invention may be readily understood and put into practical effect, reference will now be made to the accompanying drawings, in which: Fig. 1 is a block diagram showing the encoder according to a first aspect of the present invention. Fig. 2A is a block diagram showing the method according to a first aspect of the present invention. Fig. 2B depicts the complete system - showing microphone and loudspeaker apparatus. Fig. 3 is a block diagram showing the principles of the operations of the psychoacoustic models used in the lossy codec of the first aspect of the present invention. Fig. 4 is a block diagram showing concepts of the 3D soundfield, in terms of sources and the receiver or listener. XXX - might go? Fig. 5 is a flowchart of the encoder according to the first aspect of the present invention. Fig. 6 is a block diagram showing of the method of the scalable encoder/decoder according to a third aspect of the present invention. Fig. 7 is a block diagram showing the method of the encoder/decoder when the decoder is able to transmit back to the encoder some information about the listening environment in a real-time two way one-to-one communications, according to a fourth aspect of the present invention. Fig. 8 is a flowchart of the decoder.

Detailed description of the invention

As used throughout the disclosure, the following terms, unless otherwise indicated shall be understood to have the following meanings. "Soundfield" refers to the scalar acoustic field which describes the dynamic pressure as a function of space and varies with time.

"Recording location" refers to the venue at which original acoustic field is to be recorded. "Listener location" refers to the venue at which the acoustic field has to be reconstructed or synthesized.

"Target space" refers to a compact volume in space that is targeted for maximum accuracy in recording and rendering the soundfield. This is shown in Figure 2B (item 200). "Parametric soundfield representation" refers to a finite or infinite set of parameters which describe the continuous dynamic pressure distribution in a target space.

"Coding" refers to bandwidth compression. Multi-channel Soundfield Audio Coding The present invention is concerned with the coding of soundfield representations and more specifically parametric representations of soundfields such that a decoder at an alternate time and location can synthesize a perceptually transparent soundfield to the one that was originally recorded.

The coding of parametric representations of the soundfield is advantageous in terms of spatial perception and fidelity as well the ability to code independently of the physical configurations of the listening environment in contrast to existing technology - which constrains the listening environment to strict speaker layouts (such as 5.1 configurations or stereo).

A further distinction of this present invention from the prior art is the concept of perceptual coding of soundfields, where, unlike previous definitions, the 'soundfield' is not represented by pressure signals at distinct points in space. The concept of the 'soundfield' as per this invention is the pressure signal in all points in space within a target region. In particular, in a parametric representation of the soundfield there is no ready and implicit access to the pressure signal which is representative of the signal incident on one of the two ears of the listener - although these may be derived under some broad assumptions about the location and mobility of the listener. In most (if not all) monophonic, stereo as well as multi-channel audio coders which purport to be soundfield coders, involve the quantisation of pressure signals (/?_z-(t,/) ) directly. Here p_t{t,f) defines the pressure signal as a function of time t, and frequency f, (where f is the temporal frequency as opposed to spatial frequency defined later in the document) and i represents the i^th acoustic channel (typically from the i^th microphone or the i^th loudspeaker feed, located at a certain fixed point in space - but could also be the acoustic output from a "matrixing" process).

The derivation and analysis of p_t(t,f) usually involves the use of Short Term Fourier Transform techniques using either sub-band filter-bank techniques or transforms such as Discrete Fourier Transforms (DFT), Wavelet Transforms (WT) or Modified Discrete Cosine Transforms (MDCT).

The reason for the frequency analysis is to facilitate psychoacoustic models of simultaneous masking which require an approximate decomposition of the cochlear response along its length - which can be approximated by a frequency analysis of the pressure signal, p_t(t) .

In contrast, the technology disclosed in this invention, involves the coding and quantisation of a finite time varying parameter set or coefficients {a_Q(t),a_x(t),..., a_k(t)} - which as a combination can be used to calculate the pressure signal as a function of time t and all point in continuous space (in a compact target volume - depicted in Figure 2B, item 200) given in Cartesian coordinates by χ,y,z . In other words, p(t,x,y, z) = J^r{{a₀(t),a₁(t),...,a_k(t),x,y, z}) , [1 ] where, the nature and characteristic of the function T(^) is exemplified (but by no means unique) by Equations [3] and [5], which are set out below.

In comparison, the definition of the soundfield in some prior art methods is multiple pressure signals at fixed points in space p_t(t) , where i represents the i^th microphone or loudspeaker, located at a singular distinct point in space. There is, in these prior art methods, no implicit mechanism to compute the pressure signal at any other positions in space other than the points at which the microphones or loudspeakers are located.

We thus differentiate the definition of the soundfield as used in the present invention as a "parametric soundfield representation" in comparison to the definition of the soundfield used in some prior art methods.

In our definition, a finite set of parameters describe the dynamic pressure at all three dimensional points in the soundfield, as a function of time as per Equation 1. The distinction is of the utmost importance - if for nothing else than the fact the psychoacoustic models used in prior art methods for perceptual coding assumes that at least one of the i^th signals (or the matrixed output) represents the pressure signal incident on the ear of the listener.

In contrast, our parametric representation requires further processing through T(^) to compute the pressure at any point in the continuum of the target space in the soundfield.

A further distinction between the prior art and the present invention are the use of psychoacoustic models used in the current invention which predict the maximum allowable deviation of the soundfield parametric representation

and are differentiated from psychoacoustic models used in traditional prior-art perceptual coders which rely on the pressure signal incident at one ear of the listening subject to predict the maximum allowable deviation of p_t(t,f) .

Further, the psychoacoustic analysis in the present invention provides a mechanism to control the spatial resolution of the soundfield by controlling the number of parameters to a subset or superset of the complete set

{a_Q(t, f ),a_x(t, f ),..., a_k(t,f)} .

A later section will reveal the details of the psychoacoustic modelling used in the current invention. A final distinction of this invention in comparison to current multi-channel coding and playback technology is the fact that current systems only allow for sound localisation in the horizontal (or 2D) plane. In MPEG-Surround, this limitation is due to the coding of the ICTD, ICLD and ICC cues from horizontally placed loudspeaker feeds - which only cater for how humans localise sound in the horizontal plane. In general, however, the limitation is again due to the encoder being forced to make simplistic assumptions on the listening environment's speaker layout and the total system (from recording, encoding, decoding and resynthesis algorithms) not being flexible enough to render the 3D soundfield for any arbitrary speaker positions. Soundfield Recording, Parametric Representation, Synthesis and Quality The acoustic soundfield has been defined above as the time varying dynamic pressure variations in a target region of space, p(x,y,z,t) , where x, y, z are the three dimensional spatial variables in Cartesian coordinates and t is time. Alternatively, spherical coordinates may also be used to describe the acoustic field p(r,θ,ψ,t) , where r,θ,ψ are the radius, elevation and azimuth angles, respectively.

To record the soundfield at discretely sampled time and all space (within a target volume) such that the soundfield may be reconstructed to arbitrary precision at another location and time would require an inifinite number of microphones. Instead, the following paragraphs show a parametric representation of the soundfield can be achieved using the continuity imposed by acoustic wave propagation, such that the soundfield can be described at a continuum of spatial locations and discrete time using a finite set of coefficients. The fundamental claim of this invention is then to code these coefficients culminating in an efficient representation of the three-dimensional acoustic field. The acoustic field is constrained by having to satisfy the wave equation which, in spherical coordinates is given by: dp

V²p(r,θ,φ,t) = \^- + 1 d ■ a dp 1 d a2 p 1 dp

I O O O O O L J r or 3r r² sinθ dθ dθ r² sin² θ dώ² c² dt² where c is the speed of sound.

Solutions to the above wave equation may be written in various ways including in terms of spherical Bessel functions and spherical harmonics. For example, in the case, where acoustic sources are outside region of interest (i.e acoustical energy is incident upon the region from the outside of the region), the solution is given by ex) n p{r, θ,φ,k) = ∑ ∑ A: {k)j_n {kr) Y_n ^m {θ,φ) , [3]

B=0 »!=-B where j_n(») is the spherical Bessel function of the first kind of order n , k = — c is spatial frequency and the Y™ (θ,φ) are the spherical harmonics, defined by,

where P™ (•) is the associated Legendre function and i = V^I . Similarly, for the exterior case, the solution is given by ex) n p{r,θ,φ,k) = ∑ ∑ B: {k)h_n {kr) Y_n ^m {θ,φ) , [5]

Ti=O m=- n where h_n(») is the spherical Hankel function given by, h_n(kr) = j_n(kryriy_n(kr) , [6] and y_n(») is the spherical Bessel function of the second kind, both of order n .

In the context of this invention, what is important about Equations [3] and [5] is that these solutions to the Wave equation are in the form of infinite order polynomials with infinite coefficients A% and B™ .

The process of taking the continuous function, p(r,θ,φ,t) and expressing it in terms of infinite order polynomials is akin to Taylor series and Fourier series analysis. From this perspective, the above analysis can be likened to projecting the soundfield onto the orthogonal basis functions. It can be shown that the series expansion using the spherical harmonic basis functions represent a compact yet complete solution while representing a comprehensive and physically based framework for soundfield analysis and synthesis. The arbitrary order Ambisonics representations are similar to the above but offer less completeness as they ignore the variation with respect to r . It is for this reason that we have chosen to adopt these coefficients to exemplify the concept of our invention - to code a generic parametric representation of the soundfield rather than pressure signals measured at specific points in the soundfield. We however point out that the concept can easily be applied to any number of parametric representations of the soundfield and would be considered within the bounds of this present invention.

In the context of efficient representation of soundfields, the above decomposition is problematic in that both Equations [3] and [5] require an infinite number of coefficients (A™ anόB™ ) to represent the soundfield. However, for small regions (small r and/or frequency k ), both the series in Equations [3] and [5] may be truncated to order n = N with little error. This is because only the low-order spherical Bessel functions have significant values for small values of kr .

The soundfield is then fairly accurately represented by [N + l)² coefficients A™ orB™ in a small space r and spatial frequencies k .

Limiting the order to n = N also limits the order of the Spherical Harmonics which essentially limits the spatial angular accuracy with which the soundfield is represented. However, one of the central themes of the present invention is the observation that only limited precision is required for equivalent perception of the soundfield. This limited precision or tolerance for noise by human auditory perception is due to various sources of internal noise (at various stages in the auditory pathway) in the human auditory neurophysiology. Further, the invention does not impose a limit on N in any way. The essence of the invention is to not introduce any further perceptual deviations beyond that imposed due to practical constraints (such as the number of available transducers) during the coding process.

A significant amount of research has allowed us to relate the order N , to perceptual sensitivity and acuity of the resulting synthesised soundfield. In particular, the encoder can adapt to the acoustic conditions and choose to vary the order N when it is deemed that an increase in the order will not provide any extra perceptual clarity or resolution. For example, in the case of a single omnidirectional acoustic source (with no particular directivity), only the n = 0^th order representation would suffice. Similarly, in the case of a point source, the limits of human acuity would apply, providing an upper limit for N (depending on the distance of the source). The framework for adapting the order in a time varying way is also a unique aspect of this invention.

As discussed above, there are various soundfields microphones that are available to the audio recording professionals. One well known one is the 1^st order Ambisonics soundfield microphone. The array microphones, whether configured in a sphere or a plane, whether spaced regularly or randomly, all strive to sample the dynamic pressure as a function of time and space. The accuracy with which this can capture the soundfield in a small area (say the size of a human head) depends on the configuration, type and number of microphones.

The current invention is not concerned with optimizing soundfield microphone technology - but rather to use any microphone array (as depicted in Fig. 2B, item 205), calculate a parametric representation of the soundfield (item 215) and encode the parameters (item 270 in Fig. 1 and Fig 2) such that the quantisation error does not introduce any perceptible difference in the synthesised soundfield at the decoded and synthesized output through an arbitrary configuration comprised of a plurality of loudspeakers. In another aspect of this invention, perceptible distortions traded-off to lower the bit-rate produced by the encoder.

What remains then is to describe how the coefficients are calculated from the soundfield microphone outputs, quantised and decoded to synthesise the soundfield at an arbitrary time and location. This is described in the following sections.

In order that the invention be carried out, reference to the following embodiment of the invention is now made in which the following steps are taken. Preparing the Soundfield

We start with the soundfield that is being recorded. This step is a recommendation for the recording configuration and is largely independent of the actual encoding process described later. The first step is to identify a target listening volume within the soundfield.

This is depicted in item 200 (in Fig 2A & 2B). The aim of the complete system

(from recording to synthesis) is to replicate the soundfield within this area.

Recognizing that it is likely that during re-synthesis, a listener will be positioned within this target area, an attempt should be made to replicate the same conditions at the recording area. Our recommendation is to position a Head and Torso Simulator (HATS) or a live spectator inside the target area.

The 3D soundfield is also shown in Figure 4 where the target listening area has been arbitrarily centered at the origin of the three axes of the

Cartesian coordinate planes. The target area is shown by a sphere around the origin. Two audio sources (P1 and P2) are shown to be producing sound from two different positions - affecting the soundfield in the listening area.

Recording the soundfield

The next step involves the sampling of the soundfield both in space and time using a microphone array (soundfield microphone). The microphone array should be positioned in the vicinity of the target area described above. The microphone array can be any one of various soundfield microphones described in various patents and publications. The configuration, shape, type and number of microphone are not critical to this invention. It is however recognized that the number, configuration of the microphones as well the vicinity of the array to the target location restric the accuracy with which the soundfield is captured.

In terms of this invention, the essential requirement is to record the spatial position r_z,0_z,ø_z of each microphone module relative to the target area. We will refer to the recordings from the z^th microphone by/?_z-(t) . We will assume there are M total microphones (i.e z = O,1,...,M -1). The recording process is shown in block 205 in the block diagram of Figure 2A & 2B. The outputs of the block are the microphone signals p_t(t) , shown as 210 in Fig. 2A.

Transforming the recorded signals into a primary parametric soundfield representation

The next step is to convert the plurality of microphone signals p_t(t) to a parametric representation of the soundfield. This is shown as block 215 (Figures 2A & 2B).

To exemplify our concept, we have chosen the parametric representation described in Equations [3] and [5] above.

Let Pi(f) represent the pressure distribution as a function of frequency for the z^th microphone. In practice, this conversion to the frequency domain can be carried out using the DFT, WT, MDCT or similar and variant techniques, known to those familiar in the art. Equation [3] may then be expressed as a matrix equation, for each frequency/ as follows:

Comparing Equation [7] to Equation [3], it can be observed that the elements of the matrix [Λ] are given by j_n (kη)Y™ (0_z-,ø_z-) . Thus, the pressure field at the M arbitrary microphone positions is defined by the (N + lf coefficients A™ .

If the recording is constrained such that M > (N + lf , (i.e the number of microphones is at least as many as the number of coefficients) then various numerical methods for inverting the matrix exists, providing methods of finding optimal values for the coefficients .4™ . The various techniques for Matrix inversion will be familiar to those acquainted in the art.

Three things need to be pointed out in the above derivation. The first is the observation that the above derivation is for omni-directional microphone transducers - alternative transducers such as cardioid microphones can be accommodated with a change in the elements of the matrix [Λ] but will not be discussed here as it is not essential to the invention.

Second, is the observation that the above has been derived for the case where there are no sources within the target volume of the soundfield - meaning all audio sources are outside of the sphere containing the target volume and the microphones.

Finally, and most importantly, the observation that the coefficients .4™ , derived in this fashion, capture the soundfield in its entirety (within the bounds imposed by the truncation of the infinite series which limits spatial resolution and physical configuration of the microphone array) allowing the pressure at any spatial point within the vicinity of the microphone array to be defined. It also has to be pointed out that various other methods may also be employed to compute the A% coefficients, including methods that do not require the conversion of the pressure signals to the frequency domain (computed entirely in the time domain). Similarly, other parameterizations may also be possible with the aim of defining the pressure variation in a target listening area using a finite number of time varying coefficients. Since the primary aspect of this invention is the coding of these coefficients (and not the parameterization itself), any such parameterization is considered to be within the scope of the invention.

The actual encoding flowchart in Figure 5 shows buffering of the pressure signals from the microphones before their conversion to the A™ coefficients. This indicates block based processing and the buffering involves storing a complete time frame of time domain pressure signals before processing them. The next time frame typically involves overlapped data from the previous frame to ensure smooth re-synthesis at the decoder show in the decoder flowchart of Figure 9.

In addition, windowing is carried out on each frame to ensure optimal time-frequency resolution. The size and type of the window is signal dependent.

The buffering, framing and windowing is no means unique to this invention and is carried out in most speech and audio coders and is a familiar concept to those familiar to the field. If it were not for the quantization steps in between, the overlap-add process ideally leads to perfect reconstruction of the input pressure signals.

Transforming the primary parametric soundfield representation into a secondary parametric soundfield representation

In one embodiment of the invention, the parameters .4™ (or a subset), from the previous step are further analyzed and transformed to a secondary set of parameters of reduced dimensionality, which are better amenable to coding. This step is shown as block 120 in Fig. 1 and block 220 in Fig. 2A. The simplest reduction in dimensionality is achieved by recognition of some simple mathematical relationships. For example, some of the coefficients are completely real (whenever m=0) while others have equality (within a multiplicative constant) between real and imaginary parts.

Beyond the simplest of relationships described above, the coefficients portray signal dependent information content. Each coefficient (other than the 0^th order coefficient which carries omni-directional information) has directional properties and thus represents the acoustic energy in a certain direction - akin to beam patterns. Coefficients of higher order represent finer directionality and thus act to increase resolution of the soundfield representation.

Most auditory events (especially speech and music events) have spectro-temporal periodicities and structure. These correlations are also reflected in the spectro-temporal distributions of the A% coefficients. In addition since the acoustic signal typically propagates from one position to another, there is significantly increased temporal correlation between the coefficients themselves. These make the coefficients amenable to predictive coding. A/?^th order linearly prediction can be applied to individual coefficients across time as follows:

AZ lf,t] = a_lA: if,t-l}+<x₂A: if,t-2} + -- + a_pA: if,t-p}, [8] allowing the quantization of the coefficients {a_x,...,a_p} and the residual error.

Such prediction coding techniques are familiar to those acquainted in the art and can also be applied across coefficients (in the m and n dimensions). The end product is always a set of coefficients that have less statistical correlation (essentially removing redundancies in the parameters), reduced dimensionality and lower dynamic range making them suitable for quantization purposes.

Further methods of exploiting spatial characteristics of the soundfield in conjunction with limited spatial acuity of hearing are described in the next section. Deriving the psvchoacoustic thresholds In this step, shown in the flowchart of Fig. 3 (and in less detail as part of the complete encoder in Fig. 1 ), computational psychoacoustic models and related analysis are used to predict the maximal deviation that is allowable (above which the effect of the deviation would be perceptible) or equivalently minimal precision required to represent the parameters, A% . (Once this is known, the thresholds can be reflected to the secondary parameter set a_k derived in the previous step).

The principle behind the psychoacoustic models and analysis is to estimate the listening conditions in the target volume positioned around the listener. In one embodiment, we limit our consideration of listener movement to be within the target volume - i.e only guaranteeing optimal perception while the listener's ears are located within the target volume - irrespective of listener position and orientation within that volume. It is important to realize that Inter- aural characteristics change with listener movement and orientation - making the analysis of parameters such as Inter-aural Level and Time differences (ILD and ITD) unusable for coding purposes. In an alternative embodiment, allowing lower bit-rates, the listener is assumed to be stationary with a certain fixed orientation. In this embodiment, the analysis of ILD, ITD and other directional characteristics such as differences in acuity in the front and back of the listener can be used in the coding process.

In every embodiment, we only use the most conservative of thresholds. This is because, while, under certain circumstances it is possible that more noise can be tolerated by the human auditory sense, there is no guarantee that the listener can be constrained to those circumstances. These circumstances include the level of the signal that the listener has chosen for playback, their location and orientation, other modes of stimuli that the listener is being presented with (visual stimuli for instance), movement of their head, etc. This conservative threshold analysis is used in every embodiment except in the embodiment of the "two way communication" of the present invention where the decoder transmits some or all of the listener's time varying information back to the encoder. The pertinent information transmitted from the decoder include listener position/orientation, sound level, room acoustics, etc. The psychoacoustic models used in the present invention are aimed at exploiting two limitations of human hearing - limited ability to perceive distortions/noise (distributed across time and frequency - attributed to simultaneous and temporal masking) and the limited ability to detect the direction of sound sources. Both of these limitations are signal dependent, whereby a sound source with a certain time, frequency and spatial distribution is able to affect the detection of competing sounds with different time, frequency and spatial distribution.

We first focus on the first of these limitations: masking thresholds. These noise thresholds are amenable to exploitation (allowing the introduction of quantization noise) in current audio coders (mono, stereo and SAC) due to the availability of a pressure signal that is presumed (however wrongly) to be representative of the stimuli at one or both of the listener's ears. This is not implicitly available in the present invention as the entire soundfield (of interest) is represented by the parameters, A% . One aspect of the current invention is a unique methodology for reflecting the auditory noise thresholds to the parametric {A™ ) domain.

The peripheral masking threshold can be represented as a maximal noise pressure variation, n (r,θ,ψ,k) , (where the "^*" represents the threshold) on the recorded pressure field representation p (r, θ, φ, k) which is different from the actual soundfield p(r,θ,φ,k) which existed at the place (and time) of the recording. The difference between p(r,θ,ψ,k) and p(r,θ,ψ,k) is due to the limited number of microphone transducers (and their configuration) as well as the truncation of the infinite series in Equation [3]. It is important to note that the psychoacoustic model works on p(r,θ,φ,k) and not on p(r,θ,φ,k) (which is not accessible).

The first step involves approximating the pressure signal at various points surrounding the listener. If a Head and Torso Simulator (HATS) was used during the recording then all that is required are approximate pinnae positions in three dimensional space relative to the centre of the target listening area. Alternate techniques to compensate for the auditory "shadow" of a human head may also be used if HATS was not used during the original recording.

Using Equation 3, and the parametric representation, A^ , the pressure signals can easily be computed at the various possible spatial locations where the listener could possible place their pinnae. Standard psychoacoustic masking models (familiar to those acquainted in the art and described in various MPEG standards and patents) are then used to derive the noise thresholds at each of these points. The models use various experimental data (in the form of Tables in Figure 3) including the absolute threshold of hearing to calculate the thresholds, n^* (r,θ,φ,k) , for each i^th position. The most conservative of the thresholds are the reflected back to the parametric {{A™ }) domain using

Equation 7 (and using the same matrix inversion technique that was used to calculate the coefficients). At this point, we have the first of two thresholds A™ (k) that will be used to determine the bit allocation for each of the coefficients, A™(k) .

The second limitation of human audition, that of spatial acuity is uniquely exploited in this invention by controlling the order N as well as the update rate of the FB coefficients reflecting the required resolution and information content of the coefficients, in a time varying way. As mentioned in the previous section, the FB coefficients A% carry signal dependent information content, due to the directional properties of the Spherical Harmonics, Y™ [θ,φ] . Each coefficient

(other than the 0^th order coefficient which carries omni-directional information) reflects acoustic energy from a certain direction - akin to beam patterns. Coefficients of higher order represent finer directionality and thus act to increase resolution of the complete soundfield representation. However, when there is little to no acoustic energy from a certain direction, the relevant coefficient will carry little information, requiring a relatively slower update rate. We do this by not sending these coefficients over a number of frames. At the decoder, the coefficients are simply recreated by a weighted averaging across past and future frames.

Similarly when the acoustic sources are spatially broad, there is no requirement for higher order coefficients. Lower order coefficients suffice to recreate the soundfield.

In the case where there are two acoustic sources in close proximity, higher order coefficients are only warranted if the two sources are wider apart than the minimum acuity of human spatial hearing. A representation using higher order coefficients will be lost due to the spatial limitations of our hearing. In the embodiment where an assumption of fixed head orientation can be made, we use an auditory acuity of between 2°- 10° depending on the azimuthal direction of the sources ( the lowest being when the sources were deemed to be in front of the listener). In the alternate embodiment where an assumption of the orientation and location of the listener's head cannot be made, we always use an auditory acuity of 2°. For sources above 30° elevation and below -30° we use use an acuity of 7°. We exploit these effects by carrying out:

(i) An analysis of the spatial distribution of the acoustic energy in the soundfield by converting the FB coefficients to pressure signals on a sphere around the target volume. The three dimensional energy distributions around the listener and tracking their evolution over time and space provide the angular separation and strength of sound sources. This in turn allows a decision on whether the separation warrants higher order parametric representations. (ii) Calculating the mutual information between coefficients of the same and different orders to justify the use of higher order coefficients. The outcome from this step along with the previous step is a decision on the minimum order required for the representation, N. (iii) An analysis of the information content of each coefficient as compared to previous frames. This allows a decision on the update rate of each coefficient, r£ . Coefficients that do not change rapidly are conducive to being updated slowly and/or using predictive coding for their representation. The spatial analysis in Step (i) also provides a mechanism to model the spatial release from masking whereby a dominant source in close proximity to a neighbouring weaker source is able to limit its detectability (or equivalently increasing the threshold of noise in the vicinity of the weaker source). The smallest of these positional masking thresholds are then reflected to the FB coefficient domain, providing a second set of thresholds, A™₂ (k) .

A third set of thresholds is obtained by imposing spreading functions across coefficients of the same order (and frequency) as well as neighbouring orders. The spreading functions are largely derived from empirically observed sensitivity from experiments whereby noise was systematically added to the coefficients and listeners asked if they could detect the noise. The width and depth of the spreading functions are inversely proportional to the order of the coefficients. The end product is a third set of thresholds, represented by

The three sets of thresholds A" (k) , A™^* (k) and A™ (k) compared to

each other, for each frequency k, and the lowest (most conservative) Jζ (k) , threshold at each frequency is sent to the quantisation block. Also sent to the quantisation block are the order of representation N, for the current frame and the update rate for each coefficient, r^ .

Quantising the soundfield Quantisation involves scalar and vector quantisation of individual coefficients and vector quantisation for groups of parameters. A multitude of alternate techniques are available and should be familiar to those acquainted in the art. The current invention is not critical to a particular technique of quantisation and any technique should therefore be considered to be within the scope of the invention. We have exemplified our implementation with the simplest of possibilities - scalar quantisation on all the parameter space A™ in

accordance with the perceptual threshold A^ (k) , calculated in the previous step.

The encoded bitstream consists of a 8 bit positive integer depicting the order N, required to represent the soundfield for each frame, as derived in the previous step. This automatically indicates that there are (N + 1)² coefficients in the representation. We use a simple binary bit to indicate whether each coefficient at each frequency requires updating. For 8192 sized frames, there are 4096 complex frequency bins - requiring 4096 bits for each coefficient. For each coefficient that requires updating in one frame, L_k bits are assigned to each of the parameters A%(k) , ensuring that the quantisation noise

does not exceed the perceptual thresholds A^ (k) . The bit allocation information as well as the quantised coefficients composes the rest of the bitstream.

Alternative forms of quantisation including vector quantisation techniques will ensure even greater efficiency and coding gain. The block diagram in Figure 5 shows the quantisation process and the psychoacoustic model working in a recursive mode ensuring the most efficient use of bits.

This recursive mode of operation recognises that the perceptual threshold Jζ (k) , is not "set in stone" and is a function of the signal and introduced noise at different locations in the soundfield. This mode of operation thus increases efficiency or coding gain at the cost of increased complexity.

The resulting (quantised) bitstream is further compressed using Huffman coding. Other entropy coding techniques may also be used to reduce the redundancies and repetition in the bitstream and should be considered within the scope of the present invention.

Storing or transmitting the soundfield representation

The quantized digital data can be stored in various media for archival or transmitted and streamed over various channels as applicable. This is shown as block 240 in Fig. 2A. Decoding the soundfield representation

The first task of the decoder process (shown as block 250 in Figure 2A and as a flowchart in Figure 8) is to de-quantise the bitstream. This involves taking into account the entropy coding that was carried out on the bitstream during the encoding process, recognition of the bitstream and finally evaluating the parameters -M^m(£;)f from the bitstream (the hat indicates noisy versions of

the parameter set, |^™(A;)| ). In Figure 2A, the psychoacoustic model (block

245) is referenced to perform the de-quantisation. This just indicates that the concepts used to compute the thresholds, order of the representation and update rate of each coefficient is required at the time of the decoding. The ultimate step of the decoder is however to re-synthesise the soundfield accurately at an alternate location and time. Equation [3] is again used for this purpose. Given the set of coefficients!™ , and L loudspeakers positioned arbitrarily, the L loudspeaker feeds g^t) is computed using an equation of the following form:

where, the vector on the left defines the Fourier transform of the L speaker feeds, the 1™ are the quantisation noise contaminated A™ coefficients that form the parametric representation of the soundfield, and elements of the matrix [rj are functions of the positions of the loudspeakers and their radiation characteristics.

It should be pointed out that L and N are independent. Thus different matrices [rj are used for different orders of the soundfield representation. This is especially important since the encoder dictates the order of the representation which thus varies from frame to frame. Thus, as long as a set of g_t(t) is generated, it is possible to use the correct windowed-overlap-add method to resynthesise the signal that feeds the loudspeakers. The use of this time varying order to represent the soundfield and the ability to recreate the loudspeaker feeds from it is a novel aspect of this invention. Further enhancements such as compensation for room acoustics at the listening environment and noise cancellation may also be carried out as shown in block 255 in Figure 2A. Scalable Coder

In a second aspect of the present invention, the method is modified somewhat in that it involves staggered parameterisation of the soundfield such that a lower number of quantised parameters (i.e a smaller subset of A™ ) is able to de-quantise the soundfield with a potential penalty in quality (and localisation perception). This is most easily achievable by forcing lower order, (N), representation of the soundfield. This way, the end-user may trade-off bit- rate versus spatial resolution. This ability to scale the encoder by a seamless change in order of representation, is an unique contribution of this invention. Other hierarchical parameterisations and quantisation strategies, are shown in Fig 6, and should be considered within the scope of this invention. Two way communications

In the third aspect of the present invention, the invention will be in the situation where there is two-way one-to-one communication between the encoder and the decoder. In this case the decoder can send information to the encoder about such things as the layout of the speakers at the playback venue, position and orientation of the listener relative to the speakers, room acoustics and sound level that the listener has chosen - or a subset of these. This information will enable the lossy encoding system to optimise its functionality allowing further efficiency and coding gain of the soundfield representation. This is shown in Figure 7.

Alternative embodiments of the invention also include devices that embody the invention. These devices may include automatic means for determining the layout of the listening environment, or they might interact with the environment directly, or though a system, to determine the layout of the listening environment. Alternatively, the listening environment may be relatively fixed, such as in the case of headphones, in which case a set predetermined representation of the listening environment is provided by the playback device. The scope and ambit of this invention are broader than the disclosure contained herein. Any person skilled in the art will appreciate that variations may be made to the above embodiments and yet remain within the scope of the present invention

Claims

Our claims are:

1 ) A method for coding a parametric representation of a soundfield (defined as the pressure variations at a continuum of points within a target volume) comprising the steps of: a. Deriving a parametric representation of the soundfield from a plurality (at least four) acoustic signals representative of the pressure signals at different locations within the soundfield. In doing so, the parameters can be an infinite set of coefficients. b. Selecting a finite subset of coefficients from the infinite set of coefficients c. Encoding and quantising the finite set of coefficients using an information theoretic analytical basis.

2) The method according to Claim 1 , wherein the step of deriving the parametric soundfield representation involves transforming signals recorded by at least four microphones into a parametric soundfield representation.

3) The method according to Claims 1 or 2, whereby the parametric soundfield representation is further transformed such that the resultant output has lower dimensionality, lower correlation, lower dynamic range and/or has statistical characteristics that are more amenable to coding.

4) The method of Claim 3 whereby the transformation includes linear prediction and similar whitening techniques. 5) The method according to Claim 1 , whereby the step of encoding and quantising involves the further steps of: a. Converting the parametric representation of the soundfield to representative pressure variations at locations within a target volume. b. Analysis of the pressure variations in the previous step using computational psychoacoustic models. c. Using psychoacoustic models to derive threshold levels as a function of frequency, that are representative of the maximum noise levels that can remain undetected by the human auditory system. d. Transformation of the threshold levels derived in the previous step such that the output of the transformation represents the maximum distortion allowable for each of the parameters (as a function of frequency) originally used to represent the soundfield. e. Ensuring that the encoding and quantising process in Claim 1 does not result in distortions of the parameters that exceed the thresholds calculated in the previous step. Preferably, the distortions introduced by quantising the parametric soundfield and a subsequent decoding and synthesis method should not result in a synthesised soundfield that exceeds the psychoacoustic thresholds in order to ensure the transparent perception of the soundfield as originally recorded. 6) The method according to Claim 1 , whereby the step of encoding and quantising includes the further steps of a. Converting the parametric representation of the soundfield to representative pressure variations at locations on a sphere surrounding a target volume. b. The use of the pressure variations in the previous step to estimate spatial statistical and deterministic source characteristics including spatial energy distribution, source separation, angle of arrival and relative strength/level. c. The use of limitations in human auditory acuity and the information in the previous step to determine a lower bound on the order required to represent the soundfield. d. The previous step be carried out in two alternative modes (that define the encoding modes): a conservative mode whereby an assumption is made that the listener is likely to change head orientation resulting in the use of a conservative model of human auditory acuity limitations; and a more aggressive mode whereby an assumption is made that the listener is likely to keep a fixed head orientation - resulting in a more aggressive model of auditory acuity. e. The encoding process in Claim 1 can truncate the number of parameters to reflect the maximum order required as determined in Step 6c, and the quantising process in Claim 1 transmits the order as an integer along with the rest of the parameters. f. The use of the spatial data derived in Step 6b in addition to models of known human limitations of detecting sounds in close proximity to other (a phenomena known broadly as Spatial release from masking), to derive thresholds of pressure-deviations that can be tolerated in regions surrounding high acoustic energy locations. g. Transformation of the threshold levels derived in the previous step such that the output of the transformation represents the maximum distortion allowable for each of the parameters (as a function of frequency) originally used to represent the soundfield. h. Ensuring that the encoding and quantising process in Claim 1 does not result in distortions of the parameters that exceed the thresholds calculated in the previous step. The method according to Claim 1 , whereby the step of encoding and quantising includes the further steps of a. Sensitivity analysis of parameters to derive the amount of allowable distortions (as a function of frequency) that will go undetected if added to the parameters. b. Ensuring that the encoding and quantising process in Claim 1 does not result in distortions of the parameters that exceed the thresholds calculated in the previous step. 8) The method according to Claim 1 , whereby the step of encoding and quantising includes the further step of information theoretic analysis of the parameters to determine how often the parameters are required to be updated. 9) The method according to Claim 8, whereby the information theoretic analysis includes Mutual Information analysis of coefficients of lower order with higher order. 10)The method according to Claim 8, whereby the information theoretic analysis includes cross-correlation analysis of parameters with previous and future frames to determine their update rate.

11 ) The method according to Claim 6, whereby the analysis of the lower bound on the order required to represent the soundfield, is refined by the analysis according to Claims 8, 9 and 10.

12)The method according to Claim 1 , whereby the step of encoding and quantising includes the further steps of: a. combining the thresholds derived according to Steps 5d, 6g and 7a to a combined threshold that represents the most conservative threshold b. Ensuring that the encoding and quantising process in Claim 1 does not result in distortions of the parameters that exceed the thresholds calculated in the previous step.

13)The method according to Claim 1 , whereby the step of quantising includes the further steps of: a. Encoding the order derived according to Claim 11 and/or Claim 6 using an integer. b. Encoding the update rate of each of the parameters. c. Allocating appropriate number of bits to each of the parameters or a group of parameters to ensure that the resultant quantisation noise in the parameter representation does not exceed the thresholds calculated by the method in

Claim 12. 14) The method and apparatus that allows the encoded and quantised parametric soundfield representation to be delivered to a listening environment via storage media (CD, DVD, Hard-disk, etc) or via communications channels (wireless, Internet, etc). 15) A method whereby the soundfield is recreated comprising the steps of: a. Recognising the delivered and encoded soundfield representation from the bitstream. b. Decoding and dequantising the compressed soundfield information. c. Creating loudspeaker feeds based on the decoded and dequantised soundfield in the previous step in a manner that ensures a perceptually identical soundfield representation as recorded originally. Preferably this is done in accordance with loudspeaker radiation patterns, number, geometric orientation and room acoustics at the listener's playback environment.

16) The method of Claim 1 , whereby the steps of encoding and quantising includes the step of receiving information (in either realtime or a-phori) about the conditions at the listening environment. The encoding and quantising steps are then able to use this information to further reduce the bit-rate and/or ensure better spatial accuracy to the listener. The extra information includes (but not limited to): a. Room acoustics information, b. Listener position and orientation. c. Loudspeaker radiation pattern, number and/or geometrical configuration. d. Loudness level being used by the listener.

17) The method according to Claim 16, whereby the encoder is able to readjust the noise thresholds, order of representation and/or the update rate of the coefficients to optimise the encoded output. A method for carrying such re-adjustment is to compute the performance of the total system by estimating the synthetic soundfield (rendered at the listening environment) at the encoder and thus estimate more precisely the deviations that can be tolerated for transparent perception. 18) The method according to Claim 16 when the mode of communication is either a two-way and one-to-one (as opposed to one-to-many - requiring only a single encoder but many decoders), requiring real time encoding and decoding at both ends of the communication, or a the listening environment is strictly constrained to a variety of "user-defined" conditions that the encoding process catered for.

19) The method according to Claim 1 , whereby the encoding is carried out in a scalable but seamless fashion. In this embodiment, the encoder can be made to produce a certain bit-rate from a plurality of possible bit-rates, wherein higher bit-rates increase perceptual transparency and decrease perceptual transparency. The scaling is controlled either by channel conditions or by user control. 20)The scalable method according to Claim 19, whereby the scalable bit-rate is achieved simply by choosing coefficients of various orders. 21 ) The method according to Claim 15, whereby the soundfield is able to be recreated seamlessly from an encoded soundfield representation whose order varies with time as dictated by the encoding process.

22) The method according to Claim 15, whereby the soundfield is able to be recreated seamlessly from an encoded soundfield representation where the update rates of different parameters are not the same - requiring interpolation, repetition and weighted averaging during frames when parameters which have not been updated.

23) The apparatus comprising of audio equipment, microphones, loudspeakers, Computers, Memory, Hard-disks, etc to perform any or all of the Claims 1 through 22.