EP2451196A1

EP2451196A1 - Method and apparatus for generating and for decoding sound field data including ambisonics sound field data of an order higher than three

Info

Publication number: EP2451196A1
Application number: EP10306212A
Authority: EP
Inventors: Holger Kropp; Florian Keiler; Johann-Markus Batke; Stefan Abeling; Johannes Boehm; Sven Kordon
Original assignee: Thomson Licensing SAS
Current assignee: Thomson Licensing SAS
Priority date: 2010-11-05
Filing date: 2010-11-05
Publication date: 2012-05-09

Abstract

Audio signal datastreams for 2D presentation are channel oriented. Due to 3D video in cinema and broadcasting, spatial or 3D audio becomes attractive. Ambisonics coding/decoding provides a sound field description that is independent from any specific loudspeaker set-up. The inventive system facilitates real-time transmission of Ambisonics of an order higher than '3' as well as single microphone signals. The transmitted 3D sound field can be adapted at receiver side to the available positions of loudspeakers. The Ambisonics header information level is adaptable between a simple and an encoder related mode enabling fast decoder modifications. The Ambisonics processing is based on linear operators, i.e. the Ambisonics channels data can be packed and transmitted singly or in an assembled manner as a matrix.

Description

The invention relates to a method and to an apparatus for generating and for decoding sound field data including Ambisonics sound field data of an order higher than three, wherein for encoding and for decoding different processing paths can be used.

Background

Traditional audio data signal transport streams for 2D presentation are channel oriented. 2D presentations include formats like stereo or surround sound, and are based on audio container formats like WAV and BWF (Broadcast Wave Format). The wave format WAV is described in Microsoft, "Multiple Channel Audio Data and WAVE Files", updated March 7,2007, http://www.microsoft.com/whdc/device/audio/multichaud.mspx , and in http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats /WAVE/WAVE.html, last update 19 June 2006.
Improved surround systems require an increasing number of loudspeakers or audio channels, which leads to an extension of these audio container formats.
Due to the upcoming 3D video activities in cinema and broadcasting, spatial or 3D audio becomes more and more attractive. Nevertheless, descriptions of spatial audio scenes are significantly more complex than in existing 2D surround sound systems. Well-known descriptions are based on Wave Field Synthesis (WFS, cf. WO2004/047485 A1 ) as well as on Ambisonics, which was already developed in the early 1970s: http://en.wikipedia.org/wiki/Ambisonics .
WFS combines a high number of spherical sound sources for emulating plane waves from different directions. Therefore, a lot of loudspeakers or audio channels are required. A description contains a number of source signals as well as their specific positions.
Ambisonics, however, uses specific coefficients based on spherical harmonics for providing a sound field description that is independent from any specific loudspeaker set-up. This leads to a description which does not require information about loudspeaker positions during sound field recording or generation of synthetic scenes. The reproduction accuracy in an Ambisonics system can be modified by its order N. The 'higher-order Ambisonics' (HOA) description considers an order of more than one, and the focus in this application is on HOA.
By that order the number of required audio information channels can be determined for a 2D or a 3D system, because this depends on the number of spherical harmonic bases. The number O of channels is for 2D: O=2·N+1, for 3D: O=(N+1)². Besides true 2D or 3D cases, 'mixed orders' have different orders in 2D (x-y plane only) and 3D (additionally z axis).
The first-order B-Format uses three channels for 2D and four channels for 3D. The first-order B-Format is extended to the higher-order B-format. Depending on O a horizontal (2D), a full-sphere (3D), or a mixture sound field type description can be generated. By ignoring appropriate channels, this B-format is backward compatible, i.e. a 2D Ambisonics receiver is able to decode the 2D components from a 3D Ambisonics sound field. The extended B-format for HOA considers only orders up to three, which corresponds to 16 channels maximum.
The older UHJ-format was introduced to enable mono and stereo compatibility. The G-format was introduced to reproduce sound scenarios in 5.1 environments.
However, all these existing formats do not consider orders of more than three.
The Wave FORMAT_EXTENSIBLE format is an extension of the above-mentioned WAV format. One application is the use of Ambisonics B-format in the WAVEX description: "Wave Format Extensible and the .amb suffix or WAVEX and Ambisonics", http://mchapman.com/amb/wavex .

Invention

As mentioned above, known Ambisonics formats do not consider orders of more than three.
Wave-based audio format descriptions are used in different applications. An environment which is very important today and will become even more important in the future are internet applications based on Ethernet transmission protocols. However, a data structure for Ambisonics transmission that is able to use the above-mentioned B-format as well as additional features like the Ambisonics order and their coefficient's bit lengths in an efficient manner is not yet known to the applicant.
Another aspect is that in case of B-format always plane waves are assumed for the sound sources. Even for a higher quality of the acoustic wave field reproduction, a more realistic view should emulate the sound sources as spherical waves. But spherical waves will introduce more complex frequency dependencies than plane waves.
Furthermore, a transmission of video content is in many cases combined with audio content transmission. Existing streaming data structures, e.g. for cinema applications, consider 2D surround sound only, for example WAV or AIFF (Audio Interchange File Format).
A combined real-time transmission of video and audio format that is based on an extended 'Real-Time Protocol' (RTP) has been published in H. Schulzrinne et al., "RFC3550-RTP: A Transport Protocol for Real-Time Applications", Columbia University, http://www.faqs.org/rfcs/rfc3550.html, July 2003, in particular sections 5.1 and 5.3.1. The standard RTP header uses 12 data octets (8-bit data fields) in every RTP packet as depicted in Fig. 1. In EP 1936908 A1 an extension for such RTP header is proposed, for additionally encapsulating an extended RTP header and DPX (Digital Moving-Picture Exchange) data, AIFF/BWF audio data, or metadata, as depicted in Fig. 2.
This payload header extends the RTP header of Fig. 1 by a 2-octet extended sequence number and a 2-octet extended time stamp. Furthermore, one octet for flags and a reserved field, followed by a 3-octet SMPTE time stamp and a 4-octet offset value is proposed therein. The 32-bit aligned payload data is following the header data.
A problem to be solved by the invention is to provide a data structure (i.e. a protocol layer) for 3D higher-order Ambisonics sound field description formats, which can be used for real-time transmission over Ethernet. This problem is solved by the encoding method disclosed in claim 1 and the decoding method disclosed in claim 3. Apparatuses which utilise these methods are disclosed in claims 2 and 4, respectively.
The data structures described below facilitate real-time transmission of 3D sound field descriptions over Ethernet. From the content of additional metadata the transmitted 3D sound field can be adapted at receiver side to the available headphones or the number and positions of loudspeakers, for regular as well as for irregular set-ups. No regular loudspeaker set-ups including a large number of loudspeakers are required like in WFS.
Advantageously, in the inventive transmission data structure the sound quality level can be adapted to the available sound reproduction system, e.g. by mapping a 3D Ambisonics sound field description onto a 2D loudspeaker set-up. Advantageously, the inventive format enables Ambisonics orders up to N =255, whereas known Ambisonics formats allow orders up to N =3 only.
Further, the inventive data structure considers single microphones or microphone arrays as well as virtual acoustical sources with different accuracies and sample rates. Advantageously, moving sources (i.e. sources with time-dependent spatial positions) are considered in the Ambisonics descriptions inherently.
The Ambisonics header information level is adaptable between a simple and an encoder related mode. The latter one enables fast decoder modifications. This is useful especially for real-time applications.
The proposed data structure is extendable for classical audio scene descriptions, i.e. sound sources and their positions.
Generally, the inventive Ambisonics processing is based on linear operators, i.e. the Ambisonics channels data can be packed and transmitted singly or in an assembled manner as a matrix.
In principle, the inventive encoding method is suited for generating sound field data including Ambisonics sound field data of an order higher than three, said method including the steps:

receiving S input signals x (k) from a microphone array including M microphones, and/or from one or more virtual sound sources;
multiplying said input signals x (k) with a matrix Ψ, $Ψ = [\begin{matrix} Y_{0}^{0} (Ω_{0}) & Y_{0}^{0} (Ω_{1}) & . & . & Y_{0}^{0} (Ω_{S - 1}) \\ Y_{1}^{- 1} (Ω_{0}) & Y_{1}^{- 1} (Ω_{1}) & . \\ Y_{1}^{0} (Ω_{0}) & Y_{1}^{0} (Ω_{1}) & . & . \\ . & . \\ Y_{N}^{+ N} (Ω) (_{0}) & . & . & Y_{N}^{+ N} (Ω_{S - 1}) \end{matrix}],$
wherein the matrix elements $Y_{n}^{m} (Ω_{s})$
represent the spherical harmonics of all currently used directions Ω ₀ ,..,Ω _S-1 , index m denotes the order, index n denotes the degree of a spherical harmonics, N represents the Ambisonics order, n = 0,...,N, and m = -n,..., +n, so as to get coefficients vector data d (k) representing coded directional information of N Ambisonics signals for every sample time instant k;
processing said coefficients vector data d (k), value N and parameter Norm in one or two or more of the following four paths:
1. a) combining said coefficients vector data d (k), said value N and said parameter Norm with radii data R_S representing the distances of the sources of said S input signals x(k);
2. b) based on spherical waves, array response filtering said coefficients vector data d (k) in dependency from said Ambisonics order N and radii R_m values, said radii R_m values representing individual microphone radii in a microphone array, so as to compensate for non-linear frequency dependency, followed by normalising for spherical waves data, so as to provide filtered coefficients A(k), said parameter Norm and said order N value;
3. c) based on spherical waves, array response filtering said coefficients vector data d (k) in dependency from said Ambisonics order N, said radii R_m values and a radius R_ref value, said radius R_ref value representing a mean radius of loudspeakers arranged at decoder side, so as to compensate for non-linear frequency dependency, followed by normalising for spherical waves data, so as to provide filtered coefficients A (k), said parameter Norm, said order N value, and said radius R_ref value;
4. d) based on plane waves, array response filtering said coefficients vector data d (k) in dependency from said Ambisonics order N, said radii R_M values and a Plane Wave parameter, so as to compensate for non-linear frequency dependency, followed by normalising for plane waves data, so as to provide filtered coefficients A (k), said parameter Norm, said order N value, and said Plane Wave parameter;
in case a processing took place in two or more of said paths, multiplexing the corresponding data;
output of data frames including said provided data and values.

In principle the inventive encoder apparatus is suited for generating sound field data including Ambisonics sound field data of an order higher than three, said apparatus including:

means being adapted for multiplying S input signals x (k), which are received from a microphone array including M microphones and/or from one or more virtual sound sources, with a matrix Ψ, $Ψ = [\begin{matrix} Y_{0}^{0} (Ω_{0}) & Y_{0}^{0} (Ω_{1}) & . & . & Y_{0}^{0} (Ω_{S - 1}) \\ Y_{1}^{- 1} (Ω_{0}) & Y_{1}^{- 1} (Ω_{1}) & . \\ Y_{1}^{0} (Ω_{0}) & Y_{1}^{0} (Ω_{1}) & . & . \\ . & . \\ Y_{N}^{+ N} (Ω) (_{0}) & . & . & Y_{N}^{+ N} (Ω_{S - 1}) \end{matrix}],$
wherein the matrix elements $Y_{n}^{m} (Ω_{s})$
represent the spherical harmonics of all currently used directions Ω ₀ ,...,Ω _S-1 , index m denotes the order, index n denotes the degree of a spherical harmonics, N represents the Ambisonics order, n = 0,...,N, and m = -n,..., +n, so as to get coefficients vector data d (k) representing coded directional information of N Ambisonics signals for every sample time instant k;
means being adapted for processing said coefficients vector data d (k), value N and parameter Norm in one or two or more of the following four paths:
1. a) combining said coefficients vector data d(k), said value N and said parameter Norm with radii data R_S representing the distances of the sources of said S input signals x(k);
2. b) based on spherical waves, array response filtering said coefficients vector data d (k) in dependency from said Ambisonics order N and radii R_m values, said radii R_m values representing individual microphone radii in a microphone array, so as to compensate for non-linear frequency dependency, followed by normalising for spherical waves data, so as to provide filtered coefficients A (k), said parameter Norm and said order N value;
3. c) based on spherical waves, array response filtering said coefficients vector data d (k) in dependency from said Ambisonics order N, said radii R_M values and a radius R_ref value, said radius R_ref value representing a mean radius of loudspeakers arranged at decoder side, so as to compensate for non-linear frequency dependency, followed by normalising for spherical waves data, so as to provide filtered coefficients A (k), said parameter Norm, said order N value, and said radius R_ref value;
4. d) based on plane waves, array response filtering said coefficients vector data d (k) in dependency from said Ambisonics order N, said radii R_M values and a Plane Wave parameter, so as to compensate for non-linear frequency dependency, followed by normalising for plane waves data, so as to provide filtered coefficients A (k), said parameter Norm, said order N value, and said Plane Wave parameter;
a multiplexer means for multiplexing the corresponding data in case a processing took place in two or more of said paths, which multiplexer means provide data frames including said provided data and values.

In principle, the inventive decoding method is suited for decoding sound field data that were encoded according to the above encoding method using one or two or more of said paths, said method including the steps:

parsing the incoming encoded data, determining the type or types a) to d) of said paths used for said encoding and providing the further data required for a decoding according to the encoding path type or types;
performing a corresponding decoding processing for one or two or more of the paths a) to d):
1. a) based on spherical waves, filtering the received coefficients vector data d (k) in dependency from said radii data R_S so as to provide filtered coefficients A (k),
  and distance coding said filtered coefficients A (k) in dependency from said order value N and for all radii R_l of loudspeakers to be used for a presentation of the decoded signals,
  and providing the distance encoded filtered coefficients A '(k) together with loudspeaker direction values Ω _l , value N and parameter Norm;
2. b) based on spherical waves, distance coding said filtered coefficients A (k) in dependency from said order value N and for all radii R_l of loudspeakers to be used for a presentation of the decoded signals,
  and providing the distance encoded filtered coefficients A '(k) together with loudspeaker direction values Ω _l , value N and parameter Norm;
3. c) based on spherical waves, distance coding said filtered coefficients A (k) in dependency from said order value N and said radius value R_ref for all radii R_l of loudspeakers to be used for a presentation of the decoded signals,
  and providing the distance encoded filtered coefficients A '(k) together with loudspeaker direction values Ω _l , value N and parameter Norm;
4. d) based on plane waves, providing said filtered coefficients A (k), order value N, parameter Norm and a flag for Plane Waves;
in case a processing took place in two or more of said paths, multiplexing the corresponding data, wherein the selected path or paths are determined based on parameter Norm, order value N and said Plane Waves flag;
decoding said distance encoded filtered coefficients A'(k) or said filtered coefficients A(k), respectively, in dependency from said parameter Norm, said order value N and said loudspeaker direction values Ω _l , so as to provide loudspeaker signals for a loudspeaker array.

In principle the inventive decoder apparatus is suited for decoding sound field data that were encoded according to the above encoding method using one or two or more of said paths, said apparatus including:

means being adapted for parsing the incoming encoded data, and for determining the type or types a) to d) of said paths used for said encoding and for providing the further data required for a decoding according to the encoding path type or types;
means being adapted for performing a corresponding decoding processing for one or two or more of the paths a) to d):
1. a) based on spherical waves, filtering the received coefficients vector data d (k) in dependency from said radii data R_S so as to provide filtered coefficients A (k),
  and distance coding said filtered coefficients A (k) in dependency from said order value N and for all radii R_l of loudspeakers to be used for a presentation of the decoded signals,
  and providing the distance encoded filtered coefficients A'(k) together with loudspeaker direction values Ω _l , value N and parameter Norm;
2. b) based on spherical waves, distance coding said filtered coefficients A (k) in dependency from said order value N and for all radii R_l of loudspeakers to be used for a presentation of the decoded signals,
  and providing the distance encoded filtered coefficients A '(k) together with loudspeaker direction values Ω _l , value N and parameter Norm;
3. c) based on spherical waves, distance coding said filtered coefficients A (k) in dependency from said order value N and said radius value R_ref for all radii R_l of loudspeakers to be used for a presentation of the decoded signals,
  and providing the distance encoded filtered coefficients A '(k) together with loudspeaker direction values Ω _l , value N and parameter Norm;
4. d) based on plane waves, providing said filtered coefficients A (k), order value N, parameter Norm and a flag for Plane Waves;
multiplexing means which, in case a processing took place in two or more of said paths, select the corresponding data to be combined, based on parameter Norm, order value N and said Plane Waves flag;
decoding means which decode said distance encoded filtered coefficients A'(k) or said filtered coefficients A(k), respectively, in dependency from said parameter Norm, said order value N and said loudspeaker direction values Ω _l , so as to provide loudspeaker signals for a loudspeaker array.

Advantageous additional embodiments of the invention are disclosed in the respective dependent claims.

Drawings

Exemplary embodiments of the invention are described with reference to the accompanying drawings, which show in:

Fig. 1: Known RTP header format;
Fig. 2: Known extended RTP header format encapsulating DPX data, audio data or metadata;
Fig. 3: Ambisonics encoder facilitating different applications at production side before Ambisonics coefficients and metadata are transmitted;
Fig. 4: Ambisonics decoder facilitating different applications at reproduction side following reception of Ambisonics coefficients and metadata;
Fig. 5: RTP payload header extension for Ambisonics data according to the invention;
Fig. 6: General Ambisonics data header;
Fig. 7: Individual Ambisonics data header;
Fig. 8: Ambisonics metadata;
Fig. 9: Ambisonics receiver parser.

Exemplary embodiments

At first, different scenarios for sound recording or production as well as for reproduction are considered in order to derive the inventive Ethernet/IP based streaming data format. The description of these scenarios is based at production side on an Ambisonics encoder (AE) and at reproduction side on an Ambisonics decoder (AD).
In an Ambisonics encoder as shown in Fig. 3 there are two different kinds of possible input signals:

a microphone array 31 including m microphones, i.e. real sound sources;
v virtual sources 32, i.e. synthetic sounds.

For an HOA description of a source not only the time dependent source signal s(t) is required but also its position, which may move around and is time-dependent, too. The source position can be described by its spherical coordinates, i.e. the radius r_S from the origin to the source and the angles (Θ _S , Φ _S ) = Ω _S , where Θ _S denotes the inclination and Φ _S denotes the azimuth angle in the x,y plane.
In a first step or multiplier 33, all s source signals x (k) at each sample time kT, i.e. virtual single sources as well as microphone array sources, are multiplied with a matrix Ψ defined in Eq.(1).
Matrix Ψ with O rows and S columns performs a direction coding because Ψ contains the spherical harmonics $Y_{n}^{m} (Ω_{s})$
of all currently used directions Ω _S , wherein the superscript index m denotes the order and the subscript index n denotes the degree of a spherical harmonics (note: in connection with microphones the index m refers to the running number of a microphone). If N represents the Ambisonics order, the index n has values in the range 0,...,N, and the values of m are running from -N to +N. $Ψ = [\begin{matrix} Y_{0}^{0} (Ω_{0}) & Y_{0}^{0} (Ω_{1}) & . & . & Y_{0}^{0} (Ω_{S - 1}) \\ Y_{1}^{- 1} (Ω_{0}) & Y_{1}^{- 1} (Ω_{1}) & . \\ Y_{1}^{0} (Ω_{0}) & Y_{1}^{0} (Ω_{1}) & . & . \\ . & . \\ Y_{N}^{+ N} (Ω) (_{0}) & . & . & Y_{N}^{+ N} (Ω_{S - 1}) \end{matrix}]$
More details regarding indices n and m (m for order) are explained below in connection with Table 1. Instead of this specific format of matrix Ψ, any other equivalent representation for that matrix can be used instead.
Matrix Ψ is used to output a vector d (k) of N Ambisonics signals for every sample time instant k, as defined in Eq. (2) and Eq. (3) : $[\begin{matrix} d_{0}^{0} (k) \\ d_{1}^{- 1} (k) \\ d \\ _{1}^{0} (k) \\ . \\ d_{N}^{- N} (k) \end{matrix}] = [\begin{matrix} Y_{0}^{0} (Ω_{0}) & Y_{0}^{0} (Ω_{1}) & . & . & Y_{0}^{0} (Ω_{S - 1}) \\ Y_{1}^{- 1} (Ω_{0}) & Y_{1}^{- 1} (Ω_{1}) & . \\ Y_{1}^{0} (Ω_{0}) & Y_{1}^{0} (Ω_{1}) & . & . \\ . & . \\ Y_{N}^{+ N} (Ω) (_{0}) & . & . & Y_{N}^{+ N} (Ω_{S - 1}) \end{matrix}] \cdot [\begin{matrix} x_{0} (k) \\ x_{1} (k) \\ . \\ x_{s - 1} (k) \end{matrix}]$
$d (k) = Ψ \cdot x (k)$

These signals are representing the complete sound field description that has to be transmitted to the reproduction side. Vector d (k) contains the directional information only. However, the distances of all sources over a specific frequency range are to be considered, too, and the frequency behaviour or dependency is non-linear. Therefore additional filters 341, 342 and 343 are required, which can be implemented at encoder or at decoder side. Especially for HOA, the plane wave processing is sometimes not sufficient because it does not consider frequency dependencies. Therefore, a more general processing will consider sources and sinks not only with plane waves but also with spherical waves. Both wave forms require additional steps or stages that use different factors depending on the radius r and the wave number k _ω, where $k_{ω} = ω \cdot c = \frac{2 π}{T \cdot c} .$
The pressure of a sound field p(r,Θ,Φ,k _ω) can be calculated as follows: $p (r Θ Φ k_{ω}) = \sum_{n = 0}^{\infty} \sum_{m = - n}^{m = n} A_{n}^{m} (k_{ω}) \cdot j_{n} (k_{ω} r) \cdot Y_{n}^{m} (Θ Φ),$
where j_n (k _ω r) describes the spherical Bessel function of the first type, which is depending on the product of wave number k _ω and radius r. The coefficient $A_{n}^{m} (k_{ω})$
or Ambisonics signal can be calculated in case of plane waves from every direction Ω _S , independent from the frequency: $A_{n}^{m} (k_{ω}) = A_{n}^{m} = 4 π \cdot i^{n} \cdot d_{n}^{m}$
This is not the case for spherical waves. Here, the coefficients $A_{n}^{m} (k_{ω})$
are depending on the frequency: $A_{n}^{m} (k_{ω}) = - i \cdot k_{ω} \cdot h_{n} (k_{ω} r) \cdot d_{n}^{m},$
where h_n (k _ω r) describes the spherical Hankel function of the first type.
All these dependencies lead to the following four cases that are to be considered for an extended transmission of Ambisonics coefficients based on RTP. Fig. 3 shows a block diagram of an Ambisonics encoder for these four cases at production side. The required functions are represented by corresponding steps or stages in front of the transmission. All processing steps are clocked by a frequency that is made in stage 38 synchronous with the sample frequency 1/T. A controller 37 receives a mode selection signal and the value of order N, and controls an optional multiplexer 36 that receives the filter responses and the output signal of multiplier 33, and outputs the inventive data structure frames 39. Multiplier 33 represents a directional encoder providing corresponding coefficients and outputs the unfiltered vector data d (k), the order N value, and parameter Norm.

Case 1:

An array response filter 42 ('Filter 1' in Fig. 4) only for the microphone sources data can be arranged at decoder side. The unfiltered vector data d (k), the order N value, and parameter Norm are assembled in a combiner 340 with radii data R_S (t), and are fed to an optional multiplexer 36. Radii data R_S (t) represent the distances of the audio sources of the S input signals x(k), and refer to microphones as well as to artificially generated virtual sound sources.

Case 2:

The coefficients vector data d (k) pass through an array response filter 341 for the microphone sources (filter 2). The filtering compensates the microphone-array response and is based on Bessel or Hankel functions. Basically, the signals from the output vectors d (k) are filtered. The other inputs serve as parameters for the filter, e.g. parameter R is used for the term k*r. The filtering is relevant only for microphones that have the individual radius R_m . Such radii are taken into consideration in the term k*r of the Bessel or Hankel functions. Normally, the amplitude response of the filter starts with a lowpass characteristic but increases for higher frequencies. The filtering is performed in dependency from the Ambisonics order N, the order n and the radii R_m values, so as to compensate for non-linear frequency dependency. A subsequent normalisation step or stage 351 for spherical waves data provides filtered coefficients A (k). It is assumed that there is also a corresponding filter at reproduction side (filter 431 in Fig. 4). The filtered and normalised coefficients A (k), parameter Norm and the order N value are fed to multiplexer 36.

Case 3:

The coefficients vector data d (k) pass through an array response filter 342 for the microphone sources (filter 3). The filtering is performed in dependency from said Ambisonics order N, said order n, the radii R_m values and a radius R_ref value representing the average radius R_ref of the loudspeakers at decoder side as described in the below section "Radius R_ref (RREF)", so as to compensate for non-linear frequency dependency. In case microphone signals are used, a filter for spherical waves data is also arranged at reproduction side. Then the average radius R_ref of the loudspeakers has to be considered already in filter 342. A subsequent normalisation step or stage 352 for spherical waves data provides filtered coefficients A (k). Step/stage 352 can include a distance coding like that described in connection with Fig. 4. The filtered coefficients A (k) from step/stage 352, parameter Norm, the order N value and radius value R_ref are fed to multiplexer 36.

Case 4:

The coefficients vector data d (k) pass through an array response filter 343 for the microphone sources (filter 4). The filtering is performed in dependency from the Ambisonics order N, the radii R_m values and a Plane Wave parameter. A subsequent normalisation step or stage 353 for plane waves data provides parameter Norm, the order N value and a flag for Plane Wave to multiplexer 36.
The Ambisonics encoder can code the output signals 361 in any one of these paths, in any two of these paths, or in more than two of these paths.
The normalisation steps or stages 351 to 353 can use a normalisation or scaling as described below in section "Ambisonics Normalisation/Scaling Format (ANSF)".
Following transmission of the values mentioned above, e.g. via an Ethernet connection, at reproduction side the Ambisonics decoder depicted in Fig. 4 parses the incoming data data structures in a parser 41 in order to detect the case type and to provide the data for performing the appropriate functions. An example for such parser is disclosed in WO 2009/106637 A1 .

Case 1:

Unfiltered vector data d (k), order value N, parameter Norm and each radii data R_S (t) are parsed. These values pass through an array response filter 42 (Filter 1) for filtering (a filtering as described in Fig. 3) the received d (k) data under consideration of all radii R_S (t). The resulting filtered coefficients A (k) are distance coded (DC) in a distance coding step or stage 431 for all loudspeaker radii R_LS and order N, and pass thereafter together with loudspeaker direction values Ω_l (representing the directions of the LS loudspeakers 46), value N and parameter Norm through an optional multiplexer 44 to a panning or pseudo inverse step or stage 45. Distance coding means taking into account Bessel or Hankel functions with radii parameter in term k*r for plane or spherical waves. Examples of distance coding are published in M.A.Poletti, "Three-Dimensional Surround Sound Systems Based on Sperical Harmonics", J.Audio Eng.Soc., vol.53, no.11, November 2005, e.g. in equations (31) and (32), and in J.Daniel, "Spatial Sound Encoding Including Near Field Effect: Introducing Distance Coding Filters and a Viable, New Ambisonic Format", AES 23th Intl.Conf., Copenhagen, Denmark, 23-25 May 2003.

Case 2:

Filtered coefficients A (k), parameter Norm and order value N are parsed. The filtered coefficients A (k) are distance coded (DC) in a distance coding step or stage 432 for all loudspeaker radii R_LS and order N, and pass thereafter together with loudspeaker direction values Ω _l , value N and parameter Norm through multiplexer 44 to the panning or pseudo inverse step or stage 45. Spherical waves on AE and AD sides are assumed.

Case 3:

Filtered coefficients A(k), order value N, parameter Norm and radius value R_ref are parsed. The filtered coefficients A (k) are distance coded (DC) in a distance coding step or stage 432 for all loudspeaker radii R_LS and order N under consideration of radius R_ref , and pass thereafter together with loudspeaker direction values Ω _l , value N and parameter Norm through multiplexer 44 to the panning or pseudo inverse step or stage 45. Spherical waves on AE and AD sides are assumed.

Case 4:

Filtered coefficients A (k), order value N, parameter Norm and a flag for Plane Waves are parsed. The filtered coefficients A (k) together with loudspeaker direction values Ω _l , value N and parameter Norm pass through multiplexer 44 to the panning or pseudo inverse step or stage 45. Plane waves on AE and AD sides are assumed.
Based on parameter Norm, order value N and the Plane Waves flag, a mode selector 47 selects in multiplexer 44 the corresponding path or paths a) to d) which was or were used at encoder side. Decoder 45, which represents a panning or a mode matching operation including pseudo inverse, inverts the matrix Ψ operation in the Ambisonics encoder in Fig. 3, and applies this operation to the filtered coefficients A(k) or the filtered and distance coded coefficients A '(k), respectively, in dependency from the parameter Norm, order value N and the loudspeaker direction values Ω _l , and provides the l loudspeaker signals for a loudspeaker array 46. The matrix Ψ operation is inverted for cases 1-3 by w _l (k)=D·A'(k), and for case 4 by w_l (k)=D·A(k). Parser 41 also provides synchronisation information that is used for re-synchronisation of a clock 48.
The invention specifies a packet-based streaming format for encapsulating spatial sound field descriptions based on Ambisonics into an extended real-time transport protocol, in particular RTP, for real-time streaming of spatial audio scenes. The focus is on a standalone spatial (2D/3D) audio real-time application, e.g. a transmission of a live concert or a live sport event via IP. This requires a specific spatial audio layer including time stamps and possibly synchronisation information. The Ambisonics real-time stream can be used together with an RTP layer. In addition, alternative RTP layers with or without extended headers are described below.
In general, for a spatial audio transmission a sound field description in Ambisonics can be used in which possible sound source positions are inherently encoded. An alternative is the transmission of the source signals together with their time-dependent or time-independent positions. A switching possibility between these two alternatives is provided, too, but the directly following section will focus on Ambisonics.

Extended Ambisonics streaming format (EASF)

Ethernet transmissions (e.g. via internet) are performed in data packets with a typical packet length called 'path MTU' with up to 1500 or 9000 bytes. In case Ambisonics sound fields are to be transmitted via Ethernet, such relatively small data packets are not large enough. Therefore, several packets can be combined in larger containers named 'frames'. Such frame represents a dedicated time interval within which a typical number of packets is transmitted. For example in video applications, in 1080p video mode a frame contains 1080 data packets of which each one describes one line of a complete video frame. Especially for real-time applications, even for audio (where low latency and low packet loss is important), a transmission should be frame based.
Because Ambisonics supports a sound field description independent of positions but with an adaptable quality, different amounts of data per packet or frame are possible. However, the number of octets in a data packet shall always be the same within a frame, except the last packet. In principle, the RTP sequence number is to be incremented with each packet.
With regard to Fig. 3 and Fig. 4, Case 1 requires a transmission of each time-dependent radii R_S (t). This is an option if filter processing is to be performed in the decoder. However, in the following section the focus is on Cases 2-4 in which the filtered coefficients A (k) are transmitted. This allows a higher bandwidth because the transmission remains independent from all source positions, i.e. this is suited more for Ambisonics.
For standalone audio transmission, the protocol contains the following header data structure.
A standard RTP header (cf. Fig. 1) containing the following bit fields:
Version (V) - 2 bit
RTP Version (default is V=2)
Padding (P) - 1 bit
If set, a data packet will contain several additional padding bytes. These are always located at the end following the payload. The last padding byte contains a count of how many padding bytes are to be ignored.
Extension (X) - 1 bit
If set, the fixed header is followed by exactly one header extension.
CSRC count (CC) - 4 bit
The number of contributing source identifiers, following the fixed header.
Marker (M) - 1 bit
In general, the marker bit can be defined by a profile. Here, it signalises the end of a frame, i.e. it is set for the last data packet. For other packets it must be cleared.
Payload Type (PT) - 7 bits
The payload type is defined for an Audio standalone transmission as EASF. For a combined transmission with uncompressed video the film format is chosen, e.g. DPX.
Sequence Number - 16 bits
The LSB bits for the sequence number. It increments by one for each RTP data packet sent, and may be used by the receiver for detecting packet loss and for restoring the packet sequence. The initial value of the sequence number is random (i.e. unpredictable) in order to make known-plaintext attacks on encryption more difficult. The standard 16-bit sequence number is augmented with another 16 bits in the payload header in order to avoid problems due to wrap-around when operating at high data rates.
Timestamp - 32 bits
The timestamp denotes the sampling instant of the frame to which the RTP packet belongs. Packets belonging to the same frame must have the same timestamp.
RTP payload header extension
According to the invention, the fields of the known RTP header keep their usual meaning, but that header is amended as follows:
RTP Payload Frame Status (PLFS) - 2 bit
The frame status describes which type of data will follow the extended RTP header in the payload block:

PLFS code Payload type

00 Ambisonics coefficients

01 Frame end (+ Ambisonics coefficients)

10 Frame begin (+ Metadata)

11 Metadata

I.e., in the first packet of a frame, instead of audio data, additional metadata can be transmitted. In case of Ambisonics transmission, the metadata contains source and Ambisonics encoder related information (production side information) required for the decoding process.
Time Code/Sync Frequency (TCSF) - 30 bit unsigned integer The following SMPTE time code or the synchronisation is based on a specific clock frequency, the Time Code/Sync Frequency TSCF. In order to support a large range of frequencies, the TCSF is defined as a 30 bit integer field. The value is represented in Hz and leads to a frequency range from 0 to 1073.741824 MHz, wherein a value of 0 Hz is signalling that no time code is available.

Audio Source Type (AST) - 2 bit

The transmission of audio content is possible in different modes. In form of Ambisonics sound field descriptions or sampled audio sources including their positions. The following table shows AST values and their meaning.

AST code Possible sources

00 Sound field

01 Sound sources + fixed positions

10 Sound sources + time dependent positions

11 Reserved
The selection in data field AST facilitates not only a separation within Ambisonics (cf. the example provided below in connection with Fig. 9) but also the parallel transmission of differently encoded audio source signals (Ambisonocs and/or PCM data + position data), i.e. the inventive protocol can be complemented e.g. for PCM data. The below-described SMPTE Time Code/Clock Sync Info (STCSI) facilitates the temporally correct assignment of the audio signal sources.

Audio Dimension (ADIM) - 1 bit

The dimension in case of existing and extendable formats is described as follows:

ADIM code Dimension

0 2D

1 3D

Extended Ambisonics Header (XAH) - 1 bit

If XAH is cleared, the general Ambisonics header is transmitted only in the first data packet of a frame and the individual Ambisonics header is transmitted in all other data packets.
If XAH is set, the general Ambisonics header shall also be available in every data packet in front of the individual Ambisonics header. This mode enables a modification of the parameters in each data packet, i.e. in real-time. It can be useful for real-time applications where no or only small buffers are available. However, this mode decreases the available bandwidth.
Different sources can generate audio signals at the same time. Known protocols are based on a separate transmission of the sound sources, i.e. every data frame refers to a single temporal section in which, depending on the sampling frequency, several samples can be contained. Therefore, in known protocols, different source signal occurring at the same time instant will use the same time stamp and the same frame number. This poses no problem for an offline processing, i.e. no real-time processing. The transmitted data are buffered and assembled later on. However, this does not work for real-time processing in which a small latency is demanded. In the inventive protocol, the data field XAH facilitates a continued entrainment of the header, and the parser 41 in Fig. 4 can switch back and forth block-by-block (or Ethernet packet-by-packet or frame-by-frame) between different audio sources types.
Distinguishing between general header and individual header facilitates a real-time adaptation.

Selector Time Code or Sync (STS) - 1 bit

If STS is cleared, the value in the 24 bit field STCSI (see below) represents the SMPTE time code. If STS is set, field STCSI contains user-specific synchronisation information.

Rsvrd - 3 bit

Reserved bits for future applications concerning the SMPTE time code or clock synchronisation.

SMPTE Time Code/Clock Sync Info (STCSI) - 24 bit

Identifies the SMPTE time code (hh:mm:ss:frfr = 6:6:6:6 bit), or synchronisation information for the local clocks of each source and sink. That synchronisation information format is user-dependent. It appears that this kind of synchronisation has not been used before for Ambisonics- and video synchronisation.

Packet Offset (PAO) - 64 bit

In a current frame the packet offset describes the distance in bytes between the first payload octet of the first data packet in a frame relative to the first payload octet in the current data packet. PAO(HIGH) represents the 32 MSBs and PAO(LOW) represents the 32 LSBs.
The above known and extended RTP header data are depicted in Fig. 5. PAO(LOW) is followed by the Ambisonics payload data.

Ambisonics payload layer

Ambisonics payload data and Ambisonics header data shall be fragmented such that the resulting RTP data packet is smaller than the 'path MTU' mentioned above. In case of 10GE transmission the path MTU is a 'jumbo frame' of e.g. 9000 bytes. There are two types of Ambisonics headers. A small individual Ambisonics header is sent in front of each data packet. A general header contains source and encoder related information that can be useful for the Ambisonics decoder. It contains information that is valid for the all data packets within a frame, and for small frames and/or data packets it can be sent once at the beginning of a frame. Especially for real-time applications where the packet information is changing frequently, it can be advantageous to send the general header with each data packet.

General Ambisonics header (only in the first data packet if

XAH= 0)

Ambisonics Endianness (AEN): 1 bit

The endianness used for the transmitted Ambisonics data.

AE code Dimension

0 Big Endian

1 Little Endian

Ambisonics Header Length (AHL) - 8 bit

Identifies the length of the complete header in byte.

Ambisonics Wave Type (AWT) - 1 bit

Traditionally, Ambisonics assumes that all audio sources and loudspeakers provide plane waves for modelling the sound field. A typical example is the B-format. However, an extended Ambisonics sound field description with higher quality requires also a modelling with spherical waves. Therefore, the AWT field considers both possibilities.

AWT code Dimension

0 Plane wave

1 Spherical wave

Ambisonics Order Type (AOT) - 2 bit

Identifies the sequence of how the Ambisonics coefficients are transmitted. Up to 4 order types can be addressed. The different formats depend on the order and indexing in Eq. (1), i.e. how the spherical harmonics are ordered in a column of W. The existing Ambisonics B-format uses a specific sequence of Ambisonics coefficients according to Table 1, wherein K to Z denotes known B-Format channels. In case of 3D the coefficients are transmitted from top to bottom in Table 1.
E.g. for degree n=2, the sequence will be WXYZRSTUV.

Table 1

AFT code	Format
00	B-Format order
01	numerical upward
10	numerical downward
11	Reserved

Degree n	Order m	Channel
0	0	W
1	1	X
1	-1	Y
1	0	Z
2	0	R
2	1	S
2	-1	T
2	2	U
2	-2	V
3	0	K
3	1	L
3	-1	M
3	2	N
3	-2	O
3	3	P
3	-3	Q

As an alternative, the sequence of each matrix column in Eq.(1) from top to bottom represents a numerical upward order type. A degree value always starts with 0 and runs up to Ambisonics Order N. For each degree, the sequence starts with lowest order -N and runs up to order +N. The downward type uses for each degree the reversed order.

Ambisonics Horizontal Order (AHO) - 8 bit

Ambisonics Vertical Order (AVO) - 8 bit

The Ambisonics order describes the quality of the Ambisonics en- and decoding via Ψ. An order up to 255 should be sufficient. According to the audio dimension the order is distinguished in horizontal and vertical direction.
In case of 2D, only AHO has a value greater than '0'. A mixed order can have different AHO and AVO values.

RSVRD (Order) - 2x2 bit

For possible extension of order related issues, these reserved bits are considered in front of AHO and AVO.
Ambisonics Normalisation/Scaling Format (ANSF) - 3 bit Identifies different normalisation formats, typically used for Ambisonics. The normalisation corresponds to the orthogonality relationship between $Y_{n}^{m}$
and $Y_{nʹ}^{mʹ} *$
. Furthermore there are additional normalisation principles, e.g. Furse-Malham. The Furse-Malham formulation facilitates a normalisation of the coefficients to get maximum values of ±1, which yields an optimal dynamic range.
In case of dedicated scaling the scaling factors are fixed over one frame. The scaling factors will be transmitted only once in front of the Ambisonics coefficients.

ANF code Format

000 Orthonormal

001 Schmidt semi-normalised

010 4n normalised

011 Unnormalised

100 Furse-Malham

101 Dedicated scaling

11x Reserved

Radius R_ref (RREF) - 16 bit

The reference radius R_ref value of the loudspeakers in mm is required in case of spherical waves. The maximal radius depends on the acoustic wave length λ which can be calculated from audible frequencies f (F_LOW =20 Hz - f_HI =20 kHz) and the speed of sound c =340 m/s. Thus for the radius R_ref , values from 17.000 mm to 17 mm are required and a word length of 16 bit is sufficient for that.

Ambisonics Sample Format (ASF) - 4 bit

This code defines the word length as well as the format (integer/floating point) of the transmitted Ambisonics coefficients A (k). The sample format enables an adaptation to different value ranges. In the following table nine sample formats are predefined:

ASF code	Format
0000	Unsigned integer 8 bit
0001	Signed integer 8 bit
0010	Signed integer 16 bit
0011	Signed integer 24 bit
0100	Signed integer 32 bit
0101	Signed integer 64 bit
0110	Float 32 bit (binary single prec.)
0111	Float 64 bit (binary double prec.)
1000	Float 128 bit (binary quad prec.)
1001-1111	Reserved

Ambisonics Invalid Bits (AIB) - 5 bit

If ASF is specified as an integer format, the number AIB of invalid bits can mask the lowest bits within the ASF integer. AIB is coded as 5 bit unsigned integer value, so that up to 31 bits can be marked as invalid. Valid bits start at MSB. Note that the word length of AIB is less than the ASF integer word length.

Sample Rate (SR) - 32 bit

The rate at which the input data x_i (k) are sampled. The value in Hz is coded as an unsigned integer.

Frame Size Mode (FSM) - 1 bit

If FSM is cleared, the following 31 bits for FS represent the file size in bytes. If FSM is set, FS represents the total number of data packets in the actual frame.

Frame Size (FS) - 31 bit

The frame size number FS is to be interpreted in view of the FSM flag's value. Depending on the application, the frame size can vary from frame to frame.
As mentioned above, a frame represents a unit of several data packets. It is assumed that for uncompressed data all packets expect the last one will have the same length. Then the frame size in bytes can be calculated to #bytes per frame = (FS-1)*packet size + last packet size.
Basic Ethernet applications do normally use MTU sizes of 1500 bytes. Modern 10 Gigabit Ethernet applications consider larger MTUs (e.g. 'jumbo frames' with 9000 to 16000 bytes). To enable data sets larger than 232 bytes (4GB), the frame size should be specified as a number of data packets. I.e., if a data packet contains 9000 bytes the maximum frame size would be greater than 35 Tbyte.
The general Ambisonics data header in the Ambisonics payload data is depicted in Fig. 6. A 'frame' can contain several equal-length packets, wherein the last packet can have a different length that is described in the individual Ambisonics header. Every packet may use such a header for describing at the end lengths values that differ from prior packet lengths.

Individual Ambisonics header

Reserved (RSRVD) - 16 bit

The bits in front of APL are reserved. This enables an extension of the individual header, e.g. by packet related flags, and a 32 bit alignment for the following Ambisonics coefficients.

Ambisonics Packet Length (APL) - 16 bit

Defines the MTU length for each individual data packet in bytes. The maximum length is 65535.
This individual Ambisonics header is depicted in Fig. 7. If applied, the two data fields RSRVD and APL will follow data field FS in Fig. 6. APL contains the length of the following Ethernet packet which contains payload data (Ambisonics components).

Ambisonics payload data

As mentioned above, the payload data type is defined in the data field PLFS (RTP Payload Frame Status), cf. Fig. 5. Following the individual Ambisonics header, and possibly the individual Ambisonics header, 'pure' Ambisonics data or 'pure' metadata can be arranged.

Ambisonics coefficients

Due to the time dependency of the input samples x(kT)= x(k) and of the direction and radii R_S (t), it is important to perform the Ambisonics encoding and decoding with regard to the specific sample time kT or even simpler at k.
However, when considering a protocol based transmission, the transmission processing operates in a sequential manner, i.e. at each transmission clock step (which is totally different from the sampling rate) only 32 or 64 bits of a data packet can be dealt with. The number of considered Ambisonics samples in one data packet is related to one concatenated sample time or to a group of concatenated sample times.
Normally, all Ambisonics coefficients have the same length across all data packets in a frame. However, if the general Ambisonics header is inserted in a normal data packet, the data parameters can be modified within a frame.
The following examples of payload data show different dimensions, orders, and Ambisonics coefficients based on the encoder/decoder cases 2 to 4 of Fig. 3. The first index x of A(x,y) describes the sequence number for a specific order, whereas the second index y stands for the sample time k in a data packet.
Example 1: ADIM=1, AHO=AVO=3, ASF=2

Example 2: ADIM=1, AHO=AVO=2, ASF=3, AIB=2

Example 3: ADIM=1, AHO=AVO=2, ASF=4, AIB=7

Example 4: ADIM=1, AHO=AVO=1, ASF=4
Example 5: ADIM=1, AHO=AVO=1, ASF=7

Ambisonics metadata

If PLFS is set to 10₂ = 2₁₀, metadata are transmitted instead of Ambisonics coefficients. For metadata different formats are existing, of which some are considered below. Thus in front of the concrete metadata content a metadata type defines the specific formats as depicted in Fig. 8. The first two data fields RSRVD and APL are like in Fig. 7.

Ambisonics Metadata Type (AMT) - 16 bit

The types SMPTE MXF and XML are pre-defined.

AMT code Format

0x00 SMPTE MXF

0x80 XML

0x01-7F Rsrvd

0x81-0xFF Rsrvd

Rsrvd - 16 bit

Reserved bits for future applications concerning metadata.
This data field is followed by specific metadata. If possible the metadata descriptions should be kept simple in order to get only one metadata packet in the 'begin packet' of a frame. However, the packet length in bytes is the same as for Ambisonics coefficients. If the amount of metadata will exceed this packet length, the metadata has to be fragmented into several packets which shall be inserted between packets with Ambisonics coefficients. If the metadata amount in bytes in one packet is less than the regular packet length, the remaining packet bytes are to be padded with '0' or stuffing bits.
For channel coding purposes the encapsulated CRC word at the end of each Ethernet packet should be used.

At the production side as shown in Fig. 3, three different Cases are considered by the above-mentioned data structure (i.e. three cases where A (k) data are transmitted and one case where d (k) data are transmitted). The question is how to detect the Ambisonics Encoding/Decoding mode at reproduction or receiver side. The Case chosen at production side can be derived in parser 41 in Fig. 4 from the bit fields RREF and AFT. The following table shows the values for RREF and AFT and their meaning:

Mode	Payload data
2	filtered A (k), RREF=0, AFT= Spherical Wave
3	filtered A (k), RREF≠0, AFT= Spherical Wave
4	filtered A (k), AFT= B-format/Plane Wave
4	RREF is obsolete

Regarding the specific structure in figures 3 and 4, in Fig. 9 the parser 41 of the Ambisonics decoder in Fig. 4 is shown in more detail. For collecting corresponding data items from an Ambisonics data stream ADSTR, the parser can use registers REG and content addressable memories CAM. The content addressable memories CAM detect all protocol data which will lead to a decision about how the received data are to be processed in the following steps or stages, and the registers REG store information about the length or the payload data. The parser evaluates the header data in a hierarchical manner and can be implemented in hardware or software, according to any real-time requirements.

Example:

Several audio signals are generated and transmitted as spherical waves SPW or plane waves PW, e.g. the worldwide live broadcast of a concert in 3D format, wherein all receiving units are arranged in cinemas. In such case the individual signals are to be transmitted separately so that a correct presentation can be facilitated. By a corresponding arrangement of the protocol (Ambisonics Wave Type AWT described above) the parser can distinguish this and supply two separate 'distance coding' units with the corresponding data items. The inventive Ambisonics decoder depicted in Fig. 4 can process all these signals, whereas in the prior art several decoders would be required. I.e., the considering the Ambisonics wave type facilitates the advantages described above.

Claims

Method for generating sound field data including Ambisonics sound field data of an order higher than three, said method including the steps:
- receiving s input signals x (k) from a microphone array (31) including m microphones, and/or from one or more virtual sound sources (32);

- multiplying (33) said input signals x (k) with a matrix Ψ, $Ψ = [\begin{matrix} Y_{0}^{0} (Ω_{0}) & Y_{0}^{0} (Ω_{1}) & . & . & Y_{0}^{0} (Ω_{S - 1}) \\ Y_{1}^{- 1} (Ω_{0}) & Y_{1}^{- 1} (Ω_{1}) & . \\ Y_{1}^{0} (Ω_{0}) & Y_{1}^{0} (Ω_{1}) & . & . \\ . & . \\ Y_{N}^{+ N} (Ω_{0}) & . & . & Y_{N}^{+ N} (Ω_{S - 1}) \end{matrix}],$
wherein the matrix elements $Y_{n}^{m} (Ω_{s})$
represent the spherical harmonics of all currently used directions Ω ₀ ,...,Ω _S-1 , index m denotes the order, index n denotes the degree of a spherical harmonics, N represents the Ambisonics order, n = 0,...,N, and m = -n, ...,+n,
so as to get coefficients vector data d (k) representing coded directional information of N Ambisonics signals for every sample time instant k;

- processing said coefficients vector data d (k), value N and parameter Norm in one or two or more of the following four paths:
a) combining (340) said coefficients vector data d (k), said value N and said parameter Norm with radii data R_S representing the distances of the sources of said input signals x(k);

b) based on spherical waves, array response filtering (341) said coefficients vector data d (k) in dependency from said Ambisonics order N and radii R_m values, said radii R_m values representing individual microphone radii in a microphone array, so as to compensate for non-linear frequency dependency, followed by normalising (351) for spherical waves data, so as to provide filtered coefficients A (k), said parameter Norm and said order N value;

c) based on spherical waves, array response filtering (342) said coefficients vector data d (k) in dependency from said Ambisonics order N, said radii R_m values and a radius R_ref value, said radius R_ref value representing a mean radius of loudspeakers arranged at decoder side, so as to compensate for non-linear frequency dependency, followed by normalising (352) for spherical waves data, so as to provide filtered coefficients A (k), said parameter Norm, said order N value, and said radius R_ref value;

d) based on plane waves, array response filtering (343) said coefficients vector data d (k) in dependency from said Ambisonics order N, said radii R_m values and a Plane Wave parameter, so as to compensate for non-linear frequency dependency, followed by normalising (353) for plane waves data, so as to provide filtered coefficients A (k), said parameter Norm, said order N value, and said Plane Wave parameter;

- in case a processing took place in two or more of said paths, multiplexing (36) the corresponding data;

- output (361) of data frames (39) including said provided data and values.
Apparatus for generating sound field data including Ambisonics sound field data of an order higher than three, said apparatus including:
- means (33) being adapted for multiplying S input signals x (k), which are received from a microphone array (31) including m microphones and/or from one or more virtual sound sources (32), with a matrix Ψ, $Ψ = [\begin{matrix} Y_{0}^{0} (Ω_{0}) & Y_{0}^{0} (Ω_{1}) & . & . & Y_{0}^{0} (Ω_{S - 1}) \\ Y_{1}^{- 1} (Ω_{0}) & Y_{1}^{- 1} (Ω_{1}) & . \\ Y_{1}^{0} (Ω_{0}) & Y_{1}^{0} (Ω_{1}) & . & . \\ . & . \\ Y_{N}^{+ N} (Ω_{0}) & . & . & Y_{N}^{+ N} (Ω_{S - 1}) \end{matrix}],$
wherein the matrix elements $Y_{n}^{m} (Ω_{s})$
represent the spherical harmonics of all currently used directions Ω ₀ ,...,Ω _S-1 , index m denotes the order, index n denotes the degree of a spherical harmonics, N represents the Ambisonics order, n = 0,...,N, and m = -n,...,+n, so as to get coefficients vector data d (k) representing coded directional information of N Ambisonics signals for every sample time instant k;

- means (340,341,351,342,352,343,353) being adapted for processing said coefficients vector data d (k), value N and parameter Norm in one or two or more of the following four paths:
a) combining said coefficients vector data d (k), said value N and said parameter Norm with radii data R_S representing the distances of the sources of said S input signals x(k);

b) based on spherical waves, array response filtering said coefficients vector data d (k) in dependency from said Ambisonics order N and radii R_m values, said radii R_m values representing individual microphone radii in a microphone array, so as to compensate for non-linear frequency dependency, followed by normalising for spherical waves data, so as to provide filtered coefficients A (k), said parameter Norm and said order N value;

c) based on spherical waves, array response filtering said coefficients vector data d (k) in dependency from said Ambisonics order N, said radii R_M values and a radius R_ref value, said radius R_ref value representing a mean radius of loudspeakers arranged at decoder side, so as to compensate for non-linear frequency dependency, followed by normalising for spherical waves data, so as to provide filtered coefficients A (k), said parameter Norm, said order N value, and said radius R_ref value;

d) based on plane waves, array response filtering said coefficients vector data d (k) in dependency from said Ambisonics order N, said radii R_M values and a Plane Wave parameter, so as to compensate for non-linear frequency dependency, followed by normalising for plane waves data, so as to provide filtered coefficients A (k), said parameter Norm, said order N value, and said Plane Wave parameter;

- a multiplexer means (36) for multiplexing the corresponding data in case a processing took place in two or more of said paths, which multiplexer means provide data frames (39) including said provided data and values.
Method for decoding sound field data that were encoded according to claim 1 using one or two or more of said paths, said method including the steps:
- parsing (41) the incoming encoded data, determining the type or types a) to d) of said paths used for said encoding and providing the further data required for a decoding according to the encoding path type or types;

- performing a corresponding decoding processing for one or two or more of the paths a) to d):
a) based on spherical waves, filtering (42) the received coefficients vector data d (k) in dependency from said radii data R_S so as to provide filtered coefficients A (k),
and distance coding (431) said filtered coefficients A (k) in dependency from said order value N and for all radii R_l of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A '(k) together with loudspeaker direction values Ω _l , value N and parameter Norm;

b) based on spherical waves, distance coding (432) said filtered coefficients A (k) in dependency from said order value N and for all radii R_l of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A '(k) together with loudspeaker direction values Ω _l , value N and parameter Norm;

c) based on spherical waves, distance coding (433) said filtered coefficients A (k) in dependency from said order value N and said radius value R_ref for all radii R_l of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A '(k) together with loudspeaker direction values Ω _l , value N and parameter Norm;

d) based on plane waves, providing said filtered coefficients A (k), order value N, parameter Norm and a flag for Plane Waves;

- in case a processing took place in two or more of said paths, multiplexing (44) the corresponding data, wherein the selected (47) path or paths are determined based on parameter Norm, order value N and said Plane Waves flag;

- decoding (45) said distance encoded filtered coefficients A'(k) or said filtered coefficients A(k), respectively, in dependency from said parameter Norm, said order value N and said loudspeaker direction values Ω _l , so as to provide loudspeaker signals for a loudspeaker array (46).
Apparatus for decoding sound field data that were encoded according to claim 1 using one or two or more of said paths, said apparatus including:
- means (41) being adapted for parsing the incoming encoded data, and for determining the type or types a) to d) of said paths used for said encoding and for providing the further data required for a decoding according to the encoding path type or types;

- means (42,431,432,433) being adapted for performing a corresponding decoding processing for one or two or more of the paths a) to d) :
a) based on spherical waves, filtering the received coefficients vector data d (k) in dependency from said radii data R_S so as to provide filtered coefficients A (k), and distance coding said filtered coefficients A (k) in dependency from said order value N and for all radii R_l of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A '(k) together with loudspeaker direction values Ω _l , value N and parameter Norm;

b) based on spherical waves, distance coding said filtered coefficients A (k) in dependency from said order value N and for all radii R_l of loudspeakers to be used for a presentation of the decoded signals,
and providing the distance encoded filtered coefficients A '(k) together with loudspeaker direction values Ω _l , value N and parameter Norm;

c) based on spherical waves, distance coding said filtered coefficients A (k) in dependency from said order value N and said radius value R_ref for all radii R_l of loudspeakers to be used for a presentation of the decoded signals, and providing the distance encoded filtered coefficients A '(k) together with loudspeaker direction values Ω _l , value N and parameter Norm;

d) based on plane waves, providing said filtered coefficients A(k), order value N, parameter Norm and a flag for Plane Waves;

- multiplexing means (44) which, in case a processing took place in two or more of said paths, select the corresponding data to be combined, based on parameter Norm, order value N and said Plane Waves flag;

- decoding means (45) which decode said distance encoded filtered coefficients A'(k) or said filtered coefficients A(k), respectively, in dependency from said parameter Norm, said order value N and said loudspeaker direction values Ω _l , so as to provide loudspeaker signals for a loudspeaker array (46).
Method according to claim 3, or apparatus according to claim 4, wherein said parser (41) includes registers (REG) and content addressable memories (CAM) for collecting data items from the decoder input data by evaluating header data in a hierarchical manner, and wherein said content addressable memories (CAM) detect all protocol data which will lead to a decision about how the received data are to be processed in the decoding, and said registers (REG) store data item length information and/or information about payload data.
Method according to claim 5, or apparatus according to claim 5, wherein said parser (41) provides data for two or more individual audio signals by distinguishing Ambisonics plane wave and spherical wave types (AWT).
Method according to claim one of claims 1, 3 and 5, or apparatus according to one of claims 2, 4 and 5, wherein said Ambisonics sound field data are transferred using Ethernet or internet or a protocol network.
Data structure for Ambisonics audio signal data which can be encoded according to claim 1, said data structure including:
- a data field determining plane wave and spherical wave Ambisonics;

- a data field determining the Ambisonics order types B-Format order, numerical upward order, numerical downward order;

- a data field determining the channel in dependency from the degree n and the order m;

- a data field determining horizontal or vertical order of the coefficients in the Ambisonics matrix.