GB2626042A

GB2626042A - 6DOF rendering of microphone-array captured audio

Info

Publication number: GB2626042A
Application number: GB2300299.1A
Authority: GB
Inventors: Artturi Leppänen Jussi; Shyamsundar Mate Sujeet; Pajunen Lauros; Ilari Laitinen Mikko-Ville
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2023-01-09
Filing date: 2023-01-09
Publication date: 2024-07-10
Also published as: WO2024149567A1

Abstract

A method comprising: obtaining higher order ambisonic audio sources associated with positions in an environment, obtaining a listener position in the environment, generating spatial metadata by processing first order channels of all audio sources not being used for signal interpolation, selecting a plurality of sources to be active sources, determining, from the active sources, a previous source used for signal interpolation, determining, from the active sources, a current source to be used for signal interpolation, generating a current interpolated signal by processing all channels of the current source, perform signal interpolation by generating spatial metadata by processing first order channels of the previous source, crossfading, after a time delay, between the previous interpolation signal and the current interpolation signal, and ceasing generation of the previous interpolated signal; and generating the spatialised audio output based on the interpolated signal. The interpolated signals may be generated using a short-time Fourier transform.

Description

6DOF RENDERING OF MICROPHONE-ARRAY CAPTURED AUDIO

Field

The present application relates to apparatus and methods for audio 5 rendering with 6 degree of freedom systems of microphone-array captured audio.

Background

Spatial audio capture approaches attempt to capture an audio environment or audio scene such that the audio environment or audio scene can be perceptually recreated to a listener in an effective manner and furthermore may permit a listener to move and/or rotate within the recreated audio environment. For spatial audio capture and recording spatial sound linearly at one position at the recording space, a high-end microphone array is needed. One such microphone is the spherical 32-microphone Eigenmike. From the high-end microphone array higher-order Ambisonics (HOA) signals can be obtained and used for rendering. With the HOA audio signals, the spatial audio can be rendered so that sounds arriving from different directions are satisfactorily separated in a reasonable auditory bandwidth. In some systems multiple microphone locations enable a multi-point HOA (MPHOA) capture system where there are multiple HOA audio signals at locations within an audio scene. In some embodiments even basic microphone array for up to first order ambisonics may also be used for recording the the audio scene. In some other embodiments, the audio scene may comprise two or more synthetic FOA or HOA sources.

Audio rendering, where the captured audio signals are presented to a listener can be part of a virtual reality (VR) or augmented reality (AR) system. The audio rendering furthermore can be performed as part of a VR or AR where the listener can freely move within the environment or audio scene and rotate their head, which is known as a 6 degrees of freedom (6D0F) configuration. Furthmore the audio rendering can be Multi-Point HOA (MPHOA) audio rendering where the audio scene comprises multiple HOA audio signals recordings which are rendered to a user in a 6D0F manner. That is, the user is able to listen to the recorded scene from positions that may be other than the positions of the recorded HOA sources.

Summary

There is provided according to a first aspect a method for generating a spatialized audio output comprising: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment, wherein the listener position is free to move within the audio environment; determining at least one source as an active source; determining at least one current higher order ambisonics audio source from the at least one determined active source based on a current listener position; determining at least one previous higher order ambisonics audio source associated with signal interpolation determination; performing a processing on all channel signals of the at least one current higher order ambisonics audio source; perform a processing on at least one channel signal of others of the at least one determined active source; determining a difference between the at least one current higher order ambisonics audio source and the at least one previous higher order ambisonics audio source associated with signal interpolation determination; continuing a processing on all channel signals of the at least one previous active higher order ambisonics audio source; processing on at least one channel signal of at least one previous higher order ambisonics audio source associated with signal interpolation determination; continuing a processing on at least one channel signal of the others of the at least one determined active source; processing on all channel signals of at least one of the current higher order ambisonics audio source; performing a signal interpolation based on: the continued processing on the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with signal interpolation for a first time period after the determining the difference; crossfading between the continued processing on all the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with the signal interpolation and the processing on all channel signals of at least one of the determined at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources after the first time period after the determining the difference; and stopping the continuing processing on all channel signals of the at least one previous active higher order ambisonics audio source after the end of the crossfading; and generating the spatialized audio output based on the determined at least one signal interpolation.

The processing may be a short time Fourier transform.

Crossfading may be for a second time period.

The first time period may be a first number of time frames, the first number of time frames based on a short time Fourier transform priming delay and the second time period may be a second defined number of processing frames.

The first number of time frames may be 6 frames and the second number of time frames may be 12 frames.

The method may further comprise determining spatial metadata by: analysing the processing on all channel signals of the at least one current higher order ambisonics audio source and at least one channel signal of others of the at least one determined active source; analysing the continued processing on at least one channel signal of the at least one previous active higher order ambisonics audio source; Generating the spatialized audio output may be further based on the determined spatial metadata.

The at least one respective channel signals of the determined at least one previous higher order ambisonics audio source may be subset of the channels.

The at least one channel signal of the determined at least one previous higher order ambisonics audio source may be a first four channel signals of the determined at least one previous higher order ambisonics audio source. Determining at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position may comprise: determining an area within which the current listener position is located, the area defined by vertice positions of at least three higher order ambisonics audio sources; and selecting the at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources, the at least one current higher order ambisonics audio sources being those whose positions define the the area vertices.

According to a second aspect there is provided an apparatus for generating a spatialized audio output, the apparatus comprising means configured to: obtain at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtain a listener position within the audio environment, wherein the listener position is free to move within the audio environment; determine at least one source as an active source; determining at least one current higher order ambisonics audio source from the at least one determined active source based on a current listener position; determine at least one previous higher order ambisonics audio source associated with signal interpolation determination; perform a processing on all channel signals of the at least one current higher order ambisonics audio source; perform a processing on at least one channel signal of others of the at least one determined active source; determine a difference between the at least one current higher order ambisonics audio source and the at least one previous higher order ambisonics audio source associated with signal interpolation determination; continue a processing on all channel signals of the at least one previous active higher order ambisonics audio source; process on at least one channel signal of at least one previous higher order ambisonics audio source associated with signal interpolation determination; continue a processing on at least one channel signal of the others of the at least one determined active source; process on all channel signals of at least one of the current higher order ambisonics audio source; perform a signal interpolation based on: the continued processing on the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with signal interpolation for a first time period after the determining the difference; crossfade between the continued processing on all the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with the signal interpolation and the processing on all channel signals of at least one of the determined at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources after the first time period after the determining the difference; and stop the continuing processing on all channel signals of the at least one previous active higher order ambisonics audio source after the end of the crossfading; and generate the spatialized audio output based on the determined at least one signal interpolation.

The processing may be a short time Fourier transform.

Crossfading may be for a second time period.

The first number of time frames may be 6 frames and the second number of 5 time frames may be 12 frames.

The means may further be configured to determine spatial metadata by: analyse the processing on all channel signals of the at least one current higher order ambisonics audio source and at least one channel signal of others of the at least one determined active source; analyse the continued processing on at least one channel signal of the at least one previous active higher order ambisonics audio source; generate the spatialized audio output may be further based on the determined spatial metadata.

The at least one channel signal of the determined at least one previous higher order ambisonics audio source may be a first four channel signals of the determined at least one previous higher order ambisonics audio source.

The means confiugred to determine at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position may be configured to: determine an area within which the current listener position is located, the area defined by vertice positions of at least three higher order ambisonics audio sources; and select the at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources, the at least one current higher order ambisonics audio sources being those whose positions define the the area vertices.

According to a third aspect there is provided an apparatus for generating a spatialized audio output, the apparatus comprising at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the system at least to perform: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment, wherein the listener position is free to move within the audio environment; determining at least one source as an active source; determining at least one current higher order ambisonics audio source from the at least one determined active source based on a current listener position; determining at least one previous higher order ambisonics audio source associated with signal interpolation determination; performing a processing on all channel signals of the at least one current higher order ambisonics audio source; perform a processing on at least one channel signal of others of the at least one determined active source; determining a difference between the at least one current higher order ambisonics audio source and the at least one previous higher order ambisonics audio source associated with signal interpolation determination; continuing a processing on all channel signals of the at least one previous active higher order ambisonics audio source; processing on at least one channel signal of at least one previous higher order ambisonics audio source associated with signal interpolation determination; continuing a processing on at least one channel signal of the others of the at least one determined active source; processing on all channel signals of at least one of the current higher order ambisonics audio source; performing a signal interpolation based on: the continued processing on the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with signal interpolation for a first time period after the determining the difference; crossfading between the continued processing on all the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with the signal interpolation and the processing on all channel signals of at least one of the determined at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources after the first time period after the determining the difference; and stopping the continuing processing on all channel signals of the at least one previous active higher order ambisonics audio source after the end of the crossfading; and generating the spatialized audio output based on the determined at least one signal interpolation.

The processing may be a short time Fourier transform.

Crossfading may be for a second time period.

The apparatus may be further caused to perform determining spatial metadata by: analysing the processing on all channel signals of the at least one current higher order ambisonics audio source and at least one channel signal of others of the at least one determined active source; analysing the continued processing on at least one channel signal of the at least one previous active higher order ambisonics audio source; The apparatus caused to perform generating the spatialized audio output may be further based on the determined spatial metadata.

The apparatus caused to perform determining at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position may be caused to perform: determining an area within which the current listener position is located, the area defined by vertice positions of at least three higher order ambisonics audio sources; and selecting the at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources, the at least one current higher order ambisonics audio sources being those whose positions define the the area vertices.

According to a fourth aspect there is provided an apparatus for generating a spatialized audio output, the apparatus comprising: obtaining circuitry configured to obtain at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining circuitry configured to obtain a listener position within the audio environment, wherein the listener position is free to move within the audio environment; determining circuitry configured to determine at least one source as an active source; determining circuitry conifugred to determine at least one current higher order ambisonics audio source from the at least one determined active source based on a current listener position; determining circuitry conifugred to determine at least one previous higher order ambisonics audio source associated with signal interpolation determination; performing circuitry configured to perform a processing on all channel signals of the at least one current higher order ambisonics audio source; performing circuitry configured to perform a processing on at least one channel signal of others of the at least one determined active source; determining circuitry configured to determine a difference between the at least one current higher order ambisonics audio source and the at least one previous higher order ambisonics audio source associated with signal interpolation determination; continuing circuitry configured to continue a processing on all channel signals of the at least one previous active higher order ambisonics audio source; processing on at least one channel signal of at least one previous higher order ambisonics audio source associated with signal interpolation determination; continuing circuitry configured to continue a processing on at least one channel signal of the others of the at least one determined active source; processing on all channel signals of at least one of the current higher order ambisonics audio source; performing circuitry configured to perform a signal interpolation based on: the continued processing on the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with signal interpolation for a first time period after the determining the difference; crossfading between the continued processing on all the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with the signal interpolation and the processing on all the channel signals of at least one of the determined at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources after the first time period after the determining the difference; and stopping circuitry configured to stop the continuing processing on all channel signals of the at least one previous active higher order ambisonics audio source after the end of the crossfading; and generating circuitry conifugred to generate the spatialized audio output based on the determined at least one signal interpolation.

According to a fifth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising instructions] for causing an apparatus, for generating a spatialized audio output, the apparatus caused to perform at least the following: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment, wherein the listener position is free to move within the audio environment; determining at least one source as an active source; determining at least one current higher order ambisonics audio source from the at least one determined active source based on a current listener position; determining at least one previous higher order ambisonics audio source associated with signal interpolation determination; performing a processing on all channel signals of the at least one current higher order ambisonics audio source; perform a processing on at least one channel signal of others of the at least one determined active source; determining a difference between the at least one current higher order ambisonics audio source and the at least one previous higher order ambisonics audio source associated with signal interpolation determination; continuing a processing on all channel signals of the at least one previous active higher order ambisonics audio source; processing on at least one channel signal of at least one previous higher order ambisonics audio source associated with signal interpolation determination; continuing a processing on at least one channel signal of the others of the at least one determined active source; processing on all channel signals of at least one of the current higher order ambisonics audio source; performing a signal interpolation based on: the continued processing on the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with signal interpolation for a first time period after the determining the difference; crossfading between the continued processing on all the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with the signal interpolation and the processing on all channel signals of at least one of the determined at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources after the first time period after the determining the difference; and stopping the continuing processing on all channel signals of the at least one previous active higher order ambisonics audio source after the end of the crossfading; and generating the spatialized audio output based on the determined at least one signal interpolation.

According to a sixth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus, for generating a spatialized audio output, to perform at least the following: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment, wherein the listener position is free to move within the audio environment; determining at least one source as an active source; determining at least one current higher order ambisonics audio source from the at least one determined active source based on a current listener position; determining at least one previous higher order ambisonics audio source associated with signal interpolation determination; performing a processing on all channel signals of the at least one current higher order ambisonics audio source; perform a processing on at least one channel signal of others of the at least one determined active source; determining a difference between the at least one current higher order ambisonics audio source and the at least one previous higher order ambisonics audio source associated with signal interpolation determination; continuing a processing on all channel signals of the at least one previous active higher order ambisonics audio source; processing on at least one channel signal of at least one previous higher order ambisonics audio source associated with signal interpolation determination; continuing a processing on at least one channel signal of the others of the at least one determined active source; processing on all channel signals of at least one of the current higher order ambisonics audio source; performing a signal interpolation based on: the continued processing on the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with signal interpolation for a first time period after the determining the difference; crossfading between the continued processing on all the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with the signal interpolation and the processing on all channel signals of at least one of the determined at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources after the first time period after the determining the difference; and stopping the continuing processing on all channel signals of the at least one previous active higher order ambisonics audio source after the end of the crossfading; and generating the spatialized audio output based on the determined at least one signal interpolation.

According to a seventh aspect there is provided an apparatus, for generating a spatialized audio output, comprising: means for obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; means for obtaining a listener position within the audio environment, wherein the listener position is free to move within the audio environment; means for determining at least one source as an active source; means for determining at least one current higher order ambisonics audio source from the at least one determined active source based on a current listener position; means for determining at least one previous higher order ambisonics audio source associated with signal interpolation determination; means for performing a processing on all channel signals of the at least one current higher order ambisonics audio source; means for performing a processing on at least one channel signal of others of the at least one determined active source; means for determining a difference between the at least one current higher order ambisonics audio source and the at least one previous higher order ambisonics audio source associated with signal interpolation determination; means for continuing a processing on all channel signals of the at least one previous active higher order ambisonics audio source; means for processing on at least one channel signal of at least one previous higher order ambisonics audio source associated with signal interpolation determination; means for continuing a processing on at least one channel signal of the others of the at least one determined active source; processing on all channel signals of at least one of the current higher order ambisonics audio source; means for performing a signal interpolation based on: the continued processing on the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with signal interpolation for a first time period after the determining the difference; crossfading between the continued processing on all the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with the signal interpolation and the processing on all channel signals of at least one of the determined at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources after the first time period after the determining the difference; and means for stopping the continuing processing on all channel signals of the at least one previous active higher order ambisonics audio source after the end of the crossfading; and means for generating the spatialized audio output based on the determined at least one signal interpolation.

According to an eighth aspect there is provided a computer readable medium comprising instructions for causing an apparatus, for generating a spatialized audio output, to perform at least the following: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment, wherein the listener position is free to move within the audio environment; determining at least one source as an active source; determining at least one current higher order ambisonics audio source from the at least one determined active source based on a current listener position; determining at least one previous higher order ambisonics audio source associated with signal interpolation determination; performing a processing on all channel signals of the at least one current higher order ambisonics audio source; perform a processing on at least one channel signal of others of the at least one determined active source; determining a difference between the at least one current higher order ambisonics audio source and the at least one previous higher order ambisonics audio source associated with signal interpolation determination; continuing a processing on all channel signals of the at least one previous active higher order ambisonics audio source; processing on at least one channel signal of at least one previous higher order ambisonics audio source associated with signal interpolation determination; continuing a processing on at least one channel signal of the others of the at least one determined active source; processing on all channel signals of at least one of the current higher order ambisonics audio source; performing a signal interpolation based on: the continued processing on the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with signal interpolation for a first time period after the determining the difference; crossfading between the continued processing on all the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with the signal interpolation and the processing on all channel signals of at least one of the determined at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources after the first time period after the determining the difference; and stopping the continuing processing on all channel signals of the at least one previous active higher order ambisonics audio source after the end of the crossfading; and generating the spatialized audio output based on the determined at least one signal interpolation.

An apparatus comprising means for performing the actions of the method as described above.

An apparatus configured to perform the actions of the method as described above.

A computer program comprising program instructions for causing a 15 computer to perform the method as described above.

A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For a better understanding of the present application, reference will now be 25 made by way of example to the accompanying drawings in which: Figure 1 shows schematically a system of apparatus showing the audio rendering or reproduction of an example audio scene and within which a user can move within the audio scene according to some embodiments; Figure 2 shows schematically an example audio scene comprising 30 reproduction of an audio scene where a user moves within an area determined by higher order ambisonic audio signal sources; Figure 3 shows schematically the example audio scene as shown in Figure 2 and whierein sources can be identified as higher order ambisonic audio sources or full order ambisonic audio sources; Figure 4 shows schematically the example audio scene as shown in Figures 5 2 or 3, wherein the listener moves and a cross-fade implemented; Figure 5 shows schematically the example audio scene as shown in Figures 4, following the cross-fade; Figure 6 shows apparatus suitable for implementing some embodiments wherein a capture apparatus can be separate from the rendering apparatus elements; Figure 7 shows an example flow diagram of the operation of the example apparatus according to some embodiments; and Figure 8 shows schematically an example device suitable for implementing the apparatus shown.

Embodiments of the Application The concept as discussed herein in further detail with respect to the following embodiments is related to the rendering of audio scenes wherein the audio scene was captured based on linear or parametric spatial audio methods and with two or more microphone-arrays corresponding to different positions at the recording space (or in other words with audio signal sets which are captured at respective signal set positions in the recording space). Furthermore the concept is related to attempting to lower computational complexity of the spatial analysis required for MPHOA processing.

As discussed above 6DoF is presently commonplace in virtual reality, such as VR games, where movement at the audio scene is straightforward to render as all spatial information is readily available (i.e., the position of each sound source as well as the audio signal of each source separately).

In the following examples the audio signal sets are generated by microphones (or microphone-arrays). For example a microphone arrangement may comprise one or more microphones and generate for the audio signal set one or more audio signals. In some embodiments the audio signal set comprises audio signals which are virtual or generated audio signals (for example a virtual speaker audio signal with an associated virtual speaker location). In some embodiments the microphone-arrays are furthermore separate from or physically located away from any processing apparatus, however this does not preclude examples where the microphones are located on the processing apparatus or are physically connected to the processing apparatus.

With respect to Figure 1 an example apparatus is shown which can be configured to implement MPHOA processing according to some embodiments. In some embodiments the apparatus is part of a suitale MPEG-I Audio reference audio renderer.

In some embodiments the apparatus 101 comprises a pre-processor 103.

The pre-processor 103 is configured to receive the head related impulse responses (HRIRs) 100 and the Higher order ambisonic microphone positions p,104 and generate head related transfer functions (HRTFs) 110 and the determined microphone triangles T 108.

In some embodiments the apparatus 101 comprises a position pre-processor 105 configured to receive the position of the listener p, 104, the position of the Higher order ambisonic microphone positions PLO) 106 and T 108 and from these generate TAW112 -the active triangle for frame j, k) 128-the chosen interpolation weights for subframe k of frame j, mc(j) 126-the chosen HOA source for frame j.

In some embodiments the apparatus 101 comprises a spatial analyser 107 configured to receive sEsp(i,j) 102, the input time domain HOA signals in Equivalent Spatial Domain representation, from the inputs and TAO) 112, the active triangle for frame j, from the position pre-processor 105. From these the spatial analyser 107 is configured to generate metadata 0(0, b) 116, the azimuth for HOA source i, frame j, subframe k and frequency bin b, b) 118 elevation for HOA source i, frame j, subframe k and frequency bin b r(i, j, k, b) 120 direct-tototal energy ratio for HOA source i, frame j, subframe k and frequency bin b e(i, j,k,b) 122 energy for HOA source i, frame j, subframe k and frequency bin b and signals S(i, j, k, b) 114.

In some embodiments the apparatus 101 comprises a spatial metadata interpolator 111 configured to receive Op,], k, b) 116 co(i,j, k, b) 118 r(i, j, k, b) 120 e(i,j,k,b) 122 from the spatial analyser 107, and k) from the position pre-processor 105 and from these generate interpolated metadata (1(j, k, b) 134 co (j, lc, b) 136 f(j, k, b) 138 e(j, k, b) 132.

In some embodiments the apparatus 101 comprises a signal interpolator 109 configured to receive m(j) 126 from the position pre-processor 105, S(i, j, k, b) 114 and e(i, j, k, b) 122 from the spatial analyser 107 and e(j,k,b) 132 from the spatial metadata interpolator 111 from these generate interpolated signal S(1, k, b) 130.

In some embodiments the apparatus 101 comprises a mixer 113 configured to receive interpolated signal SO, k, b) 130 from the signal interpolator 109, and interpolated metadata (j,k,b) 134 Cp(j,k,b) 136 f (j, k, b) 138 e (j,k,b) 132 from the spatial metadata interpolator 111. From these the mixer generates output audio O(j,k,b) 142.

In some embodiments the apparatus 101 comprises an output processor 115 configured to receive output audio 0(j,k,b) 142, a binaural utput time-frequency domain signal, from the mixer 113 and generates output audio signals So",(j) 144 a binaural output time domain audio signal.

The operation of the spatial analyser 107 which as described above is configured to receive sEsp(i,j) 102 from the inputs and TAO) 112 from the position pre-processor 105 and generate metadata 0(0 k, b) 116 yo(i, j, k, b) 118 r(i, j, k, b) 120 e(i, j, k, b) 122 and signals S(i, j, k, b) 114. Is described in further detail here.

The spatial analyser 107 and the renderer is further described with respect to GB2007710.8 and EP21201766.9 as well as the MPEG-I Immersive Audio standard working draft (ISO/IEC 23090-4 WD), Section 6.6.18.

The spatial analysis block takes as input, the audio input signals in Equivalent Spatial Domain (ESD) representation sE."(i,j) 102 and provides as output spatial metadata for the purposes of determining interpolated spatial metadata at the listener position (i,j, k, b) 116 T(i, j, k, b) 118 r(i, j, k, b) 120 j, k, b) 122 as well as a time-frequency domain signal S(i, j, k, b) 114 to be used for signal interpolation.

The signals are first converted into higher-order Ambisonics (HOA) signals as follows: SH0A(0) MESDtoHOA5ESD(0), where MESDtoHOAis a Na, x NC,, ESD to HOA conversion matrix, i is the HOA Source index and] is the frame index. The output HOA signals are then split into Nsf subframes of equal length: [9,10A 1) *** SHOAM,Nsf)1= SHOA01) Time-frequency domain conversion is then applied for all active HOA sources i. The conversion can be performed using a suitable function such as the afSTFT function which is found for example in https://github.com/jvilkamo/afSTFT.

5HOA(I1,Nsf) S(0 k), where S(i,j,k) is a Arch x Nb matrix containing the time-frequency domain signals of length Nb for each HOA channel.

The afSTFT conversion is run for each channel ch separately. Thus, the more channels there are to process, the more computationally heavy the processing is.

For each frequency bin b of signal S(i, j,k) the spatial analysis block calculates spatial metadata comprising direction, diffuseness and energy information. These are then passed on to the spatial metadata interpolator 111 which in some embodiments can be configured to implement interpolation in a manner similar to that described in section 6.6.18.3.4.1 "Metadata interpolation" in ISO/IEC 23090-4 WD1 as discussed in further detail below.

The spatial metadata furthermore can be calculated from a signal covariance matrix Cp0A, which is obtained from the signal as follows: CF0A01,1c,b) = s(i, j,k,b)sH(i, j,k,b), where: j, k) * 1.0 S b)2(i, j, k) * 0.5774 s(i, j, k, b) = j, k) * 0.5774' Sb,40,j, k) * 0.5774 where Sbxb(i,j,k) is the value in matrix SO,], k) corresponding to channel ch and frequency bin b.

The signal S(ic, j, k), where ic is the index of the chosen HOA source for signal interpolation is passed on to the signal interpolation block, where a prototype binaural signal is calculated from it. In some embodiments, the prototype signal creation involves applying an EQ gain on the signal, rotating it according to listener head orientation and then multiplying it with an HOA to binaural transformation matrix. This, for example, in some embodiments can be implemented in a form similar to that described in Section 6.6.18.3.4.2 "Signal interpolation" in ISO/IEC 23090-4 WD1 as discussed in further detail below.

In some embodiments, in order to reduce computation complexity issues, the spatial analysis (and also the signal handling described above) can be implemented for the sources for which it is needed (i.e active sources). The determination of which HOA sources are active, can be performed based on the listener's position in the scene with respect to the HOA sources in the scene and a triangulation that has been performed on the HOA source positions in a pre-processing phase.

For example in some embodiments rhe active sources are the sources comprising the triangle within which the listener is in.

An example scene 201 is shown in Figure 2. The audio scene 201 comprises microphones m, 203, m2 205, m, 207, m, 209, which are arranged such that there are two triangles defined by the locations of the microhpones. The scene thus comprises a first triangle is formed defined by the 'connection' 204 between m, 203 and m2 205, the connection' 208 between m, 205 and m, 209 and the connection' 206 between mi 203 and m4 209. Furthermore the scene comprises a second triangle is formed defined by the 'connection' 210 between m3 207 and m2 205, the connection' 208 between m2 205 and m4 209 and the connection' 212 between m4 209 and m3 207. In this example the listener PI 211 is located within the second triangle. As such the active sources are the microphones which form the vertices or corners of the second triangle m2 205, m3 207 and m4 209.

The signal interpolator 109 is further configured to select as an input audio signal an audio signal associated with a HOA source closest to the listener. The signal interpolator 109 is then configured to create an interpolated time-frequency domain audio signal. This can be implemented in some embodiment by processing the selected input signal corresponding to the closest HOA source to the user, for example by applying an equalisation (EQ) gain to the selected audio signal. For example in the situation shown in Figure 2, the chosen source is m2 205. This selection can be implemented, in some embodiments, in the manner as described in section 6.6.18.3.2.4 "Determine HOA source for signal interpolation" in ISO/IEC 23090-4 WD as is discussed in further detail below.

In the case where the listener moves such that the chosen HOA source for signal interpolation is no longer the closest HOA source to him, the signal interpolator 109 can be conifigured to implement a cross-fade process to smoothly transition to use the new HOA source for the signal interpolation. During the cross-fade (which lasts for 12 frames), the signal interpolator 109 creates a cross-faded frequency domain signal from the previous closest HOA source and the new HOA closest source by picking frequency bands from the two signals. As the cross-fade progresses, more frequency bands are chosen from the new closest signal and less from the previous closest signal. The crossfading can be implemented, in some embodiments, using the example shown in section 6.6.18.3.2.5 "Crossfade" in ISO/IEC 23090-4 WD1 as discussed in further detail below.

To support the listener moving to a new triangle, the HOA sources comprising the new triangle are added to the list of active HOA sources, i.e. the list of HOA sources for which the STFT is run. Since before moving to the new triangle, only the three I-10A sources belonging to the previous triangle have been active for STFT, and since the STFT used here takes 6 frames of input until the output is meaningful, the new HOA sources in the new triangle are not ready for processing.

Thus, the system employs a delayed switch of the triangle. For the 6 frames or for how long it takes the STFT to produce a meaningful output, the spatial metadata interpolator 111 is configured to perform interpolation using the HOA sources of the previous triangle. Once the output of the STFT is representing the HOA sources of the new triangle then the processing resumes normal operation using the HOA sources of the new triangle. The HOA sources not part of the new triangle are dropped from the list of active HOA sources.

Thus as discussed above, for 6DoF HOA rendering, the spatial analyser 107 can be configured to calculate or determine spatial metadata for all the HOA sources required for spatial metadata interpolation and the signal interpolator use the best suited HOA source (e.g. closest HOA source) for signal interpolation.

Generally, the signals are converted and processed within the frequency domain (based on a STET processing) and the processing applied to a few selected channels for spatial metadata calculation. On the other hand, for signal interpolation, the maximum amount of information available is beneficial, consequently, STFT processing is performed for all channels of the HOA source. Furthermore as the listener (or HOA source) moves, the best suited or determined HOA source that is currently being used for signal interpolation can change. This will result in a perceptible discontinuity or glitch and consequent adverse impact on the subjective audio consumption experience. This is due to a processing or rendering involving a change from the previous signal interpolation HOA source (HOAPrev-Sigint) to a new signal interpolation HOA source (HOACurrent&girt). This requires the HOAPrev-sigint to switch from the STFT calculation for all channels to first order channels (first four channels) and HOAcurrent-sigint to switch from STFT calculation for first order channels (first four channels) to STFT calculation for all channels. This audible glitch or discontinuity occurs due to the delay in obtaining meaningful STFT results after the STFT processing is initiated. This delay is equal to a predefined number of audio frames.

The concept as discussed herein within the emboidments herein relates to 6DoF rendering of an audio scene comprising two or more HOA sources where there is provided apparatus configured to implement a method for switching between HOA sources used for signal interpolation to achieve a seamless switch without audible glitches. In some embodiments this can be implemted by the following: initiate STFT processing of all channels for the new signal interpolation HOA source, initiate STFT processing of first order channels for the previous signal interpolation HOA source, delay the switch until the STFT processing provides meaningful output for all channels of the new signal interpolation HOA source and after the switch, stop STFT processing of all channels (but continue the STFT processing of first order channels) for the previous signal interpolation HOA source.

The method steps can in some embodiments be: o Obtain listener position o Obtain active HOA sources based on listener position o Obtain previous HOA source used for signal interpolation (HOAPrev-Sigint) which was used for signal interpolation for the previous audio frame o Determine the current HOA source used for signal interpolation (HOACurrent-Sigint) from the active HOA sources o If the previous HOA source used for signal interpolation is different from the current HOA source used for signal interpolation * Set the crossfade in progress flag to true and set the number of frames before crossfade starts (e.g., audio frames required for STFT priming) * initiate STFT calculation for the current HOA source used for signal interpolation for all available channels of the HOA source while retaining the separate STFT calculation for the first four channels * initiate STFT calculation for first four channels of the previous HOA source used for signal interpolation while retaining the separate STFT calculation for all available channels of the HOA source o If the number of audio frames before cross fade starts is greater than zero, then delay the crossfade until the audio frame count before cross fade is equal to zero.

o After number of frames before cross fade is zero, perform cross fade among the previous HOA source used for signal interpolation and the current HOA source used for signal interpolation In some embodiments the determining a difference between the at least one current higher order ambisonics audio source and the at least one previous higher order ambisonics audio source associated with signal interpolation determination can be implemented by a comparing of based on a comparing between the at least one current higher order ambisonics audio source and the at least one previous higher order ambisonics audio source associated with signal interpolation determination.

Additionally in some embodiments where it is determined based on the comparison that the the at least one current higher order ambisonics audio source and the at least one previous higher order ambisonics audio source associated with signal interpolation determination are the same then the processing continues.

STFT priming in some embodiments refers to number of audio frames that need to be processed before the STFT processing delivers meaningful results.

In some embodiments, the switch from HOA HOAcurrent-sigint to is performed as follows: * Initiate STFT calculation for HOAPrev-sigint for first order channels or any other subset of channels * Continue STFT calculation for HOAcurrent-sigint for first order channels or any other subset of channels * Perform signal interpolation based on the first order or any other subset of lower order channels * Initiate STFT calculation for HOAcurrent-sigint for all channels * Perform signal interpolation based on STFT information from all channels after meaningful STFT results start appearing (e.g., the delay is defined in terms of number of audio frames) In some embodiments, during the transition from the HOAPrev-Sigint to HOAcurrent-sigint, the signal interpolation is performed such that a candidate HOAcurrent-sigint is added to the active HOA sources list and STET calculation is initiated for all channels prior to being determined as the HOAcurrent-sigint by predicting listener movement based on past position information or any other suitable information.

In some further embodiments of the invention, signal interpolation is implemented during the transition from HOAPrev-Sigint to HOAcurrent-sigint at any order that is intermediate between all channels and first order channels.

As discussed above the apparatus implementing MPHOA processing such as shown in Figure 1 is configured to render to a listener an audio scene comprising microphone-array (higher-order Ambisonics) audio signals as inputs. The rendering can be implemented in some emodiments with the listener having 6DoF in their movement. That is the listener is allowed to move around in the audio scene or enviroment and have a position which does not coincide with the position of a microphone. The apparatus in such embodiments renders to the listener a binaural (or other multichannel format) audio signal that sounds how how the audio scene is expected to sound from the listener's position. This is not trivial since there is no direct recording of the audio scene at the listener's position and therefore the apparatus is configured to implement a method which infers what the scene would sound like at the listener position. Thus, the appatatus is configured to analyse (HOA) microphone signals near the listener position to estimate a (binaural or multichannel) microphone audio signal at the listener position.

As such the apparatus as shown in Figure 1 and described briefly above is further described herein with respect to some embodiments.

The pre-processor 103 as discussed above takes as inputs, the positions of the HOA microphone arrays pi 104 and a set of H RIR filters 100. The audio scene can then be segmented into triangle sections T by performing Delauney triangulation. An example is shown and has been described above with respect to the example audio scene shown in Figure 2. The triangulation can be used later in the processing to determine which HOA sources surround the listener and are used for generating the binaural signal at the listener position.

Furthermore the pre-processor 103 can be configured to sample the HRIR filters 100 at a uniform grid of directions and convert these into frequency domain HRTFs 110. This is performed as the MPHOA processing is implemented in the frequency domain. The pre-processor is configured to perform these operations during the initialization of the apparatus. Thus in some embodiments the pre-processor 103 is employed in some embodiments once for each audio scene. In some embodiments the position pre-processor 105 can be configured, for every frame of audio, to determine the active triangle TA 112, that is used for processing at the spatial analysis block. The active triangle TA 112 is the triangle from the available triangles sections T 108 which surrounds the listener (or in other words the triangle which the listener is located or positioned in).

Furthermore, the position pre-processor 105 in some embodiments is configured to determine or select a "chosen" HOA source for signal interpolation. In some embodiments the "chosen" HOA source is the source that determined to be closest to the listener position. The position pre-processor 105 furthermore in some embodiments is also configured to determine interpolation weights 17-vc(j,k) 128, these are weights referring to the HOA sources in the active triangle. The closer to the user the HOA source is, the higher the weighting factor. The weighting factors can in some embodiments be obtained by calculating the barycentric coordinates for the triangle by solving, Txy,mbn= PL,xy where DL,xy -[ror py 1] is the listener position on the x-y plane. The Try," value contains the coordinates of the HOA sources in the active triangle on the x-y 5 plane. The barycentric coordinates can then in some embodiments be used as the weighting factors.

In some embodiments, when there is no cross-fade in progress or the listener is not traversing to a new triangle, the chosen I-10A source can be marked or indicated by setting a 'FullOrder' indicator value with the rest of the rest of the HOA sources comprising the triangle that the listener is in being indicated by setting a 'FDA' indicator value. For example as shown in Figure 3, which shows the audio scene 201 such as also shown in Figure 2 but with the microphone m2 identified as the "FullOrder" microphone 305, and microphones m, identified as a "FDA" 307 and m4 identified as a "FDA" 309 microphones.

With respect to the spatial analyser 107, the inputs, the input signals in Equivalent Spatial Domain (ESD) representation s","(i,j) 102 and provides as output spatial metadata 61(0, k, b) 116 cp(i, j,k,b) 118 r(i, j,k, b) 120 e(i, j,k,b) 122 for the purposes of determining interpolated spatial metadata interpolated signal k, b) 134 co (j, k, b) 136 PO, k, b) 138 6'(j, k, b) 132 at the listener position as well as a time-frequency domain signal S(i,j,k,b) 114 to be used for signal interpolation.

In some embodiments the input signals in Equivalent Spatial Domain ([SD) representation s",(i,j) 102 are first converted into higher-order Ambisonics (HOA) signals as follows: s A4 ROA(.i = ESDtoH0AsESDRD, where A4ESDtoHOA is a Nth x Nch [SD to HOA conversion matrix, i is the HOA

-

Source index and] is the frame index. The output HOA signals are then split into Nsf subframes of equal length: [sHDA 1) sH0A Nsf)] = slloA (0) Time-frequency domain conversion is then applied for all active HOA sources i. For sources additionally marked as 'FOA', the conversion is performed for the first four channels only: spoA(i,j, Acf) * S(i, j, k), where S(i, j,k) is a 4 x Nb matrix containing the time-frequency domain signals of length Nb for the first four HOA channels.

For sources additionally marked as 'FullOrder', the conversion is performed for all channels. For both cases, the conversion can be performed using the function afSTFT 5H0A(i, j, Ns f) j,k), where S(i, j,k) is a N,1 x Nh matrix containing the time-frequency domain signals of length Nb for each HOA channel. The afSTFT processing is run for each channel ch separately. Thus, the more channels there are to process, the more computationally heavy the processing is.

For each frequency bin b of signal S(i, j,k) the spatial analyser 107 can be configured to calculate spatial metadata comprising direction, diffuseness and energy information. These are then passed on to the spatial metadata interpolator 111. The spatial metadata is calculated from a signal covariance matrix CF0A, which is obtained from the signal as follows: CF0,1(ij,k,b) = s(i, j,k,b)sH(i, j, k, b), where: j, k) * 1.0 * j, k) * 0.5774 s(i, j k, b) = * j, k) * 0.5774' * j, k) * 0.5774 where S, j, k) is the value in matrix S(i, j,k) corresponding to channel ch and frequency bin b.

Spatial metadata is then calculated for each frequency bin of each active HOA source from the covariance matrix. This includes direction information, diffuseness information as well as energy: FO(i, j,k,b)i co(i, j k, b) r(i, j,k,b) e(i, j, k, b) Where 0(0, k, b) 116 is the azimuth, yo(i, j, k, b) 118 is the elevation, r(i, j,k, b) 120, the direct-to-total energy ratio and e(i, j,k,b) 122 is the energy for 5 HOA source 1, for frame j (subframe k) and frequency bin b. These are obtained as follows. First an intensity vector is calculated from the covariance matrix: k, b)1} i(i, j, k, b) = RefFct2(i, j, k, b) cii3 (i,j, k,b) Then the energy: (i, j,k, b) = -2X'10,j, k, b) 1=1 And the rest of the spatial metadata: a(i,j, b) = atan2 (12 (0, k, b),11(ii b)) co(i, j, b) = atan2 j, k, b), r(i'lib)11) (i,j,b) j, b) ro:J:b) b) The signal S(ic, j, k), where ic is the index of the chosen HOA source for signal interpolation is passed on to the signal interpolation block, where a prototype binaural signal is calculated from it.

The spatial metadata interpolator 111 is as indicated earlier configured to take the metadata related to the HOA sources of the active triangle (calculated by the spatial analyser 107) and creates interpolated metadata, that is, metadata at the listener position. The spatial metadata interpolator 111 thus is configured to describe the sound field at the listener position (what is should sound like at the listener position, which frequencies are coming from which direction at which energy etc.).

The output of the spatial metadata interpolator 111 is interpolated metadata which is a weighted sum of the spatial metadata of the HOA sources of the active triangle. The weights for the weighted interpolation can be the weights WC(/, k) 128 calculated in the position pre-processor 105.

The signal interpolator 109 is configured to take as an input the chosen HOA source frequency domain signal S(i c, j, k), where ic is the index of the chosen HOA source for signal interpolation as part of the signals S(i, j, k, b) 114 and provides as output a prototype frequency domain signal.5"(i, j, k,b) 130. In summary, the prototype signal creation involves applying an EQ gain (based on the interpolated signal energy) on the signal, rotating it according to listener head orientation and then multiplying it with an HOA to binaural transformation matrix. The EQ gain is calculated in some embodiments as follows: e(me(j),j,k,b) E, Geqmax) G eq (j, k, b) = min e(j, k, b) where m(j) is the index of the chose HOA source for frame j.

The interpolated signal is the calculated as follows: k) = Geq(jik,b)Sb,c(nc(D. j, k) In some embodiments the case where the listener moves is such that the chosen source for signal interpolation is no longer the original closest source and the apparatus is configured to implement a cross-fade process to smoothly transition to use the new source for the signal interpolation. During the cross-fade, which can be configured to last for a determined number of frames (for example 12 frames), the system creates a cross-faded frequency domain signal from the previous closest source and the new closest source by picking frequency bands from the two signals. As the cross-fade progresses, more frequency bands are chosen from the new closest signal and less from the previous closest signal.

For example in some embodiments as the cross-fade process starts, the new closest source is determined to be a FOA' source, which means that the signals processing and the frequency domain transform (STFT) has been implemented on the first four channels of the source. However, for the cross-fade operation to perform properly, frequency domain transforms (STFT) for all orders of the new closest source need to be determined or calculated.

In some embodiments the new closest source is also marked or indicated as a 'FullOrder' source. So, during cross-fade, there are two 'FullOrder' indicated sources and a single 'RDA' source.

For example Figure 4 shows an example of the audio scene shown in Figures 2 and 3 wherein the listener position moves closer to the microphone m4 identified as a 'FDA" 309 microphone in Figure 3. The crossfade operation 421 thus then is configured with the microphone m, identified as the first or current "FullOrder" microphone 305, m4 identified as the second or new "FullOrder" microphone 409 and microphone m, identified as a "FDA" 307 microphone.

Since the frequency domain transform STFT requires a set number of time domain audio frames (for example 6 in afSTFT) as inputs before meaningful output values are generated, and since the STFT has not been run for all of the channels for the new closest source, the start of the cross-fade needs to be delayed until proper output for all channels of the new closest source is available. For this, when a cross-fade is determined to be needed, a counter is added which is incremented for each audio frame and once it reaches the number of frames required for proper STFT output, the cross-fade is started. During the period of waiting for the cross-fade to start, the previous closest source is used for sgnal interpolation.

When the cross-fade is finished, normal operation resumes with the new closest HOA source being marked as 'FullOrder' and the other two HOA sources in the triangle that the listener is in are marked as 'FDA'. This can be shown by Figure 5, wherein the completion of the crossfade operation results in the microphone m, identified as a "FOA" microphone 505, m, identified as the "FullOrder" microphone 409 and microphone m3 identified as a "FDA" 307 microphone.

The mixer 113 is configured to receive or obtain as an input, as described above, the interpolated spatial metadata O(j,k,b) 134 (0 (j, k, b) 136 P(j,k,b) 138 e(j,k,b) 132 at the listener position as well as the prototype signal S(i,j, k, b) 114. Thus, at this stage we have a description of the sound field at the listener position (interpolated spatial metadata) and a binaural signal that is an approximation of the output that we want (signal of the closest HOA source to the listener which has been EQ'd based on the interpolated signal energy). To get the final output the mixing stage creates a binaural signal from the interpolated signal such that it has the same characteristics as the interpolated metadata. For this, an optimal mixing algorithm is used which can be implemented such as that described in Vilkamo, J., Backstrom, T., & Kuntz, A. (2013). Optimized covariance domain framework for time--frequency processing of spatial audio. Journal of the Audio Engineering Society, 61(6), 403-411, and summarised below.

First a prototype binaural signal B(j, b) is calculated from the interpolated signal: 4.

( j, b) -MHOA2bin(b) * RshU) * 117'1) ... :51)11(1: Ns f)

B [

g. b,A fch(1, 1) *.** S b,iv,"(b Nsf) where Rsh(j) is a spherical harmonics rotation matrix calculated according to the listeners head and source orientation and M -110,12bui(b) is the Ambisonics to binaural matrix for frequency bin b.

From the prototype signal a covariance matrix is calculated: C, (j, b) = B(j, b)B (j, b)" A target coavariance matrix is calculated from the interpolated metadata and the HRTFs calculated in the pre-processing step. Direct portion of covariance matrix: AI, Ccylirect b) (j, k, b)f-(j, k, b)H (b, d)HH(b, d) k=1 Diffuse portion: 1 -(j,k,b))0(j,k,b)Cdif(b) cdif fuse u b) = y, k=1 Where Cdif (b) H (b, d)H11(b, d) Nd And the final target covariance matrix: Cy (j, b) = c,l'rect (j, b) + Cydif fuse u, b) Mixing matrices are then obtained via the optimal mixing algorithm [MPHOA_optimal_mixing]. The output of the optimal mixing algorithm is mixing matrices (M (j, k, b) and M(j,k,b)) which, when applied to the binaural prototype signal will produce an output binaural signal 0(j, k, b) with a covariance matrix equal to Cy: 00, k, b) = M(j, k, b) * B(j -1, k, b) + Mr(j, k, b) * b) where D(j, b) is a decorrelated time-frequency domain signal obtained from a buffer of previous binaural signals B. The outut processor 115 is then configured to perform an inverse frequency domain transform (for example an inverse STFT) on the output frequency domain signal 0(j, k, b) 142 to provide the final time domain output signal 144.

The computational complexity savings that can be achieved using this method are due to the amount of STFT processing needed for each frame. For example a conventional frequency domain (STFT) would be performed (assuming 3rd order HOA input signals) for (3 x 16 =) 48 channels. In some embodiments the frequency domain processing (STFT) is performed for (4 + 4 + 16 =) 24 channels, cutting the STFT computations to half. For fourth order HOA input signals, the savings are even greater (75 channels vs 33 channels).

Furthermore in some embodiments during the cross-fade, instead of calculating FullOrder STFTs for the HOA sources between which the cross-fade is being performed, FOA' STFT may be calculated. In these emboiments, during the cross-fade, signal interpolation is implemented on first order signals. This would cause a drop in quality for the duration of the cross-fade, but would have the advantage that there would be no need for the delayed start of the cross-fade. In an embodiment determining number of channels for which STFT is run 25 may be done as follows.

Alias free Short-time Fourier Transform in some embodments is employed to convert I-10A format signals into time-frequency domain. For each subframe k of audio frame land for each I-10A Source i that is a part of a triangle found in the triangle track record (TRR, see clause Error! Reference source not found.) sH0A(ii, k), a time-frequency domain signal matrix S(i, j, k) is determined with the afSTFT forward transformation. S(i, j,k) is a Ach x Nb matrix containing the time-frequency domain signals of length Nb for each HOA channel. For the HOA source that has been chosen for signal interpolation, afSTFT is run for all channels and S(i, j, k) is a Nth x Nb matrix containing the time-frequency domain signals of length Nb for each HOA channel. For all other HOA sources found in the triangle track record, afSTFT is run for only the first four channels (FOA) and S(i, j, k) is a 4 x Nb matrix containing the time-frequency domain signals of length Nb for the first four channels. During cross-fade (6.6.18.3.2.5), the HOA source that is the target of the cross-fade, afSTFT shall also be run for all channels.

Furthermore in some embodiments if the chosen HOA source is changed, crossfade shall be started. For this, fade-in and fade-out weights (wp(j, k) and W10 (.1 lc)) and fade-out bands 1310(j) shall be calculated. The fade-in weights, fade-out weights and fade-out bands are used in metadata and signal interpolation during cross-fade (see sections Error! Reference source not found. and Error! Reference source not found.).

check_crossf ade_start() if (crossFadelnProgress) { getfade_out_bands() } else { if (InG(j) if (isTriangleSwitched) { for (k = 1; k c= N sf; k++) wfu (j, k) = we(j -1) } else { (j, k) = We°, k) wp(j, k) = gr,(1, k) crossFadelnProgress = True; framesUntilCrossFadeStart = 6; getfade_out_bands() getfade_out_bands() is used to obtain a list of fade-out bands Blow. The size of the list is decreased as the cross-fade progresses. getfade_out_bands() { cfBandLow = {0, 2, 3, 4, 5, 6, 8, 10, 12, 14, 17, 20, 23, \ 27, 32, 37, 43, 50, 57, 66, 76, 88, 101, 116} cfBandHigh = {1, 2, 3, 4, 5, 7, 9, 11, 13, 16, 19, 22, 26,\ 31, 36, 42, 49, 56, 65, 75, 87, 100, 115, 132} cfBandSwaporder = {7, 1, 2, 22, 21, 12, 18, 0, 5, 9, 8, 14, 13, \ 23, 15, 20, 19, 4, 16, 17, 10, 11, 3, 6} for (k = 1; k <= N sf; k++) ++h->crossFadeProgressIndx; if (framesUntilCrossFadeStart > 0) { framesUntilCrossFadeStart--; if (h->framesUntilCrossFadeStart == 0) { ++h->crossFadeProgressIndx; ++crossFadeProgressIndx; for (int cfldx = crossFadeProgressIndx; cfldx < cfLen; ++cfldx) { bandldx = crossfadeBandSwaporder[cfldx] for (band = h->cfBandLow[bandldx]; band <= h->cfBandHigh[bandldx]; ++band) { Bfo (j,k)(-band Crossfade is stopped when crossFadeProgressIndx == cfLen, by setting crossFadeln Progress = False.

With respect to Figure 6 is shown a flow diagram of the method steps for some embodiments: For example first obtain listener position as shown in Figure 6 by 601. The renderer receives this information from the listener position and orientation interface in terms of the audio scene coordinates.

Then obtain active HOA sources based on the listener position as shown in Figure 6 by 603. This can be evaluated regularly, in other words for every scene state update. All the HOA sources are classified as active for the HOA sources within which the user is in.

Furthermore obtain the previous HOA source used for signal interpolation as well as spatial metadata calculation during the previous scene state update or audio frame as shown in Figure 6 by 605.

Having performed the above then determine HOA source used for signal interpolation as well as spatial metadata calculation based on proximity of HOA source to the listener as shown in Figure 6 by 607. For example, the HOA source closest to the listener is selected since it has the most representative audio compared to the listener position.

If the previous signal interpolation HOA source is different from the HOA source determined for signal interpolation in the current audio frame period or scene state update, in other words, there is a change in the signal interpolation HOA source as shown in Figure 6 by 609: then initiate STFT calculation for the new signal interpolation HOA source (HOAcurrent sigint) for all channels of the HOA source while continuing the separate STFT calculation for the first order channels of the HOA source as shown in Figure 6 by 621; and initiate a separate STFT calculation for first order channels of the previous signal interpolation I-10A source as shown in Figure 6 by 623. Continue the dual STFT calculation for the previous signal interpolation HOA source and the new signal interpolation HOA source as shown in Figure 6 by 611.

Furthermore there is a delay of the switch of signal interpolation HOA source until meaningful output can be obtained from the new STFT processing is available as shown in Figure 6 by 613.

Then perform cross fade among the previous signal interpolation HOA source and the new signal interpolation HOA source as shown in Figure 6 by 615. Finally cease separate first four channel STFT calculation for the previous and new signal interpolation HOA sources (HOAPrev-Sigint and HOACurrent-Sigint) after 5 the STFT priming is ready after the predetermined number of audio frames as shown in Figure 6 by 617.

An example system employing some emodiments is described in Figure 7, above. The figure illustrates an end to end system overview for an audio scene comprising multiple HOA sources, which is rendered according to the invention.

This invention modifies the rendering happening at the MPEG-I Audio Renderer which is located in the playback device. The renderer receives the scene description and audio bitstreams and performs rendering accordingly. The MPHOA processing described in Figure 1 and presented in this invention is performed in the M PEG-I Audio Renderer whenever the scene comprises multiple HOA sources With respect to Figure 7 is shown schematically an example system within which embodiments are implemented.

The system can comprise a content creator 701 which can be implemented on any suitable computer or processing device. The content creator 701 comprises an (MPEG-I) encoder 711 which is configured to receive the audio scene description 700 and the audio signals or data 702. The audio scene description 700 can be provided in the MPEG-I Encoder Input Format (EIF) or in other suitable format. Generally, the audio scene description contains an acoustically relevant description of the contents of the audio scene, and contains, for example, the scene geometry as a mesh or voxel, acoustic materials, acoustic environments with reverberation parameters, positions of sound sources, and other audio element related parameters such as whether reverberation is to be rendered for an audio element or not. The MPEG-I encoder 711 is configured to output encoded data 712. The content creator 701 furthermore in some embodiments comprises a bitstream encoder 713 which is configured to receive the output 712 of the M PEG-30 1 encoder 711 and the encoded audio signals from the MPEG-H encoder 619 and generate the bitstream 714 which can be passed to the bitstream decoder 631. The bitstream 714 in some embodiments can be streamed to end-user devices or made available for download or stored.

Additionally the system comprises a server configured to obtain the bitstream 714, and store it and supply it to the player 705. In some emobodiments this is implemented by a streaming server 721 which is configured to supply the audio data 722 and MPEG-I audio 6D0F metadata bitstream 724.

The relevant bitstream 724 and audio data 722 is retrieved by the player 705. In some embodiments other implementation options are feasible such as broadcast, multicast.

The player 705 in some embodiments comprises a playback device 731 configured to obtain or receive the audio data 722 and MPEG-I audio 6DoF metadata bitstream 724, and furthermore can be configured to receive or otherwise obtain the 6 DoF tracking information (listener orientation or position information) 734 from a suitable listener user interface, for example from the head mounted device (HMD) 741. These can for example be generated by sensors within the HMD 741 or from sensors in the environment sensing the orientation or position of the listener.

In some embodiments the playback device 731 comprises a bitstream parser 733 configured to obtain the encoded metadata bitstream 724 and decode these in an opposite or inverse operation to the bitstream encoder 713 and mpeg I encoder 711 to generate audio scene description informaiton732 which can be passed to a MPEG-I audio renderer 735.

In some embodiments the playback device 731 comprises the MPEG-I audio renderer 735 configured to implement the rendering operations as described above and generate audio output signals which can be output to the head mounted device 741.

The playback device 731 can be implemented in different form factors depending on the application. In some embodiments the playback device is equipped with its own listener position tracking apparatus or receives the listener position information from an external apparatus. The playback device can in some embodiments be also equipped with headphone connector to deliver output of the rendered binaural audio to the headphones.

With respect to Figure 8 an example electronic device which may be used as the computer, encoder processor, decoder processor or any of the functional blocks described herein is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1600 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. In some embodiments the device 1600 comprises at least one processor or central processing unit 1607. The processor 1607 can be configured to execute various program codes such as the methods such as described herein.

In some embodiments the device 1600 comprises a memory 1611. In some embodiments the at least one processor 1607 is coupled to the memory 1611. The memory 1611 can be any suitable storage means. In some embodiments the memory 1611 comprises a program code section for storing program codes implementable upon the processor 1607. Furthermore in some embodiments the memory 1611 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1607 whenever needed via the memory-processor coupling.

In some embodiments the device 1600 comprises a user interface 1605. The user interface 1605 can be coupled in some embodiments to the processor 1607.

In some embodiments the processor 1607 can control the operation of the user interface 1605 and receive inputs from the user interface 1605. In some embodiments the user interface 1605 can enable a user to input commands to the device 1600, for example via a keypad. In some embodiments the user interface 1605 can enable the user to obtain information from the device 1600. For example the user interface 1605 may comprise a display configured to display information from the device 1600 to the user. The user interface 1605 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1600 and further displaying information to the user of the device 1600.

In some embodiments the device 1600 comprises an input/output port 1609.

The input/output port 1609 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1607 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (I RDA).

The transceiver input/output port 1609 may be configured to transmit/receive the audio signals, the bitstream and in some embodiments perform the operations and methods as described above by using the processor 1607 executing suitable code.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.

Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media, and optical media.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS: 1. A method for generating a spatialized audio output comprising: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment, wherein the listener position is free to move within the audio environment; determining at least one source as an active source; determining at least one current higher order ambisonics audio source from the at least one determined active source based on a current listener position; determining at least one previous higher order ambisonics audio source associated with signal interpolation determination; performing a processing on all channel signals of the at least one current higher order ambisonics audio source; perform a processing on at least one channel signal of others of the at least one determined active source; determining a difference between the at least one current higher order ambisonics audio source and the at least one previous higher order ambisonics audio source associated with signal interpolation determination; continuing a processing on all channel signals of the at least one previous active higher order ambisonics audio source; processing on at least one channel signal of at least one previous higher order ambisonics audio source associated with signal interpolation determination; continuing a processing on at least one channel signal of the others of the at least one determined active source; processing on all channel signals of at least one of the current higher order ambisonics audio source; performing a signal interpolation based on: the continued processing on the channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with signal interpolation for a first time period after the determining the difference; crossfading between the continued processing on the all channel signals of the at least one of the at least one previous higher order ambisonics audio source associated with the signal interpolation and the processing on all channel signals of at least one of the determined at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources after the first time period after the determining the difference; and stopping the continuing processing on all channel signals of the at least one previous active higher order ambisonics audio source after the end of the crossfading; and generating the spatialized audio output based on the determined at least one signal interpolation.
2. The method as claimed in claim 1, wherein the processing is a short time Fourier transform.
3. The method of any of claims 1 or 2, wherein crossfading is for a second time period.
4. The method of claim 2 or 3 when dependent on claim 2,wherein the first time period is a first number of time frames, the first number of time frames based on a short time Fourier transform priming delay and the second time period is a second defined number of processing frames.
5. The method of claim 4, wherein the first number of time frames is 6 frames and the second number of time frames is 12 frames.
6. The method of any of claims 1 to 5, further comprising determining spatial metadata by: analysing the processing on all channel signals of the at least one current higher order ambisonics audio source and at least one channel signal of others of the at least one determined active source anaysing the continued processing on at least one channel signal of the at least one previous active higher order ambisonics audio source;
7. The method of claim 6, wherein generating the spatialized audio output is further based on the determined spatial metadata.
8. The method as claimed in any of claims 1 to 7, wherein the at least one respective channel signals of the determined at least one previous higher order ambisonics audio source is subset of the channels.
9. The method as claimed in any of claims 1 to 8, wherein the at least one channel signal of the determined at least one previous higher order ambisonics audio source is a first four channel signals of the determined at least one previous higher order ambisonics audio source.
10. The method as claimed in any of claims 1 to 8, wherein determining at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position comprises: determining an area within which the current listener position is located, the area defined by vertice positions of at least three higher order ambisonics audio 20 sources; and selecting the at least one current higher order ambisonics audio source from the at least two higher order ambisonics audio sources, the at least one current higher order ambisonics audio sources being those whose positions define the the area vertices.
11. An apparatus comprising means for performing the method of any of claims 1 to 10.
12. A computer program comprising instructions, which, when executed by an apparatus, cause the apparatus to perform the method of any of claims 1 to 10.