WO2024149557A1

WO2024149557A1 - A method and apparatus for complexity reduction in 6dof audio rendering

Info

Publication number: WO2024149557A1
Application number: PCT/EP2023/085601
Authority: WO
Inventors: Jussi Artturi LEPPÄNEN; Sujeet Shyamsundar Mate; Mikko-Ville Laitinen; Lauros PAJUNEN
Original assignee: Nokia Technologies Oy
Priority date: 2023-01-09
Filing date: 2023-12-13
Publication date: 2024-07-18

Abstract

A method for generating a spatialized audio output comprising: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment; determining at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determining spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source; performing signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position; and generating the spatialized audio output based on the determined spatial metadata and performed signal interpolation.

Description

A METHOD AND APPARATUS FOR COMPLEXITY REDUCTION IN 6DOF RENDERING Field The present application relates to apparatus and methods for complexity reduction in 6 degrees of freedom rendering and not specifically with 6 degree of freedom systems of microphone-array captured audio. Background Spatial audio capture approaches attempt to capture an audio environment or audio scene such that the audio environment or audio scene can be perceptually recreated to a listener in an effective manner and furthermore may permit a listener to move and/or rotate within the recreated audio environment. For spatial audio capture and recording spatial sound linearly at one position at the recording space, a high-end microphone array is needed. One such microphone is the spherical 32- microphone Eigenmike. From the high-end microphone array higher-order Ambisonics (HOA) signals can be obtained and used for rendering. With the HOA audio signals, the spatial audio can be rendered so that sounds arriving from different directions are satisfactorily separated in a reasonable auditory bandwidth. In some systems multiple microphone locations enable a multi-point HOA (MPHOA) capture system where there are multiple HOA audio signals at locations within an audio scene. In some embodiments even basic microphone array for up to first order ambisonics may also be used for recording the the audio scene. In some other embodiments, the audio scene may comprise two or more synthetic FOA or HOA sources. Audio rendering, where the captured audio signals are presented to a listener can be part of a virtual reality (VR) or augmented reality (AR) system. The audio rendering furthermore can be performed as part of a VR or AR where the listener can freely move within the environment or audio scene and rotate their head, which is known as a 6 degrees of freedom (6DoF) configuration. Furthmore the audio rendering can be Multi-Point HOA (MPHOA) audio rendering where the audio scene comprises multiple HOA audio signals recordings which are rendered to a user in a 6DoF manner. That is, the user is able to listen to the recorded scene from positions that may be other than the positions of the recorded HOA sources. Summary There is provided according to a first aspect a method for generating a spatialized audio output comprising: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment; determining at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determining spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source; performing signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position; and generating the spatialized audio output based on the determined spatial metadata and performed signal interpolation. Determining spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source may comprise: performing a Short-time Fourier transform on the at least one respective channel signals of determined at least one active higher order ambisonics audio source to generate time-frequency representations of the at least one respective channel signals of determined at least one higher order active ambisonics audio source; analysing the time-frequency representations of the at least one respective channel signals of determined at least one higher order active ambisonics audio source to generate the spatial metadata. Performing signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position may comprise: performing a Short-time Fourier transform on the channel signals of at least one of the determined at least two higher order ambisonics audio sources to generate time-frequency representations of the channel signals of at least one of the determined at least two higher order ambisonics audio sources; signal processing the time-frequency representations of the channel signals of at least one of the determined at least two higher order ambisonics audio sources. The channel signals of at least one of the determined at least two higher order ambisonics audio sources may be all of the channel signals of at least one of the determined at least two higher order ambisonics audio sources. The at least one respective channel signals of the determined at least one active higher order ambisonics audio source may be a subset of the channels. The at least one respective channel signals of the determined at least one active higher order ambisonics audio source may be a first four channel signals of the determined at least one active higher order ambisonics audio source. The channel signals of at least one of the determined at least two higher order ambisonics audio sources may be a greater number of channels than the at least one respective channel signals of the determined at least one active higher order ambisonics audio source. Determining at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position may comprise: determining an area within which the listener position is located, the area defined by vertice positions of at least three higher order ambisonics audio sources; and selecting the at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources, the at least one active audio source being those whose positions define the the area vertices. The at least one of the determined at least two higher order ambisonics audio sources may be at least one of the determined at least one active higher order ambisonics audio sources. According to a second aspect there is provided an apparatus for generating a spatialized audio output, the apparatus comprising means configured to: obtain at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtain a listener position within the audio environment; determining at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determine spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source; perform signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position; and generate the spatialized audio output based on the determined spatial metadata and performed signal interpolation. The means configured to determine spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source may be further configured to: perform a Short-time Fourier transform on the at least one respective channel signals of determined at least one active higher order ambisonics audio source to generate time-frequency representations of the at least one respective channel signals of determined at least one active higher order ambisonics audio source; and analyse the time-frequency representations of the at least one respective channel signals of determined at least one active higher order ambisonics audio source to generate the spatial metadata. The means configured to perform signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position may be configured to: perform a Short-time Fourier transform on the channel signals of at least one of the determined at least two higher order ambisonics audio sources to generate time- frequency representations of the channel signals of at least one of the determined at least two higher order ambisonics audio sources; signal process the time- frequency representations of the channel signals of at least one of the determined at least two higher order ambisonics audio sources. The channel signals of at least one of the determined at least two higher order ambisonics audio sources may be all of the channel signals of at least one of the determined at least two higher order ambisonics audio sources. The at least one respective channel signals of the determined at least one active higher order ambisonics audio source may be a subset of the channels. The at least one respective channel signals of the determined at least one active higher order ambisonics audio source may be a first four channel signals of the determined at least one active higher order ambisonics audio source. The channel signals of at least one of the determined at least two higher order ambisonics audio sources may be a greater number of channels than the at least one respective channel signals of the determined at least one active higher order ambisonics audio source. The means configured to determine at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position may be configured to: determine an area within which the listener position is located, the area defined by vertice positions of at least three higher order ambisonics audio sources; and select the at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources, the at least one active audio source being those whose positions define the the area vertices. The at least one of the determined at least two higher order ambisonics audio sources may be at least one of the determined at least one active higher order ambisonics audio sources According to a third aspect there is provided an apparatus for generating a spatialized audio output, the apparatus comprising at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the system at least to perform: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment; determining at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determining spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source; performing signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position; and generating the spatialized audio output based on the determined spatial metadata and performed signal interpolation. The apparatus caused to perform determining spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source may be caused to perform: performing a Short-time Fourier transform on the at least one respective channel signals of determined at least one active higher order ambisonics audio source to generate time-frequency representations of the at least one respective channel signals of determined at least one active higher order ambisonics audio source; analysing the time-frequency representations of the at least one respective channel signals of determined at least one active higher order ambisonics audio source to generate the spatial metadata. The apparatus caused to perform performing signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position may be caused to perform: performing a Short-time Fourier transform on the channel signals of at least one of the determined at least two higher order ambisonics audio sources to generate time-frequency representations of the channel signals of at least one of the determined at least two higher order ambisonics audio sources; signal processing the time-frequency representations of the channel signals of at least one of the determined at least two higher order ambisonics audio sources. The channel signals of at least one of the determined at least two higher order ambisonics audio sources may be all of the channel signals of at least one of the determined at least two higher order ambisonics audio sources. The at least one respective channel signals of the determined at least one active higher order ambisonics audio source may be a subset of the channels. The at least one respective channel signals of the determined at least one active higher order ambisonics audio source may be a first four channel signals of the determined at least one active higher order ambisonics audio source. The channel signals of at least one of the determined at least two higher order ambisonics audio sources may be a greater number of channels than the at least one respective channel signals of the determined at least one active higher order ambisonics audio source. The apparatus caused to perform determining at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position may be caused to perform: determining an area within which the listener position is located, the area defined by vertice positions of at least three higher order ambisonics audio sources; and selecting the at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources, the at least one active audio source being those whose positions define the the area vertices. The at least one of the determined at least two higher order ambisonics audio sources may be at least one of the determined at least one active higher order ambisonics audio sources. According to a fourth aspect there is provided an apparatus for generating a spatialized audio output, the apparatus comprising: obtaining circuitry configured to obtain at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining circuitry configured to obtain a listener position within the audio environment; determining circuitry configured to determine at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determining circuitry conifgured to determine spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source; performing circuitry configured to perform signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position; and generating circuitry configured to generate the spatialized audio output based on the determined spatial metadata and performed signal interpolation. According to a fifth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising instructions] for causing an apparatus, for generating a spatialized audio output, the apparatus caused to perform at least the following: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment; determining at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determining spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source; performing signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position; and generating the spatialized audio output based on the determined spatial metadata and performed signal interpolation. According to a sixth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus, for generating a spatialized audio output, to perform at least the following: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment; determining at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determining spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source; performing signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position; and generating the spatialized audio output based on the determined spatial metadata and performed signal interpolation. According to a seventh aspect there is provided an apparatus, for generating a spatialized audio output, comprising: means for obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; means for obtaining a listener position within the audio environment; means for determining at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; means for determining spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source; means for performing signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position; and means for generating the spatialized audio output based on the determined spatial metadata and performed signal interpolation. According to an eighth aspect there is provided a computer readable medium comprising instructions for causing an apparatus, for generating a spatialized audio output, to perform at least the following: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment; determining at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determining spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source; performing signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position; and generating the spatialized audio output based on the determined spatial metadata and performed signal interpolation. According to a ninth aspect there is provided a method for generating a spatialized audio output comprising: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment; determining at least two active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determining at least one signal higher order ambisonics audio source from the at least two active higher order ambisonics audio source further based on the listener position; processing at least a first number of respective channel signals of the determined at least two active higher order ambisonics audio source; processing more than the first number of channel signals of the determined at least one signal higher order ambisonics audio source; determining spatial metadata based on the processed first number of respective channel signals of the determined at least one active higher order ambisonics audio source; determining signal interpolation based on the processed channel signals of the determined at least one signal higher order ambisonics audio sources based on the listener position; and generating the spatialized audio output based on the determined spatial metadata and performed signal interpolation. According to a tenth aspect there is provided an apparatus for generating a spatialized audio output, the apparatus comprising means configured to: obtain at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtain a listener position within the audio environment; determine at least two active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determine at least one signal higher order ambisonics audio source from the at least two active higher order ambisonics audio source further based on the listener position; process at least a first number of respective channel signals of the determined at least two active higher order ambisonics audio source; process more than the first number of channel signals of the determined at least one signal higher order ambisonics audio source; determine spatial metadata based on the processed first number of respective channel signals of the determined at least one active higher order ambisonics audio source; determine signal interpolation based on the processed channel signals of the determined at least one signal higher order ambisonics audio sources based on the listener position; and generate the spatialized audio output based on the determined spatial metadata and performed signal interpolation. According to an eleventh aspect there is provided an apparatus for generating a spatialized audio output, the apparatus comprising at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the system at least to perform: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment; determining at least two active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determining at least one signal higher order ambisonics audio source from the at least two active higher order ambisonics audio source further based on the listener position; processing at least a first number of respective channel signals of the determined at least two active higher order ambisonics audio source; processing more than the first number of channel signals of the determined at least one signal higher order ambisonics audio source; determining spatial metadata based on the processed first number of respective channel signals of the determined at least one active higher order ambisonics audio source; determining signal interpolation based on the processed channel signals of the determined at least one signal higher order ambisonics audio sources based on the listener position; and generating the spatialized audio output based on the determined spatial metadata and performed signal interpolation. According to a twelfth aspect there is provided an apparatus for generating a spatialized audio output, the apparatus comprising: obtaining circuitry configured to obtain at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining circuitry conifugred to obtain a listener position within the audio environment; determining circuitry conifugred to determine at least two active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determining circuitry confiugrd to determine at least one signal higher order ambisonics audio source from the at least two active higher order ambisonics audio source further based on the listener position; processing circuitry configured to process at least a first number of respective channel signals of the determined at least two active higher order ambisonics audio source; processing circuitry conifugred to process more than the first number of channel signals of the determined at least one signal higher order ambisonics audio source; determining spatial metadata based on the processed first number of respective channel signals of the determined at least one active higher order ambisonics audio source; determining circuitry configured to determine signal interpolation based on the processed channel signals of the determined at least one signal higher order ambisonics audio sources based on the listener position; and generating circuitry conifugred to generate the spatialized audio output based on the determined spatial metadata and performed signal interpolation According to a thirteenth aspect there is provided a computer program comprising instructions [or a computer readable medium comprising instructions] for causing an apparatus, for generating a spatialized audio output, the apparatus caused to perform at least the following: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment; determining at least two active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determining at least one signal higher order ambisonics audio source from the at least two active higher order ambisonics audio source further based on the listener position; processing at least a first number of respective channel signals of the determined at least two active higher order ambisonics audio source; processing more than the first number of channel signals of the determined at least one signal higher order ambisonics audio source; determining spatial metadata based on the processed first number of respective channel signals of the determined at least one active higher order ambisonics audio source; determining signal interpolation based on the processed channel signals of the determined at least one signal higher order ambisonics audio sources based on the listener position; and generating the spatialized audio output based on the determined spatial metadata and performed signal interpolation. According to a fourteenth aspect there is provided a non-transitory computer readable medium comprising program instructions for causing an apparatus, for generating a spatialized audio output, to perform at least the following: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment; determining at least two active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determining at least one signal higher order ambisonics audio source from the at least two active higher order ambisonics audio source further based on the listener position; processing at least a first number of respective channel signals of the determined at least two active higher order ambisonics audio source; processing more than the first number of channel signals of the determined at least one signal higher order ambisonics audio source; determining spatial metadata based on the processed first number of respective channel signals of the determined at least one active higher order ambisonics audio source; determining signal interpolation based on the processed channel signals of the determined at least one signal higher order ambisonics audio sources based on the listener position; and generating the spatialized audio output based on the determined spatial metadata and performed signal interpolation. According to a fifteenth aspect there is provided an apparatus, for generating a spatialized audio output, comprising: means for obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; means for obtaining a listener position within the audio environment; means for determining at least two active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; means for determining at least one signal higher order ambisonics audio source from the at least two active higher order ambisonics audio source further based on the listener position; means for processing at least a first number of respective channel signals of the determined at least two active higher order ambisonics audio source; means for processing more than the first number of channel signals of the determined at least one signal higher order ambisonics audio source; means for determining spatial metadata based on the processed first number of respective channel signals of the determined at least one active higher order ambisonics audio source; means for determining signal interpolation based on the processed channel signals of the determined at least one signal higher order ambisonics audio sources based on the listener position; and means for generating the spatialized audio output based on the determined spatial metadata and performed signal interpolation. According to an eighth aspect there is provided a computer readable medium comprising instructions for causing an apparatus, for generating a spatialized audio output, to perform at least the following: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment; determining at least two active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determining at least one signal higher order ambisonics audio source from the at least two active higher order ambisonics audio source further based on the listener position; processing at least a first number of respective channel signals of the determined at least two active higher order ambisonics audio source; processing more than the first number of channel signals of the determined at least one signal higher order ambisonics audio source; determining spatial metadata based on the processed first number of respective channel signals of the determined at least one active higher order ambisonics audio source; determining signal interpolation based on the processed channel signals of the determined at least one signal higher order ambisonics audio sources based on the listener position; and generating the spatialized audio output based on the determined spatial metadata and performed signal interpolation. An apparatus comprising means for performing the actions of the method as described above. An apparatus configured to perform the actions of the method as described above. A computer program comprising program instructions for causing a computer to perform the method as described above. A computer program product stored on a medium may cause an apparatus to perform the method as described herein. An electronic device may comprise apparatus as described herein. A chipset may comprise apparatus as described herein. Embodiments of the present application aim to address problems associated with the state of the art. Summary of the Figures For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which: Figure 1 shows schematically a system of apparatus showing the audio rendering or reproduction of an example audio scene and within which a user can move within the audio scene according to some embodiments; Figure 2 shows schematically an example audio scene comprising reproduction of an audio scene where a user moves within an area determined by higher order ambisonic audio signal sources; Figure 3 shows schematically the example audio scene as shown in Figure 2 and whierein sources can be identified as higher order ambisonic audio sources or full order ambisonic audio sources; Figure 4 shows schematically the example audio scene as shown in Figures 2 or 3, wherein the listener moves and a cross-fade implemented; Figure 5 shows schematically the example audio scene as shown in Figures 4, following the cross-fade; Figure 6 shows apparatus suitable for implementing some embodiments wherein a capture apparatus can be separate from the rendering apparatus elements. Embodiments of the Application The concept as discussed herein in further detail with respect to the following embodiments is related to the rendering of audio scenes wherein the audio scene was captured based on a linear or parametric spatial audio methods with two or more microphone-arrays corresponding to different positions at the recording space (or in other words with audio signal sets which are captured at respective signal set positions in the recording space). Furthermore the concept is related to attempting to lower computational complexity of the spatial analysis required for MPHOA processing. As discussed above 6DoF is presently commonplace in virtual reality, such as VR games, where movement at the audio scene is straightforward to render as all spatial information is readily available (i.e., the position of each sound source as well as the audio signal of each source separately). In the following examples the audio signal sets are generated by microphones (or microphone-arrays). For example a microphone arrangement may comprise one or more microphones and generate for the audio signal set one or more audio signals. In some embodiments the audio signal set comprises audio signals which are virtual or generated audio signals (for example a virtual speaker audio signal with an associated virtual speaker location). In some embodiments the microphone-arrays are furthermore separate from or physically located away from any processing apparatus, however this does not preclude examples where the microphones are located on the processing apparatus or are physically connected to the processing apparatus. With respect to Figure 1 an example apparatus is shown which can be configured to implement MPHOA processing according to some embodiments. In some embodiments the apparatus is part of a suitable MPEG-I Audio reference audio renderer. In some embodiments the apparatus 101 comprises a pre-processor 103. The pre-processor 103 is configured to receive the head related impulse responses (HRIRs) 100 and the Higher order ambisonic microphone positions ^_^104 and generate head related transfer functions (HRTFs) 110 and the determined microphone triangles ^ 108. In some embodiments the apparatus 101 comprises a position pre- processor 105 configured to receive the position of the listener ^_^ 104, the position of the Higher order ambisonic microphone positions ^_^(^) 106 and ^ 108 and from these generate ^_^(^) 112 - the active triangle for frame j, ^_^(^, ^) 128 – the chosen interpolation weights for subframe k of frame j, ^_^(^) 126 – the chosen HOA source for frame j. In some embodiments the apparatus 101 comprises a spatial analyser 107 configured to receive ^_^^^

102, the input time domain HOA signals in Equivalent Spatial Domain representation, from the inputs and ^_^(^) 112, the active triangle for frame j, from the position pre-processor 105. From these the spatial analyser 107 is configured to generate metadata ^(^, ^, ^, ^) 116, the azimuth for HOA source i, frame j, subframe k and frequency bin b, ^(^, ^, ^, ^) 118 elevation for HOA source i, frame j, subframe k and frequency bin b ^(^, ^, ^, ^) 120 direct-to- total energy ratio for HOA source i, frame j, subframe k and frequency bin b ^(^, ^, ^, ^) 122, energy for HOA source i, frame j, subframe k and frequency bin b and signals ^(^, ^, ^, ^) 114. In some embodiments the apparatus 101 comprises a spatial metadata interpolator 111 configured to receive ^(^, ^, ^, ^) 116 ^(^, ^, ^, ^) 118 ^(^, ^, ^, ^) 120 ^(^, ^, ^, ^) 122 from the spatial analyser 107, and ^_^(^, ^) from the position pre- processor 105 and from these generate interpolated metadata ^^{^}(^, ^, ^) 134 ^^(^, ^, ^) 136 ^̂(^, ^, ^) 138 ^̂(^, ^, ^) 132. In some embodiments the apparatus 101 comprises a signal interpolator 109 configured to receive ^_^ ⁽^⁾ 126 from the position pre-processor 105, ^(^, ^, ^, ^) 114 and ^(^, ^, ^, ^) 122 from the spatial analyser 107 and ^̂(^, ^, ^) 132 from the spatial metadata interpolator 111 from these generate interpolated signal ^^{^}(^, ^, ^) 130. In some embodiments the apparatus 101 comprises a mixer 113 configured to receive interpolated signal ^^{^}(^, ^, ^) 130 from the signal interpolator 109, and interpolated metadata ^^{^}(^, ^, ^) 134 ^^(^, ^, ^) 136 ^̂(^, ^, ^) 138 ^̂(^, ^, ^) 132 from the spatial metadata interpolator 111. From these the mixer generates output audio 142. In some embodiments the apparatus 101 comprises an output processor 115 configured to receive output audio O(^, ^, ^) 142, an output time-frequency domain signal (binaural), from the mixer 113 and generates output audio signals ^_^^^(^) 144, an output time domain audio signal (binaural). The operation of the spatial analyser 107 which as described above is configured to receive ^_^^^

102 from the inputs and ^_^(^) 112 from the position pre-processor 105 and generate metadata ^(^, ^, ^, ^) 116 ^(^, ^, ^, ^) 118 ^(^, ^, ^, ^) 120 ^(^, ^, ^, ^) 122 and signals ^(^, ^, ^, ^) 114. Is described in further detail here. The spatial analyser 107 and the renderer is further described with respect to GB2007710.8 and EP21201766.9 as well as the MPEG-I Immersive Audio standard working draft (ISO/IEC 23090-4 WD), Section 6.6.18. The spatial analysis block takes as input, the audio input signals in Equivalent Spatial Domain (ESD) representation ^_^^^

102 and provides as output spatial metadata for the purposes of determining interpolated spatial metadata at the listener position (^, ^, ^, ^) 116 ^(^, ^, ^, ^) 118 ^(^, ^, ^, ^) 120 ^(^, ^, ^, ^) 122 as well as a time-frequency domain signal ^(^, ^, ^, ^) 114 to be used for signal interpolation. The signals are first converted into higher-order Ambisonics (HOA) signals as follows: ^_^^^(^, ^) = ^_^^^^^^^^^_^^^(^, ^), where ^_^^^^^^^^ is a ^_^^ × ^_^^ ESD to HOA conversion matrix, ^ is the HOA Source index and ^ is the frame index. The output HOA signals are then split into ^_^^ subframes of equal length:

… ^_^^^(^, ^, ^_^^)_{] = ^^^^(^, ^)} Time-frequency domain conversion is then applied for all active HOA sources ^. The conversion can be performed using a suitable function such as the afSTFT function which is found for example in https://github.com/jvilkamo/afSTFT.

where ^⁽^, ^, ^⁾ is a ^_^^ × ^_^ matrix containing the time-frequency domain signals of length ^_^ for each HOA channel. The afSTFT conversion is run for each channel ^ℎ separately. Thus, the more channels there are to process, the more computationally heavy the processing is. For each frequency bin ^ of signal ^⁽^, ^, ^⁾ the spatial analysis block calculates spatial metadata comprising direction, diffuseness and energy information. These are then passed on to the spatial metadata interpolator 111 which can be configured to implement interpolation in a manner similar to that described in section 6.6.18.3.4.1 “Metadata interpolation” in ISO/IEC 23090-4 WD1 as explained in further detail below. The spatial metadata furthermore can be calculated from a signal covariance matrix ^_^^^, which is obtained from the signal as follows: ^_^^^ ⁽^, ^, ^, ^⁾ = ^⁽^, ^, ^, ^⁾^^{^(}^, ^, ^, ^⁾, where:

where ^_^,^^ ⁽^, ^, ^⁾ is the value in matrix ^⁽^, ^, ^⁾ corresponding to channel ^ℎ and frequency bin ^. The signal ^⁽^_^ , ^, ^⁾, where ^_^ is the index of the chosen HOA source for signal interpolation is passed on to the signal interpolation block, where a prototype binaural signal is calculated from it. In some embodiments, the prototype signal creation involves applying an EQ gain on the signal, rotating it according to listener head orientation and then multiplying it with an HOA to binaural transformation matrix. This, for example, can be implemented in a form similar to that described in Section 6.6.18.3.4.2 “Signal interpolation” in ISO/IEC 23090-4 WD1 as is further described below. In some embodiments, in order to reduce computation complexity issues, the spatial analysis (and also the signal handling described above) can be implemented for the sources for which it is needed (i.e active sources). The determination of which HOA sources are active, can be performed based on the listener’s position in the scene with respect to the HOA sources in the scene and a triangulation that has been performed on the HOA source positions in a pre- processing phase. For example in some embodiments rhe active sources are the sources comprising the triangle within which the listener is in. An example scene 201 is shown in Figure 2. The audio scene 201 comprises microphones

203, ^_^ 205, ^_^ 207, ^_^ 209, which are arranged such that there are two triangles defined by the locations of the microhpones. The scene thus comprises a first triangle is formed defined by the ‘connection’ 204 between

203 and ^_^ 205, the ‘connection’ 208 between ^_^ 205 and ^_^ 209 and the ‘connection’ 206 between

203 and ^_^ 209. Furthermore the scene comprises a second triangle is formed defined by the ‘connection’ 210 between ^_^ 207 and ^_^ 205, the ‘connection’ 208 between ^_^ 205 and ^_^ 209 and the ‘connection’ 212 between ^_^ 209 and ^_^ 207. In this example the listener

211 is located within the second triangle. As such the active sources are the microphones which form the vertices or corners of the second triangle ^_^ 205, ^_^ 207 and ^_^ 209. The signal interpolator 109 is further configured to select as an input audio signal an audio signal associated with a HOA source closest to the listener. The signal interpolator 109 is then configured to create an interpolated time-frequency domain audio signal. This can be implemented in some embodiment by processing the selected input signal corresponding to the closest HOA source to the user, for example by applying an equalisation (EQ) gain to the selected audio signal. For example in the situation shown in Figure 2, the chosen source is ^_^ 205. This selection can be implemented in the manner as described in section 6.6.18.3.2.4 “Determine HOA source for signal interpolation” in ISO/IEC 23090-4 WD as is also further described below. In the case where the listener moves such that the chosen HOA source for signal interpolation is no longer the closest HOA source to him, the signal interpolator 109 can be conifigured to implement a cross-fade process to smoothly transition to use the new HOA source for the signal interpolation. During the cross- fade (which lasts for 12 frames), the signal interpolator 109 creates a cross-faded frequency domain signal from the previous closest HOA source and the new HOA closest source by picking frequency bands from the two signals. As the cross-fade progresses, more frequency bands are chosen from the new closest signal and less from the previous closest signal. The crossfading as discussed above can employ the example method above which is further described within section 6.6.18.3.2.5 “Crossfade” in ISO/IEC 23090-4 WD1. To support the listener moving to a new triangle, the HOA sources comprising the new triangle are added to the list of active HOA sources, i.e. the list of HOA sources for which the STFT is run. Since before moving to the new triangle, only the three HOA sources belonging to the previous triangle have been active for STFT, and since the STFT used here takes 6 frames of input until the output is meaningful, the new HOA sources in the new triangle are not ready for processing. Thus, the system employs a delayed switch of the triangle. For the 6 frames or for how long it takes the STFT to produce a meaningful output, the spatial metadata interpolator 111 is configured to perform interpolation using the HOA sources of the previous triangle. Once the output of the STFT is representing the HOA sources of the new triangle then the processing resumes normal operation using the HOA sources of the new triangle. The HOA sources not part of the new triangle are dropped from the list of active HOA sources. Thus as discussed above, for 6DoF HOA rendering, the spatial analyser 107 can be configured to calculate or determine spatial metadata for all the HOA sources required for spatial metadata interpolation and the signal interpolator use the best suited HOA source (e.g. closest HOA source) for signal interpolation The computational complexity of implementing MPHOA processing is relatively high. It is of importance to attempt to lower computational complexity where possible (preferably without sacrificing audio quality). When the computational complexity is too high for the scene and rendering system, the listener may encounter glitches in audio playback. Generally, the signals are converted and processed within the frequency domain (based on a STFT processing) and the processing applied to a few selected channels for spatial metadata calculation. On the other hand, for signal interpolation, the maximum amount of information available is beneficial, consequently, STFT processing is performed for all channels of the HOA source. Consequently, when frequency domain (STFT) processing is performed for all channels of HOA sources used for MPHOA processing, there can be signicant excess computational complexity. The concept as demonstrated in some embodiments is to attempt to avoid excess computational complexity without degrading the subjective audio quality and listening experience. Thus in some embodiments for HOA sources that are used only for spatial metadata calculation but not for signal interpolation, the frequency domain (STFT) processing is not performed for the higher order channels (for example channels >4). This reduction of computational complexity enables a wider range of target addressable market of devices able to support 6DoF MPHOA rendering. In other words the concept associated with embodiments relates to apparatus and methods for achieving a complexity reduction of 6DoF rendering of an audio scene comprising two or more HOA sources. These apparatus and methods can be configured to perform determining a number of channels of a HOA source to be processed in order to achieve reduction in the computational complexity by employing frequency domain transforms and processing (STFT processing) for a variable number channels of HOA sources based on their use. For example whether the HOA source is determined to be used in either, spatial metadata calculation only or for both, spatial metadata calculation and signal interpolation. The concept can thus be summarized in some embodiments by the method steps as follows o Obtain listener position o Obtain active HOA sources based on the listener position o Determining among the active HOA sources whether an HOA source is used only for spatial metadata calculation only or for spatial metadata calculation as well as signal interpolation o Performing STFT for the first four channels of HOA sources used only for spatial metadata calculation o Performing STFT for all available channels of HOA sources used for spatial metadata calculation as well as for signal interpolation. In some embodiments, an intermediate position binaural audio rendering is generated from audio scene represented by two or more ambisonics source. The intermediate position binaural audio rendering is an audio rendering with six degrees of freedom. Furthermore in some further embodiments the method is configured to perform frequency domain transforms (STFT) only for the first N channels of M order HOA sources which are used only for spatial metadata calculation, where N<M. The value N is determined based on the number of channels used for performing spatial metadata calculation. As such the apparatus as shown in Figure 1 and described briefly above is further described herein with respect to some embodiments. The pre-processor 103 as discussed above takes as inputs, the positions of the HOA microphone arrays ^_^ 104 and a set of HRIR filters 100. The audio scene can then be segmented into triangle sections ^ by performing Delauney triangulation. An example is shown and has been described above with respect to the example audio scene shown in Figure 2. The triangulation can be used later in the processing to determine which HOA sources surround the listener and are used for generating the binaural signal at the listener position. Furthermore the pre-processor 103 can be configured to sample the HRIR filters 100 at a uniform grid of directions and convert these into frequency domain HRTFs 110. This is performed as the MPHOA processing is implemented in the frequency domain. The pre-processor is configured to perform these operations during the initialization of the apparatus. Thus in some embodiments the pre- processor 103 is employed in some embodiments once for each audio scene. In some embodiments the position pre-processor 105 can be configured, for every frame of audio, to determine the active triangle ^_^ 112, that is used for processing at the spatial analysis block. The active triangle ^_^ 112 is the triangle from the available triangles sections ^ 108 which surrounds the listener (or in other words the triangle which the listener is located or positioned in). Furthermore, the position pre-processor 105 in some embodiments is configured to determine or select a “chosen” HOA source for signal interpolation. In some embodiments the “chosen” HOA source is the source that determined to be closest to the listener position. The position pre-processor 105 furthermore in some embodiments is also configured to determine interpolation weights ^_^(^, ^) 128, these are weights referring to the HOA sources in the active triangle. The closer to the user the HOA source is, the higher the weighting factor. The weighting factors can in some embodiments be obtained by calculating the barycentric coordinates for the triangle by solving,

where ^_^,^^ = [^_^ ^_^ 1] is the listener position on the x-y plane. The ^_^^,^ value contains the coordinates of the HOA sources in the active triangle on the x-y plane. The barycentric coordinates can then in some embodiments be used as the weighting factors. In some embodiments, when there is no cross-fade in progress or the listener is not traversing to a new triangle, the chosen HOA source can be marked or indicated by setting a ‘FullOrder’ indicator value with the rest of the rest of the HOA sources comprising the triangle that the listener is in being indicated by setting a ‘FOA’ indicator value. For example as shown in Figure 3, which shows the audio scene 201 such as also shown in Figure 2 but with the microphone ^_^ identified as the “FullOrder” microphone 305, and microphones ^_^ identified as a “FOA” 307 and ^_^ identified as a “FOA” 309 microphones. With respect to the spatial analyser 107, the inputs, the input signals in Equivalent Spatial Domain (ESD) representation ^_^^^

102 and provides as output spatial metadata ^(^, ^, ^, ^) 116 ^(^, ^, ^, ^) 118 ^(^, ^, ^, ^) 120 ^(^, ^, ^, ^) 122 for the purposes of determining interpolated spatial metadata interpolated signal ^^{^}(^, ^, ^) 134 ^^(^, ^, ^) 136 ^̂(^, ^, ^) 138 ^̂(^, ^, ^) 132 at the listener position as well as a time-frequency domain signal ^(^, ^, ^, ^) 114 to be used for signal interpolation. In some embodiments the input signals in Equivalent Spatial Domain (ESD) representation ^_^^^

102 are first converted into higher-order Ambisonics (HOA) signals as follows: ^_^^^ ⁽^, ^⁾ = ^_^^^^^^^^^_^^^ ⁽^, ^⁾, where ^_^^^^^^^^ is a ^_^^ × ^_^^ ESD to HOA conversion matrix, ^ is the HOA Source index and ^ is the frame index. The output HOA signals are then split into ^_^^ subframes of equal length:

… ^_^^^(^, ^, ^_^^)] _{= ^^^^(^, ^)} Time-frequency domain conversion is then applied for all active HOA sources ^. For sources additionally marked as ‘FOA’, the conversion is performed for the first four channels only:

where ^(^, ^, ^) is a 4 × ^_^ matrix containing the time-frequency domain signals of length ^_^ for the first four HOA channels. For sources additionally marked as ‘FullOrder’, the conversion is performed for all channels. For both cases, the conversion can be performed using the function afSTFT [].

where ^⁽^, ^, ^⁾ is a ^_^^ × ^_^ matrix containing the time-frequency domain signals of length ^_^ for each HOA channel. The afSTFT processing is run for each channel ^ℎ separately. Thus, the more channels there are to process, the more computationally heavy the processing is. For each frequency bin ^ of signal

the spatial analyser 107 can be configured to calculate spatial metadata comprising direction, diffuseness and energy information. These are then passed on to the spatial metadata interpolator 111. The spatial metadata is calculated from a signal covariance matrix ^_^^^, which is obtained from the signal as follows: ^_^^^ ⁽^, ^, ^, ^⁾ = ^⁽^, ^, ^, ^⁾^^{^(}^, ^, ^, ^⁾, where: ,

where ^_^,^^(^, ^, ^) is the value in matrix ^(^, ^, ^) corresponding to channel ^ℎ and frequency bin ^. Spatial metadata is then calculated for each frequency bin of each active HOA source from the covariance matrix. This includes direction information, diffuseness information as well as energy: ^ ^, ^, é ⁽ ^, ^⁾ ê (^, ^, ^, ^)ù ^ ú ê ^⁽^, ^, ^, ^⁾ú ë ^⁽^, ^, ^, ^⁾û Where ^⁽^, ^, ^, ^⁾ 116 is the azimuth, ^⁽^, ^, ^, ^⁾ 118 is the elevation, ^⁽^, ^, ^, ^⁾ 120, the direct-to-total energy ratio and ^⁽^, ^, ^, ^⁾ 122 is the energy for HOA source ^, for frame ^ (subframe ^) and frequency bin ^. These are obtained as follows. First an intensity vector is calculated from the covariance matrix:

Then the energy:

And the rest of the spatial metadata:

The signal ^⁽^_^ , ^, ^⁾, where ^_^ is the index of the chosen HOA source for signal interpolation is passed on to the signal interpolation block, where a prototype binaural signal is calculated from it. The spatial metadata interpolator 111 is as indicated earlier configured to take the metadata related to the HOA sources of the active triangle (calculated by the spatial analyser 107) and creates interpolated metadata, that is, metadata at the listener position. The spatial metadata interpolator 111 thus is configured to describe the sound field at the listener position (what is should sound like at the listener position, which frequencies are coming from which direction at which energy etc.). The output of the spatial metadata interpolator 111 is interpolated metadata which is a weighted sum of the spatial metadata of the HOA sources of the active triangle. The weights for the weighted interpolation can be the weights ^_^(^, ^) 128 calculated in the position pre-processor 105. The signal interpolator 109 is configured to take as an input the chosen HOA source frequency domain signal ^⁽^_^ , ^, ^⁾, where ^_^ is the index of the chosen HOA source for signal interpolation as part of the signals ^(^, ^, ^, ^) 114 and provides as output a prototype frequency domain signal ^^{^(}^, ^, ^, ^⁾ 130. In summary, the prototype signal creation involves applying an EQ gain (based on the interpolated signal energy) on the signal, rotating it according to listener head orientation and then multiplying it with an HOA to binaural transformation matrix. The EQ gain is calculated in some embodiments as follows:

where ^_^(^) is the index of the chose HOA source for frame j. The interpolated signal is the calculated as follows: ^^{^} _^,^ ⁽^, ^⁾ = ^_^^ ⁽^, ^, ^⁾^_^,^ ⁽^_^ ⁽^⁾, ^, ^⁾ The mixer 113 is configured to receive or obtain as an input, as described above, the interpolated spatial metadata ^^{^}(^, ^, ^) 134 ^^(^, ^, ^) 136 ^̂(^, ^, ^) 138 ^̂(^, ^, ^) 132 at the listener position as well as the prototype signal ^(^, ^, ^, ^) 114. Thus, at this stage we have a description of the sound field at the listener position (interpolated spatial metadata) and a binaural signal that is an approximation of the output that we want (signal of the closest HOA source to the listener which has been EQ’d based on the interpolated signal energy). To get the final output the mixing stage creates a binaural signal from the interpolated signal such that it has the same characteristics as the interpolated metadata. For this, an optimal mixing algorithm is used which can be implemented such as that described in Vilkamo, J., Bäckström, T., & Kuntz, A. (2013). Optimized covariance domain framework for time--frequency processing of spatial audio. Journal of the Audio Engineering Society, 61(6), 403-411, and summarised below. First a prototype binaural signal ^(^, ^) is calculated from the interpolated

where ^_^^(^) is a spherical harmonics rotation matrix calculated according to the listeners head and source orientation and ^_^^^^^^^(^) is the Ambisonics to binaural matrix for frequency bin b. From the prototype signal a covariance matrix is calculated:

A target coavariance matrix is calculated from the interpolated metadata and the HRTFs calculated in the pre-processing step. Direct portion of covariance matrix:

And the final target covariance matrix: ^ ^, ^ = ^^{^^^^^^} ^, ^ ^^^^^^^ ^ ^{( )} _^ ^{( )} + ^_^ (^, ^) Mixing matrices are then obtained via the optimal mixing algorithm such as discussed in Vilkamo, J., Bäckström, T., & Kuntz, A. (2013). Optimized covariance domain framework for time--frequency processing of spatial audio. Journal of the Audio Engineering Society, 61(6), 403-411. The output of the optimal mixing algorithm is mixing matrices (^^{^(}^, ^, ^⁾ and ^^{^} _^ ⁽^, ^, ^⁾ ) which, when applied to the binaural prototype signal will produce an output binaural signal ^⁽^, ^, ^⁾ with a covariance matrix equal to ^_^ : ^⁽^, ^, ^⁾ = ^^{^(}^, ^, ^⁾ ∗ ^⁽^ − 1, ^, ^⁾ + ^^{^} _^ ⁽^, ^, ^⁾ ∗ ^⁽^, ^⁾ where ^⁽^, ^⁾ is a decorrelated time-frequency domain signal obtained from a buffer of previous binaural signals B. The output processor 115 is then configured to perform an inverse frequency domain transform (for example an inverse STFT) on the output frequency domain signal ^⁽^, ^, ^⁾ 142 to provide the final time domain output signal 144. The computational complexity savings that can be achieved using this method are due to the amount of STFT processing needed for each frame. For example a conventional frequency domain (STFT) would be performed (assuming 3^rd order HOA input signals) for (3 x 16 =) 48 channels. In some embodiments the frequency domain processing (STFT) is performed for (4 + 4 + 16 =) 24 channels, cutting the STFT computations to half. For fourth order HOA input signals, the savings are even greater (75 channels vs 33 channels). The current invention would cause the following change to the MPEG-I Audio working draft. Alias free Short-time Fourier Transform in some embodments is employed to convert HOA format signals into time-frequency domain [MPHOA_afSTFT]. For each subframe k of audio frame j and for each HOA Source i that is a part of a triangle found in the triangle track record (TRR, see section 6.4.17.3.2.3) s_HOA (i,j,k), a time-frequency domain signal matrix S(i,j,k) is determined with the afSTFT forward transformation. For the HOA source that has been chosen for signal interpolation, afSTFT is run for all channels and S(i,j,k) is a N_ch×N_b matrix containing the time-frequency domain signals of length N_b for each HOA channel. For all other HOA sources found in the triangle track record, afSTFT is run for only the first four channels (FOA) and S(i,j,k) is a 4×N_b matrix containing the time- frequency domain signals of length N_b for the first four channels. During cross- fade, the HOA source that is the target of the cross-fade, afSTFT shall also be run for all channels. With respect to Figure 4 is shown a flow diagram of the method steps for some embodiments: For example first obtain listener position as shown in Figure 4 by 401. The renderer receives this information from the listener position and orientation interface in terms of the audio scene coordinates. Then obtain active HOA sources based on the listener position as shown in Figure 4 by 403. This can be evaluated regularly, in other words for every scene state update. Thus all the HOA sources comprising the triangle the listener is in are classified as active. Furthermore then determine as shown in Figure 4 by 405 among the active HOA sources whether an HOA source is: used only for spatial metadata calculation only; or used for spatial metadata calculation as well as signal interpolation. If used only for spatial metadata, then as shown in Figure 4 by 407 then perform frequency domain transform (STFT) for the first N (e.g. four) channels. If used for spatial metadata as well as signal interpolation, then as shown in Figure 4 by 409 then perform frequency domain transforms (STFT) for all available channels (or in some embodments for a subset of channels >N) Then the method proceeds with continuing operations based on the frequency domain signals as shown in Figure 4 by 411. An example system employing some emodiments is described in Figure 5, above. The figure illustrates an end to end system overview for an audio scene comprising multiple HOA sources, which is rendered according to the above examples. The renderer receives the scene description and audio bitstreams and performs rendering accordingly. The MPHOA processing described in Figure 1 and presented in this invention is performed in the MPEG-I Audio Renderer whenever the scene comprises multiple HOA sources.The system can comprise a content creator 501 which can be implemented on any suitable computer or processing device. The content creator 501 comprises an (MPEG-I) encoder 511 which is configured to receive the audio scene description 500 and the audio signals or data 502. The audio scene description 500 can be provided in the MPEG-I Encoder Input Format (EIF) or in other suitable format. Generally, the audio scene description contains an acoustically relevant description of the contents of the audio scene, and contains, for example, the scene geometry as a mesh or voxel, acoustic materials, acoustic environments with reverberation parameters, positions of sound sources, and other audio element related parameters such as whether reverberation is to be rendered for an audio element or not. The MPEG-I encoder 511 is configured to output encoded data 512. The content creator 501 furthermore in some embodiments comprises a bitstream encoder 513 which is configured to receive the output 512 of the MPEG- I encoder 511 and the encoded audio signals from the MPEG-H encoder 511 and generate the bitstream 514. The bitstream 514 in some embodiments can be streamed to end-user devices or made available for download or stored. Additionally the system comprises a server configured to obtain the bitstream 514, and store it and supply it to the player 505. In some emobodiments this is implemented by a streaming server 521 which is configured to supply the audio data 522 and MPEG-I audio 6DoF metadata bitstream 524. The relevant bitstream 524 and audio data 522 is retrieved by the player 505. In some embodiments other implementation options are feasible such as broadcast, multicast. The player 505 in some embodiments comprises a playback device 531 configured to obtain or receive the audio data 522 and MPEG-I audio 6DoF metadata bitstream 524, and furthermore can be configured to receive or otherwise obtain the 6 DoF tracking information (listener orientation or position information) 534 from a suitable listener user interface, for example from the head mounted device (HMD) 541. These can for example be generated by sensors within the HMD 541 or from sensors in the environment sensing the orientation or position of the listener. In some embodiments the playback device 531 comprises a bitstream parser 533 configured to obtain the encoded metadata bitstream 524 and decode these in an opposite or inverse operation to the bitstream encoder 513 and mpeg I encoder 511 to generate audio scene description information 532 which can be passed to a MPEG-I audio renderer 535. In some embodiments the playback device 531 comprises the MPEG-I audio renderer 535 configured to implement the rendering operations as described above and generate audio output signals which can be output to the head mounted device 541. The playback device 531 can be implemented in different form factors depending on the application. In some embodiments the playback device is equipped with its own listener position tracking apparatus or receives the listener position information from an external apparatus. The playback device can in some embodiments be also equipped with headphone connector to deliver output of the rendered binaural audio to the headphones. With respect to Figure 6 an example electronic device which may be used as the computer, encoder processor, decoder processor or any of the functional blocks described herein is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1600 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc. In some embodiments the device 1600 comprises at least one processor or central processing unit 1607. The processor 1607 can be configured to execute various program codes such as the methods such as described herein. In some embodiments the device 1600 comprises a memory 1611. In some embodiments the at least one processor 1607 is coupled to the memory 1611. The memory 1611 can be any suitable storage means. In some embodiments the memory 1611 comprises a program code section for storing program codes implementable upon the processor 1607. Furthermore in some embodiments the memory 1611 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1607 whenever needed via the memory-processor coupling. In some embodiments the device 1600 comprises a user interface 1605. The user interface 1605 can be coupled in some embodiments to the processor 1607. In some embodiments the processor 1607 can control the operation of the user interface 1605 and receive inputs from the user interface 1605. In some embodiments the user interface 1605 can enable a user to input commands to the device 1600, for example via a keypad. In some embodiments the user interface 1605 can enable the user to obtain information from the device 1600. For example the user interface 1605 may comprise a display configured to display information from the device 1600 to the user. The user interface 1605 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1600 and further displaying information to the user of the device 1600. In some embodiments the device 1600 comprises an input/output port 1609. The input/output port 1609 in some embodiments comprises a transceiver. The transceiver in such embodiments can be coupled to the processor 1607 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling. The transceiver can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA). The transceiver input/output port 1609 may be configured to transmit/receive the audio signals, the bitstream and in some embodiments perform the operations and methods as described above by using the processor 1607 executing suitable code. In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof. The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media, and optical media. The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples. Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate. Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication. The foregoing description has provided by way of exemplary and non- limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS: 1. A method for generating a spatialized audio output comprising: obtaining at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtaining a listener position within the audio environment; determining at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determining spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source; performing signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position; and generating the spatialized audio output based on the determined spatial metadata and performed signal interpolation.

2. The method as claimed in claim 1, wherein determining spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source comprises: performing a Short-time Fourier transform on the at least one respective channel signals of the determined at least one active higher order ambisonics audio source to generate time-frequency representations of the at least one respective channel signals of the determined at least one active higher order ambisonics audio source; analysing the time-frequency representations of the at least one respective channel signals of the determined at least one active higher order ambisonics audio source to generate the spatial metadata.

3. The method as claimed in any of claims 1 to 2, wherein performing signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position comprises: performing a Short-time Fourier transform on the channel signals of at least one of the determined at least two higher order ambisonics audio sources to generate time-frequency representations of the channel signals of at least one of the determined at least two higher order ambisonics audio sources; signal processing the time-frequency representations of the channel signals of at least one of the determined at least two higher order ambisonics audio sources.

4. The method as claimed in any of claims 1 to 3, wherein the channel signals of at least one of the determined at least two higher order ambisonics audio sources are all of the channel signals of at least one of the determined at least two higher order ambisonics audio sources.

5. The method as claimed in any of claims 1 to 4, wherein the at least one respective channel signals of the determined at least one active higher order ambisonics audio source is subset of the channels.

6. The method as claimed in any of claims 1 to 5, wherein the at least one respective channel signals of the determined at least one active higher order ambisonics audio source is a first four channel signals of the determined at least one active higher order ambisonics audio source.

7. The method as claimed in any of the claims 1 to 6, wherein the channel signals of at least one of the determined at least two higher order ambisonics audio sources are a greater number of channels than the at least one respective channel signals of the determined at least one active higher order ambisonics audio source.

8. The method as claimed in any of claims 1 to 7, wherein determining at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position comprises: determining an area within which the listener position is located, the area defined by vertice positions of at least three higher order ambisonics audio sources; and selecting the at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources, the at least one active audio source being those whose positions define the area vertices.

9. The method as claimed in any of claims 1 to 8, wherein the at least one of the determined at least two higher order ambisonics audio sources is at least one of the determined at least one active higher order ambisonics audio sources.

10. An apparatus for generating a spatialized audio output, the apparatus comprsing means conifigured to: obtain at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtain a listener position within the audio environment; determine at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determine spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source; perform signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position; and generate the spatialized audio output based on the determined spatial metadata and performed signal interpolation.

11. The apparatus as claimed in claim 10, wherein the means configured to determine spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source is configured to: perform a Short-time Fourier transform on the at least one respective channel signals of determined at least one active higher order ambisonics audio source to generate time-frequency representations of the at least one respective channel signals of determined at least one active higher order ambisonics audio source; analyse the time-frequency representations of the at least one respective channel signals of determined at least one active higher order ambisonics audio source to generate the spatial metadata.

12. The apparatus as claimed in any of claims 10 or 11, wherein the means configured to perform signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position is configured to: perform a Short-time Fourier transform on the channel signals of at least one of the determined at least two higher order ambisonics audio sources to generate time-frequency representations of the channel signals of at least one of the determined at least two higher order ambisonics audio sources; signal process the time-frequency representations of the channel signals of at least one of the determined at least two higher order ambisonics audio sources.

13. The apparatus as claimed in any of claims 10 to 12, wherein the channel signals of at least one of the determined at least two higher order ambisonics audio sources are all of the channel signals of at least one of the determined at least two higher order ambisonics audio sources.

14. The apparatus as claimed in any of claims 10 to 13, wherein the at least one respective channel signals of the determined at least one active higher order ambisonics audio source is subset of the channels.

15. The apparatus as claimed in any of claims 10 to 14, wherein the at least one respective channel signals of the determined at least one active higher order ambisonics audio source is a first four channel signals of the determined at least one active higher order ambisonics audio source.

16. The apparatus as claimed in any of the claims 10 to 15, wherein the channel signals of at least one of the determined at least two higher order ambisonics audio sources are a greater number of channels than the at least one respective channel signals of the determined at least one active higher order ambisonics audio source.

17. The apparatus as claimed in any of claims 10 to 16, wherein the means configured to determine at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position is configured to: determine an area within which the listener position is located, the area defined by vertice positions of at least three higher order ambisonics audio sources; and select the at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources, the at least one active audio source being those whose positions define the the area vertices.

18. The apparatus as claimed in any of claims 10 to 17, wherein the at least one of the determined at least two higher order ambisonics audio sources is at least one of the determined at least one active higher order ambisonics audio sources.

19. An apparatus for generating a spatialized audio output, the apparatus comprising at least one processor and at least one memory storing instructions that, when executed by the at least one processor, cause the apparatus at least to: obtain at least two higher order ambisoncs audio sources, wherein the at least two higher order ambisonics audio sources are associated with respective audio source positions within an audio environment; obtain a listener position within the audio environment; determine at least one active higher order ambisonics audio source from the at least two higher order ambisonics audio sources based on the listener position; determine spatial metadata by processing at least one respective channel signals of the determined at least one active higher order ambisonics audio source; perform signal interpolation by processing, channel signals of at least one of the determined at least two higher order ambisonics audio sources based on the listener position; and generate the spatialized audio output based on the determined spatial metadata and performed signal interpolation.

20. An apparatus configured to perform the actions of the method as claimed in any of claims 1 to 9.