EP3869507A1

EP3869507A1 - Embedding of spatial metadata in audio signals

Info

Publication number: EP3869507A1
Application number: EP21156162.6A
Authority: EP
Inventors: Mikko Olavi Heikkinen; Antti Johannes Eronen; Arto Juhani Lehtiniemi; Jussi Artturi LEPPÄNEN
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2020-02-18
Filing date: 2021-02-10
Publication date: 2021-08-25
Anticipated expiration: 2041-02-10
Also published as: EP3869507B1

Abstract

There is provided processing a spatial audio signal. A bandwidth for embedding spatial metadata to an audio signal is determined on the basis of at least one of an adaptive gain control value for the audio signal and an evaluation of a perceptual spectrum of the audio signal, where the spatial metadata is embedded. Then the spatial metadata is embedded to the audio on the basis of the determined bandwidth. In this way processing of the audio signal with or without support for spatial audio features from the involved hardware abstraction layers, HALs, and applications may be supported.

Description

TECHNICAL FIELD

The present invention relates to processing of spatial audio signals.

BACKGROUND

This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
It is possible to capture spatial audio by using multiple microphones and then using digital signal processing to create a model of the spatial sound scene. This information can be represented using two channels of processed audio and some metadata. Rendering to different output formats, e.g., binaural headphones or surround (5.1, 7.1) loudspeakers, is possible. Also processing effects such as audio focus and audio zooming are possible.
Modern data processing devices such as mobile phones have multiple processors. A general purpose processor can be used to run an operating system, OS, and a special purpose processor can be used to process audio signals.
Hardware abstraction layer (HAL) is a software interface that describes a hardware resource. System software is programmed to use these interfaces. Hardware vendors provide drivers for their products by implementing the HAL interfaces. It is important that the HAL interfaces do not change as that allows updating both system software and hardware independently of each other.
Spatial audio can be encoded according to an audio encoding format. Support for spatial audio features should be added into a HAL in order to process the encoded spatial audio by a processor. If audio capturing is performed by one processor and audio encoding is performed by another processor, HALs of both processors should be added support for the spatial audio features. Since HALs allow updating system software and hardware independently of each other, changes to the HALs should be limited. If applications do not support the spatial audio features, they cannot process the spatial audio and they can break.

SUMMARY

The scope of protection sought for various embodiments of the invention is set out by the independent claims. The embodiments, examples and features, if any, described in this specification that do not fall under the scope of the independent claims are to be interpreted as examples useful for understanding various embodiments of the invention.
According some aspects, there is provided the subject matter of the independent claims. Some further aspects are defined in the dependent claims. The embodiments that do not fall under the scope of the claims are to be interpreted as examples useful for understanding the disclosure."

BRIEF DESCRIPTION OF THE DRAWINGS

For a more complete understanding of example embodiments of the present invention, reference is now made to the following descriptions taken in connection with the accompanying drawings in which:

Fig. 1 shows a block diagram of an apparatus in accordance with at least some embodiments of the present invention;
Fig. 2 shows an apparatus in accordance with at least some embodiments of the present invention;
Fig. 3 shows an example of an arrangement for wireless communications comprising a plurality of apparatuses, networks and network elements;
Figs. 4 to 8 show examples of methods in accordance with at least some embodiments of the present invention;
Figs. 9 and 10 illustrate example block diagrams of apparatuses for embedding spatial metadata in accordance with at least some embodiments of the present invention;
Figs. 11 and 12 illustrate example block diagrams of apparatuses for processing an audio signal comprising embedded metadata in accordance with at least some embodiments of the present invention.

DETAILED DESCRIPTON OF SOME EXAMPLE EMBODIMENTS

The following embodiments are exemplary. Although the specification may refer to "an", "one", or "some" embodiment(s) in several locations, this does not necessarily mean that each such reference is to the same embodiment(s), or that the feature only applies to a single embodiment. Single features of different embodiments may also be combined to provide other embodiments.
In connection with processing a spatial audio signal, a bandwidth for embedding spatial metadata to an audio signal is determined on the basis of at least one of an adaptive gain control value for the audio signal and an evaluation of a perceptual spectrum of the audio signal, where the spatial metadata is embedded. Then the spatial metadata is embedded to the audio on the basis of the determined bandwidth. In this way audio processing of the spatial audio signal is enabled even if spatial audio features are not supported. The spatial metadata is embedded to the audio signal in a transparent way so that the signal doesn't change perceptually. Since the spatial metadata is embedded to the audio signal, the audio signal may be transferred over legacy HAL interfaces and spatial audio use cases are enabled without breaking the legacy applications. In an example, the audio signal comprising the embedded metadata may be transferred over a legacy HAL from a digital signal processor (DSP) to a main central processing unit (CPU), where the audio signal may be encoded into a spatial audio format. In a further example, the audio signal comprising the embedded metadata may be transferred over a legacy HAL from a DSP to a legacy application or an application supporting spatial audio processing.
Spatial audio may be processed by processors and applications executed on system software provide they support spatial audio features that may be in accordance to a spatial audio technology. An example of spatial audio technology is Nokia's OZO audio. Spatial audio may be encoded according to an audio coding format, for example advanced audio coding (AAC), where spatial metadata is included into Data Stream Elements (DSE).
The following describes in further detail suitable apparatus and possible mechanisms for implementing some embodiments. In this regard reference is first made to Fig. 1 which shows a schematic block diagram of an exemplary apparatus or electronic device 50 depicted in Fig. 2, which may incorporate a transmitter according to an embodiment of the invention.
The electronic device 50 may for example be a communications device, wireless device, mobile terminal or user equipment of a wireless communication system. However, it would be appreciated that embodiments of the invention may be implemented within any electronic device or apparatus which may require transmission of radio frequency signals.
The apparatus 50 may comprise a housing 30 for incorporating and protecting the device. The apparatus 50 further may comprise a display 32 in the form of a liquid crystal display. In other embodiments of the invention the display may be any suitable display technology suitable to display an image or video. The apparatus 50 may further comprise a keypad 34. In other embodiments of the invention any suitable data or user interface mechanism may be employed. For example the user interface may be implemented as a virtual keyboard or data entry system as part of a touch-sensitive display. The apparatus may comprise one or more microphones 36 or any suitable audio input which may be a digital or analogue signal input. The apparatus 50 may further comprise an audio output device which in embodiments of the invention may be any one of: an earpiece 38, speaker, or an analogue audio or digital audio output connection. The apparatus 50 may also comprise a battery 40 (or in other embodiments of the invention the device may be powered by any suitable mobile energy device such as solar cell, fuel cell or clockwork generator). The term battery discussed in connection with the embodiments may also be one of these mobile energy devices. Further, the apparatus 50 may comprise a combination of different kinds of energy devices, for example a rechargeable battery and a solar cell. The apparatus may further comprise an infrared port 41 for short range line of sight communication to other devices. In other embodiments the apparatus 50 may further comprise any suitable short range communication solution such as for example a Bluetooth wireless connection or a USB/firewire wired connection.
The apparatus 50 may comprise a controller 56 or processor for controlling the apparatus 50. The controller 56 may be connected to memory 58 which in embodiments of the invention may store both data and/or may also store instructions for implementation on the controller 56. The controller 56 may further be connected to codec circuitry 54 suitable for carrying out coding and decoding of audio and/or video data or assisting in coding and decoding carried out by the controller 56.
In an example of the apparatus, the controller 56 may be a general purpose processor for running an operating system (OS). The controller may be for example an ARM CPU (Advanced RISC Machine Central Processing Unit). Examples of the operating systems comprise at least Android and iOS. A special purpose processor, for example a digital signal processor (DSP), may be connected to the general purpose processor and dedicated for audio processing.
The apparatus 50 may further comprise a card reader 48 and a smart card 46, for example a universal integrated circuit card (UICC) reader and UICC for providing user information and being suitable for providing authentication information for authentication and authorization of the user at a network.
The apparatus 50 may comprise radio interface circuitry 52 connected to the controller and suitable for generating wireless communication signals for example for communication with a cellular communications network, a wireless communications system or a wireless local area network. The apparatus 50 may further comprise an antenna 59 connected to the radio interface circuitry 52 for transmitting radio frequency signals generated at the radio interface circuitry 52 to other apparatus(es) and for receiving radio frequency signals from other apparatus(es).
In some embodiments of the invention, the apparatus 50 comprises a camera 42 capable of recording or detecting imaging.
With respect to Fig. 3, an example of a system within which embodiments of the present invention can be utilized is shown. The system 10 comprises multiple communication devices which can communicate through one or more networks. The system 10 may comprise any combination of wired and/or wireless networks including, but not limited to a wireless cellular telephone network (such as a GSM (2G, 3G, 4G, LTE, 5G), UMTS, CDMA network etc.), a wireless local area network (WLAN) such as defined by any of the IEEE 802.x standards, a Bluetooth personal area network, an Ethernet local area network, a token ring local area network, a wide area network, and the Internet.
For example, the system shown in Fig. 3 shows a mobile telephone network 11 and a representation of the internet 28. Connectivity to the internet 28 may include, but is not limited to, long range wireless connections, short range wireless connections, and various wired connections including, but not limited to, telephone lines, cable lines, power lines, and similar communication pathways.
The example communication devices shown in the system 10 may include, but are not limited to, an electronic device or apparatus 50, a combination of a personal digital assistant (PDA) and a mobile telephone 14, a PDA 16, an integrated messaging device (IMD) 18, a desktop computer 20, a notebook computer 22, a tablet computer. The apparatus 50 may be stationary or mobile when carried by an individual who is moving. The apparatus 50 may also be located in a mode of transport including, but not limited to, a car, a truck, a taxi, a bus, a train, a boat, an airplane, a bicycle, a motorcycle or any similar suitable mode of transport.
Some or further apparatus may send and receive calls and messages and communicate with service providers through a wireless connection 25 to a base station 24. The base station 24 may be connected to a network server 26 that allows communication between the mobile telephone network 11 and the internet 28. The system may include additional communication devices and communication devices of various types.
The communication devices may communicate using various transmission technologies including, but not limited to, code division multiple access (CDMA), global systems for mobile communications (GSM), universal mobile telecommunications system (UMTS), time divisional multiple access (TDMA), frequency division multiple access (FDMA), transmission control protocol-internet protocol (TCP-IP), short messaging service (SMS), multimedia messaging service (MMS), email, instant messaging service (IMS), Bluetooth, IEEE 802.11, Long Term Evolution wireless communication technique (LTE), 5G and any similar wireless communication technology. Yet some other possible transmission technologies to be mentioned here are high-speed downlink packet access (HSDPA), high-speed uplink packet access (HSUPA), LTE Advanced (LTE-A) carrier aggregation dual-carrier, and all multi-carrier technologies. A communications device involved in implementing various embodiments of the present invention may communicate using various media including, but not limited to, radio, infrared, laser, cable connections, and any suitable connection. In the following some example implementations of apparatuses utilizing the present invention will be described in more detail.
Fig. 4 illustrates an example of a method in accordance with at least some embodiments. The method provides a spatial audio signal that may be processed even if spatial audio features are not supported. The method may be performed by the apparatus described with Fig. 1., e.g. by the controller 56.
Phase 402 comprises determining a bandwidth for embedding spatial metadata to an audio signal, on the basis of at least one of: an adaptive gain control value for the audio signal, and an evaluation of a perceptual spectrum of the audio signal, where the spatial metadata is embedded.
Phase 404 comprises embedding the spatial metadata to the audio signal, on the basis of the determined bandwidth.
Since the spatial metadata is embedded to the audio signal, support for spatial features from HAL layers and/or applications is not necessarily needed. Accordingly, the spatial metadata embedded to the audio signal is transparent HAL layers and applications that do not support spatial audio features. Therefore, while the audio signal that is embedded with metadata can be processed using the spatial metadata, processing of the audio signal without the spatial metadata is also possible.
In an example in accordance with at least some embodiments, phase 402 comprises that the audio signal is a two-channel pulse code modulation, PCM, signal. The PCM audio signal embedded with metadata can be used to render the audio signal to various output formats and also controlling parameters such as listener orientation. Examples of the output formats comprise at least a multichannel speaker signal for headphone listening.
In an example in accordance with at least some embodiments phase 404 comprises interleaving the determined bandwidth with the least significant bits of samples of the audio signal. In this way the spatial metadata may be spread among the lowest bits of the samples in a way that reduces or even minimizes the perceptual effect of embedding the spatial metadata to the audio signal. In an example, the bits for the spatial metadata can be allocated to every Nth of the samples to have an even distribution in time of modified bits.
Fig. 5 illustrates an example of a method in accordance with at least some embodiments. The method provides determining a bandwidth for embedding spatial metadata to an audio signal based on an adaptive gain control value. The method may be performed by the apparatus described with Fig. 1. e.g. by the controller 56.
Phase 502 comprises determining an adaptive gain control value for the audio signal.
Phase 504 comprises determining if the adaptive gain control value indicates a gain increase or a gain decrease.
Phase 506 comprises determining, if the adaptive gain control value fails to indicate a gain increase in phase 504, the bandwidth to be a minimum bandwidth. In this way perceptual effects of embedding the spatial metadata to the audio signal may be kept acceptable. In an example, the adaptive gain control value fails to indicate a gain increase, when the adaptive gain control value is negative, whereby adaptive gain control value indicates a gain decrease.
Phase 508 comprises determining, based on the adaptive gain control value indicating a gain increase in phase 504 and an adaptive gain control of the audio signal using the adaptive gain control value, the bandwidth per a sample of the audio signal to be a number of bits the sample has been shifted towards the most significant bit. In this way, the bandwidth for transparently carrying the spatial metadata embedded to the audio signal may be determined.
In an example phase 506 comprises that the bandwidth is 0.5 bits per sample.
In an example phase 508 comprises that the bandwidth is an amount of spatial metadata to be embedded or an upper limit that is pre-determined for samples of the audio signal.
In example phase 502 comprises that the adaptive gain control value is determined on the basis of an adaptive gain control of the audio signal. The adaptive gain control of the audio signal may output the adaptive gain control value. The adaptive gain control may cause a shift of one or more bits of a sample of the audio signal towards the least significant bits or towards the most significant bits. When the adaptive gain control value indicates a gain increase, the adaptive gain control has shifted the bits towards the most significant bits. In this case the shifted bits make room for embedding the spatial metadata to the least significant bits. Every 6dB of gain increase may correspond to a shift of one bit towards the most significant bits in a sample. When the adaptive gain control value indicates a gain decrease, the adaptive gain control has shifted the bits towards the least significant bits. In this case information is lost and the bandwidth should be kept small or even at minimum.
Fig. 6 illustrates an example of a method in accordance with at least some embodiments. The method provides embedding spatial metadata to an audio signal. The method may be performed by the apparatus described with Fig. 1. e.g. by the controller 56.
Phase 602 comprises determining a bandwidth for embedding spatial metadata to an audio signal in accordance with phase 402 of Fig. 402.
Phase 604 comprises determining whether the determined bandwidth is sufficient for the spatial metadata.
Phase 606 comprises embedding, if the bandwidth is determined to be sufficient for the spatial metadata, the spatial metadata to the audio signal.
Phase 608 comprises at least one of phases 610 and 612. Phase 610 comprises compressing, if the determined bandwidth on the basis of the adaptive gain control value is insufficient, the spatial metadata. In this way the amount of spatial metadata may be reduced. Reducing the amount of spatial metadata provides that the spatial metadata fits to the determined bandwidth. This means storing less bits of spatial metadata for each value of metadata. Storing zero bits for a value means discarding that value altogether. Storing more than zero and less than the original amount means quantizing the value to a coarser scale.
In an example the phase 610 comprises applying a lossy compression method to the spatial metadata. Reduction of the amount of spatial metadata may be based on rules. One approach to reduce the amount of metadata is to store the audio signal into audio buffers for processing. The buffers may be time domain buffers or time-frequency tiles. Analysis data may be stored for each buffered audio. The analysis data may comprise amounts or ratios of direct and ambient sound in a buffer. The analysis data may further comprise information indicating a direction of arrival for the most prominent sound source related to that the buffer. Alternatively or additionally, information indicating a use case of the audio signal may be used for the lossy compression. The information of the use case may be received from an application in application metadata. An example of the application is an application for capturing spatial audio. An example of a use case is a 2D camera capture by a mobile phone, whereby spatial metadata of in-direct sounds may be reduced more than spatial metadata of direct sounds. Alternatively or additionally, spatial metadata of sounds originating from outside a visible area of the camera may be reduced more than spatial metadata of sounds originating from within the visible area.
In an example, phase 610 comprises applying a lossless compression method to the spatial metadata. The lossless compression method may be based on entropy encoding.
Phase 612 comprises determining, if the determined bandwidth on the basis of the adaptive gain control value is insufficient, the bandwidth on the basis of a perceptual spectrum of the audio signal comprising the embedded spatial metadata. In this way the amount of metadata may be reduced to fit the bandwidth while controlling the effect of the reduction of the spatial metadata to the perception of the audio signal. In an example, the spatial metadata may be embedded using different bandwidths to the audio signal. The bandwidth that provides a perceptual spectrum close to the perceptual spectrum of the original audio signal, i.e. the audio signal without embedded spatial metadata, or at least that perceptual changes to the original audio signal are kept acceptable, may be determined to be the bandwidth for embedding spatial metadata. Provided the determined perceptual spectrum of the audio signal indicates a larger bandwidth than the bandwidth determined based on the adaptive gain control value, the bandwidth indicated by determined perceptual spectrum may be used for embedding the spatial metadata to the audio signal. Embedding the spatial metadata to the bandwidth determined in phase 612 may be facilitated if the spatial metadata is compressed in accordance with phase 610.
Fig. 7 illustrates an example of a method in accordance with at least some embodiments. The method provides determining a bandwidth for embedding spatial metadata on the basis of a perceptual spectrum of an audio signal comprising the embedded spatial metadata. The method may be used to implement phase 612 in Fig. 6, for example. The method may be performed by the apparatus described with Fig. 1. e.g. by the controller 56.
Phase 702 comprises determining a reference perceptual spectrum based on a psychoacoustical model and the audio signal without embedded spatial metadata.
Phase 704 comprises determining candidate perceptual spectra based on the psychoacoustical model and the audio signal embedded with spatial metadata using candidate bandwidths for the spatial metadata.
Phase 706 comprises determining the bandwidth for embedding spatial metadata from the one or more candidate bandwidths on the basis of evaluating a perceived quality of the audio signals comprising embedded metadata using the candidate bandwidths. The evaluating may comprise comparing powers of the candidate perceptual spectra to a power of the perceptual spectra of the audio signal without embedded metadata. For example, the evaluating may comprise calculating powers of the audio signals comprising embedded metadata at the candidate bandwidths at perceptually motivated frequency bands such as the Bark scale. The same is done for the audio signal without embedded metadata. Then, spectral comparison is performed to determine whether at some frequency bands the spatial metadata noise power exceeds the signal power. If this happens, it may be determined that spatial metadata noise power is too high and it may be determined that the bandwidth is the highest candidate bandwidth which did not cause such an exceed of spatial metadata noise signal power above the actual signal power.
In this way perceptual changes to the audio signal caused by embedding the spatial metadata to the audio signal may be kept acceptable, mitigated or even avoided.
Fig. 8 illustrates an example of a method in accordance with at least some embodiments. The method provides processing of an audio signal comprising embedded spatial metadata. The method may be performed by the apparatus described with Fig. 1. e.g. by the controller 56.
Phase 802 comprises receiving the audio signal embedded with spatial metadata.
Phase 804 comprises extracting the spatial metadata from the received audio.
Phase 806 comprises at least one of encoding the received the audio signal according to a lossy audio coding format where the extracted spatial metadata is included in metadata elements of the audio encoding format, and performing spatial synthesis on the basis of the extracted spatial metadata and the received audio signal or the received audio signal, where the spatial metadata has been removed.
In an example, phase 802 comprises receiving the audio signal after phase 404 of Fig. 4. The audio signal may be received at a HAL of a processor.
In an example phase 804 comprises that the audio signal is left unmodified. In another example phase 804 comprises that the extraction process also removes the metadata from the signal producing the original audio signal to which the metadata was embedded.
In an example phase 806 comprises that the lossy audio encoding format is Advanced Audio Coding, AAC, and the spatial metadata is included into DSE elements. An AAC bitstream may be formed based on the encoded audio signal and multiplexed to an MP4 file together with a video track.
Figs. 9 and 10 illustrate example block diagrams of apparatuses for embedding spatial metadata in accordance with at least some embodiments of the present invention. The functional blocks may perform one or more functionalities in accordance with examples of the method in Figs. 4 to 8. Fig. 9 illustrates functional blocks on a DSP 900 for encoding a spatial audio signal from captured raw microphone signals. Spatial Encoding 902 takes multiple raw microphone signals as input and produces an output of two channels of PCM audio and spatial metadata which together form a presentation of the captured spatial sound. The output PCM audio and spatial metadata can be used to render the audio to various output formats (e.g., multichannel speaker signal for home theatre listening, loudspeaker listening or a two-channel binaural signal for headphone listening) and also controlling parameters such as listener orientation. The output PCM audio can be two selected input signals or a an audio signal, e.g. processed downmix such as binaural audio, generated by processing the input signals
There is an Automatic Gain Control (AGC) 904 that is coupled to the workings of the spatial encoding. The AGC may adjust the dynamics of the produced output audio so that the perceived level of the audio is comfortable to the user. In an example, the AGC may add gain to the audio when the input signal amplitude is low and reduce gain when the input signal amplitude is very high. The AGC may operate on several separate frequency bands and may also perform as a non-linear limiter in addition to controlling the gains linearly.
The processing of the audio happens in frames of one or more samples and for each frame the AGC may have an associated overall gain control value.
When the gain value is positive, the AGC increase the signal amplitude. This means that the information content in the input signal to the AGC didn't use the full range of the sample depth and there is room to add information to the signal without removing any information from the input signal. At least some of that space will be in the lowest bits, i.e. the least significant bits, of the output samples because adding gain, i.e. multiplying a sample, causes a shift of the information towards the high bits, the most significant bits, of the sample. Every 6dB of gain corresponds to a shift of one bit towards the most significant bits in a sample.
When the gain is negative, the AGC has caused that the information content in the input signal has shifted towards the least significant bits of the signal and also some information has been lost. Empty bits have been added to the most significant bits of each sample, but that headroom cannot be used to convey information in a perceptually transparent way. The approach is to always convey spatial metadata in the least significant bits of the signal. That means that only a minimal amount of bits can be allocated for spatial metadata when the AGC gain value is negative or that the perceptual impact of the added spatial metadata has to be evaluated using a different approach, e.g., using a psychoacoustic model.
The audio and spatial metadata output from the Spatial Encoding and the gain value from the AGC are used in the Lossy Spatial Metadata Compression 906. This Lossy Spatial Metadata Compression may estimate the available bandwidth in the PCM signal for embedding spatial metadata, reduce the amount of spatial metadata to fit the available budget for each frame and embed the metadata to the audio signal. The output will be two channels of PCM audio with embedded spatial metadata. Application metadata, e.g. from an application for capturing spatial sound, may be received by the Lossy Spatial Metadata Compression.
Fig. 10 illustrates functional blocks of the Lossy Spatial Metadata Compression 906. The Lossy Spatial Metadata Compression comprises Available Metadata Bandwidth Estimator 1002 and Psychoacoustic Model 1004 that are used to provide input for the Metadata Reduction 1006 and 1008 Metadata Compression. The output from those is fed to Metadata Signal Embedding 1010. The output of the Lossy Spatial Metadata Compression is two channels of PCM audio that is perceptually identical or as close as possible to the input signal and has spatial metadata embedded to it. The Psychoacoustic Model 1004 and Metadata Compression 1008 can be considered optional. It is possible to create a satisfactory implementation of the invention without them. The availability of Psychoacoustic Model and Metadata Compression help in providing better quality.
The Available Metadata Bandwidth Estimator 1002 gets an input of at least a gain value for the current frame from the AGC. If the Available Metadata Bandwidth Estimator is operating without a Psychoacoustic Model, then it can blindly use the AGC gain value as the controlling parameter for the processing. This is the simplest realization of the estimation and the logic is as follows.
There may be a minimum value for the required bandwidth for spatial metadata. It could be, e.g., 0.5 bits per sample. If the gain value from the AGC is negative, the output from the spatial metadata bandwidth estimation will be the minimum. There is also a maximum value for the required bandwidth for spatial metadata. This value can be chosen to be equal or less than the amount of spatial metadata coming from Spatial Encoding 902 in Fig. 9. If it is known a prior that there is an upper limit to how much spatial metadata bandwidth can be embedded to the signal in a transparent way, that value should be selected as the maximum.
With the above limits defined the operation of Available Metadata Bandwidth Estimator 1002 becomes to use the minimum value when the AGC gain control value is negative. When the AGC gain control value is positive the estimate will use the formula of 1bit of extra bandwidth for every 6dB of gain to add to the bandwidth estimation based on the current AGC control value. The maximum allowed value for the spatial metadata bandwidth mentioned above will be used as a limit for the output.
The simple model described above can be extended with use of a Psychoacoustic Model 1004. Now the absolute minimum limit for the required bandwidth can stay unchanged, but the bandwidth above that can be adjusted together with an estimate from the Psychoacoustic Model. The process can be, e.g., iterative so that various amounts of spatial metadata are embedded to the signal and the highest one without significant changes to the perceived quality is selected. The procedure can be as follows: the method creates first just the spatial metadata signal to the lowest bits at a given amount and estimates its perceptual spectrum, e.g., by calculating powers at perceptually motivated frequency bands such as the Bark scale. The same is done for the content signal. Then, spectral comparison is performed to determine whether at some frequency bands the spatial metadata noise power exceeds the signal power. If this happens, then the system can determine that spatial metadata noise power is too high and it determine to use the previous amount which did not cause such an exceed of spatial metadata noise signal power above the actual signal power.
This will involve the Metadata Signal Embedding 1010 but does not need the Metadata Reduction 1006 and Compression 1008. The data to embed can be selected more randomly, if we assume that the perceptual effect will be mostly dependent on the data amount and not the data contents.
The Metadata Reduction 1006 is responsible for lossy compression of the spatial metadata. The Metadata Reduction reduces the amount of spatial metadata. The reduction of spatial metadata means storing less bits for each value. Storing zero bits for a value means discarding that value altogether. Storing more than zero and less than the original amount means quantizing the value to a coarser scale.
The Metadata Reduction 1006 may be a rule-based system for the reduction of spatial metadata. The spatial metadata coming from spatial processing can vary. One commonly used approach is to partition the signal into time-frequency (TF) tiles and then store analysis data for each tile. The values related to a tile can be, e.g., information about the amounts (or ratios) of direct and ambient sound in that tile. There can also be a value for the direction of arrival for the most prominent sound source related to that TF tile.
The rules may be used to determine which spatial metadata to trim. The application use case can affect this behaviour. Application metadata, e.g. from an application for capturing spatial sound, may be used for determining the use case. In the use case of mobile 2D camera capture, and compressing spatial metadata, it is possible to decide that spatial metadata of direct sounds that map to the visible area in the camera are less likely to become quantized than spatial metadata of direct sounds originating from outside the camera visible area. If the related PCM audio is already binaural, leaving this spatial metadata out will not affect the spatial sound image. It will affect, e.g., focus processing that is done during playback. Another approach is to choose to favour all spatial metadata that identifies direct sound and preserve bits for them, regardless of the direction of arrival. The ambient spatial metadata can be chosen to be quantized more heavily. This will affect the spatial sound image but will preserve the direct sound information better, which will affect the rendering when controlling listener orientation. Yet another approach is to combine spatial metadata across TF tiles, for example, use the same direction-of-arrival estimates for different TF tiles, such as TF tiles having the same timestamp across a range of frequencies.
The Metadata Reduction 1006 may apply an algorithm for reducing the spatial metadata. The algorithm may receive the original spatial metadata, the available budget/bandwidth for spatial metadata size and rules that may define priorities about which aspects to preserve and which to quantize. It can then, e.g., iteratively coarsen the quantization for those elements which the priorities allow until it meets the size budget.
The reduced spatial metadata will then go to Metadata Signal Embedding 1010 or optionally to Metadata Compression1008. Metadata Compression is a lossless compression. It can be implemented, e.g., using widely known entropy coding methods.
The Metadata Signal Embedding 1010 may write the reduced and optionally compressed spatial metadata to the audio signal. It may use the lowest bits of the audio signal. There can be an approach to spread the spatial metadata among the lowest bits in a way that minimizes the perceptual effect on the audio signal.
Figs. 11 and 12 illustrate example block diagrams of apparatuses for processing an audio signal comprising embedded metadata in accordance with at least some embodiments of the present invention. Fig. 11 presents an example of an application processor where an audio signal with embedded spatial metadata is received and encoded into a file format supporting spatial features. The audio passes to the application processor 1100 through one or more HAL layers 1102, which are unaware of the embedded spatial metadata. The audio is received by a Spatial Audio Format Encoder 1104. The Spatial Audio Format Encoder may extract by a Metadata Extractor 1106 the spatial metadata from the received PCM signal. The extraction can leave the signal unmodified, but another approach is that the extraction process also removes the spatial metadata from the signal producing the original signal to which the spatial metadata was embedded. After the spatial metadata extraction, the signal is fed to an AAC Encoder 1108. If the spatial metadata was present in the signal that was fed to the AAC encoder, it is no longer possible to read from the AAC signal - the lossy AAC compression will not preserve it.
The produced AAC bitstream is injected by AAC Metadata Injector 1110 with the spatial metadata that was extracted from the received PCM signal. The injection method is to add spatial metadata to DSE elements of the bitstream. The bitstream can then be multiplexed by a Muxer 1112 to an MP4 file together with a video track. The resulting file is playable on standard compliant players.
Fig. 12 presents an example of an application processor where an audio signal with embedded spatial metadata is received and used by applications that support spatial features and applications that do not support spatial features. The audio passes to the application processor 1200 through one or more HAL layers 1202, which are unaware of the embedded spatial metadata. The audio is received by a Spatial Audio Application that support spatial features and one or more Unmodified Audio Applications 1206 that do not support spatial features. The Unmodified Audio Applications are not aware of the embedded spatial metadata in the signal, and the spatial metadata does not affect their operation, whereby they function in the same way as for a signal with no added spatial metadata. The Unmodified Audio Applications will not break and since the spatial metadata is not audible in the signal, it will not cause degradation in the user experience. The Spatial Audio Application may extract by a Metadata Extractor 1208 the spatial metadata from the received PCM signal. The extraction can leave the signal unmodified, but another approach is that the extraction process also removes the spatial metadata from the signal producing the original signal to which the spatial metadata was embedded. The audio and spatial metadata can be used as input to Spatial Synthesis 1210.
An apparatus comprising at least one processor; and at least one memory including computer program code; the at least one memory and the computer program code configured to, with the at least one processor, cause the apparatus at least to perform:

determining a bandwidth for embedding spatial metadata to an audio signal, on the basis of at least one of:
- an adaptive gain control value for the audio signal, and
- an evaluation of a perceptual spectrum of the audio signal, where the spatial metadata is embedded; and
embedding the spatial metadata to the audio signal, on the basis of the determined bandwidth.

An example in accordance with at least some embodiments is a computer program comprising computer readable program code means adapted to perform at least the following:
determining a bandwidth of an audio signal for embedding spatial metadata to the audio signal, on the basis of at least one of:

an adaptive gain control value for the audio signal, and
an evaluation of a perceptual spectrum of the audio signal, where the spatial metadata is embedded; and
embedding the spatial metadata to the audio signal, on the basis of the determined bandwidth.

A memory may be a computer readable medium that may be non-transitory. The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs) and processors based on multi-core processor architecture, as non-limiting examples.
Embodiments may be implemented in software, hardware, application logic or a combination of software, hardware and application logic. The software, application logic and/or hardware may reside on memory, or any computer media. In an example embodiment, the application logic, software or an instruction set is maintained on any one of various conventional computer-readable media. In the context of this document, a "memory" or "computer-readable medium" may be any media or means that can contain, store, communicate, propagate or transport the instructions for use by or in connection with an instruction execution system, apparatus, or device, such as a computer.
Reference to, where relevant, "computer-readable storage medium", "computer program product", "tangibly embodied computer program" etc., or a "processor" or "processing circuitry" etc. should be understood to encompass not only computers having differing architectures such as single/multi-processor architectures and sequencers/parallel architectures, but also specialized circuits such as field programmable gate arrays FPGA, application specify circuits ASIC, signal processing devices and other devices. References to computer readable program code means, computer program, computer instructions, computer code etc. should be understood to express software for a programmable processor firmware such as the programmable content of a hardware device as instructions for a processor or configured or configuration settings for a fixed function device, gate array, programmable logic device, etc.
Although the above examples describe embodiments of the invention operating within an apparatus or electronic device, it would be appreciated that the invention as described above may be implemented as a part of any apparatus comprising a circuitry for processing audio signals and metadata. Thus, for example, embodiments of the invention may be implemented in a mobile phone, in a computer such as a desktop computer or a tablet computer.
In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits or any combination thereof. While various aspects of the invention may be illustrated and described as block diagrams or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
Embodiments of the inventions may be practiced in various components such as integrated circuit modules, field-programmable gate arrays (FPGA), application specific integrated circuits (ASIC), microcontrollers, microprocessors, a combination of such modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
Programs, such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
As used in this application, the term "circuitry" may refer to one or more or all of the following:

(a) hardware-only circuit implementations (such as implementations in only analogue and/or digital circuitry) and
(b) combinations of hardware circuits and software, such as (as applicable):
1. (i) a combination of analogue and/or digital hardware circuit(s) with software/firmware and
2. (ii) any portions of hardware processor(s) with software (including digital signal processor(s)), software, and memory(ies) that work together to cause an apparatus, such as a mobile phone or server, to perform various functions) and
(c) hardware circuit(s) and or processor(s), such as a microprocessor(s) or a portion of a microprocessor(s), that requires software (e.g., firmware) for operation, but the software may not be present when it is not needed for operation.

This definition of circuitry applies to all uses of this term in this application, including in any claims. As a further example, as used in this application, the term circuitry also covers an implementation of merely a hardware circuit or processor (or multiple processors) or portion of a hardware circuit or processor and its (or their) accompanying software and/or firmware. The term circuitry also covers, for example and if applicable to the particular claim element, a baseband integrated circuit or processor integrated circuit for a mobile device or a similar integrated circuit in server, a cellular network device, or other computing or network device.
The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention.

EXAMPLES

Example 1: A method comprising:
- determining a bandwidth for embedding spatial metadata to an audio signal, on the basis of at least one of:
  - ∘ an adaptive gain control value for the audio signal, and
  - ∘ an evaluation of a perceptual spectrum of the audio signal, where the spatial metadata is embedded; and
- embedding the spatial metadata to the audio signal, on the basis of the determined bandwidth.
Example 2: The method according to example 1, comprising:
determining, based on the adaptive gain control value indicating a gain increase and an adaptive gain control of the audio signal using the adaptive gain control value, the bandwidth per a sample of the audio signal to be a number of bits the sample has been shifted towards the most significant bit.
Example 3: The method according to example 1 or 2, comprising:
determining, if the adaptive gain control value fails to indicate a gain increase, the bandwidth to be a minimum bandwidth.
Example 4: The method according to any of examples 1 to 3, comprising:
- compressing, if the determined bandwidth on the basis of the adaptive gain control value is insufficient, the spatial metadata; and/or
- determining, if the determined bandwidth on the basis of the adaptive gain control value is insufficient, the bandwidth on the basis of a perceptual spectrum of the audio signal comprising the embedded spatial metadata.
Example 5: The method according to any of examples 1 to 4, comprising: storing the audio signal into audio buffers; and at least one of
storing analysis data for each audio buffer,
prioritizing spatial metadata on the basis of application metadata, and combining spatial metadata across the audio buffers.
Example 6: The method according to any of examples 1 to 5, wherein the determined bandwidth comprises the least significant bits of samples of the audio signal.
Example 7: The method according to any of examples 1 to 6, comprising: interleaving the determined bandwidth with the least significant bits of samples of the audio signal.
Example 8: The method according to any of examples 1 to 7, comprising:
- determining a reference perceptual spectrum based on a psychoacoustical model and the audio signal without embedded spatial metadata;
- determining candidate perceptual spectra based on the psychoacoustical model and the audio signal embedded with spatial metadata using candidate bandwidths for the spatial metadata; and
- determining the bandwidth for embedding spatial metadata from the one or more candidate bandwidths on the basis of evaluating whether the candidate perceptual spectra exceed the reference perceptual spectrum.
Example 9: The method according to example 8, comprising:
determining the bandwidth for embedding spatial metadata to be the highest candidate bandwidth at which a perceptual spectrum, based on the psychoacoustical model and the audio signal embedded with spatial metadata, fails to exceed the reference perceptual spectrum.
Example 10: The method according to any of examples 1 to 9, comprising:
- receiving the audio signal embedded with spatial metadata.
- extracting the spatial metadata from the received audio signal; and at least one of:
  - encoding the received the audio signal according to a lossy audio coding format where the extracted spatial metadata is included in metadata elements of the audio encoding format; and
  - performing spatial synthesis on the basis of the extracted spatial metadata and the received audio signal or the received audio signal, where the spatial metadata has been removed.
Example 11: The method according to any of examples 1 to 10, wherein the audio signal is a two-channel pulse code modulation, PCM, signal.
Example 12: An apparatus comprising:
- means for determining a bandwidth for embedding spatial metadata to an audio signal, on the basis of at least one of:
  - an adaptive gain control value for the audio signal, and
  - an evaluation of a perceptual spectrum of the audio signal, where the spatial metadata is embedded; and
- means for embedding the spatial metadata to the audio signal, on the basis of the determined bandwidth.
Example 13: The apparatus according to example 12, comprising:
means for determining, based on the adaptive gain control value indicating a gain increase and an adaptive gain control of the audio signal using the adaptive gain control value, the bandwidth per a sample of the audio signal to be a number of bits the sample has been shifted towards the most significant bit.
Example 14: The apparatus according to example 12 or 13, comprising: means for determining, if the adaptive gain control value fails to indicate a gain increase, the bandwidth to be a minimum bandwidth.
Example 15: The apparatus according to any of examples 12 to 14, comprising:
means for compressing, if the determined bandwidth on the basis of the adaptive gain control value is insufficient, the spatial metadata; and/or means for determining, if the determined bandwidth on the basis of the adaptive gain control value is insufficient, the bandwidth on the basis of a perceptual spectrum of the audio signal comprising the embedded spatial metadata.
Example 16: The apparatus according to any of examples 12 to 15, comprising:
means for storing the audio signal into audio buffers; and at least one of means for storing analysis data for each audio buffer,
- means for prioritizing spatial metadata on the basis of application metadata, and
- means for combining spatial metadata across the audio buffers.
Example 17: The apparatus according to any of examples 12 to 16, wherein the determined bandwidth comprises the least significant bits of samples of the audio signal.
Example 18: The apparatus according to example 17, comprising:
means for interleaving the determined bandwidth with the least significant bits of samples of the audio signal
Example 19: The apparatus according to any of examples 12 to 18, means for determining a reference perceptual spectrum based on a psychoacoustical model and the audio signal without embedded spatial metadata;
means for determining candidate perceptual spectra based on the psychoacoustical model and the audio signal embedded with spatial metadata using candidate bandwidths for the spatial metadata; and
means for determining the bandwidth for embedding spatial metadata from the one or more candidate bandwidths on the basis of evaluating whether the candidate perceptual spectra exceed the reference perceptual spectrum.
Example 20: The apparatus according to example 19, comprising:
means for determining the bandwidth for embedding spatial metadata to be the highest candidate bandwidth at which a perceptual spectrum, based on the psychoacoustical model and the audio signal embedded with spatial metadata, fails to exceed the reference perceptual spectrum.
Example 21: The apparatus according to any of examples 12 to 20, comprising:
- means for receiving the audio signal embedded with spatial metadata;
- means for extracting the spatial metadata from the received audio signal; and at least one of:
  - means for encoding the received the audio signal according to a lossy audio coding format where the extracted spatial metadata is included in metadata elements of the audio encoding format; and
  - means for performing spatial synthesis on the basis of the extracted spatial metadata and the received audio signal or the received audio signal, where the spatial metadata has been removed.
Example 22: The apparatus according to any of examples 12 to 21, wherein the audio signal is a two-channel pulse code modulation, PCM, signal.
Example 23: A computer program comprising computer readable program code means adapted to perform at least the following:
determining a bandwidth of an audio signal for embedding spatial metadata to the audio signal, on the basis of at least one of:
- an adaptive gain control value for the audio signal, and
- an evaluation of a perceptual spectrum of the audio signal, where the spatial metadata is embedded; and
  - embedding the spatial metadata to the audio signal, on the basis of the determined bandwidth.
Example 24: Apparatus comprising at least one processor; and at least one memory including computer program code; the at least one memory and the computer program code configured to, with the at least one processor, causes the apparatus to perform one or more steps of any of the examples 1 to 11.

Claims

A method comprising:
- determining a bandwidth for embedding spatial metadata to an audio signal, on the basis of at least one of:
- an adaptive gain control value for the audio signal, and

- an evaluation of a perceptual spectrum of the audio signal, where the spatial metadata is embedded; and

- embedding the spatial metadata to the audio signal, on the basis of the determined bandwidth.
The method according to claim 1, comprising:
- determining, based on the adaptive gain control value indicating a gain increase and an adaptive gain control of the audio signal using the adaptive gain control value, the bandwidth per sample of the audio signal to be a number of bits the sample has been shifted towards the most significant bit.
The method according to claim 1 or 2, comprising:
- determining, if the adaptive gain control value fails to indicate a gain increase, the bandwidth to be a minimum bandwidth.
The method according to any of claims 1 to 3, comprising:
- compressing, if the determined bandwidth on the basis of the adaptive gain control value is insufficient, the spatial metadata; and/or
determining, if the determined bandwidth on the basis of the adaptive gain control value is insufficient, the bandwidth on the basis of a perceptual spectrum of the audio signal comprising the embedded spatial metadata.
An apparatus comprising:
means for determining a bandwidth for embedding spatial metadata to an audio signal, on the basis of at least one of:
∘ an adaptive gain control value for the audio signal, and

∘ an evaluation of a perceptual spectrum of the audio signal, where the spatial metadata is embedded; and
means for embedding the spatial metadata to the audio signal, on the basis of the determined bandwidth.
The apparatus according to claim 5, comprising:
- means for determining, based on the adaptive gain control value indicating a gain increase and an adaptive gain control of the audio signal using the adaptive gain control value, the bandwidth per sample of the audio signal to be a number of bits the sample has been shifted towards the most significant bit.
The apparatus according to claim 5 or 6, comprising:
- means for determining, if the adaptive gain control value fails to indicate a gain increase, the bandwidth to be a minimum bandwidth.
The apparatus according to any of claims 5 to 7, comprising:
- means for compressing, if the determined bandwidth on the basis of the adaptive gain control value is insufficient, the spatial metadata; and/or

- means for determining, if the determined bandwidth on the basis of the adaptive gain control value is insufficient, the bandwidth on the basis of a perceptual spectrum of the audio signal comprising the embedded spatial metadata.
The apparatus according to any of claims 5 to 8, comprising:
- means for storing the audio signal into audio buffers; and at least one of
- means for storing analysis data for each audio buffer,

- means for prioritizing spatial metadata on the basis of application metadata, and

- means for combining spatial metadata across the audio buffers.
The apparatus according to any of claims 5 to 9, wherein the determined bandwidth comprises the least significant bits of samples of the audio signal.
The apparatus according to claim 10, comprising:
- means for interleaving the determined bandwidth with the least significant bits of samples of the audio signal
The apparatus according to any of claims 5 to 11,
- means for determining a reference perceptual spectrum based on a psychoacoustical model and the audio signal without embedded spatial metadata;

- means for determining candidate perceptual spectra based on the psychoacoustical model and the audio signal embedded with spatial metadata using candidate bandwidths for the spatial metadata; and

- means for determining the bandwidth for embedding spatial metadata from the one or more candidate bandwidths on the basis of evaluating whether the candidate perceptual spectra exceed the reference perceptual spectrum.
The apparatus according to claim 12, comprising:
- means for determining the bandwidth for embedding spatial metadata to be the highest candidate bandwidth at which a perceptual spectrum, based on the psychoacoustical model and the audio signal embedded with spatial metadata, fails to exceed the reference perceptual spectrum.
The apparatus according to any of claims 5 to 13, comprising:
- means for receiving the audio signal embedded with spatial metadata;

- means for extracting the spatial metadata from the received audio signal; and at least one of:
- means for encoding the received the audio signal according to a lossy audio coding format where the extracted spatial metadata is included in metadata elements of the audio encoding format; and

- means for performing spatial synthesis on the basis of the extracted spatial metadata and the received audio signal or the received audio signal, where the spatial metadata has been removed.
The apparatus according to any of claims 5 to 14, wherein the audio signal is a two-channel pulse code modulation, PCM, signal.