EP3613043A1

EP3613043A1 - Ambience generation for spatial audio mixing featuring use of original and extended signal

Info

Publication number: EP3613043A1
Application number: EP18787482.1A
Authority: EP
Inventors: Tapani PIHLAJAKUJA; Jussi LEPPÄNEN; Antti Eronen; Arto Lehtiniemi
Original assignee: Nokia Technologies Oy
Current assignee: Nokia Technologies Oy
Priority date: 2017-04-20
Filing date: 2018-04-19
Publication date: 2020-02-26
Also published as: EP3613043A4; GB2561595A; WO2018193160A1; GB201706289D0

Abstract

An apparatus for generating at least one audio signal associated with a sound scene, the apparatus configured to: receive at least one audio signal;analyse the at least one audio signal to determine at least one attribute parameter; determine at least one control signal based on the at least one attribute; generate a spatially extended audio signal from the at least one audio signal based on the at least one control signal; and combine the at least one audio signal and the spatially extended audio signal based on the at least one control signal to generate the at least one audio signal associated with thesound scene.

Description

AMBIENCE GENERATION FOR SPATIAL AUDIO MIXING FEATURING USE OF

ORIGINAL AND EXTENDED SIGNAL

Field

The present application relates to apparatus and methods for ambience generation for spatial audio mixing featuring use of original and extended signal.

Background

Capture of audio signals from multiple sources and mixing of audio signals when these sources are moving in the spatial field requires significant effort. For example the capture and mixing of an audio signal source such as a speaker or artist within an audio environment such as a theatre or lecture hall to be presented to a listener and produce an effective audio atmosphere requires significant investment in equipment and training.

A commonly implemented system is where one or more 'external' microphones, for example a Lavalier microphone worn by the user or an audio channel associated with an instrument, is mixed with a suitable spatial (or environmental or audio field) audio signal such that the produced sound comes from an intended direction. This system is known in some areas as Spatial Audio Mixing (SAM).

The SAM system enables the creation of immersive sound scenes comprising

"background spatial audio" or ambiance and sound objects for Virtual Reality (VR) applications. Often, the scene can be designed such that the overall spatial audio of the scene, such as a concert venue, is captured with a microphone array (such as one contained in the OZO virtual camera) and the most important sources captured using the 'external' microphones.

One of the aspects of SAM system is the generation and use of volumetric virtual sound sources. The term volumetric virtual sound source refers to a virtual sound source with a spatial volume, whereas a point-like virtual source is perceived from a single point in space. Volumetric virtual sound sources are useful in various applications including virtual and augmented reality and computer gaming. They enable creative opportunities for sound engineers and facilitate more realistic representation of sounds with a natural size, such as large sound-emitting objects. Consider, for example, a fountain, sea, or a large machine. Such volumetric virtual sound sources are discussed in Pihlajamaki et al. Synthesis of Spatially Extended Virtual Sources with Time-Frequency Decomposition of Mono Signals, JAES 2014.

The creation of volumetric virtual sound sources can be implemented by creation of sounds with perceived spatial extents as the ability of humans to perceive sounds at different distances is not good. A sound with a perceived spatial extent may be surrounding the listener or it may have a specific width.

An effect in human hearing called summing localization enables humans to perceive simultaneously presented coherent audio signals as a virtual sound source between the original sources. If the coherence is lower, the signals may be perceived as separate audio objects or as a spatially extended auditory effect. Coherence can be measured with the interaural cross-correlation value between signals (IACC). When played identical signals, where the IACC value equals one, from both headphones, humans will perceive an auditory event in the center of the head. When played non- identical signals, where the IACC value equals zero, from both headphones, one auditory event will be perceived near each ear. When the IACC value is between one and zero, humans may perceive a spatially extended or spread auditory event inside the head, with the extent varying according to the IACC value.

To synthesize a sound source with a perceived spatial extent, one approach is to divide to signal into non-overlapping frequency bands, and then present the frequency bands at distinct spatial positions around the listener. The area from which the frequency bands are presented may be used to control the perceived spatial extent. Special care needs to be taken on how to distribute the frequency bands, such that no degradation in the timbre of the sound occurs, and that the sound is perceived as a single spatially extended source rather than several sound objects.

When spatially extending a sound source the audio is split into frequency bands. These bands are then rendered from a number of different directions defined by the desired spatial extent. The frequency bands are divided into the different directions using what is called a low-discrepancy sequence, e.g., a Halton sequence. The sequence provides random-looking uniformly-distributed frequency-component sets for the different directions. Thus, for each direction we have a filter which selects frequency components of the original signal based on the Halton sequence. Using these filters, we have signals for the different directions that, ideally, have similar frequency content (shape) as the original signal, but do not contain common frequency components with each other. This results in the sound being heard as having spatial extent.

However, the signal content to be spatially extended can sometimes be less than optimal for performing the effect. Percussive or impulsive sounds lose the "impact" in their onsets, speech and vocal content can become less intelligible, and timbre of "peaky" sounds can change in ways that is undesired. These additional effects are often undesired and should be avoided.

Summary

There is provided according to a first aspect an apparatus for generating at least one audio signal associated with a sound scene, the apparatus configured to: receive at least one audio signal; analyse the at least one audio signal to determine at least one attribute parameter; determine at least one control signal based on the at least one attribute; generate a spatially extended audio signal from the at least one audio signal based on the at least one control signal; and combine the at least one audio signal and the spatially extended audio signal based on the at least one control signal to generate the at least one audio signal associated with the sound scene.

The apparatus configured to generate a spatially extended audio signal based on the at least one control signal may be further configured to apply a spatially extending synthesis to the at least one audio signal to generate the spatially extended audio signal, wherein the spatially extending synthesis is controlled based on the at least one control signal, such that the spatial effect of the combination of the at least one audio signal and the spatially extended audio signal is compensated for.

The apparatus configured to apply the spatially extending synthesis may be further configured to: control the application of the spatially extending synthesis such that the synthesis is unmodified when the combination of the at least one audio signal and an associated spatially extended audio signal based on the at least one audio signal is purely the spatially extended audio signal; control the application of the spatially extending synthesis such that the synthesis is increased to a 360 degree extent when the combination of the at least one audio signal and the spatially extended audio signal has an equal mix of the both the at least one audio signal and the spatially extended audio signal; and control the application of the spatially extending synthesis such that the synthesis is a linear interpolation between 0 and 360 degree extent when the combination of the at least one audio signal and the spatially extended audio signal based on the at least one audio signal has an mix value between zero at least one audio signal and one half at least one audio signal.

The apparatus configured to apply the spatially extending synthesis may be further configured to control the application of the spatially extending synthesis such that the extent of synthesis is a result of a look up table using as an input an original spatial extent input and a fraction of the mix which is the at least one audio signal.

The apparatus configured to apply a spatially extending synthesis may be configured to apply at least one of: a vector base amplitude panning to the at least one audio signal; direct binaural panning to the at least one audio signal; direct assignment to channel output location to the at least one audio signal; synthesized ambisonics to the at least one audio signal; and wavefield synthesis to the at least one audio signal.

The apparatus configured to apply a spatially extending synthesis to the at least one audio signal may be configured to: determine a spatial extent parameter; determine at least one position associated with the at least one audio signal; determine at least one frequency band position based on the at least one position located within the sound scene and the spatial extent parameter.

The apparatus may be further configured to generate panning vectors for the application of vector base amplitude panning to frequency bands of the at least one audio signal.

The apparatus configured to receive the at least one audio signal may be further configured to receive the audio signal from at least one microphone, and wherein the apparatus may be further configured to determine a position of the microphone relative to the apparatus.

The apparatus configured to combine the at least one audio signal and the spatially extended audio signal based on the at least one control signal to generate the at least one audio signal associated with the sound scene may be configured to: determine a weighting for the at least one audio signal based on the at least one control signal; and/or determine a weighting for the spatially extended audio signal based on the at least one control signal.

The apparatus may be configured to generate the weighting for the at least one audio signal between zero and one half based on the at least one control signal and/or the weighting for the spatially extended audio signal between one and one half based on the at least one control signal. The apparatus configured to analyse the at least one audio signal to determine at least one attribute parameter may be configured to determine at least one of: a detection of voice activity within the at least one audio signal; a determination of peakiness within the at least one audio signal; and a determination of impulsiveness within the at least one audio signal.

The apparatus configured to analyse the at least one audio signal to determine at least one attribute parameter may be configured to determine a combined attribute parameter value based on at least one of: a voice activity parameter value determined based on the detection of voice activity within the at least one audio signal; a peakiness parameter value based on the determination of peakiness within the at least one audio signal; an impulsiveness parameter value based on the determination of impulsiveness within the at least one audio signal; a combination of at least two of the voice activity parameter value, the peakiness parameter value and the impulsiveness parameter value; a maximum of at least two of the voice activity parameter value, the peakiness parameter value and the impulsiveness parameter value; and a minimum of at least two of the voice activity parameter value, the peakiness parameter value and the impulsiveness parameter value.

The at least one audio signal associated may be at least one of: a monophonic source audio signal; a captured audio signal from a microphone; and a synthetic audio signal.

The apparatus configured to determine at least one control signal based on the at least one attribute may be configured to determine the at least one control signal further based on at least one user input.

According to a second aspect there is provided a method for generating at least one audio signal associated with a sound scene, the method comprising: receiving at least one audio signal; analysing the at least one audio signal to determine at least one attribute parameter; determining at least one control signal based on the at least one attribute; generating a spatially extended audio signal from the at least one audio signal based on the at least one control signal; and combining the at least one audio signal and the spatially extended audio signal based on the at least one control signal to generate the at least one audio signal associated with the sound scene.

Generating a spatially extended audio signal based on the at least one control signal may further comprise applying a spatially extending synthesis to the at least one audio signal to generate the spatially extended audio signal, wherein the spatially extending synthesis is controlled based on the at least one control signal, such that the spatial effect of the combination of the at least one audio signal and the spatially extended audio signal is compensated for.

Applying the spatially extending synthesis may further comprise: controlling the application of the spatially extending synthesis such that the synthesis is unmodified when the combination of the at least one audio signal and an associated spatially extended audio signal based on the at least one audio signal is purely the spatially extended audio signal; controlling the application of the spatially extending synthesis such that the synthesis is increased to a 360 degree extent when the combination of the at least one audio signal and the spatially extended audio signal has an equal mix of the both the at least one audio signal and the spatially extended audio signal; and controlling the application of the spatially extending synthesis such that the synthesis is a linear interpolation between 0 and 360 degree extent when the combination of the at least one audio signal and the spatially extended audio signal based on the at least one audio signal has an mix value between zero at least one audio signal and one half at least one audio signal.

Applying the spatially extending synthesis may further comprise controlling the application of the spatially extending synthesis such that the extent of synthesis is a result of a look up table using as an input an original spatial extent input and a fraction of the mix which is the at least one audio signal.

Applying a spatially extending synthesis may comprise applying at least one of: a vector base amplitude panning to the at least one audio signal; direct binaural panning to the at least one audio signal; direct assignment to channel output location to the at least one audio signal; synthesized ambisonics to the at least one audio signal; and wavefield synthesis to the at least one audio signal.

Applying a spatially extending synthesis to the at least one audio signal may comprise: determining a spatial extent parameter; determining at least one position associated with the at least one audio signal; determining at least one frequency band position based on the at least one position located within the sound scene and the spatial extent parameter.

The method may further comprise generating panning vectors for the application of vector base amplitude panning to frequency bands of the at least one audio signal. Receiving the at least one audio signal may further comprise receiving the audio signal from at least one microphone, and wherein the method further comprises determining a position of the microphone relative to the apparatus.

Combining the at least one audio signal and the spatially extended audio signal based on the at least one control signal to generate the at least one audio signal associated with the sound scene may comprise: determining a weighting for the at least one audio signal based on the at least one control signal; and/or determining a weighting for the spatially extended audio signal based on the at least one control signal.

The method may further comprise generating the weighting for the at least one audio signal between zero and one half based on the at least one control signal and/or the weighting for the spatially extended audio signal between one and one half based on the at least one control signal.

Analysing the at least one audio signal to determine at least one attribute parameter may comprise determining at least one of: a detection of voice activity within the at least one audio signal; a determination of peakiness within the at least one audio signal; and a determination of impulsiveness within the at least one audio signal.

Analysing the at least one audio signal to determine at least one attribute parameter may comprise determining a combined attribute parameter value based on at least one of: a voice activity parameter value determined based on the detection of voice activity within the at least one audio signal; a peakiness parameter value based on the determination of peakiness within the at least one audio signal; an impulsiveness parameter value based on the determination of impulsiveness within the at least one audio signal; a combination of at least two of the voice activity parameter value, the peakiness parameter value and the impulsiveness parameter value; a maximum of at least two of the voice activity parameter value, the peakiness parameter value and the impulsiveness parameter value; and a minimum of at least two of the voice activity parameter value, the peakiness parameter value and the impulsiveness parameter value.

The at least one audio signal may be at least one of: a monophonic source audio signal; a captured audio signal from a microphone; and a synthetic audio signal.

Determining at least one control signal based on the at least one attribute may comprise determining the at least one control signal further based on at least one user input. An apparatus may comprise means for performing the method described herein. A computer program product stored on a medium may cause an apparatus to perform the method as described herein.

An electronic device may comprise apparatus as described herein.

A chipset may comprise apparatus as described herein.

Embodiments of the present application aim to address problems associated with the state of the art.

Summary of the Figures

For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:

Figure 1 shows schematically an example known system for spatial audio mixing featuring original and spatial extended audio signals;

Figure 2 shows schematically the spatial extent synthesizer shown in Figure 1 in further detail according to some embodiments;

Figure 3 shows example uses of the system shown in Figure 1 according to some embodiments; and

Figure 4 shows schematically an example device suitable for implementing the apparatus shown in Figures 1 , 2 and 3. Embodiments of the Application

The following describes in further detail suitable apparatus and possible mechanisms for the provision of effective audio signal generation including the generation of volumetric virtual sound sources from the capture of audio signals. The volumetric virtual sound sources being suitable for processing and/or mixing for generation of immersive sound scenes.

A conventional approach to the capturing and mixing of sound sources with respect to an audio background or environment audio field signal would be for a professional producer to utilize an external microphone (a close or Lavalier microphone worn by the user, or a microphone attached to an instrument or some other microphone) to capture audio signals close to the sound source, and further utilize a 'background' microphone or microphone array to capture a environmental audio signal. These signals or audio tracks may then be manually mixed to produce an output audio signal such that the produced sound features the sound source coming from an intended (though not necessarily the original) direction. The concepts as discussed in detail hereafter is a system that detects whether an input or captured audio signal comprises any 'problematic' attributes (e.g., impulsiveness and peakiness). The input signal can be a recorded or synthetic monophonic sound with any suitable content. If the input signal is determined to be problematic, the embodiments as described herein may be configured to modify parameters associated with a spatial extent effect and further be configured to mix the spatial extent processed audio signal with the original audio signal to avoid adverse effects to the timbre of the spatial extent processing being output. This approach preserves the timbre which is generally a desired action and one of the main factors used by audio professionals in determining whether they are willing to use an effect.

With respect to Figure 1 is shown an example system for controlling a spatial extent processing of an input audio signal.

The system in some embodiments may comprise an audio signal input. In the example shown in Figure 1 the audio signal input is a mono audio signal. The mono audio signal may be one from a microphone such as an external microphone. The external microphone may be any microphone external or separate to a microphone array (for example a Lavalier microphone) which may capture a spatial audio signal. Thus the concept is applicable to any external/additional microphones be they Lavalier microphones, hand held microphones, mounted mics, or whatever. The external microphones can be worn/carried by persons or mounted as close-up microphones for instruments or a microphone in some relevant location which the designer wishes to capture accurately. A Lavalier microphone typically comprises a small microphone worn around the ear or otherwise close to the mouth. For other sound sources, such as musical instruments, the audio signal may be provided either by a Lavalier microphone or by an internal microphone system of the instrument (e.g., pick-up microphones in the case of an electric guitar) or an internal audio output (e.g., a electric keyboard output). In some embodiments the close microphone may be configured to output the captured audio signals to a mixer. The external microphone may be connected to a transmitter unit (not shown), which wirelessly transmits the audio signal to a receiver unit (not shown).

In some embodiments the external microphone, mic sources and thus the performers and/or the instruments that are being played positions may be tracked by using position tags located on or associated with the microphone source. Thus for example the external microphone comprises or is associated with a microphone position tag. The microphone position tag may be configured to transmit a radio signal such that an associated receiver may determine information identifying the position or location of the close microphone. It is important to note that microphones worn by people can be freely moved in the acoustic space and the system supporting location sensing of wearable microphone has to support continuous sensing of user or microphone location. The close microphone position tag may be configured to output this signal to a position tracker. Although the following examples show the use of the HAIP (high accuracy indoor positioning) radio frequency signal to determine the location of the close microphones it is understood that any suitable position estimation system may be used (for example satellite-based position estimation systems, inertial position estimation, beacon based position estimation etc.).

Although in the following example the mono audio signal is determined from an external microphone, the audio signal may be a stored audio signal or a synthetic (for example a generated or significantly processed audio signal).

In some embodiments the system comprises an attribute analyser 101 . The attribute analyser may be configured to receive the mono audio signal from the microphone. The attribute analyser 101 may be configured to comprise one or more 'problematic' attribute determiners configured to determine whether the input audio signal comprises a 'problematic' component and be configured to output an attribute parameter value which may be passed to an attribute mapper 1 1 1 .

Thus for example as shown in Figure 1 , the attribute analyser 101 may comprise a peakiness analyser 103. The peakiness analyser 103 may be configured to determine whether the input audio signals comprises a peakiness component and the degree of peakiness that the input audio signal contains. Peakiness is signal feature that measures how "spiky" the time domain signal is. A peaky signal usually has many frequency components in phase and can be also described as audibly "buzzy". This timbral quality may be lost quickly where the phase of the audio signal is 'touched' or the signal is otherwise modified. Thus special care should be taken when processing audio signals comprising significant peakiness components. Peakiness and the related phase-alignment can be analyzed using a simple hearing model as shown, for example, in Laitinen et. al. "Sensitivity of Human Hearing to Changes in Phase Spectrum", JAES 2013.

The hearing model converts the input audio signal into neural activation patterns. A peaky signal, using this model, has a high time alignment in activations and modifying or processing the signal can reduce this alignment. Thus, modification or processing an audio signal with determined peakiness components can produce more perceptual effects than an audio signal with less peakiness. Thus, high peakiness is considered as a problematic attribute.

Furthermore as shown in Figure 1 , the attribute analyser 101 may comprise an impulsiveness analyser 105. The impulsiveness analyser 105 may be configured to determine whether the input audio signals comprises an impulsiveness component and the degree of impulsiveness that the input audio signal contains.

The impulsiveness analyser 105 in some embodiments may analyse the impulsiveness of the input audio signal with a two-window detector. The analyser 105 may thus be configured to generate two running rectangular time domain energy windows with equal area but with different lengths. For example in some embodiments the analyser comprises a first 'long' window of 50 time samples and a second 'short' window of 5 time samples. The analyser may then be configured to calculate the energy (for example by determining a sum of squared sample values) for both windows. The short window energy value may be scaled to match the long window energy value. For example in some embodiments the time samples are multiplied by 10 before determining the energy value or the output energy value scaled similarly. The resulting energy values can be directly compared. Where the energy of the short window values is significantly larger (such as defined by threshold value) than the long window energy values, then the signal is very likely to have an impulse detected by the short window.

Also as shown in Figure 1 , the attribute analyser 101 may comprise a speech or voice activity detector 107. The speech detector 107 may be configured to determine whether the input audio signals comprise speech components. Speech within the input audio signal may be problematic as human hearing is tuned to listening for it. For example, processed audio signals comprising speech require special care otherwise they can sound very unnatural and disturbing. In some embodiments the detector 107 comprises a signal classifier and corresponding outputs from this classifier may be mapped to parameters to control the mixing process so that the output audio signal sounds spatially extended but still natural. For example in some embodiments the detector comprises a deep neural network trained to classify between speech and other signal types. This training may be for example, using any of the above features or by signal spectrum analysis directly. In some embodiments a voice activity detector or speech/music discriminator may be implemented to complement or replace the classifier described above. Thus in some embodiments a Voice Activity Detector (VAD) may be configured to first perform a noise reduction, calculate some features or quantities from a section of the input signal, and then apply a classification rule to classify the section as speech or non-speech. In some embodiments this classification rule is based on determining a value exceeds a threshold. In some embodiments there may be some feedback in this sequence, in which the VAD decision is used to improve the noise estimate in the noise reduction stage, or to adaptively vary the threshold(s). These feedback operations improve the VAD performance in non-stationary noise (i.e. when the noise varies a lot). Some VAD methods may formulate the decision rule on a frame by frame basis using instantaneous measures of the divergence distance between speech and noise. The different measures which are used in VAD methods may include spectral slope, correlation coefficient, log likelihood ratio, cepstral, weighted cepstral, and modified distance measures. In some embodiments the signal classifier may comprise a percussiveness detector. The percussiveness detector may be configured to perform an analysis of percussiveness, using, for example the pulse-metric characterization such as described within CONSTRUCTION AND EVALUATION OF A ROBUST MULTI FEATURE SPEECH/MUSIC DISCRIMINATOR, Speech & music discrimination, pulse-metric feature available from https://www.ee.columbia.edu/~dpwe/papers/ScheiS97-mussp.pdf.

In some embodiments the system may comprise an attribute mapper 1 1 1 . The attribute mapper 1 1 1 may be configured to receive the output from the attribute analyser 101 (and the one or more 'problematic' attribute determiners) and be configured to determine at least one mix/control parameter based on the determined at least one attribute parameter value from the analyser 101 . In some embodiments the attribute mapper comprises at least one attribute mapper element, where each of the elements is associated with a specific attribute analyser element and is configured to output separate mix/control parameters for each element. The at least one mix/control parameter may be passed to a mixer controller/extent compensator 121 .

Thus for example in some embodiments the attribute mapper 1 1 1 comprise elements which map the output of the analyser elements/detector to a range from 0 to 1 . Thus for example the attribute mapper 1 1 1 comprises a peakiness mapper 1 13 configured to receive the output of the peakiness analyser 103 and map the output to a range between 0 to 1 . Thus the peakiness mapper 1 13 may be configured to define the peakiness attribute parameter associated with an input audio signal to be between 0 and 1 with 1 being fully peaky, i.e., the above model has full time alignment in neural activations, and 0 being "unpeaky", i.e., the neural activations are completely misaligned in time.

Similarly the attribute mapper comprises an impulsiveness mapper 1 15. The impulsiveness mapper 1 15 may be configured to count the number of detected impulses from the impulsiveness analyser 105 for a defined time frame of consecutive windows. This number can be used as the estimate of impulsiveness of the signal. A high impulsiveness may be considered as a problematic attribute for the signal. In some embodiments the analyser may be configured to define the impulsiveness parameter value to be a value between 0 and 1 . An impulsiveness parameter value of 0 signifies that there were no detected impulses in the signal within the defined time frame and the impulsiveness parameter value 1 signifies that there was maximum allowed number of impulses (for example the 'maximum' number may be 20) in the input signal within the defined time frame. This 'maximum' number limit may in some embodiments be user defined or may be a predefined value that is deemed as a good limit for impulses in a certain time frame. An example time frame is 20 ms.

Also the attribute mapper may comprise a speech mapper 1 17. The speech mapper may receive the output of the speech detector 107 and output a speech or voice attribute parameter with a value between 0 and 1 . In some embodiments as signal type is relatively constant measure the classified signal types has a constant value that they are mapped to. This ensures that an appropriate amount of later algorithm modification is performed for each signal type. For example, input audio signals with determined speech components could have a value of 0.8 and input audio signals with determined drum components may have a value of 1 .

In some embodiments the system may comprise a mixer controller/extent compensator 121 . The mixer controller/extent compensator 121 may be configured to receive the outputs from the attribute mapper in the form of mix/control parameters and generate suitable mix controls for a mapper 131 and/or spatial extent compensation control for a spatial extent synthesizer 141 . In some embodiments the mixer controller/extent compensator 121 may be configured to combine multiple mapped attribute values into one 'problematic' signal estimate. This combination may be a maximum value from all of the mapped attribute values as thus would be the maximum requirement for any spatial extent algorithm modification to avoid undesired changes in signal timbre. The mixer controller/extent compensator 121 is configured to generate at least one control signal based on the determined mapped attribute values and these control signals used to control the operation of the spatial extent synthesizer and/or the mixer. In other words the spatial extent synthesizer operation may be controlled based on the attribute values and/or the mixer is similarly controlled based on the attribute values.

In some embodiments the system may comprise a spatial extent synthesizer 141 . The spatial extent synthesizer 141 is configured to receive the input audio signal and generate a spatially extended audio signal. In some embodiments the spatial extent synthesizer 141 is configured to receive a further input from the mixer controller/extent compensator 121 . The spatial extent synthesizer 141 is configured in some embodiments to output the spatially extended audio signal to a mixer 131 .

In some embodiments the system comprises a mixer 131 (or an adaptive combiner). The mixer 131 is configured to receive the input audio signal, the spatially extended audio signal and a mixer control input from the mixer controller/extent compensator 121 . The mixer is configured to combine or mix the input audio signals and the spatially extended audio signal based on the mixer control input to generate a suitable output audio signal.

The mixer 131 is configured to mix the original input signal and spatially extended signal together. As the spatial extent effect does not cause delay, the signals can be mixed directly together. The original signal may be mixed into the centre direction of the spatially extended signal and this direction will be the one from which the sound is perceived to come from. Based on the output of the mixer controller (which is based on the attribute analyser and attribute mapper) the balance between signals is varied. If the signal is not problematic in other words the 'combined' attribute value is low and closer to 0 then little to none of the original signal is added to the mix. On the other hand, if the signal is very problematic, in other words the 'combined' attribute value is high and closer to 1 , the mixer is configured to raise the amount of original signal in the mix so that it is equally loud (for example a 50/50 mix) compared to the spatially extended signal. This mix control attempts to reduce perceivable problems and makes the timbre of the sound more similar to the original signal. However, it may also reduce the perceived spatial extent of the combined signal and may require extent compensation in the extent synthesis module by modifying the parameters as described hereafter.

With respect to Figure 2 an example spatial extent synthesiser 141 is shown in further detail. As described herein the spatial extent synthesiser 141 receives the input audio signal and spatially extends the audio signal to a defined (for example 360 degree) spatial extent using methods for spatial extent control. In other words it takes as input a mono sound source audio signal and spatial extent parameters (width, height and depth).

In some embodiments where the audio signal input is a time domain signal the spatial extent synthesiser 141 comprises a suitable time to frequency domain transformer. For example as shown in Figure 2 the spatial extent synthesiser 215 comprises a Short-Time Fourier Transform (STFT) 401 configured to receive the audio signal and output a suitable frequency domain output. In some embodiments the input is a time-domain signal which is processed with hop-size of 512 samples. A processing frame of 1024 samples is used, and it is formed from the current 512 samples and previous 512 samples. The processing frame is zero-padded to twice its length (2048 samples) and Hann windowed. The Fourier transform is calculated from the windowed frame producing the Short-Time Fourier Transform (STFT) output. The STFT output is symmetric, thus it is sufficient to process the positive half of 1024 samples including the DC component, totalling 1025 samples. Although the STFT is shown in Figure 2 any suitable time to frequency domain transform may be used.

In some embodiments the spatial extent synthesiser 141 further comprises a filter bank 403. The filter bank 403 is configured to receive the output of the STFT 401 and using a set of filters generated based on a Halton sequence (and with some default parameters) generate a number of frequency bands 405. In statistics, Halton sequences are sequences used to generate points in space for numerical methods such as Monte Carlo simulations. Although these sequences are deterministic, they are of low discrepancy, that is, appear to be random for many purposes. In some embodiments the filter bank 409 comprises set of 9 different distribution filters, which are used to create 9 different frequency domain signals where the signals do not contain overlapping frequency components. These signals are denoted Band 1 F 405i to Band 9 F 405g in Figure 2. The filtering can be innplennented in the frequency domain by multiplying the STFT output with stored filter coefficients for each band.

In some embodiments the spatial extent synthesiser 141 further comprises a spatial extent input 400. The spatial extent input 400 may be configured to define the spatial extent of the audio signal.

Furthermore in some embodiments the spatial extent synthesiser 141 may further comprise an object position input/determiner 402. The object position input/determiner 402 may be configured to determine the spatial position of sound sources. This information may be determined in some embodiments by the sound object processor.

In some embodiments the spatial extent synthesiser 141 may further comprise a band position determiner 404. The band position determiner 404 may be configured to receive the outputs from the object position input/determiner 402 and the spatial extent input 400 and from these generate an output passed to the vector base amplitude panning processor 406.

In some embodiments the spatial extent synthesiser 141 may further comprise a vector base amplitude panning (VBAP) processor 406. The VBAP 406 may be configured to generate control signals to control the panning of the frequency domain signals to desired spatial positions. Given the spatial position of the sound source (azimuth, elevation) and the desired spatial extent for the source (width in degrees), the system calculates a spatial position for each frequency domain signal. For example, if the spatial position of the sound source is zero degrees azimuth (front), and spatial extent 90 degrees, the VBAP may position the frequency bands at positions azimuth 45, 33.75, 22.5, 1 1 .25, 0, -1 1 .2500, -22.5000, -33.7500, -45 degrees. Thus, we use a linear allocation of bands around the source position, with the span defined by the spatial extent.

The VBAP processor 406 may therefore be used to calculate a suitable gain for the signal, given the desired loudspeaker positions. VBAP processor 406 may provide gains for a signal such that it can be spatially positioned to a suitable position. These gains may be passed to a series of multipliers 407. In the following example the spatial extent synthesiser (or spatially extending controller) is implemented using a vector based amplitude panning operation. However it is understood that the spatial extent synthesis or spatially extending control may be implementation agnostic and any suitable implementation used to generate the spatially extending control. For example in some embodiments the spatially extending control may implement direct binaural panning (using Head related transfer function filters for directions), direct assignment to the output channel locations (for example direct assignment to the loudspeakers without using any panning), synthesized ambisonics, and wave-field synthesis.

In some embodiments the spatial extent synthesiser 141 may further comprise a series of multipliers 407. In Figure 2 is shown one multiplier for each frequency band. Thus the series of multipliers comprise multipliers 407i to 407g, however any suitable number of multipliers may be used. Each frequency domain band signal may be multiplied in the multiplier 407 with the determined VBAP gains.

The products of the VBAP gains and each frequency band signal may be passed to a series of output channel sum devices 409.

In some embodiments the spatial extent synthesiser 141 may further comprise a series of sum devices 409. The sum devices 409 may receive the outputs from the multipliers and combine them to generate an output channel band signal 41 1 . In the example shown in Figure 2, a 4.0 loudspeaker format output is implemented with outputs for front left (Band FL F 41 1 1 ), front right (Band FR F 41 1₂), rear left (Band RL F 41 13), and rear right (Band RR F 41 14) channels which are generated by sum devices 409i, 4092, 4093 409₄ respectively. In some other embodiments other loudspeaker formats or number of channels can be supported.

Furthermore in some embodiments other panning methods can be used such as panning laws, or the signals could be assigned to the closest loudspeakers directly.

In some embodiments the spatial extent synthesiser 141 may further comprise a series of inverse Short-Time Fourier Transforms (ISTFT) 413. For example as shown in Figure 2 there is an ISTFT 413i associated with the FL signal an ISTFT 4132 associated with the FR signal, an ISTFT 4133 associated with the RL signal output and an ISTFT 413₄ associated with the RR signal. In other words it provides N component audio signals to be played from different directions based on the spatial extent parameters. The signals are subjected to Inverse Short-Time Fourier Transform (ISTFT) and overlap-added to produce time-domain outputs.

As adding the original signal into the mix draws perception of that sound source towards that direction, the resulting spatial extent perception is less than intended. This mix 'narrowing' can be compensated by making the spatial extent actually wider than the originally specified or determined value. Thus for example where there is more original signal in the mix then the mixer controller/extent compensator may be configured to compensate for the narrowing by controlling the spatial extent synthesiser 141 to produce a wider or compensated spatially extended audio signal.

This increase in extent may for example be performed in some embodiments by increasing the extent directly from the mixing value so that where there is 0% original signal there is no increase in extent and where the mix comprises 50% of the original audio signal the spatial extent synthesiser would always controlled to produce a full 360° extent spatially extended audio signal. In some embodiments the increase is linear interpolation between these values

In some embodiments the spatial extent synthesiser 141 is configured to apply a predefined lookup table based on the current extent that has been perceptually evaluated to be correct. This lookup works so that there is a defined modified extent value for each pair of original extent and current mixing value.

In some embodiments a predefined extent value is applied to the spatial extent synthesiser associated with a specific signal type.

Examples of this extending of the spatial extent processing is shown in Figure 3. For example where the analyser determines that the input audio signal has a maximum attribute value of 0 shown by arrow 21 1 in Figure 3, then a normal spatial extent synthesis can be performed as shown in Figure 3 by the extended arc 213. When the input has a maximum attribute of 0.5 shown by arrow 221 then the output of the spatial extent synthesis is configured to extend the spatial extension, as shown by the further extended arc 215 and the direction arrow 217 representing the original direction. When the input has a maximum attribute of 1 then the spatial extent synthesis may be configured to extend the spatial extension to produce a 360 degree spatially extended audio signal as represented by the full circle 225 and the direction arrow representing the original signal.

Example use cases of the system shown in Figures 1 to 3 which are produced when a user, who may be an audio professional who can aurally detect changes in a sound signal when it is ran through an audio effect, is operating the system is described hereafter. These changes may not necessarily be perceived by a nonprofessional user.

In some embodiments the user may want to create a spatial extent for a trumpet track. This signal is detected as peaky. When it goes normally through the system, it would become less sharp and the user does not want that. They decide to apply the mixer where 20% of the original signal is mixed in to the output to retain better the qualities of the trumpet.

A further example could be where a user wants to create surrounding spatial extent for a drumming track with congas. These sounds are detected to be clearly impulsive and would lose the impact if processed normally. Instead, the user uses the proposed system and add 50% of the original signal into the output mix and makes the extent part completely surrounding. The resulting perceived extent is not completely surrounding but still very wide.

An additional use may be where the user wants to create spatially extended speech. Running speech directly through the spatial extent system creates an Otherworldly' sound which is perceived as disturbing. A user however may use the system as it classifies this input audio signal as speech and instead adds 40% of the original signal into the mix to retain the naturalness of the speech.

With respect to Figure 4 an example electronic device which may be used as the mixer and/or system is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1200 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.

The device 1200 may comprise a microphone 1201 . The microphone 1201 may comprise a plurality (for example a number N) of microphones. However it is understood that there may be any suitable configuration of microphones and any suitable number of microphones. In some embodiments the microphone 1201 is separate from the apparatus and the audio signal transmitted to the apparatus by a wired or wireless coupling. The microphone 1201 may in some embodiments be the microphone array as shown in the previous figures.

The microphone may be a transducer configured to convert acoustic waves into suitable electrical audio signals. In some embodiments the microphone can be solid state microphones. In other words the microphone may be capable of capturing audio signals and outputting a suitable digital format signal. In some other embodiments the microphone 1201 can comprise any suitable microphone or audio capture means, for example a condenser microphone, capacitor microphone, electrostatic microphone, Electret condenser microphone, dynamic microphone, ribbon microphone, carbon microphone, piezoelectric microphone, or microelectrical-mechanical system (MEMS) microphone. The microphone can in some embodiments output the audio captured signal to an analogue-to-digital converter (ADC) 1203.

The device 1200 may further comprise an analogue-to-digital converter 1203. The analogue-to-digital converter 1203 may be configured to receive the audio signals from each of the microphone 1201 and convert them into a format suitable for processing. In some embodiments where the microphone is an integrated microphone the analogue-to-digital converter is not required. The analogue-to-digital converter 1203 can be any suitable analogue-to-digital conversion or processing means. The analogue-to-digital converter 1203 may be configured to output the digital representations of the audio signal to a processor 1207 or to a memory 121 1 .

In some embodiments the device 1200 comprises at least one processor or central processing unit 1207. The processor 1207 can be configured to execute various program codes such as the methods such as described herein.

In some embodiments the device 1200 comprises a memory 121 1 . In some embodiments the at least one processor 1207 is coupled to the memory 121 1 . The memory 121 1 can be any suitable storage means. In some embodiments the memory 121 1 comprises a program code section for storing program codes implementable upon the processor 1207. Furthermore in some embodiments the memory 121 1 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1207 whenever needed via the memory-processor coupling.

In some embodiments the device 1200 comprises a user interface 1205. The user interface 1205 can be coupled in some embodiments to the processor 1207. In some embodiments the processor 1207 can control the operation of the user interface 1205 and receive inputs from the user interface 1205. In some embodiments the user interface 1205 can enable a user to input commands to the device 1200, for example via a keypad. In some embodiments the user interface 205 can enable the user to obtain information from the device 1200. For example the user interface 1205 may comprise a display configured to display information from the device 1200 to the user. The user interface 1205 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1200 and further displaying information to the user of the device 1200. In some embodiments the user interface 1205 may be the user interface for communicating with the position determiner as described herein.

In some implements the device 1200 comprises a transceiver 1209. The transceiver 1209 in such embodiments can be coupled to the processor 1207 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver 1209 or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.

For example as shown in Figure 4 the transceiver 1209 may be configured to communicate with the renderer as described herein.

The transceiver 1209 can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver 1209 or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).

In some embodiments the device 1200 may be employed as at least part of the renderer. As such the transceiver 1209 may be configured to receive the audio signals and positional information from the microphone/close microphones/position determiner as described herein, and generate a suitable audio signal rendering by using the processor 1207 executing suitable code. The device 1200 may comprise a digital-to-analogue converter 1213. The digital-to-analogue converter 1213 may be coupled to the processor 1207 and/or memory 121 1 and be configured to convert digital representations of audio signals (such as from the processor 1207 following an audio rendering of the audio signals as described herein) to a suitable analogue format suitable for presentation via an audio subsystem output. The digital-to-analogue converter (DAC) 1213 or signal processing means can in some embodiments be any suitable DAC technology.

Furthermore the device 1200 can comprise in some embodiments an audio subsystem output 1215. An example as shown in Figure 1 1 shows the audio subsystem output 1215 as an output socket configured to enabling a coupling with headphones 121 . However the audio subsystem output 1215 may be any suitable audio output or a connection to an audio output. For example the audio subsystem output 1215 may be a connection to a multichannel speaker system.

In some embodiments the digital to analogue converter 1213 and audio subsystem 1215 may be implemented within a physically separate output device. For example the DAC 1213 and audio subsystem 1215 may be implemented as cordless earphones communicating with the device 1200 via the transceiver 1209.

Although the device 1200 is shown having both audio capture, audio processing and audio rendering components, it would be understood that in some embodiments the device 1200 can comprise just some of the elements.

In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.

The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.

The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.

Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.

Programs, such as those provided by Synopsys, Inc. of Mountain View,

California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS:

1 . An apparatus for generating at least one audio signal associated with a sound scene, the apparatus configured to:

receive at least one audio signal;

analyse the at least one audio signal to determine at least one attribute parameter;

determine at least one control signal based on the at least one attribute;

generate a spatially extended audio signal from the at least one audio signal based on the at least one control signal; and

combine the at least one audio signal and the spatially extended audio signal based on the at least one control signal to generate the at least one audio signal associated with the sound scene.

2. The apparatus as claimed in claim 1 , configured to generate a spatially extended audio signal based on the at least one control signal is further configured to apply a spatially extending synthesis to the at least one audio signal to generate the spatially extended audio signal, wherein the spatially extending synthesis is controlled based on the at least one control signal, such that the spatial effect of the combination of the at least one audio signal and the spatially extended audio signal is compensated for.

3. The apparatus as claimed in claim 2, wherein the apparatus configured to apply the spatially extending synthesis is further configured to:

control the application of the spatially extending synthesis such that the synthesis is unmodified when the combination of the at least one audio signal and an associated spatially extended audio signal based on the at least one audio signal is purely the spatially extended audio signal;

control the application of the spatially extending synthesis such that the synthesis is increased to a 360 degree extent when the combination of the at least one audio signal and the spatially extended audio signal has an equal mix of the both the at least one audio signal and the spatially extended audio signal; and

control the application of the spatially extending synthesis such that the synthesis is a linear interpolation between 0 and 360 degree extent when the combination of the at least one audio signal and the spatially extended audio signal based on the at least one audio signal has an mix value between zero at least one audio signal and one half at least one audio signal.

4. The apparatus as claimed in claim 2, wherein the apparatus configured to apply the spatially extending synthesis is further configured to control the application of the spatially extending synthesis such that the extent of synthesis is a result of a look up table using as an input an original spatial extent input and a fraction of the mix which is the at least one audio signal.

5. The apparatus as claimed in any of claims 2 to 4, wherein the apparatus configured to apply a spatially extending synthesis is configured to apply at least one of:

a vector base amplitude panning to the at least one audio signal;

direct binaural panning to the at least one audio signal;

direct assignment to channel output location to the at least one audio signal; synthesized ambisonics to the at least one audio signal; and

wavefield synthesis to the at least one audio signal.

6. The apparatus as claimed in claim 5, wherein the apparatus configured to apply a spatially extending synthesis to the at least one audio signal is configured to:

determine a spatial extent parameter;

determine at least one position associated with the at least one audio signal; determine at least one frequency band position based on the at least one position located within the sound scene and the spatial extent parameter.

7. The apparatus as claimed in claim 6, further configured to generate panning vectors for the application of vector base amplitude panning to frequency bands of the at least one audio signal.

8. The apparatus as claimed in claims 1 to 7, wherein the apparatus configured to receive the at least one audio signal is further configured to receive the audio signal from at least one microphone, and wherein the apparatus is further configured to determine a position of the microphone relative to the apparatus.

9. The apparatus as claimed in any of claims 1 to 8, wherein the apparatus configured to combine the at least one audio signal and the spatially extended audio signal based on the at least one control signal to generate the at least one audio signal associated with the sound scene is configured to:

determine a weighting for the at least one audio signal based on the at least one control signal; and/or

determine a weighting for the spatially extended audio signal based on the at least one control signal.

10. The apparatus as claimed in claim 9, wherein the apparatus is configured to generate the weighting for the at least one audio signal between zero and one half based on the at least one control signal and/or the weighting for the spatially extended audio signal between one and one half based on the at least one control signal.

1 1 . The apparatus as claimed in any of claims 1 to 10, wherein the apparatus configured to analyse the at least one audio signal to determine at least one attribute parameter is configured to determine at least one of:

a detection of voice activity within the at least one audio signal;

a determination of peakiness within the at least one audio signal; and a determination of impulsiveness within the at least one audio signal.

12. The apparatus as claimed in claim 1 1 , wherein the apparatus configured to analyse the at least one audio signal to determine at least one attribute parameter is configured to determine a combined attribute parameter value based on at least one of:

a voice activity parameter value determined based on the detection of voice activity within the at least one audio signal;

a peakiness parameter value based on the determination of peakiness within the at least one audio signal;

an impulsiveness parameter value based on the determination of impulsiveness within the at least one audio signal;

a combination of at least two of the voice activity parameter value, the peakiness parameter value and the impulsiveness parameter value; a maximum of at least two of the voice activity parameter value, the peakiness parameter value and the impulsiveness parameter value; and

a minimum of at least two of the voice activity parameter value, the peakiness parameter value and the impulsiveness parameter value.

13. The apparatus as claimed in any of claims 1 to 12, wherein the at least one audio signal is at least one of:

a monophonic source audio signal;

a captured audio signal from a microphone; and

a synthetic audio signal.

14. The apparatus as claimed in any of claims 1 to 13, wherein the apparatus configured to determine at least one control signal based on the at least one attribute is configured to determine the at least one control signal further based on at least one user input.

15. A method for generating at least one audio signal associated with a sound scene, the method comprising:

receiving at least one audio signal;

analysing the at least one audio signal to determine at least one attribute parameter;

determining at least one control signal based on the at least one attribute; generating a spatially extended audio signal from the at least one audio signal based on the at least one control signal; and

combining the at least one audio signal and the spatially extended audio signal based on the at least one control signal to generate the at least one audio signal associated with the sound scene.

16. The method as claimed in claim 15, wherein generating a spatially extended audio signal based on the at least one control signal further comprises applying a spatially extending synthesis to the at least one audio signal to generate the spatially extended audio signal, wherein the spatially extending synthesis is controlled based on the at least one control signal, such that the spatial effect of the combination of the at least one audio signal and the spatially extended audio signal is compensated for.

17. The method as claimed in claim 16, wherein applying the spatially extending synthesis further comprises:

controlling the application of the spatially extending synthesis such that the synthesis is unmodified when the combination of the at least one audio signal and an associated spatially extended audio signal based on the at least one audio signal is purely the spatially extended audio signal;

controlling the application of the spatially extending synthesis such that the synthesis is increased to a 360 degree extent when the combination of the at least one audio signal and the spatially extended audio signal has an equal mix of the both the at least one audio signal and the spatially extended audio signal; and

controlling the application of the spatially extending synthesis such that the synthesis is a linear interpolation between 0 and 360 degree extent when the combination of the at least one audio signal and the spatially extended audio signal based on the at least one audio signal has an mix value between zero at least one audio signal and one half at least one audio signal.

18. The method as claimed in claim 16, wherein applying the spatially extending synthesis further comprises controlling the application of the spatially extending synthesis such that the extent of synthesis is a result of a look up table using as an input an original spatial extent input and a fraction of the mix which is the at least one audio signal.

19. The method as claimed in any of claims 16 to 18, wherein applying a spatially extending synthesis comprises applying at least one of:

a vector base amplitude panning to the at least one audio signal;

direct binaural panning to the at least one audio signal;

wavefield synthesis to the at least one audio signal.

20. The method as claimed in claim 18, wherein applying a spatially extending synthesis to the at least one audio signal comprises:

determining a spatial extent parameter; determining at least one position associated with the at least one audio signal; determining at least one frequency band position based on the at least one position located within the sound scene and the spatial extent parameter.

21 . The method as claimed in claim 20, further comprising generating panning vectors for the application of vector base amplitude panning to frequency bands of the at least one audio signal.

22. The method as claimed in claims 15 to 21 , wherein receiving the at least one audio signal further comprises receiving the audio signal from at least one microphone, and wherein the method further comprises determining a position of the microphone relative to the apparatus.

23. The method as claimed in any of claims 15 to 22, wherein combining the at least one audio signal and the spatially extended audio signal based on the at least one control signal to generate the at least one audio signal associated with the sound scene comprises:

determining a weighting for the at least one audio signal based on the at least one control signal; and/or

determining a weighting for the spatially extended audio signal based on the at least one control signal.

24. The method as claimed in claim 23, further comprising generating the weighting for the at least one audio signal between zero and one half based on the at least one control signal and/or the weighting for the spatially extended audio signal between one and one half based on the at least one control signal.

25. The method as claimed in any of claims 15 to 24, wherein analysing the at least one audio signal to determine at least one attribute parameter comprises determining at least one of:

a detection of voice activity within the at least one audio signal;

26. The method as claimed in claim 25, wherein analysing the at least one audio signal to determine at least one attribute parameter comprises determining a combined attribute parameter value based on at least one of:

a combination of at least two of the voice activity parameter value, the peakiness parameter value and the impulsiveness parameter value;

a maximum of at least two of the voice activity parameter value, the peakiness parameter value and the impulsiveness parameter value; and

27. The method as claimed in any of claims 15 to 26, wherein the at least one audio signal is at least one of:

a monophonic source audio signal;

a captured audio signal from a microphone; and

a synthetic audio signal.

28. The method as claimed in any of claims 15 to 27, wherein determining at least one control signal based on the at least one attribute comprises determining the at least one control signal further based on at least one user input.