WO2018193162A2 - Audio signal generation for spatial audio mixing - Google Patents

Audio signal generation for spatial audio mixing Download PDF

Info

Publication number
WO2018193162A2
WO2018193162A2 PCT/FI2018/050275 FI2018050275W WO2018193162A2 WO 2018193162 A2 WO2018193162 A2 WO 2018193162A2 FI 2018050275 W FI2018050275 W FI 2018050275W WO 2018193162 A2 WO2018193162 A2 WO 2018193162A2
Authority
WO
WIPO (PCT)
Prior art keywords
audio signal
sum
audio
environment
spatial
Prior art date
Application number
PCT/FI2018/050275
Other languages
French (fr)
Other versions
WO2018193162A3 (en
Inventor
Antti Eronen
Arto Lehtiniemi
Tapani PIHLAJAKUJA
Jussi LEPPÄNEN
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Publication of WO2018193162A2 publication Critical patent/WO2018193162A2/en
Publication of WO2018193162A3 publication Critical patent/WO2018193162A3/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups
    • H04R2430/20Processing of the output signals of the acoustic transducers of an array for obtaining a desired directivity characteristic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/11Positioning of individual sound objects, e.g. moving airplane, within a sound field
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/11Application of ambisonics in stereophonic audio systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/13Application of wave-field synthesis in stereophonic audio systems

Definitions

  • the present application relates to apparatus and methods for audio signal generation and ambience audio signal generation for spatial audio mixing.
  • Capture of audio signals from multiple sources and mixing of audio signals when these sources are moving in the spatial field requires significant effort. For example the capture and mixing of an audio signal source such as a speaker or artist within an audio environment such as a theatre or lecture hall to be presented to a listener and produce an effective audio atmosphere requires significant investment in equipment and training.
  • a commonly implemented system is where one or more 'external' microphones, for example a Lavalier microphone worn by the user or an audio channel associated with an instrument, is mixed with a suitable spatial (or environmental or audio field) audio signal such that the produced sound comes from an intended direction.
  • This system is known in some areas a Spatial Audio Mixing (SAM)
  • the SAM system enables the creation of immersive sound scenes comprising "background spatial audio" or ambience and sound objects for Virtual Reality (VR) applications.
  • the scene can be designed such that the overall spatial audio of the scene, such as a concert venue, is captured with a microphone array (such as one contained in the OZO virtual camera) and the most important sources captured using the 'external' microphones.
  • OZO are not available, but a content producer would like to create high quality VR sound scenes with spatial ambience and high quality close-up sources. Thus there is a need to be able to generate solutions which enables this.
  • a designated spatial audio capture device such as an OZO device
  • the spatial audio capture device may capture unintended audio, e.g., live mix for the audience, close mic ambience.
  • the signal-to-noise ratio is not good enough at the spatial capture device to represent even the ambience of the scene, such as for example where the capture device is mounted on a moving car.
  • the spatial audio capture device may not always represent the spatial scene that is desired but something similar is the target.
  • an apparatus for generating an intended spatial audio field configured to: receive at least two audio signals, wherein each audio signal is received from a separate microphone, each separate microphone is located in the same environment and configured to capture a sound source; analyse each audio signal to determine at least in part an ambience audio signal; generate a sum audio signal from the determined ambience audio signal based on the at least two audio signals; and process the sum audio signal to spatially extend the sum audio signal so as to generate the intended spatial audio field, wherein the sum audio signal comprises the ambience audio signal for the intended spatial audio field.
  • the apparatus may further be configured to apply a reverberation to the sum audio signal before the processing of the sum audio signal to spatially extend the sum audio signal.
  • the apparatus configured to generate a sum audio signal from the determined ambience audio signal based on the at least two audio signals may be configured to generate for and apply to at least one of the at least two audio signals a weighting value before generating the sum audio signal, wherein the weighting value may be based on at least one of: a detection of voice activity within the audio signal; a determination of spectral flatness within the audio signal; a determination of percussiveness within the audio signal; a determination of harmonicity within the audio signal; a determination of content classification type within the audio signal; a determination of silence within the audio signal; a determination of noise within the audio signal; and at least one user generated input associated with the audio signal.
  • the apparatus configured to generate for at least one of the at least two audio signals a weighting value may be further configured to normalise the weighting value for at least one of the at least two audio signals.
  • the apparatus configured to process the sum audio signal to spatially extend the sum audio signal may be configured to apply one of: vector base amplitude panning to the sum audio signal; direct binaural panning to the sum audio signal; direct assignment to channel output location to the sum audio signal; synthesized ambisonics to the sum audio signal; and wavefield synthesis to the sum audio signal.
  • the apparatus configured to process the sum audio signal to spatially extend the sum audio signal may be configured to: determine a spatial extent parameter; determine at least one position associated with the microphones; determine at least one frequency band position based on the at least one position associated with the microphones and the spatial extent parameter.
  • the apparatus configured to apply vector base amplitude panning to the sum audio signal may be further configured to generate panning vectors for the application of vector base amplitude panning to frequency bands of the sum audio signal.
  • the apparatus may be configured to generate the intended spatial audio field is configured to generate a plurality of intended spatial audio fields parts, wherein at least one part of the intended spatial audio field may be at least one of: partially overlapping a neighbouring part; non-overlapping at least one other part; contained within at least one other part; and containing at least one other part.
  • the apparatus may be configured to generate: at least one first part of the intended spatial audio field associated with a first part of the environment, the first part of the environment comprising at least one sound source; and at least one second part of the intended spatial audio field associated with a second part of the environment, the second part of the environment comprising at least one further sound source.
  • the first part of the environment may be a left portion of the environment with respect to the apparatus, and the second part of the environment may be a right portion of the environment with respect to the apparatus.
  • the first part of the environment may be a front portion of the environment with respect to the apparatus, and the second part of the environment may be a rear portion of the environment with respect to the apparatus.
  • the apparatus may be further configured to determine a position of the at least one microphone of the microphones relative to the apparatus.
  • the apparatus may be further configured to: receive at least one audio signal from a capture device comprising a microphone array for capturing audio signals of the sound scene; compare the at least one audio signal from the capture device to the at least one audio signal; control the generation of the sum audio signal from microphones located within the intended spatial audio field, and process the sum audio signal to generate the intended spatial audio field based on the comparison.
  • the apparatus may be further configured to mix the at least one spatially extended sum audio signal with at least one of the at least two audio signals to generate the intended spatial audio field.
  • the apparatus configured to process the sum audio signal to spatially extend the sum audio signal may be configured to spatially extend the sum audio signal such that the at least one spatially extended sum audio signal is one of: full spatially extended to 360 degrees; and partial spatially extended upto 360 degrees.
  • a method for generating an intended spatial audio field comprising: receiving at least two audio signals, wherein each audio signal is received from a separate microphone, each separate microphone being located in the same environment and configured to capture a sound source; analysing each audio signal to determine at least in part an ambience audio signal; generating a sum audio signal from the determined ambience signal based on the at least two audio signals; and processing the sum audio signal to spatially extend the sum audio signal so as to generate the intended spatial audio field, wherein the sum audio signal comprises the ambience audio signal for the intended spatial audio field.
  • the method may further comprise applying a reverberation to the sum audio signal before the processing of the sum audio signal to spatially extend the sum audio signal.
  • Generating the sum audio signal may comprise: generating for at least one of the at least two audio signals a weighting value; and applying to at least one of the at least two audio signals the weighting value before generating the sum audio signal, wherein the weighting value is based on at least one of: a detection of voice activity within the audio signal; a determination of spectral flatness within the audio signal; a determination of percussiveness within the audio signal; a determination of harmonicity within the audio signal; a determination of silence within the audio signal; a determination of noise within the audio signal; a determination of content classification type within the audio signal; and at least one user generated input associated with the audio signal.
  • Generating the weighting value may further comprise normalising the weighting value for at least one of the at least two audio signals.
  • Processing the sum audio signal to spatially extend the sum audio signal may comprise applying one of: vector base amplitude panning to the sum audio signal; direct binaural panning to the sum audio signal; direct assignment to channel output location to the sum audio signal; synthesized ambisonics to the sum audio signal; and wavefield synthesis to the sum audio signal.
  • Processing the sum audio signal to spatially extend the sum audio signal may comprise: determining a spatial extent parameter; determining at least one position associated with the microphones; determining at least one frequency band position based on the at least one position associated with the microphones and the spatial extent parameter.
  • the apparatus configured to apply vector base amplitude panning to the sum audio signal may be further configured to generate panning vectors for the application of vector base amplitude panning to frequency bands of the weighted sum.
  • Generating the intended spatial audio field may comprise generating a plurality of intended spatial audio field parts, wherein at least one part is at least one of: partially overlapping a neighbouring part; non-overlapping at least one other part; contained within at least one other part; and containing at least one other part.
  • the method may comprise: generating at least one first part of the intended spatial audio field associated with a first part of the environment, the first part of the environment comprising at least one sound source; and generating at least one second part of the intended spatial audio field associated with a second part of the environment, the second part of the environment comprising at least one further sound source.
  • the first part of the environment may be a left portion of the environment, and the second part of the environment may be a right portion of the environment.
  • the first part of the environment may be a front portion of the environment, and the second part of the environment may be a rear portion of the environment.
  • the method may further comprise determining a position of the at least one microphone of the microphones relative to the apparatus.
  • the method may further comprise: receiving at least one audio signal from a capture device comprising a microphone array for capturing audio signals of the sound scene; comparing the at least one audio signal from the capture device to the at least one audio signal; controlling the generation of the sum audio signal from microphones located within the intended spatial audio field; and processing the sum audio signal to generate the intended spatial audio field based on the comparison.
  • the method may further comprise mixing the at least one spatially extended sum audio signal with at least one of the at least two audio signals to generate the intended spatial audio field.
  • Processing the sum audio signal to spatially extend the sum audio signal comprises spatially extending the sum audio signal such that the at least one spatially extended audio signal may be one of: full spatially extended to 360 degrees; and partial spatially extended upto 360 degrees.
  • an apparatus for generating at least one spatially extended audio signal associated with a sound scene configured to: receive at least two audio signals, wherein each audio signal is received from a separate microphone located within the sound scene; generate a sum of the at least two audio signals; and apply a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal, wherein the at least one spatially extended audio signal is an ambience audio signal for mixing with at least one of the at least two audio signals to generate at least one spatial audio field.
  • the apparatus may be further configured to apply a reverberation to the sum before the application of the spatially extended control.
  • the apparatus configured to generate a sum may be configured to generate for and apply to at least one of the at least two audio signals a weighting value before generating the sum, wherein the weighting value is based on at least one of: a detection of voice activity within the audio signal; a determination of spectral flatness within the audio signal; a determination of percussiveness within the audio signal; a determination of harmonicity within the audio signal; a determination of content classification type within the audio signal; a determination of silence within the audio signal; a determination of noise within the audio signal; and at least one user generated input associated with the audio signal.
  • the apparatus configured to generate a sum is configured to generate for at least one of the at least two audio signals a weighting value may be further configured to normalise the weighting value for at least one of the at least two audio signals.
  • the apparatus configured to apply a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may be configured to apply one of: vector base amplitude panning to the sum of the at least two audio signals; direct binaural panning to the sum of the at least two audio signals; direct assignment to channel output location to the sum of the at least two audio signals; synthesized ambisonics to the sum of the at least two audio signals; and wavefield synthesis to the sum of the at least two audio signals.
  • the apparatus configured to apply a spatially extended control to the sum of the at least two audio signals may be configured to: determine a spatial extent parameter; determine at least one position associated with the microphones; determine at least one frequency band position based on the at least one position associated with the microphones and the spatial extent parameter; and generate panning vectors for the application of vector base amplitude panning to frequency bands of the sum of the at least two audio signals.
  • the apparatus may be configured to generate a plurality of audio signals, each of the plurality of audio signals are associated with a portion of the sound scene, wherein at least one portion of the sound scene is at least one of: partially overlapping a neighbouring portion; non-overlapping at least one other portion; contained within at least one other portion; and containing at least one other portion.
  • the apparatus may be configured to generate: at least one first audio signal associated with a first portion of the sound scene, the first portion of the sound scene comprising at least one sound source; and at least one second audio signal associated with a second portion of the sound scene, the second portion of the sound scene comprising at least one further sound source.
  • the first portion of the sound scene may be a left portion of the sound scene with respect to the apparatus, and the second portion of the sound scene may be a right portion of the sound scene with respect to the apparatus.
  • the first portion of the sound scene may be a front portion of the sound scene with respect to the apparatus, and the second portion of the sound scene may be a rear portion of the sound scene with respect to the apparatus.
  • the apparatus may be further configured to determine a position of the at least one microphone of the microphones relative to the apparatus.
  • the apparatus may be further configured to: receive at least one audio signal from a capture device comprising a microphone array for capturing audio signals of the sound scene; compare the at least one audio signal from the capture device to the at least one audio signal; control the generation of the sum of the at least two audio signals from microphones located within the sound scene, and apply the spatially extended control to the sum of the at least two audio signals to generate the at least one audio signal based on the comparison.
  • the apparatus may be located in the sound scene comprising at least one sound source and at least one of the at least two microphones is associated with the at least one sound source within the sound scene.
  • the apparatus may be further configured to mix the at least one spatially extended audio signal with at least one of the at least two audio signals to generate at least one spatial audio field.
  • the apparatus configured to apply a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may be configured to spatially extend the sum of the at least two audio signals such that the at least one spatially extended audio signal is one of: full spatially extended to 360 degrees; and partial spatially extended upto 360 degrees.
  • a method for generating at least one spatially extended audio signal associated with a sound scene comprising: receiving at least two audio signals, wherein each audio signal is received from a separate microphone located within the sound scene; generating a sum of the at least two audio signals; and applying a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal, wherein the at least one spatially extended audio signal is an ambience audio signal for mixing with at least one of the at least two audio signals to generate at least one spatial audio field.
  • the method may further comprise applying a reverberation to the sum before the application of the spatially extended control.
  • Generating a sum may comprise: generating for at least one of the at least two audio signals a weighting value; and applying to at least one of the at least two audio signals the weighting value before generating the sum, wherein the weighting value is based on at least one of: a detection of voice activity within the audio signal; a determination of spectral flatness within the audio signal; a determination of percussiveness within the audio signal; a determination of harmonicity within the audio signal; a determination of content classification type within the audio signal; a determination of silence within the audio signal; a determination of noise within the audio signal; and at least one user generated input associated with the audio signal.
  • Generating for at least one of the at least two audio signals a weighting value may further comprise normalising the weighting value for at least one of the at least two audio signals.
  • Applying a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may comprise applying a vector base amplitude panning to the sum of the at least two audio signals.
  • Applying a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may comprise applying direct binaural panning to the sum of the at least two audio signals.
  • Applying a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may comprise applying direct assignment to channel output location to the sum of the at least two audio signals.
  • Applying a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may comprise applying synthesized ambisonics to the sum of the at least two audio signals.
  • Applying a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may comprise applying wavefield synthesis to the sum of the at least two audio signals.
  • Applying a vector base amplitude panning to the sum of the at least two audio signals may comprise: determining a spatial extent parameter; determining at least one position associated with the microphones located within the sound scene; determining at least one frequency band position based on the at least one position associated with the microphones located within the sound scene and the spatial extent parameter; and generating panning vectors for the application of vector base amplitude panning to frequency bands of the sum of the at least two audio signals.
  • Generating at least one spatially extended audio signal associated with a sound scene may comprise generating a plurality of audio signals, each of the plurality of audio signals are associated with a portion of the sound scene, wherein at least one portion of the sound scene is at least one of: partially overlapping a neighbouring portion; non-overlapping at least one other portion; contained within at least one other portion; and containing at least one other portion.
  • Generating at least one spatially extended audio signal associated with a sound scene may comprise generating: at least one first audio signal associated with a first portion of the sound scene, the first portion of the sound scene comprising at least one sound source; and at least one second audio signal associated with a second portion of the sound scene, the second portion of the sound scene comprising at least one further sound source.
  • the first portion of the sound scene may be a left portion of the sound scene with respect to the apparatus, and the second portion of the sound scene may be a right portion of the sound scene with respect to the apparatus.
  • the first portion of the sound scene may be a front portion of the sound scene with respect to the apparatus, and the second portion of the sound scene may be a rear portion of the sound scene with respect to the apparatus.
  • the method may further comprise determining a position of the at least one microphone of the microphones relative to the apparatus.
  • the method may further comprise: receiving at least one audio signal from a capture device comprising a microphone array for capturing audio signals of the sound scene; comparing the at least one audio signal from the capture device to the at least one audio signal; controlling the generation of the sum of the at least two audio signals from microphones located within the sound scene, and applying the spatially extended control to the sum of the at least two audio signals to generate the at least one audio signal based on the comparison.
  • the method may further comprise mixing the at least one spatially extended audio signal with at least one of the at least two audio signals to generate at least one spatial audio field.
  • Applying a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may comprise spatially extending the sum of the at least two audio signals such that the at least one spatially extended audio signal is one of: full spatially extended to 360 degrees; and partial spatially extended upto 360 degrees.
  • the apparatus may be located in the sound scene comprising at least one sound source and at least one of the at least two microphones is associated with the at least one sound source within the sound scene.
  • An apparatus may comprise means for implementing the method as described herein.
  • a computer program product stored on a medium may cause an apparatus to perform the method as described herein.
  • An electronic device may comprise apparatus as described herein.
  • a chipset may comprise apparatus as described herein.
  • Embodiments of the present application aim to address problems associated with the state of the art.
  • Figure 1 shows schematically an example known capture and mixing arrangement where the external microphones and the microphone array produce both external and ambient audio signals respectively for mixing;
  • Figure 2 shows schematically an example capture and mixing arrangement where the external microphones produce both the external and ambient audio signals for mixing according to some embodiments
  • Figure 3 shows schematically the example capture and mixing arrangement shown in Figure 2 in further detail according to some embodiments
  • Figure 4 shows schematically the spatial extent synthesizer shown in Figures 2 and 3 in further detail according to some embodiments.
  • Figure 5 shows schematically an example device suitable for implementing the apparatus shown in Figures 2 to 4.
  • a conventional approach to the capturing and mixing of sound sources with respect to an audio background or environment audio field signal would be for a professional producer to utilize an external microphone (a close or Lavalier microphone worn by the user, or a microphone attached to an instrument or some other microphone) to capture audio signals close to the sound source, and further utilize a 'background' microphone or microphone array to capture a environmental audio signal. These signals or audio tracks may then be manually mixed to produce an output audio signal such that the produced sound features the sound source coming from an intended (though not necessarily the original) direction.
  • an external microphone a close or Lavalier microphone worn by the user, or a microphone attached to an instrument or some other microphone
  • Figure 1 shows an example external or close microphone and tag 101 which is configured to transmit HAIP signals which are received by the microphone array and tag receiver 103 in order to determine the actual position of the external microphone 101 relative to the microphone array 103.
  • the actual position may be passed to a mixer 105.
  • the external microphone may furthermore generate an external audio signal 102 which is passed to the mixer 105.
  • the microphone array and tag receiver 103 may furthermore generate an ambient or spatial field audio signal 104 which is passed to the mixer 105. Having received the external microphone audio signal 102 and the microphone array audio signal 104 the mixer can in some embodiments mix the two to determine a mixed audio signal 106.
  • the mixed audio signal 106 may be generated in some embodiments based on a user input such at the positional user input 109.
  • the mixed audio signal may furthermore be passed to a renderer 107 wherein the mixed audio signal is rendered into a format suitable for outputting to a user.
  • the renderer 107 in some embodiments may be configured to use vector-base amplitude panning techniques when loudspeaker domain output is desired (e.g. 5.1 channel output) or use head-related transfer-function filtering if binaural output for headphone listening is desired.
  • Spatial audio capture technology can process audio signals captured via a microphone array into a spatial audio format. In other words generating an audio signal format with a spatial perception capacity.
  • the concept may thus be embodied in a form where audio signals may be captured such that, when rendered to a user, the user can experience the sound field as if they were present at the location of the capture device.
  • Spatial audio capture can be implemented for microphone arrays found in mobile devices.
  • audio processing derived from the spatial audio capture may be used employed within a presence-capturing device such as the Nokia OZO (OZO) devices.
  • the audio signal is rendered into a suitable binaural form, where the spatial sensation may be created using rendering such as by head-related-transfer-function (HRTF) filtering a suitable audio signal.
  • HRTF head-related-transfer-function
  • the concept as described with respect to the embodiments herein makes it possible to capture and remix an external and environmental audio signal more effectively and produce a better quality output where the sound or sound sources are more widely distributed.
  • the concept may for example be embodied as a capture system configured to capture both two or more external (speaker, instrument or other source) audio signals and a processor configured to generate from the two or more external audio signals an spatial or environmental (audio field) audio signal.
  • capture and render systems may be separate, it is understood that they may be implemented with the same apparatus or may be distributed over a series of physically separate but communication capable apparatus.
  • a presence-capturing device such as the OZO device could be equipped with an additional interface for receiving location data and close microphone audio signals, and could be configured to perform the capture part.
  • the output of a capture part of the system may be the microphone audio signals (e.g. as a 5.1 channel downmix), the close microphone audio signals (which may furthermore be time-delay compensated to match the time of the microphone array audio signals), and the position information of the close microphones (such as a time-varying azimuth, elevation, distance with regard to the microphone array).
  • the renderer as described herein may be an audio playback device (for example a set of headphones), user input (for example motion tracker), and software capable of mixing and audio rendering.
  • user input and audio rendering parts may be implemented within a computing device with display capacity such as a mobile phone, tablet computer, virtual reality headset, augmented reality headset etc.
  • mixing and rendering may be implemented within a distributed computing system such as known as the 'cloud'.
  • the apparatus is configured to sum signals captured by the external microphones and spatially extend this summed audio signal to cover a full spatial audio field (360 degrees).
  • the sum of external microphone signals used for ambiance audio signal creation can be weighted. That is, the contribution of each external microphone audio signal used in the creation of the ambiance audio signal can be weighted based on various criteria.
  • the criteria weighting may be how likely the microphone audio signal contains silence or just the background audio.
  • Other audio weighting criteria may be based on voice activity detection (VAD) processing (where VAD inactivity implies that a microphone is capturing background noise).
  • VAD inactivity implies that a microphone is capturing background noise.
  • Further audio weighting criteria may be noisiness detection which may indicate that the external microphone signal contains noise and this may be compared against analysis of harmonicity & percussiveness which would indicate a high likelihood of an actual instrument/voice within the audio signal.
  • audio signals associated with external microphones with high scores for noise/ ambience may receive larger weighting in the sum to create an ambience audio signal compared to audio signals from external microphones which have high levels of detected harmonic/percussive components.
  • the ambiance signal is created differently for different parts of the sound scene.
  • an ambient signal may generate different left and right scene ambient audio signals.
  • a more natural ambience audio signal may correspond somewhat to the directions of the external audio signal microphone/sources. For example, in a "battle of the bands" situation outside with two rock bands on opposite sides of the listener, it would be more natural that these two directions would have their own ambience audio signal tracks or spatial audio field parts representing their environment parts.
  • portions of the automatically generated ambiance audio signals may be analysed and used for creating looped, long duration or 'infinitely long' artificial ambience audio signals.
  • FIG 2 a schematic view of an example capture and mixing arrangement where the external microphones produce both the external and ambient audio signals for mixing according to some embodiments are shown.
  • the system shown in figure 2 shows N microphone sources. Specifically figure 2 shows a first microphone, mic source 1 , 201 1 configured to generate a first audio signal 202i which is passed to the spatial mixer 205 and the ambience signal generator 203.
  • the system also shows a second microphone, mic source 2, 2012 configured to generate a second audio signal 2022 which is passed to the spatial mixer 205 and the ambience signal generator 203.
  • the system is shown comprising a N'th microphone source mic source, mic source N, 201 N configured to generate a N'th audio signal 202N which is passed to the spatial mixer 205 and the ambience signal generator 203.
  • the external microphones 201 1 to 201 N can be configured to capture audio signals associated with humans, instruments, or other sound sources of interest.
  • the external microphone 201 may be a Lavalier microphone.
  • the external microphones may be any microphone external or separate to a microphone array which may capture the spatial audio signal.
  • the concept is applicable to any external/additional microphones be they Lavalier microphones, hand held microphones, mounted mics, or whatever.
  • the external microphones can be worn/carried by persons or mounted as close-up microphones for instruments or a microphone in some relevant location which the designer wishes to capture accurately.
  • a Lavalier microphone typically comprises a small microphone worn around the ear or otherwise close to the mouth.
  • the audio signal may be provided either by a Lavalier microphone or by an internal microphone system of the instrument (e.g., pick-up microphones in the case of an electric guitar) or an internal audio output (e.g., a electric keyboard output).
  • the close microphone may be configured to output the captured audio signals to a mixer.
  • the external microphone may be connected to a transmitter unit (not shown), which wirelessly transmits the audio signal to a receiver unit (not shown).
  • the external microphones, mic sources, 201 and thus the performers and/or the instruments that are being played positions may be tracked by using position tags located on or associated with the microphone source.
  • the external microphone comprises or is associated with a microphone position tag.
  • the microphone position tag may be configured to transmit a radio signal such that an associated receiver may determine information identifying the position or location of the close microphone. It is important to note that microphones worn by people can be freely moved in the acoustic space and the system supporting location sensing of wearable microphone has to support continuous sensing of user or microphone location.
  • the close microphone position tag may be configured to output this signal to a position tracker.
  • HAIP high accuracy indoor positioning
  • the system comprises a spatial mixer 205.
  • the spatial mixer 205 is configured as in the known spatial audio mixing system shown in figure 1 to receive the external microphone audio signals, which may be configured to spatially position the audio signals and mix them to create a spatial audio signal.
  • the spatial positioning may be performed based on the positioning data from the HAIP information received.
  • the positioning information may be input manually by a sound engineer, e.g., by providing azimuth/elevation/distance for each sound source or by any other suitable position tracking method.
  • the spatial mixer 205 may from the determined position data, render a positioned monophonic sound signal at a suitable spatial location using head-related- transfer-function (HRTF) filtering when binaural audio output is desired for headphone listening.
  • the output may be a two channel L+R signal for headphone listening, and the outputs after filtering can be summed for each microphone source to create a spatial mix signal containing all the spatially positioned sources.
  • the positioned monophonic sound signal output (for each sound source) may be panned and the panned sound source audio signals summed to create the spatial mix of sources.
  • the spatial mixer 205 is configured to mix at least one ambient or ambience audio signal.
  • the ambient or ambience audio signal was generated from a spatial audio signal capture apparatus comprising an array of microphones, for example a Nokia OZO apparatus.
  • the at least one ambience audio signal is generated by an ambience signal generator 203.
  • the ambience signal generator 203 is configured to receive the audio signals 202 from the external microphones 201 and from these audio signals generate at least one ambience audio signal which may be passed to the spatial mixer 205 to be mixed with the spatially processed audio signals from the external microphones.
  • the ambience signal generator 203 in some embodiments comprises a weighted sum 21 1 .
  • the weighted sum 21 1 is configured to receive the audio signals 202 from the microphone sources and generate a weighted sum of the audio signals.
  • the weighted sum 21 1 outputs the combined audio signal to a reverberator 213, however in some embodiments the weighted sum 21 1 outputs a combined audio signal to the spatial extent synthesizer 215 directly.
  • the ambience signal generator 203 comprises a reverberator 213.
  • the reverberator 213 in some embodiments is configured to receive the output from the weighted sum 21 1 .
  • the reverberator is configured to output a reverberated audio signal to a spatial extent synthesiser 215.
  • the ambience signal generator 203 comprises a spatial extent synthesizer 215.
  • the spatial extent synthesizer 215 is configured to receive the output from the reverberator 213 (or the weighted sum 21 1 ) and generate an ambience signal 204 which is output to the spatial mixer 205.
  • Figure 3 shows the system shown in figure 2 in further detail.
  • the example in figure 3 shows a single example microphone source 201 which is configured to output an audio signal 202 to the weighted sum 21 1 and spatial mixer 205 (sound object processor 331 ).
  • the weighted sum 21 1 in some embodiments comprises a signal classifier/characterizer 301 which is configured to receive the audio signal 202 from the microphone source 201 and classify or characterise the audio signal or otherwise generate parameters which may be used by a weight determiner and normalizer to determine an ambience weighting factor.
  • a signal classifier/characterizer 301 which is configured to receive the audio signal 202 from the microphone source 201 and classify or characterise the audio signal or otherwise generate parameters which may be used by a weight determiner and normalizer to determine an ambience weighting factor.
  • signal classifier/characterizer 301 may comprise a Voice
  • VAD Activity Detector
  • the VAD may in some embodiments first perform a noise reduction stage, calculate some features or quantities from a section of the input signal, and then apply a classification rule to classify the section as speech or non-speech. In some embodiments this classification rule is based on determining a value exceeds a threshold. In some embodiments there may be some feedback in this sequence, in which the VAD decision is used to improve the noise estimate in the noise reduction stage, or to adaptively vary the threshold(s). These feedback operations improve the VAD performance in non-stationary noise (i.e. when the noise varies a lot).
  • Some VAD methods may formulate the decision rule on a frame by frame basis using instantaneous measures of the divergence distance between speech and noise.
  • the different measures which are used in VAD methods may include spectral slope, correlation coefficient, log likelihood ratio, cepstral, weighted cepstral, and modified distance measures.
  • signal classifier/characterizer 301 may comprise a spectral flatness detector 313.
  • the spectral flatness detector 313 is typically measured in decibels, and provides a way to quantify how noise-like a sound is, as opposed to being tone-like.
  • tonal in this context is in the sense of the amount of peaks or resonant structure in a power spectrum, as opposed to flat spectrum of a white noise.
  • a high spectral flatness (approaching 1 .0 for white noise) indicates that the spectrum has a similar amount of power in all spectral bands. This spectrum would sound similar to white noise, and the graph of the spectrum would appear relatively flat and smooth.
  • a low spectral flatness (approaching 0.0 for a pure tone) indicates that the spectral power is concentrated in a relatively small number of bands. This spectrum would typically sound like a mixture of sine waves, and the spectrum would appear spiky.
  • the spectral flatness is calculated by dividing the geometric mean of the power spectrum by the arithmetic mean of the power spectrum.
  • the spectral flatness may in some embodiments be measured within a specified sub-band, rather than across the whole band.
  • the signal classifier/characterizer 301 may comprise a percussiveness detector 315.
  • the percussiveness detector 315 may be configured to perform an analysis of percussiveness, using, for example the pulse-metric characterization such as described within CONSTRUCTION AND EVALUATION OF A ROBUST MULTIFEATURE SPEECH/MUSIC DISCRIMINATOR, Speech & music discrimination, pulse-metric feature available from https://www.ee.columbia.edu/ ⁇ dpwe/papers/ScheiS97-mussp.pdf.
  • the signal classifier/characterizer 301 may comprise a harmonicity detector 317.
  • the signal classifier/characterizer 301 may comprise a content classifier 319.
  • the content classifier 319 may in some embodiments determine the content of the microphone audio signals and used in determining the weights.
  • a deep neural network may be trained to classify between noise/speech/music/singing, for example, using any of the above features or signal spectrum directly.
  • the signal classifier/characterizer 301 may comprise a sound engineer input 321 .
  • the sound engineer input 321 may enable weights or weight adjustments to be input by a sound engineer or other user of the system. For example a sound engineer could be offered an option to make adjustments using a graphical user interface (GUI) on a digital audio workstation (DAW).
  • GUI graphical user interface
  • DAW digital audio workstation
  • the classifier 301 can output the results of the analysis to a weight determiner and normaliser 303.
  • the weighted sum 21 1 comprises a weight determiner and normaliser 303.
  • the weight determiner and normaliser 303 can be configured to generate weightings for each of the microphone sources based on the characterisation from the classifier/characterizer 301 .
  • the ambience signal may be created from microphone audio signals that represent the overall ambience instead of active direct sources.
  • the weights w(i) for each microphone audio signal may thus in some embodiments be obtained based on analysis which determines how likely is that each microphone signal carries ambient background noise instead of being dominantly capturing a closeup-sound source.
  • the output from the classifier/characterizer 301 may be used to determine the weights applied for a microphone source.
  • the analyses and the weightings may be performed over time in frames, say 1 second long, to enable the weight of a microphone signal in ambience creation to change over time.
  • the input signal is analysed to determine whether there is voice activity.
  • the VAD indicates inactivity, it is likely that the microphone captures just background noise as the audio signal.
  • the weight for this microphone source w(i) may be increased. The increase may be, for example, proportional to the current weight value.
  • the input audio signal is analysed for spectral flatness to determine how close to (white) noise the input signal is vs. how likely is it that the input signal is tone like any signals which receive spectral flatness measures close to 1 may receive higher weighting in the sum because they are likely to contain noise-like content.
  • the input audio signal is analysed to determine harmonicity related features such as fundamental frequency (pitch), harmonic concentration, or harmonicity these audio signals may indicate that the microphone signal contains harmonic content.
  • harmonicity related features such as fundamental frequency (pitch), harmonic concentration, or harmonicity
  • these audio signals may indicate that the microphone signal contains harmonic content.
  • Microphone audio signals which likely contain harmonic content may receive lower weights in the summation.
  • any microphone signals which likely contain rhythmic content are likely to contain percussions or other rhythmic content rather than background. These audio signals may receive lower weighting factors.
  • Analysis which determines classifiers may determine the content of the microphone signals and used in determining the weighting values. For example, a deep neural network may be trained to classify between noise/speech/music/singing. If the classification indicates noise, the weighting for this audio signal may be increased. Where the classification indicates speech/singing/music, the weighting for this signal may be decreased.
  • the weight determiner 303 may be configured in some embodiments to determine the weighting values based on the input by a sound engineer or other user. For example, the system might calculate initial weights using the logic above, and then a sound engineer could be offered an option to make adjustments.
  • the output of these weighting values can be passed to the weighted sum 305.
  • the weighted sum 21 1 comprises a weighted sum processor 305 configured to receive the audio signals and the weightings associated with each audio signal. The weighted sum processor 305 may then combine the audio signals according to the weightings generated in the weight determination and normalisation module 303. The output of the weighted sum processor 305 can be passed to the digital reverberator 213.
  • the digital reverberator 213 may be configured in some embodiments to optionally add reverberation to the combined audio signal. This additional process increases the spaciousness of the ambience signal and helps to separate it from the external microphone audio signals. For this, any suitable digital reverberator method may be applied. For example the combined audio signal may be passed through various delay lines.
  • a suitable reverberator is a Schroeder digital reverberator, details of which may be found from https://ccrma.stanford.edu/ ⁇ jos/pasp/Schroeder Reverberators .html.
  • the output of the digital reverberator 213 can be passed to the spatial extent synthesiser 215.
  • the spatial extent synthesiser 215 may receive the output of the reverberator 213 or weighted sum 21 1 (305) and output a spatially extent synthesised signal as the ambience signal to a spatial mixer 205 and in some embodiments a spatial mixer processor 333.
  • the system comprises the spatial mixer 205.
  • the spatial mixer 205 in some embodiments comprises a sound object processor 331 .
  • the sound object processor 331 can be configured to analyse the audio signals from the microphone sources and output these to the spatial mixer processor 333.
  • the processing may for example comprise spatial position determining of the microphone sources
  • the spatial mixer 205 may receive the ambience audio signals and the processed audio signals from the microphone audio signals and be configured to mix and/or render the audio signals based on the positioning data (which may be from the HAIP information received, input manually by a sound engineer or by any other suitable position tracking method).
  • the spatial mixer processor 333 may thus from the determined position data, render a positioned monophonic sound signal at a suitable spatial location using head-related-transfer-function (HRTF) filtering when binaural audio output is desired for headphone listening.
  • the output may be a two channel L+R signal for headphone listening, and the outputs after filtering can be summed for each microphone source to create a spatial mix signal containing all the spatially positioned sources.
  • the positioned monophonic sound signal output (for each sound source) may be panned and the panned sound source audio signals summed to create the spatial mix of sources.
  • the spatial extent synthesiser 215 receives the combined (reverberated) audio signals and spatially extends the audio signal to a defined (for example 360 degree) spatial extent using methods for spatial extent control. In other words it takes as input a mono sound source audio signal and spatial extent parameters (width, height and depth).
  • the spatial extent synthesiser 215 comprises a suitable time to frequency domain transformer.
  • the spatial extent synthesiser 215 comprises a Short-Time Fourier Transform (STFT) 401 configured to receive the audio signal and output a suitable frequency domain output.
  • the input is a time-domain signal which is processed with hop-size of 512 samples.
  • a processing frame of 1024 samples is used, and it is formed from the current 512 samples and previous 512 samples.
  • the processing frame is zero-padded to twice its length (2048 samples) and Hann windowed.
  • the Fourier transform is calculated from the windowed frame producing the Short-Time Fourier Transform (STFT) output.
  • STFT Short-Time Fourier Transform
  • the STFT output is symmetric, thus it is sufficient to process the positive half of 1024 samples including the DC component, totalling 1025 samples.
  • any suitable time to frequency domain transform may be used.
  • the spatial extent synthesiser 215 further comprises a filter bank 403.
  • the filter bank 403 is configured to receive the output of the STFT 401 and using a set of filters generated based on a Halton sequence (and with some default parameters) generate a number of frequency bands 405.
  • Halton sequences are sequences used to generate points in space for numerical methods such as Monte Carlo simulations. Although these sequences are deterministic, they are of low discrepancy, that is, appear to be random for many purposes.
  • the filter bank 409 comprises set of 9 different distribution filters, which are used to create 9 different frequency domain signals where the signals do not contain overlapping frequency components. These signals are denoted Band 1 F 405i to Band 9 F 405g in figure 4.
  • the filtering can be implemented in the frequency domain by multiplying the STFT output with stored filter coefficients for each band.
  • the spatial extent synthesiser 215 further comprises a spatial extent input 400.
  • the spatial extent input 400 may be configured to define the spatial extent of the audio signal.
  • the spatial extent synthesiser 215 may further comprise an object position input/determiner 402.
  • the object position input/determiner 402 may be configured to determine the spatial position of sound sources. This information may be determined in some embodiments by the sound object processor.
  • the spatial extent synthesiser 215 may further comprise a band position determiner 404.
  • the band position determiner 404 may be configured to receive the outputs from the object position input/determiner 402 and the spatial extent input 400 and from these generate an output passed to the vector base amplitude panning processor 406.
  • the spatial extent synthesiser 215 (or spatially extending controller) is implemented using a vector based amplitude panning operation.
  • the spatial extent synthesis or spatially extending control may be implementation agnostic and any suitable implementation used to generate the spatially extending control.
  • the spatially extending control may implement direct binaural panning (using Head related transfer function filters for directions), direct assignment to the output channel locations (for example direct assignment to the loudspeakers without using any panning), synthesized ambisonics, and wave-field synthesis.
  • the spatial extent synthesiser 215 may further comprise a vector based amplitude panning (VBAP) processor 406.
  • the VBAP 406 may be configured to generate control signals to control the panning of the frequency domain signals to desired spatial positions. Given the spatial position of the sound source (azimuth, elevation) and the desired spatial extent for the source (width in degrees), the system calculates a spatial position for each frequency domain signal. For example, if the spatial position of the sound source is zero degrees azimuth (front), and spatial extent 90 degrees, the VBAP may position the frequency bands at positions azimuth 45, 33.75, 22.5, 1 1 .25, 0, -1 1 .2500, -22.5000, -33.7500, -45 degrees.
  • the VBAP processor 406 may therefore be used to calculate a suitable gain for the signal, given the desired loudspeaker positions.
  • VBAP processor 406 may provide gains for a signal such that it can be spatially positioned to a suitable position. These gains may be passed to a series of multipliers 407.
  • the spatial extent synthesiser 215 may further comprise a series of multipliers 407.
  • the series of multipliers comprise multipliers 407i to 407g, however any suitable number of multipliers may be used.
  • Each frequency domain band signal may be multiplied in the multiplier 407 with the determined VBAP gains.
  • the products of the VBAP gains and each frequency band signal may be passed to a series of output channel sum devices 409.
  • the spatial extent synthesiser 215 may further comprise a series of sum devices 409.
  • the sum devices 409 may receive the outputs from the multipliers and combine them to generate an output channel band signal 41 1 .
  • a 4.0 loudspeaker format output is implemented with outputs for front left (Band FL F 41 1 1 ), front right (Band FR F 41 12), rear left (Band RL F 41 13), and rear right (Band RR F 41 1 ) channels which are generated by sum devices 409i, 4092, 4093 409 4 respectively.
  • other loudspeaker formats or number of channels can be supported.
  • panning methods such as panning laws, or the signals could be assigned to the closest loudspeakers directly.
  • the spatial extent synthesiser 215 may further comprise a series of inverse Short-Time Fourier Transforms (ISTFT) 413.
  • ISTFT inverse Short-Time Fourier Transforms
  • FIG. 4 there is an ISTFT 413i associated with the FL signal an ISTFT 4132 associated with the FR signal, an ISTFT 4133 associated with the RL signal output and an ISTFT 413 4 associated with the RR signal.
  • ISTFT Inverse Short-Time Fourier Transform
  • component signals may be provided for rendering and also for analysis for the purpose of ensuring even energy distributions between the components.
  • ambience audio signal there may be more than one ambiance audio signal, or in other words the ambiance audio signal may be created in two or more parts.
  • microphone audio signals on the left side of an sound scene may contribute to an ambience audio signal on the left and spatially extended to 180 degrees
  • microphone audio signals on the right side of the sound scene may contribute to the ambiance audio signal on the right, and also extended to 180 degrees extent.
  • the space is acoustically not diffuse (e.g., we are outside)
  • the ambience corresponds somewhat to the directions of the sources. For example, if we have a battle of the bands situation outside with two rock bands on opposite sides of the listener, it is only natural that these two directions would have their own ambience tracks.
  • any suitable division of the scene may be used for creating the ambiance signal in different parts.
  • a left/right division for ambiance creation may be suitable.
  • a division into four 90 degree sectors for ambiance creation may be suitable.
  • the division may in some embodiments be controllable from a graphical user interface of a digital audio workstation (DAW) system taking care of executing at least part of the proposed functionality.
  • DAW digital audio workstation
  • the apparatus and method utilizes spatial extent synthesis to expand the scene.
  • at least a second microphone some distance away from the first microphone may be required to allow capture of ambience. More microphones may result in better approximation of the ambient signal.
  • the decision of how many external microphones to use for generating the ambiance may be at least partly based on the number of relatively acoustically homogenous portions of the scene. That is, if the sound scene varies at different locations of the scene, it is naturally to place at least one microphone to capture each homogenous portion of the scene.
  • audio texture synthesis methods may be used for reusing a portion of the ambiance created this way for some other times of the audio capture.
  • an ambience signal may be created during the proposed method during a quiet section of the event, or during any suitable portion of the event.
  • the ambiance may be stored and looped as proposed in US 9,528,852 to create ambiance for some other times in the event.
  • such pre-generated ambiance audio signal may be used in such times of the event where all microphone signals indicate that they are capturing the external sound sources rather than background sounds. This can be determined by all microphones receiving low weights for the ambiance creation.
  • the system may also in some embodiments use a weighted sum of the ambiance created from current microphone input and the pre-generated (looped) ambiance.
  • the system may perform a pre-calibration phase, during which the ambiance captured with external microphones is matched acoustically to an ambience captured using a microphone array.
  • the magnitude response or other acoustic properties of the ambiance created from external microphones may be matched to an ambience captured from a microphone array. This enables substituting a microphone array ambience more realistically with ambience captured from external microphones, and may be useful, for example, in situations where the microphone array suddenly becomes unavailable (in breakdowns, for example).
  • the device may be any suitable electronics device or apparatus.
  • the device 1200 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.
  • the device 1200 may comprise a microphone 1201 .
  • the microphone 1201 may comprise a plurality (for example a number N) of microphones. However it is understood that there may be any suitable configuration of microphones and any suitable number of microphones.
  • the microphone 1201 is separate from the apparatus and the audio signal transmitted to the apparatus by a wired or wireless coupling.
  • the microphone 1201 may in some embodiments be the microphone array as shown in the previous figures.
  • the microphone may be a transducer configured to convert acoustic waves into suitable electrical audio signals.
  • the microphone can be solid state microphones. In other words the microphone may be capable of capturing audio signals and outputting a suitable digital format signal.
  • the microphone 1201 can comprise any suitable microphone or audio capture means, for example a condenser microphone, capacitor microphone, electrostatic microphone, Electret condenser microphone, dynamic microphone, ribbon microphone, carbon microphone, piezoelectric microphone, or microelectrical-mechanical system (MEMS) microphone.
  • the microphone can in some embodiments output the audio captured signal to an analogue-to-digital converter (ADC) 1203.
  • ADC analogue-to-digital converter
  • the device 1200 may further comprise an analogue-to-digital converter 1203.
  • the analogue-to-digital converter 1203 may be configured to receive the audio signals from each of the microphone 1201 and convert them into a format suitable for processing. In some embodiments where the microphone is an integrated microphone the analogue-to-digital converter is not required.
  • the analogue-to-digital converter 1203 can be any suitable analogue-to-digital conversion or processing means.
  • the analogue-to-digital converter 1203 may be configured to output the digital representations of the audio signal to a processor 1207 or to a memory 121 1 .
  • the device 1200 comprises at least one processor or central processing unit 1207.
  • the processor 1207 can be configured to execute various program codes such as the methods such as described herein.
  • the device 1200 comprises a memory 121 1 .
  • the at least one processor 1207 is coupled to the memory 121 1 .
  • the memory 121 1 can be any suitable storage means.
  • the memory 121 1 comprises a program code section for storing program codes implementable upon the processor 1207.
  • the memory 121 1 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1207 whenever needed via the memory-processor coupling.
  • the device 1200 comprises a user interface 1205.
  • the user interface 1205 can be coupled in some embodiments to the processor 1207.
  • the processor 1207 can control the operation of the user interface 1205 and receive inputs from the user interface 1205.
  • the user interface 1205 can enable a user to input commands to the device 1200, for example via a keypad.
  • the user interface 205 can enable the user to obtain information from the device 1200.
  • the user interface 1205 may comprise a display configured to display information from the device 1200 to the user.
  • the user interface 1205 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1200 and further displaying information to the user of the device 1200.
  • the user interface 1205 may be the user interface for communicating with the position determiner as described herein.
  • the device 1200 comprises a transceiver 1209.
  • the transceiver 1209 in such embodiments can be coupled to the processor 1207 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network.
  • the transceiver 1209 or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
  • the transceiver 1209 may be configured to communicate with the renderer as described herein.
  • the transceiver 1209 can communicate with further apparatus by any suitable known communications protocol.
  • the transceiver 1209 or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
  • UMTS universal mobile telecommunications system
  • WLAN wireless local area network
  • IRDA infrared data communication pathway
  • the device 1200 may be employed as at least part of the renderer.
  • the transceiver 1209 may be configured to receive the audio signals and positional information from the microphone/close microphones/position determiner as described herein, and generate a suitable audio signal rendering by using the processor 1207 executing suitable code.
  • the device 1200 may comprise a digital-to-analogue converter 1213.
  • the digital-to-analogue converter 1213 may be coupled to the processor 1207 and/or memory 121 1 and be configured to convert digital representations of audio signals (such as from the processor 1207 following an audio rendering of the audio signals as described herein) to a suitable analogue format suitable for presentation via an audio subsystem output.
  • the digital-to-analogue converter (DAC) 1213 or signal processing means can in some embodiments be any suitable DAC technology.
  • the device 1200 can comprise in some embodiments an audio subsystem output 1215.
  • An example as shown in Figure 1 1 shows the audio subsystem output 1215 as an output socket configured to enabling a coupling with headphones 121 .
  • the audio subsystem output 1215 may be any suitable audio output or a connection to an audio output.
  • the audio subsystem output 1215 may be a connection to a multichannel speaker system.
  • the digital to analogue converter 1213 and audio subsystem 1215 may be implemented within a physically separate output device.
  • the DAC 1213 and audio subsystem 1215 may be implemented as cordless earphones communicating with the device 1200 via the transceiver 1209.
  • the device 1200 is shown having both audio capture, audio processing and audio rendering components, it would be understood that in some embodiments the device 1200 can comprise just some of the elements.
  • the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
  • some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • the embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
  • any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions.
  • the software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
  • the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
  • the data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
  • Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
  • the design of integrated circuits is by and large a highly automated process.
  • Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
  • California and Cadence Design of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules.
  • the resultant design in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Stereophonic System (AREA)
  • Stereophonic Arrangements (AREA)

Abstract

An apparatus for generating an intended spatial audio field, the apparatus configured to: receive at least two audio signals, wherein each audio signal is received from a separate microphone, each separate microphone is located in the same environment and configured to capture a sound source; analyse each audio signal to determine at least in part an ambience audio signal;generate a sum audio signal from the determined ambience audio signal based on the at least two audio signals; and process the sum audio signal to spatially extend the sum audio signal so as to generate the intended spatial audio field, wherein the sum audio signal comprises the ambience audio signal for the intended spatial audio field.

Description

AUDIO SIGNAL GENERATION FOR SPATIAL AUDIO MIXING
Field
The present application relates to apparatus and methods for audio signal generation and ambience audio signal generation for spatial audio mixing.
Background
Capture of audio signals from multiple sources and mixing of audio signals when these sources are moving in the spatial field requires significant effort. For example the capture and mixing of an audio signal source such as a speaker or artist within an audio environment such as a theatre or lecture hall to be presented to a listener and produce an effective audio atmosphere requires significant investment in equipment and training.
A commonly implemented system is where one or more 'external' microphones, for example a Lavalier microphone worn by the user or an audio channel associated with an instrument, is mixed with a suitable spatial (or environmental or audio field) audio signal such that the produced sound comes from an intended direction. This system is known in some areas a Spatial Audio Mixing (SAM)
The SAM system enables the creation of immersive sound scenes comprising "background spatial audio" or ambiance and sound objects for Virtual Reality (VR) applications. Often, the scene can be designed such that the overall spatial audio of the scene, such as a concert venue, is captured with a microphone array (such as one contained in the OZO virtual camera) and the most important sources captured using the 'external' microphones.
However, there are scenarios, where spatial audio capture apparatus such as
OZO are not available, but a content producer would like to create high quality VR sound scenes with spatial ambiance and high quality close-up sources. Thus there is a need to be able to generate solutions which enables this.
Furthermore in many live situations a designated spatial audio capture device, such as an OZO device, captures audio that is unusable for professional audio production for several possible reasons. For example the spatial audio capture device may capture unintended audio, e.g., live mix for the audience, close mic ambience. Furthermore in some circumstances the signal-to-noise ratio is not good enough at the spatial capture device to represent even the ambience of the scene, such as for example where the capture device is mounted on a moving car. Also in some circumstances artistically, the spatial audio capture device may not always represent the spatial scene that is desired but something similar is the target. Thus, there is a need to develop solutions which can determine these circumstances and enable the provision of alternative ambient or spatial audio signals for the spatial audio mixing and sound track creation process.
Summary
There is provided according to a first aspect an apparatus for generating an intended spatial audio field, the apparatus configured to: receive at least two audio signals, wherein each audio signal is received from a separate microphone, each separate microphone is located in the same environment and configured to capture a sound source; analyse each audio signal to determine at least in part an ambience audio signal; generate a sum audio signal from the determined ambience audio signal based on the at least two audio signals; and process the sum audio signal to spatially extend the sum audio signal so as to generate the intended spatial audio field, wherein the sum audio signal comprises the ambience audio signal for the intended spatial audio field.
The apparatus may further be configured to apply a reverberation to the sum audio signal before the processing of the sum audio signal to spatially extend the sum audio signal.
The apparatus configured to generate a sum audio signal from the determined ambience audio signal based on the at least two audio signals may be configured to generate for and apply to at least one of the at least two audio signals a weighting value before generating the sum audio signal, wherein the weighting value may be based on at least one of: a detection of voice activity within the audio signal; a determination of spectral flatness within the audio signal; a determination of percussiveness within the audio signal; a determination of harmonicity within the audio signal; a determination of content classification type within the audio signal; a determination of silence within the audio signal; a determination of noise within the audio signal; and at least one user generated input associated with the audio signal. The apparatus configured to generate for at least one of the at least two audio signals a weighting value may be further configured to normalise the weighting value for at least one of the at least two audio signals.
The apparatus configured to process the sum audio signal to spatially extend the sum audio signal may be configured to apply one of: vector base amplitude panning to the sum audio signal; direct binaural panning to the sum audio signal; direct assignment to channel output location to the sum audio signal; synthesized ambisonics to the sum audio signal; and wavefield synthesis to the sum audio signal.
The apparatus configured to process the sum audio signal to spatially extend the sum audio signal may be configured to: determine a spatial extent parameter; determine at least one position associated with the microphones; determine at least one frequency band position based on the at least one position associated with the microphones and the spatial extent parameter.
The apparatus configured to apply vector base amplitude panning to the sum audio signal may be further configured to generate panning vectors for the application of vector base amplitude panning to frequency bands of the sum audio signal.
The apparatus may be configured to generate the intended spatial audio field is configured to generate a plurality of intended spatial audio fields parts, wherein at least one part of the intended spatial audio field may be at least one of: partially overlapping a neighbouring part; non-overlapping at least one other part; contained within at least one other part; and containing at least one other part.
The apparatus may be configured to generate: at least one first part of the intended spatial audio field associated with a first part of the environment, the first part of the environment comprising at least one sound source; and at least one second part of the intended spatial audio field associated with a second part of the environment, the second part of the environment comprising at least one further sound source.
The first part of the environment may be a left portion of the environment with respect to the apparatus, and the second part of the environment may be a right portion of the environment with respect to the apparatus.
The first part of the environment may be a front portion of the environment with respect to the apparatus, and the second part of the environment may be a rear portion of the environment with respect to the apparatus. The apparatus may be further configured to determine a position of the at least one microphone of the microphones relative to the apparatus.
The apparatus may be further configured to: receive at least one audio signal from a capture device comprising a microphone array for capturing audio signals of the sound scene; compare the at least one audio signal from the capture device to the at least one audio signal; control the generation of the sum audio signal from microphones located within the intended spatial audio field, and process the sum audio signal to generate the intended spatial audio field based on the comparison.
The apparatus may be further configured to mix the at least one spatially extended sum audio signal with at least one of the at least two audio signals to generate the intended spatial audio field.
The apparatus configured to process the sum audio signal to spatially extend the sum audio signal may be configured to spatially extend the sum audio signal such that the at least one spatially extended sum audio signal is one of: full spatially extended to 360 degrees; and partial spatially extended upto 360 degrees.
According to a second aspect there is provided a method for generating an intended spatial audio field, the method comprising: receiving at least two audio signals, wherein each audio signal is received from a separate microphone, each separate microphone being located in the same environment and configured to capture a sound source; analysing each audio signal to determine at least in part an ambience audio signal; generating a sum audio signal from the determined ambience signal based on the at least two audio signals; and processing the sum audio signal to spatially extend the sum audio signal so as to generate the intended spatial audio field, wherein the sum audio signal comprises the ambience audio signal for the intended spatial audio field.
The method may further comprise applying a reverberation to the sum audio signal before the processing of the sum audio signal to spatially extend the sum audio signal.
Generating the sum audio signal may comprise: generating for at least one of the at least two audio signals a weighting value; and applying to at least one of the at least two audio signals the weighting value before generating the sum audio signal, wherein the weighting value is based on at least one of: a detection of voice activity within the audio signal; a determination of spectral flatness within the audio signal; a determination of percussiveness within the audio signal; a determination of harmonicity within the audio signal; a determination of silence within the audio signal; a determination of noise within the audio signal; a determination of content classification type within the audio signal; and at least one user generated input associated with the audio signal.
Generating the weighting value may further comprise normalising the weighting value for at least one of the at least two audio signals.
Processing the sum audio signal to spatially extend the sum audio signal may comprise applying one of: vector base amplitude panning to the sum audio signal; direct binaural panning to the sum audio signal; direct assignment to channel output location to the sum audio signal; synthesized ambisonics to the sum audio signal; and wavefield synthesis to the sum audio signal.
Processing the sum audio signal to spatially extend the sum audio signal may comprise: determining a spatial extent parameter; determining at least one position associated with the microphones; determining at least one frequency band position based on the at least one position associated with the microphones and the spatial extent parameter.
The apparatus configured to apply vector base amplitude panning to the sum audio signal may be further configured to generate panning vectors for the application of vector base amplitude panning to frequency bands of the weighted sum.
Generating the intended spatial audio field may comprise generating a plurality of intended spatial audio field parts, wherein at least one part is at least one of: partially overlapping a neighbouring part; non-overlapping at least one other part; contained within at least one other part; and containing at least one other part.
The method may comprise: generating at least one first part of the intended spatial audio field associated with a first part of the environment, the first part of the environment comprising at least one sound source; and generating at least one second part of the intended spatial audio field associated with a second part of the environment, the second part of the environment comprising at least one further sound source.
The first part of the environment may be a left portion of the environment, and the second part of the environment may be a right portion of the environment. The first part of the environment may be a front portion of the environment, and the second part of the environment may be a rear portion of the environment.
The method may further comprise determining a position of the at least one microphone of the microphones relative to the apparatus.
The method may further comprise: receiving at least one audio signal from a capture device comprising a microphone array for capturing audio signals of the sound scene; comparing the at least one audio signal from the capture device to the at least one audio signal; controlling the generation of the sum audio signal from microphones located within the intended spatial audio field; and processing the sum audio signal to generate the intended spatial audio field based on the comparison.
The method may further comprise mixing the at least one spatially extended sum audio signal with at least one of the at least two audio signals to generate the intended spatial audio field.
Processing the sum audio signal to spatially extend the sum audio signal comprises spatially extending the sum audio signal such that the at least one spatially extended audio signal may be one of: full spatially extended to 360 degrees; and partial spatially extended upto 360 degrees.
According to a thirds aspect there is provided an apparatus for generating at least one spatially extended audio signal associated with a sound scene, the apparatus configured to: receive at least two audio signals, wherein each audio signal is received from a separate microphone located within the sound scene; generate a sum of the at least two audio signals; and apply a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal, wherein the at least one spatially extended audio signal is an ambience audio signal for mixing with at least one of the at least two audio signals to generate at least one spatial audio field.
The apparatus may be further configured to apply a reverberation to the sum before the application of the spatially extended control.
The apparatus configured to generate a sum may be configured to generate for and apply to at least one of the at least two audio signals a weighting value before generating the sum, wherein the weighting value is based on at least one of: a detection of voice activity within the audio signal; a determination of spectral flatness within the audio signal; a determination of percussiveness within the audio signal; a determination of harmonicity within the audio signal; a determination of content classification type within the audio signal; a determination of silence within the audio signal; a determination of noise within the audio signal; and at least one user generated input associated with the audio signal.
The apparatus configured to generate a sum is configured to generate for at least one of the at least two audio signals a weighting value may be further configured to normalise the weighting value for at least one of the at least two audio signals.
The apparatus configured to apply a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may be configured to apply one of: vector base amplitude panning to the sum of the at least two audio signals; direct binaural panning to the sum of the at least two audio signals; direct assignment to channel output location to the sum of the at least two audio signals; synthesized ambisonics to the sum of the at least two audio signals; and wavefield synthesis to the sum of the at least two audio signals.
The apparatus configured to apply a spatially extended control to the sum of the at least two audio signals may be configured to: determine a spatial extent parameter; determine at least one position associated with the microphones; determine at least one frequency band position based on the at least one position associated with the microphones and the spatial extent parameter; and generate panning vectors for the application of vector base amplitude panning to frequency bands of the sum of the at least two audio signals.
The apparatus may be configured to generate a plurality of audio signals, each of the plurality of audio signals are associated with a portion of the sound scene, wherein at least one portion of the sound scene is at least one of: partially overlapping a neighbouring portion; non-overlapping at least one other portion; contained within at least one other portion; and containing at least one other portion.
The apparatus may be configured to generate: at least one first audio signal associated with a first portion of the sound scene, the first portion of the sound scene comprising at least one sound source; and at least one second audio signal associated with a second portion of the sound scene, the second portion of the sound scene comprising at least one further sound source. The first portion of the sound scene may be a left portion of the sound scene with respect to the apparatus, and the second portion of the sound scene may be a right portion of the sound scene with respect to the apparatus.
The first portion of the sound scene may be a front portion of the sound scene with respect to the apparatus, and the second portion of the sound scene may be a rear portion of the sound scene with respect to the apparatus.
The apparatus may be further configured to determine a position of the at least one microphone of the microphones relative to the apparatus.
The apparatus may be further configured to: receive at least one audio signal from a capture device comprising a microphone array for capturing audio signals of the sound scene; compare the at least one audio signal from the capture device to the at least one audio signal; control the generation of the sum of the at least two audio signals from microphones located within the sound scene, and apply the spatially extended control to the sum of the at least two audio signals to generate the at least one audio signal based on the comparison.
The apparatus may be located in the sound scene comprising at least one sound source and at least one of the at least two microphones is associated with the at least one sound source within the sound scene.
The apparatus may be further configured to mix the at least one spatially extended audio signal with at least one of the at least two audio signals to generate at least one spatial audio field.
The apparatus configured to apply a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may be configured to spatially extend the sum of the at least two audio signals such that the at least one spatially extended audio signal is one of: full spatially extended to 360 degrees; and partial spatially extended upto 360 degrees.
According to a fourth aspect there is provided a method for generating at least one spatially extended audio signal associated with a sound scene, the method comprising: receiving at least two audio signals, wherein each audio signal is received from a separate microphone located within the sound scene; generating a sum of the at least two audio signals; and applying a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal, wherein the at least one spatially extended audio signal is an ambience audio signal for mixing with at least one of the at least two audio signals to generate at least one spatial audio field.
The method may further comprise applying a reverberation to the sum before the application of the spatially extended control.
Generating a sum may comprise: generating for at least one of the at least two audio signals a weighting value; and applying to at least one of the at least two audio signals the weighting value before generating the sum, wherein the weighting value is based on at least one of: a detection of voice activity within the audio signal; a determination of spectral flatness within the audio signal; a determination of percussiveness within the audio signal; a determination of harmonicity within the audio signal; a determination of content classification type within the audio signal; a determination of silence within the audio signal; a determination of noise within the audio signal; and at least one user generated input associated with the audio signal.
Generating for at least one of the at least two audio signals a weighting value may further comprise normalising the weighting value for at least one of the at least two audio signals.
Applying a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may comprise applying a vector base amplitude panning to the sum of the at least two audio signals.
Applying a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may comprise applying direct binaural panning to the sum of the at least two audio signals.
Applying a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may comprise applying direct assignment to channel output location to the sum of the at least two audio signals.
Applying a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may comprise applying synthesized ambisonics to the sum of the at least two audio signals.
Applying a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may comprise applying wavefield synthesis to the sum of the at least two audio signals. Applying a vector base amplitude panning to the sum of the at least two audio signals may comprise: determining a spatial extent parameter; determining at least one position associated with the microphones located within the sound scene; determining at least one frequency band position based on the at least one position associated with the microphones located within the sound scene and the spatial extent parameter; and generating panning vectors for the application of vector base amplitude panning to frequency bands of the sum of the at least two audio signals.
Generating at least one spatially extended audio signal associated with a sound scene may comprise generating a plurality of audio signals, each of the plurality of audio signals are associated with a portion of the sound scene, wherein at least one portion of the sound scene is at least one of: partially overlapping a neighbouring portion; non-overlapping at least one other portion; contained within at least one other portion; and containing at least one other portion.
Generating at least one spatially extended audio signal associated with a sound scene may comprise generating: at least one first audio signal associated with a first portion of the sound scene, the first portion of the sound scene comprising at least one sound source; and at least one second audio signal associated with a second portion of the sound scene, the second portion of the sound scene comprising at least one further sound source.
The first portion of the sound scene may be a left portion of the sound scene with respect to the apparatus, and the second portion of the sound scene may be a right portion of the sound scene with respect to the apparatus.
The first portion of the sound scene may be a front portion of the sound scene with respect to the apparatus, and the second portion of the sound scene may be a rear portion of the sound scene with respect to the apparatus.
The method may further comprise determining a position of the at least one microphone of the microphones relative to the apparatus.
The method may further comprise: receiving at least one audio signal from a capture device comprising a microphone array for capturing audio signals of the sound scene; comparing the at least one audio signal from the capture device to the at least one audio signal; controlling the generation of the sum of the at least two audio signals from microphones located within the sound scene, and applying the spatially extended control to the sum of the at least two audio signals to generate the at least one audio signal based on the comparison.
The method may further comprise mixing the at least one spatially extended audio signal with at least one of the at least two audio signals to generate at least one spatial audio field.
Applying a spatially extended control to the sum of the at least two audio signals to generate the at least one spatially extended audio signal may comprise spatially extending the sum of the at least two audio signals such that the at least one spatially extended audio signal is one of: full spatially extended to 360 degrees; and partial spatially extended upto 360 degrees.
The apparatus may be located in the sound scene comprising at least one sound source and at least one of the at least two microphones is associated with the at least one sound source within the sound scene.
An apparatus may comprise means for implementing the method as described herein.
A computer program product stored on a medium may cause an apparatus to perform the method as described herein.
An electronic device may comprise apparatus as described herein.
A chipset may comprise apparatus as described herein.
Embodiments of the present application aim to address problems associated with the state of the art.
Summary of the Figures
For a better understanding of the present application, reference will now be made by way of example to the accompanying drawings in which:
Figure 1 shows schematically an example known capture and mixing arrangement where the external microphones and the microphone array produce both external and ambient audio signals respectively for mixing;
Figure 2 shows schematically an example capture and mixing arrangement where the external microphones produce both the external and ambient audio signals for mixing according to some embodiments;
Figure 3 shows schematically the example capture and mixing arrangement shown in Figure 2 in further detail according to some embodiments; Figure 4 shows schematically the spatial extent synthesizer shown in Figures 2 and 3 in further detail according to some embodiments; and
Figure 5 shows schematically an example device suitable for implementing the apparatus shown in Figures 2 to 4.
Embodiments of the Application
The following describes in further detail suitable apparatus and possible mechanisms for the provision of effective ambient (or ambience) audio signal generation from the capture of audio signals from multiple sources. Furthermore the following describes mixing of the ambient and external audio signals. In the following examples, audio signals and audio capture signals are described. However it would be appreciated that in some embodiments the apparatus may be part of any suitable electronic device or apparatus configured to capture an audio signal or receive the audio signals and other information signals.
A conventional approach to the capturing and mixing of sound sources with respect to an audio background or environment audio field signal would be for a professional producer to utilize an external microphone (a close or Lavalier microphone worn by the user, or a microphone attached to an instrument or some other microphone) to capture audio signals close to the sound source, and further utilize a 'background' microphone or microphone array to capture a environmental audio signal. These signals or audio tracks may then be manually mixed to produce an output audio signal such that the produced sound features the sound source coming from an intended (though not necessarily the original) direction.
With respect to Figure 1 is shown a first known example capture and mixing arrangement. Figure 1 shows an example external or close microphone and tag 101 which is configured to transmit HAIP signals which are received by the microphone array and tag receiver 103 in order to determine the actual position of the external microphone 101 relative to the microphone array 103. The actual position may be passed to a mixer 105. The external microphone may furthermore generate an external audio signal 102 which is passed to the mixer 105.
The microphone array and tag receiver 103 may furthermore generate an ambient or spatial field audio signal 104 which is passed to the mixer 105. Having received the external microphone audio signal 102 and the microphone array audio signal 104 the mixer can in some embodiments mix the two to determine a mixed audio signal 106. The mixed audio signal 106 may be generated in some embodiments based on a user input such at the positional user input 109. The mixed audio signal may furthermore be passed to a renderer 107 wherein the mixed audio signal is rendered into a format suitable for outputting to a user. The renderer 107 in some embodiments may be configured to use vector-base amplitude panning techniques when loudspeaker domain output is desired (e.g. 5.1 channel output) or use head-related transfer-function filtering if binaural output for headphone listening is desired.
However as discussed above there are scenarios where either there is no spatial audio capture apparatus (such as Nokia's OZO) available or the spatial audio capture apparatus captures audio that is unusable for professional audio production but it is still a goal to would still like to create high quality VR sound scenes with spatial ambiance and high quality closeup sources.
The concept as described herein may be considered to be enhancement to conventional Spatial Audio Capture (SPAC) technology. Spatial audio capture technology can process audio signals captured via a microphone array into a spatial audio format. In other words generating an audio signal format with a spatial perception capacity. The concept may thus be embodied in a form where audio signals may be captured such that, when rendered to a user, the user can experience the sound field as if they were present at the location of the capture device. Spatial audio capture can be implemented for microphone arrays found in mobile devices. In addition, audio processing derived from the spatial audio capture may be used employed within a presence-capturing device such as the Nokia OZO (OZO) devices.
In the examples described herein the audio signal is rendered into a suitable binaural form, where the spatial sensation may be created using rendering such as by head-related-transfer-function (HRTF) filtering a suitable audio signal.
The concept as described with respect to the embodiments herein makes it possible to capture and remix an external and environmental audio signal more effectively and produce a better quality output where the sound or sound sources are more widely distributed. The concept may for example be embodied as a capture system configured to capture both two or more external (speaker, instrument or other source) audio signals and a processor configured to generate from the two or more external audio signals an spatial or environmental (audio field) audio signal.
Although capture and render systems may be separate, it is understood that they may be implemented with the same apparatus or may be distributed over a series of physically separate but communication capable apparatus. For example, a presence-capturing device such as the OZO device could be equipped with an additional interface for receiving location data and close microphone audio signals, and could be configured to perform the capture part. The output of a capture part of the system may be the microphone audio signals (e.g. as a 5.1 channel downmix), the close microphone audio signals (which may furthermore be time-delay compensated to match the time of the microphone array audio signals), and the position information of the close microphones (such as a time-varying azimuth, elevation, distance with regard to the microphone array).
The renderer as described herein may be an audio playback device (for example a set of headphones), user input (for example motion tracker), and software capable of mixing and audio rendering. In some embodiments the user input and audio rendering parts may be implemented within a computing device with display capacity such as a mobile phone, tablet computer, virtual reality headset, augmented reality headset etc.
Furthermore it is understood that at least some elements of the following mixing and rendering may be implemented within a distributed computing system such as known as the 'cloud'.
In the following concept the apparatus and method utilizes using external
(closeup-mic) microphone audio signals and spatial extent processing to create an ambiance 'like' signal without a spatial capture device microphone array. In some embodiments the apparatus is configured to sum signals captured by the external microphones and spatially extend this summed audio signal to cover a full spatial audio field (360 degrees).
In some embodiments, the sum of external microphone signals used for ambiance audio signal creation can be weighted. That is, the contribution of each external microphone audio signal used in the creation of the ambiance audio signal can be weighted based on various criteria. For example the criteria weighting may be how likely the microphone audio signal contains silence or just the background audio. Other audio weighting criteria may be based on voice activity detection (VAD) processing (where VAD inactivity implies that a microphone is capturing background noise). Further audio weighting criteria may be noisiness detection which may indicate that the external microphone signal contains noise and this may be compared against analysis of harmonicity & percussiveness which would indicate a high likelihood of an actual instrument/voice within the audio signal. Thus for example audio signals associated with external microphones with high scores for noise/ambiance may receive larger weighting in the sum to create an ambiance audio signal compared to audio signals from external microphones which have high levels of detected harmonic/percussive components.
In some embodiments, the ambiance signal is created differently for different parts of the sound scene. For example an ambient signal may generate different left and right scene ambient audio signals. Furthermore if the sound scene is acoustically not diffuse (e.g., the scene is outside), then a more natural ambience audio signal may correspond somewhat to the directions of the external audio signal microphone/sources. For example, in a "battle of the bands" situation outside with two rock bands on opposite sides of the listener, it would be more natural that these two directions would have their own ambience audio signal tracks or spatial audio field parts representing their environment parts.
In some embodiments, portions of the automatically generated ambiance audio signals may be analysed and used for creating looped, long duration or 'infinitely long' artificial ambiance audio signals.
With respect to figure 2 a schematic view of an example capture and mixing arrangement where the external microphones produce both the external and ambient audio signals for mixing according to some embodiments are shown. The system shown in figure 2 shows N microphone sources. Specifically figure 2 shows a first microphone, mic source 1 , 201 1 configured to generate a first audio signal 202i which is passed to the spatial mixer 205 and the ambience signal generator 203. The system also shows a second microphone, mic source 2, 2012 configured to generate a second audio signal 2022 which is passed to the spatial mixer 205 and the ambience signal generator 203. Furthermore the system is shown comprising a N'th microphone source mic source, mic source N, 201 N configured to generate a N'th audio signal 202N which is passed to the spatial mixer 205 and the ambience signal generator 203.
The external microphones 201 1 to 201 N can be configured to capture audio signals associated with humans, instruments, or other sound sources of interest.
For example the external microphone 201 may be a Lavalier microphone. The external microphones may be any microphone external or separate to a microphone array which may capture the spatial audio signal. Thus the concept is applicable to any external/additional microphones be they Lavalier microphones, hand held microphones, mounted mics, or whatever. The external microphones can be worn/carried by persons or mounted as close-up microphones for instruments or a microphone in some relevant location which the designer wishes to capture accurately. A Lavalier microphone typically comprises a small microphone worn around the ear or otherwise close to the mouth. For other sound sources, such as musical instruments, the audio signal may be provided either by a Lavalier microphone or by an internal microphone system of the instrument (e.g., pick-up microphones in the case of an electric guitar) or an internal audio output (e.g., a electric keyboard output). In some embodiments the close microphone may be configured to output the captured audio signals to a mixer. The external microphone may be connected to a transmitter unit (not shown), which wirelessly transmits the audio signal to a receiver unit (not shown).
In some embodiments the external microphones, mic sources, 201 and thus the performers and/or the instruments that are being played positions may be tracked by using position tags located on or associated with the microphone source. Thus for example the external microphone comprises or is associated with a microphone position tag. The microphone position tag may be configured to transmit a radio signal such that an associated receiver may determine information identifying the position or location of the close microphone. It is important to note that microphones worn by people can be freely moved in the acoustic space and the system supporting location sensing of wearable microphone has to support continuous sensing of user or microphone location. The close microphone position tag may be configured to output this signal to a position tracker. Although the following examples show the use of the HAIP (high accuracy indoor positioning) radio frequency signal to determine the location of the close microphones it is understood that any suitable position estimation system may be used (for example satellite-based position estimation systems, inertial position estimation, beacon based position estimation etc.).
In some embodiments the system comprises a spatial mixer 205. The spatial mixer 205 is configured as in the known spatial audio mixing system shown in figure 1 to receive the external microphone audio signals, which may be configured to spatially position the audio signals and mix them to create a spatial audio signal.
The spatial positioning may be performed based on the positioning data from the HAIP information received. Alternatively the positioning information may be input manually by a sound engineer, e.g., by providing azimuth/elevation/distance for each sound source or by any other suitable position tracking method.
The spatial mixer 205 may from the determined position data, render a positioned monophonic sound signal at a suitable spatial location using head-related- transfer-function (HRTF) filtering when binaural audio output is desired for headphone listening. The output may be a two channel L+R signal for headphone listening, and the outputs after filtering can be summed for each microphone source to create a spatial mix signal containing all the spatially positioned sources. Correspondingly, in some embodiments when creating a loudspeaker domain output, the positioned monophonic sound signal output (for each sound source) may be panned and the panned sound source audio signals summed to create the spatial mix of sources.
Furthermore in some embodiments the spatial mixer 205 is configured to mix at least one ambient or ambience audio signal. In the known systems the ambient or ambience audio signal was generated from a spatial audio signal capture apparatus comprising an array of microphones, for example a Nokia OZO apparatus. However in the embodiments as described hereafter the at least one ambience audio signal is generated by an ambience signal generator 203.
The ambience signal generator 203 is configured to receive the audio signals 202 from the external microphones 201 and from these audio signals generate at least one ambience audio signal which may be passed to the spatial mixer 205 to be mixed with the spatially processed audio signals from the external microphones.
The ambience signal generator 203 in some embodiments comprises a weighted sum 21 1 . The weighted sum 21 1 is configured to receive the audio signals 202 from the microphone sources and generate a weighted sum of the audio signals. In some embodiments the weighted sum 21 1 outputs the combined audio signal to a reverberator 213, however in some embodiments the weighted sum 21 1 outputs a combined audio signal to the spatial extent synthesizer 215 directly.
In some embodiments the ambience signal generator 203 comprises a reverberator 213. The reverberator 213 in some embodiments is configured to receive the output from the weighted sum 21 1 . The reverberator is configured to output a reverberated audio signal to a spatial extent synthesiser 215.
In some embodiments the ambience signal generator 203 comprises a spatial extent synthesizer 215. The spatial extent synthesizer 215 is configured to receive the output from the reverberator 213 (or the weighted sum 21 1 ) and generate an ambience signal 204 which is output to the spatial mixer 205.
Figure 3 shows the system shown in figure 2 in further detail. The example in figure 3 shows a single example microphone source 201 which is configured to output an audio signal 202 to the weighted sum 21 1 and spatial mixer 205 (sound object processor 331 ).
The weighted sum 21 1 in some embodiments comprises a signal classifier/characterizer 301 which is configured to receive the audio signal 202 from the microphone source 201 and classify or characterise the audio signal or otherwise generate parameters which may be used by a weight determiner and normalizer to determine an ambience weighting factor.
In some embodiments signal classifier/characterizer 301 may comprise a Voice
Activity Detector (VAD) configured to the categorisation of the audio signal can be performed by a voice activity detector 31 1 . The VAD may in some embodiments first perform a noise reduction stage, calculate some features or quantities from a section of the input signal, and then apply a classification rule to classify the section as speech or non-speech. In some embodiments this classification rule is based on determining a value exceeds a threshold. In some embodiments there may be some feedback in this sequence, in which the VAD decision is used to improve the noise estimate in the noise reduction stage, or to adaptively vary the threshold(s). These feedback operations improve the VAD performance in non-stationary noise (i.e. when the noise varies a lot). Some VAD methods may formulate the decision rule on a frame by frame basis using instantaneous measures of the divergence distance between speech and noise. The different measures which are used in VAD methods may include spectral slope, correlation coefficient, log likelihood ratio, cepstral, weighted cepstral, and modified distance measures.
In some embodiments signal classifier/characterizer 301 may comprise a spectral flatness detector 313. The spectral flatness detector 313 is typically measured in decibels, and provides a way to quantify how noise-like a sound is, as opposed to being tone-like.
The meaning of tonal in this context is in the sense of the amount of peaks or resonant structure in a power spectrum, as opposed to flat spectrum of a white noise. A high spectral flatness (approaching 1 .0 for white noise) indicates that the spectrum has a similar amount of power in all spectral bands. This spectrum would sound similar to white noise, and the graph of the spectrum would appear relatively flat and smooth. A low spectral flatness (approaching 0.0 for a pure tone) indicates that the spectral power is concentrated in a relatively small number of bands. This spectrum would typically sound like a mixture of sine waves, and the spectrum would appear spiky. In some embodiments the spectral flatness is calculated by dividing the geometric mean of the power spectrum by the arithmetic mean of the power spectrum.
The spectral flatness may in some embodiments be measured within a specified sub-band, rather than across the whole band.
In some embodiments the signal classifier/characterizer 301 may comprise a percussiveness detector 315. The percussiveness detector 315 may be configured to perform an analysis of percussiveness, using, for example the pulse-metric characterization such as described within CONSTRUCTION AND EVALUATION OF A ROBUST MULTIFEATURE SPEECH/MUSIC DISCRIMINATOR, Speech & music discrimination, pulse-metric feature available from https://www.ee.columbia.edu/~dpwe/papers/ScheiS97-mussp.pdf.
In some embodiments the signal classifier/characterizer 301 may comprise a harmonicity detector 317. The harmonicity detector 317 may for example be similar to the harmonicity detector from Srinivasan & Kankanhalli, "HARMONICITY AND DYNAMICS-BASED FEATURES FOR AUDIO" which is available from http://ieeexplorejeee.org/stamp/stamp.isp?arnumber=1326828.
In some embodiments the signal classifier/characterizer 301 may comprise a content classifier 319. The content classifier 319 may in some embodiments determine the content of the microphone audio signals and used in determining the weights. For example, a deep neural network may be trained to classify between noise/speech/music/singing, for example, using any of the above features or signal spectrum directly.
In some embodiments the signal classifier/characterizer 301 may comprise a sound engineer input 321 . The sound engineer input 321 may enable weights or weight adjustments to be input by a sound engineer or other user of the system. For example a sound engineer could be offered an option to make adjustments using a graphical user interface (GUI) on a digital audio workstation (DAW).
The classifier 301 can output the results of the analysis to a weight determiner and normaliser 303.
In some embodiments the weighted sum 21 1 comprises a weight determiner and normaliser 303. The weight determiner and normaliser 303 can be configured to generate weightings for each of the microphone sources based on the characterisation from the classifier/characterizer 301 . In one example the weight determiner generates weights for each microphone audio signal so that the weighted sum 205 is configured to multiply each microphone signals with an equal weighting (w(i)=1/N for i=1 to N microphone sources) to create the ambience audio signal. However, in some embodiments the ambience signal may be created from microphone audio signals that represent the overall ambience instead of active direct sources. The weights w(i) for each microphone audio signal may thus in some embodiments be obtained based on analysis which determines how likely is that each microphone signal carries ambient background noise instead of being dominantly capturing a closeup-sound source. The sum of w(i) when i ranges from i to N may in some embodiments be normalized to unity. In other words sum(w(i) i=1 :N)=1 .
The output from the classifier/characterizer 301 may be used to determine the weights applied for a microphone source. In some embodiments the analyses and the weightings may be performed over time in frames, say 1 second long, to enable the weight of a microphone signal in ambiance creation to change over time. Thus, this effectively implements time multiplexing, that is, different external microphones can be used at different times for ambiance creation.
In some embodiments where the input signal is analysed to determine whether there is voice activity. When the VAD indicates inactivity, it is likely that the microphone captures just background noise as the audio signal. In this case, the weight for this microphone source w(i) may be increased. The increase may be, for example, proportional to the current weight value.
Also in some embodiments where the input audio signal is analysed for spectral flatness to determine how close to (white) noise the input signal is vs. how likely is it that the input signal is tone like any signals which receive spectral flatness measures close to 1 may receive higher weighting in the sum because they are likely to contain noise-like content.
Furthermore in some embodiments where the input audio signal is analysed to determine harmonicity related features such as fundamental frequency (pitch), harmonic concentration, or harmonicity these audio signals may indicate that the microphone signal contains harmonic content. Microphone audio signals which likely contain harmonic content may receive lower weights in the summation.
In some embodiments where the input audio signal is analysed for percussiveness, any microphone signals which likely contain rhythmic content are likely to contain percussions or other rhythmic content rather than background. These audio signals may receive lower weighting factors.
Analysis which determines classifiers may determine the content of the microphone signals and used in determining the weighting values. For example, a deep neural network may be trained to classify between noise/speech/music/singing. If the classification indicates noise, the weighting for this audio signal may be increased. Where the classification indicates speech/singing/music, the weighting for this signal may be decreased.
Also the weight determiner 303 may be configured in some embodiments to determine the weighting values based on the input by a sound engineer or other user. For example, the system might calculate initial weights using the logic above, and then a sound engineer could be offered an option to make adjustments.
The output of these weighting values can be passed to the weighted sum 305.
In some embodiments the weighted sum 21 1 comprises a weighted sum processor 305 configured to receive the audio signals and the weightings associated with each audio signal. The weighted sum processor 305 may then combine the audio signals according to the weightings generated in the weight determination and normalisation module 303. The output of the weighted sum processor 305 can be passed to the digital reverberator 213. The digital reverberator 213 may be configured in some embodiments to optionally add reverberation to the combined audio signal. This additional process increases the spaciousness of the ambience signal and helps to separate it from the external microphone audio signals. For this, any suitable digital reverberator method may be applied. For example the combined audio signal may be passed through various delay lines. An example of a suitable reverberator is a Schroeder digital reverberator, details of which may be found from https://ccrma.stanford.edu/~jos/pasp/Schroeder Reverberators .html. The output of the digital reverberator 213 can be passed to the spatial extent synthesiser 215.
The spatial extent synthesiser 215 may receive the output of the reverberator 213 or weighted sum 21 1 (305) and output a spatially extent synthesised signal as the ambience signal to a spatial mixer 205 and in some embodiments a spatial mixer processor 333.
In some embodiments the system comprises the spatial mixer 205. The spatial mixer 205. The spatial mixer 205 in some embodiments comprises a sound object processor 331 . The sound object processor 331 can be configured to analyse the audio signals from the microphone sources and output these to the spatial mixer processor 333. The processing may for example comprise spatial position determining of the microphone sources
The spatial mixer 205 may receive the ambience audio signals and the processed audio signals from the microphone audio signals and be configured to mix and/or render the audio signals based on the positioning data (which may be from the HAIP information received, input manually by a sound engineer or by any other suitable position tracking method). The spatial mixer processor 333 may thus from the determined position data, render a positioned monophonic sound signal at a suitable spatial location using head-related-transfer-function (HRTF) filtering when binaural audio output is desired for headphone listening. The output may be a two channel L+R signal for headphone listening, and the outputs after filtering can be summed for each microphone source to create a spatial mix signal containing all the spatially positioned sources. Correspondingly, in some embodiments when creating a loudspeaker domain output, the positioned monophonic sound signal output (for each sound source) may be panned and the panned sound source audio signals summed to create the spatial mix of sources.
With respect to figure 4 an example spatial extent synthesiser 215 is shown in further detail. As described herein the spatial extent synthesiser 215 receives the combined (reverberated) audio signals and spatially extends the audio signal to a defined (for example 360 degree) spatial extent using methods for spatial extent control. In other words it takes as input a mono sound source audio signal and spatial extent parameters (width, height and depth).
In some embodiments where the audio signal input is a time domain signal the spatial extent synthesiser 215 comprises a suitable time to frequency domain transformer. For example as shown in figure 4 the spatial extent synthesiser 215 comprises a Short-Time Fourier Transform (STFT) 401 configured to receive the audio signal and output a suitable frequency domain output. In some embodiments the input is a time-domain signal which is processed with hop-size of 512 samples. A processing frame of 1024 samples is used, and it is formed from the current 512 samples and previous 512 samples. The processing frame is zero-padded to twice its length (2048 samples) and Hann windowed. The Fourier transform is calculated from the windowed frame producing the Short-Time Fourier Transform (STFT) output. The STFT output is symmetric, thus it is sufficient to process the positive half of 1024 samples including the DC component, totalling 1025 samples. Although the STFT is shown in figure 4 any suitable time to frequency domain transform may be used.
In some embodiments the spatial extent synthesiser 215 further comprises a filter bank 403. The filter bank 403 is configured to receive the output of the STFT 401 and using a set of filters generated based on a Halton sequence (and with some default parameters) generate a number of frequency bands 405. In statistics, Halton sequences are sequences used to generate points in space for numerical methods such as Monte Carlo simulations. Although these sequences are deterministic, they are of low discrepancy, that is, appear to be random for many purposes. In some embodiments the filter bank 409 comprises set of 9 different distribution filters, which are used to create 9 different frequency domain signals where the signals do not contain overlapping frequency components. These signals are denoted Band 1 F 405i to Band 9 F 405g in figure 4. The filtering can be implemented in the frequency domain by multiplying the STFT output with stored filter coefficients for each band. In some embodiments the spatial extent synthesiser 215 further comprises a spatial extent input 400. The spatial extent input 400 may be configured to define the spatial extent of the audio signal.
Furthermore in some embodiments the spatial extent synthesiser 215 may further comprise an object position input/determiner 402. The object position input/determiner 402 may be configured to determine the spatial position of sound sources. This information may be determined in some embodiments by the sound object processor.
In some embodiments the spatial extent synthesiser 215 may further comprise a band position determiner 404. The band position determiner 404 may be configured to receive the outputs from the object position input/determiner 402 and the spatial extent input 400 and from these generate an output passed to the vector base amplitude panning processor 406. In the following example the spatial extent synthesiser 215 (or spatially extending controller) is implemented using a vector based amplitude panning operation. However it is understood that the spatial extent synthesis or spatially extending control may be implementation agnostic and any suitable implementation used to generate the spatially extending control. For example in some embodiments the spatially extending control may implement direct binaural panning (using Head related transfer function filters for directions), direct assignment to the output channel locations (for example direct assignment to the loudspeakers without using any panning), synthesized ambisonics, and wave-field synthesis.
In some embodiments the spatial extent synthesiser 215 may further comprise a vector based amplitude panning (VBAP) processor 406. The VBAP 406 may be configured to generate control signals to control the panning of the frequency domain signals to desired spatial positions. Given the spatial position of the sound source (azimuth, elevation) and the desired spatial extent for the source (width in degrees), the system calculates a spatial position for each frequency domain signal. For example, if the spatial position of the sound source is zero degrees azimuth (front), and spatial extent 90 degrees, the VBAP may position the frequency bands at positions azimuth 45, 33.75, 22.5, 1 1 .25, 0, -1 1 .2500, -22.5000, -33.7500, -45 degrees. Thus, we use a linear allocation of bands around the source position, with the span defined by the spatial extent. The VBAP processor 406 may therefore be used to calculate a suitable gain for the signal, given the desired loudspeaker positions. VBAP processor 406 may provide gains for a signal such that it can be spatially positioned to a suitable position. These gains may be passed to a series of multipliers 407.
In some embodiments the spatial extent synthesiser 215 may further comprise a series of multipliers 407. In figure 4 is shown one multiplier for each frequency band. Thus the series of multipliers comprise multipliers 407i to 407g, however any suitable number of multipliers may be used. Each frequency domain band signal may be multiplied in the multiplier 407 with the determined VBAP gains.
The products of the VBAP gains and each frequency band signal may be passed to a series of output channel sum devices 409.
In some embodiments the spatial extent synthesiser 215 may further comprise a series of sum devices 409. The sum devices 409 may receive the outputs from the multipliers and combine them to generate an output channel band signal 41 1 . In the example shown in figure 4, a 4.0 loudspeaker format output is implemented with outputs for front left (Band FL F 41 1 1 ), front right (Band FR F 41 12), rear left (Band RL F 41 13), and rear right (Band RR F 41 1 ) channels which are generated by sum devices 409i, 4092, 4093 4094 respectively. In some other embodiments other loudspeaker formats or number of channels can be supported.
Furthermore in some embodiments other panning methods can be used such as panning laws, or the signals could be assigned to the closest loudspeakers directly.
In some embodiments the spatial extent synthesiser 215 may further comprise a series of inverse Short-Time Fourier Transforms (ISTFT) 413. For example as shown in figure 4 there is an ISTFT 413i associated with the FL signal an ISTFT 4132 associated with the FR signal, an ISTFT 4133 associated with the RL signal output and an ISTFT 4134 associated with the RR signal. In other words it provides N component audio signals to be played from different directions based on the spatial extent parameters. The signals are subjected to Inverse Short-Time Fourier Transform (ISTFT) and overlap-added to produce time-domain outputs.
These component signals may be provided for rendering and also for analysis for the purpose of ensuring even energy distributions between the components.
In some embodiments, there may be more than one ambiance audio signal, or in other words the ambiance audio signal may be created in two or more parts. For example, microphone audio signals on the left side of an sound scene may contribute to an ambiance audio signal on the left and spatially extended to 180 degrees, and microphone audio signals on the right side of the sound scene may contribute to the ambiance audio signal on the right, and also extended to 180 degrees extent.
Thus if we know that the space is acoustically not diffuse (e.g., we are outside), then it is more natural that the ambience corresponds somewhat to the directions of the sources. For example, if we have a battle of the bands situation outside with two rock bands on opposite sides of the listener, it is only natural that these two directions would have their own ambience tracks.
Any suitable division of the scene may be used for creating the ambiance signal in different parts. For example, if the microphones are located approximately in a left/right arrangement, then a left/right division for ambiance creation may be suitable. If the microphones are located in a constellation around the stage with mics in the front/back/left/right division, then a division into four 90 degree sectors for ambiance creation may be suitable. The division may in some embodiments be controllable from a graphical user interface of a digital audio workstation (DAW) system taking care of executing at least part of the proposed functionality.
In some embodiments by using just one microphone it is possible to create an ambiance audio signal since the apparatus and method utilizes spatial extent synthesis to expand the scene. However, at least a second microphone some distance away from the first microphone may be required to allow capture of ambience. More microphones may result in better approximation of the ambient signal. In general, the decision of how many external microphones to use for generating the ambiance may be at least partly based on the number of relatively acoustically homogenous portions of the scene. That is, if the sound scene varies at different locations of the scene, it is naturally to place at least one microphone to capture each homogenous portion of the scene.
In some embodiments, audio texture synthesis methods (see US patent 9,528,852) may be used for reusing a portion of the ambiance created this way for some other times of the audio capture. For example, an ambiance signal may be created during the proposed method during a quiet section of the event, or during any suitable portion of the event. The ambiance may be stored and looped as proposed in US 9,528,852 to create ambiance for some other times in the event. In some embodiments, such pre-generated ambiance audio signal may be used in such times of the event where all microphone signals indicate that they are capturing the external sound sources rather than background sounds. This can be determined by all microphones receiving low weights for the ambiance creation.
The system may also in some embodiments use a weighted sum of the ambiance created from current microphone input and the pre-generated (looped) ambiance.
In some embodiments, the system may perform a pre-calibration phase, during which the ambiance captured with external microphones is matched acoustically to an ambiance captured using a microphone array. For example, the magnitude response or other acoustic properties of the ambiance created from external microphones may be matched to an ambiance captured from a microphone array. This enables substituting a microphone array ambiance more realistically with ambiance captured from external microphones, and may be useful, for example, in situations where the microphone array suddenly becomes unavailable (in breakdowns, for example).
With respect to Figure 5 an example electronic device which may be used as the mixer and/or ambience signal generator is shown. The device may be any suitable electronics device or apparatus. For example in some embodiments the device 1200 is a mobile device, user equipment, tablet computer, computer, audio playback apparatus, etc.
The device 1200 may comprise a microphone 1201 . The microphone 1201 may comprise a plurality (for example a number N) of microphones. However it is understood that there may be any suitable configuration of microphones and any suitable number of microphones. In some embodiments the microphone 1201 is separate from the apparatus and the audio signal transmitted to the apparatus by a wired or wireless coupling. The microphone 1201 may in some embodiments be the microphone array as shown in the previous figures.
The microphone may be a transducer configured to convert acoustic waves into suitable electrical audio signals. In some embodiments the microphone can be solid state microphones. In other words the microphone may be capable of capturing audio signals and outputting a suitable digital format signal. In some other embodiments the microphone 1201 can comprise any suitable microphone or audio capture means, for example a condenser microphone, capacitor microphone, electrostatic microphone, Electret condenser microphone, dynamic microphone, ribbon microphone, carbon microphone, piezoelectric microphone, or microelectrical-mechanical system (MEMS) microphone. The microphone can in some embodiments output the audio captured signal to an analogue-to-digital converter (ADC) 1203.
The device 1200 may further comprise an analogue-to-digital converter 1203.
The analogue-to-digital converter 1203 may be configured to receive the audio signals from each of the microphone 1201 and convert them into a format suitable for processing. In some embodiments where the microphone is an integrated microphone the analogue-to-digital converter is not required. The analogue-to-digital converter 1203 can be any suitable analogue-to-digital conversion or processing means. The analogue-to-digital converter 1203 may be configured to output the digital representations of the audio signal to a processor 1207 or to a memory 121 1 .
In some embodiments the device 1200 comprises at least one processor or central processing unit 1207. The processor 1207 can be configured to execute various program codes such as the methods such as described herein.
In some embodiments the device 1200 comprises a memory 121 1 . In some embodiments the at least one processor 1207 is coupled to the memory 121 1 . The memory 121 1 can be any suitable storage means. In some embodiments the memory 121 1 comprises a program code section for storing program codes implementable upon the processor 1207. Furthermore in some embodiments the memory 121 1 can further comprise a stored data section for storing data, for example data that has been processed or to be processed in accordance with the embodiments as described herein. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 1207 whenever needed via the memory-processor coupling.
In some embodiments the device 1200 comprises a user interface 1205. The user interface 1205 can be coupled in some embodiments to the processor 1207. In some embodiments the processor 1207 can control the operation of the user interface 1205 and receive inputs from the user interface 1205. In some embodiments the user interface 1205 can enable a user to input commands to the device 1200, for example via a keypad. In some embodiments the user interface 205 can enable the user to obtain information from the device 1200. For example the user interface 1205 may comprise a display configured to display information from the device 1200 to the user. The user interface 1205 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the device 1200 and further displaying information to the user of the device 1200. In some embodiments the user interface 1205 may be the user interface for communicating with the position determiner as described herein.
In some implements the device 1200 comprises a transceiver 1209. The transceiver 1209 in such embodiments can be coupled to the processor 1207 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless communications network. The transceiver 1209 or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wire or wired coupling.
For example as shown in Figure 1 1 the transceiver 1209 may be configured to communicate with the renderer as described herein.
The transceiver 1209 can communicate with further apparatus by any suitable known communications protocol. For example in some embodiments the transceiver 1209 or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
In some embodiments the device 1200 may be employed as at least part of the renderer. As such the transceiver 1209 may be configured to receive the audio signals and positional information from the microphone/close microphones/position determiner as described herein, and generate a suitable audio signal rendering by using the processor 1207 executing suitable code. The device 1200 may comprise a digital-to-analogue converter 1213. The digital-to-analogue converter 1213 may be coupled to the processor 1207 and/or memory 121 1 and be configured to convert digital representations of audio signals (such as from the processor 1207 following an audio rendering of the audio signals as described herein) to a suitable analogue format suitable for presentation via an audio subsystem output. The digital-to-analogue converter (DAC) 1213 or signal processing means can in some embodiments be any suitable DAC technology. Furthermore the device 1200 can comprise in some embodiments an audio subsystem output 1215. An example as shown in Figure 1 1 shows the audio subsystem output 1215 as an output socket configured to enabling a coupling with headphones 121 . However the audio subsystem output 1215 may be any suitable audio output or a connection to an audio output. For example the audio subsystem output 1215 may be a connection to a multichannel speaker system.
In some embodiments the digital to analogue converter 1213 and audio subsystem 1215 may be implemented within a physically separate output device. For example the DAC 1213 and audio subsystem 1215 may be implemented as cordless earphones communicating with the device 1200 via the transceiver 1209.
Although the device 1200 is shown having both audio capture, audio processing and audio rendering components, it would be understood that in some embodiments the device 1200 can comprise just some of the elements.
In general, the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto. While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
The embodiments of this invention may be implemented by computer software executable by a data processor of the mobile device, such as in the processor entity, or by hardware, or by a combination of software and hardware. Further in this regard it should be noted that any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions. The software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD. The memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory. The data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
Embodiments of the inventions may be practiced in various components such as integrated circuit modules. The design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
Programs, such as those provided by Synopsys, Inc. of Mountain View,
California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules. Once the design for a semiconductor circuit has been completed, the resultant design, in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.
The foregoing description has provided by way of exemplary and non-limiting examples a full and informative description of the exemplary embodiment of this invention. However, various modifications and adaptations may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings and the appended claims. However, all such and similar modifications of the teachings of this invention will still fall within the scope of this invention as defined in the appended claims.

Claims

CLAIMS:
1 . An apparatus for generating an intended spatial audio field, the apparatus configured to:
receive at least two audio signals, wherein each audio signal is received from a separate microphone, each separate microphone is located in the same environment and configured to capture a sound source;
analyse each audio signal to determine at least in part an ambience audio signal;
generate a sum audio signal from the determined ambience audio signal based on the at least two audio signals; and
process the sum audio signal to spatially extend the sum audio signal so as to generate the intended spatial audio field, wherein the sum audio signal comprises the ambience audio signal for the intended spatial audio field.
2. The apparatus as claimed in claim 1 , wherein the apparatus is further configured to apply a reverberation to the sum audio signal before the processing of the sum audio signal to spatially extend the sum audio signal.
3. The apparatus as claimed in any of claims 1 and 2, wherein the apparatus configured to generate a sum audio signal from the determined ambience audio signal based on the at least two audio signals is configured to generate for and apply to at least one of the at least two audio signals a weighting value before generating the sum audio signal, wherein the weighting value is based on at least one of:
a detection of voice activity within the audio signal;
a determination of spectral flatness within the audio signal;
a determination of percussiveness within the audio signal;
a determination of harmonicity within the audio signal;
a determination of content classification type within the audio signal;
a determination of silence within the audio signal;
a determination of noise within the audio signal; and
at least one user generated input associated with the audio signal.
4. The apparatus as claimed in claim 3, wherein the apparatus configured to generate for at least one of the at least two audio signals a weighting value is further configured to normalise the weighting value for at least one of the at least two audio signals.
5. The apparatus as claimed in any of claims 1 to 4, wherein the apparatus configured to process the sum audio signal to spatially extend the sum audio signal is configured to apply one of:
vector base amplitude panning to the sum audio signal;
direct binaural panning to the sum audio signal;
direct assignment to channel output location to the sum audio signal;
synthesized ambisonics to the sum audio signal; and
wavefield synthesis to the sum audio signal.
6. The apparatus as claimed in claim 5, wherein the apparatus configured to process the sum audio signal to spatially extend the sum audio signal is configured to: determine a spatial extent parameter;
determine at least one position associated with the microphones;
determine at least one frequency band position based on the at least one position associated with the microphones and the spatial extent parameter.
7. The apparatus as claimed in claim 6, wherein the apparatus configured to apply vector base amplitude panning to the sum audio signal is further configured to generate panning vectors for the application of vector base amplitude panning to frequency bands of the sum audio signal.
8. The apparatus as claimed in any of claims 1 to 7, wherein the apparatus is configured to generate the intended spatial audio field is configured to generate a plurality of intended spatial audio fields parts, wherein at least one part of the intended spatial audio field is at least one of:
partially overlapping a neighbouring part;
non-overlapping at least one other part;
contained within at least one other part; and containing at least one other part.
9. The apparatus as claimed in any of claims 1 to 7, wherein the apparatus is configured to generate:
at least one first part of the intended spatial audio field associated with a first part of the environment, the first part of the environment comprising at least one sound source; and
at least one second part of the intended spatial audio field associated with a second part of the environment, the second part of the environment comprising at least one further sound source.
10. The apparatus as claimed in claim 9, wherein the first part of the environment is a left portion of the environment with respect to the apparatus, and the second part of the environment is a right portion of the environment with respect to the apparatus.
1 1 . The apparatus as claimed in claim 9, wherein the first part of the environment is a front portion of the environment with respect to the apparatus, and the second part of the environment is a rear portion of the environment with respect to the apparatus.
12. The apparatus as claimed in any of claims 1 to 1 1 , further configured to determine a position of the at least one microphone of the microphones relative to the apparatus.
13. The apparatus as claimed in any of claims 1 to 12, further configured to:
receive at least one audio signal from a capture device comprising a microphone array for capturing audio signals of the sound scene;
compare the at least one audio signal from the capture device to the at least one audio signal;
control the generation of the sum audio signal from microphones located within the intended spatial audio field, and process the sum audio signal to generate the intended spatial audio field based on the comparison.
14. The apparatus as claimed in any of claims 1 to 13, further configured to mix the at least one spatially extended sum audio signal with at least one of the at least two audio signals to generate the intended spatial audio field.
15. The apparatus as claimed in any of claims 1 to 14, wherein the apparatus configured to process the sum audio signal to spatially extend the sum audio signal is configured to spatially extend the sum audio signal such that the at least one spatially extended sum audio signal is one of:
full spatially extended to 360 degrees; and
partial spatially extended upto 360 degrees.
16. A method for generating an intended spatial audio field, the method comprising: receiving at least two audio signals, wherein each audio signal is received from a separate microphone, each separate microphone being located in the same environment and configured to capture a sound source;
analysing each audio signal to determine at least in part an ambience audio signal;
generating a sum audio signal from the determined ambience signal based on the at least two audio signals; and
processing the sum audio signal to spatially extend the sum audio signal so as to generate the intended spatial audio field, wherein the sum audio signal comprises the ambience audio signal for the intended spatial audio field.
17. The method as claimed in claim 16, further comprising applying a reverberation to the sum audio signal before the processing of the sum audio signal to spatially extend the sum audio signal.
18. The method as claimed in any of claims 16 and 17, wherein generating the sum audio signal comprises:
generating for at least one of the at least two audio signals a weighting value; and applying to at least one of the at least two audio signals the weighting value before generating the sum audio signal, wherein the weighting value is based on at least one of:
a detection of voice activity within the audio signal;
a determination of spectral flatness within the audio signal;
a determination of percussiveness within the audio signal;
a determination of harmonicity within the audio signal;
a determination of silence within the audio signal;
a determination of noise within the audio signal;
a determination of content classification type within the audio signal; and at least one user generated input associated with the audio signal.
19. The method as claimed in claim 18, wherein generating the weighting value further comprises normalising the weighting value for at least one of the at least two audio signals.
20. The method as claimed in any of claims 16 to 19, wherein processing the sum audio signal to spatially extend the sum audio signal comprises applying one of: vector base amplitude panning to the sum audio signal;
direct binaural panning to the sum audio signal;
direct assignment to channel output location to the sum audio signal;
synthesized ambisonics to the sum audio signal; and
wavefield synthesis to the sum audio signal.
21 . The method as claimed in claim 20, wherein processing the sum audio signal to spatially extend the sum audio signal comprises:
determining a spatial extent parameter;
determining at least one position associated with the microphones;
determining at least one frequency band position based on the at least one position associated with the microphones and the spatial extent parameter.
22. The method as claimed in claim 21 , wherein the apparatus configured to apply vector base amplitude panning to the sum audio signal is further configured to generate panning vectors for the application of vector base amplitude panning to frequency bands of the weighted sum.
23. The method as claimed in any of claims 16 to 22, wherein generating the intended spatial audio field comprises generating a plurality of intended spatial audio field parts, wherein at least one part is at least one of:
partially overlapping a neighbouring part;
non-overlapping at least one other part;
contained within at least one other part; and
containing at least one other part.
24. The method as claimed in any of claims 16 to 23, comprising:
generating at least one first part of the intended spatial audio field associated with a first part of the environment, the first part of the environment comprising at least one sound source; and
generating at least one second part of the intended spatial audio field associated with a second part of the environment, the second part of the environment comprising at least one further sound source.
25. The method as claimed in claim 24, wherein the first part of the environment is a left portion of the environment, and the second part of the environment is a right portion of the environment.
26. The method as claimed in claim 24, wherein the first part of the environment is a front portion of the environment, and the second part of the environment is a rear portion of the environment.
27. The method as claimed in any of claims 16 to 26, further comprising determining a position of the at least one microphone of the microphones relative to the apparatus.
28. The method as claimed in any of claims 16 to 26 further comprising:
receiving at least one audio signal from a capture device comprising a microphone array for capturing audio signals of the sound scene; comparing the at least one audio signal from the capture device to the at least one audio signal;
controlling the generation of the sum audio signal from microphones located within the intended spatial audio field; and
processing the sum audio signal to generate the intended spatial audio field based on the comparison.
29. The method as claimed in any of claims 16 to 28, further comprising mixing the at least one spatially extended sum audio signal with at least one of the at least two audio signals to generate the intended spatial audio field.
30. The method as claimed in any of claims 16 to 29, wherein processing the sum audio signal to spatially extend the sum audio signal comprises spatially extending the sum audio signal such that the at least one spatially extended audio signal is one of: full spatially extended to 360 degrees; and
partial spatially extended upto 360 degrees.
PCT/FI2018/050275 2017-04-20 2018-04-19 Audio signal generation for spatial audio mixing WO2018193162A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1706290.2A GB2561596A (en) 2017-04-20 2017-04-20 Audio signal generation for spatial audio mixing
GB1706290.2 2017-04-20

Publications (2)

Publication Number Publication Date
WO2018193162A2 true WO2018193162A2 (en) 2018-10-25
WO2018193162A3 WO2018193162A3 (en) 2018-12-06

Family

ID=58795721

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2018/050275 WO2018193162A2 (en) 2017-04-20 2018-04-19 Audio signal generation for spatial audio mixing

Country Status (2)

Country Link
GB (1) GB2561596A (en)
WO (1) WO2018193162A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113286249A (en) * 2020-02-19 2021-08-20 雅马哈株式会社 Sound signal processing method and sound signal processing device
CN113544774A (en) * 2019-03-06 2021-10-22 弗劳恩霍夫应用研究促进协会 Downmixer and downmixing method

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB201818959D0 (en) * 2018-11-21 2019-01-09 Nokia Technologies Oy Ambience audio representation and associated rendering

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080298610A1 (en) * 2007-05-30 2008-12-04 Nokia Corporation Parameter Space Re-Panning for Spatial Audio
EP2355559B1 (en) * 2010-02-05 2013-06-19 QNX Software Systems Limited Enhanced spatialization system with satellite device
US9313599B2 (en) * 2010-11-19 2016-04-12 Nokia Technologies Oy Apparatus and method for multi-channel signal playback
US9055371B2 (en) * 2010-11-19 2015-06-09 Nokia Technologies Oy Controllable playback system offering hierarchical playback options
US9456289B2 (en) * 2010-11-19 2016-09-27 Nokia Technologies Oy Converting multi-microphone captured signals to shifted signals useful for binaural signal processing and use thereof
US9769588B2 (en) * 2012-11-20 2017-09-19 Nokia Technologies Oy Spatial audio enhancement apparatus
US10127912B2 (en) * 2012-12-10 2018-11-13 Nokia Technologies Oy Orientation based microphone selection apparatus
GB2540175A (en) * 2015-07-08 2017-01-11 Nokia Technologies Oy Spatial audio processing apparatus
GB2540225A (en) * 2015-07-08 2017-01-11 Nokia Technologies Oy Distributed audio capture and mixing control

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113544774A (en) * 2019-03-06 2021-10-22 弗劳恩霍夫应用研究促进协会 Downmixer and downmixing method
CN113286249A (en) * 2020-02-19 2021-08-20 雅马哈株式会社 Sound signal processing method and sound signal processing device
EP3869500A1 (en) * 2020-02-19 2021-08-25 Yamaha Corporation Sound signal processing method and sound signal processing device
RU2770438C1 (en) * 2020-02-19 2022-04-18 Ямаха Корпорейшн Method for audio signal processing and audio signal processing apparatus
US11482206B2 (en) 2020-02-19 2022-10-25 Yamaha Corporation Sound signal processing method and sound signal processing device
US11900913B2 (en) 2020-02-19 2024-02-13 Yamaha Corporation Sound signal processing method and sound signal processing device

Also Published As

Publication number Publication date
GB201706290D0 (en) 2017-06-07
WO2018193162A3 (en) 2018-12-06
GB2561596A (en) 2018-10-24

Similar Documents

Publication Publication Date Title
US10685638B2 (en) Audio scene apparatus
US10382849B2 (en) Spatial audio processing apparatus
KR101341523B1 (en) Method to generate multi-channel audio signals from stereo signals
EP2486737B1 (en) System for spatial extraction of audio signals
CN106796792B (en) Apparatus and method for enhancing audio signal, sound enhancement system
EP3363017A1 (en) Distributed audio capture and mixing
CN117412237A (en) Combining audio signals and spatial metadata
US11523241B2 (en) Spatial audio processing
JP2011501486A (en) Apparatus and method for generating a multi-channel signal including speech signal processing
WO2018091776A1 (en) Analysis of spatial metadata from multi-microphones having asymmetric geometry in devices
JP2002078100A (en) Method and system for processing stereophonic signal, and recording medium with recorded stereophonic signal processing program
JP2013527727A (en) Sound processing system and method
EP2484127B1 (en) Method, computer program and apparatus for processing audio signals
WO2018193162A2 (en) Audio signal generation for spatial audio mixing
EP3613221A1 (en) Enhancing loudspeaker playback using a spatial extent processed audio signal
EP3613043A1 (en) Ambience generation for spatial audio mixing featuring use of original and extended signal
US20230370777A1 (en) A method of outputting sound and a loudspeaker
WO2024024468A1 (en) Information processing device and method, encoding device, audio playback device, and program
WO2018193161A1 (en) Spatially extending in the elevation domain by spectral extension
CN116569566A (en) Method for outputting sound and loudspeaker

Legal Events

Date Code Title Description
NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18787042

Country of ref document: EP

Kind code of ref document: A2