WO2018091776A1 - Analysis of spatial metadata from multi-microphones having asymmetric geometry in devices - Google Patents

Analysis of spatial metadata from multi-microphones having asymmetric geometry in devices Download PDF

Info

Publication number
WO2018091776A1
WO2018091776A1 PCT/FI2017/050778 FI2017050778W WO2018091776A1 WO 2018091776 A1 WO2018091776 A1 WO 2018091776A1 FI 2017050778 W FI2017050778 W FI 2017050778W WO 2018091776 A1 WO2018091776 A1 WO 2018091776A1
Authority
WO
WIPO (PCT)
Prior art keywords
microphone
microphones
audio signals
separated
microphone audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/FI2017/050778
Other languages
English (en)
French (fr)
Inventor
Juha Vilkamo
Miikka Vilermo
Mikko Tammi
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nokia Technologies Oy
Original Assignee
Nokia Technologies Oy
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nokia Technologies Oy filed Critical Nokia Technologies Oy
Priority to EP17871590.0A priority Critical patent/EP3542546B1/en
Priority to JP2019526614A priority patent/JP7082126B2/ja
Priority to US16/461,606 priority patent/US10873814B2/en
Priority to CN201780083608.7A priority patent/CN110337819B/zh
Publication of WO2018091776A1 publication Critical patent/WO2018091776A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R5/00Stereophonic arrangements
    • H04R5/04Circuit arrangements, e.g. for selective connection of amplifier inputs/outputs to loudspeakers, for loudspeaker detection, or for adaptation of settings to personal preferences or hearing impairments
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/04Circuits for transducers, loudspeakers or microphones for correcting frequency response
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R5/00Stereophonic arrangements
    • H04R5/027Spatial or constructional arrangements of microphones, e.g. in dummy heads
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R1/00Details of transducers, loudspeakers or microphones
    • H04R1/20Arrangements for obtaining desired frequency or directional characteristics
    • H04R1/32Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only
    • H04R1/326Arrangements for obtaining desired frequency or directional characteristics for obtaining desired directional characteristic only for microphones
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2499/00Aspects covered by H04R or H04S not otherwise provided for in their subgroups
    • H04R2499/10General applications
    • H04R2499/11Transducers incorporated or for use in hand-held devices, e.g. mobile phones, PDA's, camera's
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/03Application of parametric coding in stereophonic audio systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • H04S7/303Tracking of listener position or orientation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/302Electronic adaptation of stereophonic sound system to listener position or orientation
    • H04S7/303Tracking of listener position or orientation
    • H04S7/304For headphones

Definitions

  • the present application relates to apparatus and methods for generating spatial metadata for audio signals from asymmetric devices and specifically but not exclusively asymmetric arrangements of microphones on user equipment.
  • Adaptive spatial audio capture (SPAC) methods which employ dynamic analysis of perceptually relevant spatial information from the microphone array signals (e.g. directions of the arriving sound in frequency bands) are known.
  • SPAC Spatial audio capture
  • This information may be applied to dynamically synthesize a spatial reproduction that is perceptually similar to the original recorded sound field.
  • linear capture classical, static
  • Ambisonics which is a linear beamforming technique characterized by an intermediate signal representation in spherical harmonics.
  • Linear techniques require extensive hardware for accurate spatial sound capture. For example, the Eigenmike (a sphere with 32 high-SNR- microphones) is satisfactory for linear reproduction.
  • Parametric audio signal capture (perceptual, adaptive) and spatial metadata analysis includes SPAC and any other adaptive methods, including Directional Audio Coding (DirAC), Harmonic plane wave expansion (Harpex), and other similar methods. These approaches analyse the microphone audio signals to determine spatial features such as directions of the arriving sound, typically adaptively in frequency bands. This determined parametric information enables perceptually accurate synthesis of the spatial sound. These parametric capture techniques have vastly lower SNR/hardware requirements than the linear techniques.
  • the aforementioned spatial capture methods are designed to be implemented on symmetrically or near symmetrical devices. However in many practical devices at least two of the dimensions (length, width, height) differ greatly from each other. For example, a device such as a smartphone or tablet may be flat towards a certain axis close to the horizontal plane.
  • This device asymmetry poses a problem for spatial capture.
  • the main issue being that there is a 'short' spatial axis in the device which regardless of any optimization of the microphone positioning prevents any differential information between the microphones at this axis being large.
  • any interferers such as microphone self- noise, device noise, wind noise, vibration noise
  • an apparatus comprising a predetermined shape, the apparatus comprising: at least three microphones, located on or within the apparatus, wherein at least one pair from the at least three microphones comprises two microphones which are separated by a shorter distance of the predetermined shape than at least one other microphone pair of the predetermined shape; and a processor configured to: receive at least three microphone audio signals from the at least three microphones; analyse at least the microphone audio signals from the two microphones which are separated by the shorter distance to determine a directional ambiguity decision; and analyse the microphone audio signals from at least one of the other microphone pairs to determine at least one sound characteristic other than the direction ambiguity, wherein the at least one of the other microphone pairs comprises two microphones separated by a longer distance along the predetermined shape in such a way that the first and the at least one of the other microphone pairs are configured to capture spatial audio signals.
  • the predetermined shape may be a physical shape of the apparatus.
  • At least one dimension of the physical shape of the apparatus may be shorter than other dimensions of the physical shape of the apparatus.
  • the two microphones which are separated by the shorter distance may be separated by the shorter distance due to the at least one dimension of the physical shape of the apparatus being shorter than other dimensions of the physical shape of the apparatus.
  • the predetermined shape may be a physical geometry of the at least three microphones.
  • the two microphones which are separated by the shorter distance may be located on a dimension other than at least one dimension of the physical shape of the apparatus shorter than other dimensions of the physical shape of the apparatus.
  • the processor configured to analyse at least the microphone audio signals from the two microphones which are separated by the shorter distance to determine a directional ambiguity decision may be further configured to analyse microphone audio signals from at least one of the other microphone pairs to determine the direction ambiguity decision.
  • the processor may be configured to determine a first spatial metadata part, the first spatial metadata part being the directional ambiguity decision; determine a second spatial metadata part, the second spatial metadata part being the at least one sound characteristic other than the direction ambiguity; and combine the first spatial metadata part and the second metadata part to generate spatial metadata associated with at least three microphone audio signals, and wherein the second metadata part has a greater range of values than the first metadata part.
  • the processor configured to analyse the microphone audio signals from at least one of the other microphone pairs to determine at least one sound characteristic other than the direction ambiguity may be configured to determine a delay value between the at least one of the other microphone pairs.
  • the at least one sound characteristic other than the direction ambiguity may be a direction angle of the arriving sound, wherein the direction angle has ambiguous values, and wherein the direction ambiguity decision resolves the ambiguous values.
  • the processor configured to analyse the microphone audio signals from at least one of the other microphone pairs to determine the direction angle may be configured to: determine a delay value between the microphone audio signals from at least one of the other microphone pairs; normalise the delay value against a delay value for a sound wave to travel a distance between the at least one of the other microphone pairs; apply a trigonometric function to the normalised delay value or use the normalised delay value in a look up table to generate at least two ambiguous direction angle values.
  • the processor configured to apply the trigonometric function to the normalised delay value to generate the at least two ambiguous direction angle values may be configured to apply an inverse cosine function to the normalised delay value to generate the at least two ambiguous direction angle values.
  • the processor configured to analyse at least the microphone audio signals from the two microphones which are separated by the shorter distance to determine a directional ambiguity decision may be configured to: determine a sign of a delay value associated with a maximum correlation value between the microphone audio signals from the two microphones which are separated by the shorter distance, wherein the processor may be further configured to resolve the at least two ambiguous direction angle values based on the sign of the delay value.
  • the processor configured to determine a delay value between the microphone audio signals may be configured to: determine a plurality of correlation values for a range of delay values between the microphone audio signals; search the plurality of correlation values for a correlation value with the maximum correlation value; and select the delay value from the range of delay values associated with the correlation value with the maximum correlation value.
  • the processor configured to determine a delay value between the microphone audio signals may be configured to: determine a derivative over frequency of a phase difference between the microphone audio signals; and determine the delay value based on the derivative over frequency of the phase difference.
  • the at least one sound characteristic other than the direction ambiguity may further comprise an energy ratio associated with the direction angle of the arriving sound.
  • the at least one sound characteristic other than the direction ambiguity may further comprise a coherence associated with the direction angle of the arriving sound.
  • the processor configured to analyse at least the microphone audio signals from the two microphones which are separated by the shorter distance to determine a directional ambiguity decision may be configured to analyse, on a frequency-band by frequency-band basis the at least the microphone audio signals from the two microphones which are separated by the shorter distance to determine a directional ambiguity decision.
  • the processor configured to analyse the microphone audio signals from at least one of the other microphone pairs to determine at least one sound characteristic other than the direction ambiguity may be configured to analyse, on a frequency-band by frequency-band basis, the microphone audio signals from at least one of the other microphone pairs to determine at least one sound characteristic other than the direction ambiguity.
  • the at least three microphones may comprise four microphones
  • the processor configured to receive at least three microphone audio signals from the at least three microphones may be configured to receive four microphone audio signals from the four microphones
  • the processor configured to analyse the microphone audio signals from at least one of the other microphone pairs to determine at least one sound characteristic other than the direction ambiguity may be configured to: analyse the microphone audio signals from at least two of the other microphone pairs to determine at least two delays; and determine an azimuth and elevation direction of an arriving sound from the at least two delays, and the processor configured to analyse at least the microphone audio signals from the two microphones which are separated by the shorter distance to determine a directional ambiguity decision is configured to determine a direction ambiguity decision for the determined azimuth and elevation direction.
  • the direction values may be azimuth and elevation directions
  • the direction values may be any suitable direction or co-ordinate system such as for example azimuth & inclination, unit vectors, etc.
  • a method for an apparatus comprising a predetermined shape, the apparatus comprising: at least three microphones, located on or within the apparatus, wherein at least one pair from the at least three microphones comprises two microphones which are separated by a shorter distance of the predetermined shape than at least one other microphone pair of the predetermined shape, the method comprising: receiving at least three microphone audio signals from the at least three microphones; analysing at least the microphone audio signals from the two microphones which are separated by the shorter distance to determine a directional ambiguity decision; and analysing the microphone audio signals from at least one of the other microphone pairs to determine at least one sound characteristic other than the direction ambiguity, wherein the at least one of the other microphone pairs comprises two microphones separated by a longer distance along the predetermined shape in such a way that the first and the at least one of the other microphone pairs are configured to capture spatial audio signals.
  • the predetermined shape may be a physical shape of the apparatus.
  • At least one dimension of the physical shape of the apparatus may be shorter than other dimensions of the physical shape of the apparatus.
  • the two microphones which are separated by the shorter distance may be separated by the shorter distance due to the at least one dimension of the physical shape of the apparatus being shorter than other dimensions of the physical shape of the apparatus.
  • the predetermined shape may be a physical geometry of the at least three microphones.
  • the two microphones which are separated by the shorter distance may be located on a dimension other than at least one dimension of the physical shape of the apparatus shorter than other dimensions of the physical shape of the apparatus.
  • Analysing at least the microphone audio signals from the two microphones which are separated by the shorter distance to determine a directional ambiguity decision may further comprise analysing microphone audio signals from at least one of the other microphone pairs to determine the direction ambiguity decision.
  • the method may further comprise: determining a first spatial metadata part, the first spatial metadata part being the directional ambiguity decision; determining a second spatial metadata part, the second spatial metadata part being the at least one sound characteristic other than the direction ambiguity; and combining the first spatial metadata part and the second metadata part to generate spatial metadata associated with at least three microphone audio signals, and wherein the second metadata part has a greater range of values than the first metadata part.
  • Analysing the microphone audio signals from at least one of the other microphone pairs to determine at least one sound characteristic other than the direction ambiguity may comprise determining a delay value between the at least one of the other microphone pairs.
  • the at least one sound characteristic other than the direction ambiguity may be a direction angle of the arriving sound, wherein the direction angle has ambiguous values, and wherein the direction ambiguity decision resolves the ambiguous values.
  • Analysing the microphone audio signals from at least one of the other microphone pairs to determine the direction angle may further comprise: determining a delay value between the microphone audio signals from at least one of the other microphone pairs; normalising the delay value against a delay value for a sound wave to travel a distance between the at least one of the other microphone pairs; applying a trigonometric function to the normalised delay value or using the normalised delay value in a look up table to generate at least two ambiguous direction angle values.
  • Applying the trigonometric function to the normalised delay value to generate the at least two ambiguous direction angle values may comprise applying an inverse cosine function to the normalised delay value to generate the at least two ambiguous direction angle values.
  • Analysing at least the microphone audio signals from the two microphones which are separated by the shorter distance to determine a directional ambiguity decision may comprise: determining a sign of a delay value associated with a maximum correlation value between the microphone audio signals from the two microphones which are separated by the shorter distance, wherein the method further comprises resolving the at least two ambiguous direction angle values based on the sign of the delay value.
  • Determining a delay value between the microphone audio signals may comprise: determining a plurality of correlation values for a range of delay values between the microphone audio signals; searching the plurality of correlation values for a correlation value with the maximum correlation value; and selecting the delay value from the range of delay values associated with the correlation value with the maximum correlation value.
  • Determining a delay value between the microphone audio signals may comprise: determining a derivative over frequency of a phase difference between the microphone audio signals; and determining the delay value based on the derivative over frequency of the phase difference.
  • the at least one sound characteristic other than the direction ambiguity may further comprise an energy ratio associated with the direction angle of the arriving sound.
  • the at least one sound characteristic other than the direction ambiguity may further comprises a coherence associated with the direction angle of the arriving sound.
  • Analysing at least the microphone audio signals from the two microphones which are separated by the shorter distance to determine a directional ambiguity decision may comprise analysing, on a frequency-band by frequency-band basis the at least the microphone audio signals from the two microphones which are separated by the shorter distance to determine a directional ambiguity decision.
  • Analysing the microphone audio signals from at least one of the other microphone pairs to determine at least one sound characteristic other than the direction ambiguity may comprise analysing on a frequency-band by frequency- band basis, the microphone audio signals from at least one of the other microphone pairs to determine at least one sound characteristic other than the direction ambiguity.
  • the at least three microphones may comprise four microphones, wherein receiving at least three microphone audio signals from the at least three microphones may comprise receiving four microphone audio signals from the four microphones, analysing the microphone audio signals from at least one of the other microphone pairs to determine at least one sound characteristic other than the direction ambiguity may further comprise: analysing the microphone audio signals from at least two of the other microphone pairs to determine at least two delays; and determining an azimuth and elevation direction of an arriving sound from the two delays, and analysing at least the microphone audio signals from the at least two microphones which are separated by the shorter distance to determine a directional ambiguity decision may comprise determining a direction ambiguity decision for the determined azimuth and elevation direction.
  • a computer program product stored on a medium may cause an apparatus to perform the method as described herein.
  • An electronic device may comprise apparatus as described herein.
  • a chipset may comprise apparatus as described herein.
  • Embodiments of the present application aim to address problems associated with the state of the art.
  • Figure 1 shows spatial metadata errors caused by noise affecting a known spatial audio capture system
  • Figures 2a and 2b show schematically asymmetric microphone arrangement audio capture and processing apparatus suitable for implementing some embodiments
  • Figure 3 shows schematically a three microphone asymmetric arrangement audio capture and processing apparatus suitable for implementing some embodiments
  • Figure 4 shows schematically a four microphone asymmetric arrangement audio capture and processing apparatus suitable for implementing some embodiments
  • Figure 5 shows schematically functional processing elements of the example audio capture and processing apparatus suitable for implementing some embodiments
  • Figure 6 shows schematically functional elements of the analyser as shown in Figure 5 according to some embodiments
  • Figure 7 shows a flow diagram of an axis based analysis operation as implemented within apparatus as shown in Figure 6 according to some embodiments.
  • Figure 8 shows a flow diagram of an example delay information determination operation as implemented within apparatus as shown in Figure 6 according to some embodiments.
  • SPAC Spatial audio capture
  • SPAC refers here to techniques that use adaptive time-frequency analysis and processing to provide high perceptual quality spatial audio reproduction from any device equipped with a microphone array, for example, Nokia OZO or a mobile phone. At least 3 microphones are required for SPAC capture in horizontal plane, and at least 4 microphones are required for 3D capture.
  • the SPAC methods are adaptive, in other words they use non-linear approaches to improve on spatial accuracy from the state-of-the art traditional linear capture techniques.
  • Device asymmetry (where for example at least two of the dimensions such as length, width, height differ greatly from each other) poses a problem for linear capture and for conventional parametric spatial capture.
  • the issue is primarily that the asymmetric configuration of the device creates a 'short' spatial axis. This 'short' spatial axis regardless of any optimization of the microphone positioning is one where differential information between the microphones is very small.
  • Directional Audio Coding (DirAC) techniques in their typical form formulate the directional estimates based on the estimated sound field intensity vector.
  • the intensity vector is estimated from an intermediate spherical harmonic signal representation.
  • the signals in the intermediate spherical harmonic signal representation are formulated based on the differences between the microphone signals. Since the amplitude of the differential information is small for the 'short' axis, the processing coefficients (or multipliers) to obtain the spherical harmonic signals at that axis have to compensate for the small amplitude. In other words, the multipliers are large in order to amplify the 'short' axis. The large multiplier or coefficient used to amplify the small amplitude also amplifies the noise.
  • noise in the directional estimate means that the sound reproduced using that metadata may not be accurately localized at its location.
  • the sound could be perceived as being 'blurry' and only approximately arriving from the correct direction.
  • the sound reproduction may not be able to represent the single source as a point source.
  • Figure 1 shows an example asymmetric apparatus 91 which has a 'short' dimension from front to back of the apparatus and furthermore shows noise being received from a 'noisy' axis 93 in the same direction as the 'short' dimension.
  • Any arriving sound for example a sound represented by the loudspeaker symbol 95, which is primarily located perpendicular to the 'short' dimension is particularly susceptible to all sources of noise, pronouncing the parameter estimation errors when the spatial metadata associated with the captured sound is determined.
  • This is shown, for example in Figure 1 by the dashed lines 97 and 99 showing the large effect the noise on the 'noisy' axis 93 has to the estimated directional parameters.
  • the predetermined shape may refer to the physical shape or dimensions of the apparatus or may refer to the physical geometry of arrangement of the microphones on or in the apparatus. In some embodiments the physical shape of the apparatus may not be asymmetric, but the arrangement of the microphones on the apparatus is asymmetric.
  • a relevant capture device may be characterized by a small microphone spacing dimension.
  • a smart phone, a tablet, a hand-held VR camera where the at least one of the dimensions of the device limit the option to have reasonable spatial separation of microphones for all axes of interest.
  • typical parametric techniques for spatial audio capture fail with the above condition. For example, DirAC (and its variants, for example Higher-Order DirAC) as well as Harpex use an intermediate B-format (or more generally: spherical harmonic) signal representation.
  • the spherical harmonic signals for one axis have a very low SNR due to the microphone distances. This noise makes the spatial analysis on that axis unstable.
  • any technique using an intermediate spherical harmonic (or similar) representation can only produce spatial reproduction below the spatial aliasing frequency. This is the frequency above which the spherical harmonic signal cannot be formed due to too small audio wavelength with respect to the microphone spacing.
  • the spatial aliasing frequency using spherical devices such as OZO it can be possible to use acoustic shadowing information to determine directional information.
  • acoustic shadowing information may not be reliable on apparatus such as a mobile phone, where the acoustic shadowing is not prominent on all axes and may also vary depending on how the user is holding the apparatus.
  • a further benefit of the examples described herein is that they function both below and above the spatial aliasing frequency.
  • the concept may be implemented in some embodiments within a device with 3 or more microphones. With at least 3 microphones horizontal surround metadata may be analysed. With at least 4 microphones height metadata may also be analysed.
  • the spatial metadata may be information which can be utilized by the device or apparatus directly, or may be transmitted to a receiver device. An apparatus (for example apparatus receiving the spatial metadata) may then use the spatial metadata and audio signals (which may be other than the original microphone signals) to synthesize a desired output to synthesize the spatial sound to be output for example over headphones or for loudspeakers without knowledge of the microphone locations and/or dimensions of the capture apparatus.
  • a capture device may have several microphones, but stores/transmits only two of the channels, or combines linearly or adaptively the several channels for transmission, or processes the channels (equalization, noise-removal, dynamic processing..) before transmitting the audio signals alongside the spatial metadata.
  • the further apparatus which processes the audio signals using the spatial metadata (and in some embodiments further inputs such as head orientation) to determine a synthesised audio output signal or signals.
  • a common factor in embodiments described herein is that spatial metadata and some audio signals originating in a way or another from the same or similar sound field are utilized at the synthesis stage (this utilization may be either directly, or after transmission/storing/encoding, etc).
  • the core concept associated with the embodiments described herein is one where the capture device is configured to have at least one axis of capture which is selected to perform only directional ambiguity (also known as front-back) audio analysis, typically in frequency bands.
  • This axis of capture is such that the delay between audio signals generated by microphones from an arriving plane wave along that axis has a value which is smaller than the maximum delay between audio signals generated by microphones defining another capture axis.
  • An example of such an axis is shown in Figure 2a.
  • Figure 2a shows an example device 201 with the 'short' dimension axis 203.
  • the 'short' axis 203 of the device 201 (for example a thickness of a tablet device) along which a microphone spacing is significantly smaller than from another axis.
  • this 'short' dimension axis 203 is selected for direction ambiguity analysis only. In such a manner any selected 'short' dimension axis may prevent the generation of lower quality spatial metadata when generating accurate spatial information but enable the generation of robust direction ambiguity choice (for example whether the sound arrives from the front or the back direction related to that axis) spatial information.
  • the direction ambiguity choice may be binary, e.g., if the sound arrives from one or the other side of the device.
  • the direction ambiguity choice may have more than two choices, nevertheless, the direction ambiguity choice substantially is more a 'selection' parameter when compared to the fine angular determination parameter that is obtained from the delay or other analyses based on the signal analysis at the 'non- thin' axes.
  • the example apparatus or device 201 may comprise four microphones.
  • the arrangement of the microphones shown in Figure 2b is an example only of an arrangement of microphones for demonstrating the concept of the invention and it is understood that the microphones may be arranged in any suitable distribution.
  • three of the microphones are located to the 'front' of the device and one microphone is located to the 'rear' of the device 201 .
  • a first of the 'front' microphones 21 1 may be located at one corner of the device 201
  • a second of the 'front' microphones 213 may be located at an adjacent corner of the device 201
  • a third of the 'front' microphones 215 may be located in the middle of the side opposite to the side between the first 21 1 and second 213 microphones of the device 201 .
  • the 'rear' microphone 217 is shown in Figure 2b located at the same corner as the first 'front' microphone but on the opposite face to the first 'front' microphone 21 1 . It is understood that the terms 'front' and 'rear' are relative terms to the user of the apparatus and as such have been chosen as examples only.
  • the arrangement of the microphones on the example device 201 is such that an arriving sound 202 towards the front of the device may be captured by the 'front' microphones as first to third audio signals at the first to third microphones respectively. Spatial metadata may then be generated by analysis of the first to third audio signals.
  • the dimensions of the microphone placement or microphone positions thus enables the selection of the type of analysis to be performed on the audio signals.
  • the distance between the microphones 21 1 and 215 is such that a robust analysis may be performed (for example directional analysis and therefore the direction of the arriving sound 202 with respect to the device 201 ) may be determined by delay analysis of the audio signals, whereas the distance between the microphones 21 1 and 217 is such that a directional ambiguity (for example 'front-back') decision analysis may be performed.
  • a robust analysis for example directional analysis and therefore the direction of the arriving sound 202 with respect to the device 201
  • the distance between the microphones 21 1 and 217 is such that a directional ambiguity (for example 'front-back') decision analysis may be performed.
  • the spatial metadata comprises at least one sound characteristic (other than direction) which may be determined from the analysis of the at least one microphone pair audio signals. For example in some embodiments cross-correlation analysis of the microphone pair which has the largest mutual distance can be performed to determine an energy ratio parameter, that indicates the estimated proportion of the sound energy arriving from the determined 'source' direction with respect of all sound energy captured by the device in that frequency band. In some embodiments the remainder of the sound energy may be determined to be non-directional (for example reverberation sound energy).
  • the spatial metadata such as sound direction together with the energy ratio in frequency bands are parameters that express the perceptually relevant spatial information of the captured sound, and which can be utilized to perform high-quality spatial audio synthesis in a perceptual sense.
  • a spatial audio player for example a player as described in European patent application EP2617031A1
  • an example device 300 is shown in which 3 microphones are placed on a device that constrains the microphone placement in at least in one axis as described previously.
  • the example device 300 may, for example, represent a mobile device which has two 'forward' facing microphones, microphone 1 301 and microphone 3 305 and one 'rear' facing microphone, microphone 2 303.
  • the shape of the device is such that the distance between microphone 1 301 and microphone 2 303 is defined by distance 'c' 313 along a 'short' axis of the device whereas the distance between the microphone 1 301 and microphone 3 305 is defined by distance 'a' 31 1 along a 'long' axis of the device.
  • the distance between the microphone 2 303 and microphone 3 305 is defined by distance 'b' 315 which is the diagonal of the 'short' and 'long' axis of the device. In other words the distances 'a' 31 1 and 'c' 313 differ significantly.
  • the microphones when performing analysis on the audio signals from the microphones in order to determine the spatial metadata the microphones (and thus the audio signals generated by the microphones) microphone 1 301 and microphone 2 303 separated by the 'short' axis are selected such that only directional ambiguity or 'front-back' analysis is performed on these audio signals. For example delay analysis between the audio signals from microphone 1 301 and microphone 2 303 will result in noisy output values when determining directional information associated with a sound. However, the same delay analysis may, with fair robustness, be used to estimate whether a sound arrives first to microphone 1 301 or microphone 2 303, providing the 'front-back' directional ambiguity information.
  • the microphones (and thus the audio signals generated by the microphones) microphone 1 301 and microphone 3 305 separated by the 'long' axis may form a pair (separated by distance a) with a relatively large distance between the microphones.
  • the pair of microphones microphone 1 301 and microphone 3 305 could therefore be used to detect spatial direction information with higher robustness.
  • the delay-analysis between microphones 1 301 and 3 305 could provide information that can estimate the direction of the arriving sound at the horizontal plane.
  • the direction analysis produces a result which is ambiguous.
  • the same delay information would be obtained for a situation where the sound from the source arrives from the 'front' side or the 'back' or 'rear' side of the device at an approximately (or exactly) mirror-symmetric angle (depending on the microphone positioning and acoustic properties of the device).
  • This ambiguity can be solved using the front-back information from the 'short' distance pair of microphone 1 301 and microphone 2 303.
  • Figure 4 furthermore shows a further example device with four microphones.
  • the 'rear' or 'back' face of the device is shown fully.
  • On the 'rear' face is located a microphone 3 405 in one corner and the display 41 1 centrally on the 'rear' face.
  • the 'rear' face shows two of the 'long' axes in the form of the length and width of the device.
  • the opposite 'front' face of the device 400 shows in dashed form a camera 413.
  • the 'front' face furthermore has located on it a microphone 1 401 which is located opposite the microphone 3 405 but located on the 'front' face of the device 400.
  • the distance between the microphone 1 401 and the microphone 3 405 is the thickness of the device (which is considered to be the 'short' axis of the device 400).
  • the microphone 4 407. On the 'front' face but located at an adjacent corner separated by the device height is the microphone 407.
  • devices with 4 microphones as well as a directional spatial metadata determination height directional information can be determined.
  • the microphone spacing is smaller for the thickness axis 421 than for the height or width axes.
  • the microphone pairing between. microphone 1 401 and microphone 3 405 is such that the audio signals from this selection are to be used for delay analysis as described earlier for the directional ambiguity front-back analysis only.
  • FIG. 5 shows an example of internal components of the example audio capture apparatus or device shown in Figure 4 suitable for implementing some embodiments.
  • the audio capture apparatus 100 comprises the microphones (which may be defined as being microphones within a microphone array).
  • the microphone array in the example shown in Figure 5 shows microphones 401 to 407 organised in a manner similar to that shown in Figure 4.
  • the microphones 401 , 403, 405, 407 are shown configured to convert acoustic waves into suitable electrical audio signals.
  • the microphones are capable of capturing audio signals and each outputting a suitable digital signal.
  • the microphones or array of microphones can comprise any suitable microphone or audio capture means, for example a condenser microphone, capacitor microphone, electrostatic microphone, Electret condenser microphone, dynamic microphone, ribbon microphone, carbon microphone, piezoelectric microphone, or microelectrical-mechanical system (MEMS) microphone.
  • the microphones can in some embodiments output the audio captured signal to an analogue-to-digital converter (ADC) 103.
  • ADC analogue-to-digital converter
  • the audio capture apparatus 400 may further comprise an analogue-to- digital converter 103.
  • the analogue-to-digital converter 103 may be configured to receive the audio signals from each of the microphones and convert them into a format suitable for processing.
  • the microphones may comprise an ASIC where such analogue-to-digital conversions may take place in each microphone.
  • the analogue-to-digital converter 103 can be any suitable analogue-to-digital conversion or processing means.
  • the analogue-to-digital converter 103 may be configured to output the digital representations of the audio signals to a processor 107 or to a memory 1 1 1 .
  • the audio capture apparatus 100 electronics can also comprise at least one processor or central processing unit 107.
  • the processor 107 can be configured to execute various program codes.
  • the implemented program codes can comprise, for example, signal delay analysis, spatial metadata processing, signal mixing, phase processing, amplitude processing, decorrelation, mid signal generation, side signal generation, time-to-frequency domain audio signal conversion, frequency- to-time domain audio signal conversions and other algorithmic routines.
  • the audio capture apparatus can further comprise a memory 1 1 1 .
  • the at least one processor 107 can be coupled to the memory 1 1 1 .
  • the memory 1 1 1 can be any suitable storage means.
  • the memory 1 1 1 can comprise a program code section for storing program codes implementable upon the processor 107.
  • the memory 1 1 1 can further comprise a stored data section for storing data, for example data that has been processed or to be processed. The implemented program code stored within the program code section and the data stored within the stored data section can be retrieved by the processor 107 whenever needed via the memory-processor coupling.
  • the audio capture apparatus can also comprise a user interface 105.
  • the user interface 105 can be coupled in some embodiments to the processor (CPU) 107.
  • the processor 107 can control the operation of the user interface 105 and receive inputs from the user interface 105.
  • the user interface 105 can enable a user to input commands to the audio capture apparatus 400, for example via a keypad.
  • the user interface 105 can enable the user to obtain information from the apparatus 400.
  • the user interface 105 may comprise a display configured to display information from the apparatus 400 to the user.
  • the user interface 105 can in some embodiments comprise a touch screen or touch interface capable of both enabling information to be entered to the apparatus 400 and further displaying information to the user of the apparatus 400.
  • the audio capture apparatus 400 comprises a transceiver 109.
  • the transceiver 109 in such embodiments can be coupled to the processor 107 and configured to enable a communication with other apparatus or electronic devices, for example via a wireless or fixed line communications network.
  • the transceiver 109 or any suitable transceiver or transmitter and/or receiver means can in some embodiments be configured to communicate with other electronic devices or apparatus via a wireless or wired coupling.
  • the transceiver 109 can communicate with further apparatus by any suitable known communications protocol.
  • the transceiver 109 or transceiver means can use a suitable universal mobile telecommunications system (UMTS) protocol, a wireless local area network (WLAN) protocol such as for example IEEE 802.X, a suitable short-range radio frequency communication protocol such as Bluetooth, or infrared data communication pathway (IRDA).
  • UMTS universal mobile telecommunications system
  • WLAN wireless local area network
  • IRDA infrared data communication pathway
  • the audio capture apparatus 400 may also comprise a digital-to-analogue converter 1 13.
  • the digital-to-analogue converter 1 13 may be coupled to the processor 107 and/or memory 1 1 1 and be configured to convert digital representations of audio signals (such as from the processor 107) to a suitable analogue format suitable for presentation via an audio subsystem output.
  • the digital-to-analogue converter (DAC) 1 13 or signal processing means can in some embodiments be any suitable DAC technology.
  • the audio subsystem can comprise in some embodiments an audio subsystem output 1 15.
  • An example as shown in Figure 5 is a pair of speakers 131 1 and 1312.
  • the speakers 131 can in some embodiments be configured to receive the output from the digital-to-analogue converter 1 13 and present the analogue audio signal to the user.
  • the speakers 131 can be representative of a headset, for example a set of earphones, or cordless earphones.
  • the audio capture apparatus 400 is shown operating within an environment or audio scene wherein there are multiple arriving sounds present.
  • the environment comprises a first sound 151 , a vocal source such as a person talking at a first location.
  • the environment shown in Figure 5 comprises a second sound 153, an instrumental source such as a trumpet playing, at a second location.
  • the first and second locations for the first and second sounds 151 and 153 respectively may be different.
  • the first and second sounds may generate audio signals with different spectral characteristics.
  • the audio capture apparatus 400 is shown having both audio capture and audio presentation components, it would be understood that the apparatus 400 can comprise just the audio capture elements such that only the microphones (for audio capture) are present.
  • the audio capture apparatus 400 is described being suitable to performing the spatial audio signal processing described hereafter.
  • the audio capture components and the spatial signal processing components may also be separate.
  • the audio signals may be captured by a first apparatus comprising the microphone array and a suitable transmitter.
  • the audio signals may then be received and processed in a manner as described herein in a second apparatus comprising a receiver and processor and memory.
  • Figure 6 is a schematic block diagram illustrating processing of signals from multiple microphones to output signals on two channels. Other multi-channel reproductions are also possible. In addition to input from the microphones, input regarding head orientation can be used by the spatial synthesis.
  • the components can be arranged in various different manners.
  • the audio signals and directional metadata may be coded/stored/streamed/transmitted to the viewing device.
  • the apparatus is configured to generate a stereo or other one or more channel audio track that is transmitted along the spatial metadata.
  • the stereo track (or other) in some embodiments may be a combination or subset of the microphone signals.
  • the audio track can be encoded e.g. using AAC for transmission or storage, and the spatial metadata from the direction analyser 603 can be embedded to the AAC metadata.
  • the AAC (or other) audio and the spatial metadata can also be combined to a media container such as an mp4 container, possibly with a video track and other information.
  • a media container such as an mp4 container
  • the transmitted encoded audio and metadata being an AAC or mp4 stream or other, can be decoded to be processed by the spatial synthesizer 607.
  • the aforementioned processing may involve usage of different filter banks such as a forward and an inverse filter bank and a forward and an inverse modified discrete cosine transform (MDCT), or other necessary processes typical to audio/video encoding, multiplexing, transmission, demultiplexing and decoding.
  • MDCT discrete cosine transform
  • the apparatus may be configured to separate direct and ambient audio parts or any other signal components for spatial synthesis to be processed separately.
  • the direct and ambient parts or any other signal components can be synthesized from the audio signals at a single unified step using for example adaptive signal mixing and decorrelation.
  • the capture device which may be the device shown in Figures 3 to 5.
  • the capture device can comprise a display and a headphone connector/speaker for viewing the captured media.
  • the audio signals and directional information, or the processed audio output according to the audio signals and the directional information, can be coded/stored in the capture device.
  • the capture device may for example comprise a filter bank 601 configured to receive the multiple microphone signals and output a transformed domain signal to a spatial synthesizer 607 and to a direction analyser 603.
  • the filter bank may be any suitable filter bank implementation such as short time Fourier transform (STFT) or complex QMF bank.
  • the direction analyser 603 may be configured to receive the audio signals from the filter bank and perform delay analysis in a manner such as described herein in order to determine spatial metadata associated with the audio scene. This information may be passed to the spatial synthesizer 607 and to a direction rotator 605.
  • the capture device comprises a spatial processor such as a direction rotator 605.
  • the direction rotator may be configured to receive the directional information determined within the direction analyser 603 and 'move' the directions based on a head orientation input.
  • the head orientation input may indicate a direction the user is looking and may be detected using for example a head tracker in a head mounted device, or accelerometer/mouse/touchscreen in a mobile phone, tablet, laptop etc.
  • the output 'moved' spatial metadata may be passed to the spatial synthesiser 607.
  • the spatial synthesiser 607 having received the audio signals from filterbank 601 and spatial metadata from the direction analyser 603 and the direction rotator 605 may be configured to generate or synthesise a suitable audio signal.
  • the output signals can be passed in some form (for example coded/stored/streamed/transmitted) to the viewing device.
  • the microphone signals as such are coded/stored/streamed/transmitted to the viewing device that performs the processing as described in Figure 6.
  • the output of the inverse filter bank 609 may be configured to be output by any suitable output means such as speakers/headphones/earphones.
  • FIG. 7 a flow diagram showing the operation of direction analyser 603 as shown in Figure 6 or more generally a spatial metadata analyser implemented within an example capture or processing device is described in further detail.
  • the device (and in some embodiments the spatial metadata analyser/direction analyser) is shown selecting a first microphone arrangement associated with a 'thin' axis.
  • the first microphone arrangement may be a pair or more than two microphones which substantially define a dimension or axis.
  • the device is configured to select a dimension or an axis and from this selected dimension or axis determine which microphone audio signals to use for the later analysis. For example a dimension or axis may be chosen which does not have two microphones aligned and thus a 'synthesised' microphone may be generated by combining the audio signals.
  • an estimate of a group of delays between a selection of microphones may be performed, and the delay information from more than one pair may be used to determine the directional ambiguity 'front-back' choice.
  • the rule to combine the several delay estimates to obtain the directional ambiguity choice can be heuristic (using hand-tuned formulas), or optimized (e.g. using least squares optimization algorithms) based on measurement data from the devices.
  • the delay information between the audio signals from the selected microphone arrangement may be configured to be used to determine a first spatial metadata part.
  • the first spatial metadata part may for example in some embodiments be a directional ambiguity analysis (such as the front-back determination).
  • step 701 The operation of selecting a thin axis and an associated microphone arrangement and using the delay information from the selected microphone arrangement audio signals to determine directional ambiguity information only is shown in Figure 7 by step 701 .
  • the device (and in some embodiments the spatial metadata analyser/direction analyser) is shown selecting a further microphone arrangement.
  • the further microphone arrangement may be a further pair or more than two microphones which substantially define a dimension or axis other than the 'thin' axis (i.e. the 'thick axes' or 'thick dimensions').
  • this further selection is all microphone axis or dimensions other than the 'thin' axis.
  • the delay information between the audio signals from the further selection may be configured to be used to determine a second spatial metadata part.
  • the second spatial metadata part may for example in some embodiments be a robust directional estimate.
  • the first spatial metadata part may further include directional ambiguity directional estimates (such as the front-back determination).
  • the system may then combine the first and second spatial metadata parts in order to produce robust metadata output.
  • the directional information from the further arrangement of microphone audio signals and the directional ambiguity detection from the first arrangement of microphone audio signal may generate a robust and unambiguous directional result.
  • FIG. 7 The operation of determining a combined spatial metadata output from the first and second spatial metadata parts is shown in Figure 7 by step 705.
  • FIG 8 a first example of delay analysis suitable for use in embodiments is shown.
  • the delay analysis is performed on single frequency band of the audio signals. It is understood that in some embodiments where the analysis is performed on a band-by-band basis then these operations may be performed on a band-by-band basis also.
  • the device (and in some embodiments the spatial metadata analyser/direction analyser) is configured in some embodiments to apply a 'search' method for determining the delay between audio signals generated by pairs of microphones.
  • a 'search' method for determining the delay between audio signals generated by pairs of microphones.
  • a cross correlation product between the audio signals captured by the pair of microphones at a set of different delays is determined.
  • the delay with the maximum cross correlation can then be selected as the estimated delay.
  • any suitable search method may be used to determine the delay with the maximum cross correlation.
  • the range of delays may include both negative and positive delays.
  • a delay is selected from the range of delays.
  • the delay is then applied to one of the microphone audio signals.
  • the delay may be applied as adjustments of the phase in the frequency domain, which is an approximation of the delay adjustment.
  • a cross-correlation product is then determined for the un-delayed microphone audio signal and the delayed microphone audio signal.
  • the method then checks to determine whether all of the delays has been selected. Where there are still delays within the range of delays then the method passed back to step 803 where a further delay value is selected from the range of delays.
  • the delay with the maximum cross-correlation product value is selected as the delay information value.
  • step 81 1 The operation of selecting the maximum cross-correlation product value is shown in Figure 8 by step 81 1 .
  • a further example of the determination of the delay information may be a phase derivative method for determining the delay information value.
  • this phase derivative method a delay between microphones is determined which corresponds to the derivative over frequency of the phase difference between the microphones.
  • the estimate of the delay may be provided.
  • any suitable method for determining the delay information between selected pairs of microphone audio signals may be implemented in order to obtain the delay information.
  • this delay information can be used to determine the spatial metadata.
  • the selected pair of microphones microphone 1 301 and microphone 3 305 may be sufficiently spatially separated that the delay information may be used to determine the directional or angular information by first normalizing the delay parameter with a maximum-delay parameter (formulated based on the microphone distance between the microphone pairs and a speed of sound) to obtain a normalized delay dnorm that is constrained between -1 and 1 .
  • the maximum normalized delay is obtained when the sound arrives from the axis defined by the pair or microphones.
  • the angular information may then be obtained for example by aCOS(dnorm).
  • the selected pair of microphones microphone 1 301 and microphone 2 303 may not be sufficiently spatially separated to perform directional analysis.
  • the delay information from this pair of microphone audio signals may be able to provide a directional ambiguity decision (the 'front-back' decision) which can be determined from the sign of the normalised delay parameter. In such a manner combining the front-back information and the angular information provides the direction of the arriving sound at the horizontal plane.
  • a ratio parameter indicating the proportion of the sound energy arriving from the analysed direction may be determined from a coherence parameter calculated between the microphone audio signals. Only a directional sound is coherent (although potentially differently delayed) between the microphones, whereas a non-directional sound can be incoherent at some frequencies, or partially coherent at lower frequencies. Thus by performing a correlation analysis a ratio parameter of the analysed sound can be provided.
  • the correlation determination may be performed on the thin axis and non-thin axis selected microphone arrangement audio signals.
  • the determination of the ratio parameter is typically preferable to be performed using the correlation determination on the non-thin axis selected microphone arrangement audio signals. This is because a pair of microphones with a larger distance will have larger differences between correlations of the directional sound and non-directional sound.
  • the normalized complex- valued cross-correlation between channels 1 and 3 may be expressed as C13
  • the audio signals x are complex-valued frequency band signals where the subscript indicates the microphone source of the audio signal.
  • height directional information can also be determined.
  • the device thickness defines the 'thin' axis microphone spacing compared to the height or width axes.
  • any microphone arrangement which is separated by only the thickness axis is selected to only be suitable for deternnining directional ambiguity spatial metadata (for example 'front-back' analysis).
  • the microphone pair defined by microphone 1 401 and microphone 3 405 which is separated by this 'thin' axis is selected such that it is a 'directional ambiguity' microphone arrangement selection, and any analysis performed on the audio signals from this selection of microphones is a 'directional ambiguity' analysis.
  • Other microphone selections such as microphone 1 401 and microphone 2 403 (or microphone 1 401 and microphone 4 407) which are separated by a distance more than the 'thin' axis may be selected to perform delay analysis to determine directional (or other robust) parameters.
  • microphones 1 401 , 2 403 and 4 407 can be utilized to detect a direction of the arriving sound, however, with the directional ambiguity that the sound could arrive from either of the two sides of the device as determined by the front-back axis.
  • the microphone pair 1 ,2 and the pair 1 ,4 are located exactly at the horizontal and vertical axes. This is an example configuration which enables an example method to estimate the direction expressible in a simple way.
  • directional information can be determined from the microphone audio signals using the following equations. Firstly assuming the delays between all of the microphone audio channels have been determined and defined with d 1 as the delay estimate between microphone pair 1 ,2; d 2 as the delay estimate between microphone pair 1 ,4; and d 3 as the delay estimate between microphone pair 1 ,3.
  • the front-back information can be estimated from the sign of d 3 .
  • a unit vector v can be defined such that it would point to the direction- of-arrival.
  • the unit vector axes 1 and 2 may be determined from the robustly estimated delays d 1 and d 2 by
  • the directional ambiguity choice is then retrieved from the sign of d 3 , or any other similar directional ambiguity choice on that axis.
  • the direction of arrival is thus the direction of vector v, where the front-back parameter has been applied to select the sign of v 3 , i.e., if the estimated direction should be mirrored to the other side of the device.
  • microphones 1 and 2 are significantly far apart (for example on a mobile device they would be > 4cm), they would be well suited for detecting coherence.
  • any other microphone pair other than microphone 1 and microphone 3 can be utilized for the coherence analysis.
  • multiple coherence analyses between several pairs could be obtained, and the ratio parameter estimation could combine such coherence information, to arrive to a more robust ratio parameter estimate.
  • the direction, coherence and other sound characteristics can be detected separately for each frequency band.
  • the spatial metadata as described herein may also be known as directional metadata, spatial metadata, and spatial parametric information, among other terms.
  • the advantage of selection of one axis (based on the device shape and microphone positions) for only directional ambiguity ('front-back') analysis enables the device to determine accurate spatial metadata within a range of relevant devices where many of the prior methods are not well suited.
  • the methods described herein enable accurate spatial metadata to be generated using smartphones or tablets or similar devices with at least three microphones where it is known that at least one of the axes of the device is significantly shorter than the other.
  • the distance between the microphones is known. However in some embodiments the distance between the microphones may be determined by implementing a training sequence wherein the device is configured to 'test' capture an arriving sound from a range of directions and the delay determination used to find maximum delays between pairs of microphones and thus define the distances between the microphones.
  • the actual distances between the microphones is not determined or known and the selection as to whether the pair of microphones may be used to determine a 'directional ambiguity' decision (such as the 'front-back' decision) only or a range of parameter values (such as the positional/orientation or coherence or ratio parameters) may be determined based on a current 'max' delay experienced for the microphones.
  • the pair of microphone signals may be initially selected to only be able to perform 'directional ambiguity' decisions based on the delay signal analysis. In other words the sign of the delay is used to determine the directional ambiguity decision.
  • the selected pair of microphones may be used to determine more than the directional ambiguity decisions.
  • the delay values may be used to determine the spatial metadata direction.
  • This max value may be a determined maximum delay value and thus select whether the pair of microphones is currently suitable for determining the directional metadata, compared to another selection of a pair of microphones.
  • a parametric analysis of spatial sound is understood to mean that a sound model is assumed, e.g., a directional sound plus ambience at a frequency band. Then, we design the algorithms such that estimate the model parameters, i.e., the spatial metadata.
  • the sound model involves a directional parameter in frequency bands which is obtained using the directional ambiguity analysis at one spatial axis, and other analysis at the other axis/axes.
  • the directional parameter or other metadata is not stored or transmitted, but is analyzed, utilized for spatial synthesis and then discarded.
  • the device is configured to capture the microphone audio signals and process directly a 5.1 channels output.
  • the system estimates the spatial sound model parameters accordingly, and steers the sound to the loudspeaker or loudspeakers at that direction.
  • the spatial metadata analysis is performed at some part of the system in order to enable spatially accurate reproduction, but in this case the spatial metadata is not stored or transmitted.
  • the metadata is just a temporary variable within the system that is directly applied for the synthesis (e.g. selection of HRTFs, loudspeaker gains etc.) to produce the spatialized sound. This would be the case where the device is configured to perform both capturing/playback. So, in this case also the metadata is estimated, however, it is not stored anywhere.
  • the capture device is configured to send one or more audio channels (based on the microphone channels) and the analysed metadata.
  • the audio channels can be encoded e.g. with AAC.
  • the AAC encoding reduces SNR (although the perceptual masking makes the quantization noise typically inaudible), which can reduce the metadata analysis accuracy. This is one reason why the analysis is best done in the capture device.
  • the receiver is configured to retrieve the audio and metadata, and performs the spatialization flexibly, e.g. for head-tracked headphones or loudspeakers.
  • the device may also store the raw audio waveform as is, and the metadata analysis is performed at another entity, such as a computer software.
  • a mobile device camera one or more
  • microphone data is imported to a computer executing code on at least one processor, and all the metadata analysis, picture stitching etc. are performed there.
  • the code or software is informed which device is used, and configures itself accordingly.
  • the microphone channels encoded at a high bit rate may be sent to the receiver and do the metadata analysis and synthesis there.
  • the system is configured to estimate spatial parameters, i.e., the spatial metadata, but the analysis may be performed at any suitable point in the system.
  • the analysis and estimation often takes place on a computer, and on a mobile device the estimation often takes place on the device itself.
  • the various embodiments of the invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
  • some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • firmware or software which may be executed by a controller, microprocessor or other computing device, although the invention is not limited thereto.
  • While various aspects of the invention may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • the embodiments of this invention may be implemented by computer software executable by a data processor of the electronic device, such as in the processor entity, or by hardware, or by a combination of software and hardware.
  • any blocks of the logic flow as in the Figures may represent program steps, or interconnected logic circuits, blocks and functions, or a combination of program steps and logic circuits, blocks and functions.
  • the software may be stored on such physical media as memory chips, or memory blocks implemented within the processor, magnetic media such as hard disk or floppy disks, and optical media such as for example DVD and the data variants thereof, CD.
  • the memory may be of any type suitable to the local technical environment and may be implemented using any suitable data storage technology, such as semiconductor-based memory devices, magnetic memory devices and systems, optical memory devices and systems, fixed memory and removable memory.
  • the data processors may be of any type suitable to the local technical environment, and may include one or more of general purpose computers, special purpose computers, microprocessors, digital signal processors (DSPs), application specific integrated circuits (ASIC), gate level circuits and processors based on multi-core processor architecture, as non-limiting examples.
  • Embodiments of the inventions may be practiced in various components such as integrated circuit modules.
  • the design of integrated circuits is by and large a highly automated process. Complex and powerful software tools are available for converting a logic level design into a semiconductor circuit design ready to be etched and formed on a semiconductor substrate.
  • Programs such as those provided by Synopsys, Inc. of Mountain View, California and Cadence Design, of San Jose, California automatically route conductors and locate components on a semiconductor chip using well established rules of design as well as libraries of pre-stored design modules.
  • the resultant design in a standardized electronic format (e.g., Opus, GDSII, or the like) may be transmitted to a semiconductor fabrication facility or "fab" for fabrication.

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)
  • Obtaining Desirable Characteristics In Audible-Bandwidth Transducers (AREA)
PCT/FI2017/050778 2016-11-18 2017-11-10 Analysis of spatial metadata from multi-microphones having asymmetric geometry in devices Ceased WO2018091776A1 (en)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP17871590.0A EP3542546B1 (en) 2016-11-18 2017-11-10 Analysis of spatial metadata from multi-microphones having asymmetric geometry in devices
JP2019526614A JP7082126B2 (ja) 2016-11-18 2017-11-10 デバイス内の非対称配列の複数のマイクからの空間メタデータの分析
US16/461,606 US10873814B2 (en) 2016-11-18 2017-11-10 Analysis of spatial metadata from multi-microphones having asymmetric geometry in devices
CN201780083608.7A CN110337819B (zh) 2016-11-18 2017-11-10 来自设备中具有不对称几何形状的多个麦克风的空间元数据的分析

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB1619573.7 2016-11-18
GB1619573.7A GB2556093A (en) 2016-11-18 2016-11-18 Analysis of spatial metadata from multi-microphones having asymmetric geometry in devices

Publications (1)

Publication Number Publication Date
WO2018091776A1 true WO2018091776A1 (en) 2018-05-24

Family

ID=57993851

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2017/050778 Ceased WO2018091776A1 (en) 2016-11-18 2017-11-10 Analysis of spatial metadata from multi-microphones having asymmetric geometry in devices

Country Status (6)

Country Link
US (1) US10873814B2 (enExample)
EP (1) EP3542546B1 (enExample)
JP (1) JP7082126B2 (enExample)
CN (1) CN110337819B (enExample)
GB (1) GB2556093A (enExample)
WO (1) WO2018091776A1 (enExample)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021053264A1 (en) 2019-09-17 2021-03-25 Nokia Technologies Oy Direction estimation enhancement for parametric spatial audio capture using broadband estimates
WO2021170900A1 (en) 2020-02-26 2021-09-02 Nokia Technologies Oy Audio rendering with spatial metadata interpolation
WO2022136725A1 (en) 2020-12-21 2022-06-30 Nokia Technologies Oy Audio rendering with spatial metadata interpolation and source position information
EP4164255A1 (en) 2021-10-08 2023-04-12 Nokia Technologies Oy 6dof rendering of microphone-array captured audio for locations outside the microphone-arrays
US11956615B2 (en) 2019-06-25 2024-04-09 Nokia Technologies Oy Spatial audio representation and rendering
WO2024115062A1 (en) * 2022-12-02 2024-06-06 Nokia Technologies Oy Apparatus, methods and computer programs for spatial audio processing
US12156012B2 (en) 2018-11-13 2024-11-26 Dolby International Ab Representing spatial audio by means of an audio signal and associated metadata
US12167219B2 (en) 2018-11-13 2024-12-10 Dolby Laboratories Licensing Corporation Audio processing in immersive audio services
US12243553B2 (en) 2019-12-23 2025-03-04 Nokia Technologies Oy Combining of spatial audio parameters
US12243540B2 (en) 2019-12-23 2025-03-04 Nokia Technologies Oy Merging of spatial audio parameters
US12400667B2 (en) 2020-06-09 2025-08-26 Nokia Technologies Oy Spatial audio parameter encoding and associated decoding
US12439220B2 (en) 2020-03-03 2025-10-07 Nokia Technologies Oy Apparatus, methods and computer programs for enabling reproduction of spatial audio signals
US12451147B2 (en) 2019-12-31 2025-10-21 Nokia Technologies Oy Spatial audio parameter encoding and associated decoding

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2563635A (en) 2017-06-21 2018-12-26 Nokia Technologies Oy Recording and rendering audio signals
GB2572368A (en) * 2018-03-27 2019-10-02 Nokia Technologies Oy Spatial audio capture
GB2572650A (en) * 2018-04-06 2019-10-09 Nokia Technologies Oy Spatial audio parameters and associated spatial audio playback
GB2587357A (en) 2019-09-24 2021-03-31 Nokia Technologies Oy Audio processing
GB2598932A (en) 2020-09-18 2022-03-23 Nokia Technologies Oy Spatial audio parameter encoding and associated decoding
GB2598960A (en) 2020-09-22 2022-03-23 Nokia Technologies Oy Parametric spatial audio rendering with near-field effect
CA3194906A1 (en) 2020-10-05 2022-04-14 Anssi Ramo Quantisation of audio parameters
GB2608406A (en) 2021-06-30 2023-01-04 Nokia Technologies Oy Creating spatial audio stream from audio objects with spatial extent
US12010483B2 (en) 2021-08-06 2024-06-11 Qsc, Llc Acoustic microphone arrays
CN115665606B (zh) * 2022-11-14 2023-04-07 深圳黄鹂智能科技有限公司 基于四麦克风的收音方法和收音装置
EP4623437A1 (en) 2022-11-21 2025-10-01 Nokia Technologies Oy Determining frequency sub bands for spatial audio parameters
GB2626953A (en) 2023-02-08 2024-08-14 Nokia Technologies Oy Audio rendering of spatial audio

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120051548A1 (en) 2010-02-18 2012-03-01 Qualcomm Incorporated Microphone array subset selection for robust noise reduction
US20120128160A1 (en) 2010-10-25 2012-05-24 Qualcomm Incorporated Three-dimensional sound capturing and reproducing with multi-microphones
US20130272538A1 (en) 2012-04-13 2013-10-17 Qualcomm Incorporated Systems, methods, and apparatus for indicating direction of arrival
US20150208156A1 (en) * 2012-06-14 2015-07-23 Nokia Corporation Audio capture apparatus
WO2016096021A1 (en) 2014-12-18 2016-06-23 Huawei Technologies Co., Ltd. Surround sound recording for mobile devices

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7039198B2 (en) * 2000-11-10 2006-05-02 Quindi Acoustic source localization system and method
EP1489596B1 (en) * 2003-06-17 2006-09-13 Sony Ericsson Mobile Communications AB Device and method for voice activity detection
US8300845B2 (en) 2010-06-23 2012-10-30 Motorola Mobility Llc Electronic apparatus having microphones with controllable front-side gain and rear-side gain
US8855341B2 (en) * 2010-10-25 2014-10-07 Qualcomm Incorporated Systems, methods, apparatus, and computer-readable media for head tracking based on recorded sound signals
US9055371B2 (en) * 2010-11-19 2015-06-09 Nokia Technologies Oy Controllable playback system offering hierarchical playback options
JP5909678B2 (ja) * 2011-03-02 2016-04-27 パナソニックIpマネジメント株式会社 収音装置
WO2013093565A1 (en) * 2011-12-22 2013-06-27 Nokia Corporation Spatial audio processing apparatus
US9258644B2 (en) * 2012-07-27 2016-02-09 Nokia Technologies Oy Method and apparatus for microphone beamforming
CN103837858B (zh) * 2012-11-23 2016-12-21 中国科学院声学研究所 一种用于平面阵列的远场波达角估计方法及系统
CN104019885A (zh) * 2013-02-28 2014-09-03 杜比实验室特许公司 声场分析系统
US9788119B2 (en) * 2013-03-20 2017-10-10 Nokia Technologies Oy Spatial audio apparatus
US9781507B2 (en) * 2013-04-08 2017-10-03 Nokia Technologies Oy Audio apparatus
US9894454B2 (en) * 2013-10-23 2018-02-13 Nokia Technologies Oy Multi-channel audio capture in an apparatus with changeable microphone configurations
US9282399B2 (en) * 2014-02-26 2016-03-08 Qualcomm Incorporated Listen to people you recognize
WO2016179211A1 (en) * 2015-05-04 2016-11-10 Rensselaer Polytechnic Institute Coprime microphone array system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120051548A1 (en) 2010-02-18 2012-03-01 Qualcomm Incorporated Microphone array subset selection for robust noise reduction
US20120128160A1 (en) 2010-10-25 2012-05-24 Qualcomm Incorporated Three-dimensional sound capturing and reproducing with multi-microphones
US20130272538A1 (en) 2012-04-13 2013-10-17 Qualcomm Incorporated Systems, methods, and apparatus for indicating direction of arrival
US20150208156A1 (en) * 2012-06-14 2015-07-23 Nokia Corporation Audio capture apparatus
WO2016096021A1 (en) 2014-12-18 2016-06-23 Huawei Technologies Co., Ltd. Surround sound recording for mobile devices

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KOWALCZYK, K. ET AL.: "Parametric spatial sound processing", IEEE SIGNAL PROCESSING MAGAZINE, vol. 32, no. 2, 2015, pages 31 - 42, XP011573081, Retrieved from the Internet <URL:http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=7038281> [retrieved on 20180308] *
See also references of EP3542546A4

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12156012B2 (en) 2018-11-13 2024-11-26 Dolby International Ab Representing spatial audio by means of an audio signal and associated metadata
US12167219B2 (en) 2018-11-13 2024-12-10 Dolby Laboratories Licensing Corporation Audio processing in immersive audio services
US12309568B2 (en) 2019-06-25 2025-05-20 Nokia Technologies Oy Spatial audio representation and rendering
US11956615B2 (en) 2019-06-25 2024-04-09 Nokia Technologies Oy Spatial audio representation and rendering
WO2021053264A1 (en) 2019-09-17 2021-03-25 Nokia Technologies Oy Direction estimation enhancement for parametric spatial audio capture using broadband estimates
US12243540B2 (en) 2019-12-23 2025-03-04 Nokia Technologies Oy Merging of spatial audio parameters
US12243553B2 (en) 2019-12-23 2025-03-04 Nokia Technologies Oy Combining of spatial audio parameters
US12451147B2 (en) 2019-12-31 2025-10-21 Nokia Technologies Oy Spatial audio parameter encoding and associated decoding
WO2021170900A1 (en) 2020-02-26 2021-09-02 Nokia Technologies Oy Audio rendering with spatial metadata interpolation
US12439220B2 (en) 2020-03-03 2025-10-07 Nokia Technologies Oy Apparatus, methods and computer programs for enabling reproduction of spatial audio signals
US12400667B2 (en) 2020-06-09 2025-08-26 Nokia Technologies Oy Spatial audio parameter encoding and associated decoding
WO2022136725A1 (en) 2020-12-21 2022-06-30 Nokia Technologies Oy Audio rendering with spatial metadata interpolation and source position information
EP4164255A1 (en) 2021-10-08 2023-04-12 Nokia Technologies Oy 6dof rendering of microphone-array captured audio for locations outside the microphone-arrays
WO2024115062A1 (en) * 2022-12-02 2024-06-06 Nokia Technologies Oy Apparatus, methods and computer programs for spatial audio processing

Also Published As

Publication number Publication date
US10873814B2 (en) 2020-12-22
US20200068309A1 (en) 2020-02-27
JP2020500480A (ja) 2020-01-09
CN110337819A (zh) 2019-10-15
GB2556093A (en) 2018-05-23
EP3542546A4 (en) 2020-05-13
JP7082126B2 (ja) 2022-06-07
EP3542546B1 (en) 2025-02-05
CN110337819B (zh) 2021-12-10
GB201619573D0 (en) 2017-01-04
EP3542546A1 (en) 2019-09-25

Similar Documents

Publication Publication Date Title
US10873814B2 (en) Analysis of spatial metadata from multi-microphones having asymmetric geometry in devices
US11671781B2 (en) Spatial audio signal format generation from a microphone array using adaptive capture
US10382849B2 (en) Spatial audio processing apparatus
JP2020500480A5 (enExample)
US10785589B2 (en) Two stage audio focus for spatial audio processing
US11659349B2 (en) Audio distance estimation for spatial audio processing
CN105264911A (zh) 音频设备
US11350213B2 (en) Spatial audio capture
US20230362537A1 (en) Parametric Spatial Audio Rendering with Near-Field Effect

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17871590

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019526614

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2017871590

Country of ref document: EP

Effective date: 20190618