CN112153530B - Spatial audio file format for storing capture metadata - Google Patents

Spatial audio file format for storing capture metadata Download PDF

Info

Publication number
CN112153530B
CN112153530B CN202010591439.1A CN202010591439A CN112153530B CN 112153530 B CN112153530 B CN 112153530B CN 202010591439 A CN202010591439 A CN 202010591439A CN 112153530 B CN112153530 B CN 112153530B
Authority
CN
China
Prior art keywords
microphones
microphone
capture device
metadata
location
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010591439.1A
Other languages
Chinese (zh)
Other versions
CN112153530A (en
Inventor
J·D·谢弗
S·德利凯瑞斯·马尼亚斯
G·R·罗霍
P·A·拉芬斯珀格
E·A·阿拉曼克
F·鲍姆加特
D·森
J·D·阿特金斯
J·O·玛丽玛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Apple Inc
Original Assignee
Apple Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US16/899,019 external-priority patent/US11841899B2/en
Application filed by Apple Inc filed Critical Apple Inc
Publication of CN112153530A publication Critical patent/CN112153530A/en
Application granted granted Critical
Publication of CN112153530B publication Critical patent/CN112153530B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R3/00Circuits for transducers, loudspeakers or microphones
    • H04R3/005Circuits for transducers, loudspeakers or microphones for combining the signals of two or more microphones
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/60Information retrieval; Database structures therefor; File system structures therefor of audio data
    • G06F16/61Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10009Improvement or modification of read or write signals
    • G11B20/10046Improvement or modification of read or write signals filtering or equalising, e.g. setting the tap weights of an FIR filter
    • G11B20/10055Improvement or modification of read or write signals filtering or equalising, e.g. setting the tap weights of an FIR filter using partial response filtering when writing the signal to the medium or reading it therefrom
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10527Audio or video recording; Data buffering arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11BINFORMATION STORAGE BASED ON RELATIVE MOVEMENT BETWEEN RECORD CARRIER AND TRANSDUCER
    • G11B20/00Signal processing not specific to the method of recording or reproducing; Circuits therefor
    • G11B20/10Digital recording or reproducing
    • G11B20/10527Audio or video recording; Data buffering arrangements
    • G11B2020/10537Audio or video recording
    • G11B2020/10546Audio or video recording specifically adapted for audio data

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

The present disclosure relates to spatial audio file formats for storing capture metadata. A device having a microphone is capable of generating a microphone signal during audio recording. The device can store the microphone signal and metadata in an electronic audio data file, the metadata including an impulse response of the microphone. Other aspects are described and claimed.

Description

Spatial audio file format for storing capture metadata
Technical Field
One aspect of the present disclosure relates to an audio file format that includes metadata related to a capture device.
Background
An audio capture device such as a microphone or a device with a microphone may sense sound by converting sound pressure changes into electrical signals with an electroacoustic transducer. The electrical signals may be digitized with an analog-to-digital converter (ADC) and encoded to form audio files having known file formats, such as AIFF, AU, FLAC, MPEG 4-LSL, MPEG-4ALS, WMA lossless, Opus, MP3, first or higher order ambisonics formats, and the like. The decoder may decode the file format and generate a set of audio signals using the decoded audio file that may be used to drive speakers.
Disclosure of Invention
There are audio file formats that have audio data formatted for a particular playback configuration (e.g., stereo, 5.1, or 7.1). Such audio formatting may be specific to a predefined speaker arrangement. However, in this case, less than ideal speaker placement may result in an unpleasant audio playback experience.
In addition, audio files formatted for playback lack flexibility. The task of converting from one audio format to another audio format can be inefficient, and audio data can be lost in the conversion, making the original sound recorded by the device difficult to reproduce.
Ambisonics recordings, such as B-format or higher order ambisonics recordings, have flexibility over audio files formatted for a particular playback configuration, as the ambisonics recordings may be presented in different playback configurations. Ambisonics recording files do not specify or require a specific playback arrangement. However, ambisonics capture devices require special microphone arrays in which the microphones are precisely arranged in a specific arrangement (e.g., a spherical array). Such microphone placement may not be applicable to all capture devices (e.g., mobile phones or tablets).
Furthermore, first order ambisonics recordings have low spatial resolution. This may cause blurring of the sound source. Higher order ambisonics can provide higher resolution, but the resulting audio file can increase to very large sizes, making it difficult to manipulate. For example, 12 th order ambisonics recording may require uniformity with 169 channelsOr a near uniform spherical microphone array arrangement because the number of channels is made of (M +1)2Where M is the order. The channel is formatted with one of many higher order ambisonics formatting conventions (e.g., ACN, SID, Furse-Malham, or others) and different normalization schemes (such as N3D, SN3D, N2D, SN2D, maxN, or others), which may result in additional losses.
Audio data files may be generated to have flexibility in different playback configurations. The playback device or formatting device may process the user's raw microphone data in a manner selected by the device. For example, the playback device may use the metadata of the audio data file to beamform or spatialize the raw microphone data. The metadata may include one or more impulse responses of a microphone of the capture device. The impulse response data may be used on the playback side to filter the raw microphone data to provide a more immersive audio experience.
In one aspect of the disclosure, an electronic audio data file is described. The file may include raw audio data for two or more microphone signals; and metadata. The metadata may have an impulse response or transfer function for each of two or more microphones of the recording or capturing device. Each impulse response or transfer function may define a response of one of the two or more microphones to an acoustic impulse.
In one aspect, a method for capturing and/or processing audio includes receiving a microphone signal from a microphone of a capture device; storing a) the microphone signal, and b) metadata in an electronic audio data file, the metadata including one or more impulse responses of a microphone of the capture device, wherein the one or more impulse responses define a response of the microphone to the acoustic impulse.
The above summary does not include an exhaustive list of all aspects of the present disclosure. It is contemplated that the present disclosure includes all systems and methods that may be practiced from all suitable combinations of the various aspects summarized above, as well as those disclosed in the detailed description below and particularly pointed out in the claims section. Such combinations may have particular advantages not specifically set forth in the summary above.
Drawings
Aspects of the present disclosure are illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements. It should be noted that references to "an" or "an" aspect in this disclosure are not necessarily to the same aspect, and they mean at least one. In addition, features of more than one aspect of the disclosure may be shown using a given figure for the sake of brevity and to reduce the total number of figures, and not all elements in that figure may be required for a given aspect.
FIG. 1 illustrates a system for generating an audio file having metadata describing a capture device, according to one aspect.
FIG. 2 illustrates a capture device having a microphone and a sound source, according to one aspect.
FIG. 3 illustrates an audio file having metadata describing a capture device, according to one aspect.
FIG. 4 illustrates a process or method for generating an audio file with metadata describing a capture device, according to one aspect.
FIG. 5 illustrates an example of audio system hardware in accordance with an aspect.
Detailed Description
Aspects of the present disclosure will now be explained with reference to the accompanying drawings. The scope of the present invention is not limited to the components shown, but is for illustrative purposes only, whenever the shapes, relative positions and other aspects of the described components are not explicitly defined. Additionally, while numerous details are set forth, it is understood that aspects of the disclosure may be practiced without these details. In other instances, well-known circuits, algorithms, structures, and techniques have not been shown in detail in order not to obscure the understanding of this description.
Generating audio files with capture device information
Referring now to FIG. 1, a system 20 includes a capture device 18 that generates audio files. The audio file contains metadata that includes information about the capture device. The apparatus may include a plurality of (Q) microphones that may generate Q microphone signals. The Q microphones may have a fixed and known arrangement on the capture device, forming one or more microphone arrays. The microphone may have an electroacoustic transducer that converts sensed sound (e.g., pressure changes) into an electrical signal (e.g., an analog microphone signal). These analog signals may be digitized by an analog-to-digital converter (ADC) to generate a digital microphone signal.
Encoder 22 may generate an electronic audio file 23 having the microphone signal or raw audio data extracted from the microphone signal (e.g., a truncated version or a cropped version of the microphone signal). The stored microphone signals may be unformatted (e.g., not upmixed or downmixed), unfiltered, and/or uncompressed. The encoder generates metadata for the audio file 23 that includes a plurality of impulse responses for the Q microphones of the capture device 18. Each impulse response may define an acoustic response of one of the microphones to an acoustic impulse at a particular location in space. By storing the impulse response of the capturing device with the microphone signal, the playback device can process the microphone signal using the impulse response of the capturing device to perform, for example, beamforming, spatialization and localization of the sound source.
In one aspect, the metadata may be compressed by a compression module 29. The number of impulse responses stored in an audio file may depend on the desired spatial resolution and "coverage" of the audio file. The size of audio files grows as the spatial resolution and spatial "coverage" increases. Thus, the impulse response or a filter representing the impulse response (e.g., a finite impulse response Filter (FIR) having filter taps and its coefficients) may be compressed by known compression algorithms to reduce the size of the metadata and audio files.
In one aspect, the capture device includes a sensor 28, such as an inertial measurement unit formed from a combination of accelerometers, gyroscopes, and/or magnetometers. The device may process the sensor data to determine an orientation of the device (e.g., absolute or relative tilt of the device in three-dimensional space). In one aspect, the sensor 28 may comprise a camera. Images from the cameras may be processed to track the device using known visual odometry and/or instant positioning and mapping (SLAM) algorithms. The orientation of the device may be tracked and recorded while audio is captured, such that an audio file is generated with device orientation data that is time synchronized (e.g., frame-by-frame) with the microphone signals or raw audio data.
In one aspect, decoder or playback device 19 may receive audio data files and decode the audio data files with microphone signals and metadata. Decoder/device 19 may have an audio processor 24 that generates beamforming filters based on the impulse response of each of the microphones. In this case, the renderer 26 may apply beamforming filters to the raw audio data to generate a plurality of L beamformed signals. The beamformed signals may be used to drive speakers 27 of the playback device.
In one aspect, an audio processor of a playback device may generate a spatialization filter using an impulse response of an audio file. The renderer 26 may apply those spatial filters to the raw microphone signals of the audio file and drive the speakers with the spatialized audio signals. In one aspect, the device may locate sounds in the microphone signal and/or identify voice and/or sound activity in the microphone signal based on the impulse response.
Combining the impulse response of the microphone with the original microphone signal into an audio file provides the playback device with freedom as to how to filter and format the microphone signal for playback. In one aspect, the playback device may include an upmixer/downmixer to upmix/downmix the microphone signals to a desired playback configuration (e.g., stereo, 5.1, or 7.1).
Audio file metadata
Fig. 2 and 3 may be discussed together with respect to generating an audio file having metadata that includes an impulse response of a capture device. The capture device 41 is shown in fig. 2 as having a plurality of microphones 43. Although shown as a box, the capture device may be a device having two or more microphones, such as, but not limited to, a tablet, a smartphone, a laptop, a head-mounted device (e.g., "smart" glasses, a headset set, a head-mounted display (HMD)), an array of microphones, and a smart speaker. The microphone 43 may generate a microphone signal containing sounds sensed by the microphone.
FIG. 3 illustrates an audio data file 50 according to one aspect. The raw data 51 of the microphone (e.g., the digitized microphone signal) may be stored in an audio data file 50. In one aspect, the audio data file 50 contains one or more impulse responses 63. Each impulse response of the metadata may be formed as a digital filter.
In one aspect, the impulse response may be associated with a sound location identifier 61 to indicate a location or direction (e.g., azimuth or azimuth and elevation) in space of the acoustic impulse based on which the associated impulse response is calculated. For example, the sound sources S1-S4 may be indices of sound positions at a distance or radius around the capture device. Although shown as a circular ring, it may also be a sphere. In one aspect, the total number of sound sources on a ring or sphere may be in the range of less than ten to several thousand. The number of sound sources may be selected based on application-specific considerations (e.g., how much spatial resolution is required). The location of a sound source may be described by a direction (e.g., azimuth for a ring, and azimuth and elevation for a sphere) and a distance (e.g., radius) from a point designated as the center of the device. It should be understood that the sound source location is not limited to a location on a ring or sphere, and in one aspect, the location of the sound source may be described using any coordinate system (e.g., x, y, and z) that describes the position of sound relative to the device.
In one aspect, the metadata includes a microphone identifier 62 for each microphone of the capture device. Each impulse response may be associated with a microphone as well as a sound source. For example, one of the impulse responses may have a sound source identifier S1 and a microphone identifier "MIC 1" that describes the impulse response of the acoustic impulse from location S1 to MIC 1. Another impulse response may have the same sound source identifier S1, but with a microphone identifier of "MIC 2," which describes the impulse response of MIC 2 in response to the acoustic impulse at location S1. In one aspect, the impulse response (e.g., digital filter) may define a response to acoustic impulses between each sound source location and each microphone of the capture device supported and defined in the audio data file. The impulse response may include characteristics of an electroacoustic transducer of the corresponding microphone.
For example, each of S1-S4 may have three impulse responses (MICs 1-3). Similarly, each of the T1-T6 sound sources may have three impulse responses (MICs 1-3). As the number of impulse responses increases, the spatial resolution of the audio file will increase, but the size of the file will also increase. Thus, the total number of impulse responses to include in the metadata of the audio file may be application specific and/or determined based on design tradeoffs.
In one aspect, the metadata includes a sound source location relative to the capture device. For example, the impulse response is associated with a sound source location identifier in the metadata, which represents the location of the acoustic impulse of the corresponding impulse response. The acoustic source may be defined as on a ring or sphere around the article, but this is not required. The metadata may include a distance or radius of the ring from the capture device. To illustrate, in fig. 2, S1-S4 may have the same radius or distance R1 from the capture device, but be located at different positions on the ring. Other impulse responses for sound locations T1-T6 may have a radius or distance R2 from the capture device. In one aspect, the audio data file does not include or require an ideal microphone configuration, such as a spherical array of microphones.
In one aspect, the audio data file 50 may include a geometric model (e.g., a three-dimensional "mesh" or CAD drawing) of the capture device and the location of a microphone disposed on the capture device. This may be further used by the playback device or decoder to process the raw audio (e.g., by generating a beamforming filter or spatial filter).
In one aspect, at least one of the one or more impulse responses is a near-field impulse response (e.g., a response to an impulse within 2 wavelengths of a corresponding microphone or capture device), and at least one of the impulse responses is a far-field impulse response (e.g., a response to an impulse from a corresponding microphone and capture device greater than 2 wavelengths). The playback device may use the near-field impulse response and the far-field impulse response to locate sound present in the original audio file (e.g., for voice activity detection).
In one aspect, as described in other sections, the metadata may include device orientation. The orientation of the device, which describes how the capture device is rotated or tilted, may vary over time throughout the recording. For example, a mobile phone may be used to record sound. During recording, the user may hold the phone in different ways (e.g., flip the phone, rotate the phone, etc.). Thus, the device orientation may be time-varying and synchronized in time (e.g., frame-by-frame) with the captured microphone signals.
While one aspect of metadata is shown, it should be understood that the metadata may be arranged in a variety of ways to organize and index the impulse response of the sound source location relative to the microphone of the capture device.
In one aspect, the audio data file may include other features not shown in FIG. 3. For example, the audio data file may include noise characteristics and the dynamic range of the audio file. In one aspect, a sensitivity parameter indicative of a sensitivity of the microphone array is included in the audio data file. The decoding/playback device may determine an original sound pressure level of the audio recording based on the sensitivity parameter and the microphone signal. In one aspect, the microphone signal and metadata are transmitted or streamed over a network to another device, e.g., as a bitstream. In this case, the metadata may be associated with the microphone signal by a streaming audio data file or by other established means (e.g., by a communication protocol that associates the streaming metadata with the streaming microphone signal).
Process for generating audio data files with metadata
Referring now to fig. 4, a process or method 80 that may be performed by a processor of, for example, a capture device, is described. At block 82, the process may include receiving one or more microphone signals generated by a plurality of microphones (e.g., two or more microphones) of a capture device. At block 84, the method may include storing the microphone signal or raw audio data in one or more microphone signals in an electronic audio data file. The audio files may be stored in an electronic memory (e.g., RAM or ROM). At block 86, the method may include storing one or more impulse responses of a microphone of the capture device in metadata of the electronic audio data file, wherein each impulse response of the one or more impulse responses defines a response of one of the microphones to an acoustic impulse. It should be understood that for all aspects of the present disclosure, the term "impulse response" should be interchangeable with "transfer function" (or any data set convertible into an acoustic transfer function between a sound source and a microphone), wherein a transfer function may represent an impulse response in the frequency domain. For example, in one aspect, a process for generating an audio data file having metadata includes: receiving a plurality of microphone signals from a plurality of microphones of a capture device; and storing the microphone signals and metadata in the electronic audio data file, the metadata including one or more transfer functions of a microphone of the capture device, wherein the one or more transfer functions define a response of the microphone to the acoustic impulse. The impulse response may be derived in a variety of ways, including but not limited to recording the microphone signal as a response to an acoustic impulse generated at a defined location, or simulation of the device acoustics and microphone response based on a physical model. For acoustic measurements, an anechoic chamber is typically used to reduce unwanted reflections. If the device is intended to be attached to another object during regular use and the microphone signals are recorded at the same time, the impulse response measurement or simulation may also include the object, e.g. the impulse response or transfer function of the head mounted device may be measured or simulated by a person or a model/mannequin representing anyone who will wear the device during regular use.
FIG. 5 illustrates a block diagram of audio processing system hardware that may be used with any of the aspects in one aspect. The audio processing system may represent a general purpose computer system or a special purpose computer system. It is noted that while fig. 5 illustrates various components of an audio processing system that may be incorporated into headphones, a speaker system, a microphone array, and an entertainment system, this is merely one example of a particular implementation and is merely intended to illustrate the types of components that may be present in an audio processing system. FIG. 5 is not intended to represent any particular architecture or manner of interconnecting the components as such details are not germane to the various aspects described herein. It should also be understood that other types of audio processing systems having fewer components or more components than shown in fig. 5 may also be used. Thus, the processes described herein are not limited to use with the hardware and software of FIG. 5.
As shown in fig. 5, an audio processing system 150 (e.g., a laptop computer, desktop computer, mobile phone, smart phone, tablet computer, smart speaker, Head Mounted Display (HMD), headphone device, or infotainment system for an automobile or other vehicle) includes one or more buses 162 for interconnecting various components of the system. One or more processors 152 are coupled to bus 162 as is known in the art. The one or more processors may be microprocessors or special purpose processors, systems on a chip (SOC), central processing units, graphics processing units, processors created by Application Specific Integrated Circuits (ASICs), or a combination thereof. The memory 151 may include Read Only Memory (ROM), volatile and non-volatile memory, or a combination thereof, coupled to the bus using techniques known in the art. In one aspect, a camera 158 and/or a display 160 may be coupled to the bus.
The memory 151 may be connected to the bus and may include DRAM, a hard drive or flash memory, or a magnetic optical drive or magnetic memory, or an optical drive or other type of memory system that maintains data even after power is removed from the system. In one aspect, the processor 152 retrieves computer program instructions stored in a machine-readable storage medium (memory) and executes these instructions to perform the operations described herein.
Although not shown, audio hardware may be coupled to the one or more buses 162 in order to receive audio signals to be processed and output by the speakers 156. The audio hardware may include a digital-to-analog converter and/or an analog-to-digital converter. The audio hardware may also include an audio amplifier and filter. The audio hardware may also connect with a microphone 154 (e.g., a microphone array) to receive an audio signal (whether analog or digital), digitize it if necessary, and communicate the signal to the bus 162.
The communication module 164 may communicate with remote devices and networks. For example, the communication module 164 may communicate via known technologies such as Wi-Fi, 3G, 4G, 5G, bluetooth, ZigBee, or other equivalent technologies. The communication module may include wired or wireless transmitters and receivers that may communicate (e.g., receive and transmit data) with networked devices such as a server (e.g., a cloud) and/or other devices such as remote speakers and remote microphones.
It should be appreciated that aspects disclosed herein may utilize a memory that is remote from the system, such as a network storage device that is coupled to the audio processing system through a network interface, such as a modem or ethernet interface. As is well known in the art, the bus 162 may be connected to each other through various bridges, controllers and/or adapters. In one aspect, one or more network devices may be coupled to the bus 162. The one or more network devices may be wired network devices (e.g., ethernet) or wireless network devices (e.g., WI-FI, bluetooth). In some aspects, the various aspects described (e.g., simulating, analyzing, estimating, modeling, object detecting, etc.) may be performed by a networked server in communication with a capture device.
Various aspects described herein may be embodied, at least in part, in software. That is, the techniques may be implemented in an audio processing system in response to its processor executing sequences of instructions contained in a storage medium, such as a non-transitory machine-readable storage medium (such as DRAM or flash memory). In various aspects, hardwired circuitry may be used in combination with software instructions to implement techniques described herein. Thus, the techniques are not limited to any specific combination of hardware circuitry and software, nor to any particular source for the instructions executed by the audio processing system.
In this specification, certain terminology is used to describe features of the various aspects. For example, in some cases, the terms "module," "encoder," "processor," "renderer," "combiner," "synthesizer," "mixer," "locator," "spatializer," and "component" represent hardware and/or software configured to perform one or more processes or functions. For example, examples of "hardware" include, but are not limited to, an integrated circuit such as a processor (e.g., a digital signal processor, a microprocessor, an application specific integrated circuit, a microcontroller, etc.). Thus, as will be appreciated by one skilled in the art, different combinations of hardware and/or software may be implemented to perform the processes or functions described by the above terms. Of course, the hardware could alternatively be implemented as a finite state machine or even a combinational logic component. Examples of "software" include executable code in the form of an application, applet, routine or even a series of instructions. As described above, the software may be stored in any type of machine-readable medium.
Some portions of the preceding detailed descriptions have been presented in terms of algorithms and symbolic representations of operations on data bits within a computer memory. These algorithmic descriptions and representations are the means used by those skilled in the audio processing arts to most effectively convey the substance of their work to others skilled in the art. An algorithm is here, and generally, conceived to be a self-consistent sequence of operations leading to a desired result. The operations are those requiring physical manipulations of physical quantities. It should be borne in mind, however, that all of these and similar terms are to be associated with the appropriate physical quantities and are merely convenient labels applied to these quantities. Unless specifically stated otherwise as apparent from the above discussion, it is appreciated that throughout the description, discussions utilizing terms such as those set forth in the claims below, refer to the action and processes of an audio processing system, or similar electronic device, that manipulates and transforms data represented as physical (electronic) quantities within the system's registers and memories into other data similarly represented as physical quantities within the system memories or registers or other such information storage, transmission or display devices.
The processes and blocks described herein are not limited to the specific examples described, and are not limited to the specific orders used as examples herein. Rather, any of the processing blocks may be reordered, combined, or removed, performed in parallel or in series as needed, to achieve the results described above. The processing blocks associated with implementing an audio processing system may be executed by one or more programmable processors executing one or more computer programs stored on a non-transitory computer-readable storage medium to perform the functions of the system. All or part of the audio processing system may be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) and/or an ASIC (application-specific integrated circuit). All or part of the audio system may be implemented using electronic hardware circuitry including at least one of an electronic device such as, for example, a processor, memory, programmable logic device, or logic gates. Additionally, the processes may be implemented in any combination of hardware devices and software components.
While certain aspects have been described and shown in the accompanying drawings, it is to be understood that such aspects are merely illustrative of and not restrictive on the broad invention, and that this invention not be limited to the specific constructions and arrangements shown and described, since various other modifications may occur to those ordinarily skilled in the art. For example, the features discussed with respect to fig. 3 may be combined in an audio file produced in conjunction with fig. 1 and 4. The description is thus to be regarded as illustrative instead of limiting.
To assist the patent office and any reader of any patent issued in this application in interpreting the appended claims, applicants wish to note that they do not intend for any of the appended claims or claim elements to invoke 35u.s.c.112(f), unless the word "means for … …" or "step for … …" is explicitly used in a particular claim.
It is well known that the use of personally identifiable information should comply with privacy policies and practices that are recognized as meeting or exceeding industry or government requirements for maintaining user privacy. In particular, personally identifiable information data should be managed and processed to minimize the risk of inadvertent or unauthorized access or use, and the nature of authorized use should be explicitly stated to the user.

Claims (20)

1. A method for processing audio, comprising:
receiving a plurality of microphone signals from a plurality of microphones of a capture device, the plurality of microphones sensing a sound source having a sound source location; and
storing in an electronic audio data file:
the plurality of microphone signals, and
metadata including a plurality of impulse responses specific to the plurality of microphones of the capture device, wherein the plurality of impulse responses define responses of the plurality of microphones to acoustic impulses at respective locations relative to the plurality of microphones to locate the sound source location in the plurality of microphone signals.
2. The method of claim 1, wherein the plurality of impulse responses of the metadata are formed as digital filters.
3. The method of claim 1, wherein the metadata includes one or more sound location identifiers that associate one of the impulse responses with a location or position relative to the capture device.
4. The method of claim 1, wherein the metadata comprises a microphone identifier for each microphone of the capture device, and each impulse response is associated with a microphone identifier.
5. The method of claim 1, wherein the metadata comprises a location of each of the ping relative to the capture device.
6. The method of claim 1, wherein the metadata comprises a geometric model of the capture device and a location of the microphone disposed on the capture device.
7. The method of claim 1, wherein at least one of the plurality of impulse responses is a near-field impulse response.
8. The method of claim 1, wherein at least one of the plurality of impulse responses is a far-field impulse response.
9. The method of claim 1, wherein the metadata comprises a time-varying orientation of the capture device that is synchronized in time with the microphone signal.
10. The method of claim 1, further comprising:
a beamforming filter is generated based on the plurality of impulse responses, and the microphone signals are processed with the beamforming filter based on the plurality of impulse responses to generate one or more beamformed signals.
11. The method of claim 1, wherein the microphone signal is uncompressed, unfiltered.
12. The method of claim 1, further comprising compressing the metadata using a compression algorithm.
13. The method of claim 1, wherein the capture device is a tablet computer.
14. The method of claim 1, wherein the capture device is a smartphone.
15. The method of claim 1, wherein the capture device is a head-mounted device.
16. An electronic article, comprising:
a processor;
a plurality of microphones to sense a sound source having a sound source position to generate a plurality of microphone signals; and
a machine-readable medium having instructions stored therein, which when executed by the processor, cause the article of manufacture to:
transmitting a bitstream to another device, the bitstream comprising:
the plurality of microphone signals, and
metadata comprising a plurality of data sets specific to the plurality of microphones, wherein each data set in the plurality of data sets describes an acoustic transfer function associated with a response of one of the plurality of microphones to an acoustic impulse at a location relative to the plurality of microphones to locate the sound source location.
17. An electronic article according to claim 16 wherein the article has a camera.
18. An electronic article according to claim 16, wherein:
the metadata includes a sound location identifier representing a location disposed on a ring or sphere around the artefact, and
each of the data sets is associated with one of the sound location identifiers.
19. An electronic article according to claim 16 wherein each data set defines an impulse response of one of the plurality of microphones.
20. A machine readable medium having stored therein an electronic audio data file, the electronic audio data file comprising:
raw audio data of one or more microphone signals generated by two or more microphones of a capture device, the two or more microphones sensing a sound source having a sound source location; and
metadata including a transfer function specific to each of the two or more microphones of the capture device, wherein each transfer function is associated with a response of one of the two or more microphones to an acoustic impulse at a location relative to the two or more microphones of the capture device to locate the acoustic source location.
CN202010591439.1A 2019-06-28 2020-06-24 Spatial audio file format for storing capture metadata Active CN112153530B (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US201962868738P 2019-06-28 2019-06-28
US62/868,738 2019-06-28
US16/899,019 2020-06-11
US16/899,019 US11841899B2 (en) 2019-06-28 2020-06-11 Spatial audio file format for storing capture metadata

Publications (2)

Publication Number Publication Date
CN112153530A CN112153530A (en) 2020-12-29
CN112153530B true CN112153530B (en) 2022-05-27

Family

ID=73888725

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010591439.1A Active CN112153530B (en) 2019-06-28 2020-06-24 Spatial audio file format for storing capture metadata

Country Status (1)

Country Link
CN (1) CN112153530B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023025294A1 (en) * 2021-08-27 2023-03-02 北京字跳网络技术有限公司 Signal processing method and apparatus for audio rendering, and electronic device
CN114512152A (en) * 2021-12-30 2022-05-17 赛因芯微(北京)电子科技有限公司 Method, device and equipment for generating broadcast audio format file and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105981412A (en) * 2014-03-21 2016-09-28 华为技术有限公司 Apparatus and method for estimating an overall mixing time based on at least a first pair of room impulse responses, as well as corresponding computer program
CN109791769A (en) * 2016-09-28 2019-05-21 诺基亚技术有限公司 It is captured using adaptive from microphone array column-generation spatial audio signal format

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8073125B2 (en) * 2007-09-25 2011-12-06 Microsoft Corporation Spatial audio conferencing
BR112013033386B1 (en) * 2011-07-01 2021-05-04 Dolby Laboratories Licensing Corporation system and method for adaptive audio signal generation, encoding, and rendering
US9426300B2 (en) * 2013-09-27 2016-08-23 Dolby Laboratories Licensing Corporation Matching reverberation in teleconferencing environments
US10136216B2 (en) * 2015-04-10 2018-11-20 Dolby Laboratories Licensing Corporation Action sound capture using subsurface microphones
GB2549532A (en) * 2016-04-22 2017-10-25 Nokia Technologies Oy Merging audio signals with spatial metadata

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105981412A (en) * 2014-03-21 2016-09-28 华为技术有限公司 Apparatus and method for estimating an overall mixing time based on at least a first pair of room impulse responses, as well as corresponding computer program
CN109791769A (en) * 2016-09-28 2019-05-21 诺基亚技术有限公司 It is captured using adaptive from microphone array column-generation spatial audio signal format

Also Published As

Publication number Publication date
CN112153530A (en) 2020-12-29

Similar Documents

Publication Publication Date Title
CN105432097B (en) Filtering with binaural room impulse responses with content analysis and weighting
RU2661775C2 (en) Transmission of audio rendering signal in bitstream
JP6329629B2 (en) Method and apparatus for compressing and decompressing sound field data in a region
CN112153530B (en) Spatial audio file format for storing capture metadata
US11538489B2 (en) Correlating scene-based audio data for psychoacoustic audio coding
TW201714169A (en) Conversion from channel-based audio to HOA
EP2917915B1 (en) Multi-resolution audio signals
US11841899B2 (en) Spatial audio file format for storing capture metadata
US11678111B1 (en) Deep-learning based beam forming synthesis for spatial audio
EP3987516B1 (en) Coding scaled spatial components
TW202107451A (en) Performing psychoacoustic audio coding based on operating conditions
KR102593235B1 (en) Quantization of spatial audio parameters
US11252525B2 (en) Compressing spatial acoustic transfer functions
US20200402523A1 (en) Psychoacoustic audio coding of ambisonic audio data
Fonseca et al. Measurement of car cabin binaural impulse responses and auralization via convolution
CN114128312B (en) Audio rendering for low frequency effects
WO2024114373A1 (en) Scene audio coding method and electronic device
WO2024114372A1 (en) Scene audio decoding method and electronic device
US20200402522A1 (en) Quantizing spatial components based on bit allocations determined for psychoacoustic audio coding
US11758348B1 (en) Auditory origin synthesis
US20240105196A1 (en) Method and System for Encoding Loudness Metadata of Audio Components
EP3547305B1 (en) Reverberation technique for audio 3d
KR20150005438A (en) Method and apparatus for processing audio signal
CN117750293A (en) Object audio coding
CN115938388A (en) Three-dimensional audio signal processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant