WO2014204999A2 - Generating surround sound field - Google Patents

Generating surround sound field Download PDF

Info

Publication number
WO2014204999A2
WO2014204999A2 PCT/US2014/042800 US2014042800W WO2014204999A2 WO 2014204999 A2 WO2014204999 A2 WO 2014204999A2 US 2014042800 W US2014042800 W US 2014042800W WO 2014204999 A2 WO2014204999 A2 WO 2014204999A2
Authority
WO
WIPO (PCT)
Prior art keywords
sound field
surround sound
topology
audio
capturing devices
Prior art date
Application number
PCT/US2014/042800
Other languages
French (fr)
Other versions
WO2014204999A3 (en
Inventor
Xuejing Sun
Bin Cheng
Sen XU
Zhiwei Shuang
Jun Wang
Original Assignee
Dolby Laboratories Licensing Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dolby Laboratories Licensing Corporation filed Critical Dolby Laboratories Licensing Corporation
Priority to CN201480034420.XA priority Critical patent/CN105340299B/en
Priority to US14/899,505 priority patent/US9668080B2/en
Priority to EP14736577.9A priority patent/EP3011763B1/en
Priority to JP2015563133A priority patent/JP5990345B1/en
Publication of WO2014204999A2 publication Critical patent/WO2014204999A2/en
Publication of WO2014204999A3 publication Critical patent/WO2014204999A3/en
Priority to HK16108833.6A priority patent/HK1220844A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/301Automatic calibration of stereophonic sound system, e.g. with test microphone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R29/00Monitoring arrangements; Testing arrangements
    • H04R29/001Monitoring arrangements; Testing arrangements for loudspeakers
    • H04R29/002Loudspeaker arrays
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R29/00Monitoring arrangements; Testing arrangements
    • H04R29/004Monitoring arrangements; Testing arrangements for microphones
    • H04R29/005Microphone arrays
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S3/00Systems employing more than two channels, e.g. quadraphonic
    • H04S3/02Systems employing more than two channels, e.g. quadraphonic of the matrix type, i.e. in which input signals are combined algebraically, e.g. after having been phase shifted with respect to each other
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2430/00Signal processing covered by H04R, not provided for in its groups
    • H04R2430/20Processing of the output signals of the acoustic transducers of an array for obtaining a desired directivity characteristic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/03Aspects of down-mixing multi-channel audio to configurations with lower numbers of playback channels, e.g. 7.1 -> 5.1
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2400/00Details of stereophonic systems covered by H04S but not provided for in its groups
    • H04S2400/15Aspects of sound capture and related signal processing for recording or reproduction
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/01Enhancing the perception of the sound image or of the spatial distribution using head related transfer functions [HRTF's] or equivalents thereof, e.g. interaural time difference [ITD] or interaural level difference [ILD]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S2420/00Techniques used stereophonic systems covered by H04S but not provided for in its groups
    • H04S2420/11Application of ambisonics in stereophonic audio systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/308Electronic adaptation dependent on speaker or headphone connection

Definitions

  • the present application relates to signal processing. More specifically, embodiments of the present invention relate to generating surround sound field.
  • embodiments of the present invention propose a method, apparatus, and computer program product for generating the surround sound field.
  • embodiments of the present invention provide a method of generating a surround sound field.
  • the method comprises: receiving audio signals captured by a plurality of audio capturing devices; estimating a topology of the plurality of audio capturing devices; and generating the surround sound field from the received audio signals at least partially based on the estimated topology.
  • Embodiments in this aspect also include corresponding computer program product comprising a computer program tangibly embodied on a machine readable medium for carrying out the method.
  • embodiments of the present invention provide an apparatus of generating a surround sound field.
  • the apparatus comprises: a receiving unit configured to receive audio signals captured by a plurality of audio capturing devices; a topology estimating unit configured to estimate a topology of the plurality of audio capturing devices; and a generating unit configured to generate the surround sound field from the received audio signals at least partially based on the estimated topology.
  • the surround sound field may be generated by use of an ad hoc network of audio capturing devices of end users, such as microphones equipped on mobile phones. As such, the need for expensive and complex professional equipments and/or human experts can be eliminated. Furthermore, by generating the surround sound field dynamically based on the estimation of topology of the audio capturing devices, the quality of the surround sound field can be maintained at a higher level.
  • Figure 1 shows a block diagram illustrating a system in which example embodiments of the present invention can be implemented
  • Figures 2A-2C show schematic diagrams illustrating several examples of topologies of audio capturing devices in accordance with example embodiments of the present invention
  • Figure 3 shows a flowchart illustrating a method for generating a surround sound field in accordance with an example embodiment of the present invention
  • Figures 4A-4C show schematic diagrams illustrating polar patterns for W, X, and Y channels, respectively, in B-format processing for various frequencies when using an example mapping matrix;
  • Figures 5A-5C show schematic diagrams illustrating polar patterns for W, X, and Y channels, respectively, in B-format processing for various frequencies when using another example mapping matrix;
  • Figure 6 shows a block diagram illustrating an apparatus for generating a surround sound field in accordance with an example embodiment of the present invention
  • Figure 7 shows a block diagram illustrating a user terminal for implementing an example embodiment of the present invention.
  • Figure 8 shows a block diagram illustrating a system for implementing an example embodiment of the present invention.
  • embodiments of the present invention provide a method, apparatus, and computer program product for surround sound field generation.
  • the surround sound field may be effectively and accurately generated by use of an ad hoc network of audio capturing devices such as mobile phones of end users.
  • the system 100 includes a plurality of audio capturing devices 101 and a server 102.
  • the audio capturing devices 101 are capable of capturing, recording and/or processing audio signals.
  • the audio capturing devices 101 may include, but not limited to, mobile phones, personal digital assistants (PDAs), laptops, tablet computers, personal computers (PCs) or any other suitable user terminals equipped with audio capturing functionality
  • PDAs personal digital assistants
  • PCs personal computers
  • any other suitable user terminals equipped with audio capturing functionality For example, those commercially available mobile phones are usually equipped with at least one microphone and therefore can be used as the audio capturing devices 101.
  • the audio capturing devices 101 may be arranged in one or more ad hoc networks or groups 103, each of which may include one or more audio capturing devices.
  • the audio capturing devices may be grouped according to a predetermined strategy or dynamically, which will be detailed below. Different groups can be located at same or different physical locations. Within each group, the audio capturing devices are located in the same physical location, and may be positioned proximate to each other.
  • Figures 2A-2C show some examples of groups consisting of three audio capturing devices.
  • the audio capturing devices 101 may be mobile phones, PDAs or any other portable user terminals that are equipped with audio capturing elements 201, such as one or more microphones, to capture audio signals.
  • the audio capturing devices 101 are further equipped with video capturing elements 202 such as cameras, so that the audio capturing devices 101 may be configured to capture video and/or image while capturing audio signals.
  • the number of audio capturing devices within a group is not limited to three. Instead, any suitable number of audio capturing devices may be arranged as a group. Moreover, within a group, the plurality of audio capturing devices may be arranged as any desired topology. In some embodiments, the audio capturing devices within a group may communicate with each other by means of computer network, Bluetooth, infrared, telecommunication, and the like, just to name a few.
  • the server 102 is communicatively connected with the groups of audio capturing devices 101 via network connections.
  • the audio capturing devices 101 and the server 102 may communicate with each other, for example, by a computer network such as a local area network ("LAN”), a wide area network ("WAN”) or the Internet, a communication network, a near field communication connection, or any combination thereof.
  • LAN local area network
  • WAN wide area network
  • the generation of surround sound field may be initiated either by an audio capturing device 101 or by the server 102.
  • an audio capturing device 101 may log into the server 102 and request the server 102 to generate a surround sound field.
  • the audio capturing device 101 sending the request will become a master device which then sends invitations to other capturing devices to join the audio capturing session.
  • the other audio capturing devices within this group receive the invitation from the master device and join the audio capturing session accordingly.
  • another one or more audio capturing devices may be dynamically identified and grouped with the master device. For example, in case that location services like GPS (Global Positioning Service) are available to the audio capturing devices 101, it is possible to automatically invite one or more audio capturing devices located in proximity to the master device to join the audio capturing group. Discovery and grouping of the audio capturing devices may also be performed by the server 102 in some alternative embodiments.
  • the server 102 Upon forming a group of audio capturing devices, the server 102 sends a capturing command to all the audio capturing devices within the group.
  • the capturing command may be sent by one of the audio capturing devices 101 within the group, for example, by the master device.
  • Each audio capturing device in the group will start to capture and record audio signals immediately after receiving the capturing command.
  • the audio capturing session will finish when any audio capturing device stops the capturing.
  • the audio signals may be recorded locally on the audio capturing devices 101 and transmitted to the server 102 after the capturing session is completed.
  • the captured audio signals may be streamed to the server 102 in a real-time manner.
  • the audio signals captured by the audio capturing devices 101 of a single group are assigned with the same group identification (ID), such that the server 102 is able to identify whether the incoming audio signals belong to the same group. Further, in addition to the audio signals, any information relevant to the audio capturing session may be transmitted to the server 102, including the number of audio capturing devices 101 within the group, parameters of one or more audio capturing devices 101, and the like. [0030] Based on the audio signals captured by a plurality of capturing devices 101 of a group, the server 102 performs a series of operations to process the audio signals to generate a surround sound field. In this regard, Figure 3 shows a flowchart of a method for generating the surround sound field from the audio signals captured by the plurality of capturing devices 101.
  • the topology of these audio capturing devices are estimated at step S302. Estimating the topology of positions of audio capturing devices 101 within the group is important to the subsequent spatial processing, which has direct impact on reproducing the sound field.
  • the topology of audio capturing devices may be estimated in various manners. For example, in some embodiments, the topology of audio capturing devices 101 may be predefined and thus known to the server 102. In this event, the server 102 may use the group ID to determine the group from which the audio signals are transmitted, and then retrieve the predefined topology associated with the determined group as the topology estimation.
  • the topology of audio capturing devices 101 may be estimated based on the distance between each pair of the plurality of audio capturing devices 101 within the group.
  • each audio capturing device 101 may be configured to each play back a piece of audio simultaneously and to receive audio signals from the other devices within the group. That is, each audio capturing device 101 broadcasts a unique audio signal to the other members of the group.
  • each audio capturing device may play back a linear chirp signal spanning a unique frequency range and/or having any other specific acoustic features. By recording the time instants when the linear chirp signal is received, the distance between each pair of audio capturing devices 101 may be calculated by an acoustic ranging processing, which is known to those skilled in the art and thus will not be detailed here.
  • Such distance calculation may be performed at the server 102, for example.
  • the audio capturing devices may communicate with each other directly, such distance calculation may be performed at the client side.
  • the server 102 no additional processing is needed if there are only two audio capturing devices 101 in the group.
  • the multidimensional scaling (MDS) analysis or a similar process can be performed on the acquired distances to estimate the topology of the audio capturing devices.
  • MDS may be applied to generate the coordinates of the audio capturing devices 101 in a two-dimensional space. For example, assume that the measured distance matrix in a three-device group is
  • outputs of the two-dimensional (2D) MDS indicating the topology of audio capturing device 101 are Ml (0, -0.0441), M2 (-0.0750, 0.0220), and M3 (0.0750, 0.0220).
  • the scope of the present invention is not limited to the examples illustrated above. Any suitable manner capable of estimating distance between a pair of audio capturing devices, whether currently known or developed in the future, may be used in connection with embodiments of the present invention.
  • the audio capturing devices 101 may be configured to broadcast electrical and/or optical signals to each other to facilitate the distance estimation.
  • the method 300 proceeds to step S303, where the time alignment is performed on the audio signals received at step S301, such that the audio signals captured by different capturing devices 101 are temporally aligned with each other.
  • time alignment of the audio signals may be done in many possible manners.
  • the server 102 may implement a protocol based clock synchronization process.
  • NTP Network Time Protocol
  • each audio capturing device 101 may be configured to synchronize with an NTP server separately while performing audio capturing. It is not necessary to adjust the local clock. Instead, an offset between the local clock and the NTP server can be calculated and stored as metadata. The local time and its offset are sent to the server 102 together with the audio signals once the audio capturing is terminated. The server 102 then aligns the received audio signals based on such time information.
  • the time alignment at step S303 may be realized by a peer-to-peer clock synchronization process.
  • the audio capturing devices may be communicated with each other on a peer-to-peer basis, for example, via protocols like Bluetooth or infrared connection.
  • One of the audio capturing devices may be selected as the synchronization master and clock offsets of all the other capturing devices may be calculated relative to the synchronization master.
  • a series of cross-correlation coefficients between a pair of input signals, x(i) and y(i), may be calculated by: where x and y represent the mean of x(i) and y(i), N represents the length of x(i) and y(i), and d represents the time lag between the two series.
  • the delay between the two signals may be calculated as follows:
  • the time alignment can be realized by applying the cross-correlation process, this process can be time consuming and error prone if the search range is large.
  • the search range has to be fairly long in order to accommodate large network delay variations.
  • information on calibration signals issued by the audio capturing devices 101 may be collected and transmitted to the server 102 to be used to reduce the search range of the cross-correlation process.
  • the audio capturing devices 101 may broadcast an audio signal to the other members within the group upon start of the audio capture to thereby facilitate calculation of the distance between each pair of the audio capturing devices 101.
  • the broadcasted audio signals can also be used as calibration signals to reduce the time consumed by signal correlation.
  • S A is the time instant when device A issues a command to play the calibration signal;
  • S B is the time instant when device B issues a command to play the calibration signal
  • R AA is the time instant when device A receives the signal transmitted by device A;
  • R BA is the time instant when device A receives the signal transmitted by device B;
  • R BB is the time instant when device B receives the signal transmitted by device B;
  • R AB is the time instant when device B receives the signal transmitted by device A.
  • One or more of these time instants may be recorded by the audio capturing devices 101 and transmitted to the server 102 for use in cross-correlation process.
  • the acoustic propagation delay from device A to device B is smaller than the network delay difference. That is, S B - S A > R AB - S A - Accordingly, the time instants R BA and R BB can be used to start the cross-correlation based time alignment process. In other words, only audio signal samples after the time instant R BA and R BB would be included in the correlation calculation. In this way, the search range may be reduced and thus improve efficiency of the time alignment.
  • the network delay difference is smaller than acoustic propagation delay difference. This could happen when the network has very low jitter or the two devices are put farther apart, or both.
  • S B and S A can be used as the starting point for the cross correlation process. Specifically, since audio signals after S B and S A would contain the calibration signals, R BA can be used as the starting point for correlation for device A, and S B + (R BA - S A ) can be used as the starting point for correlation for device B.
  • the time alignment can be done in a three-step process. First, the coarse time synchronization may be performed between the audio capturing devices 101 and the server 102. Next, the calibration signals as discussed above may be used to refine the synchronization. Finally, cross-correlation analysis is applied to complete the time alignment of the audio signals.
  • the time alignment at step S303 is optional. For example, if the communication and/or device conditions are good enough, it is reasonably considered that all the audio capturing devices 101 receive the capturing command nearly at the same time and thus start the audio capturing simultaneously. Furthermore, it would be readily appreciated that in some applications where the quality of surround sound field is not very sensitive, a certain degree of misalignment of the starting time of audio capturing can be tolerated or ignored. In these situations, the time alignment at step S303 can be omitted.
  • step S302 is not necessarily performed prior to S303.
  • the time alignment of audio signals may be performed prior to or even in parallel with the topology estimation.
  • the clock synchronization process such as NTP synchronization or peer-to-peer synchronization can be performed before the topology estimation.
  • such clock synchronization process may be beneficial to acoustic ranging in topology estimation.
  • the surround sound field is generated from the received audio signals (possibly temporally aligned) at least partially based on the topology estimated at step S302.
  • a mode may be selected for processing the audio signals based on the number of the plurality of audio capturing devices. For example, if there are only two audio capturing devices 101 within the group, the two audio signals may be simply combined to generate a stereo output.
  • some post processing may be performed, including but not limited to stereo sound image widening, multi-channel upmixing, and so forth.
  • Ambisonics or B-format processing may be applied to generate the surround sound field.
  • the adaptive selection of processing mode is not necessarily needed. For example, even if there are only two audio capturing devices, the surround sound field may be generated by processing the captured audio signals by the B-format processing.
  • Ambisonics it is known as a flexible spatial audio processing technique to provide sound field and source localization recoverability.
  • a 3D surround sound field is recorded as a four-channel signal, named B-format with W-X-Y-Z channels.
  • the W channel contains omnidirectional sound pressure information, while the remaining three channels, X, Y, and Z represent sound velocity information measured over the three according axes in a 3D Cartesian coordinates.
  • an ideal B-format representation of the surround sound field is:
  • a n (f,r represents the weight for the audio capturing devices, which can be defined as the product of user defined weights and the gain of audio capturing device at a particular frequency and angle:
  • 0.5 represents a cardioid polar pattern
  • 0.7 represents a subcardioid polar pattern
  • weights W n (f) for respective captured audio signals will affect the quality of the generated surround sound field. Different weights W n (f) would generate different qualities of B-format signals. Weights for different audio signals may be represented as a mapping matrix. Considering the topology shown in Figure 2A as an example, the mapping matrix (W) from audio signals ⁇ 1 ; M 2 , and M 3 to W, X, and Y channels may be defined as follows:
  • the B-format signals are generated by using specially designed (often quite expensive) microphone arrays such as professional soundfield microphones.
  • the mapping matrix may be designed in advance and keep unchanged in operation.
  • the audio signals are captured by an ad hoc network of audio capturing devices which are possibly dynamically grouped with varied topology.
  • existing solutions may not be applicable to generate W, X, Y channels from such raw audio signals captured by user devices that are not specially designed and positioned. For example, assume that the group contains three audio capturing devices 101 having angles of ⁇ /2, 3 ⁇ /4, and 3 ⁇ /2 and same distance to the center at 4cm.
  • Figures 4A-4C show the polar patterns for W, X, and Y channels, respectively, for various frequencies when using the original mapping matrix as described above, respectively.
  • the outputs of X and Y channels are incorrect since they are no longer orthogonal to each other.
  • the W channel becomes problematic even as low as 1000Hz. Therefore, it is desired that the mapping matrix could be adapted flexibly in order to ensure the high quality of the generated surround sound field.
  • the weights for respective audio signals may be dynamically adapted based on the topology of audio capturing devices as estimated at step S303. Still considering the above example topology where three audio capturing devices 101 have angles of ⁇ /2, 3 ⁇ /4, and 3 ⁇ /2 and same distance to the center at 4cm, if the mapping matrix is adapted according to this specific topology, for example, as then better results can be achieved, which can be seen from Figures 5A-5C that show the polar patterns for W, X, and Y channels, respectively, for various frequencies in this situation.
  • the server 102 may maintain a repository storing a set of predefined topology templates, each of which is corresponding to a pre-tuned mapping matrix.
  • the topology templates may be represented by the coordinates and/or position relationship of the audio capturing devices.
  • the template that matches the estimated topology may be determined. There are many ways to locate the matched topology template.
  • the Euclidean distance between the estimated coordinates of the audio capturing devices and the coordinates in the template are calculated.
  • the topology template with the minimum distance is determined as the matched template.
  • the pre-tuned mapping matrix corresponding to the determined matched topology template is selected for use in the generation of surround sound field in the form of B-format signals.
  • the weights for audio signals captured by respective devices can be selected further based on a frequency of those audio signals. Specifically, it is observed that for higher frequencies, spatial aliasing start to appear due to relatively large spacing between audio capturing devices.
  • the selection of mapping matrix in B-format processing may be done on the basis of audio frequency.
  • each topology template may correspond to at least two mapping matrices.
  • the frequency of the received audio signals is compared with a predefined threshold, and one of the mapping matrices corresponding to the determined topology template can be selected and used based on the comparison.
  • the B-format processing is applied to the received audio signals to thereby generate the surround sound field, as discussed above.
  • the surround sound field is shown to be generated based on the topology estimation, the scope of the present invention is not limited in this regard.
  • the sound field may be generated directly from the cross-correlation process applied to the captured audio signals.
  • topology of audio capturing devices it is possible to perform the cross-correlation process to achieve some time alignment of the audio signals and then generate the sound field by simply applying a fixed mapping matrix in B-format processing. In this way, the time delay differences for the dominant source among different channels may be essentially removed. As a result, the sensor distance of the array of audio capturing devices may be reduced, thereby creating a coincident array.
  • the method 300 proceeds to step S305 to estimate the direction of arrival (DOA) of the generated surround sound with respect to a rendering device. Then the surround sound field is rotated at step S306 at least partially based on the estimated DOA.
  • Rotating the generated surround sound field according to the DOA is mainly for the purpose of improving the spatial rendering of the surround sound field.
  • the DOA estimation may be performed using the multi-channel input for rotating the surround sound field according to the estimated angle ⁇ .
  • DOA algorithms like Generalized Cross Correlation with Phase Transform (GCC-PHAT), Steered Response Power-Phase Transform (SRP-PHAT), Multiple Signal Classification (MUSIC), or any other suitable DOA estimation algorithms can be used in connection with embodiments of the present invention.
  • GCC-PHAT Generalized Cross Correlation with Phase Transform
  • SRP-PHAT Steered Response Power-Phase Transform
  • MUSIC Multiple Signal Classification
  • the sound field in addition to the DOA, the sound field may be rotated further based on the energy of the generated sound field. In other words, it is possible to find the most dominant sound source both in terms of energy and duration. The goal is to find the best listening angle for a user in a sound field.
  • ⁇ ⁇ and E n represent the short-term estimated DOA and energy for frame n of the generated sound field, respectively, and the total number of frames is N for the entire generated sound. It is further assumed that the medial plane is 0 degree and the angle is measured counter-clockwise. Then a frame corresponds to a point ( ⁇ ⁇ , E n ) using polar coordinate representation.
  • the rotation angle ⁇ ' may be determined, for example, by maximizing the following objective function:
  • step S307 the generated sound field may be converted into any target format suitable for playback on a rendering device.
  • the surround sound field is generated as B-format signals. It would be readily appreciated that once a B-format signal is generated, W, X, Y channels may be converted to various formats suitable for spatial rendering. The decoding and reproduction of Ambisonics is dependent on the loudspeaker system used for spatial rendering.
  • the decoding from an Ambisonics signal to a set of loudspeaker signals is based on the assumption that, if the decoded loudspeaker signals are being played back, a "virtual" Ambisonics signal recorded at the geometric center of the loudspeaker array should be identical to the Ambisonics signal used for decoding.
  • C L B
  • L ⁇ L 1 5 L 2 , ...
  • L n ⁇ T represents the set of loudspeaker signals
  • C is known as a "re-encoding" matrix defined by the geometrical definition of the loudspeaker array, i.e. azimuth, elevation of each loudspeaker. For example, give a square loudspeaker array, where loudspeakers are placed horizontally at the azimuth of ⁇ 45°, -45°, 135°, - 135° ⁇ and elevation ⁇ 0°, 0°, 0°, 0° ⁇ , this defines C as:
  • the loudspeaker signals can be derived as:
  • binaural rendering in which audio is played back through a pair of earphones or headphones, may be desired since users are expected to listen to the audio files on mobile devices.
  • B -format to binaural conversion can be achieved approximately by summing loudspeaker array feeds that are each filtered by a head-related transfer functions (HRTF) matching the loudspeaker position.
  • HRTF head-related transfer functions
  • a directional sound source travels two distinctive propagations paths to arrive at the left and right ear respectively. This results in the arrival-time and intensity difference between the two ear entrance signals, which is then exploited by the human auditory system to achieve localized hearing.
  • the head-related transfer functions can be well modeled by a pair of direction-dependent acoustic filters, referred as the head-related transfer functions.
  • the ear entrance signals S /e ⁇ and right can be modeled as: where H fe/f ⁇ and H r3 ⁇ 4to ⁇ represent the HRTFs of direction ⁇ .
  • the HRTFs of a given direction can be measured by using probe microphones inserted at a subject's (either a person or a dummy head) ears to pick up responses from an impulse, or a known stimulus, placed at the direction.
  • HRTF measurements can be used to synthesize virtual ear entrances signals from a monophonic source. By filtering this source with a pair of HRTFs corresponding to a certain direction and presenting the resulting left and right signals to a listener via headphones or earphones, a sound field with a virtual sound source spatialized at the desired direction can be simulated.
  • a sound field with a virtual sound source spatialized at the desired direction can be simulated.
  • H fc/i n represents the transfer function from the wth loudspeaker to the left ear
  • H rightth represents the transfer function from the nth loudspeaker to the right ear.
  • n the total number of loudspeakers.
  • the server 102 may transmit such signals into the rendering device for display.
  • the rendering device and the audio capturing device may co-locate on a same physical terminal.
  • the method 300 ends after step S307.
  • Figure 6 shows a block diagram illustrating an apparatus for generating a surround sound field in accordance with an embodiment of the present invention.
  • the apparatus 600 may reside at the server 102 shown in Figure 1 or is otherwise associated with the server 102, and may be configured to perform the method 300 described above with reference to Figure 3.
  • the apparatus 600 comprises a receiving unit 601 configured to receive audio signals captured by a plurality of audio capturing devices.
  • the apparatus 600 also comprises a topology estimating unit 602 configured to estimate a topology of the plurality of audio capturing devices.
  • the apparatus 600 comprises a generating unit 603 configured to generate the surround sound field from the received audio signals at least partially based on the estimated topology.
  • the estimating unit 602 may comprise a distance acquiring unit configured to acquire a distance between each pair of the plurality of audio capturing devices; and a MDS unit configured to estimate the topology by performing a multidimensional scaling (MDS) analysis on the acquired distances.
  • MDS multidimensional scaling
  • the generating unit 603 may comprise a mode selecting unit configured to select a mode for processing the audio signals based on a number of the plurality of audio capturing devices.
  • the generating unit 603 may comprise a template determining unit configured to determine a topology template matching the estimated topology of the plurality of audio capturing devices; a weight selecting unit configured to select weights for the audio signals at least partially based on the determined topology template; and a signal processing unit configured to process the audio signals using the selected weights to generate the surround sound field.
  • the weight selecting unit may comprise a unit configured to select the weights based on the determined topology template and frequencies of the audio signals.
  • the apparatus 600 may further comprise a time aligning unit 604 configured to perform a time alignment on the audio signals.
  • the time aligning unit 604 is configured to apply at least one of a protocol-based clock synchronization process, a peer-to-peer clock synchronization process, and a cross-correlation process.
  • the apparatus 600 may further comprise a DOA estimating unit 605 configured to estimate a direction of arrival (DOA) of the generated surround sound field with respect to a rendering device; and a rotating unit 606 configured to rotate the generated surround sound field at least partially based on the estimated DOA.
  • the rotating unit may comprise a unit configured to rotate the generated surround sound field based on the estimated DOA and energy of the generated surround sound field.
  • the apparatus 600 may further comprise a converting unit 607 configured to convert the generated surround sound field into a target format for playback on a rendering device.
  • the B-format signals may be converted into binaural signals or 5.1 -channel surround sound signals.
  • FIG. 7 is a block diagram illustrating a user terminal 700 for implementing example embodiments of the present invention.
  • the user terminal 700 may operate as the audio capturing device 101 as discussed herein.
  • the user terminal 700 may be embodied as a mobile phone. It should be understood, however, that a mobile phone is merely illustrative of one type of apparatus that would benefit from embodiments of the present invention and, therefore, should not be taken to limit the scope of embodiments of the present invention.
  • the user terminal 700 includes an antenna(s) 712 in operable communication with a transmitter 714 and a receiver 716.
  • the user terminal 700 further includes at least one processor or controller 720.
  • the controller 720 may be comprised of a digital signal processor, a microprocessor, and various analog to digital converters, digital to analog converters, and other support circuits. Control and information processing functions of the user terminal 700 are allocated between these devices according to their respective capabilities.
  • the user terminal 700 also comprises a user interface including output devices such as a ringer 722, an earphone or speaker 724, one or more microphones 726 for audio capturing, a display 728, and user input devices such as a keyboard 730, a joystick or other user input interface, all of which are coupled to the controller 720.
  • the user terminal 700 further includes a battery 734, such as a vibrating battery pack, for powering various circuits that are required to operate the user terminal 700, as well as optionally providing mechanical vibration as a detectable output.
  • the user terminal 700 includes a media capturing element, such as a camera, video and/or audio module, in communication with the controller 720.
  • the media capturing element may be any means for capturing an image, video and/or audio for storage, display or transmission.
  • the camera module 736 may include a digital camera capable of forming a digital image file from a captured image.
  • the user terminal 700 may further include a universal identity module (UIM) 738.
  • the UIM 738 is typically a memory device having a processor built in.
  • the UIM 738 may include, for example, a subscriber identity module (SIM), a universal integrated circuit card (UICC), a universal subscriber identity module (USEVI), a removable user identity module (R-UIM), etc.
  • SIM subscriber identity module
  • UICC universal integrated circuit card
  • USEVI universal subscriber identity module
  • R-UIM removable user identity module
  • the UIM 738 typically stores information elements related to a subscriber.
  • the user terminal 700 may be equipped with at least one memory.
  • the user terminal 700 may include volatile memory 740, such as volatile Random Access Memory (RAM) including a cache area for the temporary storage of data.
  • volatile memory 740 such as volatile Random Access Memory (RAM) including a cache area for the temporary storage of data.
  • the user terminal 700 may also include other non- volatile memory 742, which can be embedded and/or may be removable.
  • non-volatile memory 742 can additionally or alternatively comprise an EEPROM, flash memory or the like.
  • the memories can store any of a number of pieces of information, program, and data, used by the user terminal 700 to implement the functions of the user terminal 700.
  • FIG. 8 a block diagram illustrating an example computer system 800 for implementing embodiments of the present invention.
  • the computer system 800 may function as the server 102 as described above.
  • a central processing unit (CPU) 801 performs various processes in accordance with a program stored in a read only memory (ROM) 802 or a program loaded from a storage section 808 to a random access memory (RAM) 803.
  • ROM read only memory
  • RAM random access memory
  • data required when the CPU 801 performs the various processes or the like is also stored as required.
  • the CPU 801, the ROM 802 and the RAM 803 are connected to one another via a bus 804.
  • An input/output (I/O) interface 805 is also connected to the bus 804.
  • the following components are connected to the I/O interface 805: an input section 806 including a keyboard, a mouse, or the like; an output section 807 including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), or the like, and a loudspeaker or the like; the storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like.
  • the communication section 809 performs a communication process via the network such as the internet.
  • a drive 810 is also connected to the I/O interface 805 as required.
  • a removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 810 as required, so that a computer program read therefrom is installed into the storage section 808 as required.
  • the program that constitutes the software is installed from the network such as the internet or the storage medium such as the removable medium 811.
  • various example embodiments of the present invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While various aspects of the example embodiments of the present invention are illustrated and described as block diagrams, flowcharts, or using some other pictorial representation, it will be appreciated that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • the apparatus 600 described above may be implemented as hardware, software/firmware, or any combination thereof.
  • one or more units in the apparatus 600 may be implemented as software modules.
  • some or all of the units may be implemented using hardware modules like integrated circuits (ICs), application specific integrated circuits (ASICs), system-on-chip (SOCs), field programmable gate arrays (FPGAs), and the like.
  • ICs integrated circuits
  • ASICs application specific integrated circuits
  • SOCs system-on-chip
  • FPGAs field programmable gate arrays
  • various blocks shown in Figure 3 may be viewed as method steps, and/or as operations that result from operation of computer program code, and/or as a plurality of coupled logic circuit elements constructed to carry out the associated function(s).
  • embodiments of the present invention include a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program containing program codes configured to carry out the method 300 as detailed above.
  • a machine readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
  • the machine readable medium may be a machine readable signal medium or a machine readable storage medium.
  • a machine readable medium may include but not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing.
  • machine readable storage medium More specific examples of the machine readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read-only memory
  • EPROM or Flash memory erasable programmable read-only memory
  • CD-ROM portable compact disc read-only memory
  • magnetic storage device or any suitable combination of the foregoing.
  • Computer program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer program codes may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor of the computer or other programmable data processing apparatus, cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented.
  • the program code may execute entirely on a computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server.
  • the present invention may be embodied in any of the forms described herein.
  • EEEs enumerated example embodiments
  • EEE 1 A method of generating a surround sound field, the method comprising: receiving audio signals captured by a plurality of audio capturing devices; performing a time alignment of the received audio signals by applying a cross-correlation process on the received audio signals; and generating the surround sound field from the time aligned audio signals.
  • EEE 2 The method according to EEE 1, further comprising: receiving information on calibration signals issued by the plurality of audio capturing devices; and reducing a search range of the cross-correlation process based on the received information on the calibration signals.
  • EEE 3 The method according to any of preceding EEEs, wherein generating the surround sound field comprises: generating the surround sound field based on a predefined topology estimation of the plurality of audio capturing devices.
  • EEE 4 The method according to any of preceding EEEs, wherein generating the surround sound field comprises: selecting a mode for processing the audio signals based on a number of the plurality of audio capturing devices.
  • EEE 5 The method according to any of preceding EEEs, further comprising: estimating a direction of arrival (DOA) of the generated surround sound field with respect to a rendering device; and rotating the generated surround sound field at least partially based on the estimated DOA.
  • DOA direction of arrival
  • EEE 6 The method according to EEE 5, wherein rotating the generated surround sound field comprises: rotating the generated surround sound field based on the estimated DOA and energy of the generated surround sound field.
  • EEE 7 The method according to any of preceding EEEs, further comprising: converting the generated surround sound field into a target format for playback on a rendering device.
  • EEE 8 An apparatus of generating a surround sound field, the apparatus comprising: a first receiving unit configured to receive audio signals captured by a plurality of audio capturing devices; a time aligning unit configured to perform a time alignment of the received audio signals by applying a cross -correlation process on the received audio signals; and a generating unit configured to generate the surround sound field from the time aligned audio signals.
  • EEE 9 The apparatus according to EEE 8, further comprising: a second receiving unit configured to receive information on calibration signals issued by the plurality of audio capturing devices; and reducing unit configured to reduce a search range of the cross-correlation process based on the information on the calibration signals.
  • EEE 10 The apparatus according to any of EEEs 8 to 9, wherein the generating unit comprises: a unit configured to generate the surround sound field based on a predefined estimation of topology of the plurality of audio capturing devices.
  • EEE 11 The apparatus according to any of EEEs 8 to 10, wherein the generating unit comprises: a mode selecting unit configured to select a mode for processing the audio signals based on a number of the plurality of audio capturing devices.
  • EEE 12 The apparatus according to any of EEEs 8 to 11, further comprising: a DOA estimating unit configured to estimate a direction of arrival (DOA) of the generated surround sound field with respect to a rendering device; and a rotating unit configured to rotate the generated surround sound field at least partially based on the estimated DOA.
  • DOA direction of arrival
  • EEE 13 The apparatus according to EEE 12, wherein the rotating unit comprises: a unit configured to rotate the generated surround sound field based on the estimated DOA and energy of the generated surround sound field.
  • EEE 14 The apparatus according to any of EEEs 8 to 13, further comprising: a converting unit configured to convert the generated surround sound field into a target format for playback on a rendering device.

Landscapes

  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Otolaryngology (AREA)
  • General Physics & Mathematics (AREA)
  • Algebra (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Pure & Applied Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Stereophonic System (AREA)
  • Obtaining Desirable Characteristics In Audible-Bandwidth Transducers (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Embodiments of the present invention relate to adaptive audio content generation. Specifically, a method for generating adaptive audio content is provided. The method comprises extracting at least one audio object from channel-based source audio content, and generating the adaptive audio content at least partially based on the at least one audio object. Corresponding system and computer program product are also disclosed.

Description

GENERATING SURROUND SOUND FIELD
CROSS -REFERENCE TO RELATED APPLICATIONS
[0001] This application claims the benefit of priority to Chinese Patent Application No. 201310246729.2 filed on 18 June 2013 and United States Provisional Patent Application No. 61/839,474 filed on 26 June 2013, both hereby incorporated by reference in its entirety
TECHNOLOGY
[0002] The present application relates to signal processing. More specifically, embodiments of the present invention relate to generating surround sound field.
BACKGROUND
[0003] Traditionally the surround sound field is created either by means of dedicated surround sound recording equipments, or by professional sound mixing engineers or software applications that pan sound sources to different channels. Neither of these two approaches is easily accessible to end users. In the past decades, the increasingly ubiquitous mobile devices, such as mobile phones, tablets, media players, and game consoles, have been equipped with audio capturing and/or processing functionalities. However, most mobile devices (mobile phones, tablets, media players, game consoles) are only used to achieve mono audio capture.
[0004] There have been proposed several approaches for surround sound field creation using mobile devices. However, those approaches either strictly rely on access points or fail to take into consideration the nature of commonly-used, non-professional mobile devices. For example, in creating a surround sound field using an ad hoc network of heterogeneous user devices, the recording time of different mobile devices might not be synchronized, and the locations and topology of the mobile devices might be unknown. Moreover, the gains and frequency responses of audio capturing devices may be different. As a result, at present, it is incapable of generating a surround sound field effectively and efficiently by use of audio capturing devices of everyday users.
[0005] In view of the foregoing, there is a need in the art for a solution capable of generating the surround sound field in an effective and efficient manner.
SUMMARY
[0006] In order to address the foregoing and other potential problems, embodiments of the present invention propose a method, apparatus, and computer program product for generating the surround sound field.
[0007] In one aspect, embodiments of the present invention provide a method of generating a surround sound field. The method comprises: receiving audio signals captured by a plurality of audio capturing devices; estimating a topology of the plurality of audio capturing devices; and generating the surround sound field from the received audio signals at least partially based on the estimated topology. Embodiments in this aspect also include corresponding computer program product comprising a computer program tangibly embodied on a machine readable medium for carrying out the method.
[0008] In another aspect, embodiments of the present invention provide an apparatus of generating a surround sound field. The apparatus comprises: a receiving unit configured to receive audio signals captured by a plurality of audio capturing devices; a topology estimating unit configured to estimate a topology of the plurality of audio capturing devices; and a generating unit configured to generate the surround sound field from the received audio signals at least partially based on the estimated topology.
[0009] These embodiments of the present invention can be implemented to realize one or more of the following advantages. In accordance with embodiments of the present invention, the surround sound field may be generated by use of an ad hoc network of audio capturing devices of end users, such as microphones equipped on mobile phones. As such, the need for expensive and complex professional equipments and/or human experts can be eliminated. Furthermore, by generating the surround sound field dynamically based on the estimation of topology of the audio capturing devices, the quality of the surround sound field can be maintained at a higher level.
[0010] Other features and advantages of embodiments of the present invention will also be understood from the following description of example embodiments when read in conjunction with the accompanying drawings, which illustrate, by way of example, spirit and principles of the present invention.
DESCRIPTION OF DRAWINGS
[0011] The details of one or more embodiments of the present invention are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the invention will become apparent from the description, the drawings, and the claims, wherein: [0012] Figure 1 shows a block diagram illustrating a system in which example embodiments of the present invention can be implemented;
[0013] Figures 2A-2C show schematic diagrams illustrating several examples of topologies of audio capturing devices in accordance with example embodiments of the present invention;
[0014] Figure 3 shows a flowchart illustrating a method for generating a surround sound field in accordance with an example embodiment of the present invention;
[0015] Figures 4A-4C show schematic diagrams illustrating polar patterns for W, X, and Y channels, respectively, in B-format processing for various frequencies when using an example mapping matrix;
[0016] Figures 5A-5C show schematic diagrams illustrating polar patterns for W, X, and Y channels, respectively, in B-format processing for various frequencies when using another example mapping matrix;
[0017] Figure 6 shows a block diagram illustrating an apparatus for generating a surround sound field in accordance with an example embodiment of the present invention;
[0018] Figure 7 shows a block diagram illustrating a user terminal for implementing an example embodiment of the present invention; and
[0019] Figure 8 shows a block diagram illustrating a system for implementing an example embodiment of the present invention.
[0020] Throughout the figures, same or similar reference numbers indicates same or similar elements.
DESCRIPTION OF EXAMPLE EMBODIMENTS
[0021] In general, embodiments of the present invention provide a method, apparatus, and computer program product for surround sound field generation. In accordance with embodiments of the present invention, the surround sound field may be effectively and accurately generated by use of an ad hoc network of audio capturing devices such as mobile phones of end users. Some embodiments of the present invention will be detailed below.
[0022] Reference is first made to Figure 1, where a block diagram illustrating a system 100 in which embodiments of the present invention can be implemented is shown. In Figure 1, the system 100 includes a plurality of audio capturing devices 101 and a server 102. In accordance with embodiments of the present invention, the audio capturing devices 101, among other things, are capable of capturing, recording and/or processing audio signals. Examples of the audio capturing devices 101 may include, but not limited to, mobile phones, personal digital assistants (PDAs), laptops, tablet computers, personal computers (PCs) or any other suitable user terminals equipped with audio capturing functionality For example, those commercially available mobile phones are usually equipped with at least one microphone and therefore can be used as the audio capturing devices 101.
[0023] In accordance with embodiments of the present invention, the audio capturing devices 101 may be arranged in one or more ad hoc networks or groups 103, each of which may include one or more audio capturing devices. The audio capturing devices may be grouped according to a predetermined strategy or dynamically, which will be detailed below. Different groups can be located at same or different physical locations. Within each group, the audio capturing devices are located in the same physical location, and may be positioned proximate to each other.
[0024] Figures 2A-2C show some examples of groups consisting of three audio capturing devices. In the example embodiments shown in Figures 2A-2C, the audio capturing devices 101 may be mobile phones, PDAs or any other portable user terminals that are equipped with audio capturing elements 201, such as one or more microphones, to capture audio signals. Specifically, in the example embodiment shown in Figure 2C, the audio capturing devices 101 are further equipped with video capturing elements 202 such as cameras, so that the audio capturing devices 101 may be configured to capture video and/or image while capturing audio signals.
[0025] It should be noted that the number of audio capturing devices within a group is not limited to three. Instead, any suitable number of audio capturing devices may be arranged as a group. Moreover, within a group, the plurality of audio capturing devices may be arranged as any desired topology. In some embodiments, the audio capturing devices within a group may communicate with each other by means of computer network, Bluetooth, infrared, telecommunication, and the like, just to name a few.
[0026] Continuing reference to Figure 1, as shown, the server 102 is communicatively connected with the groups of audio capturing devices 101 via network connections. The audio capturing devices 101 and the server 102 may communicate with each other, for example, by a computer network such as a local area network ("LAN"), a wide area network ("WAN") or the Internet, a communication network, a near field communication connection, or any combination thereof. The scope of the present invention is not limited in this regard. [0027] In operation, the generation of surround sound field may be initiated either by an audio capturing device 101 or by the server 102. Specifically, in some embodiments, an audio capturing device 101 may log into the server 102 and request the server 102 to generate a surround sound field. Then the audio capturing device 101 sending the request will become a master device which then sends invitations to other capturing devices to join the audio capturing session. In this regard, there may be a predefined group to which the master device belongs. In these embodiments, the other audio capturing devices within this group receive the invitation from the master device and join the audio capturing session accordingly. Alternatively or additionally, another one or more audio capturing devices may be dynamically identified and grouped with the master device. For example, in case that location services like GPS (Global Positioning Service) are available to the audio capturing devices 101, it is possible to automatically invite one or more audio capturing devices located in proximity to the master device to join the audio capturing group. Discovery and grouping of the audio capturing devices may also be performed by the server 102 in some alternative embodiments.
[0028] Upon forming a group of audio capturing devices, the server 102 sends a capturing command to all the audio capturing devices within the group. Alternatively, the capturing command may be sent by one of the audio capturing devices 101 within the group, for example, by the master device. Each audio capturing device in the group will start to capture and record audio signals immediately after receiving the capturing command. The audio capturing session will finish when any audio capturing device stops the capturing. During audio capture, the audio signals may be recorded locally on the audio capturing devices 101 and transmitted to the server 102 after the capturing session is completed. Alternatively, the captured audio signals may be streamed to the server 102 in a real-time manner.
[0029] In accordance with embodiments of the present invention, the audio signals captured by the audio capturing devices 101 of a single group are assigned with the same group identification (ID), such that the server 102 is able to identify whether the incoming audio signals belong to the same group. Further, in addition to the audio signals, any information relevant to the audio capturing session may be transmitted to the server 102, including the number of audio capturing devices 101 within the group, parameters of one or more audio capturing devices 101, and the like. [0030] Based on the audio signals captured by a plurality of capturing devices 101 of a group, the server 102 performs a series of operations to process the audio signals to generate a surround sound field. In this regard, Figure 3 shows a flowchart of a method for generating the surround sound field from the audio signals captured by the plurality of capturing devices 101.
[0031] As shown in Figure 3, upon receipt of the audio signals captured by a group of audio capturing devices 101 at step S301, the topology of these audio capturing devices are estimated at step S302. Estimating the topology of positions of audio capturing devices 101 within the group is important to the subsequent spatial processing, which has direct impact on reproducing the sound field. In accordance with embodiments of the present invention, the topology of audio capturing devices may be estimated in various manners. For example, in some embodiments, the topology of audio capturing devices 101 may be predefined and thus known to the server 102. In this event, the server 102 may use the group ID to determine the group from which the audio signals are transmitted, and then retrieve the predefined topology associated with the determined group as the topology estimation.
[0032] Alternatively or additionally, the topology of audio capturing devices 101 may be estimated based on the distance between each pair of the plurality of audio capturing devices 101 within the group. There are many possible manners capable of acquiring the distance between a pair of audio capturing devices 101. For example, in those embodiments where the audio capturing devices are capable of playing back audios, each audio capturing device 101 may be configured to each play back a piece of audio simultaneously and to receive audio signals from the other devices within the group. That is, each audio capturing device 101 broadcasts a unique audio signal to the other members of the group. As an example, each audio capturing device may play back a linear chirp signal spanning a unique frequency range and/or having any other specific acoustic features. By recording the time instants when the linear chirp signal is received, the distance between each pair of audio capturing devices 101 may be calculated by an acoustic ranging processing, which is known to those skilled in the art and thus will not be detailed here.
[0033] Such distance calculation may be performed at the server 102, for example. Alternatively, if the audio capturing devices may communicate with each other directly, such distance calculation may be performed at the client side. At the server 102, no additional processing is needed if there are only two audio capturing devices 101 in the group. When there are more than two audio capturing devices 101, in some embodiments, the multidimensional scaling (MDS) analysis or a similar process can be performed on the acquired distances to estimate the topology of the audio capturing devices. Specifically, with an input matrix indicating the distances of pairs of audio capturing devices 101, MDS may be applied to generate the coordinates of the audio capturing devices 101 in a two-dimensional space. For example, assume that the measured distance matrix in a three-device group is
0 0. 1 0. 1
0. 1 0 0. 15
(: 0. 1 0. 15 0 )
Then outputs of the two-dimensional (2D) MDS indicating the topology of audio capturing device 101 are Ml (0, -0.0441), M2 (-0.0750, 0.0220), and M3 (0.0750, 0.0220).
[0034] It should be noted that the scope of the present invention is not limited to the examples illustrated above. Any suitable manner capable of estimating distance between a pair of audio capturing devices, whether currently known or developed in the future, may be used in connection with embodiments of the present invention. For example, instead of playing back audio signals, the audio capturing devices 101 may be configured to broadcast electrical and/or optical signals to each other to facilitate the distance estimation.
[0035] Next, the method 300 proceeds to step S303, where the time alignment is performed on the audio signals received at step S301, such that the audio signals captured by different capturing devices 101 are temporally aligned with each other. In accordance with embodiments of the present invention, time alignment of the audio signals may be done in many possible manners. In some embodiments, the server 102 may implement a protocol based clock synchronization process. For example, the Network Time Protocol (NTP) provides accurate and synchronized time across the Internet. When connecting to the internet, each audio capturing device 101 may be configured to synchronize with an NTP server separately while performing audio capturing. It is not necessary to adjust the local clock. Instead, an offset between the local clock and the NTP server can be calculated and stored as metadata. The local time and its offset are sent to the server 102 together with the audio signals once the audio capturing is terminated. The server 102 then aligns the received audio signals based on such time information.
[0036] Alternatively or additionally, the time alignment at step S303 may be realized by a peer-to-peer clock synchronization process. In these embodiments, the audio capturing devices may be communicated with each other on a peer-to-peer basis, for example, via protocols like Bluetooth or infrared connection. One of the audio capturing devices may be selected as the synchronization master and clock offsets of all the other capturing devices may be calculated relative to the synchronization master.
[0037] Another possible implementation is cross-correlation based time alignment. As known, a series of cross-correlation coefficients between a pair of input signals, x(i) and y(i), may be calculated by:
Figure imgf000010_0001
where x and y represent the mean of x(i) and y(i), N represents the length of x(i) and y(i), and d represents the time lag between the two series. The delay between the two signals may be calculated as follows:
D = arg max{r(d)}
d
Then using x(i) as the reference, signal y(i) can be time-aligned to x(i) by: y k = y(i - D)
[0038] It would be appreciated that though the time alignment can be realized by applying the cross-correlation process, this process can be time consuming and error prone if the search range is large. However, in practice the search range has to be fairly long in order to accommodate large network delay variations. To address this problem, information on calibration signals issued by the audio capturing devices 101 may be collected and transmitted to the server 102 to be used to reduce the search range of the cross-correlation process. As described above, in some embodiments of the present invention, the audio capturing devices 101 may broadcast an audio signal to the other members within the group upon start of the audio capture to thereby facilitate calculation of the distance between each pair of the audio capturing devices 101. In these embodiments, the broadcasted audio signals can also be used as calibration signals to reduce the time consumed by signal correlation. Specifically, considering two audio capturing devices A and B within a group, it is assumed that: SA is the time instant when device A issues a command to play the calibration signal;
SB is the time instant when device B issues a command to play the calibration signal;
RAA is the time instant when device A receives the signal transmitted by device A;
RBA is the time instant when device A receives the signal transmitted by device B;
RBB is the time instant when device B receives the signal transmitted by device B; RAB is the time instant when device B receives the signal transmitted by device A.
One or more of these time instants may be recorded by the audio capturing devices 101 and transmitted to the server 102 for use in cross-correlation process.
[0039] Generally speaking, the acoustic propagation delay from device A to device B is smaller than the network delay difference. That is, SB - SA > RAB- SA- Accordingly, the time instants RBA and RBB can be used to start the cross-correlation based time alignment process. In other words, only audio signal samples after the time instant RBA and RBB would be included in the correlation calculation. In this way, the search range may be reduced and thus improve efficiency of the time alignment.
[0040] It is possible, however, that the network delay difference is smaller than acoustic propagation delay difference. This could happen when the network has very low jitter or the two devices are put farther apart, or both. In this case, SB and SA can be used as the starting point for the cross correlation process. Specifically, since audio signals after SB and SA would contain the calibration signals, RBA can be used as the starting point for correlation for device A, and SB + (RBA - SA) can be used as the starting point for correlation for device B.
[0041] It would be appreciated that the above mechanisms for time alignment may be combined in any suitable manner. For example, in some embodiments of the present invention, the time alignment can be done in a three-step process. First, the coarse time synchronization may be performed between the audio capturing devices 101 and the server 102. Next, the calibration signals as discussed above may be used to refine the synchronization. Finally, cross-correlation analysis is applied to complete the time alignment of the audio signals.
[0042] It should be noted that the time alignment at step S303 is optional. For example, if the communication and/or device conditions are good enough, it is reasonably considered that all the audio capturing devices 101 receive the capturing command nearly at the same time and thus start the audio capturing simultaneously. Furthermore, it would be readily appreciated that in some applications where the quality of surround sound field is not very sensitive, a certain degree of misalignment of the starting time of audio capturing can be tolerated or ignored. In these situations, the time alignment at step S303 can be omitted.
[0043] Specifically, it should be noted that step S302 is not necessarily performed prior to S303. Instead, in some alternative embodiments, the time alignment of audio signals may be performed prior to or even in parallel with the topology estimation. For example, the clock synchronization process such as NTP synchronization or peer-to-peer synchronization can be performed before the topology estimation. Depending on the acoustic ranging approach, such clock synchronization process may be beneficial to acoustic ranging in topology estimation.
[0044] Continuing reference to Figure 3, at step S304, the surround sound field is generated from the received audio signals (possibly temporally aligned) at least partially based on the topology estimated at step S302. To this end, in accordance with some embodiments, a mode may be selected for processing the audio signals based on the number of the plurality of audio capturing devices. For example, if there are only two audio capturing devices 101 within the group, the two audio signals may be simply combined to generate a stereo output. Optionally, some post processing may be performed, including but not limited to stereo sound image widening, multi-channel upmixing, and so forth. On the other hand, when there are more than two audio capturing devices 101 within the group, Ambisonics or B-format processing may be applied to generate the surround sound field. It should be noted that the adaptive selection of processing mode is not necessarily needed. For example, even if there are only two audio capturing devices, the surround sound field may be generated by processing the captured audio signals by the B-format processing.
[0045] Next, some embodiments of the present invention of how to generate the surround sound field will be discussed with reference to the Ambisonics processing. However, it should be noted that the scope of the present invention is not limited in this regard. Any suitable techniques capable of generating the surround sound field from the received audio signals based on the estimated topology may be used in connection with embodiments of the present invention. For example, the binaural or 5.1 -channel surround sound generation technology may be utilized as well.
[0046] As to Ambisonics, it is known as a flexible spatial audio processing technique to provide sound field and source localization recoverability. In Ambisonics, a 3D surround sound field is recorded as a four-channel signal, named B-format with W-X-Y-Z channels. The W channel contains omnidirectional sound pressure information, while the remaining three channels, X, Y, and Z represent sound velocity information measured over the three according axes in a 3D Cartesian coordinates. Specifically, given a sound source S localized at azimuth Ψ and elevation θ , an ideal B-format representation of the surround sound field is:
W = ^S
2
X = cos φ- cos Θ - S
Y = sin φ- cos O - S
Z = sin Θ S
[0047] For sake of simplicity, in the following discussion of the generation of directivity patterns for B-format signals, only the horizontal W, X, and Y channels are considered while the elevation axis Z will be ignored. This is a reasonable assumption because with the way the audio signals are captured by the audio capturing devices 101 in accordance with embodiments of the present invention, there is generally no elevation information.
[0048] Given a plane wave, the directivity of a discrete array can be represented as follows:
JV -l
where r = n audio capturing device with
Figure imgf000013_0001
distance to the center of R and angle of ΨΜ , and a represents the source location at angle Ψ : a = [coscp sincp 0]
Further, An (f,r represents the weight for the audio capturing devices, which can be defined as the product of user defined weights and the gain of audio capturing device at a particular frequency and angle:
An(J,r) = Wn (Or(<p) r(<p) = /? + (! - /?)cos (φ) where β = 0.5 represents a cardioid polar pattern, β = 0.7 represents a subcardioid polar pattern, and β = 1 represents omni directivity.
[0049] It can be seen that once the polar pattern and the position topology of the audio capturing devices are determined, the weights Wn(f) for respective captured audio signals will affect the quality of the generated surround sound field. Different weights Wn (f) would generate different qualities of B-format signals. Weights for different audio signals may be represented as a mapping matrix. Considering the topology shown in Figure 2A as an example, the mapping matrix (W) from audio signals Μ1 ; M2, and M3 to W, X, and Y channels may be defined as follows:
Figure imgf000014_0001
Figure imgf000014_0002
[0050] Traditionally the B-format signals are generated by using specially designed (often quite expensive) microphone arrays such as professional soundfield microphones. In this event, the mapping matrix may be designed in advance and keep unchanged in operation. However, in accordance with embodiments of the present invention, the audio signals are captured by an ad hoc network of audio capturing devices which are possibly dynamically grouped with varied topology. As a result, existing solutions may not be applicable to generate W, X, Y channels from such raw audio signals captured by user devices that are not specially designed and positioned. For example, assume that the group contains three audio capturing devices 101 having angles of π/2, 3π/4, and 3π/2 and same distance to the center at 4cm. Figures 4A-4C show the polar patterns for W, X, and Y channels, respectively, for various frequencies when using the original mapping matrix as described above, respectively. As seen, the outputs of X and Y channels are incorrect since they are no longer orthogonal to each other. In addition, the W channel becomes problematic even as low as 1000Hz. Therefore, it is desired that the mapping matrix could be adapted flexibly in order to ensure the high quality of the generated surround sound field.
[0051] To this end, in accordance with embodiments of the present invention, the weights for respective audio signals, represented as the mapping matrix, may be dynamically adapted based on the topology of audio capturing devices as estimated at step S303. Still considering the above example topology where three audio capturing devices 101 have angles of π/2, 3π/4, and 3π/2 and same distance to the center at 4cm, if the mapping matrix is adapted according to this specific topology, for example, as
Figure imgf000015_0001
then better results can be achieved, which can be seen from Figures 5A-5C that show the polar patterns for W, X, and Y channels, respectively, for various frequencies in this situation.
[0052] According to some embodiments, it is possible to select the weights for audio signals based on the estimated topology of the audio capturing devices on-the-fly. Alternatively or additionally, adaptation of the mapping matrix may be realized based on predefined templates. In these embodiments, the server 102 may maintain a repository storing a set of predefined topology templates, each of which is corresponding to a pre-tuned mapping matrix. For example, the topology templates may be represented by the coordinates and/or position relationship of the audio capturing devices. For a given estimated topology, the template that matches the estimated topology may be determined. There are many ways to locate the matched topology template. As an example, in one embodiment, the Euclidean distance between the estimated coordinates of the audio capturing devices and the coordinates in the template are calculated. The topology template with the minimum distance is determined as the matched template. As such, the pre-tuned mapping matrix corresponding to the determined matched topology template is selected for use in the generation of surround sound field in the form of B-format signals.
[0053] In some embodiments, in addition to the determined topology template, the weights for audio signals captured by respective devices can be selected further based on a frequency of those audio signals. Specifically, it is observed that for higher frequencies, spatial aliasing start to appear due to relatively large spacing between audio capturing devices. In order to further improve performance, the selection of mapping matrix in B-format processing may be done on the basis of audio frequency. For example, in some embodiments, each topology template may correspond to at least two mapping matrices. Upon determination of the position topology template, the frequency of the received audio signals is compared with a predefined threshold, and one of the mapping matrices corresponding to the determined topology template can be selected and used based on the comparison. Using the selected mapping matrix, the B-format processing is applied to the received audio signals to thereby generate the surround sound field, as discussed above.
[0054] It should be noted that although the surround sound field is shown to be generated based on the topology estimation, the scope of the present invention is not limited in this regard. For example, in some alternative embodiments where clock synchronization and distance/topology estimation is not available or already known, the sound field may be generated directly from the cross-correlation process applied to the captured audio signals. For example, in the case that topology of audio capturing devices is known, it is possible to perform the cross-correlation process to achieve some time alignment of the audio signals and then generate the sound field by simply applying a fixed mapping matrix in B-format processing. In this way, the time delay differences for the dominant source among different channels may be essentially removed. As a result, the sensor distance of the array of audio capturing devices may be reduced, thereby creating a coincident array.
[0055] Optionally, the method 300 proceeds to step S305 to estimate the direction of arrival (DOA) of the generated surround sound with respect to a rendering device. Then the surround sound field is rotated at step S306 at least partially based on the estimated DOA. Rotating the generated surround sound field according to the DOA is mainly for the purpose of improving the spatial rendering of the surround sound field. When performing B-format based spatial rendering, there is a nominal front, i.e. 0 degree of azimuth, between the left and right audio capturing devices. Sound source from this direction will be perceived as coming from the front during binaural playback. It is desirable to have the target sound source coming from the front, as this is the most natural listening condition. However, due to the very nature of the positioning of audio capturing devices in the ad hoc group, it is impossible to always require the users pointing the left and right devices to the direction of main target sound source, for example, a performance stage. To address this problem, the DOA estimation may be performed using the multi-channel input for rotating the surround sound field according to the estimated angle Θ. In this regard, DOA algorithms like Generalized Cross Correlation with Phase Transform (GCC-PHAT), Steered Response Power-Phase Transform (SRP-PHAT), Multiple Signal Classification (MUSIC), or any other suitable DOA estimation algorithms can be used in connection with embodiments of the present invention. Then the sound field rotation can be easily achieved on the B-format signals using standard rotation matrix as follows:
Figure imgf000017_0001
r. sin (0) cos (0) - Y
[0056] In some embodiments, in addition to the DOA, the sound field may be rotated further based on the energy of the generated sound field. In other words, it is possible to find the most dominant sound source both in terms of energy and duration. The goal is to find the best listening angle for a user in a sound field. Let θη and En represent the short-term estimated DOA and energy for frame n of the generated sound field, respectively, and the total number of frames is N for the entire generated sound. It is further assumed that the medial plane is 0 degree and the angle is measured counter-clockwise. Then a frame corresponds to a point (θη, En) using polar coordinate representation. In one embodiment, the rotation angle θ' may be determined, for example, by maximizing the following objective function:
Figure imgf000017_0002
[0057] Next, the method 300 proceeds to optional step S307 where the generated sound field may be converted into any target format suitable for playback on a rendering device. Continuing, we consider the examples where the surround sound field is generated as B-format signals. It would be readily appreciated that once a B-format signal is generated, W, X, Y channels may be converted to various formats suitable for spatial rendering. The decoding and reproduction of Ambisonics is dependent on the loudspeaker system used for spatial rendering. In general, the decoding from an Ambisonics signal to a set of loudspeaker signals is based on the assumption that, if the decoded loudspeaker signals are being played back, a "virtual" Ambisonics signal recorded at the geometric center of the loudspeaker array should be identical to the Ambisonics signal used for decoding. This can be expressed as: C L = B where L = {L1 5 L2, ... , Ln }T represents the set of loudspeaker signals, B = {W, X, Y, Z}T represents the "virtual" Ambisonics signal assumed to be identical to the input Ambisonics signal for decoding, and C is known as a "re-encoding" matrix defined by the geometrical definition of the loudspeaker array, i.e. azimuth, elevation of each loudspeaker. For example, give a square loudspeaker array, where loudspeakers are placed horizontally at the azimuth of {45°, -45°, 135°, - 135° } and elevation { 0°, 0°, 0°, 0° }, this defines C as:
1 1 1 1
cos(45° ) cos(-45° ) cos(135° ) cos(-135° )
C =
sin(45° ) sin(-45° ) sin(135° ) sin(-135° )
0 0 0 0
Based on this, the loudspeaker signals can be derived as:
L = D B where D represents the decoding matrix typically defined as the pseudo-inverse matrix of C.
[0058] In accordance with some embodiments, binaural rendering, in which audio is played back through a pair of earphones or headphones, may be desired since users are expected to listen to the audio files on mobile devices. B -format to binaural conversion can be achieved approximately by summing loudspeaker array feeds that are each filtered by a head-related transfer functions (HRTF) matching the loudspeaker position. In spatial hearing, a directional sound source travels two distinctive propagations paths to arrive at the left and right ear respectively. This results in the arrival-time and intensity difference between the two ear entrance signals, which is then exploited by the human auditory system to achieve localized hearing. These two propagation paths can be well modeled by a pair of direction-dependent acoustic filters, referred as the head-related transfer functions. For example, given a sound source S located at direction ψ , the ear entrance signals S/e^ and right can be modeled as:
Figure imgf000018_0001
where Hfe/f φ and Hr¾to φ represent the HRTFs of direction Ψ . In practice, the HRTFs of a given direction can be measured by using probe microphones inserted at a subject's (either a person or a dummy head) ears to pick up responses from an impulse, or a known stimulus, placed at the direction.
[0059] These HRTF measurements can be used to synthesize virtual ear entrances signals from a monophonic source. By filtering this source with a pair of HRTFs corresponding to a certain direction and presenting the resulting left and right signals to a listener via headphones or earphones, a sound field with a virtual sound source spatialized at the desired direction can be simulated. Using the four-speaker array described above, we can thus convert the W, X, and Y channels to binaural signals as follows:
left, I H left, 2 H left, 3 H left, A
Figure imgf000019_0001
right , 1 H right, 2 H right, 3 H right , 4
L4 where Hfc/i n represents the transfer function from the wth loudspeaker to the left ear, and H right„ represents the transfer function from the nth loudspeaker to the right ear. This can be extended to more loudspeakers
H left , l H left , 2 H left ,n
H
Figure imgf000019_0002
right , 1 H right , 2 H right ,n where n represents the total number of loudspeakers.
[0060] After converting the generated surround sound field into a suitable format of signals, the server 102 may transmit such signals into the rendering device for display. In some embodiments, the rendering device and the audio capturing device may co-locate on a same physical terminal.
[0061] The method 300 ends after step S307.
[0062] Reference is now made to Figure 6 which shows a block diagram illustrating an apparatus for generating a surround sound field in accordance with an embodiment of the present invention. In accordance with embodiments of the present invention, the apparatus 600 may reside at the server 102 shown in Figure 1 or is otherwise associated with the server 102, and may be configured to perform the method 300 described above with reference to Figure 3.
[0063] As shown, in accordance with embodiments of the present invention, the apparatus 600 comprises a receiving unit 601 configured to receive audio signals captured by a plurality of audio capturing devices. The apparatus 600 also comprises a topology estimating unit 602 configured to estimate a topology of the plurality of audio capturing devices. Furthermore, the apparatus 600 comprises a generating unit 603 configured to generate the surround sound field from the received audio signals at least partially based on the estimated topology.
[0064] In some example embodiments, the estimating unit 602 may comprise a distance acquiring unit configured to acquire a distance between each pair of the plurality of audio capturing devices; and a MDS unit configured to estimate the topology by performing a multidimensional scaling (MDS) analysis on the acquired distances.
[0065] In some example embodiments, the generating unit 603 may comprise a mode selecting unit configured to select a mode for processing the audio signals based on a number of the plurality of audio capturing devices. Alternatively or additionally, in some example embodiments, the generating unit 603 may comprise a template determining unit configured to determine a topology template matching the estimated topology of the plurality of audio capturing devices; a weight selecting unit configured to select weights for the audio signals at least partially based on the determined topology template; and a signal processing unit configured to process the audio signals using the selected weights to generate the surround sound field. In some example embodiments, the weight selecting unit may comprise a unit configured to select the weights based on the determined topology template and frequencies of the audio signals.
[0066] In some example embodiments, the apparatus 600 may further comprise a time aligning unit 604 configured to perform a time alignment on the audio signals. In some example embodiments, the time aligning unit 604 is configured to apply at least one of a protocol-based clock synchronization process, a peer-to-peer clock synchronization process, and a cross-correlation process.
[0067] In some example embodiments, the apparatus 600 may further comprise a DOA estimating unit 605 configured to estimate a direction of arrival (DOA) of the generated surround sound field with respect to a rendering device; and a rotating unit 606 configured to rotate the generated surround sound field at least partially based on the estimated DOA. In some example embodiments, the rotating unit may comprise a unit configured to rotate the generated surround sound field based on the estimated DOA and energy of the generated surround sound field.
[0068] In some example embodiments, the apparatus 600 may further comprise a converting unit 607 configured to convert the generated surround sound field into a target format for playback on a rendering device. For example, the B-format signals may be converted into binaural signals or 5.1 -channel surround sound signals.
[0069] It should be noted that various units in the apparatus 600 correspond to the steps of method 300 described above with reference to Figure 3, respectively. As a result, all the features described with respect to Figure 3 are also applicable to the apparatus 600, which will not be detailed here.
[0070] Figure 7 is a block diagram illustrating a user terminal 700 for implementing example embodiments of the present invention. The user terminal 700 may operate as the audio capturing device 101 as discussed herein. In some embodiments, the user terminal 700 may be embodied as a mobile phone. It should be understood, however, that a mobile phone is merely illustrative of one type of apparatus that would benefit from embodiments of the present invention and, therefore, should not be taken to limit the scope of embodiments of the present invention.
[0071] As shown, the user terminal 700 includes an antenna(s) 712 in operable communication with a transmitter 714 and a receiver 716. The user terminal 700 further includes at least one processor or controller 720. For example, the controller 720 may be comprised of a digital signal processor, a microprocessor, and various analog to digital converters, digital to analog converters, and other support circuits. Control and information processing functions of the user terminal 700 are allocated between these devices according to their respective capabilities. The user terminal 700 also comprises a user interface including output devices such as a ringer 722, an earphone or speaker 724, one or more microphones 726 for audio capturing, a display 728, and user input devices such as a keyboard 730, a joystick or other user input interface, all of which are coupled to the controller 720. The user terminal 700 further includes a battery 734, such as a vibrating battery pack, for powering various circuits that are required to operate the user terminal 700, as well as optionally providing mechanical vibration as a detectable output. [0072] In some embodiments, the user terminal 700 includes a media capturing element, such as a camera, video and/or audio module, in communication with the controller 720. The media capturing element may be any means for capturing an image, video and/or audio for storage, display or transmission. For example, in an example embodiment in which the media capturing element is a camera module 736, the camera module 736 may include a digital camera capable of forming a digital image file from a captured image. When embodied as a mobile terminal, the user terminal 700 may further include a universal identity module (UIM) 738. The UIM 738 is typically a memory device having a processor built in. The UIM 738 may include, for example, a subscriber identity module (SIM), a universal integrated circuit card (UICC), a universal subscriber identity module (USEVI), a removable user identity module (R-UIM), etc. The UIM 738 typically stores information elements related to a subscriber.
[0073] The user terminal 700 may be equipped with at least one memory. For example, the user terminal 700 may include volatile memory 740, such as volatile Random Access Memory (RAM) including a cache area for the temporary storage of data. The user terminal 700 may also include other non- volatile memory 742, which can be embedded and/or may be removable. The non-volatile memory 742 can additionally or alternatively comprise an EEPROM, flash memory or the like. The memories can store any of a number of pieces of information, program, and data, used by the user terminal 700 to implement the functions of the user terminal 700.
[0074] Referring to Figure 8, a block diagram illustrating an example computer system 800 for implementing embodiments of the present invention. For example, the computer system 800 may function as the server 102 as described above. As shown, a central processing unit (CPU) 801 performs various processes in accordance with a program stored in a read only memory (ROM) 802 or a program loaded from a storage section 808 to a random access memory (RAM) 803. In the RAM 803, data required when the CPU 801 performs the various processes or the like is also stored as required. The CPU 801, the ROM 802 and the RAM 803 are connected to one another via a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
[0075] The following components are connected to the I/O interface 805: an input section 806 including a keyboard, a mouse, or the like; an output section 807 including a display such as a cathode ray tube (CRT), a liquid crystal display (LCD), or the like, and a loudspeaker or the like; the storage section 808 including a hard disk or the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs a communication process via the network such as the internet. A drive 810 is also connected to the I/O interface 805 as required. A removable medium 811, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 810 as required, so that a computer program read therefrom is installed into the storage section 808 as required.
[0076] In the case where the above-described steps and processes (for example, method 300) are implemented by the software, the program that constitutes the software is installed from the network such as the internet or the storage medium such as the removable medium 811.
[0077] Generally speaking, various example embodiments of the present invention may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While various aspects of the example embodiments of the present invention are illustrated and described as block diagrams, flowcharts, or using some other pictorial representation, it will be appreciated that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
[0078] For example, the apparatus 600 described above may be implemented as hardware, software/firmware, or any combination thereof. In some embodiments, one or more units in the apparatus 600 may be implemented as software modules. Alternatively or additionally, some or all of the units may be implemented using hardware modules like integrated circuits (ICs), application specific integrated circuits (ASICs), system-on-chip (SOCs), field programmable gate arrays (FPGAs), and the like. The scope of the present invention is not limited in that regard.
[0079] Additionally, various blocks shown in Figure 3 may be viewed as method steps, and/or as operations that result from operation of computer program code, and/or as a plurality of coupled logic circuit elements constructed to carry out the associated function(s). For example, embodiments of the present invention include a computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program containing program codes configured to carry out the method 300 as detailed above.
[0080] In the context of the disclosure, a machine readable medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine readable medium may be a machine readable signal medium or a machine readable storage medium. A machine readable medium may include but not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the machine readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
[0081] Computer program code for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer program codes may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor of the computer or other programmable data processing apparatus, cause the functions/operations specified in the flowcharts and/or block diagrams to be implemented. The program code may execute entirely on a computer, partly on the computer, as a stand-alone software package, partly on the computer and partly on a remote computer or entirely on the remote computer or server.
[0082] Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are contained in the above discussions, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features that may be specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination.
[0083] Various modifications, adaptations to the foregoing example embodiments of this invention may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings. Any and all modifications will still fall within the scope of the non-limiting and example embodiments of this invention. Furthermore, other embodiments of the inventions set forth herein will come to mind to one skilled in the art to which these embodiments of the invention pertain having the benefit of the teachings presented in the foregoing descriptions and the drawings.
[0084] Accordingly, the present invention may be embodied in any of the forms described herein. For example, the following enumerated example embodiments (EEEs) describe some structures, features, and functionalities of some aspects of the present invention.
[0085] EEE 1. A method of generating a surround sound field, the method comprising: receiving audio signals captured by a plurality of audio capturing devices; performing a time alignment of the received audio signals by applying a cross-correlation process on the received audio signals; and generating the surround sound field from the time aligned audio signals.
[0086] EEE 2. The method according to EEE 1, further comprising: receiving information on calibration signals issued by the plurality of audio capturing devices; and reducing a search range of the cross-correlation process based on the received information on the calibration signals.
[0087] EEE 3. The method according to any of preceding EEEs, wherein generating the surround sound field comprises: generating the surround sound field based on a predefined topology estimation of the plurality of audio capturing devices.
[0088] EEE 4. The method according to any of preceding EEEs, wherein generating the surround sound field comprises: selecting a mode for processing the audio signals based on a number of the plurality of audio capturing devices.
[0089] EEE 5. The method according to any of preceding EEEs, further comprising: estimating a direction of arrival (DOA) of the generated surround sound field with respect to a rendering device; and rotating the generated surround sound field at least partially based on the estimated DOA.
[0090] EEE 6. The method according to EEE 5, wherein rotating the generated surround sound field comprises: rotating the generated surround sound field based on the estimated DOA and energy of the generated surround sound field.
[0091] EEE 7. The method according to any of preceding EEEs, further comprising: converting the generated surround sound field into a target format for playback on a rendering device.
[0092] EEE 8. An apparatus of generating a surround sound field, the apparatus comprising: a first receiving unit configured to receive audio signals captured by a plurality of audio capturing devices; a time aligning unit configured to perform a time alignment of the received audio signals by applying a cross -correlation process on the received audio signals; and a generating unit configured to generate the surround sound field from the time aligned audio signals.
[0093] EEE 9. The apparatus according to EEE 8, further comprising: a second receiving unit configured to receive information on calibration signals issued by the plurality of audio capturing devices; and reducing unit configured to reduce a search range of the cross-correlation process based on the information on the calibration signals.
[0094] EEE 10. The apparatus according to any of EEEs 8 to 9, wherein the generating unit comprises: a unit configured to generate the surround sound field based on a predefined estimation of topology of the plurality of audio capturing devices.
[0095] EEE 11. The apparatus according to any of EEEs 8 to 10, wherein the generating unit comprises: a mode selecting unit configured to select a mode for processing the audio signals based on a number of the plurality of audio capturing devices.
[0096] EEE 12. The apparatus according to any of EEEs 8 to 11, further comprising: a DOA estimating unit configured to estimate a direction of arrival (DOA) of the generated surround sound field with respect to a rendering device; and a rotating unit configured to rotate the generated surround sound field at least partially based on the estimated DOA.
[0097] EEE 13. The apparatus according to EEE 12, wherein the rotating unit comprises: a unit configured to rotate the generated surround sound field based on the estimated DOA and energy of the generated surround sound field.
[0098] EEE 14. The apparatus according to any of EEEs 8 to 13, further comprising: a converting unit configured to convert the generated surround sound field into a target format for playback on a rendering device.
[0099] It will be appreciated that the embodiments of the invention are not to be limited to the specific embodiments disclosed and that modifications and other embodiments are intended to be included within the scope of the appended claims. Although specific terms are used herein, they are used in a generic and descriptive sense only and not for purposes of limitation.

Claims

WHAT IS CLAIMED IS:
1. A method of generating a surround sound field, the method comprising:
receiving audio signals captured by a plurality of audio capturing devices;
estimating a topology of the plurality of audio capturing devices; and
generating the surround sound field from the received audio signals at least partially based on the estimated topology.
2. The method according to claim 1, wherein estimating the topology of the plurality of audio capturing devices comprises:
acquiring a distance between each pair of the plurality of audio capturing devices; and
estimating the topology by performing a multidimensional scaling (MDS) analysis on the acquired distances.
3. The method according to any of preceding claims, wherein generating the surround sound field comprises:
selecting a mode for processing the audio signals based on a number of the plurality of audio capturing devices.
4. The method according to any of preceding claims, wherein generating the surround sound field comprises:
determining a topology template matching the estimated topology of the plurality of audio capturing devices;
selecting weights for the audio signals at least partially based on the determined topology template; and
processing the audio signals using the selected weights to generate the surround sound field.
5. The method according to claim 4, wherein selecting the weights comprises: selecting the weights based on the determined topology template and a frequency of the audio signals.
6. The method according to any of preceding claims, further comprising: performing a time alignment of the received audio signals.
7. The method according to claim 6, wherein performing the time alignment comprises applying at least one of a protocol-based clock synchronization process, a peer-to-peer clock synchronization process, and a cross-correlation process.
8. The method according to any of preceding claims, further comprising:
estimating a direction of arrival (DOA) of the generated surround sound field with respect to a rendering device; and
rotating the generated surround sound field at least partially based on the estimated
DOA.
9. The method according to claim 8, wherein rotating the generated surround sound field comprises:
rotating the generated surround sound field based on the estimated DOA and energy of the generated surround sound field.
10. The method according to any of preceding claims, further comprising:
converting the generated surround sound field into a target format for playback on a rendering device.
11. An apparatus of generating a surround sound field, the apparatus comprising: a receiving unit configured to receive audio signals captured by a plurality of audio capturing devices;
a topology estimating unit configured to estimate a topology of the plurality of audio capturing devices; and
a generating unit configured to generate the surround sound field from the received audio signals at least partially based on the estimated topology.
12. The apparatus according to claim 11, wherein the estimating unit comprises: a distance acquiring unit configured to acquire a distance between each pair of the plurality of audio capturing devices; and a MDS unit configured to estimate the topology by performing a multidimensional scaling (MDS) analysis on the acquired distances.
13. The apparatus according to any of claims 11 to 12, wherein the generating unit comprises:
a mode selecting unit configured to select a mode for processing the audio signals based on a number of the plurality of audio capturing devices.
14. The apparatus according to any of claims 11 to 13, wherein the generating unit comprises:
a template determining unit configured to determine a topology template matching the estimated topology of the plurality of audio capturing devices;
a weight selecting unit configured to select weights for the audio signals at least partially based on the determined topology template; and
a signal processing unit configured to process the audio signals using the selected weights to generate the surround sound field.
15. The apparatus according to claim 14, wherein the weight selecting unit comprises:
a unit configured to select the weights based on the determined topology template and a frequency of the audio signals.
16. The apparatus according to any of claims 11 to 15, further comprising:
a time aligning unit configured to perform a time alignment of the received audio signals.
17. The apparatus according to claim 16, wherein the time aligning unit is configured to apply at least one of a protocol-based clock synchronization process, a peer-to-peer clock synchronization process, and a cross-correlation process.
18. The apparatus according to any of claims 11 to 17, further comprising:
a DOA estimating unit configured to estimate a direction of arrival (DOA) of the generated surround sound field with respect to a rendering device; and a rotating unit configured to rotate the generated surround sound field at least partially based on the estimated DOA.
19. The apparatus according to claim 18, wherein the rotating unit comprises:
a unit configured to rotate the generated surround sound field based on the estimated DOA and energy of the generated surround sound field.
20. The apparatus according to any of claims 11 to 19, further comprising:
a converting unit configured to convert the generated surround sound field into a target format for playback on a rendering device.
21. A computer program product comprising a computer program tangibly embodied on a machine readable medium, the computer program containing program codes configured to carry out the method according to any of claims 1-10.
PCT/US2014/042800 2013-06-18 2014-06-17 Generating surround sound field WO2014204999A2 (en)

Priority Applications (5)

Application Number Priority Date Filing Date Title
CN201480034420.XA CN105340299B (en) 2013-06-18 2014-06-17 Method and its device for generating surround sound sound field
US14/899,505 US9668080B2 (en) 2013-06-18 2014-06-17 Method for generating a surround sound field, apparatus and computer program product thereof
EP14736577.9A EP3011763B1 (en) 2013-06-18 2014-06-17 Method for generating a surround sound field, apparatus and computer program product thereof.
JP2015563133A JP5990345B1 (en) 2013-06-18 2014-06-17 Surround sound field generation
HK16108833.6A HK1220844A1 (en) 2013-06-18 2016-07-23 Method for generating a surround sound field, apparatus and computer program product thereof

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201310246729.2 2013-06-18
CN201310246729.2A CN104244164A (en) 2013-06-18 2013-06-18 Method, device and computer program product for generating surround sound field
US201361839474P 2013-06-26 2013-06-26
US61/839,474 2013-06-26

Publications (2)

Publication Number Publication Date
WO2014204999A2 true WO2014204999A2 (en) 2014-12-24
WO2014204999A3 WO2014204999A3 (en) 2015-03-26

Family

ID=52105492

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2014/042800 WO2014204999A2 (en) 2013-06-18 2014-06-17 Generating surround sound field

Country Status (6)

Country Link
US (1) US9668080B2 (en)
EP (1) EP3011763B1 (en)
JP (2) JP5990345B1 (en)
CN (2) CN104244164A (en)
HK (1) HK1220844A1 (en)
WO (1) WO2014204999A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3079074A1 (en) * 2015-04-10 2016-10-12 B<>Com Data-processing method for estimating parameters for mixing audio signals, associated mixing method, devices and computer programs
FR3034892A1 (en) * 2015-04-10 2016-10-14 B<>Com DATA PROCESSING METHOD FOR ESTIMATING AUDIO SIGNAL MIXING PARAMETERS, MIXING METHOD, DEVICES, AND ASSOCIATED COMPUTER PROGRAMS
EP3122066A1 (en) * 2015-07-22 2017-01-25 Harman International Industries, Incorporated Audio enhancement via opportunistic use of microphones
KR20170134464A (en) * 2015-04-05 2017-12-06 퀄컴 인코포레이티드 Conference audio management
GB2554446A (en) * 2016-09-28 2018-04-04 Nokia Technologies Oy Spatial audio signal format generation from a microphone array using adaptive capture

Families Citing this family (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11310614B2 (en) 2014-01-17 2022-04-19 Proctor Consulting, LLC Smart hub
GB2540226A (en) * 2015-07-08 2017-01-11 Nokia Technologies Oy Distributed audio microphone array and locator configuration
CN105120421B (en) * 2015-08-21 2017-06-30 北京时代拓灵科技有限公司 A kind of method and apparatus for generating virtual surround sound
EP3188504B1 (en) 2016-01-04 2020-07-29 Harman Becker Automotive Systems GmbH Multi-media reproduction for a multiplicity of recipients
EP3400722A1 (en) * 2016-01-04 2018-11-14 Harman Becker Automotive Systems GmbH Sound wave field generation
CN106162206A (en) * 2016-08-03 2016-11-23 北京疯景科技有限公司 Panorama recording, player method and device
EP3293987B1 (en) * 2016-09-13 2020-10-21 Nokia Technologies Oy Audio processing
US9986357B2 (en) 2016-09-28 2018-05-29 Nokia Technologies Oy Fitting background ambiance to sound objects
FR3059507B1 (en) * 2016-11-30 2019-01-25 Sagemcom Broadband Sas METHOD FOR SYNCHRONIZING A FIRST AUDIO SIGNAL AND A SECOND AUDIO SIGNAL
EP3340648B1 (en) * 2016-12-23 2019-11-27 Nxp B.V. Processing audio signals
CN110447238B (en) * 2017-01-27 2021-12-03 舒尔获得控股公司 Array microphone module and system
JP6753329B2 (en) * 2017-02-15 2020-09-09 株式会社Jvcケンウッド Filter generation device and filter generation method
CN106775572B (en) * 2017-03-30 2020-07-24 联想(北京)有限公司 Electronic device with microphone array and control method thereof
US10547936B2 (en) * 2017-06-23 2020-01-28 Abl Ip Holding Llc Lighting centric indoor location based service with speech-based user interface
US10182303B1 (en) * 2017-07-12 2019-01-15 Google Llc Ambisonics sound field navigation using directional decomposition and path distance estimation
RU2736418C1 (en) 2017-07-14 2020-11-17 Фраунхофер-Гезелльшафт Цур Фердерунг Дер Ангевандтен Форшунг Е.Ф. Principle of generating improved sound field description or modified sound field description using multi-point sound field description
RU2740703C1 (en) 2017-07-14 2021-01-20 Фраунхофер-Гезелльшафт Цур Фердерунг Дер Ангевандтен Форшунг Е.Ф. Principle of generating improved sound field description or modified description of sound field using multilayer description
EP3677025A4 (en) 2017-10-17 2021-04-14 Hewlett-Packard Development Company, L.P. Eliminating spatial collisions due to estimated directions of arrival of speech
CN109756683B (en) * 2017-11-02 2024-06-04 深圳市裂石影音科技有限公司 Panoramic audio and video recording method and device, storage medium and computer equipment
US10354655B1 (en) * 2018-01-10 2019-07-16 Abl Ip Holding Llc Occupancy counting by sound
GB2572761A (en) * 2018-04-09 2019-10-16 Nokia Technologies Oy Quantization of spatial audio parameters
CN109168125B (en) * 2018-09-16 2020-10-30 东阳市鑫联工业设计有限公司 3D sound effect system
US11109133B2 (en) 2018-09-21 2021-08-31 Shure Acquisition Holdings, Inc. Array microphone module and system
GB2577698A (en) * 2018-10-02 2020-04-08 Nokia Technologies Oy Selection of quantisation schemes for spatial audio parameter encoding
CN109618274B (en) * 2018-11-23 2021-02-19 华南理工大学 Virtual sound playback method based on angle mapping table, electronic device and medium
CN110751956B (en) * 2019-09-17 2022-04-26 北京时代拓灵科技有限公司 Immersive audio rendering method and system
FR3101725B1 (en) * 2019-10-04 2022-07-22 Orange Method for detecting the position of participants in a meeting using the personal terminals of the participants, corresponding computer program.
CN113055789B (en) * 2021-02-09 2023-03-24 安克创新科技股份有限公司 Single sound channel sound box, method and system for increasing surround effect in single sound channel sound box
CN112817683A (en) * 2021-03-02 2021-05-18 深圳市东微智能科技股份有限公司 Control method, control device and medium for topological structure configuration interface
US12039991B1 (en) * 2021-03-30 2024-07-16 Meta Platforms Technologies, Llc Distributed speech enhancement using generalized eigenvalue decomposition
CN112804043B (en) * 2021-04-12 2021-07-09 广州迈聆信息科技有限公司 Clock asynchronism detection method, device and equipment
US11716569B2 (en) 2021-12-30 2023-08-01 Google Llc Methods, systems, and media for identifying a plurality of sets of coordinates for a plurality of devices

Family Cites Families (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5757927A (en) * 1992-03-02 1998-05-26 Trifield Productions Ltd. Surround sound apparatus
WO1999041947A1 (en) 1998-02-13 1999-08-19 Koninklijke Philips Electronics N.V. Surround sound reproduction system, sound/visual reproduction system, surround signal processing unit and method for processing an input surround signal
US7277692B1 (en) 2002-07-10 2007-10-02 Sprint Spectrum L.P. System and method of collecting audio data for use in establishing surround sound recording
US7693289B2 (en) 2002-10-03 2010-04-06 Audio-Technica U.S., Inc. Method and apparatus for remote control of an audio source such as a wireless microphone system
FI118247B (en) 2003-02-26 2007-08-31 Fraunhofer Ges Forschung Method for creating a natural or modified space impression in multi-channel listening
JP4349123B2 (en) * 2003-12-25 2009-10-21 ヤマハ株式会社 Audio output device
US20080165949A9 (en) * 2004-01-06 2008-07-10 Hanler Communications Corporation Multi-mode, multi-channel psychoacoustic processing for emergency communications
JP4368210B2 (en) 2004-01-28 2009-11-18 ソニー株式会社 Transmission / reception system, transmission device, and speaker-equipped device
AU2005234518A1 (en) 2004-04-16 2005-10-27 Dolby Laboratories Licensing Corporation Apparatuses and methods for use in creating an audio scene
WO2006050353A2 (en) * 2004-10-28 2006-05-11 Verax Technologies Inc. A system and method for generating sound events
ES2349723T3 (en) * 2005-06-09 2011-01-10 Koninklijke Philips Electronics N.V. PROCEDURE AND SYSTEM TO DETERMINE DISTANCES BETWEEN SPEAKERS.
US7711443B1 (en) 2005-07-14 2010-05-04 Zaxcom, Inc. Virtual wireless multitrack recording system
US8130977B2 (en) * 2005-12-27 2012-03-06 Polycom, Inc. Cluster of first-order microphones and method of operation for stereo input of videoconferencing system
US8405323B2 (en) 2006-03-01 2013-03-26 Lancaster University Business Enterprises Limited Method and apparatus for signal presentation
US20080077261A1 (en) 2006-08-29 2008-03-27 Motorola, Inc. Method and system for sharing an audio experience
WO2008039339A2 (en) * 2006-09-25 2008-04-03 Dolby Laboratories Licensing Corporation Improved spatial resolution of the sound field for multi-channel audio playback systems by deriving signals with high order angular terms
US8264934B2 (en) 2007-03-16 2012-09-11 Bby Solutions, Inc. Multitrack recording using multiple digital electronic devices
US7729204B2 (en) 2007-06-08 2010-06-01 Microsoft Corporation Acoustic ranging
US20090017868A1 (en) 2007-07-13 2009-01-15 Joji Ueda Point-to-Point Wireless Audio Transmission
WO2009010832A1 (en) * 2007-07-18 2009-01-22 Bang & Olufsen A/S Loudspeaker position estimation
KR101415026B1 (en) * 2007-11-19 2014-07-04 삼성전자주식회사 Method and apparatus for acquiring the multi-channel sound with a microphone array
US8457328B2 (en) * 2008-04-22 2013-06-04 Nokia Corporation Method, apparatus and computer program product for utilizing spatial information for audio signal enhancement in a distributed network environment
US9445213B2 (en) 2008-06-10 2016-09-13 Qualcomm Incorporated Systems and methods for providing surround sound using speakers and headphones
EP2230666B1 (en) 2009-02-25 2019-10-23 Bellevue Investments GmbH & Co. KGaA Method for synchronized multi-track editing
EP2249334A1 (en) 2009-05-08 2010-11-10 Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. Audio format transcoder
EP2346028A1 (en) * 2009-12-17 2011-07-20 Fraunhofer-Gesellschaft zur Förderung der Angewandten Forschung e.V. An apparatus and a method for converting a first parametric spatial audio signal into a second parametric spatial audio signal
US8560309B2 (en) 2009-12-29 2013-10-15 Apple Inc. Remote conferencing center
WO2012007152A1 (en) 2010-07-16 2012-01-19 T-Mobile International Austria Gmbh Method for mobile communication
US9552840B2 (en) 2010-10-25 2017-01-24 Qualcomm Incorporated Three-dimensional sound capturing and reproducing with multi-microphones
CN103460285B (en) 2010-12-03 2018-01-12 弗劳恩霍夫应用研究促进协会 Device and method for the spatial audio coding based on geometry
EP2469741A1 (en) * 2010-12-21 2012-06-27 Thomson Licensing Method and apparatus for encoding and decoding successive frames of an ambisonics representation of a 2- or 3-dimensional sound field
US9313336B2 (en) * 2011-07-21 2016-04-12 Nuance Communications, Inc. Systems and methods for processing audio signals captured using microphones of multiple devices

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20170134464A (en) * 2015-04-05 2017-12-06 퀄컴 인코포레이티드 Conference audio management
KR102430838B1 (en) * 2015-04-05 2022-08-08 퀄컴 인코포레이티드 Conference audio management
US11910344B2 (en) 2015-04-05 2024-02-20 Qualcomm Incorporated Conference audio management
EP3079074A1 (en) * 2015-04-10 2016-10-12 B<>Com Data-processing method for estimating parameters for mixing audio signals, associated mixing method, devices and computer programs
FR3034892A1 (en) * 2015-04-10 2016-10-14 B<>Com DATA PROCESSING METHOD FOR ESTIMATING AUDIO SIGNAL MIXING PARAMETERS, MIXING METHOD, DEVICES, AND ASSOCIATED COMPUTER PROGRAMS
US9769565B2 (en) 2015-04-10 2017-09-19 B<>Com Method for processing data for the estimation of mixing parameters of audio signals, mixing method, devices, and associated computers programs
EP3122066A1 (en) * 2015-07-22 2017-01-25 Harman International Industries, Incorporated Audio enhancement via opportunistic use of microphones
US9769563B2 (en) 2015-07-22 2017-09-19 Harman International Industries, Incorporated Audio enhancement via opportunistic use of microphones
GB2554446A (en) * 2016-09-28 2018-04-04 Nokia Technologies Oy Spatial audio signal format generation from a microphone array using adaptive capture

Also Published As

Publication number Publication date
JP5990345B1 (en) 2016-09-14
JP2016533045A (en) 2016-10-20
HK1220844A1 (en) 2017-05-12
EP3011763A2 (en) 2016-04-27
JP2017022718A (en) 2017-01-26
CN104244164A (en) 2014-12-24
US20160142851A1 (en) 2016-05-19
CN105340299A (en) 2016-02-17
EP3011763B1 (en) 2017-08-09
WO2014204999A3 (en) 2015-03-26
CN105340299B (en) 2017-09-12
US9668080B2 (en) 2017-05-30

Similar Documents

Publication Publication Date Title
US9668080B2 (en) Method for generating a surround sound field, apparatus and computer program product thereof
US10397722B2 (en) Distributed audio capture and mixing
CN104871566B (en) Collaborative sound system
US9877135B2 (en) Method and apparatus for location based loudspeaker system configuration
US8989552B2 (en) Multi device audio capture
WO2020253844A1 (en) Method and device for processing multimedia information, and storage medium
EP3446309A1 (en) Merging audio signals with spatial metadata
WO2014032709A1 (en) Audio rendering system
WO2015035785A1 (en) Voice signal processing method and device
US11350213B2 (en) Spatial audio capture
US20120155680A1 (en) Virtual audio environment for multidimensional conferencing
EP2904817A1 (en) An apparatus and method for reproducing recorded audio with correct spatial directionality
JP2021525392A (en) Spatial audio parameter signaling
WO2013088208A1 (en) An audio scene alignment apparatus
US11483669B2 (en) Spatial audio parameters
US20230156419A1 (en) Sound field microphones
CN114220454B (en) Audio noise reduction method, medium and electronic equipment
CN111050270A (en) Multi-channel switching method and device for mobile terminal, mobile terminal and storage medium
WO2022067652A1 (en) Real-time communication method, apparatus and system
EP3540735A1 (en) Spatial audio processing

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 201480034420.X

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14736577

Country of ref document: EP

Kind code of ref document: A2

DPE1 Request for preliminary examination filed after expiration of 19th month from priority date (pct application filed from 20040101)
REEP Request for entry into the european phase

Ref document number: 2014736577

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2014736577

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2015563133

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 14899505

Country of ref document: US