WO2022010454A1

WO2022010454A1 - Binaural down-mixing of audio signals

Info

Publication number: WO2022010454A1
Application number: PCT/US2020/040903
Authority: WO
Inventors: Sunil Bharitkar; Andre DA FONTE LOPES DASILVA; Walter Flores PEREIRA
Original assignee: Hewlett-Packard Development Company, L.P.
Priority date: 2020-07-06
Filing date: 2020-07-06
Publication date: 2022-01-13

Abstract

In example implementations, an apparatus is provided. The apparatus includes an audio decoder, a binaural down-mixer, a communication interface, and a processor. The audio decoder is to decode an audio signal to remove a coding applied to the audio signal. The binaural down-mixer is to generate a binaural down-mixed version of the audio signal. The communication interface is to communicatively connect to a device connected to headphones. The processor is to detect that the device is without media processing capabilities and to transmit the binaural down-mixed version of the audio signal to the device, wherein the binaural down-mixed version of the audio signal is to be outputted via the headphones.

Description

BINAURAL DOWN-MIXING OF AUDIO SIGNALS

BACKGROUND

[0001] Audio signals from a source may be processed to create a more enjoyable experience for a user. For example, audio signals from a movie or video game may be processed to provide a surround sound experience. For example, speakers may be placed around a user, and the audio signals may be processed into different channels to be outputted by a respective speaker around the user to create the surround sound experience.

[0002] In some instances, a user may wear headphones to listen to the audio signals. Headphones may not have the ability to output audio signals in several different speakers. Rather, headphones output audio into a left channel and a right channel. However, some headphones may perform spatial processing on the headphones to make it sound like an audio signal is coming from a particular direction.

BRIEF DESCRIPTION OF THE DRAWINGS [0003] FIG. 1 is a block diagram of an example system to generate a binaural down-mixed version of audio signals and transmit the binaural down- mixed version to a device of the present disclosure;

[0004] FIG. 2 is a block diagram of example apparatus to generate a binaural down-mixed version of audio signals of the present disclosure;

[0005] FIG. 3 is another block diagram of an example apparatus to generate a binaural down-mixed version of audio signals of the present disclosure;

[0006] FIG. 4 is a flow diagram of an example method for generating a binaural down-mixed version of an audio signal of the present disclosure; [0007] FIG. 5 is another flow diagram of an example method for generating a binaural down-mixed version of an audio signal of the present disclosure; and [0008] FIG. 6 is an example non-transitory computer readable storage medium storing instructions executed by a processor to generate a binaural down-mixed version of an audio signal of the present disclosure.

DETAILED DESCRIPTION

[0009] Examples described herein provide an apparatus and method to generate a binaural down-mixed version of an audio signal. The binaural down- mixed version can be played via devices that are connected to headphones and provide a spatially rendered audio signal over two channels that can be played on the headphones. As discussed above, headphones may be used to listen to audio signals. However, some audio signals may not be spatially rendered to provide a surround sound experience for the user via the headphones.

[0010] In some instances, audio signals can be streamed from a remotely located server. The audio signals may be associated with music, movies, or any other type of media. The audio signals may be processed for 5.1 channel (or more channels) surround sound or not spatially rendered for two channel headphones.

[0011] In some instances, the audio signals can be streamed via a smart speaker. The smart speaker may have a single speaker, but have the capability to decode the audio signal and then spatially render the audio signal for the smart speaker. However, the spatial rendering performed by the smart speaker may not be the spatial rendering used for the headphones.

[0012] In some instances, some devices may be used to stream the audio signals. However, the devices (e.g., mobile devices) may not have media processing capabilities or rendering capabilities to generate a binaural down- mixed version of the audio signal for the headphones connected to the device. Thus, the audio signals may not be properly rendered or processed for surround sound output on headphones.

[0013] The present disclosure provides a method and apparatus that can generate a binaural down-mixed version of the audio signal. The binaural down-mixed version can be stored in a server or locally stored on a device such that a user can have a surround sound experience while listening to the audio signal on headphones.

[0014] FIG. 1 illustrates an example system 100 to generate a binaural down-mixed version of an audio signal of the present disclosure. In an example, the system 100 may include an apparatus 102, a device 104, and a server 108. The apparatus 102 may be a device with media processing capabilities. For example, the apparatus 102 may be a device that may include a binaural down-mixer 114. The binaural down-mixer 114 may be able to generate a spatially rendered audio signal that can simulate surround sound over a two-channel output.

[0015] In an example, the apparatus 102 may be a smart speaker that includes media processing capabilities, such as decoding audio signals, spatially rendering audio signals for output on the smart speaker, and the binaural down-mixer 114. The apparatus 102 may be communicatively coupled to the server 108 and the device 104.

[0016] In an example, the server 108 may be a remotely located server within a network 110. The network 110 may be an internet protocol (IP) network that allows remote devices (e.g., the apparatus 102) to access the server 108.

It should be noted that the network 110 has been simplified for ease of explanation and may include other components and devices that are not shown. For example, the network 110 may include gateways, routers, firewalls, additional servers, and the like.

[0017] In an example, the server 108 may include audio signals 112. The audio signals 112 may be media files, such as movies, music, speech, podcasts, and the like. The audio signals 112 may be encoded for surround sound on 5.1 channel systems. For example, the audio signals 112 may be encoded with motion picture group of experts (MPEG) encoding, Dolby Atmos encoding, digital theater system (DTS) encoding, and the like. The audio signals 112 may be selected and streamed to a device via the network 110. [0018] The encoding of the audio signals 112 may not be compatible with some devices. For example, the apparatus 102 may include media processing capabilities that can decode the audio signals 112 and then render the decoded audio signals for output on the speakers of the apparatus 102.

[0019] In an example, the device 104 may be a mobile device, smartphone, tablet computer, laptop computer, and the like. The device 104 may also be connected to headphones 106. The headphones 106 may be communicatively coupled to the apparatus 102. In an example, the headphones 106 may include speakers in the ear cups of the headphones 106. In an example, the headphones 106 may include two outputs (e.g., two channels, one for the left ear and one for the right ear).

[0020] The device 104 may be communicatively coupled to or connected to the apparatus 102. The device 104 may be without media processing capabilities. In other words, the device 104 may exclude an audio signal decoder.

[0021] The device 104 may request an audio signal 112 to be output via the headphones 106. However, the audio signal 112 may not be encoded for proper output via the headphones 106. In addition, the device 104 may be without media processing capabilities to properly render the audio signal 112 for output on the headphones 106.

[0022] In an example, the apparatus 102 may detect that the device is without media processing capabilities and generate a binaural down-mixed version of the audio signal 112 that is selected by the device 104. For example, generating the binaural down-mixed version of the audio signal 112 may refer to moving from a higher spatial representation (e.g., 5.1 channel or 7.1 channel, to two channels). In other words, the binaural down-mixed version of the audio signal 112 may have fewer channels than the unprocessed version of the audio signal 112.

[0023] The binaural down-mixed version may then be transmitted to the device 104 for output via the headphones 106. The binaural down-mixed version can be transmitted to the server 108 for storage by the apparatus 102, or to the device 104. As a result, when the audio signal 112 is selected by the device 104 again in a location not in proximity to the apparatus 102, the binaural down-mixed version can be retrieved from storage of the server 108 and transmitted to the device 104. The device 104 may output the binaural down- mixed version via the headphones 106. The binaural down-mixed version may be spatially rendered to simulate surround sound in the audio signal 112 over the two output channels of the headphones 106.

[0024] FIG. 2 illustrates a block diagram of an example of the apparatus 102. In an example, the apparatus 102 may include a processor 202, a communication interface 204, an audio decoder 206, and the binaural down- mixer 114. The processor 202 may be communicatively coupled to the communication interface 204, the audio decoder 206, and the binaural down- mixer 114

[0025] In an example, the communication interface 204 may be a wired or wireless communication interface. For example, the communication interface 204 may be a short range wireless interface (e.g., Bluetooth) or a WiFi interface that can communicatively connect the apparatus 102 to the device 104. In an example, the device 104 may transmit a request to the apparatus 102 for an audio signal 112.

[0026] In an example, the request may include information related to the device 104. For example, the request may indicate that the device 104 does not have media processing capabilities and that the audio signal 112 may be output by the headphones 106 connected to the device 104. In another example, the processor 202 may determine that the device 104 does not have media processing capabilities through information exchanged when the connection is initially established.

[0027] In an example, the communication interface 204 may also communicatively connect the apparatus 102 to the server 108. The apparatus 102 may obtain the audio signal 112 requested by the device 104 and stored in the server 108.

[0028] The processor 202 may receive the audio signal 112 via the communication interface 204. The audio signal 112 may then be provided to the audio decoder 206. The audio decoder 206 may decode the audio signal 112.

[0029] The decoded audio signal 112 may then be provided to the binaural down-mixer 114. The binaural down-mixer 114 may process the decoded audio signal 112 to generate a binaural down-mixed version of the audio signal 112. The binaural down-mixed version of the audio signal 112 may be a spatially rendered version of the audio signal 112. The spatially rendered version may simulate surround sound by rendering different portions of the audio signal 112 to sound as if the sound is coming from a particular direction using two channels. Thus, the binaural down-mixed version of the audio signal 112 may simulate a surround sound audio signal when listened to over two channels of the headphones 106.

[0030] FIG. 3 illustrates a block diagram of another example of the apparatus 102. In an example, the apparatus 102 may include the processor 202, the communication interface 204, a digital signal processor (DSP) 302 and a memory 304. The processor 202 may be communicatively coupled to the communication interface 204 and the DSP 302. The processor 202 and the communication interface 204 may operate as described above in FIG. 2.

[0031] In an example, the audio decoder 206 and the binaural down-mixer 114 may be deployed as the DSP 302. For example, the memory 304 may include instructions 306 to decode audio signals and instructions 308 to generate binaural down-mixed audio signals. The DSP 302 may be communicatively coupled to the memory 304 and execute the instructions 306 and instructions 308. For example the DSP 302 may identify a type of encoding applied to the audio signals. Then the instructions 308 may generate a binaural down-mix before re-encoding to another format, based on the instructions 306. The binaural down-mixed version of the audio signal may be transmitted to the server or the device associated with the headphones.

[0032] In an example, the memory 304 may be a non-transitory computer readable medium. For example, the memory 304 may be a random access memory (RAM), a read only memory (ROM), a memory that is part of the DSP 302, and the like.

[0033] FIG. 4 illustrates a flow diagram of an example method 400 for generating a binaural down-mixed version of an audio signal of the present disclosure. In an example, the method 400 may be performed by the apparatus 102 or the apparatus 600 illustrated in FIG. 6, and described below.

[0034] At block 402, the method 400 begins. At block 404, the method 400 detects a connection to a device without media processing capabilities. For example, a device without media processing capabilities may connect to an apparatus with media processing capabilities to request an audio signal. The connection may be via a short range wireless connection (e.g., Bluetooth) or via a WiFi connection.

[0035] In an example, the device without media processing capabilities may be connected to headphones. As noted above, the headphones may output the audio signal using two channels. However, some audio signals from a server may be encoded for output on more than two channels (e.g., 5.1 channel surround sound). Since, the device without media processing capabilities cannot render the audio signal for proper output over the headphones, the device may connect to the apparatus with media processing capabilities. The apparatus may then process and/or render the audio signal for output on the headphones connected to the device without media processing capabilities. [0036] In an example, the detection that the device does not have media processing capabilities may be performed via information exchanged when the connection is initially established. For example, the device without media processing capabilities may transmit a connection request that includes information about the device. The information in the request may indicate that the device does not have media processing capabilities.

[0037] At block 406, the method 400 receives a request for an audio signal from the device. In an example, the request may be a first time request for the audio signal. In other words, the audio signal may not have been previously requested by the device. The audio signal may be a media file such as music, a movie, a podcast, speech, and the like.

[0038] At block 408, the method 400 decodes the audio signal to remove a coding applied to the audio signal. For example, the audio signal may be encoded to be output via multiple channels (e.g., 5.1 channel surround sound). The coding may be MPEG, Dolby Atmos, DTS, and the like.

[0039] At block 410, the method 400 generates a binaural down-mixed version of the audio signal. For example, the decoded audio signal may be encoded to generate the binaural down-mixed version of the audio signal. The binaural down-mixed version may be a spatial rendering of the audio signal. Spatial rendering may encode the audio signal to simulate surround sound over two channels used by the headphones. The spatial rendering may encode a direction, angle, and distance for different sounds in the audio signal to allow a listener to perceive sounds from the appropriate direction, angle, and distance. The sounds may be down-mixed via linear summation to simulate surround sound audio signals in the left ear and the right ear via the two-channels of the headphones.

[0040] At block 412, the method 400 transmits the binaural down-mixed version of the audio signal to the device to be played via headphones connected to the device. In an example, the binaural down-mixed version may be stored in a remotely located server. As a result, if the audio signal is subsequently selected or requested again (by the same device or a different device), the binaural down-mixed version of the audio signal may be immediately transmitted or streamed to the requesting device. At block 414, the method 400 ends.

[0041] FIG. 5 illustrates a flow diagram of an example method 500 for generating a binaural down-mixed version of an audio signal of the present disclosure. In an example, the method 400 may be performed by the devices of the system 100 illustrated in FIG. 1 or the apparatus 600 illustrated in FIG. 6, and described below.

[0042] The method 500 begins at block 502. At block 504, the method 500 determines if a device without media processing capabilities is connected to a device with media processing capabilities. For example, a mobile device connected to headphones may want to stream an audio signal via the headphones. However, the mobile device, or an application residing on the mobile device, may not have media processing capabilities. Thus, the mobile device may connect to a device with media processing capabilities (e.g., a smart speaker or a device with a binaural down-mixer).

[0043] If the answer to block 504 is no, the method 500 may proceed to block 506. At block 506, the method 500 may determine if a first time request for an audio signal is being made. For example, the device without media processing capabilities may request an audio signal to be streamed over the headphones connected to the device.

[0044] If the answer to block 506 is no, the method 500 may proceed to block 508. At block 508, the method 500 may determine if a binaural down- mixed version of the requested audio signal is available. For example, the audio signal may have been previously requested and the binaural down-mixed version of the audio signal may be stored in a remotely located server.

[0045] If the answer to block 508 is yes, the method 500 may proceed to block 510. At block 510, the method 500 may transmit the binaural down-mixed version of the audio signal to the device without media processing capabilities. The binaural down-mixed version of the audio signal may be output over two channels of the headphones connected to the device without media processing capabilities. The method 500 may then proceed to block 524.

[0046] Referring back to block 508, if the answer to block 508 is no, the method 500 may proceed to block 512. At block 512, the method 500 may determine if the server storing the audio signal can generate a binaural down- mixed version. For example, in some instances, the audio decoding and binaural down-mixing may be performed by the remotely located server that stores the audio signals (e.g., the server 108 illustrated in FIG. 1). If the answer to block 512 is yes, the method 500 may proceed to block 514.

[0047] At block 514, the method 500 may generate a binaural down-mixed version. For example, the remotely located server storing the audio signal may decode the requested audio signal and generate the binaural down-mixed version of the audio signal. The method 500 may then proceed to block 510. [0048] Referring back to block 512, if the answer to block 512 is no, then the method 500 may proceed to block 524.

[0049] Referring back to block 506, if the answer to block 506 is yes, then the method 500 may proceed to block 512. If the answer to block 512 is yes, the method 500 may proceed to block 514, as described above. If the answer to block 512 is no, the method may proceed to block 524, as described above. [0050] Referring back to block 504, if the answer to block 504 is yes, then the method 500 may proceed to block 516. At block 516, the method 500 may determine if the request for the audio signal is a first time request or an initial request. If the answer to block 516 is yes, the method 500 may proceed to block 518.

[0051] At block 518, the method 500 may generate a binaural down-mixed version of the requested audio signal. For example, the device with media processing capabilities may include an audio decoder and a binaural down- mixer. In an example, the audio decoding and the binaural down-mixing may be performed by a DSP.

[0052] In an example, the audio signal may be obtained from a remotely located server that stores the audio signal. The device with media processing capabilities may decode and generate the binaural down-mixed version of the requested audio signal as the audio signal is streamed from the remote server. For example, the audio signal may be buffered at the device with media processing capabilities to provide time for the decoding and down-mixing to be performed without interrupting the stream to the device without media processing capabilities.

[0053] In another example, the audio signal may be downloaded and temporarily stored at the device with the media processing capabilities. The device may then decode the audio signal and generate the binaural down-mixed version of the audio signal, as described above. The binaural down-mixed version of the audio signal may then be streamed to the device without media processing capabilities or transmitted to the device without media processing capabilities for output.

[0054] At block 520, the method 500 may transmit the binaural down-mixed version to the device without media processing capabilities. The binaural down- mixed version may be played over two channels of the headphones connected to the device without media processing capabilities. In an example, the binaural down-mixed version may be stored locally on the device with media processing capabilities or in the remotely located server. As a result, the binaural down- mixed version may be available for streaming or download in response to subsequent requests. The method 500 may then proceed to block 524.

[0055] Referring back to block 516, if the answer to block 516 is no, then the method 500 may proceed to block 522. At block 522, the method 500 may determine if the binaural down-mixed version was previously stored. For example, the binaural down-mixed version of popular audio signals that are frequently requested may be stored for subsequent requests. On the other hand, binaural down-mixed versions of audio signals that are infrequently requested may not be stored to save memory space at the remote server or the device with media processing capabilities.

[0056] If the answer to block 522 is no, the method 500 may proceed to the block 518. The method 500 may proceed from block 518, as discussed above.

If the answer to block 522 is yes, then the method 500 may proceed to block 520 to transmit the binaural down-mixed version of the audio signal to the device without media processing capabilities. The method 500 may then proceed to block 524. At block 524, the method 500 ends.

[0057] FIG. 6 illustrates an example of an apparatus 600. In an example, the apparatus 600 may be the apparatus 100. In an example, the apparatus 600 may include a processor 602 and a non-transitory computer readable storage medium 604. The non-transitory computer readable storage medium 604 may include instructions 606, 608, 610, and 612 that, when executed by the processor 602, cause the processor 602 to perform various functions.

[0058] In an example, the instructions 606 may include instructions to receive a first time request for an audio signal on a device without media processing capabilities. The audio signal may be a version that cannot be played on the device. The device may not have media processing capabilities to modify the audio signal into a format that is compatible with the device. The instructions 608 may include instructions to identify a connected device with media processing capabilities. For example, the connected device may be a smart speaker that can generate binaural down-mixed version of an audio signal. The instructions 610 may include instructions to instruct the connected device with media processing capabilities to generate a binaural down-mixed version of the audio signal. The instructions 612 may include instructions to receive the binaural down-mixed version from the connected device with media processing capabilities.

[0059] It will be appreciated that variants of the above-disclosed and other features and functions, or alternatives thereof, may be combined into many other different systems or applications. Various presently unforeseen or unanticipated alternatives, modifications, or variations therein may be subsequently made by those skilled in the art which are also intended to be encompassed by the following claims.

Claims

1. An apparatus, comprising: an audio decoder to decode an audio signal to remove a coding applied to the audio signal; a binaural down-mixer to generate a binaural down-mixed version of the audio signal; a communication interface to communicatively connect to a device connected to headphones; and a processor to detect that the device is without media processing capabilities and to transmit the binaural down-mixed version of the audio signal to the device, wherein the binaural down-mixed version of the audio signal is to be outputted via the headphones.

2. The apparatus of claim 1 , wherein the audio decoder comprises a digital signal processor.

3. The apparatus of claim 1 , wherein the binaural down-mixer comprises a digital signal processor.

4. The apparatus of claim 1 , wherein the apparatus comprises a smart speaker communicatively coupled to a remotely located server.

5. The apparatus of claim 1 , wherein the apparatus comprises a remotely located server that is to store a plurality of audio signals.

6. A method, comprising: detecting a connection to a device without media processing capabilities; receiving a request for an audio signal from the device; decoding the audio signal to remove a coding applied to the audio signal; generating a binaural down-mixed version of the audio signal; and transmitting the binaural down-mixed version of the audio signal to the device to be played via headphones connected to the device.

7. The method of claim 6, wherein the coding comprises at least one of: motion picture group of experts (MPEG) encoding, Dolby Atmos encoding, or digital theater system (DTS) encoding.

8. The method of claim 6, wherein the binaural down-mixed version of the audio signal comprises a spatial rendering for two channels.

9. The method of claim 6, further comprising: storing the binaural down-mixed version of the audio signal in a remotely located server.

10. The method of claim 9, further comprising: receiving a selection for the audio signal; determining that the binaural down-mixed version of the audio signal is stored in the remotely located server; and transmitting the binaural down-mixed version of the audio signal.

11. A non-transitory computer readable storage medium encoded with instructions executable by a processor, the non-transitory computer-readable storage medium comprising: instructions to receive a first time request for an audio signal on a device without media processing capabilities; instructions to identify a connected device with media processing capabilities; instructions to instruct the connected device with media processing capabilities to generate a binaural down-mixed version of the audio signal; and instructions to receive the binaural down-mixed version from the connected device with media processing capabilities.

12. The non-transitory computer readable storage medium of claim 11 , further comprising: instructions to store the binaural down-mixed version of the audio signal on a remotely located server.

13. The non-transitory computer readable storage medium of claim 12, further comprising: instructions to transmit the binaural down-mixed version of the audio signal on subsequent requests for the audio signal.

14. The non-transitory computer readable storage medium of claim 11 , wherein the connected device comprises a smart speaker.

15. The non-transitory computer readable storage medium of claim 11 , wherein the instructions to receive are performed via a short range wireless connection to the connected device.