EP3228096B1

EP3228096B1 - Audio terminal

Info

Publication number: EP3228096B1
Application number: EP14777648.8A
Authority: EP
Inventors: Detlef Wiese; Lars IMMISCH; Hauke Krüger
Original assignee: Binauric Se
Current assignee: Binauric Se
Priority date: 2014-10-01
Filing date: 2014-10-01
Publication date: 2021-06-23
Anticipated expiration: 2034-10-01
Also published as: EP3228096A1; WO2016050298A1

Description

FIELD OF THE INVENTION

The present invention generally relates to the field of audio data processing. More particularly, the present invention relates to an audio terminal.

BACKGROUND OF THE INVENTION

1. Introduction

Everybody uses a telephone - either using a wired telephone connected to the well-known PSTN (Public Switched Telephone Network) via cable or a modern mobile phone, such as a smartphone, which is connected to the world via wireless connections based on, e.g., UMTS (Universal Mobile Telecommunications System). However, despite the innovations in the field of mobile communication terminals and signal processing in general, in terms of the quality of speech no advances have been observed for the last 20 years. Speech is transmitted and received with severe limitations of audio bandwidth such that frequencies below 300 Hz and higher than 3400 Hz are removed which is denoted as "narrowband speech". The only progress claimed by the big telecommunication companies during that time was the reduction of cost (see F. Felden et al., "How IT Is Driving Business Value at European Telcos", bcg.perspectives, 2012), but not the increase of communication quality which would be a real benefit for the customers. Some of the communication providers even reduced the operational cost at the expense of a degraded speech quality by, e.g., additional compression of speech signals when transmitting from one point to another in the internal network to create room for additional communication links on one communication line. The perception of audio quality as well as intelligibility of PSTN telephony is quoted by the users to be even worse than before, and, clearly, there is a huge gap between the capabilities of smartphones with millions of apps and features as well as Internet access bitrate even going up to the Mbit/s in 3G and LTE on the one hand, but on the other hand no or only ultra-slow progress with the audio quality in the old telephone communication.
During the last years, however, the introduction of internet based telephony (VoIP: voice over internet protocol) has led to the introduction of higher quality communication applications. Companies such as Skype or Google offer services which employ novel speech codecs offering so-called HD-Voice quality (see, e.g., 3GPP TS 26.190, "Speech codec speech processing functions; Adaptive Multi-Rate - Wideband (AMR-WB) speech codec; Transcoding functions", 3GPP Technical Specification Group Services and System Aspects, 2001; ITU-T G.722.2, "Wideband coding of speech at around 16 kbit/s using Adaptive Multi-Rate Wideband (AMR-WB)", 2002). In this context, speech signals cover a frequency bandwidth between 50 Hz and 7 kHz (so-called "wideband speech") and even more, for instance, a frequency bandwidth between 50 Hz and 14 kHz (so-called "super-wideband speech") (see 3GPP TS 26.290, "Audio codec processing functions; Extended Adaptive Multi-Rate - Wideband (AMR-WB+) codec; Transcoding functions", 3GPP Technical Specification Group Services and System Aspects, 2005) or an even higher frequency bandwidth (e.g., "full band speech"). As a consequence, the overall communication quality - even due to the introduction of the transmission of video contents - has increased significantly. The success of this approach, however, is so far mainly limited to stationary environments.
In the future, Audio-3D - also denoted as binaural communication - is expected by the present inventors to be the next emerging technology in communication. The benefit of Audio-3D in comparison to conventional (HD-)Voice communication lies in the use of a binaural instead of a monaural audio signal. Audio contents will be captured and played back by novel binaural terminals involving two microphones and two speakers, yielding an acoustical reproduction that better resembles what the remote communication partner really hears. Ideally, binaural telephony is "listening to the audio ambience with the ears of the remote speaker", wherein the pure content of the recorded speech is extended by the capturing of the acoustical ambience. Compared to the transmission of stereo contents, which allow a left-right location of sound sources, the virtual representation of room acoustics in binaural signals is, preferably, based on differences in the time of arrival of the signals reaching the left and the right ear as well as attenuation and filtering effects caused by the human head, the body and the ears allowing the location of sources also in vertical direction.

2. The Vision of Audio-3D

Audio-3D is expected to represent the first radical change of the more than 100 years known old form of audio communication, which the society has named telephone or phoning. It targets particularly a new mobile type of communication which may be called "audio portation". In one exemplary usage scenario, everybody being equipped with a future binaural terminal equipment as well as a smartphone app to handle the communication will be able to effectively capture the acoustical environment, i.e., the acoustical events of real life, preferably, as they are perceived with the two ears of the user, and provide them as captured, like a listening picture, to another user, anywhere in the world. With Audio-3D, communication partners will no longer feel distant (or, at least, less distant) and, eventually, it will even result in a reduction of traveling - which indeed is an intelligent economic and ecological approach.
Originally, binaural telephony was proposed by H. Fletcher and L. Sivian in 1927 in a patent describing an application in which two signals captured by an artificial head are transferred to a person located at a far-away place (see H. Fletcher and L. Sivian, "Binaural telephone system", US Patent 1,624,486 A, 1927 ). In A. Harma et al., "Techniques and Applications of Wearable Augmented Reality Audio, Audio Engineering Society Convention, 2003, and M. Karjalainen et al., "Application Scenarios of Wearable and Mobile Augmented Reality Audio", Audio Engineering Society Convention, 2004, investigations are described to create a so-called augmented-reality by connecting two persons via a binaural link.
However, for Audio-3D to open a new world of communication, this technology requires a new family of signal processing algorithms that takes into account the specific constraints defined by the binaural perception of audio events. In particular, the binaural cues, i.e., the inherent characteristics defining the relation between the left and the right audio channel, which are crucial for 3D audio perception, must substantially be preserved and transmitted in the complex signal processing chain of an end-to-end binaural communication.
The present invention has been made in view of the above situation and considerations and embodiments of the present invention aim at providing technology that may be used in various Audio-3D usage scenarios.
It shall be noted that in this specification, the term "binaural" or "binaurally" is not used in an as strict sense as in some publications, where only audio signals captured with an artificial head (also called "Kunstkopf") are considered truly binaural. Rather the term is used here for audio any signals that compared to a conventional stereo signal more closely resemble the acoustical ambience as it would be perceived by a real human. Such audio signals may be captured, for instance, by the audio terminals described in more detail in sections 3 to 9 below.
WO 2012/061148 A1 discloses systems, methods and apparatuses for detecting head movement based on recorded sound signals. US 2011/0280409 A1 discloses that a personalized hearing profile is generated for an ear-level device comprising a memory, microphone, speaker and processor. Communication is established between the ear-level device and a companion device, having a user interface. A frame of reference in the user interface is provided, where positions in the frame of reference are associated with sound profile data. A position on the frame of reference is determined in response to user interaction with the user interface, and certain sound profile data associated with the position. Certain data is transmitted to the ear level device. Sound can be generated through the speaker based upon the audio stream data to provide real-time feedback to the user. The determining and transmitting steps are repeated until detection of an end event.

SUMMARY OF THE INVENTION

It is an object of the present invention to provide technology that may be used in various Audio-3D usage scenarios.
In an aspect of the invention, an audio system according to claim 1 is presented.
It is preferred that the conference bridge is adapted to monaurally mix the multi-channel audio data streamed from the first and the second audio terminal to the multi-channel audio data streamed from the third audio terminal to generate the multi-channel audio mix.
It is further preferred that the conference bridge is further adapted to spatially position the monaurally mixed multi-channel audio data streamed from the first and the second audio terminal when generating the multi-channel audio mix.
It is also preferred that the audio system further comprises a telephone comprising a microphone and a speaker, wherein the conference bridge is further connectable with the telephone, wherein the conference bridge is adapted to mix the multi-channel audio data streamed from the first and the second audio terminal into a single-channel audio mix comprising a single audio channel and to stream the single-channel audio mix to the telephone.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other aspects of the present invention will be apparent from and elucidated with reference to the embodiments described hereinafter. In the following drawings:

Fig. 1: shows schematically and exemplarily a basic configuration of an audio terminal that may be used for Audio-3D,
Fig. 2: shows schematically and exemplarily a possible usage scenario for Audio-3D, here "Audio Portation",
Fig. 3: shows schematically and exemplarily a possible usage scenario for Audio-3D, here "Sharing Audio Snapshots",
Fig. 4: shows schematically and exemplarily a possible usage scenario for Audio-3D, here "Attending a Conference from Remote",
Fig. 5: shows schematically and exemplarily a possible usage scenario for Audio-3D, here "Multiple User Binaural Teleconference",
Fig. 6: shows schematically and exemplarily a possible usage scenario for Audio-3D, here "Binaural Conference with Multiple Endpoints",
Fig. 7: shows schematically and exemplarily a possible usage scenario for Audio-3D, here "Binaural Conference with Conventional Telephone Endpoints",
Fig. 8: shows an example of an artificial head equipped with a prototype headset for Audio-3D,
Fig. 9: shows schematically and exemplarily a signal processing chain in an Audio-3D terminal device, here a headset,
Fig. 10: shows schematically and exemplarily a signal processing chain in another Audio-3D terminal device, here a speakerbox,
Fig. 11: shows schematically and exemplarily a typical functionality of an Audio-3D conference bridge, based on an exemplary setup composed of three participants,
Fig. 12: shows schematically and exemplarily a conversion of monaural, narrowband signals to Audio-3D signals in the Audio-3D conference bridge shown in Fig. 11, and
Fig. 13: shows schematically and exemplarily a conversion of Audio-3D signals to monaural, narrowband signals in the Audio-3D conference bridge shown in Fig. 11.

DETAILED DESCRIPTION OF EMBODIMENTS

3. Basic configuration of audio terminal for Audio-3D

A basic configuration of an audio terminal 100 that may be used for Audio-3D is schematically and exemplarily shown in Fig. 1. In this example, the audio terminal 100 comprises a first device 10 and a second device 20 which is separate from the first device 10. In the first device 10, there are provided a first and a second microphone 11, 12 for capturing multi-channel audio data comprising a first and a second audio channel. In the second device 20, there is provided a communication unit 21 for, here, voice and data communication. The first and the second device 10, 20 are adapted to be connected with each other via a local wireless transmission link 30. The first device 10 is adapted to stream the multi-channel audio data, i.e., the data comprising the first and the second audio channel, to the second device 20 via the local wireless transmission link 30 and the second device 20 is adapted to receive and process and/or store the multi-channel audio data streamed from the first device 10.
Here, the first device 10 is an external speaker/microphone apparatus as described in detail in the unpublished International patent application PCT/EP2013/067534, filed on 23 August 2013 . In this example, it comprises a housing 17 that is formed in the shape of a (regular) icosahedron, i.e.,a polyhedron with 20 triangular faces. Such an external speaker/microphone apparatus, in this specification also designated as a "speakerbox", is marketed by the company Binauric SE under the name "BoomBoom". Here, the first and the second microphone 11, 12 are arranged at opposite sides of the housing 17, at a distance of, for example, about 12.5 cm. Since the first and the second microphone 11, 12 are spatially separated by the housing 17, the shape of which at least approximately resembles the roundish shape of a human head, the multi-channel audio data captured by the two microphones 11, 12 can more closely resemble the acoustical ambience as it would be perceived by a real human (compared to a conventional stereo signal).
The audio terminal 100, here, in particular, the first device 10, further comprises a first and a second speaker 15, 16 for playing back multi-channel audio data comprising at least a first and a second audio channel. Also, the audio terminal 10 is adapted to stream the multi-channel audio data from the second device 20 to the first device 10 via a local wireless transmission link, for instance, a transmission link complying with the Bluetooth standard, preferably, the current Bluetooth Core Specification 4.1.
The second device 20, here, is a smartphone, such as an Apple iPhone or a Samsung Galaxy. In this case, the data communication unit 21 supports voice and data communication via one or more mobile communication standards, such as GSM (Global System for Mobile Communication), UMTS (Universal Mobile Telecommunication terminal) or LTE (Long-Term Evolution). Additionally, it may support one or more further network technologies, such as WLAN (Wireless LAN).
In this example, the audio terminal 100, here, in particular, the first device 10, further comprises a third and a fourth microphone 13, 14 for capturing further multi-channel audio data comprising a third and a fourth audio channel. The third and the fourth microphone 13, 14 are provided on a same side of the housing 17, at a distance of, for example, about 1.8 cm. Preferably, these microphones can be used to better classify audio capturing situations (e.g., the direction of arrival of the audio signals) and may thereby support stereo enhancement. Also, if the first device 10 is a speakerbox and an additional one of the first device 10 is provided (not shown in the figures), the third and the fourth microphone 13, 14 of each of the two speakerboxes may be used to locate the position of the speakerboxes for allowing True Wireless Stereo in combination with stereo crosstalk cancellation (see below for details). Further options for using the third and the fourth microphone 13, 14 are to preferably capture the acoustical ambience for reducing background noise with noise cancelling algorithm (near speaker to far speaker), to measure the ambience volume level for adjusting the playback level (loudness of music, voice prompts and far speaker) to a convenient listening level automatically, to a lower volume late at night in bedroom, or to a loud playback in noise environment, and/or to detect the direction of sound sources (for example, a beamformer could focus on near speakers and attenuate unwanted sources more efficiently).

4. Further details on the local wireless transmission link

With continuing reference to Fig. 1, the local wireless transmission link 30, here, is a transmission link complying with the Bluetooth standard, preferably, the current Bluetooth Core Specification 4.1. The standard provides a large number of different Bluetooth "profiles" (currently over 35), which are specifications regarding a certain aspect of a Bluetooth-based wireless communication between devices. One of the profiles is the so-called Advanced Audio Distribution Profile (A2DP), which describes how stereo-quality audio data can be streamed from an audio source to an audio sink. This profile could, in principle, be used to also stream binaurally recorded audio data. However, available smartphones are, unfortunately, only able to act as an audio source and to stream audio data to an external speaker/microphone apparatus using A2DP; they do not support the use of the Advanced Audio Distribution Profile to stream audio data to them. Other the other hand, another Bluetooth profile, the so-called Hands-Free Profile (HFP) describes how a gateway device can be used to place and receive calls for a hands-free device. This profile can be used to stream audio data to a smartphone acting as an audio sink; however, it only supports monaural signals at a comparably low data rate of up to only 64 kbit/s.
In view of the above, the multi-channel audio data are streamed according to the present invention using the Bluetooth Serial Port Profile (SPP) or the iPod Accessory Protocol (iAP). SPP defines how to set up virtual serial ports and connect two Bluetooth enabled devices. It is based on 3GPP TS 07.10, "Terminal Equipment to Mobile Station (TE-MS) multiplexer protocol", 3GPP Technical Specification Group Terminals, 1997 and the RFCOMM protocol. It basically emulates a serial cable to provide a simple substitute for existing RS-232, including the control signals known from that technology. SPP is supported, for example, by Android based smartphones, such as a Samsung Galaxy. For iOS based devices, such as the Apple iPhone, iAP provides a similar protocol that is likewise based on both 3GPP TS 70.10 and RFCOMM.
With regards to the transmission of the multi-channel audio data from the first device 10 to the second device 20, it is preferred that the synchronization between the first and the second audio channel is as much as possible kept during the transmission, since any synchronization problems may destroy the binaural cues or at least lead to the impression of moving audio sources. For instance, at a sampling rate of 48 kHz, the delay between the left and the right ear is limited to about 25 to 30 samples if the audio signal arrives from one side. Thus, one preferred solution is to transmit synchronized audio data from each of the first and the second channel together in the same packet, ensuring that the synchronization between the audio data is not lost during transmission. For example, when a source coding using Low Complexity Subband Coding (SBC) is used, samples from the first and the second audio channel may preferably be packed into one packet for each segment, hence, there is no chance of deviation Moreover, it is preferred that the audio data of the first and the second audio channel are generated by the first and the second microphone 11, 12 on the basis of the same clock or a common clock reference in order to ensure a substantially zero sample rate deviation.

5. Variations

In the audio terminal 100 described with reference to Fig. 1, the first device 10 is an external speaker/microphone apparatus, which comprises a housing 17 that is formed in the shape of a (regular) icosahedron. However, the first device 10 may also be something else. For example, the shape of the housing may be formed in substantially a U-shape for being worn by a user on the shoulders around the neck, in this specification also designated as a "shoulderspeaker" (not shown in the figures). In this case, at least a first and a second microphone for capturing multi-channel audio data comprising a first and a second audio channel may be provided at the sides of the "legs" of the U-shape, at a distance of, for example, about 20 cm. Since this distance is quite close to the average distance between the ears of a human (approximately 25 cm) and since the first and the second microphone are spatially separated by the user's neck when the shoulderspeaker is worn, the multi-channel audio data captured by the two microphones can more closely resemble the acoustical ambience as it would be perceived by a real human (compared to a conventional stereo signal). In other examples, the first device may be an external speaker/microphone apparatus that is configured as an over- or on-the-ear headset, as an in-ear phone or that is arranged on glasses worn by the user. In all these cases, the captured multi-channel audio data comprising a first and a second audio channel may provide a better approximation of what a real human would here than a conventional stereo signal, wherein the resemblance may become particularly good if the microphones are arranged as close as possible to (or even within) the ears of the user, as it is possible, e.g., with headphones and in-ear phones. In this context, reference is made again to the unpublished International patent application PCT/EP2013/067534, filed on 23 August 2013 , which describes details of different types of external speaker/microphone apparatuses that can be used in the context of the present invention. It is noted that in order to even better emulate the human acoustical perception with microphone configurations that are not at or within the ears of the user, the microphones may preferably be provided with structures that resemble the form of the human outer and/or inner ears.
The audio terminal 100, here, in particular, the first device 10, may also comprise an accelerometer (not shown in the figures) for measuring an acceleration and/or gravity thereof. In this case, the audio terminal 100 is preferably adapted to control a function in dependence of the measured acceleration and/or gravity. For instance, it can be foreseen that the user can power up (switch on) the first device 10 by simply shaking it. Additionally or alternatively, the audio terminal 100 can also be adapted to determine a misplacement thereof in dependence of the measured acceleration and/or gravity. For instance, it can be foreseen that the audio terminal 100 can determine whether the first device 10 is placed with an orientation that is generally suited for providing a good audio capturing performance.
With returning reference to Fig. 1, the audio terminal 100 may comprise, in some scenarios, at least one additional one of the second device (shown in a smaller size at the top of the figure), or, more generally, at least one further speaker for playing back audio data comprising at least a first audio channel provided in a device that is separate from the first device 10.
While in the audio terminal 100 described with reference to Fig. 1, the second device 20 is a smartphone, it may also be, for example, a tablet PC, a stationary PC or a notebook with WLAN support, etc.

6. Communication protocol for auxiliary control

The audio terminal 100 preferably allows over-the-air flash updates and device control of the first device 10 from the second device 20 (including updates for voice prompts used to notify status information and the like to a user) over a reliable Bluetooth protocol. For an Android based smartphone, such as a Samsung Galaxy, a custom RFCOMM Bluetooth service will preferably be used. For an iOS based device, such as the Apple iPhone, the External Accessory Framework is preferably utilized. It is foreseen that the first device 10 supports at most two simultaneous control connections, be it to an Android based device or an iOS based device. If both are already connected, further control connections will preferably be rejected.
The iOS Extended Accessory protocol identifier may, for example, be a simple string like com.binauric.bconfig. On an Android based device, a custom service UUID of, for example, 0x5dd9a71c3c6341c6a3572929b4da78b1 may be used.
In the following, we describe a message protocol that may preferably be used for these operations. The description is exemplarily made with respect to a combination where the first device 10 is a speakerbox and the second device 20 is a smartphone running an app implementing at least part of the necessary operations. The speakerbox, here, comprises a virtual machine (VM) application executing at least part of the operations as well as one or more flash memories.
Here, each message consists of a tag (16 bit, unsigned), followed by a length (16 bit, unsigned) and then the optional payload. The length is always the size of the entire payload in bytes, including the TL header. All integer values are preferably big-endian.
All requests and responses are limited to a maximum of 128 bytes unless an exception is noted in the message descriptions below.

TAG [uint16] LENGTH [uint16] PAYLOAD [byte array]
All responses have a status code as the first word of their payload (See Table 2: Status codes). All unsupported requests shall be responded to with a specific error notification, namely, NOT_SUPPORTED.

6.1 Requests and Responses

The following Table 1 enumerates the requests and responses exemplarily defined for operations that fall into four classes:

Device specific control operations
Over-the-air (OTA) flash control operations
Binaural recording operations

The OTA control operations preferably start at "Hash-Request" and work on 8 Kbyte sectors. The protocol is inspired by rsync: Before transmitting flash updates, applications should compute the number of changed sectors by retrieving the hashes of all sectors, and then only transmit sectors that need updating.

Flash updates go to a secondary flash memory which only once confirmed to be correct is used to update the primary flash.

Table 1: Requests and responses

Request/Response	Tag	Payload	Comment
STATUS_REQUEST	1	-
STATUS_RESPONSE	2	+	Signal strength, battery level, and accelerometer data
VERSION_REQUEST	3	-
VERSION_RESPONSE	4	+	Firmware and prompt versions, language and variant
SET_NAME_REQUEST	5	+	Set the name of the device
SET_NAME_RESPONSE	6	+
STEREO _ PAIR REQUEST	16	+	Start special pairing for stereo operation (master)
STEREO_ PAIR RESPONSE	17	+
STEREO_UNPAIR_REQUEST	18	-	Remove special stereo pairing (should be sent to master and slave)
STEREO_UNPAIR_RESPONSE	19	+
COUNT_PAIRED_DEVICE_REQUEST	128	-
COUNT_PAIRED_DEVICE_RESPONSE	129	+	Returns the number of devices in the paired device list.
PAIRED_DEVICE_REQUEST	130	+
PAIRED_DEVICE_RESPONSE	131	+	Returns information on a single paired device.
DELETE_PAIRED_DEVICE_REQUEST	130	+
DELETE_PAIRED_DEVICE_RESPONSE	131	+
ENTER_OTA_MODE_REQUEST	256	-	Enter the over-the-air update mode
ENTER_OTA_MODE_RESPONSE	257	+
EXIT_OTA_MODE_REQUEST	258	+	End (commit or cancel) an over-the-air firmware update
EXIT_OTA_MODE_RESPONSE	259	+
EXIT_OTA_COMPLETE_REQUEST	260	-	Tells the VM that the EXIT OTA MODE RESPONSE was received and that the VM can hand over control to the PIC for the reboot and flash update
HASH_REQUEST	262	+
HASH_RESPONSE	263	+
READ_REQUEST	264	+	Read flash
READ_RESPONSE	265	+
ERASE_REQUEST	266	+	Erase flash
ERASE_RESPONSE	267	+
WRITE_REQUEST	268	+	Write flash
WRITE_RESPONSE	269	+
BINAURAL_RECORD_START_REQUEST	400	-	Starts the automatic sending of encoded binaural audio packets (i.e. BINAURAL RECORD AUDIO_RESPONSE_packets)
BINAURAL_RECORD_START_RESPONSE	401	+
BINAURAL_RECORD_ STOP_ REQUEST	402	-	Stops the automatic sending of binaural audio packets
BINAURAL_RECORD_STOP_RESPONSE	403	+
BINAURAL_RECORD AUDIO_ RESPONSE	405	+	Unsolicited packets containing SBC encoded audio data.

6.2 Status codes

The following Table 2 enumerates the status codes.

Table 2: Status codes

Name	Code
SUCCESS
	0
INVALID_ARGUMENT	1
NOT_ SUPPORTED	3
OTA ERROR	4
PAIRED_ DEV DELETE_ FAIL	5
STATUS_VALUE_READ_FAIL	6
OTA ERROR LOW BATTERY	7
OPERATION_NOT_ALLOWED	8
UNKNOWN_ERROR	0xFFFF

6.3 Parameter definitions

6.3.1 STATUS RESPONSE

The following Table 3 illustrates the response to the STATUS_REQUEST, which has no parameters. It returns the current signal strength, battery level and accelerometer data.

Table 3: STATUS_RESPONSE

Parameter Name	Size	Comment
Status	uint16	See Table 2: Status codes
Signal strength	uint16	The signal strength
Battery level	uint16	The battery level as integer percent value between 0 and 100.
Charging status	uint16	One of:
		• power_charger_disconnected (0)
		• power_charger_disabled (1)
		• power_charger_trickle (2)
		• power_charger_fast (3)
		• power_charger_boost_internal (4)
		• power_charger_boost_external (5)
		• power_charger_complete (6)
Accelerometer x-value	int16	Acceleration in thousands of a g.
Accelerometer y-value	int16	See above
Accelerometer z-value	int16	See above

6.3.2 VERSION RESPONSE

The following Table 4 illustrates the response to the VERSION_REQUEST, which has no parameters. All strings in this response need only be null terminated if their values are shorter than their maximum length.

Table 4: VERSION_RESPONSE

Parameter Name	Size	Comment
Status	uint16	See Table 2: Status codes
Version	40 byte string	The git tag (or hash if the version is not tagged).
Prompt version	40 byte string	The git tag (or hash if the version is not tagged)
Prompt language	6 byte string	Language Code/Country code, separated by an underscore
Prompt variant	32 byte string	The name of the variant (the name of the speaker or more generally, the name of the prompt set)
Product	uint16	Boom Boom, Shoulderspeaker, In-ear
Hardware revision	uint16	Hardware revision number

6.3.3 SET NAME REQUEST

The following Table 5 illustrates the SET_NAME_REQUEST. This request allows setting the name of the speaker. Here, it is the smartphone app's responsibility to make sure that the name requested is different from the name currently set on the speaker. This assumption saves the speakerbox from having to do this check on the VM application, which often is a lot more resource constrained. It is also the smartphone app's responsibility to ensure that the name sent in this request is null terminated and that it is not an empty string. Table 5: SET_NAME_REQUEST

Parameter Name Size Comment

Name 31 byte string New name of the speaker

6.3.4 SET NAME RESPONSE

The following Table 6 illustrates the response to the SET_NAME_REQUEST. Table 6: SET_NAME_RESPONSE

Parameter Name Size Comment

Status uint16 See Table 2: Status codes

6.3.5 STEREO PAIR REQUEST

The following Table 7 illustrates the STEREO_PAIR_REQUEST. This request initiates the special pairing with two speakerboxes for stereo mode (True Wireless Stereo; TWS), which will be described in more detail below. It needs to be sent to both speakerboxes, in different roles. The decision which speakerbox is master, and which is slave is arbitrary. The master device will become the right channel. Table 7: STEREO_PAIR_REQUEST

Parameter Name Size Comment

BT address 6 byte, Bluetooth address The address of the device to pair with

Role uint16 0: slave

1: master

6.3.6 STEREO PAIR RESPONSE

The following Table 8 illustrates the response to the STEREO_PAIR_REQUEST, which has no parameters. Table 8: STEREO_PAIR_RESPONSE

Parameter Name Size Comment

Status uint16 See Table 2: Status codes

6.3.7 STEREO UNPAIR RESPONSE

The following Table 9 illustrates the response to the STEREO_UNPAIR_REQUEST, which has no parameters. It must be sent to both the master and the slave. Table 9: STEREO_UNPAIR_RESPONSE

Parameter Name Size Comment

Status uint16 See Table 2: Status codes

6.3.8 COUNT PAIRED DEVICE RESPONSE

The following Table 10 illustrates the response to the COUNT_PAIRED_DEVICE_REQUEST, which has no parameters. It returns the number of paired devices. Table 10: COUNT_PAIRED_DEVICE_RESPONSE

Parameter Name Size Comment

Status uint16 See Table 2: Status codes

Count uint16 The number of paired devices

6.3.9 PAIRED DEVICE REQUEST

The following Table 11 illustrates the PAIRED_DEVICE_REQUEST. It allows requesting information about a paired device from the speakerbox. Table 11: COUNT_PAIRED_DEVICE_RESPONSE

Parameter Name Size Comment

Index uint16 Index of the paired device to retrieve information on. Must be between 0 and count-1.

6.3.10 PAIRED DEVICE RESPONSE

The following Table 12 illustrates the response to the PAIRED_DEVICE_REQUEST. The smartphone's app needs to send this request for each paired device it is interested in. If for some reason the read of the requested information fails the speakerbox will return a PAIRED_DEVICE_RESPONSE with just the status field. The remaining fields specified below will not be included in the response packet. Therefore the actual length of the packet will vary depending on whether the required information can be supplied.

Table 12: PAIRED_DEVICE_RESPONSE

Parameter Name	Size	Comment
Status	uint16	See Table 2: Status codes
Index	uint16	The index of the paired device
BT address	6 byte, Bluetooth address	The address of the paired device
Device class	4 bytes	The most significant byte is zero, followed by 3 bytes with the BT Class of Device.
		If the class of device for that device is not available, it will be set to all zeroes.
Name	31 byte string, null terminated	The name of the device that was used for pairing

6.3.11 DELETE PAIRED DEVICE REQUEST

The following Table 13 illustrates the DELETE_PAIRED_DEVICE_REQUEST. It allows deleting paired devices from the speakerbox. It is permissible to delete the currently connected device, but this will make it necessary to pair with the current device again the next time the user connects to it. If no Bluetooth address is included in this request, all paired devices will be deleted. Table 13: DELETE_PAIRED_DEVICE_REQUEST

Parameter Name Size Comment

Device Single 6 byte Bluetooth address Bluetooth addresses of device to be removed from paired device list.

6.3.12 DELETE PAIRED DEVICE RESPONSE

The following Table 14 illustrates the response to the DELETE_PAIRED_DEVICE_REQUEST. Table 14: DELETE_PAIRED_DEVICE_RESPONSE

Parameter Name Size Comment

Status uint16 See Table 2: Status codes

6.3.13 ENTER OTA MODE RESPONSE

The following Table 15 illustrates the ENTER_OTA_MODE_RESPONSE. It will put the device in OTA mode. The firmware will drop all other profile links, thus stopping e.g. the playback of music. Table 15: ENTER_OTA_MODE_RESPONSE

Parameter Name Size Comment

Status uint16 See Table 2: Status codes

6.3.14 EXIT OTA MODE REQUEST

The following Table 16 illustrates the EXIT_OTA_MODE_REQUEST. If the payload of the request is non-zero in length, the requester wants to write the new flash contents to the primary flash. To avoid bricking the device, this operation must only succeed if the flash image hash can be validated. If the payload of the request is zero in length, the requester just wants to exit the OTA mode and continue without updating any flash contents.

Table 16: EXIT_OTA_MODE_REQUEST

Parameter Name	Type	Comment
Complete flash image hash	uint64	A 64-bit hash of the complete 15.69 Mbit of flash. This is an extra sanity check. If the hash doesn't match then the primary flash update operation will not occur.
		If this isn't present then the OTA mode is exited without updating flash contents.

6.3.15 EXIT OTA MODE RESPONSE

The following Table 17 illustrates the response to the EXIT_OTA_MODE_REQUEST.

Table 17: EXIT_OTA_MODE_RESPONSE

Parameter Name	Type	Comment
Status	uint16	See Table 2: Status codes
		OTA_ERROR returned if the hash does not match.
		SUCCESS returned for "matching hash", and also for "exit OTA mode and continue without updating flash contents"

6.3.16 EXIT OTA COMPLETE REQUEST

The EXIT_OTA_COMPLETE_REQUEST will shut down the Bluetooth transport link, and kick the PIC to carry out the CSR8670 internal flash update operation. This message will only be acted upon if it follows from an EXIT_OTA_MODE_RESPONSE with SUCCESS "matching hash".

6.3.17 HASH-REQUEST

The following Table 18 illustrates the HASH_REQUEST. It requests the hash values for a number of sectors. The requester should not request more sectors than can fit in a single response packet.

Table 18: HASH_REQUEST

Parameter Name	Type	Comment
Sector Map	uint256	A bit field of 251 bits (1 bit per flash sector). Sectors are 8kByte in size. A bit being set indicates that a hash for that sector is requested.
		Note: Bit 0 of the 32^nd byte equates to sector 0.

6.3.18 HASH RESPONSE

The following Table 19 illustrates the response to the HASH_REQUEST. Table 19: HASH_RESPONSE

Parameter Name Type Comment

Status uint16 See Table 2: Status codes

Hashes Array of uint64 hash values

6.3.19 READ REQUEST

The following Table 20 illustrates the READ_REQUEST. It requests a read of the data from flash. Each sector will be read in small chunks so as not to exceed the maximum response packet size of 128.

Table 20: READ_REQUEST

Parameter Name	Type	Comment
Sector number	uint16	The sector (0-250) where the data should be read from. Each sector is 8kByte in size
Sector Offset	uint16	An offset from the start of the sector to read from. This offset is in units of uint16's.
Count	uint16	Number of words (uint16's) to read

6.3.20 READ RESPONSE

The following Table 21 illustrates the response to the READ_REQUEST. Table 21: READ_RESPONSE

Parameter Name Type Comment

Status uint16 See Table 2: Status codes

Data uint16 array The data read from the flash

6.3.21 ERASE REQUEST

The following Table 22 illustrates the ERASE_REQUEST. It requests a set of flash sectors to be erased.

Table 22: ERASE_REQUEST

Parameter Name	Type	Comment
Sector Map	uint256	A bit field of 251 bits (1 bit per flash sector). Sectors are 8kByte in size. A bit being set indicates that the sector should be erased.
		Note: Bit 0 of the 32^nd byte equates to sector 0.

6.3.22 ERASE RESPONSE

The following Table 23 illustrates response to the ERASE_REQUEST. Table 23: ERASE_RESPONSE

Parameter Name Type Comment

Status uint16 See Table 2: Status codes

6.3.23 WRITE REQUEST

The following Table 24 illustrates the WRITE_REQUEST. It write a sector. This packet has - unlike all other packets - a maximum size of 8200 bytes to be able to hold an entire 8Kbyte sector.

Table 24: WRITE_REQUEST

Parameter Name	Size	Comment
Sector number	uint16	The sector (0-251) where the data should be written
Offset	uint16	Offset within the sector, in units of uint16, at which to start writing
Data	uint16 array	At most 4096 words (or 8192 bytes)

6.3.24 WRITE RESPONSE

The following Table 25 illustrates the response to the WRITE_REQUEST. Table 25: WRITE_RESPONSE

Parameter Name Type Comment

Status uint16 See Table 2: Status codes

6.3.25 WRITE KALIMBA RAM REQUEST

The following Table 26 illustrates the WRITE_KALIMBA_RAM_REQUEST. It writes to the Kalimba RAM. The overall request must not be larger than 128 bytes. Table 26: WRITE_KALIMBA_RAM_REQUEST

Parameter Name Size Comment

Address uint32 Destination address

Data uint32 array The length of the data may be at most 30 uint32 values (or 120 bytes)

6.3.26 WRITE KALIMBA RAM RESPONSE

The following Table 27 illustrates the response to the WRITE_KALIMBA_RAM_REQUEST. Table 27: WRITE_KALIMBA_RAM_RESPONSE

Parameter Name Type Comment

Status uint16 See Table 2: Status codes

6.3.27 EXECUTE KALIMBA REQUEST

The following Table 28 illustrates the EXECUTE_KALIMBA_REQUEST. On the Kalimba, it forces execution from a given address. Table 28: EXECUTE_KALIMBA_REQUEST

Parameter Name Size Comment

Address uint32 Address

6.3.28 EXECUTE KALIMBA RESPONSE

The following Table 29 illustrates the response to the EXECUTE_KALIMBA_REQUEST. Table 29: EXECUTE_KALIMBA_RESPONSE

Parameter Name Type Comment

Status uint16 See Table 2: Status codes

6.3.29 BINAURAL RECORD START RESPONSE

The following Table 30 illustrates the response to the BINAURIAL_RECORD_START_ REQUEST.

Table 30: BINAURAL RECORD START RESPONSE

Parameter Name	Size	Comment
Status	uint16	See Table 2: Status codes
Codec	uint16	Codec list:
		PCM (linear)
		SBC
		APT-X
		Opus
		G.729
		AAC HE
		MPEG Layer
2
		Mpeg Layer 3
Sampling rate	uint16

6.3.30 BINAURAL RECORD STOP RESPONSE

The following Table 31 illustrates the response to the BINAURIAL_RECORD_STOP_REQUEST. Table 31: BINAURAL_RECORD_START_RESPONSE RESPONSE

Parameter Name Type Comment

Status uint16 See Table 2: Status codes

6.3.31 BINAURAL RECORD AUDIO RESPONSE

The following Table 32 illustrates the BINAURAL_RECORD_AUDIO_RESPONSE. This is an unsolicited packet that will be sent repeatedly from the speakerbox with new audio content (preferably, SBC encoded audio data from the binaural microphones), following a BINAURAL_RECORD_START_REQUEST. To stop the automatic sending of these packets a BINAURAL_RECORD_STOP_REQUEST must be sent. The first BINAURAL_RECORD_AUDIO_RESPONSE packet will contain the header below, subsequent packets will contain just the SBC frames (no header) until the total length of data sent is equal to the length in the header, i.e., a single BINAURAL_RECORD_AUDIO_RESPONSE packet may be fragmented across a large number of RFCOMM packets depending on the RFCOMM frame size negotiated. In a preferred implementation scenario, the header for BINAURAL_RECORD_AUDIO_RESPONSE is not sent with every audio frame. Rather, it is only sent approximately once per second to minimize the protocol overhead.

Table 32: BINAURAL_RECORD_AUDIO_RESPONSE

Parameter Name	Size	Comment
Status	uint16	See Table 2: Status codes
Number of SBC frames in payload	uint16	The number of SBC packets contained in the "SBC packet stream" portion of this control protocol packet.
Number of SBC frames discarded	uint16	If the radio link is not maintaining the required data rate to send all generated audio then some SBC packets will need to be discarded and not sent. This parameter holds the number of SBC packets that were discarded between this control protocol packet and the last one. N.B. This parameter should be zero for successful streaming without audio loss.
SBC packet stream	n bytes	A concatenated stream of SBC packets.

7. Usage scenarios for Audio-3D

As already explained above, Audio-3D aims to transmit speech contents as well as the acoustical ambience in which the speaker currently is located. In addition, Audio-3D may also be used to create binaural "snapshots" (also called "moments" in this specification) of situations in life, to share acoustical experiences and/or to create diary like flashbacks based on the strong emotions that can be triggered by the reproduction of the acoustical ambience of a life-changing event.
Possible usage scenarios for Audio-3D together with the benefits in comparison to conventional telephony being pointed out are listed in the following:

7.1 Scenario "Audio Portation"

A possible scenario "Audio Portation" is shown schematically and exemplarily in Fig. 2. In this scenario, a remote speaker is located in a natural acoustical environment which is characterized by a specific acoustical ambience. The remote speaker uses a mobile binaural terminal, e.g., comprising an over-the-ear headset connected with a smartphone via a local wireless transmission link (see sections 4 and 5 above), which connects to a local speaker. The binaural terminal of the remote speaker captures the acoustical ambience using the binaural headset. The binaural audio signal is transmitted to the local speaker which allows the local speaker to participate in the acoustical environment in which the remote speaker is located substantially as if the local speaker would be there (which is designated in this specification as "audio portation").
Compared to a communication link based on conventional telephony, besides understanding the content of the speech emitted by the remote speaker, the local speaker preferably hears all acoustical nuances of the acoustical environment in which the remote speaker is located such as the bird, the bat and the sounds of the beach.

7.2 Scenario "Sharing Audio Snapshots"

A possible scenario "Sharing Audio Snapshots" is shown schematically and exemplarily in Fig. 3. In this scenario, a user is at a specific location and enjoys his stay there. In order to capture his positive feelings at that moment, he/she makes a binaural recording using an Audio-3D headset which is connected to a smartphone, denoted as the "Audio-3D-Snapshot". Once the snapshot is complete, the user also takes a photo from the location. The binaural recording is tagged by the photo, the exact position, which is available in the smartphone, the date and time and possibly a specific comment to identify this moment in time later on. All these informations are uploaded to a virtual place, such as a social media network, at which people can share Audio-3D-Snapshots.
In later situations, the user and those who share the uploaded contents can listen to the binaural content. Due to the additional information/data and the realistic impression that the Audio-3D-Snapshot can produce in the ears of the listener, the feelings the user may have had in the situation where he/she captured the Audio-3D-Snapshot can be reproduced in a way much more realistic than it could based on a photo or a single channel audio recording.
People can share audio and visual images to let other people participate, for example, in concerts, special locations and emotional situations.

7.3 Scenario "Attending a Conference from Remote"

A possible scenario "Attending a Conference from Remote" is shown schematically and exemplarily in Fig. 4. In this scenario, Audio-3D technology connects a single remote speaker with a conferencing situation with multiple speakers.
The remote speaker uses a binaural headset 202 which is connected to a smartphone (not shown in the figure) that operates a binaural communication link (realized, for example, by means of an app). On the local side, one of the local speakers wears a binaural headset 201 to capture the signal or, alternatively, there is a binaural recording device on the local side which mimics the characteristics of a natural human head such as an artificial head.
In comparison to a similar situation which would be based on conventional telephony, the remote person hears not only the speech content which the speakers on the local side emit, but also additional information which is inherent to the binaural signal transmitted via the Audio-3D communication link. This additional information may allow the remote speaker to better identify the location of the speakers within the conference room. This, in particular, may enable the remote speaker to link specific speech segments to different speakers and may significantly increase the intelligibility even in case that all speakers talk at the same time.

7.4 Scenario "Multiple User Binaural Conference"

A possible scenario "Multiple User Binaural Conference" is shown schematically and exemplarily in Fig. 5. In that scenario, two endpoints at remote locations M, N are connected via an Audio-3D communication link with multiple communication partners on both sides. One participant on each side has a "Master-Headset device" 301, 302, which is equipped with speakers and microphones. All other participants wear conventional stereo headsets 303, 304 with speakers only.
Due to the use of Audio-3D, a communication is enabled as if all participants would share one room. In particular, even if multiple speakers on both sides speak at the same time, the transmission of the binaural cues enables to separate the speakers due to the ability to separate the different locations of the speakers.

7.5 Scenario "Binaural Conference with Multiple Endpoints"

A possible scenario "Binaural Conference with Multiple Endpoints" is shown schematically and exemplarily in Fig. 6. This scenario is very similar to the scenario "Multiple User Binaural Conference", explained in section 7.4 above.
The main difference is that more than two groups are connected, e.g., three groups at remote locations M, N, O in Fig. 6. In order to connect the three groups, a network located Audio-3D conference bridge 406 is used to connect all three parties. A peer-to-peer connection from each of the groups to all other groups would, in principle, also be possible. However, the overall number of data links increases exponentially with the number of participating groups.
The purpose of the conference bridge 406 is to provide each participant group with a mix-down of the signals from all other participants. As a result, all participants involved in this communication situation have the feeling that all speakers are located at one place, such as, in one room. In specific situations, it may, however, be useful to preserve the grouping of people participating in this communication. In that case, the conference bridge may employ sophisticated digital signal processing to relocate signals in the virtual acoustical space. For example, for the listeners in group 1, the participants from group 2 may be artificially relocated to the left side and the participants from group 3 may be artificially relocated to the right side of the virtual acoustical environment.

7.6 Scenario "Binaural Conference with Conventional Telephone Endpoints"

A possible scenario "Binaural Conference with Conventional Telephone Endpoints" is shown schematically and exemplarily in Fig. 7. This scenario is very similar to the scenario "Binaural Conference with Multiple Endpoints", explained in section 7.5 above. In this case, however, two participants at remote location O are connected to the binaural conference situation via a conventional telephone link using a telephone 505.
In order to allow all connected participants to be virtually located in one acoustical environment (e.g., room), the Audio-3D conference bridge 506 provides binaural signals to the two groups which are connected via an Audio-3D link. The signals originating from the conventional telephone link, however, are preferably extended to be located at a specific location in the virtual acoustical environment by HRTF (Head Related Transfer Function) rendering techniques (see, for example, G. Enzner et al., "Trends in Acquisition of Individual Head-Related Transfer Functions", The Technology of Binaural Listening, Springer-Verlag, pages 57 to 92, 2013; J. Blauert, "Spatial Hearing: The Psychophysics of Human Sound Localization", The MIT press, Cambridge, Massachusetts, 1983). Also, due to the poor quality of the speech arriving in the context of the conventional telephone link, speech enhancement technologies such as bandwidth extension (see B. Geiser, "High-Definition Telephony over Heterogeneous Networks", PhD dissertation, Institute of Communication Systems and Data Processing, RWTH Aachen, 2012) are preferably employed to improve the overall communication quality. In addition, sophisticated techniques to add a specific degree of binaural reverberation may preferably be employed to transform the signal originating from the speaker connected via the conventional telephone link to better match the room acoustic properties (reverberation time, early reflections) of the group to which he/she is added (room acoustic alignment).
For the participants connected via conventional telephone link, the Audio-3D conference bridge 506 creates a mix-down from the binaural signals. Sophisticated mix-down techniques to avoid comb filtering effects and similar from the binaural signals should preferably be employed. Also, the binaural signals should preferably be processed by means of sophisticated signal enhancement techniques such as, e.g., noise reduction and dereverberation to help the connected participants which listen to monaural signals captured in a situation with multiple speakers speaking at the same time from different directions.

7.7 Combination of the described usage scenarios

It shall be noted that the use-cases described in sections 7.1 to 7.6 above, may also be combined. For example, binaural conferences may be extended by means of a recorder which captures the audio signals of the complete conference and afterwards stores it as an Audio-3D snapshot for later recovery.

7.7 Further details

Further details of the present invention are now described with respect to a binaural conferencing situation (not shown in the figures) with three participants at different locations which all use a binaural terminal, such as an Audio-3D headset. In a conventional monaural telephone conference, the audio signals from all participants are mixed to an overall resulting signal at the same time. With a binaural communication, this may end up in a quite acoustical noisy result and signal distortions due to the overlaying/mixing of three different binaural audio signals originating from the same environment. Therefore, the present invention foresees the following selection by a participant or by an automatic approach.
For example, one participant of the binaural conference may select a master binaural signal, either from participant 1, 2 or 3. As an example, the signal from participant 3 has been selected. Then, as a first option, the participants 1 and 2 may be represented in mono (preferably, being freed from the sounds related to the acoustical environment) and mixed to the binaural signal from participant 3. In an alternative, the signals from participants 1 and 2 are only monaural (preferably, being freed from the sounds related to the acoustical environment) and are then mixed binaurally to the binaural signal from participant 3.
In another example, the binaural signal from the currently speaking participant is preferably always used, which means that there will be a switch of the binaural acoustical environment. This concept may be realized by commonly known means, such as by detecting the current speaker by means of a level detection or the like. Alternatively, sophisticated signal processing algorithms may be employed to combine the recorded signals to form the best combination targeting a specific optimization criterion (e.g. to maximize the intelligibility).
In the following, we describe some further ideas that can be realized in different usage scenarios for Audio-3D. A first example preferably consists of one or more of the following steps:

A. Users A und B are each hearing music and user A calls user B.
A1. The volume of the music is automatically reduced by e.g. 30 dB and users A and B hear each other binaurally.
A2. Alternatively, the music is automatically turned off and users A and B hear each other binaurally.
A3. Or, users A and/or B do not want to hear each other binaurally and manually switch to mono.
A4. If user A and/or B do not want to hear the music while they are taking to each other, they may turn off the music of manually.

A second example preferably consists of one or more of the following steps:

B. Users A und B are each hearing music and, additionally, have added the acoustical environment, e.g., with -20 dB, via the binaural microphones. Now, user A calls user B. In the worst case, three binaural signals would have to be overlaid.
B1. The volume of the music is automatically reduced by e.g. 30 dB and users A and B hear each other binaurally. Additionally, users A and/or B still hear their own acoustical environment, e.g., with -20 dB.
B2. Alternatively, the music is automatically turned off and users A and B hear each other binaurally. Additionally, users A and/or B still hear their own acoustical environment, e.g., with -20 dB.
B3. Or, users A and/or B do not want to hear each other binaurally and manually switch to mono.
B4. If user A and/or B do not want to hear the music while they are taking to each other, they may turn off the music of manually.
B5. Users A and B hear each other binaurally, but the music is automatically switched to mono and reduced in volume.
B6. Users A and B hear each other binaurally, but the music is automatically switched to mono and reduced in volume. Additionally, the acoustical environment is automatically switched to mono and reduced in volume.
B7. All other sources, except for the signals of users A and B are automatically switched to mono and positioned in a virtual acoustical environment, e.g., mid left and mid right.

8. Signal processing for Audio-3D

As already explained above, it is crucial for 3D audio perception that the binaural cues, i.e., the inherent characteristics defining the relation between the left and the right audio channel, are substantially preserved and transmitted in the complex signal processing chain of an end-to-end binaural communication. For this reason, Audio-3D requires new algorithm designs of partial functionalities such as acoustical echo compensation, noise reduction, signal compression and adaptive jitter control. Also, specific new classes of algorithms must be introduced such as stereo crosstalk cancellation, which aims at achieving binaural audio playback in scenarios in which users do not use headphones. During the last years, parts of the required algorithms were developed and investigated in the context of binaural signal processing for hearing aids (see T. Lotter, "Single and Multimicrophone Speech Enhancement for Hearing Aids", PhD dissertation, Institute of Communication Systems and Data Processing, RWTH Aachen, 2004; M. Jeub, "Joint Dereverberation and Noise Reduction for Binaural Hearing Aids and Mobile Phones", PhD dissertation, Institute of Communication Systems and Data Processing, RWTH Aachen, 2012). However, a major part of functionalities such as a binaural adaptive jitter buffer are not available and must be conceptually developed from scratch.

8.1 Differences between conventional monaural telephony and Audio-3D

In comparison to the signal processing approaches nowadays employed in conventional telephony to achieve high-quality monaural communication, in Audio-3D, additionally, the binaural cues inherent to the audio signal captured at the one side must be preserved until the audio signal reaches the ears of the connected partner at the other side. In that context, the binaural cues are defined as the characteristics of the relations among the two channels of the binaural signal which are commonly mainly expressed as the Interaural Time Differences (ITD) and the Interaural Level Differences (ILD) (see J. Blauert, "Spatial Hearing: The Psychophysics of Human Sound Localization", The MIT press, Cambridge, Massachusetts, 1983).
The ITD cues influence the perception of the spatial location of acoustical events at low frequencies due to the time differences between the arrival of an acoustical wavefront at the left and the right human ear. Often, these cues are also denoted as phase differences between the two channels of the binaural signal. The human perception is rather sensitive to these cues, such that already a very light shift of a faction of a millisecond between left and right signal can have a significant impact on the perceived location of an acoustical event. This is rather intuitive since with the known speech of sound of c = 340 m/s, a wavefront typically propagates from the one to the other ear on approximately 0.7 milliseconds (distance approximately 25 cm).
In contrast to this, the ILD binaural cues have a strong impact on the human perception at high frequencies. The ILD cues are due to the shadowing and attenuation effects caused by the human head given signals arriving from a specific direction: The level tends to be higher at that side of the head which points into the direction of the origin of the acoustical event.

8.2 Technological Backbones for Audio-3D

Due to experiences made in the past and the fact that there is no standard for Audio-3D communication yet, it is evident that Audio-3D can only be based on transmission channels for which the provider has end-to-end control. An introduction of Audio-3D as a standard in public telephone networks seems to be unrealistic due to the lack of cooperation and interest of the big telecommunication companies.
However, since powerful smartphones are available today and due to the wide spread of high rate internet accesses in urban areas - particularly in LTE networks (see 3GPP TS 36.212, "Evolved Terrestrial Radio Access (E-UTRA); Multiplexing and Channel Coding", 3GPP Technical Specification Group Radio Access Network, 2011) -, this is not a real limitation. As a conclusion, Audio-3D should preferably be based on packet based transmission schemes, which requires technical solutions to deal with packet losses and delays.

8.2 Audio-3D terminal devices (headsets)

In order to realize Audio-3D, new terminal devices are required. Instead of a single microphone in proximity to the mouth of the speaker as commonly used in conventional telephony, two microphones are required for Audio-3D, which must be located in proximity to the natural location where human perception actually happens, hence close to the entrance of the ear canal. A possible realization is shown in Fig. 8 based on an example of an artificial head equipped with a prototype headset for Audio-3D.
The microphone capsules are in close proximity to the entrance of the ear canal. The shown headset is not closed; otherwise, the usage scenario "Multiple User Binaural Teleconference" would not be possible, since in that scenario, the local acoustical signals need to reach the ear of the speaker on a direct path also. Alternatively, closed headphones extended by a "hear-through" functionality as well as loudspeaker-microphone enclosures combined with stereo-crosstalk-cancellation and stereo-widening or wave field synthesis techniques are optional variants of Audio-3D terminal devices (refer to section 8.4.2).
Special consideration has to be taken to realize Audio-3D, since currently available smartphones support only monaural input channels. However, some manufacturers, such as, e.g., Tascam (see www.tacsam.com) offer soundcards which can be used in stereo input and output mode in combination with, e.g., an iPhone. It is very likely that the USB-to-go standard (OTG) will allow connecting USB compliant high-quality soundcards with smartphones soon.

8.3 Quality constraints for Audio-3D

In terms of human perception, binaural signals should preferably be of a higher quality, since the binaural masking threshold level is known to be lower than the masking threshold for monaural signals (see B.C.J. Moore, "An Introduction to the Psychology of Hearing, Academic Press, 4th Edition, 1997). As a consequence, for the communication between multiple participants of a binaural teleconference, a binaural signal transmitted from one location to the other should preferentially be of a higher quality compared to the signal transmitted in conventional monaural telephony. This implies that high-quality acoustical signal processing approaches should be realized as well as audio compression schemes (audio codec) which allow higher bit rate and therefore quality modes.
Audio-3D, in this example, is packet based and principally an interactive duplex application. Therefore, the end-to-end delay should preferably be as low as possible to avoid negative impacts on conversations and the transmission should be able to deal with different network conditions. Therefore jitter compensation methods, frame loss concealment strategies and audio codecs which adapt the quality and the delay with respect to a given instantaneous network characteristic are deemed crucial elements of Audio-3D applications.
In our view, Audio-3D applications shall be available for everybody. Therefore, simplicity in usage may also be considered a key feature of Audio-3D.

8.4 Signal processing units in Audio-3D terminals

Principally, the functional units in a packet based Audio-3D terminal can be similar to those in a conventional VoIP-terminal. Two variants are considered in the following, of which the variant shown schematically and exemplarily in Fig. 9 is preferably foreseen for use in a headset terminal device as shown in Fig. 8 - which is a preferred solution -, whereas the variant shown schematically and exemplarily in Fig. 10 is preferably foreseen for use in a terminal device realized as a speakerbox, which may require additional signal processing for realizing a stereo crosstalk cancellation in the receiving direction and a stereo widening in the sending direction.

8.4.1 Signal processing units in Audio-3D headsets

The most important difference between a conventional VoIP terminal and a packet based Audio-3D headset terminal, as shown schematically and exemplarily in Fig. 9, is that the Audio-3D terminal comprises two speakers and two microphones, which are associated with the left and the right ear of the person wearing the headset.
The most important functional components are two acoustical echo cancellers (AEC), one for the left and one for the right side. These acoustical echo cancellers are of high importance if the Audio-3D headset is not closed to allow for user scenario "Multiple User Binaural Teleconference". As a result, the audio signal from the far speaker is emitted by the speaker and directly fed back into the microphone with a high amplitude, since the speaker and the microphone are in close proximity. The far speaker may then be disturbed by his own voice, unless the feedback signal is removed by the acoustical echo cancellers.
In tests with prototype headset devices, it was found that a feedback from the left channel to the right channel could not be observed: A natural signal attenuation of 30 to 40 dB due to the fact that the audio signals are mostly emitted into the ear canal and attenuated by the user's head in-between appears to be sufficient, such that the acoustical echo cancellers can principally operate independently.
In the sending direction, the signal captured by each of the microphones is preferably processed by a noise reduction (NR), an equalizer (EQ) and an automatic gain control (AGC). The output from the AGC is finally fed into the source codec. This source coded is preferably specifically suited for binaural signals and transforms the two channels of the audio signal into a stream of packets of a moderate data rate which fulfill the high quality constraints as defined in section 8.3 above. The packets are finally transmitted to the connected communication partner via an IP link.
In the receiving direction, sequences of packets arrive from the connected communication partner. At first, the packets are fed into the adaptive jitter buffer unit (JB). This jitter buffer has control of the decoder to reconstruct the binaural audio signal from the arriving packets as well as of the frame loss concealment (FLC) functionality that proceeds error concealment in case packets have been lost or arrive too late. An example of such an adaptive jitter buffer, which may be used in Audio-3D, is described in detail in the unpublished International patent application PCT/EP2013/069536, filed on 19 September 2013 .
In the adaptive jitter buffer, network delays, denoted as "jitters", are compensated by buffering a specific number of samples. It is adaptive as the number of samples to be stored for jitter compensation may vary over time to adapt to given network characteristics. However, caution should be taken not to increase the end-to-end communication delay which depends on the number of samples stored in the buffer before playback. Given that a frame was lost during the transmission from the connected communication partner or arrives too late, the decoder is preferably driven to perform a frameloss concealment. In some situations, however, a frameloss concealment cannot be performed by the decoder. In this case, the frameloss concealment unit is preferably driven to output audio samples that conceal the gap in the audio signal due to the missing audio samples. The output signal from the jitter buffer is fed, here, into an optional noise reduction (NR) and an automatic gain control (AGC) unit. In the optimal case of a high transmitted audio quality, these units are not necessary, since this functionality has been realized on the side of the connected communication partner. Nevertheless, it often makes sense if the connected terminal does not provide the desired audio quality due to low bit rate source encoders or insufficient signal processing on the side of the connected terminal.
The following equalizer in the receiving direction (EQ) is preferably used to individually equalize the headset speakers and to adapt the audio signals according to the subjective amenities of the user. It was found, e.g., in R. Bomhardt et al., "Individualisierung der kopfbezogenen Übertragungsfunktion", 40. Jahrestagung für Akustik (DAGA), 2014, that an individual equalization can be crucial for a high-quality spatial perception of the binaural signals.
The processed signal is finally emitted by the speakers of the Audio-3D terminal headset.

8.4.2 Signal processing units in Audio-3D speakerboxes

The functional units in the context of a packet based Audio-3D speakerbox terminal are shown schematically and exemplarily in Fig. 10. In comparison to the Audio-3D headset terminal, in addition to the functional components from Fig. 9, a functional unit for a stereo widening (STW) as well as a functional unit for a stereo crosstalk cancellation (XTC) are added.
The stereo widening units transforms a stereo signal captured by means of two microphones into a binaural signal. This enhancement is principally necessary if the two microphones are not in a distance which is identical (or close to) that of the ears in human perception due to, e.g., a limited size of the speakerbox terminal device. Due to the knowledge of the capturing situation, the stereo widening unit can compensate for the lack of distance by artificially adding binaural cues such as increased interchannel phase differences for low frequencies and interchannel level differences for higher frequencies.
The two additional microphones, as described above, may help to better classify source locations. Due to the availability of side information about the sources in the acoustic environment, stereo widening on the sending side in a communication scenario may be denoted as "side information based stereo widening". Principally, stereo widening may also be based solely on the received signal on the receiving side of a communication scenario. In that case, it is denoted as "blind stereo widening" since no side information is available in addition to the transmitted binaural signal.
The stereo crosstalk cancelling unit is preferably used to aid the listener who is located at a specific position to perceive binaural signals. Mainly, it compensates for the loss of binaural cues due to the emission of the two channels via closely spaced speakers and a cross-channel interference (audio signals emitted by the right loudspeaker reaching the left ear and audio signals emitted by the left loudspeaker reaching the right ear). The purpose of the stereo crosstalk canceller unit is to employ signal processing to emit signals which cancel out the undesired cross-channel interference signals reaching the ears.
In contrast to the Audio-3D headset terminal device, for the Audio-3D speakerbox terminal device, a full two-channel acoustical echo canceller is preferably used, rather than two single channel acoustical echo cancellers.

8.4.3 Signal processing units in Audio-3D conference bridge

The purpose of the Audio-3D conference bridge is to provide audio streams to the participants of a conference situation with more than two participants. Principally, to establish multiple peer-to-peer connections between all participating connections would be possible also; some of the functionalities performed by the conference bridge would then have to be realized in the terminals. However, the overall data rate involved would grow exponentially as a function of the number of participants and, therefore, would start to become inefficient for a low number of connected participants already.
The typical functionality to be realized in the conference bridge is shown schematically and exemplarily in Fig. 11, based on an exemplary setup composed of three participants, of which one is connected via a conventional telephone (PSTN; public switched telephone network) connection, whereas the other two participants are connected via a packet based Audio-3D link.
The conference bridge receives audio streams from all three endpoints, shown as the incoming gray arrows in the figure. In this example, it is assumed that the streams originating from participants 1 and 2 contain binaural signals in Audio-3D quality, indicated by the double arrows, whereas the signal from participant 3 is only monaural and of narrow band quality.
The conference bridge creates one outgoing stream for each of the participants:

Participant 1 receives the data from participant 3 and participant 2.
Participant 2 receives the data from participant 3 and participant 1.
Participant 3 receives the data from participant 1 and participant 2.

In this context, it is very much preferred that each participant receives the audio data from all participants but himself. Variants are possible to control the outgoing audio streams, e.g.,

The output audio streams contain only signals from active sources.
The output audio streams may be processed in order to enhance the conversation quality, e.g., by means of a noise reduction or the like. Each incoming stream may be processed independently.
The incoming "spatial images" of the binaural signals are virtually relocated. Given more than one connection, it may be useful to place groups of sound sources at different positions in the created virtual acoustical scenery.

Principally, incoming audio signals may be decoded and transformed into PCM (pulse code modulation) signals to be accessible for audio signal processing algorithms. However, in some cases it may be more useful to perform the signal processing in the coded parameter domain to avoid transcoding artifacts. The signal processing functionalities in the PCM domain are similar to those functionalities realized in the terminals (e.g., adaptive jitter buffer) and shall not be explained in detail here.
In the example shown in Fig. 11, there is one participant connected via PSTN. The corresponding speech signals reaching the conference bridge are monaural and of low quality, due to narrow band frequency limitations and low data rate. Therefore, a signal adaptation is preferentially used in both directions, from the telephone network to the Audio-3D network (Voice to Audio-3D) and from the Audio-3D network to the telephone network (Audio-3D to Voice).
Given the conversion from the telephone network to the Audio-3D network, mainly the audio signals must be converted from narrowband to Audio quality and from monaural to binaural, as shown schematically and exemplarily in Fig. 12.
In order to convert narrowband to Audio quality, technologies to extend the audio bandwidth are required, denoted as artificial bandwidth extension (BWE). Such an artificial bandwidth extension is described in detail in the unpublished European patent application 13 001 602.5, filed on 27 March 2013 .
In the next step, the monaural signal is transformed into a binaural signal. So-called spatial rendering (SR) is employed for this purpose in most cases. In that approach, given a specific angle of direction of arrival, HRTFs (head related transfer functions) are chosen for the left and the right channel, respectively, and used to filter the monaural signal. These HRTFs mimic the effect of the temporal delay caused by a signal reaching the one ear before the other and the attenuation effects caused by the human head. Sometimes, in order to better fit into the acoustical scenario in the case of a conference with multiple participants, also, an additional binaural reverberation can be useful (SR+REV).
Given the conversion from the Audio-3D network to the telephone network, the monaural signal must be converted into a signal which is compliant to a conventional telephone.
The audio bandwidth must be limited and the signal must be converted from binaural to mono, as shown schematically and exemplarily in Fig. 13.
Therefore, an intelligent down-mix is preferably realized, such that undesired comb effects and spectral colorations are avoided.
Since the intelligibility is usually significantly lower for monaural signals compared to binaural signals, additional signal processing / speech enhancements may preferably be implemented, such as a noise reduction and a dereverberation that may help the listener to better follow the conference.

8.4.4 Specific characteristics in signal processing in Audio-3D

In the following, specific considerations for the design of signal processing algorithms for Audio-3D shall be briefly described. In particular, algorithms nowadays used in monaural VoIP terminals are preferably adapted to preserve the binaural cues in Audio-3D transmission schemes.

8.4.4.1 Source coding in Audio-3D

Besides a higher quality constraint for source coding approaches in Audio-3D, as described in section 8.3 above, the binaural cues as introduced in section 8.1 above must be preserved. In order to do so, the sensitivity of human perception with respect to phase shifts in binaural signals is preferably taken into account in the source codec. VoIP applications tend to transfer different media types in independent streams of data and to synchronize on the receiver side. This procedure makes sense for audio and video due to the use of different recording and playback clocks. The receiver side synchronization is not very critical, since a temporal shift between audio and video can be tolerated unless it exceeds 15 to 45 milliseconds (see Advanced Television Systems Committee, "ATSC Implementation Subcommittee Finding: Relative Timing of Sound and Vision for Broadcast Operations", IS-191, 2003).
However, to transmit the channels of the binaural signal in independent media streams may not be suitable for Audio-3D. In particular, the two channels of a binaural signal should preferably be captured using one physical device with one common clock rate to prevent signal drifting. The synchronization on the receiver side cannot be realized or only with an immense signal processing effort to achieve an accuracy which allows preserving the ITD binaural cues as defined in section 8.1 above.
In principal, a transmission of the encoded binary data taken from two independent instances of the same monaural source encoder, one for each binaural channel, in one data packet (with twice the length of the length of a packet resulting for a monaural signal per frame) is the most simple approach as long as the left and right binaural channels are captured sample and frame synchronously, which implies that both are recorded by ADCs (analog-to-digital converters) which are driven by the same clock or a common clock reference. This approach yields a data rate which is twice the data rate of a monaural HD-Voice communication terminal. In view thereof, sophisticated approaches to exploit the redundancies in both channels may be a promising solution to decrease the overall data rate (see, e.g., H. Fuchs, "Improving joint stereo audio coding by adaptive inter-channel prediction", IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pages 39 to 42, 1993; H. Krüger und P. Vary, "A New Approach for Low-Delay Joint-Stereo Coding", ITG-Fachtagung Sprachkommunikation, VDE Verlag GmbH, 2008).
However, many of these approaches to realize so-called Joint-Stereo coding (see, e.g., J. Breebaart und C. Faller, "Spatial Audio Processing: MPEG Surround and Other Applications", Wiley-Interscience, 1st Edition, 2007; J. Herre et al., "Intensity Stereo Coding", Audio Engineering Society Convention, 1994) must be used with care since the phase relations between the left and the right binaural channels are often neglected. Instead, only level differences are considered which reduces the spatial characteristics which are originally present for 3D audio events.

8.4.4.2 Adaptive jitter buffer

VoIP transmission schemes in general rely on the so-called User Datagram Protocol (UDP). Due to wide spread firewalls and routers in local private home networks, applications may also employ the Transmission Control Protocol (TCP). In both cases, packets emitted by one side of the communication arrive in time very often, but may also arrive with a significant delay (denoted as the "network jitter"). In the case of UDP, packets may also get lost during the transmission (denoted as a "frameloss").
If a packet does not arrive in time, audible artifacts will typically occur due to gaps in the audio signal caused by the lack of samples originating from the lost packet. In order to compensate for packet delays, it is suitable to store a certain budget of audio samples which is realized in the jitter buffer (JB). The larger this storage of audio samples is, the higher can be the delay which can be compensated without that any audible artifacts occur. However, if the storage is too large, the end-to-end delay of the communication is too high diminishing the overall communication quality which should be avoided.
The network jitter characteristics observed in real applications are in general strongly time-varying. An example with strongly variable network jitter is a typical WiFi router used in many households nowadays. Often, packets are not transmitted via the WiFi transmission link for a couple of hundred milliseconds if a microwave oven is used which produces disturbances in the same frequency band used by WiFi or if a Bluetooth link is used in parallel. Therefore, a good jitter buffer should preferably be managed and should adapt to the instantaneous network quality which must be observed by the Audio-3D communication application. Such a jitter buffer is denoted as an adaptive jitter buffer.
During the adaptation - given a modification of the network characteristics -, the number of samples stored in the jitter buffer (the fill height) is preferably modified by the employment of approaches for signal modifications such as the waveform similarity overlap-add (WSOLA) approach (see W. Verhelst and M. Roelands, "An overlap-add technique based on waveform similarity (WSOLA) for high quality time-scale modification of speech", IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 554 to 557, 1993), a phase vocoder (see M. Dolson, "The phase vocoder: A tutorial", Computer Music Journal, Vol. 10, No. 4, ) approach or similar techniques. The goal during this adaptation is to play the signal with an increase or decrease of speed without producing artifacts which are audible, also denoted as "Time-Stretching". In case of binaural signals, however, not all approaches are suitable. For example, in WSOLA, time stretching is achieved by re-assembling the signal from signal segments originating from the past or the future. However, the exact signal synthesis process may be different for the left and the right channel of a binaural signal due to independent WSOLA processing instances. Arbitrary phase shifts may be the result, which do not really produce audible artifacts, but which may lead to a manipulation of the ITD cues in Audio-3D and may destroy or modify the spatial localization of audio events.
A preferred approach which does not influence the ITD binaural cues is to use an adaptive resampler. The core component is a flexible resampler, the output sample rate of which can be modified continuously during operation.
If two instances - one for the left and one for the right channel of a binaural signal - are operated synchronously, the binaural cues can be preserved even in case of a fill height manipulation of the jitter buffer. A suitable technique for a flexible resampler with quasi-continuous output rate is e.g. proposed in Matthias Pawig et al., "Adaptive Sampling Rate Correction for Acoustical Echo Control in Voice-Over-IP", IEEE Transactions on Signal Processing, Vol. 58, No. 1, pages 189 to 199, 2010. The only disadvantage of this approach is a slight variation of the pitch of the played signal. Normally, however, this does not lead to any noticeable artifacts in case of a purely speech based communication.

8.4.4.3 Frameloss concealment

In the frameloss concealment (FLC), signal gaps caused by the loss of frames during transmission over a UDP link or by frames which arrive with a delay that even a long jitter buffer no longer can compensate are concealed such that the impact is not or hardly audible.
Most approaches known from the literature reconstruct an artificial segment based on information from the past. This segment is then adapted to fit well into the gap caused by the lost packet by creating a smooth transition between artificial segment and the signal before and after. Also in this case, due to the manipulation of the signal, the binaural cues can be destroyed. As a consequence, phantom sounds can be produced which might be located by the listener at arbitrary positions in the virtual acoustical room; this may be perceived in a very unpleasant and disturbing way.

8.4.4.4 Automatic gain control

In the automatic gain control (AGC), signal levels are preferably adapted such that the transmitted signal does appear neither to be too loud nor of too low volume. In general, this increases the perceived communication quality since, e.g., a source encoder works better for signals with a higher level than for lower levels and the intelligibility is higher for higher level signals.
However, the ILD binaural cues are based on level differences in the two channels of a binaural signal. Given two AGC instances which operate independently on the left and the right channel, these cues may be destroyed since the level differences are removed. Thus, a usage of conventional AGCs which operate independently may not be suitable for Audio-3D. Instead, the gain control for the left channel should preferably somehow be coupled to the gain control for the right channel.

8.4.4.5 Equalizer for personalization of binaural headset devices

In a binaural recording system, in general, the signals are recorded with devices which mimic the influence of real ears (for example, an artificial head in general has "average ears" which shall approximate the impact of the ears of a huge amount of persons) or by using headset devices with a microphone in close proximity to the ear canal (see section 8.4.1). In both cases, the ears of the person who listens to the recorded signals and the ears which have been the basis for the binaural recording are not identical.
In experiments, it was shown that binaural signals recorded with the individual ears of the listener lead to significantly better spatial localization properties than in the case where the recording has been made involving the "average ears" of an artificial head. In order to overcome these deficiencies, an individual equalization of the recorded binaural signals is preferable.
In addition to that, an equalizer can be used in the sending direction in Figs. 9 and 10 to compensate for possible deviations of the microphone characteristics related to the left and the right channel of the binaural recordings.
In the receiving direction, an equalizer may also be useful to adapt to the hearing preference of the listener to attenuate or amplify specific frequencies. For persons with hearing impairments, attenuations and amplifications of parts of the binaural signal may also be realized in the equalizer according to the needs of the person wearing the binaural terminal device to increase the overall intelligibility. However, some care has to be taken to not destroy or manipulate the ILD binaural cues.

8.4.4.6 Noise reduction

As already explained above, a goal of Audio-3D is the transmission of speech contents as well as a transparent reproduction of the ambience in which acoustical contents have been recorded. In this sense, a noise reduction which removes acoustical background noise may not be useful at the first glance.
Considering, however, that in a conference situation, background noise caused by vents or other undesired acoustical sources may be present for a long time, at least stationary undesired noises should preferably be removed to increase the conversational intelligibility.
Approaches for noise reduction in conventional HD-Voice communications in general exploit the properties of speech, in particular the presence of speech pauses. During these pauses, estimates of the instantaneous background noise characteristics are commonly measured, which afterwards are used as a basis for a frequency specific control of the attenuation of parts of the recorded signal to attenuate undesired noise.
In Audio-3D, a more accurate classification of the recording situation should be performed to distinguish between "desired" and "undesired" background noises. In comparison to single channel noise reduction approaches, in Audio-3D applications, two rather than only one microphone help in this classification process by locating audio sources in a given room environment. In this context, additional sensors such as an accelerometer or a compass may support the auditory scene analysis.
Principally, noise reduction is based on the attenuation of those frequencies of the recorded signal where noise is present, such that the speech is left unaltered, whereas noise is as much as possible suppressed. In the sense to preserve the binaural cues of a binaural signal, two noise reduction instances operating on the left and the right channel independently may destroy or manipulate the binaural ILD cues. However, approaches have been developed for binaural hearing aids in the past (see T. Lotter, "Single and Multimicrophone Speech Enhancement for Hearing Aids", PhD dissertation, Institute of Communication Systems and Data Processing, RWTH Aachen, 2004; M. Jeub, "Joint Dereverberation and Noise Reduction for Binaural Hearing Aids and Mobile Phones", PhD dissertation, Institute of Communication Systems and Data Processing, RWTH Aachen, 2012).

8.4.4.7 Acoustical echo compensation

In acoustical echo compensation, in general an approach is followed which is composed of an acoustical echo canceller and a statistical postfilter. The acoustical echo canceller part is based on the estimation of the "real" physical acoustical path between speaker and microphone by means of an adaptive filter. Once determined, the estimate of the acoustical path is afterwards used to approximate the undesired acoustical echo signal recorded by the microphones of the terminal device. In order to achieve a reduction of the undesired acoustical echo in the recorded signal, the approximation of the acoustical echo and the acoustical echo signal inherent to the recorded signal are finally cancelled out by means of destructive superposition (see S. Haykin, "Adaptive Filter Theory", Prentice Hall, 4th Edition, 2001).
In practice, due to the physics of acoustics, specific uncertainties in the reconstruction of the acoustical echo path are always present which lead to a specific residual acoustical echo signal (see G. Enzner, "A Model-Based Optimum Filtering Approach to Acoustical Echo Control: Theory and Practice", PhD dissertation, Institute of Communication Systems and Data Processing, RWTH Aachen, 2006). Therefore solutions for acoustical echo cancellation nowadays commonly employ a statistical postfilter which acts in a way similar to a noise reduction to remove residual acoustical echos based on a frequency selective attenuation of the recorded signal.
In Audio-3D headset terminal devices, a strong coupling between speaker and microphone is present, which is due to the close proximity of the microphone and the speaker (see, for example, Fig. 8) and which produces a strong undesired acoustical echo. A well- designed adaptive filter may reduce this acoustical echo by a couple of dB but may never remove it completely. The remaining acoustical echo can still audible and may be very confusing in terms of the perception of a binaural signal given that two independent instances of an acoustical echo compensator are operated for the left and the right channel. Phantom signals may appear to be present, which are located at arbitrary locations in the acoustical scenery. A postfilter is therefore considered to be of great importance here, but it may have a negative impact on the ILD binaural cues due to an independent manipulation of the signal levels of the left and the right channel of the binaural signal.

8.4.4.8 Crosstalk cancellation for (loud)speaker playback scenarios

In case a user is not equipped with a headset terminal device, nevertheless, it is desirable that he/she consumes binaural audio contents in a way such that he/she benefits from the fact that binaural recording techniques have been employed. In most cases, the hardware setup to consume binaural contents if not using a headset device is expected to be composed of two loudspeakers, for instance, two speakerboxes, being placed in a typical stereo playback scenario. Such a stereo hardware setup is not optimal for binaural contents as it suffers from cross channel interferences: Signals emitted by the left of the two loudspeakers of the stereo playback system will reach the right ear and signals emitted by the right speaker will reach the left ear. As a result, the user will not perceive the captured acoustical sound events as required to create a realistic impression of the original surrounding acoustical environment. The consequence is that sound events are always located in a virtual acoustical space which is limited to be somewhere in-between the left and the right speaker. This reduces the width of the created spatial image compared to a binaural playback where no cross channel interference occurs.
In a stereo crosstalk-canceller, the two channels of a captured binaural signal to be emitted by the two involved speakers are pre-processed by means of linear filtering in order to minimize the amount of cross channel interferences. Principally, it employs cancellation techniques based on fixed filtering techniques described e.g. in B. B. Bauer, "Stereophonic Earphones and Binaural Loudspeakers," Journal of the Audio Engineering Society, Vol. 9, No. 2, pages 148 to 151, 1961. Unfortunately, the pre-processing required for crosstalk cancellation depends heavily on the physical location and characteristics of the involved loudspeakers. Normally, users have no common sense in placing stereo loudspeakers, e.g., in the context of a home cinema. However, in many practical applications such as large displays of TV sets the location of the stereo speakers is fixed and users are assumed to be located in front of the display in a specific distance. In that special case, a carefully designed set of pre-processing filter coefficients is preferably sufficient to cover most use-cases. In case of the speaker device as described herein at the beginning the position of the loudspeakers is definitely not fixed. When using the True Wireless Stereo feature to combine two devices, for instance, two speakerboxes, coupled via Bluetooth, however, the two speakers may become aware of their position relative to each other. This may lead to the following possible behaviors:

The two connected loudspeakers may preferably instruct the user how to place both speaker devices in relation to each other. This solution will guide the user to correct the speaker and the listener position until it is optimal for binaural sound reproduction.
The two loudspeakers may preferably detect the position relative to each other and adapt the pre-processing filter coefficients to create the optimal binaural sound reproduction. The four microphones as proposed herein help to locate the position of each loudspeaker in a detailed way.

8.4.4.9 Enhancement of stereo signals

Given a signal that is the output from a stereo rather than a binaural recording procedure, its playback in a binaural playback hardware setup such as a headset terminal device will not lead to outstanding perceived spatial impressions. Principally, spatial information such as the ITD cues is partly lost in a stereo recording setup. However, stereo enhancement techniques may preferably be employed to transform a stereo signal into a somewhat binaural signal. The main principle of these stereo enhancement techniques is to artificially modify the captured stereo audio signals to reconstruct lost binaural cues artificially. In this context, in most approaches for a stereo enhancement, it is necessary to first determine source localization properties from the recorded signal which afterwards can be transformed into more elaborate binaural cues: Sound components can be separated, associated to a specific angle of arrival and finally artificially rendered by using HRTF (head related transfer functions) based rendering techniques. Targeting low complexity pre-processing, very simple approaches may also be used, e.g., to increase the degree of diffuseness between the left and right stereo channel.

9. Metadata

Normally, as state of the art, any audio recording is simply played back by devices without taking care of how it was captured, e.g., whether it is a mono, a stereo, a surround sound or a binaural recording and/or whether the playback device is a speakerbox, a headset, a surround sound equipment, a loudspeaker arrangement in the car or the like. The maximum that can be expected today, is that a mono signal is automatically played back on both loudspeakers, right and left, or on both headset speakers, left and right, or that a surround sound signal is down-mixed to two speakers if the surround sound is indicated. Overall, the ignorance of the audio signal's nature may result in an audio quality which is not satisfactory for the listener. For instance, a binaural signal might be played back via loudspeakers and a surround sound signal might be played back via headphones. Another example might occur with more distribution of binaurally recorded sounds in the market, provided by music labels or broadcasters. While 3D algorithms for enhancing the flat audio field of a stereo signal exist and are being applied, such devices or algorithms cannot make a difference between stereo signals and binaurally recorded signals. Thus, they would even apply 3D processing on already binaurally recorded signals. This needs to be avoided, because it could result in a very impaired sound quality that does not at all match the target of the audio signal supplier, whether it is a broadcaster or the music industry.
In order to improve the situation, the audio terminal 100 shown in Fig. 1 generates metadata provided with the multi-channel audio data, wherein the metadata indicates that the multi-channel audio data is binaurally captured. Preferably, the metadata further indicates one or more of: a type of the first device, a microphone use case, a microphone attenuation level, a beamforming processing profile, a signal processing profile and an audio encoding format.

For example, a suitable metadata format could be defined as follows:
Device ID: 3 bit to indicate a setup of the first and the second microphone,

e.g., '000'	BoomBoom
'001'	Shoulderspeaker
'010'	Headset over the ear
'011'	Headset on the ear
'100'	In-Ear
'101'	Headset over the ear with artificial ears
'110'	Headset on the ear with artificial ears
'111'	In-Ear with hear-through

Microphone Use Case: 3 bit to indicate the use case of the microphones, e.g. mono, stereo, binaural, beamformed, combined, such as 11, 12, 13 or 11, 12, 14 or 11, 13, 14 etc.,

e.g., '000'	All microphones (see 11, 12, 13, 14 in Fig. 1)
'001'	Binaural microphones only (see 11, 12 in Fig. 1)
'010'	Beamforming microphones only (see 13, 14 in Fig. 1)
'011'	One of the beamforming microphones only (see 13 in Fig. 1)

Level Setup: 32 bit (4 x 8 bit) or more to indicate the respective attenuation of the microphones,

e.g.,	'Bit 0-7'	Attenuation of microphone 1 in dB
	'Bit 8-15'	Attenuation of microphone 2 in dB
	'Bit 16-23'	Attenuation of microphone 3 in dB
	'Bit 24-31'	Attenuation of microphone 4 in dB

Beamforming Processing Profile: 2 bit to indicate which beamforming algorithms have been applied to the microphones,

e.g., '00'	Beamforming algorithm 1
'01'	Beamforming algorithm 2
'10'	Beamforming algorithm 3
'11'	Beamforming algorithm 4

Signal Processing Profile: 4 bit to indicate which algorithms have been applied to the microphones,

e.g., '00'	Signal processing 1
'01'	Signal processing 2
'10'	Signal processing 3
'11'	Signal processing 4

Encoding Algorithm Format: 2 to 4 bit to indicate the encoding algorithm being used, such as SBC, apt-X, Opus or the like,

e.g., '000'	PCM (linear)
'001'	SBC
'010'	APT-X
'011'	Opus

'100'	G.729
'101'	AAC HE
'110'	MPEG Layer 2
'111'	MPEG Layer 3

It shall be noted that for (loud)speaker playback scenarios, as described in more detail in section 8.4.4.8 above, the metadata preferably indicates a position of the two speakers relative to each other.

10. Additional comments

While the audio terminal 100 described with reference to Fig. 1 comprises a first device 10 and a second device 20 which is separate from the first device 10, this does not have to be the case. For example, other audio terminals according to the present invention which may be used for Audio-3D may be integrated terminals, in which both (a) at least a first and a second microphone for capturing multi-channel audio data comprising a first and a second audio channel, and (b) a communication unit for voice and/or data communication, are provided in a single first device. In this case, a connection via a local wireless transmission link may not be needed and the concepts and technologies described in sections 7 to 9 above could also be realized in an integrated terminal. Also, instead of using a local wireless transmission link, an audio terminal which realizes the concepts and technologies described in sections 7 to 9 above could comprise a first and a second device which are adapted to be connected with each other via a wired link.
Of course, some of the technologies described above may also be used in an audio terminal which comprises only one of (a) at least a first and a second microphone and (b) at least one of a first and a second speaker, the first one being preferably usable for recording multi-channel audio data comprising at least a first and a second audio channel and the second one being preferably usable for playing back multi-channel audio data comprising at least a first and a second audio channel.
Moreover, while the audio terminal 100 described with reference to Fig. 1 comprises a communication unit 21 for voice and/or data communication, other audio terminals according to the present invention which may be used for Audio-3D may comprise, additionally or alternatively, a recording unit (not shown in the figures) for recording the captured multi-channel audio data comprising a first and a second audio channel. Such a recording unit preferably comprises a non-volatile memory, such as a hard disk drive or a flash memory, in particular, a flash RAM. The memory may be integrated into the audio terminal or the audio terminal may provide an interface for inserting an external memory.
In some scenarios, the audio terminal 100 further comprises an image capturing unit (not shown in the figures) for capturing a still or moving picture, preferably, while capturing the multi-channel audio data, wherein the audio terminal 100 is adapted to provide, preferably, automatically or substantially automatically, an information associating the captured still or moving picture with the captured multi-channel audio data.
Additionally or alternatively, the audio terminal 100 may further comprise a text inputting unit for inputting text, preferably, while capturing the multi-channel audio data, wherein the audio terminal 100 is adapted to provide, preferably, automatically or substantially automatically, an information associating the inputted text with the captured multi-channel audio data.
Preferentially, the audio terminal 100 is adapted to provide, preferably, by means of the communication unit 21, the multi-channel audio data such that a remote user is able to listen to the multi-channel audio data. For example, the audio terminal 100 may be adapted to communicate the multi-channel audio data to a remote audio terminal via a data communication, e.g., a suitable Voice-over-IP communication. In this respect, reference is made to the description of the various usage scenarios for Audio-3D in section 7 above.
Also, in the audio terminal 100, the first and the second microphone 11, 12 and the first speaker 15 can be provided in a headset, for instance, an over- or on-the-ear headset, or a in-ear phone.
It shall be noted that in preferred scenarios, Audio-3D is not realized with narrowband audio data but, preferably, with wideband or even super-wideband or full band audio data. In these latter cases, which may be referred to as HD-Audio-3D, the various technologies described above are adapted to deal with such high definition audio content.
In the claims, the word "comprising" does not exclude other elements or steps, and the indefinite article "a" or "an" does not exclude a plurality.
A single unit or device may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Any reference signs in the claims should not be construed as limiting the scope.

Claims

An audio system (400; 500) for providing a communication between at least three remote locations (M, N, 0), comprising a first audio terminal (401) being located at a first remote location (M), a second audio terminal (402) being located at a second remote location (N), a third audio terminal (405) being located at a third remote location (O), and a conference bridge (406; 506) being connectable with the first, the second and the third audio terminal (401, 402, 405; 501, 502, 505) via a transmission link, preferably, a dial-in or IP transmission link, supporting at least a first and a second audio channel, respectively, wherein each of the first, the second and the third audio terminal (401, 402, 405) comprises at least a first and a second microphone (11, 12) for capturing multi-channel audio data comprising at least a first and a second audio channel, and at least a first speaker (15) for playing back audio data comprising at least a first audio channel, wherein the first and the second microphone (11, 12) and the first speaker (15) are provided in a headset or an in-ear phone, wherein each of the first, the second, and the third terminal (401, 402, 405) further comprises at least a second speaker for playing back audio data comprising at least a second audio channel, wherein the conference bridge (406; 506) is adapted to generate for each audio terminal of the first, the second and the third audio terminal (401, 402, 405) a multi-channel audio mix of multi-channel audio data streamed from all other audio terminals of the first, the second and the third audio terminal (401, 402, 405) and to stream the multi-channel audio mix to the audio terminal for which it is generated, the multi-channel audio mix comprising at least a first and a second audio channel.
The audio system (400) according to claim 1, wherein the conference bridge (406) is adapted to monaurally mix the multi-channel audio data streamed from the first and the second audio terminal (401, 402) to the multi-channel audio data streamed from the third audio terminal (405) to generate the multi-channel audio mix.
The audio system (400) according to claim 2, wherein the conference bridge (406) is further adapted to spatially position the monaurally mixed multi-channel audio data streamed from the first and the second audio terminal (401, 402) when generating the multi-channel audio mix.
The audio system (500) according to any of claims 1 to 3, further comprising a telephone (505) comprising a microphone and a speaker, wherein the conference bridge (506) is further connectable with the telephone (505), wherein the conference bridge (506) is adapted to mix the multi-channel audio data streamed from the first and the second audio terminal (501, 502) into a single-channel audio mix comprising a single audio channel and to stream the single-channel audio mix to the telephone (505).