EP0968624A2

EP0968624A2 - Telephonic transmission of three dimensional sound

Info

Publication number: EP0968624A2
Application number: EP98909666A
Authority: EP
Inventors: David Monteith; Alastair Sibbald; Martin Peter Todd
Original assignee: Central Research Laboratories Ltd
Current assignee: Creative Technology Ltd
Priority date: 1997-03-18
Filing date: 1998-03-18
Publication date: 2000-01-05
Also published as: WO1998042161A2; WO1998042161A3

Abstract

The invention relates to telephonic transmission of 3D sound. Existing video conferencing systems suffer from the disadvantage that following transmission of a person speaking, the speaker's voice tends to become 'disembodied'. That is, if a person moves with respect to a microphone, the reproduced voice tends not to move with the speaker. The invention overcomes or reduces this effect by obtaining left and right monophonic signals, modifying the signals to compensate for head related air-to-ear transfer functions and performing equalisation and cross-talk cancellation on the signals. Eventually signals are compressed to produce a compressed binaural signal for transmission along a telephone link so that frequencies are split into separate bands, but relative phase differences between signals in different frequency bands are preserved. 3D sounds are therefore able to be transmitted via telephone links, and reproduced more effectively, than was previously possible.

Description

TELEPHONIC TRANSMISSION OF 3D SOUND

This invention relates to telephonic transmission of three dimensional (3D) sounds and more particularly to an apparatus for communicating three dimensional sounds between two or more remote locations by telephone transmission. The present invention is concerned with telephone conference systems irrespective of whether or not a visual image is transmitted at the same time as the audio transmission.

The concept of video telephone conference systems which employ large expensive equipment to enable a large group of people in one location to communicate with another group at another location is well known. Video telephones which incorporate a camera and video screen at each location are also well known. With the advent of low cost video telephone systems for personal computers and the integration of office technologies, such as fax, telephone and video such systems are becoming more readily available.

One of the problems of telephone conference systems, particularly those without the transmission of visual images, is that when one is listening to a person speaking at the remote location, the voice of the person speaking seems to be "disembodied". With those systems that transmit visual images as well as the sound, this undesirable effect is more noticeable because, if the person speaking moves about at the remote location relative to the microphone or microphones monitoring the speaker's voice, the voice does not appear to move with the speaker. In those systems where the microphones are voice-actuated any slight noise or speech from one person can switch off the microphone of another person so that the listener becomes confused as to who is speaking.

These undesirable effects are related to the fact that all voices and any background sounds are localised at the same loudspeaker at the receiving station. An object of the present invention is to overcome, or reduce, these undesirable effects by reproducing a three dimensional sound field of the transmitting station at the receiving location.

The processing of binaural signals to produce a highly realistic three dimensional sound image, is well known, and is described in International Patent Application No WO-A- 9422278. Binaural technology is based on using a so-called "artificial head" microphone system to receive sound from a sound source and convert the acoustic energy into an electrical signal which is subsequently processed digitally. The use of an artificial head ensures that the natural three dimensional sound cues, which the brain of a listener uses to determine the position of sound sources in three dimensional space, are incorporated into the audio signal. The artificial head is preferably constructed to resemble as close as possible an actual human head and upper torso and has silicone rubber ears which precisely resemble human ears but in some applications good results (but less precise) can be achieved using two spaced microphones with a block or sheet of wood between the microphones.

For the purpose of the present specification the term "binaural signals" is intended to mean two channel or stereophonic signals which include one or more components representing audio diffraction effects created by an artificial head means positioned between a pair of microphones. The term "artificial head" is intended to cover not only a precise model of a human head but other imprecise models (such as for example a block of wood between microphones) and electrical synthesis of the audio diffraction signals.

There are many problems associated with artificial head sound recordings. For example, because the sound passes through two sets of ears (those of the artificial head and those of the listener) the tonal qualities of the reproduced sounds are not true to life. There is generally a resonance at a frequency of several kHz created in the main cavity of the ear (the concha). This has the effect of boosting the mid-range gain of the reproduced sound, and the reproduced sound appears to lack both low-frequency and high frequency content. It is known to use equalisation filters to shape, or equalise, the spectral response of the audio signals generated by such artificial head recording means, to compensate for this "twice- through-the-ears" effect. International Patent Application WO- A- 9515069 describes a binaural sound system which compensates for this so called "twice-through-the-ears" effect.

A further problem with artificial - head microphone systems is that when listening to the reproduced sound through loudspeakers interaural cross talk occurs, when an audio signal intended for one ear of a listener is also received by the other ear. In order to compensate for this effect it is well known to employ cross-talk cancellation circuits. See for example International Patent Application WO-A-9515069.

A further object of the present invention is to provide apparatus which enables binaural processing of the audio signals of a telephone conference system.

According to one aspect of the present invention there is provided apparatus for communicating three dimensional sounds via a telephone link comprising an input device consisting of two spaced microphones operable to produce left and right channel monophonic microphone output signals, signal processing means for each channel comprising filter means for receiving the microphone output signals and modifying the signals to compensate for head related air-to-ear transfer functions and equalise the spectral response of the microphone output signals, cross-talk cancellation means for cancelling out interaural cross-talk between the channels, and data compression means operable to receive an output signal from each channel, combine them to produce a binaural signal and compress said binaural signal to produce a compressed binaural signal for transmission over the telephone link, said compression means using a first compression algorithm to compress frequencies below 1 kHz whilst preserving relative phase differences between the channel output signals, a second algorithm to compress frequencies above 2 kHz whilst preserving relative differences between amplitudes of the channel output signals and a third algorithm to compress frequencies between 1 kHz and 2 kHz whilst preserving the IAD and ITD information over the whole frequency band.

Preferably the apparatus further includes a receiving means for receiving a compressed binaural signal transmitted over a telephone link and converting said compressed signal into left and right channel audio output signals, and spaced left and right channel sound reproduction means each of which is operable to receive a respective channel audio output signal from said receiving means and reproduced sound corresponding to said respective channel audio output signal.

The sound reproduction means may comprise a pair of loudspeakers, or a pair of headphones.

The apparatus may be provided with a video signal means comprising a camera operable to produce a video output signal, and the compression means is operable to receive the video output signal and to combine said video output signal with said compressed binaural signal to produce a combined output signal for transmission via the telephone link.

Preferably receiving means further includes means for receiving a video signal transmitted over a telephone link and converting the video signal into a video output signal, and display means operable to receive said video output signal and display a visual image.

The present invention will now be described by way of example, with reference to the accompanying drawings in which:-

Figure 1 illustrates schematically apparatus incorporating the present invention for telephone conference connection between two conference centres.

Figure 2 shows in block diagram form apparatus incorporating signal processors in accordance with the present invention. Figure 3 shows schematically human head, and

Figure 4 shows a further embodiment of the present invention.

Referring to Figure 1 each conference station 10, 11 is provided with a personal computer (PC) 12 which includes a monitor, two spaced microphones 13, 14 mounted in silicone rubber moulded ears 15 (which model precisely human outer ears) and two spaced loudspeakers 16.

The microphones 13, 14 should ideally be placed about 15 cm apart (the approximate width of a human head) and although it is preferable that the microphones are mounted in moulded ears 15 on an artificial head, the microphones could be mounted in moulded ears mounted on structure 17 (such as a block or sheet of wood). Alternatively the microphones 13, 14 and moulded ears could simply be mounted on the sides of the computer case 12, but this would give less precise detail to the three-dimensional sound field.

Each of the stations 10, 11 is connected to the other by means of the public telephone system 27 in the usual way.

Referring to Figure 2, both microphones 13, 14 are positioned to receive sound generated at their respective station 10, 11, where they are located. Each microphone converts the pressure variations associated with the sound waves that it receives into an analogue electrical signal at inputs 18a, 18b of each channel (representing left and right ears 13,14) of a digital signal processor 19.

The processor 18 comprises a HRTF filter 20 and an equalisation filter 21 for each channel. It will be understood that for the purposes of this specification, that "HRTF" or "Head Related Transfer Function" is intended to mean a function representing the transfer function of a path between a source of sound and the ear of the listener, either the ear nearer the sound (near HRTF) or the ear further from the sound (far HRTF). HRTF's may be obtained by measurements on a real human head equipped with suitable microphones; alternatively, they may be obtained using an artificial head means, which may be, as is common, a precise model of a human head or torso with microphones in the ear structures; alternatively it may be something far less precise, for example a block or sheet of wood positioned between a pair of spaced apart microphones; it might even be an electrical synthesis circuit or system which creates such functions.

Filters 21 correct the spectral response to compensate for the mid-range gain associated with the concha-related resonance, as explained in International Patent Applications WO-A- 9422278 and WO-A-9515069. The outputs 21a, 21b of the filters 21 are fed to cross-talk cancellation circuits 22 which cancel out the interaural crosstalk as explained in International Patent Applications WO-A-9422278 and WO-A-9515069. The output signals at each channel output 23 comprises a monophonic digital audio signal.

The normal signals transmitted over internationally acceptable telephone networks are typically a monophonic signal covering a range of frequencies from about 200 Hz to 3.4 kHz. In order to be able to transmit the outputs 23 of each channel over a normal telecommunications line, and reproduce a realistic three dimensional sound field, it is necessary to combine the output signals 23 to produce a stereophonic signal covering a wider range of frequencies (typically lower than 1 kHz and higher than 13 kHz) whilst still being able to differentiate between the left and right channel signals To do this, the output signals 23 of each channel are combined and compressed by a signal compression means 25 to produce a stereophonic output signal 24. The compression algorithms used by the compression means 25 are designed to preserve the three dimensional cues in the audio output signals 23 from each channel. One of the aspects of this is to preserve a wider range of frequencies than is normal for telephony compression. A second key aspect is to preserve the time relationship between the signals in the two channels. The manner in which the head and outer ears of a listener modify soundwaves before they are registered by the inner ears is complex, with several contributing factors playing a part. When a sound source is directly in front of the listener, then each pinna (outer ear flap), together with its auditory canal, is exposed equally to the sound source. However when the sound source is moved to one side of the head of the listener, then the more distant ear lies in the shadow of the head, and the ear closer to the sound source is aligned more on-axis with the source. When sound waves encounter the listeners head, the soundwaves.diffract around the listener's head. In general, the average width of a human head is 15 cm with an interaural path length of about 20 cm when the circumference effect is taken into account. Sound waves of greater wavelength than 15 cm (corresponding to frequencies below about 1.7 kHz) can diffract efficiently around a human head whereas at higher frequencies the sound wave cannot diffract efficiently around the head. This effect, known as "head-shadowing", creates differences in amplitudes of the sound signals arriving at each ear of the listener. This interaural amplitude difference (IAD) is one of the primary 3D cues which need to be preserved. The effects of diffraction on the intensity of the sound are noticeable in the range of between 700 Hz and 8 kHz and are more noticeable at higher frequencies (say above 2 kHz), where the head-shadowing creates noticeable differences in the intensities of the sound waves reaching the ears. The listener's brain uses these differences in intensity as cues to locate the direction of the source of high frequency sounds. Therefore it is important to retain the relationship between the intensities (or amplitudes) of the high frequency sounds.

At lower frequencies (below say 1 kHz) there is little or no difference in the intensity of the acoustic energy of the sound waves received at both ears but there is a marked phase difference. In general terms the phase difference is approximately proportional to frequency. The listener's brain therefore uses the phase differences of the low frequencies as an important cue to determine the direction of the source of low frequency sounds. It is therefore important to retain the phase relationships between the output signals of the left and right channels for the low frequency sounds. In addition to the IAD there will be time-of-arrival differences between the left and right ears of the listener, unless the sound source is exactly in front, behind, above or below the head of the listener. This is known as the interaural time delay (ITD) and can be seen depicted in diagram form in Figure 3 which shows a plan view of a conceptual head with a left ear (LE) and a right ear (RE) receiving a sound signal from a distant source at azimuth angle θ (about +45° as shown in the drawing). When the wave front (W - W¹) arrives at the right ear (RE), then it can be seen that there is a path length of (a+b) still to travel before it reaches the left ear LE. By symmetry, the path length b is equal to the distance from head centre to wave front (W - W'),and hence b = r.sin θ. The path distance a, represents a proportion of the circumference subtended by θ. By inspection, the path length (a+b) is given by.

When θ tends to zero so does the path length (a + b); when θ tends to 90° and the head is 15 cm in width, then the path length is approximately 19.3 cm and the associated ITD is about 760μs. In practice, ITDs are measured to be slightly greater than this, possibly because of the non-spherical nature of human heads, the complex diffractive situation and surface effects. Hence ITDs lying in the range of 0 to 0.8 ms are also important primary 3D cues.

As explained above, the mid-range gain due to the concha related resonance and the resonance in the auditory canal of the outer ear occurs at about 3 kHz or slightly higher and this is at the extreme end of the normal bandwidth of conventional telephone transmission lines. Furthermore it is believed that the Fossa (a cavity at the uppermost region of the Pinna of the outer ear) creates resonance at 13 kHz which boosts the higher frequency sounds, and that the brain of the listener makes use of the higher frequency sounds at 13 kHz or above to assist in determining whether the source of sound is in front of or behind the listener. It is therefore important to retain the detail of high frequency sounds above 13 kHz, if front and back cues are necessary. Bearing the above in mind, the compression means 25 uses a first algorithm which allows compression of frequencies below 1 kHz, whilst preserving phase differences between the channel output signal, and uses a second algorithm to compress frequencies above 2 kHz, whilst preserving relative differences in the amplitudes of the channel output signals 23.

The compression means 25 also employs algorithms that allow the compression of the mid range frequencies, whilst preserving the IAD and ITD information over the whole frequency band.

The compression means 25 thus preserves the phase and amplitude relationships up to 8 kHz for reproducing three dimensional. sound fields without front and back cues, or up to 13 kHz, or above, when front and back cues are wanted.

The output signal 24 of the compression means 25 is a compressed binaural signal which is transmitted over a conventional public telephone link 27 to another receiving station 10, 11.

Each station 10, 11 further includes a receiving means 28 for receiving an incoming compressed combined binaural signal transmitted via the telephone link 27. The receiving means 28, (see Figure 2), comprises a signal processor which operates to re-expand the incoming compressed signal 26 and produce two channel input signals 30. Each channel input signal 30 is supplied to a sound reproduction device 16 which may be the pair of loudspeakers 16 or a pair of headphones 32.

In the case where it is desired to listen through headphones 31(b), it is preferred not to cancel the interaural cross-talk. It is therefore possible to re-introduce the cancelled cross-talk by combining a signal which is the inverse of the cross-talk cancellation signal with the incoming signal. In a further embodiment of the invention the apparatus of Figures 1, 2, and 3 further includes means for transmitting and receiving video signals over a telephone link 27 as shown in Figure 4 . For simplicity, in Figure 4 the same reference numbers are given to the same components that are common to the Figure 2 embodiment.

Referring to Figures 1 and 4, each station 10, 11 is provided with a video camera 32 and video processor 33 which is operable to produce a video output signal 34. The video output signal 34 from the camera 32 is supplied to the compression means 25 of the signal processor 19 (see Figure 4). The compression means 25 includes circuits for combining the binaural output signal 24 with the video output signal 34 to produce a combined video and binaural output signal 36 for transmission over the a telephone link.

The apparatus is also provided with a receiving means 37 for receiving an incoming combined video and binaural signal 38 transmitted over the telephone link 27 from another remote conference centre 10 or 11. The receiving means 37 includes a decompression means 39 for expanding the received video and binaural signal 38, and operates to produce a video signal 40 to a video processor 41 and two audio output signals 30.to the speakers 16 or headphones 32. The output of the video processor 40 drives the monitor 12 to produce a visual image.

Claims

1. Apparatus for transmitting three dimensional sounds via a telephone link comprising an input device consisting of two spaced microphones operable to produce left and right channel monophonic microphone output signals, signal processing means for each channel comprising filter means for receiving the microphone output signals and modifying the signals to compensate for head related air-to-ear transfer functions and equalise the spectral response of the microphone output signals, cross-talk cancellation means for cancelling out interaural cross-talk between the channels, and data compression means operable to receive an output signal from each channel, combine them to produce a binaural signal and compress said binaural signal to produce a compressed binaural signal for transmission over the telephone link, said compression means using a first compression algorithm to compress frequencies below 1 kHz whilst preserving relative phase differences between the channel output signals, a second algorithm to compress frequencies above 2 kHz whilst preserving relative differences between amplitudes of the channel output signals and a third algorithm to compress frequencies between 1 kHz and 2 kHz whilst preserving IAD and ITD information over the whole frequency band.

2. Apparatus according to claim 1 further including receiving means for receiving a compressed binaural signal transmitted over a telephone link and converting said compressed signal into left and right channel audio output signals, and spaced left and right channel sound reproduction means each of which is operable to receive a respective channel audio output signal from said receiving means and reproduced sound corresponding to said respective channel audio output signal.

3. Apparatus according to claim 2 wherein said sound reproduction means comprises a pair of loudspeakers.

4. Apparatus according to claim 2 wherein said sound reproduction means comprises a pair of headphones.

5. Apparatus according to any one of claims 1 to 4 wherein video signal means are provided comprising a camera operable to produce a video output signal, and the compression means is operable to receive the video output signal and to combine said video output signal with said compressed binaural signal to produce a combined output signal for transmission via the telephone link.

6. Apparatus according to any one of the preceding claims wherein the receiving means further includes means for receiving a video signal transmitted over a telephone link and converting the video signal into a video output signal, and display means operable to receive said video output signal and display a visual image.

7. A method of transmitting three dimensional sounds via a telephone link comprising the steps of providing left and right channel monophonic microphone output signals to a signal processing means, filtering said microphone output signals and modifying the signals to compensate for head related air-to-ear transfer functions, equalising the spectral response of the microphone output signals, performing cross-talk cancellation of interaural cross-talk between the channels, and data compressing processed output signals, so as to produce a binaural signal and compressing said binaural signal to produce a compressed binaural signal for transmission over the telephone link, wherein said compression means employs three algorithms, a first compression algorithm compresses frequencies below 1 kHz whilst preserving relative phase differences between the channel output signals, a second algorithm compresses frequencies above 2 kHz whilst preserving relative differences between amplitudes of the channel output signals and a third algorithm compresses frequencies between 1 kHz and 2 kHz whilst preserving IAD and LTD information over the whole frequency band.

8. Apparatus and method substantially as herein described with reference to the accompanying drawings.