WO2009001035A2 - Transmission of audio information - Google Patents

Transmission of audio information Download PDF

Info

Publication number
WO2009001035A2
WO2009001035A2 PCT/GB2008/002085 GB2008002085W WO2009001035A2 WO 2009001035 A2 WO2009001035 A2 WO 2009001035A2 GB 2008002085 W GB2008002085 W GB 2008002085W WO 2009001035 A2 WO2009001035 A2 WO 2009001035A2
Authority
WO
WIPO (PCT)
Prior art keywords
source
output means
sources
audio system
allocated
Prior art date
Application number
PCT/GB2008/002085
Other languages
French (fr)
Other versions
WO2009001035A3 (en
Inventor
Bernard ST JAMES
Original Assignee
Wivenhoe Technology Ltd
St James Bernard
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wivenhoe Technology Ltd, St James Bernard filed Critical Wivenhoe Technology Ltd
Publication of WO2009001035A2 publication Critical patent/WO2009001035A2/en
Publication of WO2009001035A3 publication Critical patent/WO2009001035A3/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L19/00Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis
    • G10L19/04Speech or audio signals analysis-synthesis techniques for redundancy reduction, e.g. in vocoders; Coding or decoding of speech or audio signals, using source filter models or psychoacoustic analysis using predictive techniques
    • G10L19/08Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters
    • G10L19/093Determination or coding of the excitation function; Determination or coding of the long-term prediction parameters using sinusoidal excitation models
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • H04M3/568Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities audio processing specific to telephonic conferencing, e.g. spatial distribution, mixing of participants
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04MTELEPHONIC COMMUNICATION
    • H04M3/00Automatic or semi-automatic exchanges
    • H04M3/42Systems providing special services or facilities to subscribers
    • H04M3/56Arrangements for connecting several subscribers to a common circuit, i.e. affording conference facilities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R2227/00Details of public address [PA] systems covered by H04R27/00 but not provided for in any of its subgroups
    • H04R2227/007Electronic adaptation of audio signals to reverberation of the listening space for PA
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04RLOUDSPEAKERS, MICROPHONES, GRAMOPHONE PICK-UPS OR LIKE ACOUSTIC ELECTROMECHANICAL TRANSDUCERS; DEAF-AID SETS; PUBLIC ADDRESS SYSTEMS
    • H04R27/00Public address systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04SSTEREOPHONIC SYSTEMS 
    • H04S7/00Indicating arrangements; Control arrangements, e.g. balance control
    • H04S7/30Control circuits for electronic adaptation of the sound field
    • H04S7/305Electronic adaptation of stereophonic audio signals to reverberation of the listening space

Definitions

  • the field of the invention is that of transmission of audio information.
  • a system is described to spatially and temporally separate voices in a teleconferencing or surround sound environment.
  • a method of suppressing echo within a teleconferencing or broadcasting environment is disclosed.
  • the second difficulty often encountered is that of a clash of phonemes when two speakers are talking at the same time. Certain sounds clash with one another which can lead to difficulties in understanding. Paradoxically, the presence of stops or temporary silences as part of speech can be masked by sounds from other voices.
  • a further, perhaps more widely experienced difficulty experienced with the transmission of audio information is that of echoic or acoustic feed back. This arises due to the microphone used by a speaker or instrument picking up sound information from a loudspeaker which is broadcasting sounds made by the speaker or instrument. Public address systems often suffer instability due to this effect.
  • Conventional echo suppression techniques require an accurate transfer function to be generated, said function mimicking the effects of the loudspeaker and also of the sound space. Such transfer functions are however difficult to set up and are not robust in normal environments where relative motion of sources and loudspeakers can occur.
  • an audio transmission and broadcast system to transmit and broadcast input from a plurality of sources, wherein each source is provided with a microphone or the like to collect sound emitted by a source; a plurality of spatially separated output means; the system comprising an encoder to encode sound into a digital form, the encoder allocating a vectorising element to each source; transmission means to transmit encoded data to a receiver, the receiver being operatively connected to a decoder to decode the data and transmit an audio signal through an output means; the output means allocated to a source being dependent on the vectorising element, wherein simultaneous sources are allocated spatially separated output predetermined spatial distance from the output means initially allocated to that source.
  • the angle subtended with the listener by two allocated output means is
  • the output means are horizontally spatially separated to achieve the optimal psychoacoustic separation.
  • each vectorising means is orthogonal to any other vectorising means.
  • the vectorising means is particularly preferably the imaginary part of a complex number.
  • a source is constrained to remain within an angular distance of x° from its original allocated output speaker, in order to reduce confusion of the listeners as to the source.
  • the rate of movement from one position to another is constrained to be below a preset rate to minimise the likelihood of a listener perceiving the motion.
  • the system advantageously includes identification means allocating phonemes into classes of phonemes. Particularly advantageously, four classes are defined. Comparison means are conveniently included to compare the time parameter of the classes of phonemes from two sources and where required time-shift a phoneme to avoid cotemporaneous broadcasting of conflicting phonemes.
  • FIG 1 illustrates the general process of processing an audio signal into A- soundlets
  • Figure 2 illustrates sinusoidal decomposition of an audio frame
  • Figure 3 is an overall view of a spatialisation process
  • Figure 4 illustrates the application of a spatialising vector
  • Figure 5 a, 5b illustrates a dynamic spatial separation
  • Figure 6 illustrates temporal manipulation of soundlets
  • Figure 7 shows the results of a temporal manipulation shown in Figure 6;
  • Figure 8 is a Table of test data and comprehensibility of text passages
  • Figure 9 is a schematic of a process of energy management
  • Figure 10 illustrates spectra of conflicting sources resolved by energy management
  • Figure 1 lillustrates a prior art method of echo suppression
  • Figure 12 illustrates echo suppression according to the invention
  • Figures 13a, 13b respectively illustrate energy loss of a laboratory signal and a loss profile generated therefrom.
  • Figures 14a - 14c are spectra illustrating the effect of an echo gate.
  • the sinusoidal component is relatively simple to describe in terms of a series of sinusoids (see Figure 2) each of which can be characterised by three features, the amplitude, the frequency and the phase ⁇ .
  • Figure 2 When dealing with a reproduced sound perceived by a human listener, many terms of the series can be discarded as being too close to others to be audible. Also discarded are frequencies which are too high or low in frequency to be useful. This typically leaves 25 - 45 elements of the series to be described per frame, R.
  • the noise can be expressed as a filtered white noise function R k (t).
  • the first term represents the N audible components and the second the M-N perceptually inaudible components.
  • the residual noise wave enhancement layer in the encoder can provide a simple amplitude modulation envelope. Although comparatively small in energy the noise soundlet is important objectively as it adds a degree of naturalness, particularly in the case of aspirated or unvoiced speech segments.
  • E nk (t) is a best-fit triangular envelope described by the parameters: a nk , rate of attack, d nk , rate of decay and time of peak occurrence, t p .
  • Others have suggested more elaborate envelope modelling for improved fidelity. In order to ensure that there is no loss of information between adjacent frames an overlap is formed between said frames which can be around 50%.
  • the above coding does not however enable information from different sources to be identified during decoding and therefore separated from other sources.
  • the result of this is that where the receiver has more than one loudspeaker, the sounds from different sources are not separated. Quite often therefore, particularly when the sources are a number of different speakers the voices overlap and lose a degree of intelligibility.
  • the present invention seeks to overcome the difficulty by assigning each speaker a vector identification. It is envisaged, referring to a teleconferencing situation, that each participant is provided with their own microphone. The sound entering each of the microphones is converted into soundlets using the methodology described above. In the present example, to each set of soundlets arising from one particular microphone, is assigned a spatialising vector.
  • the vector can simply be a number or other label, which has the effect when signals are decoded of a speaker's signal being directed to a particular loudspeaker. Said vectors can be in the form of the imaginary part of a complex number, arg(Znkv)-
  • the encoded information transmitted contains therefore information on which particular participant the soundlet belongs to.
  • orthogonal can simply have the meaning of being different to other vectors, but can also include the conventional meaning.
  • Any sinusoidal soundlet, ⁇ . nh> , occurring in a spatially defined object, v, can be expressed as :-
  • ⁇ ⁇ fo A knv sin(2 ⁇ F n j + ⁇ n JE nkv (t) arg(Z nto )
  • any residual noise soundlet, N kv can be expressed as :- Spatial reproduction of speech has been shown to improve intelligibility due to the psychoacoustic phenomenon of binaural masking level difference.
  • a person speaks within a noisy environment there is an audibility advantage if the voice is presented to the listener binaurally rather than monaurally. To exploit this advantage the voice has to be physically displaced from the ambient noise.
  • the interfering noise is predominantly other persons speaking at the same time.
  • the binaural advantage is shown to increase with azimuthal separation although beyond 25° the improvement is less pronounced.
  • permitted angular deviation of a source from its origin (rad)
  • Z 0 x function( ⁇ , Q a , ⁇ , ⁇ , Z a ) due to the presence of other sources, the azimuth will need to change to accommodate these and so an instantaneous azimuth is assigned to the source and for each soundlet frame in which x, the source becomes active:
  • Z 1 x k Z ⁇ x ⁇ l _ X) + v(k - k x ).(Z 0Atk - Z I ⁇ (k _ ⁇ )
  • v is the velocity of migration
  • rad/frame is the frame in which x becomes active.
  • Control of the rate of migration is governed to a large extent by the value of v which itself is a function of the number of sources entering or leaving the conference.
  • v is relatively large (500 rad/frame) but relatively slow for remaining sources after departure (20 rad/frame).
  • the relative slowness means that it is not easily perceived by the listener that the output speakers are spreading apart from one another or returning towards their pre-conflict positions.
  • Figures 5a and b illustrate the procedure in practice.
  • Figures 5a and b show actual rendition tracks plotted against time for 3 active sources.
  • Figure 5a depicts statically rendered sources of which two are virtually co-sited.
  • Figure 5b depicts the same sources when re-rendered dynamically. The position of each source has an interactive effect on its neighbours such that the optimum positions for all sources are continuously revised.
  • the general principle of operation can be observed from the rendition tracks: the first conflict between sources (B) and (C) occurs at 9 seconds.
  • the source (C) appears displaced to the right and source (B) migrates to the left to achieve adequate separation.
  • Source (A) reappears after another 4 seconds at a displaced position due to the proximity of source (B).
  • the algorithm is set to maintain a nominal 25° separation the rules for soundscape stability take precedence. For example: sources that have been inactive for a defined period may enter the soundscape at their optimum positions whilst others must migrate under controlled conditions. The optimum rate of migration is variable dependent upon prevailing conditions.
  • the dynamic rendition process is compatible with other coding techniques and is scaleable to broadcast quality. Further applications beyond teleconferencing include automatic spatialising low-budget radio "telephone chat shows" where the presenter is also burdened with panning contributors into position on the mixing desk.
  • each active source is maintained at least a minimum angular separation from other active sources and will moreover (assuming there are no conflicting positions), if moved, migrate back to its original position to maintain stability. It is also envisaged that each speaker should keep its position relative to the other speakers in the long term and also maintain its place order relative to the other speakers.
  • a source is not allowed to move more than a pre-set angular distance from its original position. Moreover, any movement is not allowed to exceed a pre-set rate.
  • some phonemes in one source may be corrupted by their underlying phones being masked by phones from another source.
  • the corruptive phones tend to have high intensity and a spread-energy spectrum such as plosives and some fricatives.
  • the vulnerable phonemes tend to be low energy, non-vowel sounds.
  • stops which are critical to intelligibility For example, masking of the stop in the word "stay” results in the perception of the word "say”.
  • the vectorised soundlets can be employed to temporally separate said phones.
  • soundlets from one time period are mapped to another to minimise corruptive overlap.
  • potential conflict to be avoided the soundlets within timeframes need first to be identified and classified.
  • 'Stops' typically of 30-60 ms in duration. Although in a silent background stops down to 8 ms can be discerned audibly, 15 ms is the limit in a noisy background. The discernability rises as the length increases to 60 ms beyond which it takes on fricative attributes. The audibility of short duration stops is also dependent on the magnitude of a following plosive. 'P' and 'B' phonemes are more resilient than 'T' and 'K' due in part to stronger associative plosives.
  • fricatives have a shorter mean duration than vowel sounds requiring significantly less time displacement to misalign the conflicting elements.
  • Refuges are defined as segments that contain no vulnerable phonemes (stops) but are also non-corruptive, such as low to medium intensity vowel sounds. The tonal nature of the signal has little masking ability across each critical band. Vulnerable stops can be safely aligned with refuges.
  • Figure 6 depicts a simplified schematic of the temporal manipulation process to misalign conflicting events.
  • T scope provided by the source buffers the following manipulations are available to minimise conflicts :- i. Extend or shorten pauses (add/remove x silent frames) ii. Extend or shorten refuges (replicate/remove x frames) iii. Extend or shorten stops (add x silent frames)
  • Insert x silent frames retard subsequence frames
  • the editing is executed according to a set of process rules to maintain transparency, fidelity and synchronicity between sources.
  • some of the temporal parameters applied to the process are:-
  • T scope Maximum scope of analysis and manipulation, typically 500 ms.
  • T range Maximum range of editing influence from any vulnerable allophone, typically 400 ms.
  • T exten Maximum duration to which a stop can be extended, typically 60 ms.
  • T m i n s Minimum duration of an exposed stop, typically 15 ms.
  • T re f ex t Maximum extension to refuge period, typically 50 ms.
  • T hr ift Maximum drift or asynchronicity between sources, typically 1000 ms
  • T pmp Post-masking protection guard-band, typically 30 ms.
  • Relative timing is optimised over the sliding observation window of g frames allowing a small degree of "look forward” and "look backward” to identify appropriate editing points. This introduces some inherent latency which can be limited to typically 500 ms. Since manipulation is only necessary during conflicting speech the buffers can re-size dynamically to match latency with the demands of the material. Within the scope of the window events may be moved by a range of up to +/- R frames.
  • Remaining stops that cannot be re-aligned safely against pauses and refuges can, in some circumstances be widened to an optimum duration to promote audibility.
  • Drift is both a relative timing measure between all sources and a relative error to a global timing reference.
  • Figure 7a and b depicts an actual example of bi-directional corruption between two sources. Stops (white) can be seen to be aligned with corruptive material (black) in sources A and B in Figure 7a.
  • the processed sources in Figure 7b show stops generally aligned with pauses or refuges. This particular example illustrates extended pauses in both sources to achieve the re-timing. The overall duration for the clip has been extended due to the application of a modest re-synchronising regime being applied.
  • the dynamic range of multiplexed sources within each critical band is normally limited to ⁇ 10 dB.
  • the maximum dynamic range is, typically, ⁇ 55 dB.
  • Figure 9 depicts a simplified schematic of the energy management process. The soundlets from each source are mapped into critical bands and then energy balanced.
  • Figures 10a and 10b show the energy distribution across a section of frequency spectrum for two unprocessed conflicting sources.
  • Figures 10c and 1Od show the same sources once energy management processing has been applied. The spectra of each pair of sources, when combined, would appear virtually identical as the same energy is retained in each critical band.
  • FIG. 12 depicts its implementation. Firstly a test is made on the surroundings in which a sound source is located. A known sound shape is emitted and the reverberation derived therefrom measured. A three-dimensional echo gate, operating in the corpuscular domain, is placed in the outgoing signal path. The incoming signal, prior to decoding, is compared with the encoded microphone signal after which elements of the microphone signal that are recognised within the reference signal, albeit delayed and distorted, are identified for removal. This enables signals originating from local sound source to be passed to the channel whilst incoming signals, reproduced from one or more loudspeakers within the same acoustic environment, are discarded.
  • a minimum spectral spacing is typically > 40 Hz. If the spectral elements of the local and incoming sources are mutually exclusive then soundlets from the microphone stream can be discarded if they are identified as members of the incoming set. However, limited encoder accuracy and the distortions in the acoustically coupled material necessitate a tolerance in frequency matching and accurate matching is achieved only if the tolerance is within half of the spectral spacing.
  • a practical teleconference environment will be echoic to some degree giving rise to delayed and distorted replications of incoming soundlets.
  • To identify and remove replications a store of past incoming soundlets is maintained where the frequencies in the store are used as the reference set for comparison. The time that soundlets remain in the store is proportionate to the reverberation time of the acoustic environment. If the storage time is inadequate then older reverberant material will not be discarded. Conversely, excessive storage time results in erroneous matches with unrelated soundlets from other periods.
  • the accuracy of the matching process is improved by applying a joint time/intensity metric to exploit the reverberation characteristics.
  • the rate of decay of reverberant energy can be described by:-
  • Figure 13a depicts an example decay for a calibration noise burst in the laboratory during the system start-up. From this burst three parameters for the acoustic environment are estimated: -
  • an estimate of reverberant energy received at the microphone can be plotted against elapsed time, the inverse of which yields a threshold of plausibility as depicted in Figure 13b.
  • a threshold of plausibility For a given elapsed time it is implausible for an acoustically-coupled soundlet to contain more energy than the threshold.
  • AU soundlets that pass the frequency-matching criterion are tested for plausible time/intensity and those plausible soundlets are classified as potentially echoic.
  • the parameters are frequency-dependent the experimental arrangement in the laboratory using mean values still produced a viable level of discrimination. If the loss threshold is set too high reverberant soundlets with the highest acoustic coupling ratios, (not necessarily the loudest), may not be discarded.
  • a frequency matching tolerance of 15 Hz gave sufficient frequency discrimination and an optimum rate of decay for the plausibility threshold, Sd , was 0.25 dB/ms compared to the laboratory decay rate 0.5 dB/ms. This discrepancy is due probably to the simplistic estimate of decay determined using a filtered noise burst.
  • Figure 14a shows part of the signal spectrum present at the input to the three- dimensional gate and Figure 14b shows the depleted spectrum at its output.
  • the spectrum of the incoming reference signal is depicted in Figure 14c.
  • the echo suppression technique described herein can be applied to any soundlet-based audio transmission system, including public address systems, chat rooms, on-line multi user gaming etc.

Abstract

An audio transmission and broadcast system to transmit and broadcast input from a plurality of sources, wherein each source is provided with a microphone or the like to collect sound emitted by a source; a plurality of spatially separated output means; the system comprising an encoder to encode sound into a digital form, the encoder allocating a vectorising element to each source; transmission means to transmit encoded data to a receiver, the receiver being operatively connected to a decoder to decode the data and transmit an audio signal through an output means; the output means allocated to a source being dependent on the vectorising element, wherein simultaneous sources are allocated spatially separated output means; and wherein the output means of a particular source remains within a predetermined spatial distance from the output means initially allocated to that source.

Description

TRANSMISSION OF AUDIO INFORMATION
Field of the Invention
The field of the invention is that of transmission of audio information. In particular a system is described to spatially and temporally separate voices in a teleconferencing or surround sound environment. Additionally, a method of suppressing echo within a teleconferencing or broadcasting environment is disclosed.
Background to the Invention
The transmission of audio information digitally now dominates due to the clarity and relative ease of the transmission. However, in order to carry out the transmission the sounds need to be effectively encoded. Effective coding needs to take into account the quality of reproduction and the quantum of information which needs to be encoded. The latter consideration is important because certain transmissions do not require as exact a reproduction of the sound field as others. For example when transmitting a musical concert then a high quality reproduction needs to be made.
However, in a teleconferencing situation certain aspects of the sound field can be lost or neglected without impairing unduly the received sound image: a lower information transmission rate can therefore be utilised with a resultant cost saving. When implementing a teleconferencing environment currently, a number of draw backs are encountered. Prior art systems pay little attention to psychoacoustic problems encountered by the listener. For example, listeners have difficulty focusing on one speaker's voice when a second voice is speaking at a small angular separation from the first. A recent MPEG release, ISO/IEC 23003-1, addresses the problem of spatial resolution but suffers from the drawback of cross-talk whereby elements of one speaker's code becomes partially mixed in with that of a second speaker, leading to elements of that first speaker's voice being included in the sound stream of the second speaker.
The second difficulty often encountered is that of a clash of phonemes when two speakers are talking at the same time. Certain sounds clash with one another which can lead to difficulties in understanding. Paradoxically, the presence of stops or temporary silences as part of speech can be masked by sounds from other voices.
A further, perhaps more widely experienced difficulty experienced with the transmission of audio information is that of echoic or acoustic feed back. This arises due to the microphone used by a speaker or instrument picking up sound information from a loudspeaker which is broadcasting sounds made by the speaker or instrument. Public address systems often suffer instability due to this effect. Conventional echo suppression techniques require an accurate transfer function to be generated, said function mimicking the effects of the loudspeaker and also of the sound space. Such transfer functions are however difficult to set up and are not robust in normal environments where relative motion of sources and loudspeakers can occur.
It is an object of the invention to provide an audio transmission system and methodology to address the above problems.
Summary of the Invention
According to the invention there is provided an audio transmission and broadcast system to transmit and broadcast input from a plurality of sources, wherein each source is provided with a microphone or the like to collect sound emitted by a source; a plurality of spatially separated output means; the system comprising an encoder to encode sound into a digital form, the encoder allocating a vectorising element to each source; transmission means to transmit encoded data to a receiver, the receiver being operatively connected to a decoder to decode the data and transmit an audio signal through an output means; the output means allocated to a source being dependent on the vectorising element, wherein simultaneous sources are allocated spatially separated output predetermined spatial distance from the output means initially allocated to that source.
Preferably, the angle subtended with the listener by two allocated output means is
Figure imgf000005_0001
Optionally, the output means are horizontally spatially separated to achieve the optimal psychoacoustic separation.
Preferably, each vectorising means is orthogonal to any other vectorising means.
The vectorising means is particularly preferably the imaginary part of a complex number.
Preferably, a source is constrained to remain within an angular distance of x° from its original allocated output speaker, in order to reduce confusion of the listeners as to the source. The rate of movement from one position to another is constrained to be below a preset rate to minimise the likelihood of a listener perceiving the motion.
The system advantageously includes identification means allocating phonemes into classes of phonemes. Particularly advantageously, four classes are defined. Comparison means are conveniently included to compare the time parameter of the classes of phonemes from two sources and where required time-shift a phoneme to avoid cotemporaneous broadcasting of conflicting phonemes.
Brief Description of the Drawings
The invention will now be described with reference to the accompanying drawings which show by way of example only embodiments of an audio transmission system. In the drawings:
Figure 1 illustrates the general process of processing an audio signal into A- soundlets;
Figure 2 illustrates sinusoidal decomposition of an audio frame;
Figure 3 is an overall view of a spatialisation process;
Figure 4 illustrates the application of a spatialising vector; Figure 5 a, 5b illustrates a dynamic spatial separation;
Figure 6 illustrates temporal manipulation of soundlets;
Figure 7 shows the results of a temporal manipulation shown in Figure 6;
Figure 8 is a Table of test data and comprehensibility of text passages;
Figure 9 is a schematic of a process of energy management; Figure 10 illustrates spectra of conflicting sources resolved by energy management;
Figure 1 lillustrates a prior art method of echo suppression;
Figure 12 illustrates echo suppression according to the invention;
Figures 13a, 13b respectively illustrate energy loss of a laboratory signal and a loss profile generated therefrom; and
Figures 14a - 14c are spectra illustrating the effect of an echo gate.
Detailed Description of the Invention
When transmitting sound information picked up from a microphone, it is well known to perform two basic operations to encode the information efficiently. The process is illustrated in Figure 1. The first of these operations is to segment the time variable into a sequence of frames, each frame lasting typically from 20 - 30 ms. Within each frame the sound is then approximated digitally through a combination of three types of function, each type representing a different facet of the sound. The facets represented are the relatively stable components of the sound, rapidly changing components and the apparently random components: often referred to as sinusoidal, transient and noise respectively.
The sinusoidal component is relatively simple to describe in terms of a series of sinusoids (see Figure 2) each of which can be characterised by three features, the amplitude, the frequency and the phase φ. When dealing with a reproduced sound perceived by a human listener, many terms of the series can be discarded as being too close to others to be audible. Also discarded are frequencies which are too high or low in frequency to be useful. This typically leaves 25 - 45 elements of the series to be described per frame, R. The noise can be expressed as a filtered white noise function Rk(t). The overall signal for a frame, termed herein a 'soundlet' can be expressed as: n=N n=M ∑\k sm(2πFnkt + φnk)Enk(t) + ∑\k sm(2πFnkt + φnk)Enk(t) + Rk(t) n=\ n=N+l
The first term represents the N audible components and the second the M-N perceptually inaudible components. The third term Rk(t) represents the noise-like residual signal once the significantly tonal elements have been extracted from the frame. It can be expressed as filtered white noise: Rk it) = N(t)SkHk . In this second equation, N(t) = white noise of unit amplitude,
Sk(t) = scaling factor and Hk = a filter function, usually expressed as a spectral response of 7 DCT coefficients
The residual noise wave enhancement layer in the encoder can provide a simple amplitude modulation envelope. Although comparatively small in energy the noise soundlet is important objectively as it adds a degree of naturalness, particularly in the case of aspirated or unvoiced speech segments.
The amplitude modulation function, Enk(t) is a best-fit triangular envelope described by the parameters: ank, rate of attack, dnk, rate of decay and time of peak occurrence, tp. Others have suggested more elaborate envelope modelling for improved fidelity. In order to ensure that there is no loss of information between adjacent frames an overlap is formed between said frames which can be around 50%.
The above coding does not however enable information from different sources to be identified during decoding and therefore separated from other sources. The result of this is that where the receiver has more than one loudspeaker, the sounds from different sources are not separated. Quite often therefore, particularly when the sources are a number of different speakers the voices overlap and lose a degree of intelligibility.
The present invention, as illustrated in Figure 3, seeks to overcome the difficulty by assigning each speaker a vector identification. It is envisaged, referring to a teleconferencing situation, that each participant is provided with their own microphone. The sound entering each of the microphones is converted into soundlets using the methodology described above. In the present example, to each set of soundlets arising from one particular microphone, is assigned a spatialising vector. The vector can simply be a number or other label, which has the effect when signals are decoded of a speaker's signal being directed to a particular loudspeaker. Said vectors can be in the form of the imaginary part of a complex number, arg(Znkv)- The encoded information transmitted contains therefore information on which particular participant the soundlet belongs to. It is advantageous if the set of vectors assigned are mutually orthogonal, assigning each speaker initially to a particular loudspeaker. In this sense orthogonal can simply have the meaning of being different to other vectors, but can also include the conventional meaning. Using the above methodology of distinguishing sources, intelligibility within a teleconferencing environment can be improved.
Any sinusoidal soundlet, Ω.nh>, occurring in a spatially defined object, v, can be expressed as :-
Ωπfo = Aknv sin(2πFnj + φnJEnkv(t) arg(Znto)
Similarly, any residual noise soundlet, Nkv, can be expressed as :-
Figure imgf000008_0001
Spatial reproduction of speech has been shown to improve intelligibility due to the psychoacoustic phenomenon of binaural masking level difference. When a person speaks within a noisy environment there is an audibility advantage if the voice is presented to the listener binaurally rather than monaurally. To exploit this advantage the voice has to be physically displaced from the ambient noise. In the case of teleconferencing the interfering noise is predominantly other persons speaking at the same time. The binaural advantage is shown to increase with azimuthal separation although beyond 25° the improvement is less pronounced.
Intelligibility during conversations is also context sensitive. Any ambiguity in the speaker's identity compromises intelligibility. Spatial rendition of similar sounding voices provides strong cues to identity and hence another means to improving intelligibility.
In spatial audio teleconferencing contributing sources are placed at real or artificial positions within the reconstructed soundscape. Where there are several contributors significant human intervention is required to set up the soundscape in each location. For maximum binaural advantage each attendee must be placed at an isolated position within the soundscape. Numerous contributors cannot be separated by the optimum spacing within a practical soundscape and this forces some to be closely located. Conversational dynamics are such that some persons have a tendency to interrupt others. Also, parallel conversations can develop between pairs of persons. Without advance knowledge of who is about to speak it is impossible to prevent simultaneous voices emanating from near-identical positions.
When carrying out manual placement however a number of further difficulties also need to be overcome. Firstly different venues receiving the same spatialised material are likely to have different capacity, loudspeaker configurations and seating arrangements. To impose therefore a common placement scheme would not lead to optimum spatial separation of different sources. Secondly, individual microphone signals may be mixed locally before being multiplexed on the common teleconference server. The mixer in each venue may not have the soundscape data for all of the sites, resulting in an aggregated multiplex that is likely to raise spatial conflict. Thirdly even if all of the sources were manually positioned at each venue it is still likely that participants joining and leaving the teleconference at different times would compromise the soundscape.
The automated placement scheme envisaged herein is advantageous in terms of both intelligibility and ease of operation. Nevertheless, an override facility can be provided allowing limited manual grouping of participants. For example, ordering and placement location of sources by importance could be introduced where policies so dictate. At each participating venue, sources being output are located and moved to maintain optimal positions within the soundscape of a venue. The process is illustrated with reference to Figure 3. Within the array of output loudspeakers UI1O2, ..., OM, soundlets received from a particular source are allocated to one of these output loudspeakers. When a second speaker joins in, this second speaker is allocated an output loudspeaker at an angular separation, with reference to the listeners at that venue, from that allocated to the first speaker. When a third speaker joins in the output loudspeaker allocated to that third speaker is again at an angular separation from both of the previous speakers. Typically, angular separations of output loudspeakers of up to 25° are ideal to facilitate a listener in distinguishing the verbal contributions of different speakers. Separations greater than 25°do not tend to bring further advantage. Typically, a separation of 3-25° can be used. Below the lower value of 3° then the separation would not be expected to bring any advantages to the listener.
Although it is possible to allocate to an individual person an isolated, fixed position within a soundscape, irrespective of when they joined or left a conversation, this is not optimal and can cause the listeners confusion. An algorithm is therefore employed to marshal the large number of contributors in and out of the soundscape which algorithm can make use of a restricted number of output loudspeakers. Contributing speakers can be moved to a limited extent within a soundscape without causing the listener any confusion whatsoever. Indeed it is unlikely, providing the movement is not excessive that movement will be noticed. The procedure is assisted by the observation that even where there are a large number of speakers taking part in a teleconferencing event, it is likely that only two or three of them will be speaking at any one time.
To achieve this end, the following processes have been developed which modify the sound that is received wherein a vector assigned is modified to suit the soundspace in which it is being broadcast. This process is illustrated in Figure 4 in which a spatial vector can be seen to define an azimuthal position within a soundscape. It will be appreciated therefore that the modifications will be different depending on the particular soundscape, although the methodology will be the same. The description below uses the following parameters:-
λ = angular range of reconstructed soundscape (rad)
Qp = quantity of all participating sources Qa = quantity of all active sources
δ =permitted angular deviation of a source from its origin (rad)
η = target separation from other active sources (rad)
p = {Ω : Ω.e all possible sources }
α = {Ω : Ωs all active sources }
Using these therefore, an optimum azimuth Z0,x can be defined for a given source x (x ε a): Z0 x = function(λ, Qa , δ, η, Za ) due to the presence of other sources, the azimuth will need to change to accommodate these and so an instantaneous azimuth is assigned to the source and for each soundlet frame in which x, the source becomes active: Z1 x k = Zι x {l_X) + v(k - kx).(Z0Atk - ZIΛ(k_υ)
In this equation v is the velocity of migration, rad/frame and kx is the frame in which x becomes active.
Control of the rate of migration is governed to a large extent by the value of v which itself is a function of the number of sources entering or leaving the conference. For new sources, v is relatively large (500 rad/frame) but relatively slow for remaining sources after departure (20 rad/frame). The relative slowness means that it is not easily perceived by the listener that the output speakers are spreading apart from one another or returning towards their pre-conflict positions.
Figures 5a and b illustrate the procedure in practice. Figures 5a and b show actual rendition tracks plotted against time for 3 active sources. Figure 5a depicts statically rendered sources of which two are virtually co-sited. Figure 5b depicts the same sources when re-rendered dynamically. The position of each source has an interactive effect on its neighbours such that the optimum positions for all sources are continuously revised.
The general principle of operation can be observed from the rendition tracks: the first conflict between sources (B) and (C) occurs at 9 seconds. When dynamically rendered, the source (C) appears displaced to the right and source (B) migrates to the left to achieve adequate separation. Source (A) reappears after another 4 seconds at a displaced position due to the proximity of source (B). Whilst the algorithm is set to maintain a nominal 25° separation the rules for soundscape stability take precedence. For example: sources that have been inactive for a defined period may enter the soundscape at their optimum positions whilst others must migrate under controlled conditions. The optimum rate of migration is variable dependent upon prevailing conditions.
Experiments carried out on the dynamic rendition process indicate that the manipulations are detectable to an experienced listener but not to a general audience, the members of which are unaware that spatial modification may occur. "Animation" (perception of movement in active sources) and "relocation" (perception of movement in inactive sources) can be reduced. For example, for applications where ~500 ms latency is acceptable, the algorithm includes a look- ahead facility to predict impending conflicts. Predictive optimisation allows less intrusive manipulation by pre-emptive movement of sources towards optimal positions.
The dynamic rendition process is compatible with other coding techniques and is scaleable to broadcast quality. Further applications beyond teleconferencing include automatic spatialising low-budget radio "telephone chat shows" where the presenter is also burdened with panning contributors into position on the mixing desk.
The above limitations ensure efficient operation without a limitation being in place on the number of active sources. For example, each active source is maintained at least a minimum angular separation from other active sources and will moreover (assuming there are no conflicting positions), if moved, migrate back to its original position to maintain stability. It is also envisaged that each speaker should keep its position relative to the other speakers in the long term and also maintain its place order relative to the other speakers.
In addition, the introduction of a new source should not cause the listener to perceive any spatial instability in the active sources already existing: even where the new source because of its allocated position could conflict with those active sources. In this respect, a source is not allowed to move more than a pre-set angular distance from its original position. Moreover, any movement is not allowed to exceed a pre-set rate.
Further improvements to the intelligibility of an output can be achieved through addressing problems caused by overlap of conflicting phonemes within speech. Most parts of the spoken language are rugged in terms of intelligibility due to a high degree of redundancy in the information. However, some speech phonemes are more susceptible than others to corruption by conflicting voices.
During overlapping speech, some phonemes in one source may be corrupted by their underlying phones being masked by phones from another source. The corruptive phones tend to have high intensity and a spread-energy spectrum such as plosives and some fricatives. The vulnerable phonemes tend to be low energy, non-vowel sounds. In particular, stops which are critical to intelligibility. For example, masking of the stop in the word "stay" results in the perception of the word "say".
The inherently orthogonal nature of the vectored soundlet data stream allows access to discrete sound objects (voices) within multiplexed channels for preprocessing prior to spatial presentation. Two processes can be undertaken concurrently to provide protection to vulnerable phonemes. Firstly, temporal manipulation can be undertaken to separate vulnerable and corruptive phonemes by a process of micro-editing.
Secondly, energy manipulation is carried out within each critical band to minimise intensity masking.
In order to separate conflicting phones the vectorised soundlets can be employed to temporally separate said phones. In brief, soundlets from one time period are mapped to another to minimise corruptive overlap. In order for potential conflict to be avoided the soundlets within timeframes need first to be identified and classified.
In the following description, the protection of a 'stop', a brief silence often heard before plosives such as 'B' or 'P' is described. 'Stops' of this kind are typically of 30-60 ms in duration. Although in a silent background stops down to 8 ms can be discerned audibly, 15 ms is the limit in a noisy background. The discernability rises as the length increases to 60 ms beyond which it takes on fricative attributes. The audibility of short duration stops is also dependent on the magnitude of a following plosive. 'P' and 'B' phonemes are more resilient than 'T' and 'K' due in part to stronger associative plosives. Even in the absence of conflicting speech, room reverberations deny totally silent stops. In practice, stops with a minimum energy depression of approximately 8 dB for at least 15 ms are perceived reliably. Temporal masking of a stop by its own preceding energy may account for perception of silence.
To achieve this a four type classification system is used as follows:
"pause" A near-silent, non-vulnerable period, Tp, > 100 ms, Ap< -46 dB
"stop" A near-silent vulnerable period, 15 ms < Ts < 100 ms, As< -46 dB
"refuge" A non-corruptive, non- vulnerable, lower energy period Tr > 100 ms, Ar< -20 dB
"corruptive" A corruptive, non-vulnerable, high-energy period, Tc, > 0 ms, Ac> - 2O dB
It is more likely that vulnerable stops would be masked by fricatives in the conflicting speech rather than by vowel sounds, due to the similar spectral content of the sound in which the stop is embedded. Fortunately, fricatives have a shorter mean duration than vowel sounds requiring significantly less time displacement to misalign the conflicting elements.
Refuges are defined as segments that contain no vulnerable phonemes (stops) but are also non-corruptive, such as low to medium intensity vowel sounds. The tonal nature of the signal has little masking ability across each critical band. Vulnerable stops can be safely aligned with refuges.
Figure 6 depicts a simplified schematic of the temporal manipulation process to misalign conflicting events. Over an observation period, Tscope, provided by the source buffers the following manipulations are available to minimise conflicts :- i. Extend or shorten pauses (add/remove x silent frames) ii. Extend or shorten refuges (replicate/remove x frames) iii. Extend or shorten stops (add x silent frames)
In terms of soundlet manipulation the micro-edit commands can be described as:- Frame advanced by x : Ωk = Ω.^k_x^
Frame retarded by x : Ω.k = Ω(k+X)
Frame repeated x times : Ωt+1 : Ω^) = Ω,k Discarded x frames : Ω.k+1 : Ω,,k+X^ = null & advance subsequent frames
Insert x silent frames : retard subsequence frames
The editing is executed according to a set of process rules to maintain transparency, fidelity and synchronicity between sources. As an illustration some of the temporal parameters applied to the process are:-
Tscope = Maximum scope of analysis and manipulation, typically 500 ms.
Trange = Maximum range of editing influence from any vulnerable allophone, typically 400 ms.
Texten = Maximum duration to which a stop can be extended, typically 60 ms.
Tmins = Minimum duration of an exposed stop, typically 15 ms.
Trefext = Maximum extension to refuge period, typically 50 ms.
Thrift = Maximum drift or asynchronicity between sources, typically 1000 ms
Tpmp = Post-masking protection guard-band, typically 30 ms.
Relative timing is optimised over the sliding observation window of g frames allowing a small degree of "look forward" and "look backward" to identify appropriate editing points. This introduces some inherent latency which can be limited to typically 500 ms. Since manipulation is only necessary during conflicting speech the buffers can re-size dynamically to match latency with the demands of the material. Within the scope of the window events may be moved by a range of up to +/- R frames.
In cases where all potential conflicts cannot be resolved completely within the scope of the processing the experimental algorithm makes multiple analytical passes using different editing strategies. A metric of gross conflict time is used to select the optimum strategy.
For a given scope of frames, g :-
k -^- ≤ Ωt < k + ^-
2 k 2
gross conflict factor, Ψ = a.num{Ωkstop n ΩkcormptiJ + β.num{Ωkstop n Ωw }
where α and β are weighting factors, num. = number of elements in set
Remaining stops that cannot be re-aligned safely against pauses and refuges can, in some circumstances be widened to an optimum duration to promote audibility.
Due to temporal masking, re-aligning stops in one source with the start of pauses in conflicting sources may not be adequate as the stop may still be lost within the masking decay of the previous conflicting material (~ 0.5 dB/ ms). A practical guard band has been found to be 30 ms.
From a psychoacoustic perspective it is believed that sounds from different sources may be perceptually fused into a single, indeterminate object if there are strong cues to group them. The cue of "common fate" may cause grouping if two independent sources were to have near-simultaneous onset transients. Re-timing to misalign significant transients may preserve the audibility of individual sound objects and will be developed into the experimental model.
Consideration has to be given to the phase of the soundlets when phonemes are displaced in time as phase discontinuities between overlapping frames give rise to audible modulation artefacts. Fortunately, the majority of edits take place on the boundaries of silent periods so discontinuities do not arise. In the case of lengthening a refuge the insertion of duplicated frames does cause discontinuities. Corrected starting phases for the inserted and all subsequent soundlets can be calculated by matching to the closing phases of the preceding frames :-
Figure imgf000020_0001
where O = frame overlap factor, T = frame period
This phase correction works adequately for slow pitch changes but introduces artefacts during pitch glides. In such cases a random starting phase is subjectively preferable. Periodicity artefacts can also be minimised by suppressing envelope modulation during repeated frames : - Envelope modulation,
Figure imgf000021_0001
nu^
If left unchecked the accumulation of edits could cause the sources to drift out of synchronism. Generally, re-synchronisation takes place during pauses but long segments of speech can require a proactive re-timing regime. The controlling algorithm monitors the asynchronicity and this influences the degree of timing correction within the editing strategy. Drift is both a relative timing measure between all sources and a relative error to a global timing reference.
Figure 7a and b depicts an actual example of bi-directional corruption between two sources. Stops (white) can be seen to be aligned with corruptive material (black) in sources A and B in Figure 7a. The processed sources in Figure 7b show stops generally aligned with pauses or refuges. This particular example illustrates extended pauses in both sources to achieve the re-timing. The overall duration for the clip has been extended due to the application of a modest re-synchronising regime being applied.
Informal listening tests with pairs of recorded passages of speech being replayed simultaneously have shown an improvement in intelligibility. Parrs of passages were replayed monaurally at equal loudness and were described as "difficult to understand" by all members of the panel. For unprocessed material all members made comprehension errors. Typical errors were mistaking "whiter" for "wider", "stand" for "sand" and "scales" for "sails". Figure 8 gives a summary of these preliminary results when sound tracks were played to a sample of listeners. Whilst the temporal manipulation described above provides protection for vulnerable stops, other vulnerable phonemes can be protected by minimising the effects of intensity masking through the manipulation of the energy in each critical band.
It is known that within each critical band masking of less intense sounds requires a differential in energy between the masker and maskee. By reducing the differential to near-zero the potency of the masker elements is reduced significantly providing that there is only a small number of active sources. It is possible to have more than one source above the masking threshold particularly in the case of corpuscular streaming where the frequency spectrum has been depleted.
By re-balancing the energy between sources but maintaining the same total energy within each band the subjective effect of the manipulation is minimised. Due to the perceptually compressed nature of the data stream, the dynamic range of multiplexed sources within each critical band is normally limited to < 10 dB. For non-jointly compressed sources the maximum dynamic range is, typically, < 55 dB.
Figure 9 depicts a simplified schematic of the energy management process. The soundlets from each source are mapped into critical bands and then energy balanced.
In terms of corpuscular transformation the process may be described as :-
Soundlets in the active set, a, are mapped by frequency, F, into Y critical- band sets:-
For each critical band 1
ΩFy = {Ω : FyL ≤ Fy ≤ FyU }
where L and U are lower and upper critical-band boundaries
To prevent excessive and obvious manipulation a maximum working dynamic range is set, below which soundlets are not processed. This prevents very low intensity elements from being grossly over-amplified. Low-energy soundlets, beyond the dynamic range, ~u dB, for processing are excluded:-
Valid soundlets, Ωy = {Ω : \ ≥ u)
All members of the valid set, V, are re-balanced in energy with a common amplitude, A :-
Figure imgf000023_0001
Figures 10a and 10b show the energy distribution across a section of frequency spectrum for two unprocessed conflicting sources. Figures 10c and 1Od show the same sources once energy management processing has been applied. The spectra of each pair of sources, when combined, would appear virtually identical as the same energy is retained in each critical band.
The subjective effect on the perceived quality of the processed audio has been described as "occasionally slightly gritty" during some unvoiced utterances. By balancing energy on a critical band basis no wholesale gain fluctuations are perceived. In terms of intelligibility informal listening tests indicate an increase in speech clarity for two simultaneous talkers.
Sound quality within, although certainly not confined to a teleconferencing situation is also affected adversely by echo effects. These arise predominantly from microphones picking up the output from output loudspeakers and retransmitting this along with the voice of the speaker. Such echoes can be inconvenient and mask a speaker's voice and on occasion lead to catastrophic break down of the signal. A number of techniques is known to minimise the effects of echoes. Commonly, an adaptive filter is introduced (see Figure 11) which models the combined transfer function, H of the loudspeaker, the microphone and the acoustic space in which the loudspeaker and microphone are located. The processed incoming signal, SjnH is subtracted from the microphone signal to leave a residual signal S0, that is solely contributable to the local sound source CL:
So = SL +SinH ~ SinH
This technique does however have drawbacks. As can be seen from the above equation H needs to mimic closely the transfer function H. However, in practice In the case of multi-channel audio a matrix of transfer functions exists between all microphones and loudspeakers. This is currently considered a significant challenge in signal processing to derive accurate cancellation signals for all microphones.
Corpuscular streaming audio enables an alternative, more robust paradigm for effective echo control. Figure 12 depicts its implementation. Firstly a test is made on the surroundings in which a sound source is located. A known sound shape is emitted and the reverberation derived therefrom measured. A three-dimensional echo gate, operating in the corpuscular domain, is placed in the outgoing signal path. The incoming signal, prior to decoding, is compared with the encoded microphone signal after which elements of the microphone signal that are recognised within the reference signal, albeit delayed and distorted, are identified for removal. This enables signals originating from local sound source to be passed to the channel whilst incoming signals, reproduced from one or more loudspeakers within the same acoustic environment, are discarded.
The combination of amplitude, frequency and arrival time of each soundlet gives a three-dimensional reference (co-ordinates) for determining its acoustic origin. Since the encoding process is driven by a perceptual model of human hearing the depleted set of subjectively significant soundlets tend to be spectrally isolated. For voiced segments the frequencies are generally coincident with the formant peaks.
A minimum spectral spacing is typically > 40 Hz. If the spectral elements of the local and incoming sources are mutually exclusive then soundlets from the microphone stream can be discarded if they are identified as members of the incoming set. However, limited encoder accuracy and the distortions in the acoustically coupled material necessitate a tolerance in frequency matching and accurate matching is achieved only if the tolerance is within half of the spectral spacing.
In practice some incoming soundlets will be in close spectral proximity to those from the local source. This ambiguity is resolved by allowing those soundlets to pass to a secondary matching process. Fortunately, due to perceptual masking the subjective consequence of passing echoic soundlets is less than rejecting erroneously, those from the local source, although it should be noted that the secondary matching process is intended primarily for dealing with reverberant elements.
A practical teleconference environment will be echoic to some degree giving rise to delayed and distorted replications of incoming soundlets. To identify and remove replications a store of past incoming soundlets is maintained where the frequencies in the store are used as the reference set for comparison. The time that soundlets remain in the store is proportionate to the reverberation time of the acoustic environment. If the storage time is inadequate then older reverberant material will not be discarded. Conversely, excessive storage time results in erroneous matches with unrelated soundlets from other periods.
The accuracy of the matching process is improved by applying a joint time/intensity metric to exploit the reverberation characteristics. The rate of decay of reverberant energy can be described by:-
ASPL = 4.35* ITε where t = elapsed time, s ASPL = change in sound pressure level, dB Tε = time constant decay in energy for an acoustic space
Figure 13a depicts an example decay for a calibration noise burst in the laboratory during the system start-up. From this burst three parameters for the acoustic environment are estimated: -
I. Loss (gain) through the acoustically coupled link inclusive of amplifiers, transducers and coding devices ( = direct loss).
II. Relative intensity of the first reflection (= direct/reverberant ratio). IE. Rate of reverberation decay, Sd, dB/ ms.
From these parameters an estimate of reverberant energy received at the microphone can be plotted against elapsed time, the inverse of which yields a threshold of plausibility as depicted in Figure 13b. For a given elapsed time it is implausible for an acoustically-coupled soundlet to contain more energy than the threshold. AU soundlets that pass the frequency-matching criterion are tested for plausible time/intensity and those plausible soundlets are classified as potentially echoic. Although the parameters are frequency-dependent the experimental arrangement in the laboratory using mean values still produced a viable level of discrimination. If the loss threshold is set too high reverberant soundlets with the highest acoustic coupling ratios, (not necessarily the loudest), may not be discarded. Conversely a lowered threshold yields a subjectively hollow, less natural rendition as a result of missing soundlets. Experiments have shown that if the threshold is set correctly echo suppression is subjectively almost total and is stable, under highly dynamic conditions, when the microphone is in rapid motion.
Inevitably a small proportion of potentially echoic soundlets are allowed erroneously through the gate. In particular, the wide-band residual noise soundlets are not simple to detect and discard. During active periods of the local source these elements are virtually inaudible due to masking. During inactive periods a small amount of breakthrough could be perceived on the experimental device, which was primarily attributable to noise soundlets. The application of an energy- operated switching element mitigates breakthrough successfully.
Conventional voice-operated switches can cause temporal clipping of utterances and large peaks in background noise can erroneously close them. This is generally acceptable for utility communications but not for good quality, conversational speech. In teleconferencing a switch cannot be placed conventionally, at the front end of the encoder, as sounds from the loudspeaker can be more intense than some of utterances from the local source. More appropriate switching results from comparing the energy from the incoming channel with that in the outgoing channel. If the outgoing energy is no greater then the local source is deemed to be inactive and the residual signal is muted. Although low in energy, noise soundlets are important subjectively. A modification to the noise modelling is used to improve the subjective quality of the speech: When the incoming audio is predominantly noise-like, (e.g. during fricatives), and the local source is tonal or near-silent excessive noise energy is present in the outgoing channel that can cause a transitory 'hiss'. The incoming noise energy is compared with that of the source and if a threshold ratio is exceeded the outgoing noise soundlet is attenuated by an amount proportional to the excess energy. This simple modification makes a significant subjective improvement and it is likely to be improved further by frequency banding the energy comparison.
Experimental
An experimental coder and decoder based on the HILN paradigm were implemented on a PC using the Mathwork's "Matlab" programming language. A spatial augmentation layer, based on vectored soundlets was added to each to enable reproduction through a 15-channel audio system. The three-dimensional echo-gate was also a Matlab implementation.
Recordings of persons speaking were made in the laboratory whilst decoded material was reproduced concurrently through a loudspeaker. For reliable operation a 9 dB SPL differential between the local source and the loudspeaker was necessary and an inexpensive cardioid microphone provided this differential between a loudspeaker 1.5 m away at 77 dB SPL and a person speaking in a normal voice at 0.25 m.
Acoustic measurements in the laboratory, (5m x 5m room with curtains on 2 walls), gave a mean direct to 1st reflection, reverberant ratio of 8 dB and a decay to -30 dB of 120 ms (standard 60 dB measurements are impractical). An optimum reference storage time, T5 , (for minimum errors / best subjective quality), for past soundlets was found to be :-
T « 0.6T -30dB
For a 23.2 ms, 50% overlapped frame this period equates to 6 previous frames.
A frequency matching tolerance of 15 Hz gave sufficient frequency discrimination and an optimum rate of decay for the plausibility threshold, Sd , was 0.25 dB/ms compared to the laboratory decay rate 0.5 dB/ms. This discrepancy is due probably to the simplistic estimate of decay determined using a filtered noise burst.
In challenging situations where the loudspeaker is unusually loud or close to the microphone some breakthrough of low level, reverberant material occurs in isolated short bursts. Although its relative amplitude is small it is subjectively significant because it is unnatural in nature. In such cases, low-level reverberant soundlets can be attenuated rather than eliminated. This masks the unnatural sound in a low-level background mumble that is more natural and subjectively acceptable. Real time adaptation of thresholds was not found necessary unless the microphone was taken very close, (<0.5 m), to the loudspeaker. Ia such cases the person talking is likely to speak louder due to the significant rise in ambient SPL. This would partially restore the gain margin and reduce the need for adaptation.
Figure 14a shows part of the signal spectrum present at the input to the three- dimensional gate and Figure 14b shows the depleted spectrum at its output. The spectrum of the incoming reference signal is depicted in Figure 14c.
As indicated earlier, the echo suppression technique described herein can be applied to any soundlet-based audio transmission system, including public address systems, chat rooms, on-line multi user gaming etc.

Claims

Claims
1. An audio transmission and broadcast system to transmit and broadcast input from a plurality of sources, wherein each, source is provided with a microphone or the like to collect sound emitted by a source; a plurality of spatially separated output means; the system comprising an encoder to encode sound into a digital form, the encoder allocating a vectorising element to each source; transmission means to transmit encoded data to a receiver, the receiver being operatively connected to a decoder to decode the data and transmit an audio signal through an output means; the output means allocated to a source being dependent on the vectorising element, wherein simultaneous sources are allocated spatially separated output means; and wherein the output means of a particular source remains within a predetermined spatial distance from the output means initially allocated to that source.
2. An audio system according to Claim 1, wherein the angle subtended with the listener by two allocated output means is from 3° to 25°.
3. An audio system according to Claim 1 or Claim 2, wherein the output means are horizontally spatially separated.
4. An audio system according to any preceding claim, wherein each vectorising means is orthogonal to any other vectorising means.
5. An audio system according to Claim 4, wherein the vectorising means is the imaginary part of a complex number.
6. An audio system according to any preceding claim, wherein a source is constrained to remain within a preset angular distance from its original allocated output speaker, in order to reduce confusion of the listeners as to the source.
7. An audio system according to Claim 7, wherein the rate of movement from one position to another is constrained to be below a preset rate.
8. An audio system according to any preceding claim, wherein system includes identification means allocating phonemes into classes of phonemes.
9. An audio system according to Claim 8, wherein four classes are defined.
10. An audio system according to Claim 8 or Claim 9, wherein comparison is included to compare the time parameter of the classes of phonemes from two sources.
11. An audio system according to Claim 10, wherein a phoneme is time shifted to avoid cotemporaneous broadcasting of conflicting phonemes.
PCT/GB2008/002085 2007-06-22 2008-06-18 Transmission of audio information WO2009001035A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0712099.1 2007-06-22
GB0712099A GB0712099D0 (en) 2007-06-22 2007-06-22 Transmission Of Audio Information

Publications (2)

Publication Number Publication Date
WO2009001035A2 true WO2009001035A2 (en) 2008-12-31
WO2009001035A3 WO2009001035A3 (en) 2009-02-19

Family

ID=38352710

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2008/002085 WO2009001035A2 (en) 2007-06-22 2008-06-18 Transmission of audio information

Country Status (2)

Country Link
GB (1) GB0712099D0 (en)
WO (1) WO2009001035A2 (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013142727A1 (en) * 2012-03-23 2013-09-26 Dolby Laboratories Licensing Corporation Talker collisions in an auditory scene
WO2014076129A1 (en) * 2012-11-13 2014-05-22 Symonics GmbH Method for operating a telephone conference system, and telephone conference system
WO2016126813A3 (en) * 2015-02-03 2016-09-29 Dolby Laboratories Licensing Corporation Scheduling playback of audio in a virtual acoustic space
US9565314B2 (en) 2012-09-27 2017-02-07 Dolby Laboratories Licensing Corporation Spatial multiplexing in a soundfield teleconferencing system
US11771682B2 (en) 2016-06-22 2023-10-03 Ellipses Pharma Ltd. AR+ breast cancer treatment methods

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP5965909B2 (en) 2010-09-28 2016-08-10 ラジウス ヘルス,インコーポレイテッド Selective androgen receptor modulator

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4400724A (en) * 1981-06-08 1983-08-23 The United States Of America As Represented By The Secretary Of The Army Virtual space teleconference system
EP0730365A2 (en) * 1995-03-01 1996-09-04 Nippon Telegraph And Telephone Corporation Audio communication control unit
US6037970A (en) * 1996-04-05 2000-03-14 Sony Corporation Videoconference system and method therefor
US20030101219A1 (en) * 2000-10-06 2003-05-29 Tetsujiro Kondo Communication system, communication device, seating-order determination device, communication method, recording medium, group-determination-table generating method, and group-determination-table generating device
US20040068736A1 (en) * 2000-07-04 2004-04-08 Lafon Michel Beaudouin Communication terminal and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4400724A (en) * 1981-06-08 1983-08-23 The United States Of America As Represented By The Secretary Of The Army Virtual space teleconference system
EP0730365A2 (en) * 1995-03-01 1996-09-04 Nippon Telegraph And Telephone Corporation Audio communication control unit
US6037970A (en) * 1996-04-05 2000-03-14 Sony Corporation Videoconference system and method therefor
US20040068736A1 (en) * 2000-07-04 2004-04-08 Lafon Michel Beaudouin Communication terminal and system
US20030101219A1 (en) * 2000-10-06 2003-05-29 Tetsujiro Kondo Communication system, communication device, seating-order determination device, communication method, recording medium, group-determination-table generating method, and group-determination-table generating device

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013142727A1 (en) * 2012-03-23 2013-09-26 Dolby Laboratories Licensing Corporation Talker collisions in an auditory scene
CN104205212A (en) * 2012-03-23 2014-12-10 杜比实验室特许公司 Talker collision in auditory scene
US9502047B2 (en) 2012-03-23 2016-11-22 Dolby Laboratories Licensing Corporation Talker collisions in an auditory scene
US9565314B2 (en) 2012-09-27 2017-02-07 Dolby Laboratories Licensing Corporation Spatial multiplexing in a soundfield teleconferencing system
WO2014076129A1 (en) * 2012-11-13 2014-05-22 Symonics GmbH Method for operating a telephone conference system, and telephone conference system
WO2016126813A3 (en) * 2015-02-03 2016-09-29 Dolby Laboratories Licensing Corporation Scheduling playback of audio in a virtual acoustic space
US10334384B2 (en) 2015-02-03 2019-06-25 Dolby Laboratories Licensing Corporation Scheduling playback of audio in a virtual acoustic space
US11771682B2 (en) 2016-06-22 2023-10-03 Ellipses Pharma Ltd. AR+ breast cancer treatment methods

Also Published As

Publication number Publication date
WO2009001035A3 (en) 2009-02-19
GB0712099D0 (en) 2007-08-01

Similar Documents

Publication Publication Date Title
Bronkhorst The cocktail party phenomenon: A review of research on speech intelligibility in multiple-talker conditions
US9570080B2 (en) Apparatus and method for encoding a multi-channel audio signal
Jeub et al. Model-based dereverberation preserving binaural cues
EP2064699B1 (en) Method and apparatus for extracting and changing the reverberant content of an input signal
US7627471B2 (en) Providing translations encoded within embedded digital information
US20040039464A1 (en) Enhanced error concealment for spatial audio
JP5232151B2 (en) Packet-based echo cancellation and suppression
US20080004866A1 (en) Artificial Bandwidth Expansion Method For A Multichannel Signal
US9408010B2 (en) Audio system and method therefor
KR101680953B1 (en) Phase Coherence Control for Harmonic Signals in Perceptual Audio Codecs
US9520140B2 (en) Speech dereverberation methods, devices and systems
WO2009001035A2 (en) Transmission of audio information
EP3275208B1 (en) Sub-band mixing of multiple microphones
JP2017507602A (en) Perceptually continuous mixing in teleconferencing
US9502047B2 (en) Talker collisions in an auditory scene
JP4774255B2 (en) Audio signal processing method, apparatus and program
Westermann et al. The effect of nearby maskers on speech intelligibility in reverberant, multi-talker environments
Lavandier et al. Speech segregation in rooms: Effects of reverberation on both target and interferer
EP2779161A1 (en) Spectral and spatial modification of noise captured during teleconferencing
Schoenmaker et al. Better-ear rating based on glimpsing
Köster et al. Perceptual speech quality dimensions in a conversational situation.
Hazrati et al. Comparison of two channel selection criteria for noise suppression in cochlear implants
Drullman The significance of temporal modulation frequencies for speech intelligibility
Kociński et al. Time-compressed speech intelligibility in different reverberant conditions
James et al. Corpuscular Streaming and Parametric Modification Paradigm for Spatial Audio Teleconferencing

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08775760

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 08775760

Country of ref document: EP

Kind code of ref document: A2