WO1999017278A1

WO1999017278A1 - Method and apparatus for improving speech intelligibility

Info

Publication number: WO1999017278A1
Application number: PCT/GB1998/002890
Authority: WO
Inventors: Peter William Barnett
Original assignee: Peter William Barnett
Priority date: 1997-09-26
Filing date: 1998-09-24
Publication date: 1999-04-08
Also published as: EP1018108A1; AU9177298A; GB9720544D0

Abstract

A method and apparatus for improving the intelligibility of the spoken word in an acoustic space comprise generating an electrical signal indicative of a word or words, inputting the signal to a signal processor including a signal compressor, comparing the amplitude of the input signal with a threshold level and compressing any part of the signal in excess of the threshold, expanding both the compressed and uncompressed signal and outputting the expanded signal as an audible signal.

Description

Method and Apparatus for Improving Speech Intelligibility

The present invention relates to a method and apparatus for improving the intelligibility of the spoken word. The intelligibility of the spoken word is determined by a number of different factors. It has been known from an examination of the Temples of Ancient Egypt and Amphitheatres of Greece and Rome that structural changes to buildings and spaces can improve the intelligibility of the spoken word but the impetus towards a more scientific approach to the problem of improving intelligibility came with the advent of the telephone, which resulted in a large volume of scientific work but mainly with the aim of solving problems of distortion, bandwidth and transducer design in telephone systems. However, in 1929 Knudsen published a work entitled "On Hearing in Auditoriums" where he postulated that what he termed percentage articulation was a function of reverberation, noise, room shape and echo.

By today's standards, the postulation by Knudsen is incomplete but two important factors are present even at this early date. Firstly, if we remove the obvious limitations to speech intelligibility i.e. that the speech is not loud enough and there is too much noise present, it is clear that articulation is limited by the acoustics and geometry of the space. Secondly, intelligibility or articulation is presented as a product of a number of reduction factors. Each factor lies in the range 0-1 and hence judicious application of one parameter or influence cannot undo the shortfall imposed by another.

In 1971, Peutz et al hypothesized that speech intelligibility could be assessed on the basis of the number of lost consonants and a formula was suggested which could give a measure of the likelihood of the percentage of lost consonants based on the distance of the listener from the source, the reverberation time of the space and the volume of the space.

Work in the area of speech intelligibility continued and a speech transmission index (STI) by Houtgast and Steeneken was developed and introduced in 1980. They determined that there was a direct and robust correlation between speech intelligibility as measured by word scores and a modulation transfer function between a source and receive position. The correlation between STI and word scores as shown by the experimental data is shown in Fig. 1 The speech transmission index has been widely adopted and it is now accepted that an STI score of 0.5 is the minimum required for reasonable intelligibility in most circumstances. Recently, an STI of 0.5 has been specified in relation to locations such as underground and rail transportation, stadia, shopping centres, cinemas and all public places. It can be shown that the factors affecting speech intelligibility can be grouped into four major areas namely those associated with the talker, the listener, the space and the transmission system. The list of factors can be greatly reduced by assuming a perfect talker, a normal listener, a space free from anomalies and a perfect transmission system. These assumptions reduce the list of influencing factors to those dependent on direct sound pressure level, reverberant sound pressure level, reverberation time and noise. If we understand that the direct sound pressure level is the signal (speech) and wanted component, then this list further reduces to two ratios as follows: direct-to-reverberant ratio i.e. the ratio of wanted to unwanted sound, and the signal-to-noise ratio i.e. the ratio of wanted signal to the noise, together with the reverberation time of the space. In other words, speech intelligibility is a function of the product of all three factors mentioned above. As previously mentioned, the detrimental effects and the limitations imposed by one dependent variable may not be fully compensated by another. While the intelligibility of speech can be improved by making structural alterations to the space where the listener is present e.g. by reducing the reverberation time within the space by the introduction of acoustic absorption, in certain instances it is not economic to make such material changes and there is a need for a simpler and more economic way of improving the intelligibility of the spoken word in public areas. It is now commonplace for spoken words to be amplified by electrical apparatus we have looked at ways in which the problem can be solved electronically.

The present invention provides apparatus for broadcasting speech into an acoustic space through one or more loudspeakers which comprises means for compressing the higher amplitude portions of the spoken words and expanding the thus compressed signal whereby the emphasise the lower amplitude parts of the spoken words. The compression is preferably in the range 1: 1 to 10: 1 and usually 2: 1 to 4: 1. In most cases a compression ratio of 3 : 1 will suffice. The present invention also provides a method for broadcasting public address announcements into closed acoustic spaces which comprises the use of the compression/expansion apparatus. The threshold at which compression commences is preferably selected depending on the speech characteristics of the person enunciating the words but as an alternative, the speech signals can be normalized prior to transmission to the compressor/expander apparatus in which case the threshold can be preset to a specific value depending on the output from the normalization circuitry.

Features and advantages of the present invention will become apparent from the following description of an embodiment thereof given by way of example with reference to the accompanying drawings in which:

Fig. 1 shows a graph of STI versus word score;

Fig. 2 shows a graph of input against output for a compressor according to the present invention;

Fig. 3 shows a block diagram of a arrangement according to the present invention; and

Fig. 4 shows the amplitude waveforms of various words before and after processing by apparatus according to the present invention.

Speech has a dynamic range of some 10-20 dB or, in pressure terms around 100: 1, i.e. the quietest parts of our speech (at a normal level) are around 100th of the loudest. In fact, vowel sounds ( 100 Hz - 1 Kz) which are voiced i.e. formed in the voice box and larynx, are much more powerful than the consonant or unvoiced components which are formed in the mouth and with the teeth and expellation of air. These large vowel sounds tend to mask the weaker consonants which are vital and play a far more important role in intelligibility. The vowel sounds also enhance the reverberant sound thereby reducing the direct-to reverberant ratio.

From consideration of these facts relating to the structure of speech, it is apparent that simply increasing the gain of a public address amplifier is not sufficient to improve intelligibility. In fact, it can have the opposite effect in very reverberant spaces. As a consequence, we have investigated amplitude compression which reduces the range between peaks and troughs. The benefit of amplitude compression, unlike gain, is that it is dynamic and is applied above a threshold. It is proposed that the threshold should be set at a level where the vowel sounds will be compressed but the weak consonant sounds will not be compressed. This has the advantage that signal processing is applied differentially to the wanted signal and the noise (or reverberation). Figure 2 shows a typical relationship between input and output levels. It will be seen that up to a threshold the output level is linear. At threshold compression is applied and the drawing shows different compression ratios as compared with the uncompressed signal ( 1 : 1 ). It is thus clear that the effect of applying amplitude compression to speech is to reduce the ratio of largest to smaller sounds.

If one now looks at Fig. 4, Fig. 4a shows a waveform diagram of the word "drop" in its original, uncompressed form. After compression, as shown in Fig. 4b, it will be seen that the difference between the amplitudes of the loudest and quietest part of the word have been reduced. Thus, on expansion or amplification, the very quiet consonant "p" has been enhanced with respect to the vowel sound "o" and therefore the intelligibility of the word "drop" has been improved.

As far as the word "turf' is concerned which is shown in Fig. 4c, comparison with Fig. 4d shows that there is very little compression applied hence on expansion the whole word has been amplified.

If one looks at the word "nest" in Figure 4e it will be seen that the vowel sound "e" has been compressed because on expansion the parts of the signal representing the "s" and the "t" have been amplified with respect to the vowel sound.

It will be appreciated that the effects of the compression will be altered depending on the threshold at which compression occurs as well as the compression ratio used. If one looks at Fig. 3, a diagram of a suitable apparatus is shown where the user speaks into a microphone 10. The output from the microphone is then passed through a normaliser which will process the input signal and provide a normalised sound output. The output from the normaliser 11 is fed to a compression and expansion circuit, sometimes known as a compander, 12 which applies amplitude compression to the input signal if the amplitude exceeds a pre-set threshold. The compander 12 is arranged to start compression at a threshold which is set relative to the magnitude of the speech in the signal chain. It has been determined that the threshold should be set at a value less than halfway up the dynamic range so that the majority of the speech signal is subject to compression. It has been found that a threshold at a value 5 - 6 dB above the level of the quietest part of the speech is adequate. Another way of expressing this is to look at the average value of the peak amplitudes of the speech signal, in which case the threshold should be in the range 28 dB to 22 dB below the average of the peak levels of the speech. Typically, the threshold is set at 25 dB below the average of the peaks. The amount of compression is usually in the range 2: 1 - 10: 1 but might be as high as 20: 1. The output from the compander 12 is then fed to an electro-acoustic transducer in the form of a loud speaker system 13 for broadcast to the listener who is in an acoustic space.

The above apparatus can be used with good effect in public address systems for all public spaces including but not limited to stations, theatres and cinemas. It also has application in other areas where ambient noise levels are high and speech intelligibility is important such as in aircraft for in-flight announcement and also for induction loops and hearing aids for persons with impaired hearing since it has been found that those who suffer from impaired hearing due to age can have their understanding of spoken words improved if the aforementioned technique is utilized.

Various tests have been carried out utilizing the equipment and it has been shown that the RASTI score of a space can be improved by approximately 0.1 or 10%.

Claims

CLAIMS:

1. A method of improving the intelligibility of the spoken word in an acoustic space comprising generating an electrical signal indicative of a word or words, inputting the signal to a signal processor including a signal compressor, comparing the amplitude of the input signal with a threshold level and compressing any part of the signal in excess of the threshold, expanding both the compressed and uncompressed signal and outputting the expanded signal as an audible signal.

2. A method according to claim 1, wherein the threshold level is at 5 or 6 dB above normal signal levels.

3. A method according to claim 1 or 2, wherein the step of generating an electrical signal includes generating a normalised electrical signal.

4. A method according to claim 1, 2 or 3 ,wherein the signal compressor compresses the generated signal by a ratio of between 1 : 1 and 10: 1.

5. Apparatus for improving the intelligibility of the spoken word and comprising means for generating an electrical signal indicative of a word or words, a signal processor including means for comparing the amplitude of the generated electrical signal with a threshold level, means for compressing any part of the signal in excess of the threshold level, and means for expanding both the compressed and uncompressed signal, and an output device for generating an audible signal.

6. Apparatus according to claim 5, wherein the means for compressing the signal is arranged to compress the signal by a ratio between 1 : 1 and 10: 1.

7. Apparatus according to claim 5 or 6, wherein the comparing means is arranged to compare the signal with a level which is 5 or 6 dB above the normal level.

8. Apparatus according to claim 5, 6 or 7, and comprising means for normalising the electrical signal prior to signal processing means.