GB2537924A - A Speech Processing System and Method - Google Patents

A Speech Processing System and Method Download PDF

Info

Publication number
GB2537924A
GB2537924A GB1507488.3A GB201507488A GB2537924A GB 2537924 A GB2537924 A GB 2537924A GB 201507488 A GB201507488 A GB 201507488A GB 2537924 A GB2537924 A GB 2537924A
Authority
GB
United Kingdom
Prior art keywords
speech
processing system
signal
speech processing
shaped signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1507488.3A
Other versions
GB2537924B (en
GB201507488D0 (en
Inventor
Koutsogiannaki Maria
Stylianou Ioannis
Petkov Petko
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Europe Ltd
Original Assignee
Toshiba Research Europe Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Research Europe Ltd filed Critical Toshiba Research Europe Ltd
Priority to GB1507488.3A priority Critical patent/GB2537924B/en
Publication of GB201507488D0 publication Critical patent/GB201507488D0/en
Publication of GB2537924A publication Critical patent/GB2537924A/en
Application granted granted Critical
Publication of GB2537924B publication Critical patent/GB2537924B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/057Time compression or expansion for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/043Time compression or expansion by changing speed
    • G10L21/045Time compression or expansion by changing speed using thinning out or insertion of a waveform
    • G10L21/047Time compression or expansion by changing speed using thinning out or insertion of a waveform characterised by the type of waveform to be thinned out or inserted

Abstract

A speech processing system enhances the intelligibility of input speech by selecting at least two frequency bands (eg. one for unvoiced consonants and another for voiced consonants and vowels), applying a different weighting to each to produce a spectrally shaped signal, and temporally scaling/stretching the shaped signal. The scaling may be based on environmental reverberation time and may insert pauses, and the stretched signal may be normalised to have the same Root Mean Square energy as the original signal.

Description

A SPEECH PROCESSING SYSTEM AND METHOD
FIELD
Embodiments of the present invention as described herein are generally concerned with
the field of speech processing systems
BACKGROUND
Reverberation is a process under which acoustic signals generated in the past reflect off objects in the environment and are observed simultaneously with acoustic signals that follow the shortest, direct, path to an arbitrary point in space. It is often necessary to understand speech in reverberant environments such as train stations and stadiums, large factories, concert and lecture halls.
It is possible to enhance a speech signal such that it is more intelligible in such environments.
BRIEF DESCRIPTION OF THE DRAWINGS
Systems and methods in accordance with non-limiting embodiments will now be described with reference to the accompanying figures in which: Figure lisa schematic of a speech processing system in accordance with an embodiment, Figure 2 is a flow chart showing a method in accordance with an embodiment; Figure 3 is an overview of the functions of a system in accordance with an embodiment; Figure 4 is a normalised histogram of vowels and voiced consonants classified by the two different algorithms S and L where the x-axis is the S values or L values (observation values) and the y axis is probability (normalized observation frequency); Figure 5(a) is a plot of the intelligibility scores of used for a reverberation time of 0.8s and figure 5(b) for a reverberation time of 2s; Figure 6 is a plot of the difference between clear speech and casual speech against frequency; Figure 7 is a plot of the difference between the Glimpse proportion scores against the weights used in producing the spectrally shaped signal, and Figure 8 is a plot of the differences between clear speech and casual speech, and spectrally shaped speech and casual speech against frequency.
DETAILED DESCRIPTION OF THE DRAWINGS
In an embodiment, a speech processing system for increasing the intelligibility of output speech is provided, the system comprising: a speech input for receiving speech; a processor adapted to receive said input speech and modify the speech to increase its intelligibility; and an output for said modified speech, wherein said processor is adapted to modify the speech to increase its intelligibility by selecting at least two frequency bands of said input speech and applying a different weighting to each of the at least two frequency bands to produce a spectrally shaped signal, the processor being configured to temporally scale said spectrally shaped signal to stretch the spectrally shaped signal.
In some embodiments, the system can be used in reverberant environments. However, the use of spectrally shaping the signal in the frequency domain and then stretching the signal in the time domain can also be used to improve intelligibility in the presence of noise in non-reverberant environments.
In an embodiment, the speech processing system further comprises an input to receive information indicating the reverberation time for the environment where the output of said speech processing system is located. Here, the information indicating the reverberation time may be used to control the temporal scaling of the spectrally shaped signal. However, the system may be adapted to apply a fixed scaling when there is no input for the reverberation time or when the input indicates that the reverberation time is below a pre-determined value. For example, the system may be used to apply a 20% scaling factor when the reverberation time is low or where the system does not allow the reverberation time to be input.
By using the reverberation time as an input, the system can be applied to mobile speakers, for example those on a computer or sound system where their settings may need to be changed depending on the location.
In an embodiment, the processor is configured to temporally scale said spectrally shaped signal to stretch the spectrally shaped signal by a factor based on the temporal difference between clear speech and casual speech. The processor may be configured to temporally scale said spectrally shaped signal to uniformly stretch the spectrally shaped signal or temporally scale said spectrally shaped signal by identifying stationary parts of speech and allowing elongation of these parts without elongating non-stationary parts of speech. Here, the stationary parts of speech may be identified by loudness measurements and modulation information.
In an embodiment, the processor is configured to temporally scale the signal by a factor 1S where: Ts=aRy + 1 where RT is the reverberation time and a is a constant of less than I The processor may alternatively or additionally be configured to stretch the spectrally shaped signal by inserting pauses into the spectrally shaped signal.
The at least two frequency bands are selected as bands where there is enhancement of a clear speech signal over a casual speech signal. One frequency band may be selected as a band where UNVOICED consonants are present and the other frequency band may be a band where VOICED CONSONANTS and vowels are present. Here, at least one of said frequency bands is above 5000Hz and the other of said frequency bands is below 5000Hz. For example, the at least two frequency bands comprise a first band from 2000Hz up to 4800Hz and a second band from 5600Hz up to 8000Hz.
Although two frequency bands are taught above, other frequency bands are possible. For example, depending on the condition of the user, the system may further comprise a control means adapted to change the at least two frequency bands to accommodate for a hearing impairment of the user. For example, if the user has severely impaired hearing at some of the frequencies where the signal is to be boosted, then the system, could be configured to not boost the signal at these frequencies.
In a further embodiment, the processor is adapted to select a further frequency band is selected corresponding to the full frequency range and apply a weighting to this further frequency band, in a yet further embodiment, the full frequency range is selected as one of the at least two frequency bands and a further band, which is narrower than the full frequency range is selected as the other of the at least two frequency bands. In other embodiments, the at least two frequency bands are selected as bands that are narrower than the entire frequency range and which do not overlap.
The processor may be adapted to normalise the spectrally shaped signal to have the same RMS energy as the original speech such that: y = is W282 [1] where s, sr and s2 are the signals from the full frequency band s and the first sl and second s2 frequency bands, Iva, w/ and 14,2 are the respective weighting factors, ymar is the proposed spectrally modified signal and N is the number of samples of the original signal s and y.
s-[I] / I \-1/4N N * = The weightings of the bands may be set to sum to I and wherein the weighting of a higher frequency band is greater than the weighting of a lower frequency band.
In a further embodiment, a speech processing method for increasing the intelligibility of output speech is provided, the method comprising: receiving speech; modifying the speech to increase its intelligibility; and outputting said modified speech, wherein the speech is modified by selecting at least two frequency bands of said input speech and applying a different weighting to each of the at least two frequency bands to produce a spectrally shaped signal, and temporally scaling said spectrally shaped signal to stretch the spectrally shaped signal.
Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.
Figure 1 is a schematic of a speech intelligibility enhancing system.
The system 1 comprises a processor 3 which comprises a program 5 which takes input speech and, in one embodiment, information about the reverberation conditions where the speech will be output and enhances the speech to increase its intelligibility in the presence of noise. The storage 7 stores data that is used by the program 5. Details of what data is stored will be described later.
The system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to an input for data relating to the speech to be enhanced and also, in an embodiment, an input for collecting data concerning the reverberation conditions in the places where the enhanced speech is to be output. The type of data that is input may take many forms, which will be described in more detail later. The input 15 may be an interface that allows a user to directly input data. Alternatively, the input may be a receiver for receiving data from an external storage medium or a network.
Connected to the output module 13 is output is audio output 17, for example a speaker.
In use, the system 1 receives data through data input 15. The program 5 executed on processor 3, enhances the inputted speech in the mariner which will be described with reference to figures 2 to 5.
Figure 2 is a flow diagram showing the processing steps provided by program 5. In step 551 speech is input into the system. In step S53, spectral shaping starts. At this stage, n filters are applied to the input speech in, what will be termed, as the mix-filtering approach. This signal flow is also shown in figure 3 Here, the input speech, 101 is passed into filters 1030, 1031 and 1032.
The mix filtering approach aims to enhance the input speech in the style of clear speech. Clear speech is the speaking style elicited by speakers when the listener faces a communication barrier with the most common characteristics of slowing down and hyper-articulating. Speech used in "normal" environments is generally referred to as "casual speech".
Three frequency bands are considered. BFI is from 2000Hz up to 4800Hz, BF2 is from 5600Hz up to 8000Hz, considering sampling frequency of 16000Hz. The third frequency band BFO corresponds to the full-frequency range signal (0-8000Hz). From these frequency bands the signals respectively s7, s2, s are obtained For the isolation of the frequency bands, in an embodiment, a simple method is used. Casual speech s is filtered with a 5-order bandpass digital elliptic filter with 0:1dB of ripple in the passband, and 60dB ripple in the stopband and bandpass edge frequencies [2000; 4800]. The output of the filter is the signal s7 which contains information on the B1 frequency band. Moreover, original speech s is filtered with a 5-order highpass digital elliptic filter with normalized passband edge frequency fc = 5600Hz. The output of this filter is the signal s2 which contains information on the frequency band B2.
In step S55, the original signal s and the filtered signals sr and s2 are combined with different weighting factors which are applied in step S55 to form the modified signal 32, which is normalized to have the same RMS energy as original speech in step S57: y[i] = wos ± W282 [i] s2 F [i] y[i] (2)
Y
where, yy",,j. is the proposed spectrally modified signal, /V is the number of samples of the original signal s and y, and w0= 0:1, Iv/ = 0:4, -Hi2 = 0:5 are the weighting factors of the signals s, .51 and sz, respectively.
Figure 3 shows the weighting factors applied as 1050, 1051 and 1052 and the signal being normalised and combined at 107 The values of the weighting factors were selected based on the clear and casual speech observations (the spectral energy in B2 is greater than in B1 and therefore 112> mir) and on maximizing an objective intelligibility score, namely the Glimpse Portion, in the presence of low Signal-to-Noise Ratio SSN (SNR = -10dB). How the bands are selected and the weightings determined will be explained in detail with reference to figures 6 to 8.
The above spectral shaping method is simple and efficient to apply as it does not require framebased analysis and modifications (detection of voiced/unvoiced regions, formant shaping etc). The mix-filtering technique enhances spectral information based on measured differences between the two speaking styles, clear and casual speech.
In step S59, time scaling is performed in unit 109. In one embodiment, uniform time-scaling is performed. Here, the spectrally mixed signal is time scaled by a constant factor. In one embodiment, the spectrally mixed signal is multiplied by the ratio of the grr casual speech signal to that of the clear speech signal. In one embodiment, the scaling is achieved using a Waveform Similarity Based Overlap-Add algorithm (WSOLA) In practice, there will not be a clear speech signal to which the causal speech signal can be compared. In this case, the casual speech signal is scaled by 20%. In further embodiments, the signal temporally scaled between 10 and 40%, in other embodiments between 10 and 25%.
In a further embodiment, time scaling in step 559 is achieved using time-scaling modifications according to the properties of speech. Stationary parts of speech are allowed to be elongated while non-stationary parts are not allowed to do so. In one embodiment, this is done by measuring a degree of stationarity of speech at certain rate (i.e., every 10ms) and then a mapping function allows the determination of the scaling factor. This operation follows observations on how humans modify their speaking rate when they need to speak clearly.
One way to estimate degree of stationarity is that of combining loudness measurements and modulation information and this is performed in unit 111. In one embodiment, the Perceptual-Speech-Quality measure (PSQ) is used to elongate the stationary parts of casual speech and to define where to insert pauses to the signal. The PSQ measure is based on the basic version of ITU Standard REC-BS.1387-1-2001, a method for objective measurements of perceived speech quality. It estimates features such as loudness and modulations in specific frequency bands, in order to describe the input signal with perceptual attributes.
The elongation and pause insertion scheme are described below. The pause insertion can be used as a technique for temporally scaling the signal in isolation from the other techniques taught herein. In an embodiment, two metrics of the PSQ model are used to detect the stationary parts of speech, where time-scaling can be applied: the perceived loudness of the signal in low frequency bands and the loudness modulations in high frequency bands. Analytically, PSQ estimates the perceived loudness on the low frequency bands (0-300Hz) of the signal, where unvoiced speech is less likely to be present. However, this metric is not perfect for distinguishing stationary from non-stationary parts of speech, as some voiced stop consonants have high energy in low frequency bands. Time-scaling voiced stop consonants can cause distortion, probably not noticeable in reverberation but it may also reduce the phoneme's intelligibility. Therefore, in an embodiment, the loudness is not the only metric used to decide which parts of speech should be elongated. However, it should be noted that loudness can be used to give an indication of the parts of speech that should be elongated.
In an embodiment, loudness is combined with another metric, namely the loudness modulations of high frequency bands (around 4000Hz) to detect stationary speech The loudness modulations in high frequency bands are strongly correlated with the non-stationarity of the signal and are able to detect voiced stop consonants. Subtracting the modulation values (near zero for vowels, high for unvoiced and voiced consonants) from the loudness values (high for vowels, near zero for unvoiced consonants and high for voiced consonants) the stationary parts are detected more efficiently than based on a purely loudness metric (after subtraction values are high for vowels, negative for unvoiced and near zero for voiced consonants).
As an illustration of this, the efficiency of the metric to classify voiced consonants as non-stationary is estimated. Specifically, 100 sentences are annotated for a speaker in our database to distinguish consonant-frames from vowel-frames. Then, the average perceived loudness in low frequency bands L and loudness modulations in high frequency bands M are calculated per frame. In an embodiment, the low frequency band is from 0 to 339Hz and the high frequency band is from 4053 Hz to 4200 Hz. The difference S between these values for each frame is calculated and the normalized histograms of S for vowels and the voiced consonant frames lb, g, w, 11 and the corresponding normalized histograms of L are depicted on Figure 4.
Selecting a decision threshold T > 0:5 for consonant and vowel classification, the misclassification error of the proposed metric S for the consonants (the area below the consonant curve on the interval [0.5, 3]) is lower than that of the L metric. Therefore, the proposed scheme decides to elongate a speech frame if the S value is above T = 1.
Each frame allowed to be elongated by the S metric is time-scaled by 20% of its original duration The time-scaling is performed by WSOLA.
In an embodiment, pause insertion is also implemented using the PSQ model. The pause insertion scheme is purely unsupervised and takes into consideration the acoustic properties of casual speech. Specifically, the perceived loudness of the speech signal in the whole frequency band is estimated (in dB SPL). Then, loudness is normalized by the maximum loudness of the signal and all valleys are detected on the normalized loudness curve. PSQ adds pauses on valleys with less than 20% of the normalized loudness. The valleys are usually in the middle of word boundaries and are appropriate for inserting pauses without distorting the signal.
In an embodiment, a pre-processing of the signal before and after the location of the valley is performed; the signal is timescaled around the location where the pause will be inserted and a hamming window is applied on the centre of the valley, so that the transition from speech to silence will be more smooth. In an embodiment, inserted pauses have a fixed length of 90ms based on average pause duration on clear speech.
Next, the methods described with reference to figures 1 to 4 above, are evaluated in reverberant conditions. Reverberation is simulated using a room impulse response (RIR) model obtained with the source-image method. The hall dimensions are fixed to 20 m x 30 m x 8 m. The speaker and listener locations used for RIR generation are 110 m; 5 m; 3 ml and {10 m; 25 m; 1:8 m} respectively. The propagation delay and attenuation are normalized to the direct sound. Effectively, the direct sound is equivalent to the sound output from the speaker. Convolving the modified speech signals with RIR produces the signals for evaluation. Seven sets of signals are evaluated: (1) the clear speech (CL); (2) the casual speech (CV); (3) the mix-filtering spectrally modified casual speech signal (M); (4) the uniformly time-scaled casual speech signal (U); (5) the PSQ-based time-scaled casual speech signal (P); (6) uniform time-scaling and mix-filtering of casual speech (UM), and (7) PSQ-based time-scaling and mix-filtering of casual speech (PM).
The corpora used for the analysis is the read clear and read casual speech from the LUCID database. Read speech is an exaggerated form of speech relative to spontaneous speech and has higher intelligibility only for the clear style. The speakers participated on the recordings were normophonic Southern British English. The sentences recorded were meaningful and simple in syntax The term Categories will be used to refer to the above seven sets of signals. 56 randomly selected distinct sentences from the LUCID corpus are presented to the listeners, uttered from 2 Male and 2 Female speakers (14 sentences per speaker, 8 sentences per set of signals, 4 sentences per Category per reverberant condition). The reverberant times are RTI = 0:8s and RT2 = 2s to simulate low and high reverberant environments, respectively.
A "header" of 4 sentences is added to the listening test to serve as a preparation set for the listeners to the reverberant environment (these sentences are not evaluated). The listener hears each sentence once and is instructed to write down whatever he/she perceives to have heard. As sentence difficulty may affect the intelligibility scores (especially for the non-native population), 7 different listening scenarios have been created to ensure that each sentence will be presented in a I CL, CV, M, U, P, UM, PM} manner to different listeners (as each listener cannot hear the same sentence twice). For example, if a specific sentence is presented to the listener in CL manner on RT1 condition on the listening Scenario 1, then the same sentence will be presented to another listener in CV manner on the same reverberant condition on listening Scenario 2 etc. This allows "denoising" the performance evaluation from the sentence dependency. 32 listeners participated in the intelligibility test, 7 native speakers, 4 hearing-impaired listeners, and 21 non-native speakers with good perception of English (this was also verified in the listening test with 5 difficult sentences presented without reverberation conditions). As the majority of the listeners are non-native, explicit statistical analysis is presented for this population.
Performance evaluation for the non-native speakers contains three parts of analysis. The first part presents the intelligibility scores of each Category across listeners, in order to reveal possible intelligibility benefits of the proposed modifications for the non-native population. The second part of analysis computes the intelligibility scores of each Category across sentences, to parcel out the possible variability due to sentence difficulty and reveal the Category main effect. Lastly, the third part of analysis presents the intelligibility scores of each Category across the two different reverberant conditions.
For each reverberant condition, the ratio of the correct keywords to the number of total keywords per sentence is estimated per listener and per Category. Then, the mean of the ratios for all sentences is estimated per listener and per Category. Figure 5 shows the fmin, 1st quartile, median, 3rd quartile, maxg of intelligibility scores per Category across all listeners. CL appear to have a higher intelligibility advantage over all Categories for both reverberant conditions while the UM seems to have a benefit over CV on RT2 (Figure 5). In order to evaluate the statistical significance of these results, a repeated-measures ANOVA (Analysis of variants) is performed on intelligibility with Category nested within each listener. Results reveal significant intelligibility differences among Categories, for both reverberant conditions RT1 (F(6; 20) = 5:601; p <0:001) and RT2 (F(6; 20) = 7:167; p <0:001). Post-hoc comparisons using pairwise paired t-tests reveal that the mean intelligibility score of CL (NI = 0:86; SD = 0:13, NI stands for mean and SD for standard deviation) is significantly different (p < 0:001) from CV (M = 0:67; SD = 0:22) in RT1 while in RT2 both CL (NI = 0:83; SD = 0:16) and UM (NI = 0:77; SD = 0:17) have significantly different means (p <0:01) from CV (M = 0:64; SD = 0:23). No significant difference between means of CL and UM are reported (p = 0:07).
A repeated-measures ANOVA on intelligibility with Category nested within each sentence is performed to remove possible dependencies of the intelligibility scores on sentence difficulty. ANOVA null hypothesis of equal means of the intelligibility scores for every Category, is rejected using the F-test for RT I (F(6; 27) = 6:634; p < 0:001) and RT2 (F(6; 27) = 7:268; p < 0:001). Post-hoc comparisons using pairwise paired t-tests reveal that the mean intelligibility score of CL (NI = 0:87; SD = 0:15) is significantly different (p < 0:01) from CV (N4 = 0:66; SD = 0:20) in RT1 while in RT2 both CL (N4 = 0:83; SD = 0:18) and UM (N4 = 0:75; SD = 0:21) have means different from CV (N4 = 0:63; SD = 0:26) and this result is statistical significant (p < 0:001 for CL, p < 0:01 for UM). No significant differences are reported between the means of CL and UN4 (p = 0:07). The mean of UM is significantly different from the means of all other modifications (p < 0:01). Last, pairw-ise paired t-tests showed no significant difference between means per Category in RT1 with their corresponding in RT2.
Subjective evaluations were also performed by 7 native listeners. As the sentences were meaningful, the content helped the native listeners to understand both CL and CV speech almost 100%. One listener appeared to have intelligibility score below 70% for both speaking styles. That listener benefits from all modification techniques in RT2, and in RT1 from all modifications except uniform-time scaling. Repeated measures ANOVA showed no statistical significant differences between Categories both for RT1 (F(6; 6) = 1:544; p = 0:192) and RT2 (F(6; 6) = 1:781; p = 0:131). Subjective evaluations were also performed by 4 non-native hearing impaired listeners. CL speech was more intelligible than CV speech in RT1 (N4CL = 0:87; SDCL =016, MCV = 0:60; SDCV = 0; 22) and RT2 (MCL = 0:80; SDCL = 0:23, MCV = 0:57; SDCV = 017). In RT2 condition, modification schemes failed to increase the intelligibility of casual speech. However, for RT1, all listeners showed an intelligibility increase of modified casual speech with the mix-filtering modification (N4 = 0:86; SD = 0:26). Repeated measures ANOVA showed no statistical significant differences between Categories for RT1 (F(6; 3) = 1:754; p = 0:166) and RT2 (F(6; 3) = 3:228; p = 0:0248).
Subjective evaluations presented above show that clear speech is more intelligible than casual speech in reverberant conditions for the non-native listeners. Indeed, CL outperforms CV by 19% in 0.8s and 2s reverberant time. Non-native listeners also report that the combination of uniform time scaling and mix-filtering technique is advantageous for RT2 since the intelligibility benefit is 13%, 6% lower from the upper bound (CL). However, in less reverberant time, the benefit of this modification drops.
This inefficiency is possibly due to the selection of the uniform-time scaling factor. Figure 2(a) shows that the mix-filtering technique has a slight advantage over casual speech. Then, when uniform-time scaling is combined with the spectral boosting, the median intelligibility score drops and the variance increases. Therefore, this result indicates that the time-scaling factor is important for reverberant environments and its selection should be proportional to the reverberant time. Also, the PSQ-based modification fails to increase intelligibility of casual speech. One possible reason for this is the change of rhythm between speech segments and the extreme elongation in some cases. A more conservative time-scaling factor could be proven more advantageous for the time-scaling techniques and is to be explored in the future.
The hearing-impaired population is rather small to draw any concrete conclusions. However, the clear speech intelligibility advantage is 23% and 27% higher than that of casual for RT1 and RT2, respectively and the mix-filtering in RTI increases the intelligibility of casual speech by 26%. Finally, native listeners do not benefit from the transformations as the intelligibility of CV is as high as that of CL, highlighting the importance of the semantic content and/or the amount of reverberation, above which their perception is degraded (possibly on higher reverberation times).
In the above discussion, different time and spectral techniques for increasing the intelligibility of casual speech for reverberant environments are explored. In methods in accordance with an embodiment, a spectral transformation applies a multi-band filtering on casual speech, boosting information from important frequency bands indicated by clear speech and it has low computational complexity as it does not require detection of steady-state portions. The mix-filtering and uniform time-scaling combination increases the intelligibility of casual speech in high reverberant environments (RT = 2s) for the non-native population. Results indicate that modifications based on clear speech properties can be beneficial for the intelligibility enhancement of casual speech in reverberant environments.
The above embodiments use the realisation that clear speech has an intelligibility advantage over casual speech in noisy and reverberant environments. Spectral and time domain modifications are used to increase the intelligibility of casual speech in reverberant environments by compensating particular differences between the two speaking styles. To compensate spectral differences, a frequency-domain filtering approach is applied to casual speech. In the time domain, two techniques for time-scaling casual speech are demonstrated: (1) uniform time-scaling and (2) pause insertion and phoneme elongation based on loudness and modulation criteria. The combination of spectral transformation and uniform time-scaling is shown to be helpful in increasing the intelligibility of casual speech. The experimental results above support the conclusion that modifications inspired by clear speech can be beneficial for the intelligibility enhancement of speech in reverberant environments.
As shown in figure 3, a reverberation time input 113 is provided. In an embodiment, the reverberation time input is used to temporally scale the spectrally shaped signal produced at 107 by a factor Ts where: Ts=aRT + 1 where Rl is the reverberation time and a is a constant of less than 1.
In one embodiment, the constant a is selected by setting an upper value for the reverberation time. Beyond the upper value, it will be assumed that a fixed scaling will apply. In an embodiment, an upper value of the reverberation time is set as 3. A total allowable time scaling factor will also be set. In this embodiment, the total allowable time scaling will be set to be 40%. The value of a is then worked out to allow a scaling factor of 1.4 at a reverberation time of 3s. Therefore, in this embodiment, the factor a is set to be 0.13.
This scaling factor can be used with the above described uniform scaling or the scaling which is dependent on properties of speech, such as stationary speech.
However, it will be appreciated that other scaling factors could be used At reverberation times of greater than 3s, in an embodiment, the factor of 1.4 will still be used Inserting pauses as explained above may be used for large reverberation times.
Although the above embodiment uses the reverberation time as an input to the time scaling, the above embodiment can be used when there is no reverberation. In such cases, the timescaling can be set to a fixed value, e.g. 20%, or somewhere from 10% to 25%. This constant factor can be used for both uniform scaling or for just scaling the stationary parts of speech.
With reference to figures 6 to 8, the selection of the frequency bands and the weighting for the mix frequency technique will be discussed.
Figure 6 shows the difference of the log average spectral envelopes of clear speech minus casual speech. In this study, the analysis is performed on the whole signal and not only on the voiced segments, accounting for the importance of the consonants on speech intelligibility. Every clear sentence and its corresponding casual is analysed.
The analysis involves frame-by-frame estimation of the true envelope for the voiced segments and spectral envelope estimation directly from the LPC analysis for unvoiced segments. The true envelope estimation is based on cepstra1 smoothing of the amplitude spectrum. The cepstrum order is set to 10 in order to estimate an overall energy of the frequency bands. For each spectral envelope, the DC component is set to zero. Then, the spectral envelope is normalized by its RMS to eliminate intensity differences between clear and casual speech. The averaged spectral envelopes are computed as the mean of all frames for each speaking style separately Figure 6 shows clear speech appears to have higher energy in two frequency bands, B1 = [2000; 4800] and B2 = [5600; 8000]. In the above explained embodiment, the enhancement of the intelligibility of casual speech involves the isolation of these important frequency bands BI and B2 and then the addition of their energy to the original signal with different weighting factors for each frequency band.
However, although these energy bands are taught here, others are possible. For example, sub-bands of the above identified bands could also be used. In this embodiment, the upper band is selected as it is where UNVOICED consonants are defined and the lower band is selected to boost important transition information from consonant to vowels. However, the system could be used for hearing impaired users, both those with impaired hearing at certain frequencies and those, for example with cochlear implants, where hearing is focussed at certain channels. In these situations, the bands that are boosted could be, for example, those which have been found to be enhanced during clear speech and where the user can hear. For cochlear implant users, the system could be configured to boost the signal in line with the dominant channels of the implant The selection of the proper combination of the weights is important both for intelligibility and quality. For example, high pass filtering speech above I:5kHz increases its intelligibility in noise. However, the absence of information on lower frequency bands can degrade the quality of speech. Therefore, this information is contained on the modified speech y,,,11 by choosing to keep the original speech signal weighted by wo. Then, the selection of the other two weights is inspired by clear speech properties.
Specifically, focusing on the energy differences between clear and casual speech on Figure 6, it can be observed that the energy in B2 frequency band is greater than that of B1 on clear speech than on casual speech. Possibly, this happens because the energy of consonants is higher in clear than in casual speech. Therefore, in an embodiment, w2 > w, to account for the slight higher energy difference of B2 frequency band compared to B1 between the two speaking styles.
Summarizing the above, the set of the possible weighting combinations can be described by the following equations: 41, (3) t, (4) (5) In order to select one proper weight combination fitio; in is considered as a dependent variable Then, the two variables itr1,w2 can vary between (0, 1) respecting the restrictions described by equations (3), (4) and (5) As the desire is to enhance the intelligibility of casual speech, the proper values twoviiiti21 are those that maximize the intelligibility score of modified speech compared to unmodified speech.
To define these values, the casual speech used to produce figure 6 is used as a training dataset. Specifically, the casual signals of this speech are modified using different weight combinations that satisfy the above equations. The intelligibility of the modified sentences using the mix-filtering approach (mixF) and the unmodified casual sentences is evaluated objectively on the presence of low SNR (SNR = -10dB) Speech Shaped Noise (SSN). The best combination of weights is the one that maximizes the objective Intelligibility difference of the modified speech minus the unmodified speech.
In an embodiment, the objective metric used to predict intelligibility is the Glimpse Proportion (GP). The Glimpse measure comes from the Glimpse model for auditory processing. As an intelligibility predictor, the model is based on the assumption that in a noisy environment humans listen to the glimpses of speech that are less masked.
Therefore, the GP measure is the proportion of spectral-temporal regions where speech is more energetic than the noise Figure 7 shows for various weight combinations the difference between the intelligibility score of mixF speech minus casual speech given by GP. Note, that in is not present as it is assumed from equation (3) to be the dependent variable. The optimal weight combination that maximizes this difference is 1, 0.4, 0.51. The difference between the average smoothed spectral envelopes of the modified speech mixF that derives from this combination and the casual speech is depicted on Figure 8. The important frequency bands are boosted "stealing" from the lower frequency bands.
The fact that clear speech can be applicable both in various intelligibility challenging conditions and in quiet motivates the modification of casual speech based on clear speech characteristics with a view to increasing its intelligibility in "noisy" channels and at the same time preserving its quality as these channels vary dynamically in real life.
The intelligibility benefit of the mix-filtering method for SSN is demonstrated for reverberant environments. In reverberant environments, the mix-filtered modified speech simulates clear speech in terms of spectral energy distribution, which is resistant to reverberant environments. In such environments, the intelligibility decrease of speech is due to (1) overlap masking effect where the energy of a phoneme is masked by the preceding one and (2) self-masking where the information is smeared inside a phoneme possibly as a result of flattened formant transitions.
As in clear speech, the mix-filtering approach boosts higher spectral regions, where transient parts are more likely to be found, and "steals" spectral energy from low-frequency energy which usually causes the overlap masking effect on the energy of a preceding phoneme. Time-scaling schemes enhance the intelligibility of unmodified speech through repetition of the information in time, reducing the overlap-masking and self-masking effect. The performance of two time-scaling techniques is evaluated above for reverberant environments: 1) Uniform time-scaling 2) Timescaling based on the Perceptual Quality Measure (PSQ) model. Uniform time-scaling changes the overall duration while respects the "local" speech rhythm. PSQ proposes both an elongation and a pause insertion scheme that could be beneficial inside reverberant environments as the energy of a speech segment falls into pauses and does not mask following segments. The above examples, use a pause insertion scheme that inserts pauses in acoustically meaningful places.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and systems described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and systems described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms of modifications as would fall within the scope and spirit of the inventions.

Claims (20)

  1. CLAIMS: 1. A speech processing system for increasing the intelligibility of output speech, the system comprising: a speech input for receiving speech; a processor adapted to receive said input speech and modify the speech to increase its intelligibility; and an output for said modified speech, wherein said processor is adapted to modify the speech to increase its intelligibility by selecting at least two frequency bands of said input speech and applying a different weighting to each of the at least two frequency bands to produce a spectrally shaped signal, the processor being configured to temporally scale said spectrally shaped signal to stretch the spectrally shaped signal.
  2. 2. A speech processing system according to claim 1, further comprising an input to receive information indicating the reverberation time for the environment where the output of said speech processing system is located.
  3. 3. A speech processing system according to claim 2, being configured such that the information indicating the reverberation time is used to control the temporal scaling of the spectrally shaped signal.
  4. 4. A speech processing system according to claim 3, wherein the system is adapted to apply a fixed scaling when there is no input for the reverberation time or when the input indicates that the reverberation time is below a pre-determined value.
  5. 5. A speech processing system according to claim I, wherein the processor is configured to temporally scale said spectrally shaped signal to stretch the spectrally shaped signal by a factor based on the temporal difference between clear speech and casual speech.
  6. 6. A speech processing system according to claim 1, wherein the processor is configured to temporally scale said spectrally shaped signal to uniformly stretch the spectrally shaped signal
  7. 7. A speech processing system according to claim 1, wherein the processor is configured to temporally scale said spectrally shaped signal by identifying stationary parts of speech and allowing elongation of these parts without elongating non-stationary parts of speech.
  8. 8. A speech processing system according to claim 7, wherein said stationary parts of speech are identified by loudness measurements and modulation information.
  9. 9. A speech processing system according to claim 1, wherein the processor is configured to temporally scale the signal by a factor T, where: 2,=aRi + 1 where R/ is the reverberation time and a is a constant of less than 1.
  10. 10. A speech processing system according to claim 1, wherein the processor is configured to stretch the spectrally shaped signal by inserting pauses into the spectrally shaped signal.
  11. 11. A speech processing system according to claim 1, wherein the at least two frequency bands are selected as bands where there is enhancement of a clear speech signal over a casual speech signal.
  12. 12. A speech processing system according to claim 1, wherein one frequency band is selected as a band where UNVOICED consonants are present and the other frequency band is a band where VOICED CONSONANTS and vowels are present.
  13. 13. A speech processing system according to claim 1, wherein at least one of said frequency bands is above 5000Hz and the other of said frequency bands is below 5000Hz.
  14. 14. A speech processing system according to claim 1, wherein the at least two frequency bands comprises a first band from 2000Hz up to 4800Hz and a second band from 5600Hz up to 8000Hz.
  15. 15. A speech processing system according to claim 1, the system further comprising a control means adapted to change the at least two frequency bands to accommodate for a hearing impairment of a user of said system.
  16. 16. A speech processing system according to claim 1, wherein the processor is adapted to select a further frequency band is selected corresponding to the full frequency range and apply a weighting to this further frequency band.
  17. 17. A speech processing system according to claim 16, wherein the processor is adapted to normalise the spectrally shaped signal to have the same RMS energy as the original speech such that: BUS +wis) W2S2 [i] V A' td Yrn [i] = / NV ATwhere s, st and s2 are the signals from the full frequency band s and the first si and second 57 frequency bands, wo, wt and ty2 are the respective weighting factors, YnnF is the proposed spectrally modified signal and N is the number of samples of the original signal s and y.
  18. 18. A speech processing system according to claim 1, wherein the weightings of the bands are set to sum to 1 and wherein the weighting of a higher frequency band is greater than the weighting of a lower frequency band.82 [i] i]
  19. 19. A speech processing system according to claim 1, comprising a speaker to output said modified speech.
  20. 20. A speech processing method for increasing the intelligibility of output speech, the method comprising: receiving speech; modifying the speech to increase its intelligibility; and outputting said modified speech, wherein the speech is modified by selecting at least two frequency bands of said input speech and applying a different weighting to each of the at least two frequency bands to produce a spectrally shaped signal, and temporally scaling said spectrally shaped signal to stretch the spectrally shaped signal.
GB1507488.3A 2015-04-30 2015-04-30 A Speech Processing System and Method Active GB2537924B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1507488.3A GB2537924B (en) 2015-04-30 2015-04-30 A Speech Processing System and Method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1507488.3A GB2537924B (en) 2015-04-30 2015-04-30 A Speech Processing System and Method

Publications (3)

Publication Number Publication Date
GB201507488D0 GB201507488D0 (en) 2015-06-17
GB2537924A true GB2537924A (en) 2016-11-02
GB2537924B GB2537924B (en) 2018-12-05

Family

ID=53489003

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1507488.3A Active GB2537924B (en) 2015-04-30 2015-04-30 A Speech Processing System and Method

Country Status (1)

Country Link
GB (1) GB2537924B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030229490A1 (en) * 2002-06-07 2003-12-11 Walter Etter Methods and devices for selectively generating time-scaled sound signals
EP1515310A1 (en) * 2003-09-10 2005-03-16 Microsoft Corporation A system and method for providing high-quality stretching and compression of a digital audio signal
US20060270467A1 (en) * 2005-05-25 2006-11-30 Song Jianming J Method and apparatus of increasing speech intelligibility in noisy environments
EP1918911A1 (en) * 2006-11-02 2008-05-07 RWTH Aachen University Time scale modification of an audio signal

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030229490A1 (en) * 2002-06-07 2003-12-11 Walter Etter Methods and devices for selectively generating time-scaled sound signals
EP1515310A1 (en) * 2003-09-10 2005-03-16 Microsoft Corporation A system and method for providing high-quality stretching and compression of a digital audio signal
US20060270467A1 (en) * 2005-05-25 2006-11-30 Song Jianming J Method and apparatus of increasing speech intelligibility in noisy environments
EP1918911A1 (en) * 2006-11-02 2008-05-07 RWTH Aachen University Time scale modification of an audio signal

Also Published As

Publication number Publication date
GB2537924B (en) 2018-12-05
GB201507488D0 (en) 2015-06-17

Similar Documents

Publication Publication Date Title
Goehring et al. Speech enhancement based on neural networks improves speech intelligibility in noise for cochlear implant users
Assmann et al. The perception of speech under adverse conditions
Jenstad et al. Quantifying the effect of compression hearing aid release time on speech acoustics and intelligibility
Ma et al. Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions
Kokkinakis et al. A channel-selection criterion for suppressing reverberation in cochlear implants
Taal et al. Speech energy redistribution for intelligibility improvement in noise based on a perceptual distortion measure
Chen et al. Contributions of cochlea-scaled entropy and consonant-vowel boundaries to prediction of speech intelligibility in noise
US8983832B2 (en) Systems and methods for identifying speech sound features
Roman et al. Intelligibility of reverberant noisy speech with ideal binary masking
US9538297B2 (en) Enhancement of reverberant speech by binary mask estimation
Marzinzik Noise reduction schemes for digital hearing aids and their use for the hearing impaired
Hummersone A psychoacoustic engineering approach to machine sound source separation in reverberant environments
Yoo et al. Speech signal modification to increase intelligibility in noisy environments
Tang et al. Evaluating the predictions of objective intelligibility metrics for modified and synthetic speech
Payton et al. Comparison of a short-time speech-based intelligibility metric to the speech transmission index and intelligibility data
Hazrati et al. Reverberation suppression in cochlear implants using a blind channel-selection strategy
Huber et al. Objective assessment of a speech enhancement scheme with an automatic speech recognition-based system
Jayan et al. Automated modification of consonant–vowel ratio of stops for improving speech intelligibility
Gaich et al. On speech intelligibility estimation of phase-aware single-channel speech enhancement.
Saba et al. The effects of Lombard perturbation on speech intelligibility in noise for normal hearing and cochlear implant listeners
Zorilă et al. Near and far field speech-in-noise intelligibility improvements based on a time–frequency energy reallocation approach
Souza et al. Does the speech cue profile affect response to amplitude envelope distortion?
van Schoonhoven et al. The extended speech transmission index: Predicting speech intelligibility in fluctuating noise and reverberant rooms
Nogueira et al. Development of a sound coding strategy based on a deep recurrent neural network for monaural source separation in cochlear implants
US20240071411A1 (en) Determining dialog quality metrics of a mixed audio signal