GB2536729A - A speech processing system and a speech processing method - Google Patents

A speech processing system and a speech processing method Download PDF

Info

Publication number
GB2536729A
GB2536729A GB1505363.0A GB201505363A GB2536729A GB 2536729 A GB2536729 A GB 2536729A GB 201505363 A GB201505363 A GB 201505363A GB 2536729 A GB2536729 A GB 2536729A
Authority
GB
United Kingdom
Prior art keywords
speech
output
signal
input
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
GB1505363.0A
Other versions
GB201505363D0 (en
GB2536729B (en
Inventor
Zorila Tudor-Catalin
Stylianou Ioannis
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Toshiba Europe Ltd
Original Assignee
Toshiba Research Europe Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Toshiba Research Europe Ltd filed Critical Toshiba Research Europe Ltd
Priority to GB1505363.0A priority Critical patent/GB2536729B/en
Publication of GB201505363D0 publication Critical patent/GB201505363D0/en
Publication of GB2536729A publication Critical patent/GB2536729A/en
Application granted granted Critical
Publication of GB2536729B publication Critical patent/GB2536729B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
    • G10L21/0364Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility

Abstract

A speech intelligibility enhancing system receives speech from an input (11, fig. 1), divides it into frames S1 (via eg. a Hann window function), transforms to the frequency domain S2 (via eg. FFT) and then applies a Spectral Contrast Filter S3 which estimates spectral tilt (S38, fig. 4), normalizes signal energy (S40) and uses compression/expansion to increase the relative magnitude of consecutive peaks and valleys according to a threshold (Input/output Contrast Characteristic filter IOCC, S41). Other filters may be used to preserve the total energy of the adapted signal before output (17).

Description

A Speech Processing System and Speech Processing Method
FIELD
Embodiments described herein relate generally to speech processing systems and speech processing methods.
BACKGROUND
The goal of speech communication is to convey information. However, in most practical scenarios, the speech signal is corrupted by acoustical noise found in various communication channels (e.g. inside cars, at train stations, airports, sports arena etc.) which lowers its intelligibility, and thus impairs the information transfer. Highly intelligible speech in noisy environments is desirable both for everyday listeners using mobile phones, playing media files on mobile devices or paying attention to public announcements etc. in such conditions, and even more so for the exceptional listeners exchanging critical speech information in adverse situations (e.g. the communication between the plane's pilot and the control tower).
Applying carefully designed signal processing techniques, it is possible to enhance speech to promote its intelligibility in additive noise.
BRIEF DESCRIPTION OF THE DRAWINGS
Systems and methods in accordance with non-limiting embodiments will now be described with reference to the accompanying figures in which: Figure 1 is a schematic illustration of a system in accordance with an embodiment; Figure 2(a) is a flow diagram showing the processing steps performed by a processor in a system in accordance with an embodiment; Figure 2(b) is a flow diagram showing the processing step performed by a processor in a system in accordance with another embodiment; Figure 2(c) is a flow diagram showing the processing steps performed by a processor in a system in accordance with yet another embodiment; Figure 3 is a schematic illustration of the frequency domain spectral energy readjustment step of Figure 2(c) in more detail; Figure 4 is a schematic illustration of a frequency domain spectral contrast enhancement (fSCE) step of Figure 3 in more detail; Figure 5 is a plot of an input-output characteristic curve which may be used in the frequency domain spectral contrast enhancement step of Figure 4(a); Figure 6(a) shows the magnitude spectra output from the frequency domain spectral contrast enhancement step of Figure 4; Figure 6(b) shows the magnitude spectra output from the pre-emphasis stage of the frequency domain spectral energy readjustment step shown in Figure 3; Figure 6(c) shows the magnitude spectra output from the weighted sum step of the frequency domain spectral energy readjustment step shown in Figure 3; Figure 7 is a schematic showing the dynamic range compression stage of Figure 2(c) in more detail; Figure 8 is a plot of an input-output envelope characteristic curve used in the dynamic range compression stage of Figure 7; Figure 9(a) is a plot of a speech signal and figure 9(b) is a plot of the output from a dynamic range compression step performed on the speech signal; Figure 10 is a plot of an input-output envelope characteristic curve adapted in accordance with a signal to noise ratio; Figure 11 is a schematic illustration of a system in accordance with an embodiment; Figure 12(a) is a schematic of a system in accordance with a further embodiment with multiple outputs; Figure 12(b) is a schematic of a system in accordance with an alternative embodiment with multiple outputs; Figure 13 shows the percentage of correctly recognized keywords from the Harvard test set for speech shaped and competing speaker types for fSER and fSERDRC; Figure 14 shows a comparison of computational load in seconds for the Matlab implementation of fSER and fSERDRC.
DETAILED DESCRIPTION
According to one embodiment, there is provided a speech intelligibility enhancing system for enhancing speech, the system comprising: a speech input for receiving speech to be enhanced; an enhanced speech output to output the enhanced speech; and a processor configured to convert speech received from the speech input to enhanced speech to be output by the enhanced speech output, the processor being configured to: i) extract frames of the speech received from the speech input; ii) transform each frame of the speech received from the speech input into the frequency domain; iii) apply a spectral contrast enhancement filter over a first frequency range.
In one embodiment, the spectral contrast enhancement filter applies gains to an inputted signal in the frequency domain. The gains may be computed from the envelope of the inputted signal, and then used to rescale the input signal. The envelope may be altered to increase the magnitude of the parts having a magnitude higher than a threshold value and decrease the magnitude of the parts having a magnitude lower than the threshold value. The envelope may then be used to rescale the input signal.
In one embodiment, the spectral contrast enhancement filter is configured to: reduce the spectral tilt of an inputted signal; normalise the reduced spectral tilt signal; and increase the magnitude of parts of the normalised reduced spectral tilt signal having a magnitude higher than a first threshold value and decrease the magnitude of parts having a magnitude lower than the first threshold value, in the frequency domain.
In one embodiment, the processor is further configured to: iv) transform each frame of the filtered signal back into the time domain; v) add the frames of the filtered signal to produce an enhanced speech signal.
In a further embodiment, the processor is further configured to: apply a low-frequency preservation filter to the output of step ii); and apply a pre-emphasis filter to the output of step ii).
The low frequency preservation filter may comprise a low pass filter.
In a further embodiment, the spectral contrast enhancement filter is applied to the output of the pre-emphasis filter.
In a yet further embodiment, the processor is further configured to: adjust the signal outputted from the spectral contrast enhancement filter such that the power is the same as the signal outputted from the pre-emphasis filter.
In a yet further embodiment a band pass filter is applied to the output from the spectral contrast enhancement filter and to the output from the pre-emphasis filter, before the power adjustment is performed.
In a yet further embodiment, step i) comprises applying a windowing function; and step ii) comprises performing a discrete Fourier transform on the windowed speech signal, and wherein the output of the discrete Fourier transform for each frame is a phase spectrum and a magnitude spectrum, wherein the magnitude spectrum is inputted into the low-frequency preservation filter and the pre-emphasis filter.
In a further embodiment, the processor is further configured to: weight the output of the low-frequency preservation filter, the output of the pre-emphasis filter and the adjusted signal outputted from the spectral contrast enhancement filter and sum the weighted outputs; and wherein step iv) comprises performing an inverse Fourier transform on the summed weighted outputs using the phase spectrum, and step v) comprises overlap-adding the frames outputted from the inverse Fourier transform; and wherein the processor is further configured to: adjust the overlap-added signal such that the power is the same as the original speech signal.
In an embodiment, the spectral contrast enhancement filter comprises: a log operator, configured to calculate the log of an inputted signal; an envelope module, configured to estimate the magnitude envelope of the output of the log operator; a spectral tilt module, configured to estimate the spectral tilt of the output of the envelope module; a first subtraction module, configured to subtract the output of the spectral tilt module from the output of the envelope module; a normalization module, configured to normalize the output of the first subtraction module; a contrast enhancement module, configured to increase the magnitude of parts of the output of the normalization module having a magnitude higher than a first threshold value and decreases the magnitude of parts of the output of the normalization module having a magnitude lower than the first threshold value, in the frequency domain and over the first frequency range; a second subtraction module, configured to subtract the output of the normalisation module from the output of the contrast enhancement module; an interpolation module, configured to interpolate the output of the second subtraction module so that its resolution is the same as the resolution of the output of the log operator; an addition module, configured to add the output of the log operator to the output of the interpolation module; and an exponential operator, wherein the output of the second addition module is inputted into the exponential operator.
In one embodiment, the magnitude envelope is estimated using the SEEVOC algorithm.
In one embodiment the spectral tilt is estimated using linear regression.
In one embodiment, the normalisation module is configured to calculate the mean value of the output of the first subtraction module, and subtract this value from the output of the first subtraction module.
In one embodiment, the contrast enhancement module is configured to apply an input-output contrast characteristic to the output of the normalisation module.
In one embodiment the processor is further configured to apply a dynamic range compression to the enhanced speech signal.
In one embodiment, the system further comprises: a noise input for receiving real-time information concerning the noise environment; wherein the processor is further configured to: measure the signal to noise ratio at the noise input, wherein the dynamic range compression comprises a control parameter which is updated in real time according to the measured signal to noise ratio.
In one embodiment, the control parameter for the dynamic range compression is used to control the gain to be applied by the dynamic range compression.
In a further embodiment, the dynamic range compression is configured to redistribute the energy of the speech received at the speech input and wherein the control parameter is updated such that it gradually suppresses the redistribution of energy with increasing signal to noise ratio.
In some embodiments, the dynamic range compression stage will be adapted to the noisy environment. The output may be adapted to the noise environment by adapting the DRC stage in line with the signal to noise ratio (SNR). Further, the output is continually updated such that it adapts in real time to the changing noise environment. For example, if the above system is built into a mobile telephone and the user is standing outside a noisy room, the system can adapt to enhance the speech dependent on whether the door to the room is open or closed. Similarly, if the system is used in a public address system in a railway station, the system can adapt in real time to the changing noise conditions as trains arrive and depart.
In an embodiment, the signal to noise ratio is estimated on a frame by frame basis and the signal to noise ratio for a previous frame is used to update the parameters for a current frame. A frame length may be 1 to 3 seconds.
When adapting the dynamic range compression in line with the SNR, the control parameter that is updated may be used to control the gain to be applied by said dynamic range compression. In further embodiments, the control parameter is updated such that it gradually suppresses the boosting of the low energy segments of the input speech with increasing signal to noise ratio. In some embodiments, a linear relationship is assumed between the SNR and control parameter, in other embodiments a non-linear or logistic relationship is used.
In one embodiment, the system further comprises an energy banking box, the energy banking box being a memory provided in the system and configured to store the total energy of the speech received at the speech input before enhancement, the processor being further configured to redistribute energy from high energy parts of the speech to low energy parts using the energy banking box.
B
In these embodiments, the processor is configured to increase the energy of low energy parts of the enhanced signal using energy stored in the energy banking box.
In one embodiment, the system is further configured to modify the dynamic range compression in accordance with the input speech independent of noise measurements.
In some embodiments, the dynamic range compression comprises a control parameter which is updated in real time according to the speech received at the speech input In one embodiment, the processor is configured to estimate the maximum value of the signal envelope of the speech received at the speech input when applying dynamic range compression and wherein the system is configured to update the maximum value of the signal envelope of the input speech every m seconds, wherein m is a value from 2 to 10.
In one embodiment, the signal to noise ratio is estimated on a frame by frame basis and wherein the signal to noise ratio for a previous frame is used to update the parameters for a current frame.
In one embodiment, the system is configured to output enhanced speech in a plurality of locations, the system comprising a plurality of noise inputs corresponding to the plurality of locations, the processor being configured to apply a plurality of corresponding dynamic range compression filters, such that there is a dynamic range compression filter for each noise input, the processor being configured to update the control parameters for each dynamic range compression filter in accordance with the signal to noise ratio measured from its corresponding noise input Such a system would be of use for example in a PA system with a plurality of speakers in different environments.
Since some methods in accordance with embodiments can be implemented by software, some embodiments encompass computer code provided to a general purpose computer on any suitable carrier medium. The carrier medium can comprise any storage medium such as a floppy disk, a CD ROM, a magnetic device or a programmable memory device, or any transient medium such as any signal e.g. an electrical, optical or microwave signal.
In one embodiment, the speech is enhanced under the constraint of global signal ene gy preservation, before and after modification.
According to one embodiment, there is provided a method for enhancing speech to be outputted, the method comprising: receiving speech to be enhanced; converting speech received from the speech input to enhanced speech; and outputting the enhanced speech, wherein converting the speech comprises: extracting frames of the input speech signal; transforming each frame of speech received via the speech input into the frequency domain; applying a spectral contrast enhancement filter over a first frequency range.
According to one embodiment, there is provided a carrier medium comprising computer readable code configured to cause a computer to perform a method of: receiving speech to be enhanced; converting speech received from the speech input to enhanced speech; and outputting the enhanced speech, wherein converting the speech comprises: extracting frames of the input speech signal; transforming each frame of speech received via the speech input into the frequency domain; applying a spectral contrast enhancement filter over a first frequency range.
Figure 1 is a schematic of a speech intelligibility enhancing system 1.
The system 1 comprises a processor 3 which comprises a program 5 which takes input speech and enhances the speech to increase its intelligibility. The storage 7 stores data that is used by the program 5. Details of what data is stored will be described later.
The system 1 further comprises an input module 11 and an output module 13. The input module 11 is connected to an input for data relating to the speech to be enhanced. The input may also include data concerning the real time noise conditions in the places where the enhanced speech is to be output. The type of data that is input may take many forms, which will be described in more detail later. The input 15 may be an interface that allows a user to directly input data. Alternatively, the input may be a receiver for receiving data from an external storage medium or a network.
Connected to the output module 13 is audio output 17.
In use, the system 1 receives data through data input 15. The program 5, executed on processor 3, enhances the inputted speech in the manner which will be described with reference to figures 2 to 10.
The system is configured to increase the intelligibility of speech in additive noise. The system modifies plain speech such that it has high intelligibility over noisy conditions.
In one embodiment, the system is configured to preserve the energy of the original speech, i.e. it is configured to increase the intelligibility of speech in noise with energy preservation. In other words, the system may preserve the same energy of the signal before and after modification. This is known as global power preservation.
Figure 2(a) depicts two flow diagrams showing the processing steps provided by program 5 in accordance with an embodiment. In the first flow diagram, a speech signal s(n) is inputted. Frames are extracted from the inputted speech signal in step Si, for example, by applying a windowing function such as the Hann window function.
Each frame is transformed into the frequency domain in step S2, for example using a discrete Fourier transform. The second flow diagram shows an example of the processing steps in which a discrete Fourier transform is applied in step S2.
A spectral contrast enhancement filter is then applied over a first frequency range in step S3. The spectral contrast enhancement filter may be applied to the magnitude spectra outputted from the discrete Fourier transform.
In one embodiment, the spectral contrast enhancement filter is configured to use compression and expansion to increase the contrast of the signal, that is, the magnitude difference between consecutive peaks and valleys from the magnitude spectrum.
In one embodiment, the signal is transformed back into the time domain in step S4, and the frames of the signal are added to output an enhanced speech signal SfscE(n) in step SS.
In one embodiment, the signal is transformed back into the time domain by applying an inverse Fourier transform using the original phase spectra. This is the case for the second flow diagram, in which an inverse discrete Fourier transform is applied in step 54.
In one embodiment, the spectral contrast enhancement filter comprises steps 536 to S49 described in relation to Figure 4(a), applied to the framed inputted speech signal s(n) in the frequency domain. Figure 6(a) shows the magnitude spectrum output from the process shown in Figure 2(a), when the steps S36 to 549 in Figure 4(a) are applied to a framed inputted speech signal in the frequency domain. The horizontal axis shows frequency in Hertz. The vertical axis shows magnitude in dB. The solid line shows the plain speech. The dashed line shows the outputted enhanced speech.
Figure 2(b) is a flow diagram showing the processing steps provided by program 5 in accordance with another embodiment. In this embodiment, to enhance or boost the intelligibility of the speech, the system comprises a frequency domain based spectral energy readjustment (fSER) step 510.
The fSER step 510 of Figure 2(b) will be described in more detail with reference to Figures 3 to S. The fSER step 510 reallocates an inputted speech signal's spectral energy and consists of three stages: low-frequency preservation, pit-emphasis and spectral contrast enhancement, all performed in the frequency domain, and all the operations being frame based. All of the fSER operations are applied on the magnitude of the DFTcoefficients, on a frame-by-frame basis.
In another embodiment, the system performs a first stage based on a frequency domain approach and a second stage which allows reallocation of energy over time. Figure 2(c) is a flow diagram showing the processing steps provided by program 5 in accordance with this embodiment In this embodiment, to enhance or boost the intelligibility of the speech, the system comprises a frequency domain based spectral energy readjustment (fSER) step S10 and a Dynamic Range Compression (DRC) step S11. The resulting system may be referred to as fSERDRC. In this embodiment, the output of the fSER step 510 is delivered to the DRC step 511, i.e. the fSER 510 and the DRC 511 are connected in cascade form. In other words, the output of the fSER step is inputted into the DRC step.
The fSER step 510 uses a frequency domain approach. The DRC step Sit allows reallocation of energy over time. fSERDRC boosts speech intelligibility by a frequency domain energy readjustment followed by a secondary energy readjustment in the time domain. fSERDRC readjusts energy both in the spectral and time domain giving a high intelligibility gain. fSERDRC includes frequency and time domain-based operators applying piece-wise linear compression/expansion of magnitude spectra and waveform envelopes.
In an embodiment, the processing steps shown in Figures 2(a) to (c) modify an input speech signal so as to increase its intelligibility, before the signal is presented in the noisy environment, under the constraint of global power preservation. The processing steps, or modification algorithm, comprise a cascade form of filters implementing a combination of compression and expansion operators in the frequency and time domains.
The fSER step S10 of Figure 2(c) will be described in more detail with reference to Figures 3 to S. The fSER step reallocates an inputted speech signal's spectral energy and consists of three stages: low-frequency preservation, pre-emphasis and spectral contrast enhancement, all performed in the frequency domain, and all the operations being frame based. In other words, all of the fSER operations are applied on the magnitude of the Mg-coefficients, on a frame-by-frame basis.
The original speech signal s(n) in Figures 2(a) to 2(c) is a sampled signal, with n being the discrete time index. In one embodiment, the sampling rate is greater than 8 kHz and less than 44kHz. In one embodiment, the sampling rate, or sampling frequency is 16 kHz.
Figure 3 shows a flow diagram of the fSER step S10 of Figures 2(b) and (c). The filtering operations shown in Figure 3 are conducted in the frequency domain on the magnitude spectra, which results in very fast implementation.
The original speech signal s(n) is inputted in step 517.
In one embodiment, prior to enhancement by the filters shown in Figure 3, the speech signal is rescaled by 2(Q-11, where Q is the number of bits for quantization.
In step 518, the speech signal s(n) is windowed. The output of step 518 is: sw(n) = s(n)wk(n) where wk(n) is a windowing function, for example a Hann window function: wk (n) = 0.5 cos (_2yrn)1 where (:).N-1 V-1 where N is the width, in samples n, of the window function. N is therefore equal to the length of time (in seconds) of the window multiplied by the sampling frequency, Fs. In one embodiment, the original speech signal is divided in segments of length 32ms using SO% overlapping Hann windows. In a further embodiment, the sampling frequency is 16kHz, and N is thus 32ms x 16 kHz = 512 samples.
In step S19, the windowed signal s(n) is inputted into a Fourier transform processor. In one embodiment, the processor is a discrete time Fourier transform (DM) processor. The discrete time Fourier transform processor applies a discrete time Fourier transform to the inputted windowed signal sw(n) in 519, resulting in S(k, co): I =4.
(n)e-FInj where k is the current frame's index, or sample, and co is the frequency, normalised with respect to the sampling frequency. The below explanation and equations use the normalised frequency co. However, they may also be rewritten in terms of the de-normalised frequency F. For each frame, the magnitude spectrum: So(k, w)=IS(k, (01 and the phase spectrum: (po (k, <o) = a rg(S (k, co)} are computed by the Fourier transform module, or Fourier transform processor: S (k, (0) =S 0(k, to)e dic 9 D(k 4') In an alternative embodiment, the processor is a discrete Fourier transform (DFT) processor. In this embodiment) an N-point discrete Fourier transform may be applied in step 519, resulting in: -i27nam s. (n) e N where m is the discrete frequency bin. The output of the N-point discrete Fourier transform is a complex function that can be written in a similar polar form as that used in the embodiment in which a discrete time Fourier transform is applied.
Step S20 is the low-frequency preservation stage, in which the magnitude spectrum So(k, tu) outputted from the Fourier transform processor step 519 is inputted to low-pass filter Hi(w). In the low frequency preservation stage, the low-frequency components of the magnitude spectrum So(k, a)) are preserved. i.e. are separated out from the rest of the signal and outputted. The low-pass filtering is applied in the frequency domain directly to the magnitude spectrum, and thus no phase distortions are introduced.
The output of the low-pass filter is: 51(k, co) = (w)So(k, w) In one embodiment, the components of the signal below 400 Hz are preserved.
In one embodiment 0, S Hi c (o) idB = {-60, 1°1 >tr_Nc In one embodiment, ex corresponds to 400Hz.
The output of the low-frequency preservation stage is a signal comprising the low frequency components of the magnitude spectrum. In one embodiment, the output of the low-frequency preservation stage is a signal comprising the components of the magnitude spectrum below 400Hz.
In one embodiment, a different low frequency preservation filter Mai) is used, where Hi(co) is defined by the critical points given in the below table.
Frequency, F(Hz) . SOO -60 Fs/2 -60 Step S21 is the pre-emphasis stage, in which the magnitude spectrum So(k, 64 outputted from the Fourier transform processor step S19 is passed through a pre-emphasis filter.
The output of the pre-emphasis filter is: 52 (k, co) = H (aOS 0 pc, co) where the transfer function of the pre-emphasis filter H2(w) is: H2 (w) = 2a cos (co) a2 In one embodiment, a = 0.97.
The pre-emphasis stage flattens the spectral tilt of the signal. It attenuates low frequency parts of the signal, and amplifies the high frequency parts of the signal.
Figure 6(b) shows the magnitude spectrum output from the pre-emphasis stage S21. The horizontal axis shows frequency in Hertz. The vertical axis shows magnitude in dB.
The darker, solid line is the signal inputted into the pre-emphasis stage, So (k,co), i.e. the plain speech. The lighter, dashed, line is 52(k,co), the signal outputted from the pre-emphasis stage.
In step S22, the phase spectra po(k, co) outputted from the Fourier transform processor step S19 are stored for use in re-synthesis.
The frequency domain based spectral contrast enhancement (fSCE) step S23 is performed on the pre-emphasised signal S2(k, co) output from step 521. Spectral contrast is defined as the difference in dB between adjacent peaks and valleys in the spectrum. Thus, increasing spectral contrast sharpens formants, which can improve the intelligibility of speech-in-noise.
In the fSCE step, gains are computed from the envelope of the inputted signal, and then used to rescale the input signal.
Figure 4 shows the architecture of an fSCE process, which is performed on a frame by frame basis in accordance with an embodiment. In one embodiment, the fSCE step 523 comprises the steps shown in Figure 4. The fSCE step S23 enhances speech frequency contrast by operating in the frequency domain. The contrast is achieved by compression and expansion mechanisms. The fSCE step can be applied to one frequency band, or multiple frequency bands, or selected frequency bands.
In S35, the pre-emphasised signal 54k, co) output from step 521 is inputted.
In step S36, the log of the inputted pre-emphasised magnitude spectrum S2(k, co), log S2(k, co), is calculated. In this step, the pre-emphasised magnitude spectrum S2(k, co) is compressed using a log operator.
In S37, a magnitude envelope of log S2(k, co) is estimated. This is referred to as the log envelope, E(k, co'). In one embodiment, E(k, col the log envelope of the input, is calculated using the SEEVOC algorithm.
In an embodiment, the SEEVOC algorithm estimates the log envelope, E(k, co'), using a lower spectral resolution, i.e. less frequency bins, than that of the input spectrum, log S2(k, co). This lower spectral resolution is used because it is more computationally efficient The different spectral resolutions are depicted in the block diagram with co indicating the spectral resolution of the input spectrum and co' indicating the lower spectral resolution resulting after the SEEVOC algorithm is applied.
In 538, the log-spectral tilt T(k,co') is estimated. The log spectral tilt is the spectral tilt of the log envelope, E(k, a), in other words it is a measure of the difference in intensity of E(k, co') at high and low frequencies.
In an embodiment, this is estimated by means of a weighted linear regression model: T(c, )= where ak and bk are the slope and offset of the model. A linear regression is applied to E(k, co') to estimate ak and bk.
Alternatively, for a given frame k, a single line is estimated, and used as a model for T(k, co') for all frames.
In an alternative embodiment, the log-spectral tilt T(k,co') is estimated using the first two cepstral coefficients of the discrete cosine transform of the log envelope. In this case, the log envelope, E(k, col, is assumed to be a function of to only, given that k is constant for a given frame. This function is represented as a Fourier series, using only cosine terms, as the function is even with respect to co. The first two coefficients of the Fourier series representation are taken to be the estimate of the log-spectral tilt. The first two coefficients give a very close approximation of T(k, co').
In step 539, the tilt-free log envelope is calculated: (k,= E (k, .coi) -(k, co') In this step, the spectral tilt is suppressed from the log envelope to generate a tilt-free spectral envelope.
Speech frames have different energies, thus in order to apply the same input/output contrast enhancement characteristic (IOCC) to all of them in step S41, first the energy is normalized, in step 540.
In step S40, the tilt-free spectral envelope El(kAiT) calculated in 539 is normalized by subtracting its mean value, resulting in: k, ) do) The integral in 540 may be calculated using a Riemann sum.
In step S41, the normalised tilt-free spectral envelope E 12(k, co') is inputted into an input/output contrast characteristic module. Contrast enhancement is achieved by the input/output contrast characteristic module by compression and expansion mechanisms. This step expands the parts of the normalised tilt-free envelope which have magnitudes higher than a threshold and compresses the remaining parts.
In one embodiment, the frequency range Gang° over which it is desired to perform the IOCC is also inputted into step 541. The input/output contrast characteristic block applies the input/output contrast characteristic (IOCC) for the pre-defined frequency range frange. In one embodiment, ranfge is from 250Hz to 4kHz.
-
Applying the IOCC to the normalised tilt-free magnitude spectral envelope EL (k, cot) increases the magnitude of the parts having a magnitude higher than a first threshold value and decreases the magnitude of the parts having a magnitude lower than the first threshold value, in the frequency domain. In other words, applying the IOCC to the input normalised tilt-free spectral envelope Ene (k,cor)expands the parts of the input having a magnitude higher than the first threshold value and compresses the parts of the input having a magnitude lower than the first threshold value, in the frequency domain. In an embodiment, the first threshold value is the mean value of the tilt-free spectral envelope, -I fff Ei (k,a)dco'. The contrast of the signal, that is the magnitude 277 -Tr difference between the consecutive peaks and valleys of the magnitude spectrum, is increased.
In one embodiment, the IOCC module increases the magnitude of the parts having a magnitude between the first threshold value and a second threshold value, and decreases the magnitude of the parts having a magnitude between the first threshold value and a third threshold value in the frequency domain, where the second threshold value is higher than the first threshold value, and the first threshold value is higher than the third threshold value. In one embodiment, the second threshold value is 30dB, the first threshold value is 1f,E'(k,co9dco' and the third threshold value is -40d13.
In one embodiment, the IOCC is applied to a plurality of frequency ranges.
An example of an IOCC curve which may be used in step S41 is shown in Figure 5. The compression and expansion areas are indicated on the figure. The IOCC curve is linearly interpolated between the critical points shown in Figure 5. The critical points are listed in the table below: Input level (dB) Output level (dB) -40 -40 -20 -30 -10 -20 0 3 15 25 30 The IOCC curve depicted in Figure 5 is a plot of the desired output in decibels (vertical axis) against the input in decibels (horizontal axis]. Unity gain is shown as a straight dotted line and the desired gain to implement contrast enhancement is shown as a solid line.
The inputted Ecc(k, co') is multiplied by the IOCC function element by element. In other words, all elements of Et (k, co') having a particular magnitude, or input level, when multiplied by the IOCC function, will have the output level magnitude given by the IOCC curve for that particular input level. For the IOCC curve shown in Figure 5 for example, all elements of EV;(k, 6d) having a magnitude of -20dB will have a magnitude of -30dB after the IOCC function is applied.
In step S49 the output of the normalization step 540, Er; (k, cop, is subtracted from the output of the IOCC step S41. This results in a log gain function G(k, cd). The log-gain function is computed as the difference between the input and output of the IOCC step S41.
In 545, the log-gain function G(k, af) is interpolated. In one embodiment, a linear interpolation is applied in order to obtain the same resolution as the original pre-emphasised magnitude spectrum S2(k, w).
The interpolation step is included here because the SEEVOC algorithm may estimate the envelope of the magnitude spectrum, E(k, co), using a lower spectral resolution, i.e. less frequency bins, than the input spectrum, log 54k, co). In order to use the gain filter to reshape the input spectrum, the gain filter is interpolated so that it has the same, higher, spectral resolution as the input spectrum. In one embodiment, a simple linear interpolation (in the log domain) is used to obtain the higher spectral resolution.
In alternative embodiments, in which the envelope of the magnitude spectrum is estimated using the same spectral resolution as the input spectrum, the interpolation step 545 is not included.
In 546, the output of S45 is added to the output of 536, log 54k, co) . In this step, the interpolated log-gain function is used to reshape the input magnitude spectra, log S2(k, co), to increase its spectral contrast.
In 547, the exponential of the output of S46 is taken, resulting in S'4k, co). 5'4k, co) is outputted in step 548. This is the output of the fSCE step 523 and is the final enhanced magnitude spectrum.
In one embodiment, instead of step 549, three steps, step 542, 543 and 544 below are performed.
S42 is de-normalization. De-normalisation is applied to the output of the IOCC step 541 in order to recover the original dynamic range of the envelope.
In 543, the log spectral tilt is recovered, by adding ill(k,ffir), the log spectral tilt outputted from step 538, to the output of 542.
In 544, the log-gain function G(k, co') is computed, by subtracting E(k,c.oT), the log envelope outputted from step 537, from the output of 543.
In S45, the log-gain function G(k, co') outputted from 544 is interpolated as before.
Returning to Figure 3, in step S25, the output of the fSCE step S73(k, co) is band-limited by a band pass filter H3(co), resulting in S2"(k,co): = Sf(k: 0.0 143 i td, H3( I = 0 t-60, eon, I cot S C2 otherwise In one embodiment, S'2(1, co) is band-limited between 250-4500Hz by the band pass filter H3(w), i.e. cod corresponds to 250Hz and c0,2 corresponds to 4500Hz.
In an alternative embodiment, the band pass filter H3(w) may be defined by the critical points given in the following table: Frequency, F(Hz) H3 (F)IdB 0 -60 -60 250 -3 350 0 4400 0 4500 -3 4600 -60 F5/2 -60 In one embodiment, filter H3(co) implements a power compensation plan for components from 250 Hz to 4500 Hz. a
The next steps involve scaling the output of the BPF, S3"(Ica), to match the energy of the band-limited version of the initial pre-emphasised speech frame S2(k,w) output from step S21. The energy matching (EM) stage comprises steps 524, 526 and S27.
In step S24, a band limited version of Sz(k,co) output from step S21 is obtained using the II3(co) filter. The filter H3(co) is applied to the initial pre-emphasised speech S2(k,co) output from S21, resulting in S2-(k,co).
fk, co') = S In step 526, the output of step 524, S%(k, co) and the output of step S25, 5121(k, co), are used to calculate ak: lEn ST "5+;! 2 (k, CO) The output of step S26 is ak. In step S27, the output of step S25, 5:,;(k, to), is multiplied by ak, to give the energy matched signal S3 (k,L0): (k, cd) = a kS w) In step S28 of Figure 3, signals from the three stages (low-frequency preservation stage S20, pre-emphasis stage S21 and fSCE stage with energy matching, S23, S24, S25, 526 and S27) are weighted and in step S29 the weighted signals are added.
In step S28, the output signal from S20 (the low frequency preservation step) Si(k,w) is multiplied by a constant ci; the output signal from S21 (the pre-emphasis step) S2(k,co) is multiplied by a constant c2; and the output signal from S27 (the energy matched spectral contrast enhancement stage) S3(k,w) is multiplied by a constant c3. The weighted signals are then summed in step S29 to give S4(k,ca): S4(k,co)= c4S4(k,m) + c2S2(k/o) + c3S3(k,co) In one embodiment, ci=0.15, c2=0.5, c3=0.35.
Figure 6(c) shows the magnitude spectra output from step 528. The horizontal axis shows frequency in Hertz. The vertical axis shows magnitude in dB.
The darker, solid line is the plain speech signal So(k,co). The lighter, dashed line is S4(1c,m), the signal outputted from S29.
In step S30, the original phase spectra stored in step 522 and the output S4(k,co) of step S29 are fed into an inverse Fourier transform processor to give s4(n). The original phase spectrum is considered and the inverse Fourier transform provides an enhanced speech frame.
In S31, the frames s4(n) are overlap-added to give the final enhanced signal ss(n).
The final step 532 is a last scaling by y to keep the same global signal power of modified and original speech. In step 532, the value of y is calculated from _jIsCn) Y LE 52 (r0 n S such that the output of the fSER step S10 is: sisER (n)= yss(n) Returning to Figure 2(c), the dynamic range compression (DRC) step S11 will now be described in more detail with reference to Figure 7. DRC augments intelligibility in noise by redistributing energy in the time domain such that the signal's temporal envelope variations are reduced. The DRC is performed on the output of the fSER step S10. In other words, the input of the DRC step S11 is sfsER(n). Therefore, in the following description, the inputted speech signal s(n) is the output of the fSER stage, sisER(n). In other words, the inputted speech signal s(n) is substituted by the output of the fSER stage, sisER(n).
The DRC step is a second energy readjustment over time. This was found to help the intelligibility of speech in additive noise. DRC is inspired by the compression techniques used in audio recording/broadcasting and hearing-aid amplification. DRC is designed to amplify low-energy segments of speech (e.g., nasals, onsets and offsets), while more energetic voiced sounds are attenuated under the constraint of global signal energy preservation before and after modifications.
In DRC, time varying gains g(n) are computed from the dynamically and statically compressed temporal envelope, and then used to rescale the input signal. The temporal envelope c(n) is first estimated in step S51, considering the magnitude of the analytical signal of speech waveform, then it is further processed to remove its fast fluctuations. Next, the previous signal is dynamically compressed with a 2 ms release and instantaneous attack time constants in step S53, and it is then passed through a predefined input/output envelope characteristic (IOEC) which applies static compression in step S55. The output of step S55 is the time varying gain g(n). This is used to rescale the input signal in step S57.
The first step in DRC is to estimate the signal's temporal envelope. The signal's time envelope is estimated in step S51 using the magnitude of the analytical signal corresponding to s(n) e(o)= Is(n) + A101 where (r) denotes the Hilbert transform of the inputted speech signal (i.e. the outputted signal from the fSER step sisER(n)). The Hilbert transform of a signal is found by taking the convolution of the signal with 1/(nn). The ë(n) is the signal's time envelope.
Furthermore, because this estimated envelope has fast fluctuations, a new estimate e(n) is computed based on a moving average operator with order given by the average pitch of the speaker's gender. In an embodiment, to reduce the envelope's fast fluctuations, the previous estimated signal, g(rt) , is divided in non-overlapping segments of length 2.5 times the speaker's mean pitch period, or fundamental period, then 95% of the maximum value of the envelope amplitude in each frame is saved. In one embodiment, an average 150 Hz pitch frequency is used, corresponding to a mean pitch period of 0.0067 seconds. This results in a modified envelope e(n). The pitch period may be calculated by using a pitch detection algorithm such as an autocorrelation algorithm.
In an embodiment, the speaker's gender is assumed to be male since the average fundamental period is longer for men. However, in some embodiments as noted above, the system can be adapted specifically for female speakers with a shorter fundamental period.
The signal is then passed to the DRC dynamic step S53. In an embodiment, during the DRC's dynamic stage S53, the modified envelope of the signal is dynamically compressed with 2ms release time constant and with an almost instantaneous attack time constant: ( = -1) + (1 -a,-)e(n), if e(n) < e(r1 -1) 1:n) uae(i? 1) + (1 -a")^:(n), if e(n) i1(n -1) where aa is the attack time constant and a, is the release time constant. In one embodiment, a, = 0.15 and a. = 0.0001.
The attack time constant and the release time constant is the amount of time it will take for the gain to change a set amount of dB. The attack time constant is the amount of time to decrease by the set amount and the release time constant is the amount of time to increase by the set amount.
Following the dynamic stage S53, a static amplitude compression step 555 controlled by an Input-Output Envelope Characteristic (IOEC) is applied.
In one embodiment, setting the reference level to 30% of the maximum value of the dynamically compressed envelope, e(n), the signal is then converted in dB, followed by a static stage whore a pre-defined fixed IOEC is applied.
The IOEC curve depicted in Figure 8 is a plot of the desired output in decibels against the input in decibels. Unity gain is shown as a straight dotted line and the desired gain to implement DRC is shown as a solid line. This curve is used to generate time-varying gains required to reduce the envelope's variations. To achieve this, first the dynamically compressed,6(11-.) is transposed in dB ein (it) = 20 login ((it)/e0) where, in one embodiment, the reference level e, is set to 0.3 times the maximum level of the signal's dynamically compressed envelope e(n). The selection of 0.3 times the maximum dynamically compressed envelope for e, provided good listening results for a broad range of SNRs (signal to noise ratio).
Then, applying the IOEC to em(n) generates emd(n) and allows the computation of the time-varying gains: g())) = 10((*no(n)-c,,,(It))/20 The IOEC curve is linearly interpolated between the critical points shown in Figure 8.
The gains, g(n), resulting from the operations are used in step S57 to alter the signal's initial waveform, which produces the DAC-modified speech signal seac(n), where seRc(n) is given by: soac(n) = g(n)s(n) where as described above, in this case s(n)= sisEk(n) and sroic(n) is the final output of the system.
Figure 9(a) shows a speech signal s(n) before modification by a DRC. The signal is input to the DRC stage. Figure 9(b) shows the speech signal after DRC, i.e. Figure 9(b) shows snRc(n)=g(n) s(n).
In an alternative embodiment, the IOEC curve is controlled in accordance with the signal to noise ratio (SNR) at the location where the speech is to be output. Such a curve is shown in Figure 10. In this embodiment, instead of using a static to noise IOEC curve, an adaptive to noise DRC is used, where the fixed IOEC is replaced by a SNR-adaptive IOEC. Sensors are provided at the speech output to allow the noise and SNR at the output to be measured. The SNR A is used to control the IOEC as described below. SNR is a parameter to control the adaptation of IOEC as a function of noise.
In this embodiment, the output speech is adapted to the noise environment The output is continually updated such that it adapts in real time to the changing noise environment. For example, if the system is built into a mobile telephone device and the user is standing outside a noisy room, the system can adapt to enhance the speech dependent on whether the door to the room is open or closed, by adapting the IOEC depending on the SNR. Similarly, if the system is used in a public address system in a railway station, the system can adapt in real time to the changing noise conditions as trains arrive and depart.
Talkers may adapt their speaking style in acoustically challenging environments. Changes in speaking style may be as simple as repeating the message (often with a different prosody i.e. slower speaking rate, more pauses or emphasizing parts of speech for example) or may involve special modifications in normal speech production (for example spectral tilt flattening, increased Fo) caused by an increased vocal effort as response to acoustical perturbations (the so-called Lombard effect).
In Figure 10, as the current SNR A increases from a specified minimum value Amin towards a maximum value Amax, the IOEC is modified from the curve depicted in Figure 8 towards the bisector of the first quadrant angle. At Amin, the signal's envelope is compressed by the baseline DRC as shown by the solid line, while at Am x no compression is taking place (as shown by the line comprising dashes and dots). In between, different morphing strategies may be used for the SNR-adaptive IOEC. The levels Amin and Amax are given as input parameters for each type of noise. For example, for SSN type of noise they may be chosen -9dB and 3dB.
A piecewise linear IOEC (as the one given in Figure 10) is obtained using a discrete set of M points P',t = 0,M -1. Further on, xi and yi will denote respectively the input and output levels of IOEC at point i. Also, the discrete family of M points denoted as Pi2=(xi, yi(X)) in Figure 10 parameterize the modified IOEC with respect to a given SNR A. In this context, the noise adaptive IOEC segment (Pi2, P+1)) has the following analytical expression: (Pt P;,) Y(x, A) = u(A)x + b(A) 1 where a(A) is the segment's slope "(A) tit-H (A) -!MA) and b(A) is the segment's offset b( A) = m(A) -rt(,)Oiri Two embodiments will now be discussed where respectively two types of effective morphing methods were selected to control the IOEC curve: a linear and a non-linear (logistic) slope variation over A. For an embodiment, where a linear relationship is employed, the following expression may be used for a: AA + Ii, if Arni" 5 A < 1, if A> A","" a(A",;"), ifA < An",, where 1-0(Ainn.) and
B -
For the non-linear (logistic) form: a(A)= + 1"; if A0110 < \ < if A> Aron., if A < Amin Ill C; of,AminA where Ao is the logistic offset, ao is the logistic slope, while Arnus- - + C 1 + C -e and = 1+e In an embodiment, A0 and ao are constants given as input parameters for each type of noise (for example, for SSN type of noise they may be chosen -6dB and 2, respectively). In a further embodiment, lo and/or ao may be controlled in accordance with the measured SNR. For example, they may be controlled with a linear relationship on the SNR, for example: if SNR<=0, Au = AF if O<SNR<=15, Au = AF*(1-SNR/15) if SNR>15, X0 =0 and if SNR<=0, so = OF if 0<SNR<=15, so = su*(1-SNR/15) if SNR>15, so =0 where AF and OF are constants (for example for SSN type of noise they may be chosen -6dB and 2, respectively).
Finally, imposing P01=1302, adaptive IOEC is computed for a given A, considering the expression for a(A) (i.e. for either the linear relationship or the non-linear relationship) as slopes for each of its segments i = 1,1W -1. Then, using the analytical expression for (P the new piecewise linear IOEC is generated.
Psychometric measurements have indicated that speech intelligibility changes with SNR following a logistic function of the type used in accordance with the above embodiment In the above embodiments, the fSER step S10 and the DRC step 511 are very fast processes which allow real time execution at a perceptual high quality modified speech.
Systems in accordance with the above described embodiments, show enhanced performance in terms of speech intelligibility gain especially for low SNRs. They also provide suppression of audible artefacts inside the modified speech signal at high SNRs. At high SNRs, increasing the amplitude of low energy segments of speech (such as unvoiced speech) can cause perceptual quality and intelligibility degradation.
Systems and methods in accordance with the above embodiments provide a light, simple and fast method to adapt dynamic range compression to the noise conditions, inheriting high speech intelligibility gains at low SNRs from the non-adaptive DRC and improve perceptual quality and intelligibility at high SNRs.
Figure 11 is a flow diagram showing the processing steps provided by program 5 in an embodiment in which adaptive DRC is used. The program S takes input speech and information about the noise conditions where the speech will be output and enhances the speech to increase its intelligibility in the presence of noise. In an embodiment, to enhance or boost the intelligibility of the speech, the system comprises an fSER step 510 and a DRC step S11. The output of the fSER step S10 is delivered to the dynamic range compression step S11. Stages 510 and 511 have been described in detail with reference to figures 3 to 9.
If speech is not present the system is off In stage S61 a voice activity detection module is provided to detect the presence of speech. Once speech is detected, the speech signal is passed for enhancement. The voice activity detection module may employ a standard voice activity detection WAD) algorithm.
The speech will be output at speech output 63. Sensors are provided at speech output 63 to allow the noise and SNR at the output to be measured. The SNR A is used to control stage 511 as described in relation to Figure 10 above.
The current SNR at frame t is predicted from previous frames of noise as they have been already observed in the past (t-1, t-2, t-3...). In an embodiment, the SNR is estimated using long windows in order to avoid fast changes in the application of stage S11. In an example, the window lengths can be from 1s to 3s.
The system of Figure 11 is adaptive in that it updates the IOEC curve of step 511 in accordance with the measured SNR.
In stage S11, in the above embodiment, Co was set to 0.3 times the maximum value of the signal envelope din). This envelope can be continually updated dependent on the input signal. In other words, the maximum value of e(n) is updated, such that eo is varied based on the input signal. The envelope can be updated every n seconds, where n is a value between 2 and 10, in one embodiment, n is from 3-5.
The initial value for the maximum value of the signal envelope are obtained from database 65 where speech signals have been previously analysed and these parameters have been extracted. These parameters are passed to parameter update stage S67 with the speech signal and stage S67 updates these parameters.
In one embodiment, the system updates the IOEC curve of step S11 in accordance with the measured SNR and the system also updates the value of eo in stage S11 dependent on the input voice signal and independent of the noise at speech output 63.
In an embodiment, in the dynamic range compression, energy is distributed over time. This modification is constrained by the following condition: total energy of the signal before and after modifications should remain the same (otherwise one can increase intelligibility by increasing the energy of the signal i.e. the volume). Since the signal which is modified is not known a priori, an Energy Banking box 69 is provided. In box 69, energy from the most energetic part of speech is "taken" and saved (as in a Bank) and it is then distributed to the less energetic parts of speech. These less energetic parts are very vulnerable to the noise. In this way, the distribution of energy helps the overall modified signal to be above the noise level.
In an embodiment, this can be implemented by modifying the output to be: soxca(n)=soxr(n)a(n) Where a(n) is calculated from the values saved in the energy banking box to allow the overall modified signal to be above the noise level.
If E(soar(n))>E(Noise(n)) then a(n)=1, where E(suac(n)) is the energy of the enhanced signal snac(n) for the frame (n) and E(Noise(n)) is the energy of the noise for the same frame.
If E(sDRc(n))E(Noise(n)) then the system attempts to further distribute energy to boost low energy parts of the signal so that they are above the level of the noise. However, the system only attempts to further distribute the energy if there is energy Eb stored in the energy banking box.
If the gain g(n)<1, then the energy difference between the input signal and the enhanced signal (E(s(n))-E(sDac(n))) is stored in the energy banking box. The energy banking box stores the sum of these energy differences where g(n)<1 to provide the stored energy Eh.
To calculate a(n) when E(soac(n))5_E(Noise(n)), a bound on a is derived as al(n) -E(noise.(n)) E (spite (i1)) A second expression a2 [n) for a(n) is derived using Eb Eb (n) r ±1 E(snRc(n)) Where F is a parameter chosen such that 0< which expresses a percentage of the energy bank which can be allocated to a single frame. In an embodiment, F=0.2, but other values can be used.
If az(n)al, then a(n)=a2(n) However, If az(n)<ai, then a(n)=1 When energy is distributed as above, the energy is removed from the energy banking box El, such that the new value of Eb is: Eb-E(5DRC(n))(a(n)-1) Once a(n) is derived, it is applied to the enhanced speech signal in step S71.
The system of Figure 11 can be applied to devices producing speech as output (cell phones, TVs, tablets, car navigation etc.) or accepting speech (i.e., hearing aids). The system can also be applied to Public Announcement apparatus. In such a system, there may be a plurality of speech outputs, for example, speakers, located in a number of places, e.g. inside or outside a station, in the main area of an airport and a business lounge. The noise conditions will vary greatly between these environments. The system of Figure 11 can therefore be modified to produce one or more speech outputs as shown in Figure 12(a) and (b).
The system of Figure 12 (a) has been simplified to show a speech input 101, which is then split to provide an input into a first sub-system 103 and a second subsystem 105. Both the first and second subsystems comprise a fSER stage S10 and a dynamic range compression stage S11. The fSER stage S10 and the dynamic range compression stage 511 are the same as those described in relation to figures 2 to 10. Both subsystems comprise a speech output 63 and the SNR at the speech output 63 for the first subsystem is used to calculate the IOEC curve for stage S11 of the first subsystem. The SNR at the speech output 63 for the second subsystem 105 is used to calculate the IOEC curve for stage S11 of the second subsystem 105. The parameter update stage 567 can be used to supply the same data to both subsystems as it provides parameters calculated from the input speech signal. For clarity the Voice activity detection module and the energy banking box have been omitted from Figure 12, but they will both be present in such a system.
Figure 12(b) shows a system according to an alternative embodiment, in which there is a single fSER stage S10, which outputs to the DRC stage S11 of both subsystems.
Figure 13 shows the percentage of correctly recognized keywords from the Harvard test set for speech shaped and competing speaker types of noise. Clean speech data consisting of the first 180 Harvard sentences uttered by a male native British English talker in an acoustically controlled environment was used. SSN (speech shaped noise) and CS (competing speaker) noise maskers were mixed with speech at low, mid and high SNRs, respectively at: -9 dB, -4 dB and 1 dB for SSN, and -21 dB, -14 dB and -7 dB for CS. The competing speaker is a female talker producing read news speech and Harvard-like sentences. All signals have 16 kHz sampling frequency and 16 bits quantization.
The listening test was conducted in sound-treated booths and using headphones, with the subjects typing the answers. After passing audiological screening, 74 native British English listeners were presented with mixes of different speech styles and noise at various SNRs, and then asked to type what they heard. In addition to the plain (not-modified) speech style, two other modified speech styles were included in this evaluation: fSER and fSER+DRC. The percentage of correctly recognized keywords from all listeners is depicted in Figure 13, and gives a measure of speech intelligibility.
Figure 14 presents the computational load (average run times) for the Matlab implementation of the selected methods. A computer with Intel dual core processor at 2.4 GHz and 3 GB of RAM was used to run the test for the first 100 Harvard sentences. The average length of the samples (ALS) in seconds is also specified.
The intelligibility of clean speech degrades rapidly when it is presented in environments with strong background acoustical noise. In order to assist the audience to some extent, talkers have the ability to adapt their speech to such conditions. This is called the Lombard reflex. Various experiments with Lombard speech have established several signal cues that may be important in promoting speech in noise intelligibility, e.g., increased vocal effort and fundamental frequency, decreased spectral tilt or increased first formant frequency.
Boosting the intelligibility of clean speech received by listeners located in noisy environments, for example by fSER or by fSERDRC, has many practical applications, such as in hearing aids and cochlear implants, telecommunication, public address systems and car navigation. It can be used to enhance face to face communication, as well as improving the intelligibility of speech for people with hearing problems.
fSERDRC is a fast algorithm aimed at boosting the intelligibility of speech in additive noise by reallocating spectral energy. It includes a frequency domain-based spectral contrast enhancement technique and a second energy readjustment over time (DRC). fSERDRC offers a good trade-off between intelligibility gains and computational complexity, and, in one embodiment, the DRC stage improves the scores of the fSER system by at least 6% in the low SNR scenario. fSERDRC has good intelligibility scores for CS noise maskers. fSERDRC has a fast run time, few non-causal operations, making it real-time friendly and no phase-mismatched artefacts during signal reconstruction. fSER uses a frequency domain-based spectral contrast enhancement algorithm (fSCE). FSER works on the frequency domain, therefore all of the operations are fast, and it can be implemented in real time. The operations are also causal, meaning that real time implementation is possible, in other words, the operations can be performed immediately. The operations do not require information from a future point in time to be performed.
In one embodiment, fSCE has been shown to improve intelligibility by nearly 200% in very adverse noise conditions, in formal listening evaluation tests.
In one embodiment, fSER is both arithmetically inexpensive and easy to implement in real-time.
In one embodiment, fSER is a DFT based system which does not require zero-phasing of the filtering operations and does not require knowledge of the entire speech waveform.
In one embodiment, all of the signal processing performed in fSER is performed in the frequency domain, on a frame-by-frame basis, altering only the magnitude spectra of the speech signal. The phase spectra are kept unaltered for re-synthesis purposes. Thereby, the filtering operations are performed on simple multiplications of magnitude spectra which do not require zero-phase computations.
In one embodiment the number of non-causal operations for fSER is small, and its entire signal processing is integrated into an overlap-and-add (OLA) reconstruction framework, meaning high perceptual quality of enhanced speech sounds. Thereby, fSER offers ideal conditions for a straightforward real-time implementation.
While certain embodiments have been described, these embodiments have been presented by way of example only, and are not intended to limit the scope of the inventions. Indeed the novel methods and apparatus described herein may be embodied in a variety of other forms; furthermore, various omissions, substitutions and changes in the form of methods and apparatus described herein may be made without departing from the spirit of the inventions. The accompanying claims and their equivalents are intended to cover such forms of modifications as would fall within the scope and spirit of the inventions.

Claims (20)

  1. CLAIMS: 1. A speech intelligibility enhancing system for enhancing speech, the system comprising: a speech input for receiving speech to be enhanced; an enhanced speech output to output the enhanced speech; and a processor configured to convert speech received from the speech input to enhanced speech to be output by the enhanced speech output, the processor being configured to: i) extract frames of the speech received from the speech input; ii) transform each frame of the speech received from the speech input into the frequency domain; iii) apply a spectral contrast enhancement filter over a first frequency range.
  2. 2. The system according to claim 1, wherein the spectral contrast enhancement filter is configured to: reduce the spectral tilt of an inputted signal; normalise the reduced spectral tilt signal; and increase the magnitude of parts of the normalised reduced spectral tilt signal having a magnitude higher than a first threshold value and decrease the magnitude of parts having a magnitude lower than the first threshold value, in the frequency domain.
  3. 3. The system according to claim 1, wherein the processor is further configured to: iv) transform each frame of the filtered signal back into the time domain; v) add the frames of the filtered signal to produce an enhanced speech signal.
  4. 4. The system according to claim 3, wherein the processor is further configured to: apply a low-frequency preservation filter to the output of step ii); and apply a pre-emphasis filter to the output of step ii).
  5. 5. The system according to claim 4, wherein the spectral contrast enhancement filter is applied to the output of the pre-emphasis filter.
  6. 6. The system according to claim 5, wherein the processor is further configured to: adjust the signal outputted from the spectral contrast enhancement filter such that the power is the same as the signal outputted from the pre-emphasis filter.
  7. 7. The system according to claim 6, wherein step i) comprises applying a windowing function; and wherein step ii) comprises performing a discrete Fourier transform on the windowed speech signal, and wherein the output of the discrete Fourier transform for each frame is a phase spectrum and a magnitude spectrum, wherein the magnitude spectrum is inputted into the low-frequency preservation filter and the pre-emphasis filter.
  8. 8. The system according to claim 7, wherein the processor is further configured to: weight the output of the low-frequency preservation filter, the output of the pre-emphasis filter and the adjusted signal outputted from the spectral contrast enhancement filter and sum the weighted outputs; and wherein step iv) comprises performing an inverse Fourier transform on the summed weighted outputs using the phase spectrum, and step v) comprises overlap-adding the frames outputted from the inverse Fourier transform; and wherein the processor is further configured to: adjust the overlap-added signal such that the power is the same as the original speech signal.
  9. 9. The system according to claim 1, wherein the spectral contrast enhancement filter comprises: a log operator, configured to calculate the log of an inputted signal; an envelope module, configured to estimate the magnitude envelope of the output of the log operator; a spectral tilt module, configured to estimate the spectral tilt of the output of the envelope module; a first subtraction module, configured to subtract the output of the spectral tilt module from the output of the envelope module; a normalization module, configured to normalize the output of the first subtraction module; a contrast enhancement module, configured to increase the magnitude of parts of the output of the normalization module having a magnitude higher than a first threshold value and decreases the magnitude of parts of the output of the normalization module having a magnitude lower than the first threshold value, in the frequency domain and over the first frequency range; a second subtraction module, configured to subtract the output of the normalisation module from the output of the contrast enhancement module; an interpolation module, configured to interpolate the output of the second subtraction module so that its resolution is the same as the resolution of the output of the log operator; an addition module, configured to add the output of the log operator to the output of the interpolation module; and an exponential operator, wherein the output of the second addition module is inputted into the exponential operator.
  10. 10. The system according to claim 3, wherein the processor is further configured to: apply a dynamic range compression to the enhanced speech signal.
  11. 11. The system according to claim 10, further comprising a noise input for receiving real-time information concerning the noise environment; wherein the processor is further configured to: measure the signal to noise ratio at the noise input, wherein the dynamic range compression comprises a control parameter which is updated in real time according to the measured signal to noise ratio.
  12. 12. The system according to claim 11, wherein the control parameter for the dynamic range compression is used to control the gain to be applied by the dynamic range compression.
  13. 13. The system according to claim 12, wherein the dynamic range compression is configured to redistribute the energy of the speech received at the speech input and wherein the control parameter is updated such that it gradually suppresses the redistribution of energy with increasing signal to noise ratio.
  14. 14. The system according to claim 11, wherein the system further comprises an energy banking box, the energy banking box being a memory provided in the system and configured to store the total energy of the speech received at the speech input before enhancement, the processor being further configured to redistribute energy from high energy parts of the speech to low energy parts using the energy banking box.
  15. 15. The system according to claim 10, wherein the system is further configured to modify the dynamic range compression in accordance with the input speech independent of noise measurements.
  16. 16. The system according to claim 15, wherein the processor is configured to estimate the maximum value of the signal envelope of the speech received at the speech input when applying dynamic range compression and wherein the system is configured to update the maximum value of the signal envelope of the input speech every m seconds, wherein m is a value from 2 to 10.
  17. 17. The system according to claim 11, wherein the signal to noise ratio is estimated on a frame by frame basis and wherein the signal to noise ratio for a previous frame is used to update the parameters for a current frame.
  18. 18. The system according to claim 11, configured to output enhanced speech in a plurality of locations, the system comprising a plurality of noise inputs corresponding to the plurality of locations, the processor being configured to apply a plurality of corresponding dynamic range compression filters, such that there is a dynamic range compression filter for each noise input, the processor being configured to update the control parameters for each dynamic range compression filter in accordance with the signal to noise ratio measured from its corresponding noise input.
  19. 19. A method for enhancing speech to be outputted, the method comprising: receiving speech to be enhanced; converting speech received from the speech input to enhanced speech; and outputting the enhanced speech, wherein converting the speech comprises: extracting frames of the input speech signal; transforming each frame of speech received via the speech input into the frequency domain; applying a spectral contrast enhancement filter over a first frequency range.
  20. 20. A carrier medium comprising computer readable code configured to cause a computer to perform the method of claim 19.
GB1505363.0A 2015-03-27 2015-03-27 A speech processing system and speech processing method Active GB2536729B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB1505363.0A GB2536729B (en) 2015-03-27 2015-03-27 A speech processing system and speech processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB1505363.0A GB2536729B (en) 2015-03-27 2015-03-27 A speech processing system and speech processing method

Publications (3)

Publication Number Publication Date
GB201505363D0 GB201505363D0 (en) 2015-05-13
GB2536729A true GB2536729A (en) 2016-09-28
GB2536729B GB2536729B (en) 2018-08-29

Family

ID=53178293

Family Applications (1)

Application Number Title Priority Date Filing Date
GB1505363.0A Active GB2536729B (en) 2015-03-27 2015-03-27 A speech processing system and speech processing method

Country Status (1)

Country Link
GB (1) GB2536729B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110931033A (en) * 2019-11-27 2020-03-27 深圳市悦尔声学有限公司 Voice focusing enhancement method for microphone built-in earphone
WO2023014738A1 (en) * 2021-08-03 2023-02-09 Zoom Video Communications, Inc. Frontend capture
US11837254B2 (en) 2021-08-03 2023-12-05 Zoom Video Communications, Inc. Frontend capture with input stage, suppression module, and output stage

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112642160B (en) * 2021-01-22 2021-12-03 江苏比夫电竞数字科技有限公司 Electronic sports service management system based on big data analysis

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5479560A (en) * 1992-10-30 1995-12-26 Technology Research Association Of Medical And Welfare Apparatus Formant detecting device and speech processing apparatus
US20090299742A1 (en) * 2008-05-29 2009-12-03 Qualcomm Incorporated Systems, methods, apparatus, and computer program products for spectral contrast enhancement

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5479560A (en) * 1992-10-30 1995-12-26 Technology Research Association Of Medical And Welfare Apparatus Formant detecting device and speech processing apparatus
US20090299742A1 (en) * 2008-05-29 2009-12-03 Qualcomm Incorporated Systems, methods, apparatus, and computer program products for spectral contrast enhancement

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110931033A (en) * 2019-11-27 2020-03-27 深圳市悦尔声学有限公司 Voice focusing enhancement method for microphone built-in earphone
CN110931033B (en) * 2019-11-27 2022-02-18 深圳市悦尔声学有限公司 Voice focusing enhancement method for microphone built-in earphone
WO2023014738A1 (en) * 2021-08-03 2023-02-09 Zoom Video Communications, Inc. Frontend capture
US11837254B2 (en) 2021-08-03 2023-12-05 Zoom Video Communications, Inc. Frontend capture with input stage, suppression module, and output stage

Also Published As

Publication number Publication date
GB201505363D0 (en) 2015-05-13
GB2536729B (en) 2018-08-29

Similar Documents

Publication Publication Date Title
US10636433B2 (en) Speech processing system for enhancing speech to be outputted in a noisy environment
Upadhyay et al. Speech enhancement using spectral subtraction-type algorithms: A comparison and simulation study
Taal et al. An algorithm for intelligibility prediction of time–frequency weighted noisy speech
Ma et al. Objective measures for predicting speech intelligibility in noisy conditions based on new band-importance functions
AU2009278263B2 (en) Apparatus and method for processing an audio signal for speech enhancement using a feature extraction
Taal et al. An evaluation of objective measures for intelligibility prediction of time-frequency weighted noisy speech
Taal et al. Speech energy redistribution for intelligibility improvement in noise based on a perceptual distortion measure
JP2022022247A (en) Method and device for modifying time domain excitation compound decoded by time domain excitation decoder
US11222651B2 (en) Automatic speech recognition system addressing perceptual-based adversarial audio attacks
ES2927808T3 (en) Apparatus and method for determining a characteristic related to artificial bandwidth limitation processing of an audio signal
Koning et al. The potential of onset enhancement for increased speech intelligibility in auditory prostheses
Szurley et al. Perceptual based adversarial audio attacks
Kumar Comparative performance evaluation of MMSE-based speech enhancement techniques through simulation and real-time implementation
GB2536729A (en) A speech processing system and a speech processing method
Pulakka et al. Bandwidth extension of telephone speech to low frequencies using sinusoidal synthesis and a Gaussian mixture model
Wang et al. Mask estimation incorporating phase-sensitive information for speech enhancement
GB2536727B (en) A speech processing device
KR20200095370A (en) Detection of fricatives in speech signals
Zorila et al. On spectral and time domain energy reallocation for speech-in-noise intelligibility enhancement.
Alaya et al. Speech enhancement based on perceptual filter bank improvement
Upadhyay et al. The spectral subtractive-type algorithms for enhancing speech in noisy environments
Uhle et al. Speech enhancement of movie sound
Brouckxon et al. Time and frequency dependent amplification for speech intelligibility enhancement in noisy environments
Goli et al. Speech intelligibility improvement in noisy environments based on energy correlation in frequency bands
Hendriks et al. Speech reinforcement in noisy reverberant conditions under an approximation of the short-time SII