EP2063420A1 - Procédé et assemblage pour améliorer l'intelligibilité de la parole - Google Patents
Procédé et assemblage pour améliorer l'intelligibilité de la parole Download PDFInfo
- Publication number
- EP2063420A1 EP2063420A1 EP07405332A EP07405332A EP2063420A1 EP 2063420 A1 EP2063420 A1 EP 2063420A1 EP 07405332 A EP07405332 A EP 07405332A EP 07405332 A EP07405332 A EP 07405332A EP 2063420 A1 EP2063420 A1 EP 2063420A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- speech
- noise
- segments
- data processing
- processing module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 23
- 238000012545 processing Methods 0.000 claims abstract description 30
- 230000006872 improvement Effects 0.000 claims abstract description 8
- 230000009467 reduction Effects 0.000 claims description 13
- 230000002829 reductive effect Effects 0.000 claims description 7
- 230000000873 masking effect Effects 0.000 claims description 6
- 230000011218 segmentation Effects 0.000 claims description 6
- 230000006978 adaptation Effects 0.000 claims description 5
- 230000003321 amplification Effects 0.000 claims description 4
- 238000003199 nucleic acid amplification method Methods 0.000 claims description 4
- 230000007613 environmental effect Effects 0.000 claims description 3
- 238000010586 diagram Methods 0.000 description 7
- 239000013598 vector Substances 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 230000000670 limiting effect Effects 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 210000001260 vocal cord Anatomy 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 230000001627 detrimental effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000003014 reinforcing effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 208000027765 speech disease Diseases 0.000 description 1
- 230000001131 transforming effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/93—Discriminating between voiced and unvoiced parts of speech signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0316—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude
- G10L21/0364—Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude for improving intelligibility
Definitions
- the present invention concerns a method to enhance the intelligibility of speech produced by a sound device in a noisy environment.
- the present invention also concerns an assembly for implementing this method to enhance the intelligibility of speech produced by a sound device in a noisy environment.
- noise in particular stationary background noise
- Noise has always been a problem. Every signal traveling from one point to another is prone to be corrupted by noise. Noise can come in various manners: from surrounding acoustic sources, such as traffic, babbling, reverberation or acoustic echo paths, or from electric/electronic sources such as thermal noise. Background noise, also known as environmental noise, can seriously affect speech perceptual aspects such as quality or intelligibility. Therefore huge efforts have been produced during the last decades to overcome this problem.
- a solution to the speech enhancement in the presence of local background noise is fundamental to the user experience. This issue is compounded by the consequences of possible usage in unfavorable environments and of rapid change in background conditions. Rapid means that those conditions may vary one or several times during the time of a normal conversation, even if this is a rather slow change in comparison to signal and noise frequencies so that noise can be mainly approximated as stationary in comparison. Automatic adaptation of perceptual aspects such as quality and especially intelligibility are then of the uppermost importance to provide as seamless as possible conversation and device use.
- a classic noise reduction problem consists of reducing the level of stationary noise superimposed to a local voice (or sound in general) signal that is captured by the same recording device in the same time interval.
- remote voice signal arrives to a sound device more or less disturbed by remote background noise and local device noise, but it is added to local background noise only during the acoustic path from the device speaker to one ear and further disturbed by local background noise possibly reaching the other ear.
- This kind of noise cannot be reduced for the local user by signal processing in the digital domain; this can be obtained using the classic scheme only for the remote user. So, the only possible solution is to enhance the remote voice signal locally, in order to improve its perception when immersed in the local noisy condition.
- Another solution is to use isolating headset devices. This solution is invasive and cannot be used everywhere. A conventional solution consists of changing location but it reduces the mobility and is not applicable in any case. A further solution consists of using noise canceling headsets. The drawback of such a solution is that it is invasive, needs extra battery and is costly.
- an object of the present invention is to provide a method such as defined in preamble and characterized by a combination of specific algorithms offering a perceptual improvement of the produced speech by increasing intelligibility, by saving an adequate signal quality and by saving as far as possible the overall power consumption.
- the method primarily adapts to non-personally invasive devices but it also operates on invasive devices.
- the method applies especially when no direct or indirect control is possible on the source of background noise. It applies when the microphones of the device capture the background noise but not necessarily the source of speech, which may be local as well as remote, received through a communication link and rendered through the device speaker(s).
- the field of use especially includes telecommunication devices, hearing aids devices and multimedia devices.
- At least one algorithm is used for identifying signal segments as silence, voiced or unvoiced segments (SUV).
- the unvoiced segments are processed by applying a constant amplification, given the reduced bandwidth of the voice signal and the corresponding high bandwidth of these unvoiced segments.
- the silence segments are simply ignored.
- a band energy adaptation is especially conceived to avoid increases in the overall power of the long voiced segments.
- the overall power is redistributed where noise is less masking, with consequent reduction in the energy, instead of increasing it where noise is more intense.
- a certain amount of signal distortion is accepted to permit as advantage an increase in intelligibility in particular environmental conditions.
- the object of the present invention is also achieved by an assembly for implementing this method as defined in the preamble and characterized in that said assembly comprises at least one microphone, one speaker, and a data processing module designed to combine specific algorithms offering a perceptual improvement of the produced speech by increasing intelligibility, by saving an adequate signal quality and by saving as far as possible the overall power consumption.
- the data processing module comprises means designed to identify signal segments as silence, voiced and unvoiced segments.
- this means is at least one algorithm.
- the data processing module also comprises means designed to apply a constant amplification to said unvoiced segments, given the reduced bandwidth of the voice signal.
- the data processing module of the assembly may also comprise means designed to ignore the silence segments, and means designed to provide a band energy adaptation especially conceived to avoid increases in the overall power of the long voiced segment.
- the data processing module may comprise means designed to redistribute the overall power where noise is less masking instead of increasing it where noise is more intense, with consequent reduction in the energy consumed.
- the assembly according to the present invention may comprise means designed to make specific approximations in SUV segmentation, thresholds and band gain adjustments.
- Voice signals captured through a microphone may contain a DC (continuous) component. Since signal processing modules are often based on energy estimation, it is important to remove this DC component in order to avoid useless very high offsets, especially in a limited numerical format case (16-bit integer).
- the DC remove filter implements a simple IIR filter allowing the removal of the DC component inside the telephone narrow- and wide-band range limiting the loss in other low frequencies as far as possible.
- a voice-only signal is typically composed by speech periods that are separated by Silence intervals. Moreover, speech periods can be subdivided into two classes, Unvoiced and Voiced sounds.
- Speech periods are those when the talker is active. Roughly speaking, a speech sound can be considered as voiced if it is produced by the vibration of the vocal cords. Vowels are voiced sounds by definition. When a sound is instead pronounced so that it does not require the vocal cords to vibrate, it is called unvoiced. Only consonants can be unvoiced, but not all of them are. Silence normally refers to a period in the signal of interest where the talker is not speaking. But while not containing speech, most of the time the signal corresponding to "silence" regions rather different from zero as it can contain many kinds of interfering signals, such as background noise, reverberation or echo.
- the SUV detection block 22 allows separating signal into silence, unvoiced and voiced periods. This is normally obtained by calculating a number of selected signal features, which are then weighted and fed to a suitable decision algorithm. As the whole algorithm works on a frame-by-frame basis, as often in signal processing for efficiency in computation, this block provides as output signal frames, each frame being windowed before processing (and frames are then overlapped at the end).
- Unvoiced signals nearly cover the entire speech band, which in most cases is approximately 3.5 or 7 kHz wide (8 or 16 kHz sampling rate). This allows boosting in a simple manner unvoiced portions by limiting at maximum the processing power.
- the enhancement is obtained by applying a gain in time domain to each sample so as to increase unvoiced speech power to a level at least equal to that of the background noise power. This has the effect of increasing the power of consonants against vowels.
- the processing of the voiced part is the most expensive from a computation point of view: it requires analysis in the frequency domain.
- the frequency coefficients are preferably calculated by applying a Short-Time Fourier Transform (STFT) to the voiced speech signal. Once the coefficients computed, they are grouped into frequency bands to reflect in relative importance the nonlinear behavior of the human hearing. In fact, from a psychoacoustic point of view, critical bands increase in width as frequency increases. Grouping is obtained preferably using a Bark-like scale.
- the number of critical bands has been chosen to be preferably twenty-four, which trade-offs enough frequency resolution for the purpose of noise estimation, noise reduction and speech enhancement.
- the gain of each critical band is adjusted according to criteria that can result in an improvement of the overall intelligibility of voice periods of speech over noise.
- gain is increased inversely to the noise distribution in critical bands, which means signal is increased more where noise has less energy aiming at reinforcing SNR in bands that require a lower energy increase. Signal may even be reduced where noise is very strong to preserve as far as possible the energy level.
- the frame gains are normalized depending on the power of the noise frame. If the original power of the speech frame was greater or equal than the power of the noise frame, then the energy of the signal is kept unchanged. But if the power of the noise frame was greater, then masking may occur. The speech frame power is boosted so that it has the same power as noise, taking care not to hit too high values leading to signal saturation.
- signal is transformed back to the time domain and overlap-and-add is applied to frames to recreate a complete signal (with silence, unvoiced and voiced parts all together again).
- Background Noise Estimation consists of separating to background noise captured locally by the device microphone from noise + speech periods. Many algorithms exist for this kind of separation.
- a voice activity detection (VAD) is preferably used here to separate pure noise segments and the noise features are extracted as explained above by frequency transform and grouping into critical bands. Noise energy for each critical band is used by the enhancement algorithm outlined above.
- Parametric Spectral Subtraction is the core of the noise reduction algorithm that can be applied to the local speech signal before transmission to the remote peer. This part has no influence on the remote speech enhancement. In any case, gains are calculated according to an Ephraim-Malah algorithm.
- the proposed application preferably targets mobile device implementations. As such, important limitations are imposed by the device and CPU in comparison to theoretical solutions and many approximations may be necessary to reduce the computational complexity while saving the result accuracy.
- the DC Remove filter block 21 is applied to the audio signal frames before processing.
- a simple high-pass, fixed-point IIR filter is used. Cutoff frequency is approximately 200 Hz in narrowband, 60 Hz in wider bands.
- the log-energy is computed as:
- log-energy values may be stored in signed 7b/8b (16-bit) numbers.
- Voiced sounds are more concentrated at low frequencies, and then normalized autocorrelation tends to be higher (near to 1) for voiced than unvoiced segments.
- the denominator sum is an approximation of the correct formula to avoid calculation of the square root.
- the range is of course -1 to 1 (signed 0b/15b representation).
- the number of zero-crossing is an integer value (15b/0b) representation.
- Figure 2 represents the block diagram for the SUV decision algorithm.
- a distance is computed between the actual feature vector and each of the three classes. This is done by assuming that the features for each class belong to a multidirectional Gaussian distribution with known mean vector and covariance matrices W i , corresponding respectively to the class voiced, unvoiced and silence.
- the index i is 1, 2 or 3 for the three classes.
- Mean vectors and covariance matrices for the three classes are obtained (trained) by a given database of speech utterances. The data is segmented manually into silence, voiced and unvoiced, and then for each of these segments the three features above are calculated.
- the following procedure is used as shown in the block diagram.
- First the segment is tested for Voice class using the log-energy and the zero-crossing count. If the resulting distance d 1 is minimal among the three distances, and if the log-energy is higher than a given threshold, then Voice is decided. If the log-energy is lower than the threshold, then Silence is decided.
- the threshold has to be determined empirically. The actual value of the threshold is preferably 3'900, relative to the 7b/8b format described above for log-energy precision.
- d 1 is not minimal, then the distance d 3 from the silence class with the autocorrelation feature only is calculated. If it is minimal, then Silence is decided, otherwise Unvoiced is decided.
- the enhancement of unvoiced segments is simply obtained applying a gain in time domain to each sample to increase the signal power to a level at least equal to that of the noise power.
- the parameter T unvoiced is an adaptive threshold that avoids saturation. For each frame, the threshold is calculated as the maximum given by the chosen representation (32-bit) over the actual frame energy.
- Bark 13.1 ⁇ arctan 0.00074 ⁇ f + 2.24 ⁇ arctan 1.85 ⁇ 10 - 8 ⁇ f 2 + 10 - 4 ⁇ f
- the number of band-pass filters, and therefore the number of critical band is twenty-four, the result as shown in Figure 3 .
- S and W are the STFTs of signal and noise respectively.
- the threshold T has an actual value of 256.
- FIG 4 represents the block diagram of the assembly 10 according to the present invention and shows how the different elements are connected.
- the source of the voice can be either a local microphone 11, or optionally a telecommunication unit 12, which provides to a data processing module 13 the voice of a remote speech.
- the data processing module 13 is used to combine specific algorithms offering a perceptual improvement of the produced speech by increasing intelligibility, by saving an adequate signal quality and by saving as far as possible the overall power consumption.
- the enhanced speech as produced by the data processing module 13 is played in a speaker 14.
- the telecommunication unit 12 has the capability to connect to a remote system that is a source of speech, especially a telecommunication device, and is optional.
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Telephone Function (AREA)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP07405332A EP2063420A1 (fr) | 2007-11-26 | 2007-11-26 | Procédé et assemblage pour améliorer l'intelligibilité de la parole |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP07405332A EP2063420A1 (fr) | 2007-11-26 | 2007-11-26 | Procédé et assemblage pour améliorer l'intelligibilité de la parole |
Publications (1)
Publication Number | Publication Date |
---|---|
EP2063420A1 true EP2063420A1 (fr) | 2009-05-27 |
Family
ID=39148654
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP07405332A Withdrawn EP2063420A1 (fr) | 2007-11-26 | 2007-11-26 | Procédé et assemblage pour améliorer l'intelligibilité de la parole |
Country Status (1)
Country | Link |
---|---|
EP (1) | EP2063420A1 (fr) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2471064A1 (fr) * | 2009-08-25 | 2012-07-04 | Nanyang Technological University | Procédé et système pour reconstruire une parole à partir d'un signal d'entrée comprenant des chuchotements |
CN106060714A (zh) * | 2016-05-26 | 2016-10-26 | 惠州华阳通用电子有限公司 | 一种降低音源噪声的控制方法及装置 |
CN113192507A (zh) * | 2021-05-13 | 2021-07-30 | 北京泽桥传媒科技股份有限公司 | 一种基于语音识别的资讯检索方法及系统 |
-
2007
- 2007-11-26 EP EP07405332A patent/EP2063420A1/fr not_active Withdrawn
Non-Patent Citations (4)
Title |
---|
ATAL B S ET AL: "A PATTERN RECOGNITION APPROACH TO VOICED-UNVOICED-SILENCE CLASSIFICATION WITH APPLICATIONS TO SPEECH RECOGNITION", IEEE TRANSACTIONS ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, IEEE INC. NEW YORK, US, vol. ASSP-24, no. 3, June 1976 (1976-06-01), pages 201 - 212, XP009040248, ISSN: 0096-3518 * |
BEROUTI M ET AL: "ENHANCEMENT OF SPEECH CORRUPTED BY ACOUSTIC NOISE", INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH & SIGNAL PROCESSING. ICASSP. WASHINGTON, APRIL 2 - 4, 1979, NEW YORK, IEEE, US, vol. CONF. 4, 1979, pages 208 - 211, XP001079151 * |
LEE S.H ET AL: "Real Time Speech Intelligibility enhancement based on the background noise analysis", PROCEEDINGS OF FOURTH IASTED "INTERNATIONAL CONFERENCE SIGNAL PROCESSING, PATTERN RECOGNITION AND APPLICATIONS", 14 February 2007 (2007-02-14), INNSBRUCK, AUSTRIA, pages 287 - 292, XP002472964 * |
VIRAG N: "Speech enhancement based on masking properties of the auditory system", ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 1995. ICASSP-95., 1995 INTERNATIONAL CONFERENCE ON DETROIT, MI, USA 9-12 MAY 1995, NEW YORK, NY, USA,IEEE, US, vol. 1, 9 May 1995 (1995-05-09), pages 796 - 799, XP010625353, ISBN: 0-7803-2431-5 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2471064A1 (fr) * | 2009-08-25 | 2012-07-04 | Nanyang Technological University | Procédé et système pour reconstruire une parole à partir d'un signal d'entrée comprenant des chuchotements |
EP2471064A4 (fr) * | 2009-08-25 | 2014-01-08 | Univ Nanyang Tech | Procédé et système pour reconstruire une parole à partir d'un signal d'entrée comprenant des chuchotements |
CN106060714A (zh) * | 2016-05-26 | 2016-10-26 | 惠州华阳通用电子有限公司 | 一种降低音源噪声的控制方法及装置 |
CN113192507A (zh) * | 2021-05-13 | 2021-07-30 | 北京泽桥传媒科技股份有限公司 | 一种基于语音识别的资讯检索方法及系统 |
CN113192507B (zh) * | 2021-05-13 | 2022-04-29 | 北京泽桥传媒科技股份有限公司 | 一种基于语音识别的资讯检索方法及系统 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8712074B2 (en) | Noise spectrum tracking in noisy acoustical signals | |
EP1739657B1 (fr) | Amélioration d'un signal de parole | |
US6263307B1 (en) | Adaptive weiner filtering using line spectral frequencies | |
EP2416315B1 (fr) | Dispositif suppresseur de bruit | |
US9064502B2 (en) | Speech intelligibility predictor and applications thereof | |
EP3038106B1 (fr) | Amélioration d'un signal audio | |
US7492889B2 (en) | Noise suppression based on bark band wiener filtering and modified doblinger noise estimate | |
EP0993670B1 (fr) | Procede et appareil d'amelioration de qualite de son vocal dans un systeme de communication par son vocal | |
Kim et al. | Nonlinear enhancement of onset for robust speech recognition. | |
US20100198588A1 (en) | Signal bandwidth extending apparatus | |
US20120263317A1 (en) | Systems, methods, apparatus, and computer readable media for equalization | |
US20080312916A1 (en) | Receiver Intelligibility Enhancement System | |
US10176824B2 (en) | Method and system for consonant-vowel ratio modification for improving speech perception | |
EP3757993B1 (fr) | Prétraitement de reconnaissance automatique de parole | |
Garg et al. | A comparative study of noise reduction techniques for automatic speech recognition systems | |
US20120004907A1 (en) | System and method for biometric acoustic noise reduction | |
Itoh et al. | Environmental noise reduction based on speech/non-speech identification for hearing aids | |
US7917359B2 (en) | Noise suppressor for removing irregular noise | |
Jaiswal et al. | Implicit wiener filtering for speech enhancement in non-stationary noise | |
EP2151820B1 (fr) | Procédé pour la compensation de biais pour le lissage cepstro-temporel de gains de filtre spectral | |
CN109102823B (zh) | 一种基于子带谱熵的语音增强方法 | |
EP2063420A1 (fr) | Procédé et assemblage pour améliorer l'intelligibilité de la parole | |
Flynn et al. | Combined speech enhancement and auditory modelling for robust distributed speech recognition | |
GB2336978A (en) | Improving speech intelligibility in presence of noise | |
Defraene et al. | A psychoacoustically motivated speech distortion weighted multi-channel Wiener filter for noise reduction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU LV MC MT NL PL PT RO SE SI SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL BA HR MK RS |
|
AKX | Designation fees paid | ||
REG | Reference to a national code |
Ref country code: DE Ref legal event code: 8566 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20091128 |