WO2004059614A2 - Procede et appareil permettant d'augmenter la qualite de perception de signaux de parole synthetises - Google Patents
Procede et appareil permettant d'augmenter la qualite de perception de signaux de parole synthetises Download PDFInfo
- Publication number
- WO2004059614A2 WO2004059614A2 PCT/DK2003/000917 DK0300917W WO2004059614A2 WO 2004059614 A2 WO2004059614 A2 WO 2004059614A2 DK 0300917 W DK0300917 W DK 0300917W WO 2004059614 A2 WO2004059614 A2 WO 2004059614A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- speech signal
- speech
- signal
- filter
- component
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 28
- 230000002708 enhancing effect Effects 0.000 title abstract description 4
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 21
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 21
- 238000001914 filtration Methods 0.000 claims description 8
- 238000011161 development Methods 0.000 claims description 4
- 238000009499 grossing Methods 0.000 claims description 4
- 238000004891 communication Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 claims description 2
- 239000013598 vector Substances 0.000 claims 2
- 230000009467 reduction Effects 0.000 description 7
- 230000004044 response Effects 0.000 description 7
- 238000004519 manufacturing process Methods 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000011946 reduction process Methods 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 208000032041 Hearing impaired Diseases 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000011109 contamination Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000000354 decomposition reaction Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000003860 storage Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
- 238000001308 synthesis method Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000001052 transient effect Effects 0.000 description 1
- 230000017105 transposition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/15—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being formant information
Definitions
- the present invention relates to a method and apparatus for enhancing the perceptual quality of synthesized speech signals.
- Noise reduction of a noise-corrupted target speech signal is obtained through dynamic smoothing or filtration of time-varying speech model parameters, so-called f, b, g parameters, based on a priori knowledge of human speech production.
- the filtration of the f, b, g parameters may take into account typical maximum and minimum frequencies for formants of human speech, durations of voiced sounds such as phonemes, or unvoiced sounds, and/or take into account frequency differences between formants of a present speech signal and a previous speech signal.
- the synthesized speech signals may originate from a preceding model based noise reduction process which may comprise deriving a so- called f, b, g-model representing respective frequencies, bandwidths and gains of formants in a segment of a noise-corrupted target speech signal, individually smoothing the f, b, g parameters over time, and re-synthesizing a noise-reduced version of the target speech signal.
- a preceding model based noise reduction process which may comprise deriving a so- called f, b, g-model representing respective frequencies, bandwidths and gains of formants in a segment of a noise-corrupted target speech signal, individually smoothing the f, b, g parameters over time, and re-synthesizing a noise-reduced version of the target speech signal.
- a speech synthesis method in accordance with the present invention may be utilized to improve the perceptual quality of speech that has been subjected to the noise reduction methodology disclosed in WO 00/72305.
- the present invention may also be utilized to improve the perceptual quality of synthesized speech in general for applications such as text-to-speech systems and voice-recognition and response systems.
- the present invention relates to a method of synthesizing speech signals, an apparatus for implementing the method as well as a computer programme product for enabling a programmable computer to perform the method.
- the time-varying synthesis filter comprises one or several filter section(s) of fractional order, i.e. filter sections having a non-integer order.
- filter sections having a non-integer order.
- synthesis filters employed for synthesis of parametric speech data have been of integer order, and often been constituted by a composition of paralleled or cascaded second order filter sections.
- the speech synthesis filter may in the latter situation be composed of a set of cascaded or parallel second order filters or filer sections as disclosed in lines 12-16 of page 3 of WO 00/72305.
- a significant problem associated with the use of filter sections of integer order in the synthesis filter is that individual formants, such a second and a third formant, of the speech signal to be synthesized, or resynthesized, tend to overlap or fuse. This overlap is a consequence of the fact that formants of a speech segment often are too closely spaced in frequency to avoid a substantial overlap between individual formant frequencies with the frequency response roll-off rates provided by traditional second order filter sections. This fusing of individual formants of the speech segment leads to a reduced intelligibility and/or perceptual quality of the synthesized speech signal. Listening experiments have shown that noise free speech signals are very well modelled by all-pole transfer functions for the vocal tract, but problems are encountered in case of noise corrupted speech problems. Desired speech signal manipulation through the f, b, g parameters, such as manipulation motivated by speech enhancement, transposition, noise cancellation etc., eventually leads to non- all pole speech models and therefore a need to continually vary filter section orders as functions of their respective bandwidths.
- a solution to the above-mentioned technical problem in accordance with the present invention is to include at least one, but preferably several, filter section(s) of fractional order in the speech synthesis filter.
- the fractional order filter or filters are preferably digital filters and designed with a 3 dB bandwidth that represents the formants to be modelled.
- the order of the at least one fractional order filter may advantageously lie between 2 and 6 or more preferably between 2 and 4 such as between 2.5 and 3.5.
- the generator signal component of the speech signal comprises a transient part of the speech signal and preferably additionally comprises glottal pulse components and unvoiced components of the speech signal.
- the generator signal component may have been determined from the speech signal via conventional techniques known in the art such as inverse filtering. In case of heavy noise contamination of the target speech signal, the generator signal component may have been composed of a synthetic glottal pulse combined with the proper pitch period, thereby removing/reducing the inevitable noise in the inverse filter based generator signal.
- the quasi-stationary component of the speech signal has preferably been determined by a step of mapping the speech signal into a LPC model of predetermined order, such as an order of 8 or 10 or 12.
- the quasi-stationary component of the speech signal and/or the generator signal component for synthesis of the speech signal may be stored in digital form in electronic memory means such as a RAM, ROM, EPROM, EEPROM, Flash memory and/or on a magnetic or optical memory disc.
- electronic memory means which holds the quasi-stationary component of the speech signal and/or the generator signal component may form part of a Personal Computer, hand-held computer, mobile or cellular phone, headset, hearing prosthesis etc. for adapted for reproducing the synthesized speech signal to a listener through suitable loudspeaker means.
- the method according to the present invention may be fully or partly implemented by a software program running on a programmable signal processor, such as a microprocessor and/or industrial or proprietary Digital Signal Processor (DSP).
- a software program running on a programmable signal processor, such as a microprocessor and/or industrial or proprietary Digital Signal Processor (DSP).
- DSP Digital Signal Processor
- the software program may be loaded at run-time from a nonvolatile memory of the apparatus to a suitable Program RAM storage space and then executed from the Program RAM.
- the apparatus according to the present invention may comprise a portable and battery powered communication device such as a hearing prosthesis, headset, handset, mobile or cellular phone etc.
- the LPC model of the quasi- stationary component of the speech signal is subjected to pseudo-decomposition into second order and fractional order filter sections having respective f, b, g parameter sets representing respective formant frequencies, formant bandwidths and formant gains of a segment of the speech signal.
- a significant advantage of the f, b, g representation of the speech signal is that this representation is closely coupled to physiologically identifiable, time-varying parameters, of the vocal tract of the individual producing the synthesized speech signal in question.
- the present methodology comprises a preceding speech analysis stage during which the generator signal component and the quasi-stationary component of the speech signal are derived.
- This embodiment is particularly useful during real-time processing of incoming signals such as microphone input signals to a hearing prosthesis or headset.
- the incoming signal may be subjected to non-linear or linear signal processing between analysis and synthesis step in order to improve the audibility and/or, intelligibility and/or comfort of the incoming signal before a synthesized resulting signal is provided to a user, e.g. a hearing impaired user.
- the methodology comprises the above-mentioned speech analysis stage and a noise reduction step during which the f, b, g parameters are subjected to band pass or low pass filtering with a filter having a cut-off frequency below 15 Hz or 10 Hz.
- the filtering step comprises individually filtering each parameter of each set of f, b, g parameters with respective band pass filters having a pass band between 1 and 10 Hz.
- the noise reduction process may additionally comprise a step of inserting a synthetic glottal pulse signal in at least a part of the generator signal component of the speech signal.
- a noise-robust pitch detector of conventional design may be used to determine a pitch of the speech signal and the processor may use such information to select a preferred period of the synthetic glottal pulse signal.
- a noise-reduction method is particularly well suited to work in cooperation with the noise reduction method and apparatus disclosed in international patent application WO 00/72305 owned by the present applicant.
- the invention disclosed in this prior art document relates to a model based noise reduction method and apparatus.
- Noise reduction of a noise-corrupted target speech signal is obtained through dynamic smoothing or filtration of time-varying speech model parameters, so-called f, b, g parameters, based on a priori knowledge of human speech production.
- the present invention may be operate to improve the noise-reduction results obtained with the prior art method by further reducing artefacts in the target speech signal and improve the suppression of the noise component in the target speech signal.
- the improvement is obtained by including statistical modelling of phoneme development processes over time based on one or more Hidden Markov Models that have been adapted to model f, b, g parameter development over time. Since such statistical properties of phoneme development may be language specific, HMMs for a particular language may be provided in apparatuses that are adapted to implement the present invention on a particular market.
- Fig. 1 shows an empirically determined relationship between the order, x, of a section of a speech synthesis filter and the bandwidth of the formant frequency to be modelled by the filter section.
- Fig. 2 illustrates two different exemplary magnitude responses of a particular section of a speech synthesis filter.
- Response 1 is obtained by using a prior art second order filter section to model a formant response of a formant located at 400 Hz
- response 2 is obtained by using a fractional order filter of order 3.4 to model the same formant. Note that the bandwidth of the formant is approximately equal for both filter responses, but that the undesired spread in frequency of "skirts" of the filter is much smaller for the fractional/higher order filter.
- the speech synthesis filter comprises at least one fractional order digital filter that models a formant of a segment of the speech signal.
- Formants with small bandwidths such as bandwidths less than 50 Hz or 30 Hz, are preferably represented by filter sections of relatively low order, while formants with larger bandwidths may utilize a progressively raising filter order of the filter section in question.
- Figure 1 illustrates a preferred relation between filter section order, X, and formant bandwidth, BW.
- the preferred goal of the synthesis filter is to obtain a target 3 dB bandwidth of a standard second order model filter substantially independent of which order, x, that is selected for the fractional order section.
- the bandwidth of the filter section shall preferably be modified according to the below-mentioned formula:
- Amplitude (l/(1.0+y ⁇ 2)) ⁇ 0.5
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Transmission Systems Not Characterized By The Medium Used For Transmission (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2003287927A AU2003287927A1 (en) | 2002-12-31 | 2003-12-19 | A method and apparatus for enhancing the perceptual quality of synthesized speech signals |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DKPA200202019 | 2002-12-31 | ||
DKPA200202019 | 2002-12-31 |
Publications (2)
Publication Number | Publication Date |
---|---|
WO2004059614A2 true WO2004059614A2 (fr) | 2004-07-15 |
WO2004059614A3 WO2004059614A3 (fr) | 2004-09-23 |
Family
ID=32668640
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/DK2003/000917 WO2004059614A2 (fr) | 2002-12-31 | 2003-12-19 | Procede et appareil permettant d'augmenter la qualite de perception de signaux de parole synthetises |
Country Status (2)
Country | Link |
---|---|
AU (1) | AU2003287927A1 (fr) |
WO (1) | WO2004059614A2 (fr) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1999001942A2 (fr) * | 1997-07-01 | 1999-01-14 | Partran Aps | Procede de reduction de bruit dans des signaux vocaux et appareil d'application du procede |
WO2000072305A2 (fr) * | 1999-05-19 | 2000-11-30 | Noisecom Aps | Procede et dispositif de reduction du bruit dans des signaux vocaux |
-
2003
- 2003-12-19 AU AU2003287927A patent/AU2003287927A1/en not_active Abandoned
- 2003-12-19 WO PCT/DK2003/000917 patent/WO2004059614A2/fr not_active Application Discontinuation
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO1999001942A2 (fr) * | 1997-07-01 | 1999-01-14 | Partran Aps | Procede de reduction de bruit dans des signaux vocaux et appareil d'application du procede |
WO2000072305A2 (fr) * | 1999-05-19 | 2000-11-30 | Noisecom Aps | Procede et dispositif de reduction du bruit dans des signaux vocaux |
Non-Patent Citations (1)
Title |
---|
HARTMANN U ET AL: "Model based spectral subtraction used for noise suppression in speech with low SNR" NORSIG2000. NORDIC SIGNAL PROCESSING SYMPOSIUM, NORSIG2000. NORDIC SIGNAL PROCESSING SYMPOSIUM, VILDMARKSHOTELLET KOLMARDEN, SWEDEN, 13-15 JUNE 2000, pages 129-132, XP002278902 2000, Linkoping, Sweden, Linkoping Univ, Sweden ISBN: 91-7219-789-7 * |
Also Published As
Publication number | Publication date |
---|---|
AU2003287927A8 (en) | 2004-07-22 |
AU2003287927A1 (en) | 2004-07-22 |
WO2004059614A3 (fr) | 2004-09-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8265940B2 (en) | Method and device for the artificial extension of the bandwidth of speech signals | |
US8311842B2 (en) | Method and apparatus for expanding bandwidth of voice signal | |
US6889182B2 (en) | Speech bandwidth extension | |
EP2375785B1 (fr) | Améliorations de la stabilité des appareils auditifs | |
JP3653826B2 (ja) | 音声復号化方法及び装置 | |
US20020128839A1 (en) | Speech bandwidth extension | |
WO2008032828A1 (fr) | Dispositif de codage audio et procédé de codage audio | |
WO2001056021A1 (fr) | Systeme et procede de modification de signaux vocaux | |
TW201308316A (zh) | 適應性聲音清晰度處理器 | |
EP1333700A2 (fr) | Procédé de transposition de fréquence dans une prothèse auditive et une telle prothèse auditive | |
EP1008984A2 (fr) | Synthèse de la parole à large bande à partir d'un signal vocal à bande étroite | |
DE102008031150B3 (de) | Verfahren zur Störgeräuschunterdrückung und zugehöriges Hörgerät | |
US20080027708A1 (en) | Method and system for FFT-based companding for automatic speech recognition | |
EP2675191A2 (fr) | Translation de fréquence dans des dispositifs d'assistance auditive par synthèse spectrale additive | |
WO2004059614A2 (fr) | Procede et appareil permettant d'augmenter la qualite de perception de signaux de parole synthetises | |
EP0570362B1 (fr) | Decodeur de parole numerisee utilisant un postfiltre a distorsion spectrale reduite | |
EP4133482A1 (fr) | Amélioration de la parole à bande passante réduite avec extension de bande passante | |
WO2010078938A2 (fr) | Procédé et dispositif de traitement de signaux acoustiques vocaux | |
JP3197975B2 (ja) | ピッチ制御方法及び装置 | |
JP6159570B2 (ja) | 音声強調装置、及びプログラム | |
JPH08110796A (ja) | 音声強調方法および装置 | |
JP2001249676A (ja) | 雑音が付加された周期波形の基本周期あるいは基本周波数の抽出方法 | |
Fulop et al. | Signal Processing in Speech and Hearing Technology | |
EP2506254A1 (fr) | Procédé d'amélioration de l'intelligibilité de la parole avec un appareil auditif ainsi qu'appareil auditif | |
US11967334B2 (en) | Method for operating a hearing device based on a speech signal, and hearing device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW |
|
AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
122 | Ep: pct application non-entry in european phase | ||
NENP | Non-entry into the national phase in: |
Ref country code: JP |
|
WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |