EP1078354A1 - Method and device for determining spectral voice characteristics in a spoken expression - Google Patents
Method and device for determining spectral voice characteristics in a spoken expressionInfo
- Publication number
- EP1078354A1 EP1078354A1 EP99929088A EP99929088A EP1078354A1 EP 1078354 A1 EP1078354 A1 EP 1078354A1 EP 99929088 A EP99929088 A EP 99929088A EP 99929088 A EP99929088 A EP 99929088A EP 1078354 A1 EP1078354 A1 EP 1078354A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- transformation
- utterance
- speaker
- speech
- wavelet transformation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000003595 spectral effect Effects 0.000 title claims abstract description 19
- 238000000034 method Methods 0.000 title claims description 15
- 230000014509 gene expression Effects 0.000 title abstract 4
- 230000009466 transformation Effects 0.000 claims abstract description 63
- 230000015572 biosynthetic process Effects 0.000 claims abstract description 11
- 238000003786 synthesis reaction Methods 0.000 claims abstract description 11
- 238000001914 filtration Methods 0.000 claims 1
- 230000006870 function Effects 0.000 description 8
- 230000007704 transition Effects 0.000 description 5
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 230000001755 vocal effect Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 230000001186 cumulative effect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000010355 oscillation Effects 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/06—Elementary speech units used in speech synthesisers; Concatenation rules
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
Definitions
- the invention relates to a method and an arrangement for determining spectral speech characteristics in a spoken utterance.
- a wavelet transformation is known from [1].
- a wavelet filter ensures that a high-pass component and a
- Low-pass component of a subsequent transformation stage completely restore a signal of a current transformation stage.
- the resolution of the high-pass component or low-pass component is reduced from one transformation stage to the next (technical term: "subsampling").
- the number of transformation levels is finite due to subsampling.
- the object of the invention is a method and an arrangement for determining spectral
- a method for determining spectral speech characteristics in a spoken utterance.
- the spoken utterance is digitized and subjected to a wavelet transformation.
- the speaker-specific characteristics are determined on the basis of different transformation levels of the wavelet transformation.
- the utterance is divided in the wavelet transformation by means of a high-pass filter and a low-pass filter and that different high-pass components or low-pass components of different transformation stages contain speaker-specific characteristics.
- the individual high-pass components or low-pass components of different transformation levels stand for predetermined speaker-specific characteristics, it being possible for both high-pass component and low-pass component of a respective transformation stage, that is to say the respective characteristic, to be modified separately from other characteristics. If, in the case of the inverse wavelet transformation, the original signal is put together again from the respective high-pass and low-pass components of the individual transformation stages, this ensures that exactly the desired signal
- Characteristic has been changed. It is thus possible to change certain specified characteristics of the utterance without influencing the rest of the utterance.
- One embodiment consists in that before the wavelet transformation the utterance is windowed, that is to say a predetermined quantity of samples is cut out, and the frequency range is transformed.
- a Fast Fourier Transform (FFT) is used in particular for this purpose.
- Another embodiment consists in that a high-pass component of a transformation stage m is a real part and an imaginary part is divided.
- the high-pass component of the wavelet transformation corresponds to the difference signal between the current low-pass component and the low-pass component of the previous transformation stage.
- a further development consists in determining the number of transformation stages of the wavelet transformation to be carried out by including a constant component of the utterance in the last transformation stage, which consists of low-pass filters connected in series. Then the signal as a whole can be represented by its wavelet coefficients. This corresponds to the complete transformation of the information of the signal section m into the wavelet space.
- the difference signal remains as the high-pass component of a transformation stage, as explained above. If one accumulates difference signals (high-pass components) over the transformation stages, the information of the spoken utterance without a constant component is obtained in the last transformation stage as a cumulative high-pass component.
- the speaker-specific characteristics can be identified as:
- Transformation reveals the fundamental frequency of the utterance.
- the basic frequency indicates whether the speaker is a man or a woman.
- the spectral envelope contains information about a transfer function of the vocal tract during articulation. In a voiced area, the spectral envelope is dominated by the formants. The high-pass component of a higher transformation level of the wavelet transformation contains this spectral envelope.
- the smokiness in a voice becomes visible as a negative slope in the course of the penultimate low-pass portion.
- the speaker-specific characteristics a) to c) are of great importance in speech synthesis.
- concatenative speech synthesis uses large quantities of real uttered utterances, from which excerpts are cut out and later put together to form a new word (synthesized language).
- Discontinuities between compound sounds are disadvantageous because they are perceived by the human ear as unnatural.
- An advantage of the invention is that the spectral envelope curve reflects the articulation tract of the speaker and is not based on formants, such as a pole position model. Go further with the wavelet transformation no data is lost as a nonparametric representation, the utterance can always be completely reconstructed.
- the data resulting from the individual transformation stages of the wavelet transformation are linearly independent of one another, can thus be influenced separately from one another and can later be combined again - without loss - to the influenced utterance.
- Speech characteristics specified which has a processor unit which is set up such that an utterance can be digitized.
- the utterance is then subjected to a wavelet transformation and speaker-specific characteristics are determined using different transformation levels.
- Fig.l a wavelet function
- Fig.l shows a wavelet function, which is determined by
- f is the frequency
- ⁇ is a standard deviation
- c is a given normalization constant
- the standard deviation ⁇ is determined by the predeterminable position of the sideband minimum 101 in Fig.l.
- ⁇ (f) ⁇ (f) + j • H ⁇ (f) ⁇ (2).
- ⁇ denotes the conjugate complex wavelet function.
- 3 shows the cascaded application of the wavelet transform.
- a signal 301 is filtered both by a high pass HP1 302 and by a low pass TP1 305. In particular, subsampling takes place, ie the number of values to be saved is reduced per filter.
- a mverse wavelet transformation ensures that the original signal 301 can be reconstructed from the low-pass component TP1 305 and the high-pass component HP1 304.
- HP1 302 is filtered separately for real part Rel 303 and Imagmar part Iml 304.
- the signal 310 after the low-pass filter TP1 305 is again both by a high-pass HP2 306 and by a
- the high pass HP2 306 again comprises a real part Re2 307 and an imagemart Im2 308.
- the signal after the second transformation stage 311 is filtered again, etc.
- FIG. 4 shows various transformation stages of the wavelet transformation, divided into low-pass components (FIGS. 4A, 4C and 4E) and high-pass components (FIGS. 4B, 4D and 4F).
- the basic frequency of the spoken utterance can be seen from the high-pass component in accordance with FIG. 4B.
- the fundamental frequency of the speaker On the basis of the fundamental frequency, it is possible to express one another in the speech synthesis adapt or determine suitable utterances from a database with predefined utterances.
- the formants of the speech signal section are shown as pronounced Mmima and Maxima (the length of the speech signal section corresponds to m approximately twice the fundamental frequency).
- the formants represent resonance frequencies in the speaker's vocal tract. The clear representability of the formants enables adaptation and / or selection of suitable phonetic components in concatenative speech synthesis.
- the smokiness of a voice can be determined in the low-pass portion of the penultimate transformation stage (with 256 frequency values in the original signal: TP7).
- the descent of the curve between maximum Mx and minimum Mi indicates the degree of smokiness.
- the three speaker-specific characteristics mentioned are thus identified and can be influenced in a targeted manner for speech synthesis. It is particularly important that the manipulation of a single speaker-specific characteristic only influences this in the case of the verse wavelet transformation; the other perceptually relevant variables remain unaffected. In this way, the basic frequency can be adjusted in a targeted manner without affecting the smokiness of the voice.
- Another option is to select a suitable sound section for concatenative linking with another sound section, both sound sections originally being recorded by different speakers in different contexts.
- a suitable sound section to be linked can be found, since criteria are known with the characteristics that allow a comparison of sound sections with each other and thus a selection of the Allow suitable sound section automatically according to certain specifications.
- a database is created with a predetermined amount of naturally spoken language by different speakers, sound sections in the naturally spoken language being identified and stored. There are numerous representatives for the different sound sections of a language that the database can access.
- the sound sections are in particular phonemes of a language or a series of such phonemes. The smaller the section of the sound, the greater the possibilities for combining new words. For example, the German language contains a predetermined amount of approximately 40
- Discontinuities that are perceived by the human ear as unnatural and "synthetic" can be avoided.
- the sound sections come from different speakers and thus have different speaker-specific characteristics.
- FIG. 5 shows two sounds A 507 and B 508, each of which has individual sound sections 505 and 506, for example.
- the sounds A 507 and B 508 each come from a spoken utterance, whereby the sound A 507 clearly is different from the sound B 508.
- a dividing line 509 indicates where the sound A 507 should be linked with the sound B 508. In the present case, the first three sound sections of sound A 507 are to be concatenated with the last three sound sections of sound B 508.
- a temporal stretching or compressing (see arrow 503) of the successive sound sections is carried out along the dividing line 509 in order to reduce the discontinuous impression at the transition 509.
- a variant consists in an abrupt transition of the sounds divided along the dividing line 509. However, this leads to the discontinuities mentioned, which human hearing perceives as disturbing. If, on the other hand, a sound C is put together, the sound sections within a transition area 501 or 502 are taken into account, whereby a spectral distance between two mutually assignable sound sections m is adapted to the respective transition area 501 or 502 (gradual transition between the
- the Euclidean distance between the coefficients relevant to this area is used as the distance measure, particularly in the wavelet space.
Landscapes
- Engineering & Computer Science (AREA)
- Acoustics & Sound (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Electrically Operated Instructional Devices (AREA)
- Machine Translation (AREA)
- Sorting Of Articles (AREA)
- Pallets (AREA)
- Navigation (AREA)
- Traffic Control Systems (AREA)
Abstract
Description
Claims
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
DE19821031 | 1998-05-11 | ||
DE19821031 | 1998-05-11 | ||
PCT/DE1999/001308 WO1999059134A1 (en) | 1998-05-11 | 1999-05-03 | Method and device for determining spectral voice characteristics in a spoken expression |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1078354A1 true EP1078354A1 (en) | 2001-02-28 |
EP1078354B1 EP1078354B1 (en) | 2002-03-20 |
Family
ID=7867382
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP99929088A Expired - Lifetime EP1078354B1 (en) | 1998-05-11 | 1999-05-03 | Method and device for determining spectral voice characteristics in a spoken expression |
Country Status (6)
Country | Link |
---|---|
EP (1) | EP1078354B1 (en) |
JP (1) | JP2002515608A (en) |
AT (1) | ATE214831T1 (en) |
DE (1) | DE59901018D1 (en) |
ES (1) | ES2175988T3 (en) |
WO (1) | WO1999059134A1 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE10031832C2 (en) | 2000-06-30 | 2003-04-30 | Cochlear Ltd | Hearing aid for the rehabilitation of a hearing disorder |
US8483854B2 (en) | 2008-01-28 | 2013-07-09 | Qualcomm Incorporated | Systems, methods, and apparatus for context processing using multiple microphones |
JP6251145B2 (en) * | 2014-09-18 | 2017-12-20 | 株式会社東芝 | Audio processing apparatus, audio processing method and program |
JP2018025827A (en) * | 2017-11-15 | 2018-02-15 | 株式会社東芝 | Interactive system |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
FR2678103B1 (en) * | 1991-06-18 | 1996-10-25 | Sextant Avionique | VOICE SYNTHESIS PROCESS. |
GB2272554A (en) * | 1992-11-13 | 1994-05-18 | Creative Tech Ltd | Recognizing speech by using wavelet transform and transient response therefrom |
JP3093113B2 (en) * | 1994-09-21 | 2000-10-03 | 日本アイ・ビー・エム株式会社 | Speech synthesis method and system |
-
1999
- 1999-05-03 WO PCT/DE1999/001308 patent/WO1999059134A1/en active IP Right Grant
- 1999-05-03 DE DE59901018T patent/DE59901018D1/en not_active Expired - Fee Related
- 1999-05-03 JP JP2000548866A patent/JP2002515608A/en active Pending
- 1999-05-03 AT AT99929088T patent/ATE214831T1/en not_active IP Right Cessation
- 1999-05-03 EP EP99929088A patent/EP1078354B1/en not_active Expired - Lifetime
- 1999-05-03 ES ES99929088T patent/ES2175988T3/en not_active Expired - Lifetime
Non-Patent Citations (1)
Title |
---|
See references of WO9959134A1 * |
Also Published As
Publication number | Publication date |
---|---|
WO1999059134A1 (en) | 1999-11-18 |
ES2175988T3 (en) | 2002-11-16 |
EP1078354B1 (en) | 2002-03-20 |
JP2002515608A (en) | 2002-05-28 |
ATE214831T1 (en) | 2002-04-15 |
DE59901018D1 (en) | 2002-04-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
DE69613646T2 (en) | Method for speech detection in case of strong ambient noise | |
DE69028072T2 (en) | Method and device for speech synthesis | |
DE60000074T2 (en) | Linear predictive cepstral features organized in hierarchical subbands for HMM-based speech recognition | |
DE69811656T2 (en) | VOICE TRANSFER AFTER A TARGET VOICE | |
DE69521955T2 (en) | Method of speech synthesis by chaining and partially overlapping waveforms | |
DE69718284T2 (en) | Speech synthesis system and waveform database with reduced redundancy | |
DE4237563C2 (en) | Method for synthesizing speech | |
DE69031165T2 (en) | SYSTEM AND METHOD FOR TEXT-LANGUAGE IMPLEMENTATION WITH THE CONTEXT-DEPENDENT VOCALALLOPHONE | |
DE68919637T2 (en) | Method and device for speech synthesis by covering and summing waveforms. | |
DE69719654T2 (en) | Prosody databases for speech synthesis containing fundamental frequency patterns | |
DE69909716T2 (en) | Formant speech synthesizer using concatenation of half-syllables with independent cross-fading in the filter coefficient and source range | |
DE69932786T2 (en) | PITCH DETECTION | |
DE3687815T2 (en) | METHOD AND DEVICE FOR VOICE ANALYSIS. | |
DE69720861T2 (en) | Methods of sound synthesis | |
DE69933188T2 (en) | Method and apparatus for extracting formant based source filter data using cost function and inverted filtering for speech coding and synthesis | |
DE69627865T2 (en) | VOICE SYNTHESIZER WITH A DATABASE FOR ACOUSTIC ELEMENTS | |
DE69614233T2 (en) | Speech adaptation system and speech recognizer | |
WO2003012779A1 (en) | Method for analysing audio signals | |
DE69631037T2 (en) | VOICE SYNTHESIS | |
DE3228757A1 (en) | METHOD AND DEVICE FOR PERIODIC COMPRESSION AND SYNTHESIS OF AUDIBLE SIGNALS | |
DE69723930T2 (en) | Method and device for speech synthesis and data carriers therefor | |
EP1435087B1 (en) | Method for producing reference segments describing voice modules and method for modelling voice units of a spoken test model | |
WO2001086634A1 (en) | Method for creating a speech database for a target vocabulary in order to train a speech recognition system | |
EP1078354B1 (en) | Method and device for determining spectral voice characteristics in a spoken expression | |
DE69607928T2 (en) | METHOD AND DEVICE FOR PROVIDING AND USING DIPHONES FOR MULTI-LANGUAGE TEXT-BY-LANGUAGE SYSTEMS |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20000919 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE DE ES FR GB NL |
|
RIC1 | Information provided on ipc code assigned before grant |
Free format text: 7G 10L 13/06 A |
|
GRAG | Despatch of communication of intention to grant |
Free format text: ORIGINAL CODE: EPIDOS AGRA |
|
GRAG | Despatch of communication of intention to grant |
Free format text: ORIGINAL CODE: EPIDOS AGRA |
|
GRAH | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOS IGRA |
|
17Q | First examination report despatched |
Effective date: 20010904 |
|
REG | Reference to a national code |
Ref country code: GB Ref legal event code: IF02 |
|
GRAH | Despatch of communication of intention to grant a patent |
Free format text: ORIGINAL CODE: EPIDOS IGRA |
|
GRAA | (expected) grant |
Free format text: ORIGINAL CODE: 0009210 |
|
AK | Designated contracting states |
Kind code of ref document: B1 Designated state(s): AT BE DE ES FR GB NL |
|
REF | Corresponds to: |
Ref document number: 214831 Country of ref document: AT Date of ref document: 20020415 Kind code of ref document: T |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: AT Payment date: 20020424 Year of fee payment: 4 |
|
REF | Corresponds to: |
Ref document number: 59901018 Country of ref document: DE Date of ref document: 20020425 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: ES Payment date: 20020523 Year of fee payment: 4 Ref country code: BE Payment date: 20020523 Year of fee payment: 4 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: FR Payment date: 20020528 Year of fee payment: 4 |
|
PGFP | Annual fee paid to national office [announced via postgrant information from national office to epo] |
Ref country code: DE Payment date: 20020722 Year of fee payment: 4 |
|
ET | Fr: translation filed | ||
REG | Reference to a national code |
Ref country code: ES Ref legal event code: FG2A Ref document number: 2175988 Country of ref document: ES Kind code of ref document: T3 |
|
PLBE | No opposition filed within time limit |
Free format text: ORIGINAL CODE: 0009261 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: NO OPPOSITION FILED WITHIN TIME LIMIT |
|
26N | No opposition filed |
Effective date: 20021223 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: GB Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20030503 Ref country code: AT Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20030503 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: ES Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20030505 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: BE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20030531 |
|
BERE | Be: lapsed |
Owner name: *SIEMENS A.G. Effective date: 20030531 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: NL Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20031201 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: DE Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20031202 |
|
GBPC | Gb: european patent ceased through non-payment of renewal fee |
Effective date: 20030503 |
|
PG25 | Lapsed in a contracting state [announced via postgrant information from national office to epo] |
Ref country code: FR Free format text: LAPSE BECAUSE OF NON-PAYMENT OF DUE FEES Effective date: 20040130 |
|
NLV4 | Nl: lapsed or anulled due to non-payment of the annual fee |
Effective date: 20031201 |
|
REG | Reference to a national code |
Ref country code: FR Ref legal event code: ST |
|
REG | Reference to a national code |
Ref country code: ES Ref legal event code: FD2A Effective date: 20030505 |