EP1078354A1

EP1078354A1 - Method and device for determining spectral voice characteristics in a spoken expression

Info

Publication number: EP1078354A1
Application number: EP99929088A
Authority: EP
Inventors: Martin Holzapfel
Original assignee: Siemens AG
Current assignee: Siemens AG
Priority date: 1998-05-11
Filing date: 1999-05-03
Publication date: 2001-02-28
Anticipated expiration: 2019-05-03
Also published as: WO1999059134A1; ES2175988T3; EP1078354B1; JP2002515608A; ATE214831T1; DE59901018D1

Abstract

According to the invention, spectral voice characteristics are determined in a natural language expression, whereby the expression is digitized and subjected to a wavelet transformation. The speaker-specific characteristics arise from the different transformation steps of the wavelet transformation. Within the scope of a voice synthesis, these characteristics can be compared with characteristics of other expressions in order to generate a continuously sounding synthetic voice signal for the human ear. Alternatively, the characteristics can also be modified in a targeted manner in order to counteract a perceptive dissonance.

Description

description

Method and arrangement for determining spectral speech characteristics in a spoken utterance

The invention relates to a method and an arrangement for determining spectral speech characteristics in a spoken utterance.

In a concatenative speech synthesis, individual sounds are put together from speech databases. In order to obtain a speech course that sounds natural to the human ear, discontinuities must be avoided at the points where the sounds are composed (concatenation points). The sounds are in particular phonemes of one language or a combination of several phonemes.

A wavelet transformation is known from [1]. In the wavelet transformation, a wavelet filter ensures that a high-pass component and a

Low-pass component of a subsequent transformation stage completely restore a signal of a current transformation stage. The resolution of the high-pass component or low-pass component is reduced from one transformation stage to the next (technical term: "subsampling"). In particular, the number of transformation levels is finite due to subsampling.

The object of the invention is a method and an arrangement for determining spectral

Specify language characteristics, with the help of which in particular a natural-looking synthetic speech output can be determined

This object is achieved according to the features of the independent claims. In the context of the invention, a method is specified for determining spectral speech characteristics in a spoken utterance. For this purpose, the spoken utterance is digitized and subjected to a wavelet transformation. The speaker-specific characteristics are determined on the basis of different transformation levels of the wavelet transformation.

It is an advantage in particular that the utterance is divided in the wavelet transformation by means of a high-pass filter and a low-pass filter and that different high-pass components or low-pass components of different transformation stages contain speaker-specific characteristics.

The individual high-pass components or low-pass components of different transformation levels stand for predetermined speaker-specific characteristics, it being possible for both high-pass component and low-pass component of a respective transformation stage, that is to say the respective characteristic, to be modified separately from other characteristics. If, in the case of the inverse wavelet transformation, the original signal is put together again from the respective high-pass and low-pass components of the individual transformation stages, this ensures that exactly the desired signal

Characteristic has been changed. It is thus possible to change certain specified characteristics of the utterance without influencing the rest of the utterance.

One embodiment consists in that before the wavelet transformation the utterance is windowed, that is to say a predetermined quantity of samples is cut out, and the frequency range is transformed. A Fast Fourier Transform (FFT) is used in particular for this purpose.

Another embodiment consists in that a high-pass component of a transformation stage m is a real part and an imaginary part is divided. The high-pass component of the wavelet transformation corresponds to the difference signal between the current low-pass component and the low-pass component of the previous transformation stage.

In particular, a further development consists in determining the number of transformation stages of the wavelet transformation to be carried out by including a constant component of the utterance in the last transformation stage, which consists of low-pass filters connected in series. Then the signal as a whole can be represented by its wavelet coefficients. This corresponds to the complete transformation of the information of the signal section m into the wavelet space.

If, in particular, only the respective low-pass component is further transformed (by means of a high-pass and a low-pass filter), the difference signal remains as the high-pass component of a transformation stage, as explained above. If one accumulates difference signals (high-pass components) over the transformation stages, the information of the spoken utterance without a constant component is obtained in the last transformation stage as a cumulative high-pass component.

As part of additional training, the speaker-specific characteristics can be identified as:

a) fundamental frequency:

The oscillation of the high-pass component of the first or second transformation stage of the wavelet

Transformation reveals the fundamental frequency of the utterance. The basic frequency indicates whether the speaker is a man or a woman.

b) Shape of the spectral envelope:

The spectral envelope contains information about a transfer function of the vocal tract during articulation. In a voiced area, the spectral envelope is dominated by the formants. The high-pass component of a higher transformation level of the wavelet transformation contains this spectral envelope.

c) Spectral Tilt:

The smokiness in a voice becomes visible as a negative slope in the course of the penultimate low-pass portion.

The speaker-specific characteristics a) to c) are of great importance in speech synthesis. As mentioned at the beginning, concatenative speech synthesis uses large quantities of real uttered utterances, from which excerpts are cut out and later put together to form a new word (synthesized language). Discontinuities between compound sounds are disadvantageous because they are perceived by the human ear as unnatural. In order to counteract the discontinuities, it is advantageous to directly record the perceptually relevant sizes and, if necessary, to compare and / or adapt them.

This can be done by direct manipulation by adapting a speech m at least one of its speaker-specific characteristics, so that it is not perceived as disturbing in the acoustic context of the concatenative linked sounds. It is also possible to align the selection of a suitable sound so that speaker-specific characteristics of sounds to be linked match each other as well as possible, e.g. that the sounds have the same or similar smokiness.

An advantage of the invention is that the spectral envelope curve reflects the articulation tract of the speaker and is not based on formants, such as a pole position model. Go further with the wavelet transformation no data is lost as a nonparametric representation, the utterance can always be completely reconstructed. The data resulting from the individual transformation stages of the wavelet transformation are linearly independent of one another, can thus be influenced separately from one another and can later be combined again - without loss - to the influenced utterance.

Furthermore, an arrangement for determining spectral

Speech characteristics specified, which has a processor unit which is set up such that an utterance can be digitized. The utterance is then subjected to a wavelet transformation and speaker-specific characteristics are determined using different transformation levels.

This arrangement is particularly suitable for carrying out the method according to the invention or one of its developments explained above.

Further developments of the invention also result from the dependent claims.

Exemplary embodiments of the invention are illustrated and explained below with reference to the drawing.

Show it

Fig.l a wavelet function;

2 shows a wavelet function, divided into real part and imaginary part;

3 shows a cascaded filter structure that the

Represents transformation steps of the wavelet transformation; Fig.4 low-pass components and high-pass components of different transformation levels;

Fig. 5 steps of concatenative speech synthesis.

Fig.l shows a wavelet function, which is determined by

where f is the frequency, σ is a standard deviation and c is a given normalization constant.

In particular, the standard deviation σ is determined by the predeterminable position of the sideband minimum 101 in Fig.l.

2 shows a wavelet function with a real part according to equation (1) and a Hilbert transform H of the real part as an imaginary part. The complex wavelet function thus arises

Ψ (f) = ψ (f) + j ^• H {ψ (f)} (2).

The constant c from equation (1) is used to normalize the complex wavelet function:

oo

-oo

where Ψ denotes the conjugate complex wavelet function. 3 shows the cascaded application of the wavelet transform. A signal 301 is filtered both by a high pass HP1 302 and by a low pass TP1 305. In particular, subsampling takes place, ie the number of values to be saved is reduced per filter. A mverse wavelet transformation ensures that the original signal 301 can be reconstructed from the low-pass component TP1 305 and the high-pass component HP1 304.

In the high pass HP1 302 is filtered separately for real part Rel 303 and Imagmar part Iml 304.

The signal 310 after the low-pass filter TP1 305 is again both by a high-pass HP2 306 and by a

Filtered low pass TP2 309. The high pass HP2 306 again comprises a real part Re2 307 and an imagemart Im2 308. The signal after the second transformation stage 311 is filtered again, etc.

Assuming a (FFT-transformed) short-term spectrum with 256 values, eight transformation steps are carried out (subsampl grate: 1/2) until the signal from the last low-pass filter TP8 corresponds to the DC component.

4 shows various transformation stages of the wavelet transformation, divided into low-pass components (FIGS. 4A, 4C and 4E) and high-pass components (FIGS. 4B, 4D and 4F).

The basic frequency of the spoken utterance can be seen from the high-pass component in accordance with FIG. 4B. In addition to the fluctuations in the amplitude, a predominant periodicity in the wavelet-filled spectrum can clearly be seen, the fundamental frequency of the speaker. On the basis of the fundamental frequency, it is possible to express one another in the speech synthesis adapt or determine suitable utterances from a database with predefined utterances.

In the low-pass portion of Fιg.4C, the formants of the speech signal section are shown as pronounced Mmima and Maxima (the length of the speech signal section corresponds to m approximately twice the fundamental frequency). The formants represent resonance frequencies in the speaker's vocal tract. The clear representability of the formants enables adaptation and / or selection of suitable phonetic components in concatenative speech synthesis.

The smokiness of a voice can be determined in the low-pass portion of the penultimate transformation stage (with 256 frequency values in the original signal: TP7). The descent of the curve between maximum Mx and minimum Mi indicates the degree of smokiness.

The three speaker-specific characteristics mentioned are thus identified and can be influenced in a targeted manner for speech synthesis. It is particularly important that the manipulation of a single speaker-specific characteristic only influences this in the case of the verse wavelet transformation; the other perceptually relevant variables remain unaffected. In this way, the basic frequency can be adjusted in a targeted manner without affecting the smokiness of the voice.

Another option is to select a suitable sound section for concatenative linking with another sound section, both sound sections originally being recorded by different speakers in different contexts. By determining spectral speech characteristics, a suitable sound section to be linked can be found, since criteria are known with the characteristics that allow a comparison of sound sections with each other and thus a selection of the Allow suitable sound section automatically according to certain specifications.

5 shows steps of a concatenative speech synthesis. A database is created with a predetermined amount of naturally spoken language by different speakers, sound sections in the naturally spoken language being identified and stored. There are numerous representatives for the different sound sections of a language that the database can access. The sound sections are in particular phonemes of a language or a series of such phonemes. The smaller the section of the sound, the greater the possibilities for combining new words. For example, the German language contains a predetermined amount of approximately 40

Phonemes sufficient to synthesize almost all words in the language. Different acoustic contexts must be taken into account, depending on the word in which the respective phoneme occurs. Now it is important to embed the individual phonemes in the acoustic context in such a way that

Discontinuities that are perceived by the human ear as unnatural and "synthetic" can be avoided. As mentioned, the sound sections come from different speakers and thus have different speaker-specific characteristics. In order to synthesize a statement that looks as natural as possible, it is important to minimize the discontinuities. This can be done by adapting the identifiable and modifiable speaker-specific characteristics or by selecting suitable sound sections from the database, the speaker-specific characteristics also being a decisive aid in the selection.

5 shows two sounds A 507 and B 508, each of which has individual sound sections 505 and 506, for example. The sounds A 507 and B 508 each come from a spoken utterance, whereby the sound A 507 clearly is different from the sound B 508. A dividing line 509 indicates where the sound A 507 should be linked with the sound B 508. In the present case, the first three sound sections of sound A 507 are to be concatenated with the last three sound sections of sound B 508.

A temporal stretching or compressing (see arrow 503) of the successive sound sections is carried out along the dividing line 509 in order to reduce the discontinuous impression at the transition 509.

A variant consists in an abrupt transition of the sounds divided along the dividing line 509. However, this leads to the discontinuities mentioned, which human hearing perceives as disturbing. If, on the other hand, a sound C is put together, the sound sections within a transition area 501 or 502 are taken into account, whereby a spectral distance between two mutually assignable sound sections m is adapted to the respective transition area 501 or 502 (gradual transition between the

Sound sections). The Euclidean distance between the coefficients relevant to this area is used as the distance measure, particularly in the wavelet space.

Bibliography :

[1] I. Daubechies: "Ten Lectures on Wavelets", Siam Verlag 1992, ISBN 0-89871-274-2, chapter 5.1, pages 129-137.

Claims

claims

1. Method for determining spectral speech characteristics in a spoken utterance, a) in which the utterance is digitized, b) in which the digitized utterance is subjected to a wavelet transformation, c) in which the speaker-specific characteristics are determined on the basis of different transformation levels of the wavelet transformation become.

2. The method according to claim 1, in which a windowed transformation of the digitized utterance is carried out in a frequency range before the wavelet transformation.

3. The method according to claim 2, wherein the transformation of the frequency range is carried out by means of Fast Fourier transformation.

Method according to one of the preceding claims, in which each stage of the wavelet transformation em low pass portion and em high pass portion of a signal to be transformed are determined.

Method according to one of the preceding claims, in which a high-pass component according to a real component and a

Imagmar part is divided.

6. The method according to any one of the preceding claims, wherein the wavelet transformation comprises a plurality of transformation stages, the last transformation stage providing a constant component of the utterance m of repeated low-pass filtering corresponding to the number of transformation stages.

7. The method according to any one of the preceding claims, wherein the speaker-specific characteristics are determined by: a) a basic frequency of the spoken utterance; b) spectral envelope; c) a smokiness of the spoken utterance.

8. Use of the method according to one of claims 1 to 7 for speech synthesis, wherein individual speaker-specific characteristics in

With a view to a natural sounding sequence of speech sounds can be adjusted.

9. Use of the method according to one of claims 1 to 7 for speech synthesis, wherein those speech sounds are selected from a predetermined amount of data on the basis of individual spectral speech characteristics, which ensure a natural sounding sequence of speech sounds.

10. Arrangement for determining spectral speech characteristics in a spoken utterance with a processor unit, which is set up in such a way that the following steps can be carried out: a) the utterance is digitized; b) the digitized utterance is subjected to a wavelet transformation; c) based on different levels of transformation

Wavelet transformation, the speaker-specific characteristics are determined.