WO2006058958A1

WO2006058958A1 - Method for the automatic segmentation of speech

Info

Publication number: WO2006058958A1
Application number: PCT/FI2005/000519
Authority: WO
Inventors: Unto Laine; Petri Korhonen
Original assignee: Helsinki University Of Technology
Priority date: 2004-11-30
Filing date: 2005-11-30
Publication date: 2006-06-08
Also published as: FI20041541A; FI20041541A0

Abstract

The invention relates to a method for the segmentation of speech using an automatic method. The invention is characterized by the use of the vector-autoregressive (VAR) method in the segmentation. In it, the changes taking place in a vector time series depicting speech are predicted on the basis of data both preceding the prediction point in time and following the prediction point in time, with the aid of a vector-autoregressive model.

Description

METHOD FOR THE AUTOMATIC SEGMENTATION OF SPEECH

The present invention relates to a method for the segmentation of speech using an automatic method. The invention is characterized by the use of the vector-autoregressive (VAR) method disclosed hereinafter. The method can also be applied to any vector time series whatever that is calculated from speech.

In speech technology, spoken messages are processed with the aid of various technical systems that boost speech communication. These can be, for example, the coding of speech and its transfer or storage economically, using a low bit frequency (bit/s), the conversion of speech to text (speech-to-text), or speech synthesis - the automatic production of a voice message from text material (text-to-speech).

In nearly all sub-areas of voice technology, parametric models are required for the speech signal. The continuous speech signal is typically divided into small parts, parametric models are created for the parts and are used for the aforementioned purposes. The division of a temporally continuous speech signal is often performed manually, for example, as the work of a professional phonetician. This restricts the processing of extensive speech material. The automatic segmentation method disclosed here is intended to reduce, or even replace the manual work and thus to accelerate the development of speech-technology applications, hi the future, it may also be possible for it to be applied in new types of speech detector or speech-synthesis devices.

The typical result of segmentation performed manually by phoneticians is a depiction of the boundaries of the phonemes, of the period of time that delimits each phoneme. This is referred to as phonemic segmentation. The result of automatic speech segmentation may differ substantially from that of manual segmentation. Depending on the application, it may be entirely sufficient not to find absolutely every sound boundary.

Segmentation may be still be useful, even if it is made in such a way that only the very clearest sound boundaries are detected. The question will then no longer be of tight phonic limits, but instead of sound pairs, or even broader sound boundaries, such as syllables or morphemes, or other similar units. When parameterizing a speech signal, the speech is processed within relatively small time windows, which are moved over the speech signals in such a way that the windows partly overlap. A depiction is made of the part of the signal currently visible in the window, either using spectrum-type parameters (spectrum vector, auditoric spectrum vector, cepstrum vector, mel-cepstrum vector, etc.), or as a depiction as quite freely chosen characteristics, i.e. characteristic vectors.

A time series of vectors depicting a continuous speech signal is created with the aid of a sliding time window, the series typically depicting the time-frequency structure of the signal. The changes occurring in such time series are often used to assist the segmentation of the speech. Such a method is disclosed in, for example, Aversano G. et al. A New Text-Independent Method for Phoneme Segmentation. Proceedings of the 44^th Midwest Symposium on Circuits and Systems 2, 2001, pp. 516 - 519. Quite often, known segmentation techniques presuppose knowing what is said in the speech sample, i.e. what its phonemic depiction is. Sometimes the segmenter must be taught the voice of the speaker before performing the segmentation task.

Several fields of speech technology require interference-free methods for the automatic segmentation of speech. Such methods should preferably be speaker and language- independent. They should be possible to perform with no prior information on the speaker or sentence in question. The methods should also not require any kind of prior learning and they should process sentences entirely automatically.

The method to which the present invention relates requires no prior information on the speaker and certainly not on the spoken sentence. The entire segmentation is built on the time-frequency structures specific to the speech signal and on their predictability.

hi the present new method, the vector-autoregressive (VAR) model is used to predict the changes that take place in the vector-time series depicting the speech (e.g., in the time- frequency-range depiction vectors). Prediction is performed from both data preceding the time of the prediction (forward prediction), and from data following the time of the prediction (backward prediction). The predictive error produced by the predictor increases at the sound boundaries. These error signals are used to detect the segment boundaries. The greatest changes provide the most reliable estimate of the segment boundary.

The automatic method produces segments consisting of a varying number of sounds. The method's interference immunity and performance were tested using 201 sentences in Finnish. The speakers were two men and one woman. Particularly boundaries between plosive consonants and vowels were detected reliably and accurately.

The present application discloses a new invention and method for the automatic segmentation of speech, which meets the requirements itemized above, up to a certain limit. The method is based on the detection of unpredictable changes at the sound boundaries of the time- frequency depiction of speech. It is known that all changes do not result in rapid or large spectrum changes, so that the kind of sound boundaries permitting the most reliable and best identification when using this method must be determined.

It is relatively simple to supplement the method, for instance by arranging in parallel with it other segmenters operating on a known principle, or by performing segmentation hierarchically, starting from distinct segment boundaries and proceeding towards smaller units, until the most probable segment division is complete. The later case may require statistical a posteriori information on the structure of the speech and the duration of the sound segments.

In speaker-independent speech recognition using an unlimited vocabulary (for example, declension) of continuous speech, the phrases and words must be broken into small parts or units, such as morphemes. Parametric models are made for the parts found and are then compared with references (possibly with knowledge collected a posteriori). It is therefore not necessary to detect every sound boundary. Speaker-adaptive speech recognition can be performed by exploiting automatic segmentation.

Segments similar to syllables or morphemes, and which consist of one or more sounds, are quite as suitable for the purposes of recognition, as long as the segmentation operates reliably and the total number of the various segments is not too great to be modelled (typically 4000 - 8000 different models). Correspondingly, in speech synthesis segmental information can be automatically collected from a specific speaker and, with the aid of this, speech synthesis can produce the voice of precisely this speaker (speaker-adaptive speech synthesis). The method disclosed in the present patent application produces segments consisting of sound sets of different length. The central solution is the use of a vector-autoregressive (VAR) model to model the variation taking place in their feature vector series. The model predicts the multivariate time series from data both preceding and following the prediction time. The errors arising in the predictions taking place in both directions are used to indicate the segment boundary.

The method can be applied to any vector-time series calculated from speech. The following description is an example of the exploitation, in automatic segmentation according to the method, of a vector-time series formed of line-spectrum pairs.

VAR model

The vector-autogressive [VAR(p)] model is defined as follows.

y_t = A{Y)y_t-i +...+ A(p)y_t._p + v + u_t (1)

In the equation, y_t is the vector of the observations made at the moment t, A(i) are constant (K x K) matrices, v is a fixed (K x 1) vector, which permits the mean value E(y_t), and u_t is a vector (K x 1) representing white noise with non-singular co variance matrix C₁₁. The coefficients A(I),..., A(p), and C₁₁ are unknown variables, which are estimated from the multivariate time-series data using least squares estimation. VAR(I) is stable, if the specific values of all the A(I) matrices are < 1. The process y_t for different moments in time t = 0, 1, 2, ... can also be defined, even if the stability condition is not met.

The algorithm presented here exploits only first-order VAR models. The VAR(I) model predicts the vector at the moment in time t from the moment in time t-1 as follows: y_t__x = Ay₁^ v (2)

Model A is estimated from the vector set, using the least squares estimation. The estimation error of the model is a prediction error of one step between the consecutive vectors in the data window.

The digital speech signal s(n) is converted into a set of spectral characteristic vectors y_t calculated by frame, each one of which is a (p x 1) vector. Other characteristics (e.g., energy) calculated from speech are also possible. Short-duration spectrum samples must be calculated at the shortest intervals, in order to achieve a time resolution sufficient for the purpose. This typically leads to overlapping of the consecutive frames.

The matrix A_t is defined at the VAR(I) model calculated form the L data vectors, terminating at the vector at the moment in time t.

A_t = VAR_LSE(y_t-L₊i,..,yt) (3)

The value of L must correspond to the average length of the phonemes in the speech. For each vector yt, an estimate of M is calculated recursively with the aid of models A_t-M,- •_» A_t-1.

y_n = A_t__xy_t__x

The relative errors are calculated from these estimates.

_{o ^} (yt - ynf iyt - yg) ,_ςs

The median of the errors (i.e. the mean) represents the error at the moment in time t. Other criteria for the selection of the error at the moment in time t are also possible.

e_t = median{e_tl, ... e_tM) (6)

Small values of e_t are boosted by taking the logarithm of the error signal e_t.

E_t = 10 Iog₁₀(l + e_t) (7)

In the above, the model A is used to predict the values for y outside of the window from which the model was estimated. Up to this point, A was used recursively to produce the vectors for the time intervals t...+M. The model therefore predicts the future values of y. The model can also be used to predict the values prior to the window, from which the model was estimated. This can be done easily by inverting the order in time of the original y vectors and performing the same VAR analysis. The signals E_t+ and E_t, which represent the prediction error forwards and backwards, are produced in the manner described above.

E_t* — E_t+ - E_t, (8)

In the following, the invention is examined with reference to the accompanying drawings, of which - Figure 1 (a) shows the spectrum of the audibility range of the speech,

Figure 1 (b) shows the signals E_t+ and E_t- representing the prediction error forwards and backwards,

Figure 1 (c) shows the summed error signal E_t*,

Figure 1 (d) shows the E_t+- signal filtered by h(t), - Figure 2 shows clean speech, man 1, threshold 0.20,

Figure 3 shows clean speech, man 1, threshold 9.35, and

Figure 4 shows noisy speech, man, threshold 0.20, M = 7, L = 66 ms. The summed error signal must have a large negative peak value prior to the segment boundary and a large positive peak value after the segment boundary, as in Figure 1 (c). The relevant segment boundaries are located between these two peak points. In order to facilitate the detection of the points, E_t* is filtered with a filter that has an impulse response:

Here d is the mean width of the peaks in the error signal. Using h(t) to filter E_t* results in a signal, the peaks of which coincide with the segment boundaries [Figure 1 (d)]. The selection of the threshold value for the selection of the peaks depends on the application.

The estimation of the performance of the segmentation algorithm is not exactly direct. The method disclosed here detects the greatest spectral changes and it preferable for these moments in time to correspond to the phonetic sound boundaries. Thus phonetic transcriptions were used in the estimation of performance.

At this stage, the aim is not to produce perfect phonetic segmentation by detecting every sound boundary. However, it is good to examine how far one can get by using the basic method. These preliminary results can be improved using known method.

It should be noted that the differences between manual segmentation and automatic segmentation are not errors in all cases, particularly if they appear systematically. The differences between automatic segmentation and phonetic transcription can also be caused by inaccuracies in the phonetic transcription performed by a person.

Two types of error can be detected in the system; rejection of the segment boundaries, or their being set in the wrong place. The quality criterion obtained for the system is: Q _{= m}, Hit - Removal -Ignore ^

Cases

Hit is calculated as the number of correct boundaries. Removal refers to the number (removals) of rejected segment boundaries and Ignore refers to the number of wrongly placed boundaries.

201 Finnish sentences, which were combined from the speech of three different speakers (two men, one woman), were used to evaluate the operation of the system. The material was recorded in an anechoic chamber, using a sampling frequency of 22.05 kHz. The short-duration spectrum samples were line spectrum pairs (Warped Line Spectrum Frequencies, WLSF) calculated, using an auditory frequency scale, at intervals of 3 ms in a 20-ms time window, using Hamming windowing. Frequency warping was used to produce the auditory frequency scale and to reduce the factor p of the linear prediction model. The use of the unusually short step 3 ms was required by the need to obtain sufficient data to estimate the VAR model as well as the need to achieve a higher time resolution for segmentation. This also agrees more closely with the time resolution of human hearing.

The segmentation results for the three speakers are combined in Table 1. There was little mutual variation in the quality of the speakers. The greatest hit probability was achieved by the female speaker, though the qualitatively best performance was that of the male speaker 1. The result reinforces the view that the method is mostly speaker-independent.

Figures 2 and 3 show the results obtained for the male speaker 1 (clean speech) when using two different threshold values to select the peaks. The figures show the quantities Hit, Rejection, and Quality values at different values of the prediction error M contained in the method. The number of hits is the greater the greater the number of prediction errors appearing. The greater number also appears to increase the number of rejected cases. The extension of the data window L has the same effect. A longer window leads to greater Hit and Rejection numbers. The quantity Q has its highest values when the length of the window approaches the average phoneme length of 70 ms. The number of prediction errors M has no real effect on quality. The increase of the threshold value affects the degrees of both the Hits and Rejections, but not so much that of quality.

A group of Finnish phonemes were divided into seven sub-classes on the basis of their phonetic similarity. The division is shown in Figure 2. Thus when calculated theoretically there can be 49 different connections or transitions between the classes. Seven transitions were not observed in the material. Five of them were not observed at all, or else a disputed phonological rule of the Finnish language (marked with an x) appeared between them. Two cases were possible, but they are not shown in the material (marked with a 0). 34 cases out of 42, in which three or more changes appeared, gave statistically sufficient information for segmentation. Eight cases produced no kind of statistically significant segment information (shown in italics), on account of a small number of changes or the low probability of detection.

Figure 4 shows the result for loud speech. The signal-to-noise ratio (SNR) was adjusted by adding pink noise in the speech, as well as so-called 'babble noise' interference. 'Babble noise' is interference that resembles speech. The level of performance fell considerably when the SNR dropped below the 15 dB level. Increasing 'babble noise' interference reduced the quality of performance more rapidly than pink noise.

Table 1. Segmentation results for three speakers. M = 7, L = 66 ms, and degree of transmittance 0.20.

Table 2. Distribution percentages of detected segment boundaries and number of connections between the phoneme classes. Man, M = 7, L = 66 ms, degree of transmittance 0.2. Total number of segments 2264. Vowels: /a/, Id, IiI, lol, IuI, IyI, IaI, /δ/; plosives: Pol, lάl, IgI, IkI, IpI, IiI; nasals: InI, /ml, /ng/; fr.: IfI, IhI, IsI; Ii.: /j/, IV, IYI; tr.: IrI; sil.: silence.

Claims

1. Method for the automatic segmentation of speech, characterized in that the changes taking place in a vector time series depicting speech are predicted on the basis of both data preceding the prediction point in time and data following the prediction point in time, with the aid of a vector-autoregressive (VAR) model.

2. Method according to Claim 1, characterized in that the variation taking place in the characteristic-vector queues of sound groups of differing length are modelled with the aid of a vector-autoregressive (VAR) model.

3. Method according to Claim 1 , characterized in that the error signals occurring at the sound boundaries are used to detect the segment boundaries.

4. Method according to Claim 1, characterized in that predictions are made from several prediction points in time preceding and following the time window, which permits the combination of the prediction errors.

5. Method according to Claim 1, characterized in that the segment boundaries are set at the location of the local maximum values of the error.