WO2006058958A1 - Method for the automatic segmentation of speech - Google Patents

Method for the automatic segmentation of speech Download PDF

Info

Publication number
WO2006058958A1
WO2006058958A1 PCT/FI2005/000519 FI2005000519W WO2006058958A1 WO 2006058958 A1 WO2006058958 A1 WO 2006058958A1 FI 2005000519 W FI2005000519 W FI 2005000519W WO 2006058958 A1 WO2006058958 A1 WO 2006058958A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
vector
time
segmentation
prediction
Prior art date
Application number
PCT/FI2005/000519
Other languages
French (fr)
Inventor
Unto Laine
Petri Korhonen
Original Assignee
Helsinki University Of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Helsinki University Of Technology filed Critical Helsinki University Of Technology
Publication of WO2006058958A1 publication Critical patent/WO2006058958A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals

Definitions

  • the present invention relates to a method for the segmentation of speech using an automatic method.
  • the invention is characterized by the use of the vector-autoregressive (VAR) method disclosed hereinafter.
  • VAR vector-autoregressive
  • the method can also be applied to any vector time series whatever that is calculated from speech.
  • spoken messages are processed with the aid of various technical systems that boost speech communication.
  • These can be, for example, the coding of speech and its transfer or storage economically, using a low bit frequency (bit/s), the conversion of speech to text (speech-to-text), or speech synthesis - the automatic production of a voice message from text material (text-to-speech).
  • bit/s bit frequency
  • speech-to-text the conversion of speech to text
  • speech synthesis the automatic production of a voice message from text material
  • the continuous speech signal is typically divided into small parts, parametric models are created for the parts and are used for the aforementioned purposes.
  • the division of a temporally continuous speech signal is often performed manually, for example, as the work of a professional phonetician. This restricts the processing of extensive speech material.
  • the automatic segmentation method disclosed here is intended to reduce, or even replace the manual work and thus to accelerate the development of speech-technology applications, hi the future, it may also be possible for it to be applied in new types of speech detector or speech-synthesis devices.
  • the typical result of segmentation performed manually by phoneticians is a depiction of the boundaries of the phonemes, of the period of time that delimits each phoneme. This is referred to as phonemic segmentation.
  • the result of automatic speech segmentation may differ substantially from that of manual segmentation. Depending on the application, it may be entirely sufficient not to find absolutely every sound boundary.
  • Segmentation may be still be useful, even if it is made in such a way that only the very clearest sound boundaries are detected. The question will then no longer be of tight phonic limits, but instead of sound pairs, or even broader sound boundaries, such as syllables or morphemes, or other similar units.
  • the speech is processed within relatively small time windows, which are moved over the speech signals in such a way that the windows partly overlap.
  • a depiction is made of the part of the signal currently visible in the window, either using spectrum-type parameters (spectrum vector, auditoric spectrum vector, cepstrum vector, mel-cepstrum vector, etc.), or as a depiction as quite freely chosen characteristics, i.e. characteristic vectors.
  • a time series of vectors depicting a continuous speech signal is created with the aid of a sliding time window, the series typically depicting the time-frequency structure of the signal.
  • the changes occurring in such time series are often used to assist the segmentation of the speech.
  • Such a method is disclosed in, for example, Aversano G. et al. A New Text-Independent Method for Phoneme Segmentation. Proceedings of the 44 th Midwest Symposium on Circuits and Systems 2, 2001, pp. 516 - 519.
  • known segmentation techniques presuppose knowing what is said in the speech sample, i.e. what its phonemic depiction is. Sometimes the segmenter must be taught the voice of the speaker before performing the segmentation task.
  • the method to which the present invention relates requires no prior information on the speaker and certainly not on the spoken sentence.
  • the entire segmentation is built on the time-frequency structures specific to the speech signal and on their predictability.
  • the vector-autoregressive (VAR) model is used to predict the changes that take place in the vector-time series depicting the speech (e.g., in the time- frequency-range depiction vectors). Prediction is performed from both data preceding the time of the prediction (forward prediction), and from data following the time of the prediction (backward prediction). The predictive error produced by the predictor increases at the sound boundaries. These error signals are used to detect the segment boundaries. The greatest changes provide the most reliable estimate of the segment boundary.
  • the automatic method produces segments consisting of a varying number of sounds.
  • the method's interference immunity and performance were tested using 201 sentences in Finnish.
  • the speakers were two men and one woman. Particularly boundaries between plosive consonants and vowels were detected reliably and accurately.
  • the present application discloses a new invention and method for the automatic segmentation of speech, which meets the requirements itemized above, up to a certain limit.
  • the method is based on the detection of unpredictable changes at the sound boundaries of the time- frequency depiction of speech. It is known that all changes do not result in rapid or large spectrum changes, so that the kind of sound boundaries permitting the most reliable and best identification when using this method must be determined.
  • Segments similar to syllables or morphemes, and which consist of one or more sounds, are quite as suitable for the purposes of recognition, as long as the segmentation operates reliably and the total number of the various segments is not too great to be modelled (typically 4000 - 8000 different models).
  • segmental information can be automatically collected from a specific speaker and, with the aid of this, speech synthesis can produce the voice of precisely this speaker (speaker-adaptive speech synthesis).
  • the method disclosed in the present patent application produces segments consisting of sound sets of different length.
  • the central solution is the use of a vector-autoregressive (VAR) model to model the variation taking place in their feature vector series.
  • the model predicts the multivariate time series from data both preceding and following the prediction time. The errors arising in the predictions taking place in both directions are used to indicate the segment boundary.
  • VAR vector-autoregressive
  • the method can be applied to any vector-time series calculated from speech.
  • the following description is an example of the exploitation, in automatic segmentation according to the method, of a vector-time series formed of line-spectrum pairs.
  • VAR(p) The vector-autogressive [VAR(p)] model is defined as follows.
  • y t is the vector of the observations made at the moment t
  • A(i) are constant (K x K) matrices
  • v is a fixed (K x 1) vector, which permits the mean value E(y t )
  • u t is a vector (K x 1) representing white noise with non-singular co variance matrix C 11 .
  • the coefficients A(I),..., A(p), and C 11 are unknown variables, which are estimated from the multivariate time-series data using least squares estimation.
  • VAR(I) is stable, if the specific values of all the A(I) matrices are ⁇ 1.
  • Model A is estimated from the vector set, using the least squares estimation.
  • the estimation error of the model is a prediction error of one step between the consecutive vectors in the data window.
  • the digital speech signal s(n) is converted into a set of spectral characteristic vectors y t calculated by frame, each one of which is a (p x 1) vector.
  • Other characteristics e.g., energy
  • Short-duration spectrum samples must be calculated at the shortest intervals, in order to achieve a time resolution sufficient for the purpose. This typically leads to overlapping of the consecutive frames.
  • the matrix A t is defined at the VAR(I) model calculated form the L data vectors, terminating at the vector at the moment in time t.
  • a t VAR LSE (y t -L + i,..,yt) (3)
  • the relative errors are calculated from these estimates.
  • the median of the errors represents the error at the moment in time t.
  • Other criteria for the selection of the error at the moment in time t are also possible.
  • the model A is used to predict the values for y outside of the window from which the model was estimated. Up to this point, A was used recursively to produce the vectors for the time intervals t...+M. The model therefore predicts the future values of y.
  • the model can also be used to predict the values prior to the window, from which the model was estimated. This can be done easily by inverting the order in time of the original y vectors and performing the same VAR analysis.
  • the signals E t+ and E t which represent the prediction error forwards and backwards, are produced in the manner described above.
  • Figure 1 (b) shows the signals E t+ and E t- representing the prediction error forwards and backwards
  • Figure 1 (c) shows the summed error signal E t* .
  • Figure 1 (d) shows the E t+ - signal filtered by h(t), - Figure 2 shows clean speech, man 1, threshold 0.20,
  • Figure 3 shows clean speech, man 1, threshold 9.35, and
  • the summed error signal must have a large negative peak value prior to the segment boundary and a large positive peak value after the segment boundary, as in Figure 1 (c).
  • the relevant segment boundaries are located between these two peak points.
  • E t* is filtered with a filter that has an impulse response:
  • d is the mean width of the peaks in the error signal.
  • h(t) to filter E t* results in a signal, the peaks of which coincide with the segment boundaries [ Figure 1 (d)].
  • the selection of the threshold value for the selection of the peaks depends on the application.
  • the estimation of the performance of the segmentation algorithm is not exactly direct.
  • the method disclosed here detects the greatest spectral changes and it preferable for these moments in time to correspond to the phonetic sound boundaries. Thus phonetic transcriptions were used in the estimation of performance.
  • the aim is not to produce perfect phonetic segmentation by detecting every sound boundary.
  • Hit is calculated as the number of correct boundaries. Removal refers to the number (removals) of rejected segment boundaries and Ignore refers to the number of wrongly placed boundaries.
  • Figures 2 and 3 show the results obtained for the male speaker 1 (clean speech) when using two different threshold values to select the peaks.
  • the figures show the quantities Hit, Rejection, and Quality values at different values of the prediction error M contained in the method.
  • the number of hits is the greater the greater the number of prediction errors appearing. The greater number also appears to increase the number of rejected cases.
  • the extension of the data window L has the same effect. A longer window leads to greater Hit and Rejection numbers.
  • the quantity Q has its highest values when the length of the window approaches the average phoneme length of 70 ms.
  • the number of prediction errors M has no real effect on quality.
  • the increase of the threshold value affects the degrees of both the Hits and Rejections, but not so much that of quality.
  • a group of Finnish phonemes were divided into seven sub-classes on the basis of their phonetic similarity. The division is shown in Figure 2. Thus when calculated theoretically there can be 49 different connections or transitions between the classes. Seven transitions were not observed in the material. Five of them were not observed at all, or else a disputed phonological rule of the Finnish language (marked with an x) appeared between them. Two cases were possible, but they are not shown in the material (marked with a 0). 34 cases out of 42, in which three or more changes appeared, gave statistically sufficient information for segmentation. Eight cases produced no kind of statistically significant segment information (shown in italics), on account of a small number of changes or the low probability of detection.
  • Figure 4 shows the result for loud speech.
  • the signal-to-noise ratio (SNR) was adjusted by adding pink noise in the speech, as well as so-called 'babble noise' interference.
  • 'Babble noise' is interference that resembles speech.
  • the level of performance fell considerably when the SNR dropped below the 15 dB level.
  • Increasing 'babble noise' interference reduced the quality of performance more rapidly than pink noise.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method for the segmentation of speech using an automatic method. The invention is characterized by the use of the vector-autoregressive (VAR) method in the segmentation. In it, the changes taking place in a vector time series depicting speech are predicted on the basis of data both preceding the prediction point in time and following the prediction point in time, with the aid of a vector-autoregressive model.

Description

METHOD FOR THE AUTOMATIC SEGMENTATION OF SPEECH
The present invention relates to a method for the segmentation of speech using an automatic method. The invention is characterized by the use of the vector-autoregressive (VAR) method disclosed hereinafter. The method can also be applied to any vector time series whatever that is calculated from speech.
In speech technology, spoken messages are processed with the aid of various technical systems that boost speech communication. These can be, for example, the coding of speech and its transfer or storage economically, using a low bit frequency (bit/s), the conversion of speech to text (speech-to-text), or speech synthesis - the automatic production of a voice message from text material (text-to-speech).
In nearly all sub-areas of voice technology, parametric models are required for the speech signal. The continuous speech signal is typically divided into small parts, parametric models are created for the parts and are used for the aforementioned purposes. The division of a temporally continuous speech signal is often performed manually, for example, as the work of a professional phonetician. This restricts the processing of extensive speech material. The automatic segmentation method disclosed here is intended to reduce, or even replace the manual work and thus to accelerate the development of speech-technology applications, hi the future, it may also be possible for it to be applied in new types of speech detector or speech-synthesis devices.
The typical result of segmentation performed manually by phoneticians is a depiction of the boundaries of the phonemes, of the period of time that delimits each phoneme. This is referred to as phonemic segmentation. The result of automatic speech segmentation may differ substantially from that of manual segmentation. Depending on the application, it may be entirely sufficient not to find absolutely every sound boundary.
Segmentation may be still be useful, even if it is made in such a way that only the very clearest sound boundaries are detected. The question will then no longer be of tight phonic limits, but instead of sound pairs, or even broader sound boundaries, such as syllables or morphemes, or other similar units. When parameterizing a speech signal, the speech is processed within relatively small time windows, which are moved over the speech signals in such a way that the windows partly overlap. A depiction is made of the part of the signal currently visible in the window, either using spectrum-type parameters (spectrum vector, auditoric spectrum vector, cepstrum vector, mel-cepstrum vector, etc.), or as a depiction as quite freely chosen characteristics, i.e. characteristic vectors.
A time series of vectors depicting a continuous speech signal is created with the aid of a sliding time window, the series typically depicting the time-frequency structure of the signal. The changes occurring in such time series are often used to assist the segmentation of the speech. Such a method is disclosed in, for example, Aversano G. et al. A New Text-Independent Method for Phoneme Segmentation. Proceedings of the 44th Midwest Symposium on Circuits and Systems 2, 2001, pp. 516 - 519. Quite often, known segmentation techniques presuppose knowing what is said in the speech sample, i.e. what its phonemic depiction is. Sometimes the segmenter must be taught the voice of the speaker before performing the segmentation task.
Several fields of speech technology require interference-free methods for the automatic segmentation of speech. Such methods should preferably be speaker and language- independent. They should be possible to perform with no prior information on the speaker or sentence in question. The methods should also not require any kind of prior learning and they should process sentences entirely automatically.
The method to which the present invention relates requires no prior information on the speaker and certainly not on the spoken sentence. The entire segmentation is built on the time-frequency structures specific to the speech signal and on their predictability.
hi the present new method, the vector-autoregressive (VAR) model is used to predict the changes that take place in the vector-time series depicting the speech (e.g., in the time- frequency-range depiction vectors). Prediction is performed from both data preceding the time of the prediction (forward prediction), and from data following the time of the prediction (backward prediction). The predictive error produced by the predictor increases at the sound boundaries. These error signals are used to detect the segment boundaries. The greatest changes provide the most reliable estimate of the segment boundary.
The automatic method produces segments consisting of a varying number of sounds. The method's interference immunity and performance were tested using 201 sentences in Finnish. The speakers were two men and one woman. Particularly boundaries between plosive consonants and vowels were detected reliably and accurately.
The present application discloses a new invention and method for the automatic segmentation of speech, which meets the requirements itemized above, up to a certain limit. The method is based on the detection of unpredictable changes at the sound boundaries of the time- frequency depiction of speech. It is known that all changes do not result in rapid or large spectrum changes, so that the kind of sound boundaries permitting the most reliable and best identification when using this method must be determined.
It is relatively simple to supplement the method, for instance by arranging in parallel with it other segmenters operating on a known principle, or by performing segmentation hierarchically, starting from distinct segment boundaries and proceeding towards smaller units, until the most probable segment division is complete. The later case may require statistical a posteriori information on the structure of the speech and the duration of the sound segments.
In speaker-independent speech recognition using an unlimited vocabulary (for example, declension) of continuous speech, the phrases and words must be broken into small parts or units, such as morphemes. Parametric models are made for the parts found and are then compared with references (possibly with knowledge collected a posteriori). It is therefore not necessary to detect every sound boundary. Speaker-adaptive speech recognition can be performed by exploiting automatic segmentation.
Segments similar to syllables or morphemes, and which consist of one or more sounds, are quite as suitable for the purposes of recognition, as long as the segmentation operates reliably and the total number of the various segments is not too great to be modelled (typically 4000 - 8000 different models). Correspondingly, in speech synthesis segmental information can be automatically collected from a specific speaker and, with the aid of this, speech synthesis can produce the voice of precisely this speaker (speaker-adaptive speech synthesis). The method disclosed in the present patent application produces segments consisting of sound sets of different length. The central solution is the use of a vector-autoregressive (VAR) model to model the variation taking place in their feature vector series. The model predicts the multivariate time series from data both preceding and following the prediction time. The errors arising in the predictions taking place in both directions are used to indicate the segment boundary.
The method can be applied to any vector-time series calculated from speech. The following description is an example of the exploitation, in automatic segmentation according to the method, of a vector-time series formed of line-spectrum pairs.
VAR model
The vector-autogressive [VAR(p)] model is defined as follows.
yt = A{Y)yt-i +...+ A(p)yt.p + v + ut (1)
In the equation, yt is the vector of the observations made at the moment t, A(i) are constant (K x K) matrices, v is a fixed (K x 1) vector, which permits the mean value E(yt), and ut is a vector (K x 1) representing white noise with non-singular co variance matrix C11. The coefficients A(I),..., A(p), and C11 are unknown variables, which are estimated from the multivariate time-series data using least squares estimation. VAR(I) is stable, if the specific values of all the A(I) matrices are < 1. The process yt for different moments in time t = 0, 1, 2, ... can also be defined, even if the stability condition is not met.
The algorithm presented here exploits only first-order VAR models. The VAR(I) model predicts the vector at the moment in time t from the moment in time t-1 as follows: yt_x = Ay1^ v (2)
Model A is estimated from the vector set, using the least squares estimation. The estimation error of the model is a prediction error of one step between the consecutive vectors in the data window.
The digital speech signal s(n) is converted into a set of spectral characteristic vectors yt calculated by frame, each one of which is a (p x 1) vector. Other characteristics (e.g., energy) calculated from speech are also possible. Short-duration spectrum samples must be calculated at the shortest intervals, in order to achieve a time resolution sufficient for the purpose. This typically leads to overlapping of the consecutive frames.
The matrix At is defined at the VAR(I) model calculated form the L data vectors, terminating at the vector at the moment in time t.
At = VARLSE(yt-L+i,..,yt) (3)
The value of L must correspond to the average length of the phonemes in the speech. For each vector yt, an estimate of M is calculated recursively with the aid of models At-M,- •» At-1.
yn = At_xyt_x
Figure imgf000006_0001
The relative errors are calculated from these estimates.
Figure imgf000006_0002
o ^ (yt - ynf iyt - yg) ,ςs
Figure imgf000007_0001
The median of the errors (i.e. the mean) represents the error at the moment in time t. Other criteria for the selection of the error at the moment in time t are also possible.
et = median{etl, ... etM) (6)
Small values of et are boosted by taking the logarithm of the error signal et.
Et = 10 Iog10(l + et) (7)
In the above, the model A is used to predict the values for y outside of the window from which the model was estimated. Up to this point, A was used recursively to produce the vectors for the time intervals t...+M. The model therefore predicts the future values of y. The model can also be used to predict the values prior to the window, from which the model was estimated. This can be done easily by inverting the order in time of the original y vectors and performing the same VAR analysis. The signals Et+ and Et, which represent the prediction error forwards and backwards, are produced in the manner described above.
Et* — Et+ - Et, (8)
In the following, the invention is examined with reference to the accompanying drawings, of which - Figure 1 (a) shows the spectrum of the audibility range of the speech,
Figure 1 (b) shows the signals Et+ and Et- representing the prediction error forwards and backwards,
Figure 1 (c) shows the summed error signal Et*,
Figure 1 (d) shows the Et+- signal filtered by h(t), - Figure 2 shows clean speech, man 1, threshold 0.20,
Figure 3 shows clean speech, man 1, threshold 9.35, and
Figure 4 shows noisy speech, man, threshold 0.20, M = 7, L = 66 ms. The summed error signal must have a large negative peak value prior to the segment boundary and a large positive peak value after the segment boundary, as in Figure 1 (c). The relevant segment boundaries are located between these two peak points. In order to facilitate the detection of the points, Et* is filtered with a filter that has an impulse response:
Figure imgf000008_0001
Here d is the mean width of the peaks in the error signal. Using h(t) to filter Et* results in a signal, the peaks of which coincide with the segment boundaries [Figure 1 (d)]. The selection of the threshold value for the selection of the peaks depends on the application.
The estimation of the performance of the segmentation algorithm is not exactly direct. The method disclosed here detects the greatest spectral changes and it preferable for these moments in time to correspond to the phonetic sound boundaries. Thus phonetic transcriptions were used in the estimation of performance.
At this stage, the aim is not to produce perfect phonetic segmentation by detecting every sound boundary. However, it is good to examine how far one can get by using the basic method. These preliminary results can be improved using known method.
It should be noted that the differences between manual segmentation and automatic segmentation are not errors in all cases, particularly if they appear systematically. The differences between automatic segmentation and phonetic transcription can also be caused by inaccuracies in the phonetic transcription performed by a person.
Two types of error can be detected in the system; rejection of the segment boundaries, or their being set in the wrong place. The quality criterion obtained for the system is: Q = m, Hit - Removal -Ignore ^
Cases
Hit is calculated as the number of correct boundaries. Removal refers to the number (removals) of rejected segment boundaries and Ignore refers to the number of wrongly placed boundaries.
201 Finnish sentences, which were combined from the speech of three different speakers (two men, one woman), were used to evaluate the operation of the system. The material was recorded in an anechoic chamber, using a sampling frequency of 22.05 kHz. The short-duration spectrum samples were line spectrum pairs (Warped Line Spectrum Frequencies, WLSF) calculated, using an auditory frequency scale, at intervals of 3 ms in a 20-ms time window, using Hamming windowing. Frequency warping was used to produce the auditory frequency scale and to reduce the factor p of the linear prediction model. The use of the unusually short step 3 ms was required by the need to obtain sufficient data to estimate the VAR model as well as the need to achieve a higher time resolution for segmentation. This also agrees more closely with the time resolution of human hearing.
The segmentation results for the three speakers are combined in Table 1. There was little mutual variation in the quality of the speakers. The greatest hit probability was achieved by the female speaker, though the qualitatively best performance was that of the male speaker 1. The result reinforces the view that the method is mostly speaker-independent.
Figures 2 and 3 show the results obtained for the male speaker 1 (clean speech) when using two different threshold values to select the peaks. The figures show the quantities Hit, Rejection, and Quality values at different values of the prediction error M contained in the method. The number of hits is the greater the greater the number of prediction errors appearing. The greater number also appears to increase the number of rejected cases. The extension of the data window L has the same effect. A longer window leads to greater Hit and Rejection numbers. The quantity Q has its highest values when the length of the window approaches the average phoneme length of 70 ms. The number of prediction errors M has no real effect on quality. The increase of the threshold value affects the degrees of both the Hits and Rejections, but not so much that of quality.
A group of Finnish phonemes were divided into seven sub-classes on the basis of their phonetic similarity. The division is shown in Figure 2. Thus when calculated theoretically there can be 49 different connections or transitions between the classes. Seven transitions were not observed in the material. Five of them were not observed at all, or else a disputed phonological rule of the Finnish language (marked with an x) appeared between them. Two cases were possible, but they are not shown in the material (marked with a 0). 34 cases out of 42, in which three or more changes appeared, gave statistically sufficient information for segmentation. Eight cases produced no kind of statistically significant segment information (shown in italics), on account of a small number of changes or the low probability of detection.
Figure 4 shows the result for loud speech. The signal-to-noise ratio (SNR) was adjusted by adding pink noise in the speech, as well as so-called 'babble noise' interference. 'Babble noise' is interference that resembles speech. The level of performance fell considerably when the SNR dropped below the 15 dB level. Increasing 'babble noise' interference reduced the quality of performance more rapidly than pink noise.
Figure imgf000010_0001
Table 1. Segmentation results for three speakers. M = 7, L = 66 ms, and degree of transmittance 0.20.
Figure imgf000011_0001
Table 2. Distribution percentages of detected segment boundaries and number of connections between the phoneme classes. Man, M = 7, L = 66 ms, degree of transmittance 0.2. Total number of segments 2264. Vowels: /a/, Id, IiI, lol, IuI, IyI, IaI, /δ/; plosives: Pol, lάl, IgI, IkI, IpI, IiI; nasals: InI, /ml, /ng/; fr.: IfI, IhI, IsI; Ii.: /j/, IV, IYI; tr.: IrI; sil.: silence.

Claims

Claims
1. Method for the automatic segmentation of speech, characterized in that the changes taking place in a vector time series depicting speech are predicted on the basis of both data preceding the prediction point in time and data following the prediction point in time, with the aid of a vector-autoregressive (VAR) model.
2. Method according to Claim 1, characterized in that the variation taking place in the characteristic-vector queues of sound groups of differing length are modelled with the aid of a vector-autoregressive (VAR) model.
3. Method according to Claim 1 , characterized in that the error signals occurring at the sound boundaries are used to detect the segment boundaries.
4. Method according to Claim 1, characterized in that predictions are made from several prediction points in time preceding and following the time window, which permits the combination of the prediction errors.
5. Method according to Claim 1, characterized in that the segment boundaries are set at the location of the local maximum values of the error.
PCT/FI2005/000519 2004-11-30 2005-11-30 Method for the automatic segmentation of speech WO2006058958A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
FI20041541A FI20041541A (en) 2004-11-30 2004-11-30 Procedure for automatic segmentation of speech
FI20041541 2004-11-30

Publications (1)

Publication Number Publication Date
WO2006058958A1 true WO2006058958A1 (en) 2006-06-08

Family

ID=33515289

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/FI2005/000519 WO2006058958A1 (en) 2004-11-30 2005-11-30 Method for the automatic segmentation of speech

Country Status (2)

Country Link
FI (1) FI20041541A (en)
WO (1) WO2006058958A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010126709A1 (en) * 2009-04-30 2010-11-04 Dolby Laboratories Licensing Corporation Low complexity auditory event boundary detection

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08123469A (en) * 1994-10-28 1996-05-17 Mitsubishi Electric Corp Phrase border probability calculating device and continuous speech recognition device utilizing phrase border probability
US6535851B1 (en) * 2000-03-24 2003-03-18 Speechworks, International, Inc. Segmentation approach for speech recognition systems

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH08123469A (en) * 1994-10-28 1996-05-17 Mitsubishi Electric Corp Phrase border probability calculating device and continuous speech recognition device utilizing phrase border probability
US6535851B1 (en) * 2000-03-24 2003-03-18 Speechworks, International, Inc. Segmentation approach for speech recognition systems

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
DEMUYNCK K. ET AL: "A Comparison of different approaches to automatic speech segmentation", 5TH INTERNATIONAL CONFERENCE, TSD 2002,, vol. 2448, 9 September 2002 (2002-09-09) - 12 September 2002 (2002-09-12), BRNO, CZECH REPUBLIC, pages 227 *
KAWABATA T.: "Predictor codebooks for speaker-independent speech recognition", 1992 IEEE INTERNATIONAL CONFERENCE ON ACOUSTIC, SPEECH, AND SIGNAL PROCESSING, vol. 1, 1992, pages 353 - 356 *
TAHIR ET AL: "Time varying autoregressive modeling approach for speech segmentation", SIXTH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, August 2001 (2001-08-01), MALAYSIA, pages 715 - 718 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010126709A1 (en) * 2009-04-30 2010-11-04 Dolby Laboratories Licensing Corporation Low complexity auditory event boundary detection
CN102414742A (en) * 2009-04-30 2012-04-11 杜比实验室特许公司 Low complexity auditory event boundary detection
JP2012525605A (en) * 2009-04-30 2012-10-22 ドルビー ラボラトリーズ ライセンシング コーポレイション Low complexity auditory event boundary detection
CN102414742B (en) * 2009-04-30 2013-12-25 杜比实验室特许公司 Low complexity auditory event boundary detection
US8938313B2 (en) 2009-04-30 2015-01-20 Dolby Laboratories Licensing Corporation Low complexity auditory event boundary detection

Also Published As

Publication number Publication date
FI20041541A (en) 2006-05-31
FI20041541A0 (en) 2004-11-30

Similar Documents

Publication Publication Date Title
Mustafa et al. Robust formant tracking for continuous speech with speaker variability
Ananthapadmanabha et al. Detection of the closure-burst transitions of stops and affricates in continuous speech using the plosion index
Shanthi et al. Review of feature extraction techniques in automatic speech recognition
JPH075892A (en) Voice recognition method
Jiao et al. Convex weighting criteria for speaking rate estimation
Zhang et al. Improved modeling for F0 generation and V/U decision in HMM-based TTS
Lin et al. Automatic estimation of voice onset time for word-initial stops by applying random forest to onset detection
Karpagavalli et al. Phoneme and word based model for tamil speech recognition using GMM-HMM
Priya et al. Implementation of phonetic level speech recognition in Kannada using HTK
Nickel et al. Corpus-based speech enhancement with uncertainty modeling and cepstral smoothing
EP1081681B1 (en) Incremental training of a speech recognizer for a new language
Lugosch et al. Tone recognition using lifters and ctc
WO2006058958A1 (en) Method for the automatic segmentation of speech
Kupryjanow et al. Real-time speech signal segmentation methods
Jijomon et al. An offline signal processing technique for accurate localisation of stop release bursts in vowel-consonant-vowel utterances
Koc Acoustic feature analysis for robust speech recognition
Anh et al. A Method for Automatic Vietnamese Speech Segmentation
Hemakumar et al. Large Vocabulary in Continuous Speech Recognition Using HMM and Normal Fit
Sönmez et al. Consonant discrimination in elicited and spontaneous speech: a case for signal-adaptive front ends in ASR.
Sasou et al. Glottal excitation modeling using HMM with application to robust analysis of speech signal.
Dutta et al. A comparative study on feature dependency of the Manipuri language based phonetic engine
KR20080039072A (en) Speech recognition system for home network
Thandil et al. Automatic speech recognition system for utterances in Malayalam language
Deekshitha et al. Implementation of Automatic segmentation of speech signal for phonetic engine in Malayalam
Zeng et al. Robust children and adults speech classification

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A1

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KN KP KR KZ LC LK LR LS LT LU LV LY MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A1

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU LV MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 05818023

Country of ref document: EP

Kind code of ref document: A1