KR100223028B1

KR100223028B1 - Apparatus and method for modelling the duration time of speech synthesizer

Info

Publication number: KR100223028B1
Application number: KR1019960065737A
Authority: KR
Inventors: 김상호; 이정철
Original assignee: 정선종; 한국전자통신연구원
Priority date: 1996-12-14
Filing date: 1996-12-14
Publication date: 1999-10-01
Also published as: KR19980047261A

Abstract

본 발명은 음성 합성기의 지속시간 모델링 장치 및 이를 이용한 모델링 방법에 관한 것으로, 음절내 음소길이가 상기 음절의 길이에 따라 선형적으로 변화한다는 분석결과를 이용하여, 주어진 음절의 길이가 음절 내 음소의 길이와 선형적 관계인 경우에는 그 음소에 대해 모델링된 선형 방정식을 이용하여 음소지속시간을 최적으로 할당하고, 음절 길이와 음소 길이가 비선형적 관계인 경우에는 다층 신경 회로망을 이용한 비선형 모델링 방법으로 음소 지속시간을 모델링하여 음성 합성기의 명료도 및 자연성을 향상시킬 수 있는 음성 합성기의 지속시간 모델링 장치 및 이를 이용한 모델링 방법이 개시된다.The present invention relates to a duration modeling apparatus of a speech synthesizer and a modeling method using the same, by using an analysis result that a phoneme length in a syllable varies linearly with the length of the syllable, and a length of a given syllable In case of linear relationship with length, the phoneme duration is optimally assigned using the linear equation modeled for the phoneme.If the syllable length and phoneme length are non-linear relationship, the phoneme duration is determined by nonlinear modeling using multilayer neural network. Disclosed are a duration modeling apparatus of a speech synthesizer capable of improving the clarity and naturalness of a speech synthesizer by modeling a model, and a modeling method using the same.

Description

Duration Modeling Device of Speech Synthesizer and Modeling Method Using the Same

본 발명은 음성 합성기의 지속시간 모델링 장치 및 이를 이용한 모델링 방법에 관한 것으로, 컴퓨터 사용자 인터페이스를 자연스럽게 하기 위한 음성합성 시스템 개발에 있어서, 합성음을 구성하고 있는 단위 음절 및 음소의 지속시간을 결정할 수 있는 음성 합성기의 지속시간 모델링 장치 및 이를 이용한 모델링 방법에 관한 것이다.The present invention relates to a duration modeling device of a speech synthesizer and a modeling method using the same. In the development of a speech synthesis system for naturalizing a computer user interface, a speech capable of determining the duration of unit syllables and phonemes constituting a synthesized sound The present invention relates to a duration modeling apparatus of a synthesizer and a modeling method using the same.

일반적으로 지속시간 모델은 대부분 음소단위이다. 따라서 모델링할 개수는 음절단위보다 줄어드는 장점이 있으나 지속시간에 영향을 미치는 요인이 많고 이러한 요인에 민감하게 변화하므로 모델링이 용이하지 않다. 또한 합성음의 자연성 및 명료도는 음절내 음소의 지속시간 할당 비율이 중요한데 음소모델링 경우 적절한 할당 비율을 찾기 어렵다. 결과적으로 합성음의 질을 저하시키는 주요 원인이 된다. 이에 따라 음절단위의 모델링이 최근 시도되어 음소단위의 단점을 극복하고자 했다. 그러나 이 방법에서는 음절내 음소의 지속시간 할당 방법이 적절치 않아 합성음의 음질을 향상시킬 수는 없다. 즉, 음절단위를 사용한 기술은 먼저 음절지속 시간을 구한 후 음소의 지속시간을 하기 식 1에 의해 구한다.In general, the duration model is mostly phoneme. Therefore, the number of models can be reduced than syllable units, but modeling is not easy because there are many factors that affect the duration and are sensitive to these factors. In addition, the ratio of duration allocation of phonemes in syllables is important for the nature and clarity of synthesized sounds. As a result, it is a major cause of degrading the quality of synthesized sound. Accordingly, modeling of syllable units has recently been attempted to overcome the shortcomings of phoneme units. However, in this method, the duration allocation method of the phoneme in syllable is not appropriate, so the sound quality of the synthesized sound cannot be improved. That is, in the technique using syllable units, the syllable duration is obtained first, and then the duration of the phoneme is calculated by Equation 1 below.

여기서i는 음소 지속시간의 평균치,i는 표준편차를 나타낸다 상수 k값은 0.1씩 양, 음으로 단계적으로 증가시켜 음절 지속시간을 만족하는 음소 지속 시간을 결정하게 된다. 이 방법은 각 음소의 통계적 특성(i,i)에 민감하게 반응하고 실제로 각 음소의 지속시간이 음소 지속시간의 평균치와 표준편차의 k곱만큼 더한 관계를 가졌다고 말할 수 없기 때문에 모델로 적절하지 못하다. 즉 음소 지속시간 분포가 제1도와 같다고 가정할때(일반적으로 이와같은 분포를 가짐) 음절 "삼"의 경우 실제 발성에서는 모음 /아/ 보다 종성자음 /ㅁ/ 이 더 길다. 따라서 기존의 음소할당 방식을 적용할 경우 /아/의 길이 ₂+k ₂/ㅁ/의 길이 ₃+k ₃이 되어 오히려 음소 /아/가 더 길어지게 되는 결과를 가져온다. 또한 평균치와 표준 편차는 정규분포(Gaussian normal distribution)를 가정한 것인데 제2(a)도 및 제2(b)도에 도시된 바와같이 모음 /아/의 경우 뒤틀림(skewness)1.12로 나타났으며 이 값은 정규분포에서 양의 방향으로 왜곡(skewness0일때 정규분포라 함)되어 있음을 나타내며 종성자음 /ㄴ/인 경우 뒤틀림(skewness)0.08로 비교적 정규분포에 가까우나 지속시간이 80 내지 100msec 근처에서 정규분포와 어긋난 모양을 가지고 있어 그 부근에서 예측 오차가 크게 된다. 결론적으로 말하면 음소 지속시간 분포를 정규분포로 가정하고 그 평균치와 표준편차를 이용한 음절내 음소 지속시간 할당 방법은 음질개선에 적절치 않다.here i is the mean value of phoneme duration, i represents the standard deviation. The constant k value is incrementally increased by 0.1 positively and negatively to determine the phoneme duration that satisfies the syllable duration. This method uses the statistical characteristics of each phoneme ( i, It is not appropriate to model because it is sensitive to i) and cannot actually say that the duration of each phoneme has a relationship between the mean of phoneme duration and the k product of the standard deviation. In other words, assuming that the phoneme duration distribution is equal to the first degree (generally having this distribution), the syllable "three" has a longer final consonant / ㅁ / than a vowel / a / in actual speech. Therefore, when applying the conventional phoneme allocation method, the length of / a / ₂ + k Length of ₂ / ㅁ / ₃ + k ₃ results in a longer phoneme / a /. In addition, the mean and standard deviation assume a Gaussian normal distribution, and skewness in the case of vowels / a /, as shown in Figures 2 (a) and 2 (b). 1.12, which is skewness in the positive direction If it is 0, it is called normal distribution. If it is final consonant / b /, skewness Although it is relatively close to a normal distribution at 0.08, the duration has a shape that deviates from the normal distribution in the vicinity of 80 to 100 msec, and the prediction error becomes large in the vicinity. In conclusion, it is assumed that the phoneme duration distribution is a normal distribution, and the method of assigning phoneme duration in syllables using the mean and standard deviation is not appropriate for sound quality improvement.

따라서 본 발명은 음절내 음소길이가 상기 음절의 길이에 따라 선형적으로 변화한다는 분석결과를 이용하여, 주어진 음절의 길이가 음절 내 음소길이와 선형적 관계를 갖는 경우에는 그 음소에 대해 모델링된 선형 방정식을 이용하여 음소지속시간을 최적으로 할당하고, 음절길이와 음소 길이가 비선형적 관계를 갖는 경우에는 신경 회로망을 이용한 비선형 모델링 방법으로 음소하여 음절내 음소길이가 그 음절과 비선형적 관계를 가질 때 이를 모델링하는 방법으로 음소 지속시간을 할당하므로서, 음성 합성기의 명료도 및 자연성을 향상시킬 수 있는 음성 합성기의 지속시간 모델링 장치 및 이를 이용한 모델링 방법을 제공하는 데 그 목적이 있다.Therefore, the present invention uses an analysis result that the phoneme length in a syllable varies linearly with the length of the syllable, and when the length of a given syllable has a linear relationship with the phoneme length in the syllable, the modeled linear When the phoneme duration is optimally assigned using the equation and the syllable length and the phoneme length have a nonlinear relationship, when the phoneme length in the syllable has a nonlinear relationship with the syllable by using a nonlinear modeling method using a neural network It is an object of the present invention to provide a duration modeling apparatus of a speech synthesizer that can improve the intelligibility and naturalness of the speech synthesizer by allocating a phoneme duration as a modeling method, and a modeling method using the same.

상술한 목적을 실현하기 위한 본 발명에 따른 음성 합성기의 지속시간 모델링 장치는 지속시간을 모델링하고자 하는 문장을 입력받기 위한 문장 입력 장치와, 상기 문장 입력장치로 입력된 문장의 읽기 변환 및 문장구조를 분성하기 위한 언어 처리장치와, 억양 처리부, 에너지 처리부 및 지속시간 처리부로 이루어지는 운율 처리장치와, 실제 음성 파형을 생성하기 위한 신호 처리 장치로 구성되는 음성 합성기의 지속시간 모델링 장치에 있어서, 상기 지속시간 처리부는 다층 신경 회로망을 이용하여 음절 지속시간 분포의 비선형적 특성과 요소들 간의 상호 의존 관계를 분류하기 위한 음절 지속시간 예측모델과, 상기 음절 지속시간 예측 모델을 통해 예측된 음절 시속시간을 이용하여 음절길이와 음소길이의 관계에 따라 선형적 또는 비선형적으로 음소 지속시간을 예측하기 위한 음소 지속시간 예측 모델을 포함하여 이루어지는 것을 특징으로 한다.The duration modeling apparatus of the speech synthesizer according to the present invention for realizing the above object comprises a sentence input device for receiving a sentence to model the duration, the reading conversion and sentence structure of the sentence input to the sentence input device; A duration modeling apparatus for a speech synthesizer comprising a speech processing unit for dividing, a rhyme processing unit comprising an intonation processing unit, an energy processing unit, and a duration processing unit, and a signal processing unit for generating an actual speech waveform, the duration time The processor uses a multi-layer neural network to generate a syllable duration prediction model for classifying non-linear characteristics of syllable duration distributions and interdependencies among the elements, and a syllable time predicted by the syllable duration prediction model. Depending on the relationship between syllable length and phoneme length, Including the phoneme duration prediction model for predicting the predetermined duration, characterized in that formed.

상술한 목적을 실현하기 위한 본 발명에 따른 음성 합성기의 지속시간 모델링 방법은 지속시간을 모델링하고자 하는 문장을 구성하는 음절의 음절위치, 음절 유형, 음운환경 및 음절개수에 따라 신경 회로망을 이용한 비선형 지속시간 예측 모델링 방법으로 음절 지속시간을 할당하는 단계와, 음절길이와 음소길이가 선형적 관계인 경우, 상기 음절 지속시간을 이용한 선형 모델링 방법으로 음절 내의 음소 지속시간을 모델링하고, 음절길이와 음소길이가 비선형적 관계인 경우, 다층 신경 회로망을 이용한 비선형 모델링 방법으로 음소 지속시간을 모델링하는 단계를 포함하여 이루어지는 것을 특징으로 한다.The duration modeling method of the speech synthesizer according to the present invention for realizing the above object is a nonlinear sustain using neural network according to syllable position, syllable type, phonological environment, and number of syllables of the syllable constituting the sentence to model duration. Allocating the syllable duration by the time prediction modeling method, and when the syllable length and the phoneme length are linearly related, model the phoneme duration in the syllable by the linear modeling method using the syllable duration, and the syllable length and the phoneme length are In the case of a nonlinear relationship, the phoneme duration is modeled using a nonlinear modeling method using a multilayer neural network.

제1도는 일반적인 음소 지속시간을 도시하는 분포도.1 is a distribution chart showing typical phoneme durations.

제2(a)도는 일반적인 모음 /아/의 지속시간을 도시하는 분포도.FIG. 2 (a) is a distribution chart showing the duration of a typical vowel / ah /.

제2(b)도는 일반적인 종성 유성자음 /ㄴ/의 지속시간을 도시하는 분포도.FIG. 2 (b) is a distribution chart showing the duration of a typical last meteor consonant / b /.

제3도는 본 발명에 따른 음성 합성기의 지속시간 모델링 장치를 도시하는 구성도.3 is a block diagram showing an apparatus for duration modeling of a speech synthesizer according to the present invention.

제4도는 제3도의 지속시간 제어부에 대한 상세 구성도.4 is a detailed block diagram of the duration controller of FIG.

제5도는 음절길이로부터 음소길이 할당 방법을 설명하기 위한 도면.5 is a diagram for explaining a phoneme length allocation method from syllable lengths.

제6도는 음소 지속시간 신경회로망 구조도.6 is a phoneme duration neural network structure diagram.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

1 : 문장 입력장치 2 : 언어 처리장치1: sentence input device 2: language processing device

3 : 운율 처리장치 4 : 신호 처리장치3: rhyme processing apparatus 4: signal processing apparatus

31 : 억양처리부 32 : 지속시간 처리부31: intonation processing unit 32: duration processing unit

33 : 에너지 처리부 321 : 음절 지속시간 추정기33: energy processor 321: syllable duration estimator

322 : 음소 지속시간 추정기 323 : 선형모델322 Phoneme Duration Estimator 323 Linear Model

324 : 신경회로망 모델324: neural network model

이하, 본 발명에 따른 음성 합성기의 지속시간 모델링 장치 및 이를 이용한 모델링 방법을 첨부된 도면을 참조하여 상세히 설명하면 다음과 같다.Hereinafter, an apparatus for modeling a duration of a speech synthesizer and a modeling method using the same according to the present invention will be described in detail with reference to the accompanying drawings.

도면 3은 본 발명에 따른 음성 합성기의 지속시간 모델링 장치를 도시한 구성도로서, 문장 입력장치(1)와, 문장의 읽기변환 및 문장구조 분석을 위한 언어 처리장치(2)와, 억양과 지속시간 및 에너지를 제어하는 운율 처리장치(3)와, 실제 음성파형을 생성하는 신호 처리장치(4)로 이루어진다. 상기 운율 처리장치(3)는 억양 처리부(31)와 지속시간 처리부(32) 및 에너지 처리부(33)로 이루어진다.3 is a block diagram illustrating an apparatus for modeling a duration of a speech synthesizer according to the present invention, comprising: a sentence input device 1, a language processing device 2 for reading conversion and sentence structure analysis of a sentence, and intonation and sustaining And a rhyme processing apparatus 3 for controlling time and energy, and a signal processing apparatus 4 for generating actual speech waveforms. The rhyme processing apparatus 3 includes an intonation processing unit 31, a duration processing unit 32, and an energy processing unit 33.

제4도는 제3도의 운율 처리장치(3)를 이루는 지속시간 처리부(32)에 대한 상세 구성도로서, 음절 지속시간 예측모델(321)과 음소 지속시간 예측모델(322)로 이루어진다. 음절 지속시간 예측모델(321)은 음절 지속시간 분포의 비선형적 특성과 요소(factor)들 간 상호 의존관계 등 복잡한 패턴을 분류하는데 강인한 특성을 가진 다층 신경회로망(multi-layer perceptron) 모델이다. 다층 신경회로망의 입력은 어절 내에서의 음절위치, 음절유형, 음운환경, 음절개수 등 4가지 요소로 구성되며 24차 이진수로 변환된 값이 된다. 은닉충의 노드는 실험에 의해 결정되는 N개, 출력은 예측된 음절 지속시간이 된다. 신경회로망의 훈련은 오류 역전파(error back propagation) 알고리즘을 사용하였으며 은닉층 노드와 출력 노드에 비선형함수인 시그모이드(sigmoid function)를 적용하였다. 신경 회로망으로 예측된 음절 지속시간은 선형모델(323)의 음소 지속시간 예측을 위한 입력 데이타가 되고 상기 입력 데이타를 이용해 음소 지속시간 예측 모델이 음절내 음소 지속시간을 최적으로 할당한다. 음소 지속시간 예측 모델은 음절내 음소 지속시간 분포를 선형적으로 모델링하는 방법과 다층 신경회로망을 이용한 비선형적 모델링 방법이 있다. 선형예측 모델링 방법은 음절길이에 따른 초성, 중성, 종성부의 음소 지속시간 분포를 선형방정식 XaY+b로 근사화 한다. 여기서 Y는 제4도의 음절 지속시간 예측모델(321)의 예측된 음절 지속시간이고 X는 최적 비율로 할당하고자 하는 음소의 지속시간이 된다. 예를들면 OVC(초성+중성+종성) 구조를 가진 음절의 각 음소 초성 (X₁), 중성 (X₂), 종성 (X₃)은 최소 좌승오차(E) 기준에 의해 하기 식 2와 같이 모델링 된다.4 is a detailed configuration diagram of the duration processor 32 constituting the rhyme processing apparatus 3 of FIG. 3, and includes a syllable duration prediction model 321 and a phoneme duration prediction model 322. The syllable duration prediction model 321 is a multi-layer perceptron model that is robust in classifying complex patterns such as nonlinear characteristics of syllable duration distribution and interdependence between factors. The input of multi-layer neural network consists of 4 elements such as syllable location, syllable type, phonological environment, and syllable count in the word, and is converted to 24th order binary number. The nodes of hidden insects are N determined by experiment, and the output is the estimated syllable duration. The training of neural network uses error back propagation algorithm and the nonlinear function sigmoid function is applied to hidden layer node and output node. The syllable duration predicted by the neural network becomes input data for the phoneme duration prediction of the linear model 323, and the phoneme duration prediction model optimally allocates the phoneme duration in the syllable using the input data. The phoneme duration prediction model includes a method of linearly modeling a phoneme duration distribution in a syllable and a nonlinear modeling method using a multilayer neural network. The linear predictive modeling method uses the linear equation X to determine the distribution of phoneme durations of the initial, neutral, and final approximates to aY + b Here, Y is the predicted syllable duration of the syllable duration prediction model 321 of FIG. 4, and X is the duration of the phoneme to be allocated at the optimum ratio. For example, each phoneme initial (X ₁ ), neutral (X ₂ ), and final (X ₃ ) of syllables with OVC (primary + neutral + trailing) structures are expressed by Equation 2 based on the minimum left error (E) criterion: Modeled.

제5도는 초/중/종성 길이 분포를 선형방정식으로 근사화했을때 각 음소의 길이가 어떻게 결정되는지 도식화하여 보여주고 있다. 이 모델의 성능평가를 위해 근사화 정도를 판단할 수 있는 상관계수 즉, 훈련 데이타와 선형 예측모델이 얼마나 비슷한가를 나타내었을때 초성(Onset)0.222, 중성(Peak)0.498, 종성(Coda)0.840이였다. 여기서 상관계수의 수치가 높을수록 모델링이 잘 되었다고 할 수 있다. 제5도와 같은 방법으로 다른 음절 환경에 적용하면 음절내 음소 지속시간을 최적으로 할당할 수 있다. 두번째 방법에서는 다층 신경회로망을 이용한다. 비선형 예측 모델링 방법은 다층 신경 회로망(multi-layer perceptron) 모델로, 입력과 출력과의 관계가 비선형 적일 때 사용된다. 다층 신경 회로망 모델은 최초출원 명세서의 첨부 도면 제6도에 도시된 것과 같이, 입력층, 출력층 및 은닉층으로 구성되며, 각 층에 연결된 웨이트(weight)값이 오류 역전파(error back-propagation) 훈련 알고리즘에 의해 조정된다. 본 발명에서는 비선형 관계에 대한 신경 회로망의 모델링 특성을 이용하여 음절 길이가 주어졌을 때 그 음절 내의 음소 지속시간을 예측하고자 하는 것이다. 즉, 음절과 그 음절 내 음소의 지속시간 길이는 거의 선형적인 관계를 가지고 있으나, 음소에 따라 또는 비선형적 관계에 대한 더욱 정확한 관계를 모델링 하기 위해 신경 회로망을 이용한 비선형 예측 모델링 방법을 사용하는 것이다.5 shows how the length of each phoneme is determined when approximating the elementary / medium / terminal length distribution to the linear equation. For the performance evaluation of this model, the correlation coefficient that can determine the degree of approximation, that is, Onset when the training data and the linear predictive model are shown how similar 0.222, Neutral 0.498, Coda 0.840. The higher the correlation coefficient, the better the modeling. When applied to other syllable environments in the same manner as in FIG. The second method uses a multilayer neural network. Nonlinear predictive modeling is a multi-layer perceptron model and is used when the relationship between input and output is nonlinear. The multilayer neural network model is composed of an input layer, an output layer and a hidden layer, as shown in FIG. 6 of the accompanying drawings of the original application, and weight values connected to each layer are used for error back-propagation training. Adjusted by the algorithm. In the present invention, when the syllable length is given by using the modeling characteristics of the neural network for the nonlinear relationship, the phoneme duration in the syllable is to be predicted. In other words, although the syllable and the duration length of the phoneme in the syllable are almost linearly related, the nonlinear predictive modeling method using neural networks is used to model more accurate relations according to the phoneme or nonlinear relationship.

첫번째 방법인 선형방정식에 의한 모델은 음소 길이 분포가 음절길이와 상관 계수가 높을때 사용될 수 있으나 그렇지 않을 경우 훈련 데이타를 잘 근사화할 수 없게 된다. 특히 음절길이와의 관계가 비선형적(예: XY²)인 특성이 나타날 때 근사화 오차가 커진다. 따라서 제6도과 같이 이러한 비선형적 특성을 잘 모델링할 수 있는 방법인 다층 신경회로망을 사용하여 이를 모델링하고자 한다. 신경회로망의 입력은 제3도의 문장 입력장치(1)에서 구해진 음절길이이며 이진수로 표기된다. 출력은 훈련 데이타의 최대 음소길이로 정규화된 비(ratio)가 된다. 신경회로망을 사용했을 때 상관계수는 각각 초성(Onset)0.672, 중성(Peak)0.753, 종성(Coda)0.933이였다. 이렇게 하므로서 최종 합성음의 자연성 및 명료도를 개선시켜 사람과 기계간 휴먼인터페이스를 더 자연스럽게 할 수 있다.The first method, the linear equation model, can be used when the phoneme length distribution has a high syllable length and correlation coefficient, but otherwise the training data cannot be approximated well. In particular, the relationship to syllable length is nonlinear (e.g. X When the characteristic of Y ² ) appears, the approximation error increases. Therefore, as shown in FIG. 6, a multi-layer neural network, which is a method of modeling such nonlinear characteristics, is used to model it. The input of the neural network is the syllable length obtained from the sentence input device 1 of FIG. 3 and is expressed in binary numbers. The output is the ratio normalized to the maximum phoneme length of the training data. When neural network is used, the correlation coefficient is Onset 0.672, Neutral 0.753, Coda 0.933. This improves the natural and clarity of the final synthesized sound, making the human interface between the human and the machine more natural.

상술한 바와같이 본 발명에 의하면 음절내 음소길이가 상기 음절의 길이에 따라 선형적으로 변화한다는 분석결과를 이용한 것으로 음절의 길이가 주어지면 그 음소에 대해 모델링된 선형 방정식을 이용하여 음소길이를 최적으로 할당하는 방법과, 또는 신경 회로망을 이용하여 음절내 음소길이가 그 음절과 비선형적 관계를 가질 때 이를 모델링하는 방법으로 음성 합성기의 명료도 및 자연성을 향상시킬 수 있는 효과가 있다.As described above, according to the present invention, the phoneme length in the syllable is changed linearly according to the length of the syllable. When the length of the syllable is given, the phoneme length is optimized using a linear equation modeled for the phoneme. This method can be used to improve the intelligibility and naturalness of the speech synthesizer by using the method of assigning and modeling when the phoneme length in a syllable has a nonlinear relationship with the syllable using neural networks.

Claims

Sentence input device for receiving a sentence to model the duration, a language processing device for reading conversion and analysis of the sentence structure input to the sentence input device, intonation processing unit, energy processing unit and duration processing unit In the duration modeling apparatus of the speech synthesizer comprising a rhyme processing device and a signal processing device for generating the actual speech waveform, the duration processing unit uses non-linear characteristics and elements of syllable duration distribution using a multilayer neural network. Using the syllable duration prediction model to classify the interdependence relationship and the syllable duration predicted by the syllable duration prediction model, phoneme duration is linearly or nonlinearly determined according to the relationship between syllable length and phoneme length. Comprising a phoneme duration prediction model for predicting Duration modeling device of the speech synthesizer according to claim.

Allocating syllable duration by nonlinear duration prediction modeling method using neural network according to syllable position, syllable type, phonological environment and number of syllables that make up the sentence to model duration, syllable length and phoneme length Is a linear relationship, the phoneme duration in the syllable is modeled by the linear modeling method using the syllable duration, and when the syllable length and the phoneme length are nonlinear, the phoneme duration is modeled by the nonlinear modeling method using the multilayer neural network. Duration modeling method of the speech synthesizer characterized in that it comprises a step.

The duration modeling method of claim 2, wherein the linear modeling approximates the phoneme duration distribution of the initial, neutral, and final parts according to the syllable length as in the following formula.

Where Xij: duration of the phoneme you want to allocate at the optimal ratio

Y: Syllable duration predicted from syllable duration prediction model