KR100608643B1

KR100608643B1 - Pitch modelling apparatus and method for voice synthesizing system

Info

Publication number: KR100608643B1
Application number: KR1019990055463A
Authority: KR
Inventors: 이준우
Original assignee: 엘지전자 주식회사
Priority date: 1999-12-07
Filing date: 1999-12-07
Publication date: 2006-08-09
Also published as: KR20010054592A

Abstract

본 발명은 음성 합성 시스템의 억양 모델링 장치및 방법에 관한 것으로, 다양한 화자 및 적용 분야에 대해서도 화자 고유의 억양 패턴을 구현하여 합성음의 자연성을 향상 시킬 수 있도록 한 것이다. 따라서, 본 발명은 임의의 입력 텍스트를 한 문장씩 분리하여 그 분리된 문장 단위로 구문을 분석하고, 그 분석결과에 근거한 문장성분과 음운변동에 관한 정보를 출력하는 언어처리부와; 상기 언어처리부에서 분석된 문장성분과 음운변동을 학습된 성조 패턴을 예측하기 위한 회귀 나무구조에 적용하여 음절별 성조패턴을 결정하고, 그 음절별 성조패턴을 기본 주파수 궤적을 예측하기 위한 회귀 나무구조에 적용하여 기본 주파수 궤적을 생성하는 억양예측부와; 상기 억양예측부에서 생성된 기본 주파수 궤적을 이용하여, 음성DB로부터 합성단위의 데이터를 오버랩 애드(Overlap Add)를 행하여 그에 따른 합성음의 파형을 생성하는 합성부를 포함하여 구성한다.The present invention relates to an accent modeling apparatus and method for a speech synthesis system, and to improve the naturalness of a synthesized sound by implementing a speaker-specific accent pattern for various speakers and applications. Accordingly, the present invention provides a language processing unit which separates arbitrary input text by one sentence and analyzes syntax in units of the separated sentences, and outputs information on sentence components and phonological fluctuations based on the analysis result; The sentence component and phonological variation analyzed by the language processor are applied to a regression tree structure for predicting the learned tone pattern, and the tonal pattern for each syllable is determined, and the regression tree structure for predicting the fundamental frequency trajectory is used for the syllable pattern for each syllable. An intonation prediction unit generating a fundamental frequency trajectory by By using the fundamental frequency trajectory generated by the intonation prediction unit, it comprises a synthesis unit for performing the overlap Add (addlap) of the data of the synthesis unit from the speech DB to generate the waveform of the synthesized sound accordingly.

Description

Accent modeling apparatus and method for speech synthesis system {PITCH MODELLING APPARATUS AND METHOD FOR VOICE SYNTHESIZING SYSTEM}

도1은 종래 음성 합성시스템의 개략적인 구성을 보인 블록도.1 is a block diagram showing a schematic configuration of a conventional speech synthesis system.

도2는 본 발명 음성 합성시스템의 억양 모델링 장치에 대한 구성을 보인 블록도.Figure 2 is a block diagram showing the configuration of the intonation modeling apparatus of the present invention speech synthesis system.

도3은 도2에 있어서, 억양예측부의 성조 및 기본 주파수 궤적모델링에 대한 흐름도.FIG. 3 is a flowchart of tonal and fundamental frequency trajectory modeling of an intonation prediction unit in FIG. 2; FIG.

도4는 도2에 있어서, 억양예측부의 틸트 모델의 요소를 보인 개락도. Fig. 4 is an open view showing elements of the tilt model of the intonation prediction part in Fig. 2;

*****도면의 주요부분에 대한 부호의 설명********** Description of the symbols for the main parts of the drawings *****

100:언어처리부 200:억양예측부100: language processing unit 200: intonation prediction unit

300:합성부300: synthetic part

본 발명은 음성 합성 시스템의 억양 모델링 장치 및 방법에 관한 것으로, 특히 다양한 화자 및 적용 분야에 대해서도 화자 고유의 억양 패턴을 구현하여 합성음의 자연성을 향상 시킬 수 있도록 한 음성 합성 시스템의 억양 모델링 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for accent modeling a speech synthesis system. In particular, an apparatus and method for modeling intonation of a speech synthesis system to improve the naturalness of synthesized speech by implementing a speaker-specific accent pattern for various speakers and applications. It is about.

음성합성은 임의의 텍스트 문장에 대해 언어처리와 신호처리과정을 거쳐 자연스럽고 명료한 합성음을 생성하는 것이다.Speech synthesis is to generate natural and clear synthesized sound through language processing and signal processing for arbitrary text sentences.

일반적으로, 인간의 억양은 사람 마다 고유한 패턴과 레벨을 가지고 있으며, 또한 동일한 화자라 할지라도 발성하는 문장의 성격에 따라 문장의 특징이 달라지고, 이렇게 달라지는 억양패턴을 특정화자의 패턴으로 음성합성으로 일반화하여 합성음을 발생할 때 자연성 향상을 기대하기 어렵다.In general, human accents have their own patterns and levels, and even if they are the same speaker, the characteristics of the sentences vary according to the characteristics of the spoken sentences. In general, it is difficult to expect a natural improvement when synthesized sound is generated.

이때, 상기 음성합성은 임의의 텍스트 문장에 대해 언어처리와 신호처리 과정을 거쳐 자연스럽고 명료한 합성음을 생성하는 것으로, 일반적인 종래 음성 합성시스템의 억양 모델링 기법을 첨부한 도면을 참조하여 설명한다. In this case, the speech synthesis generates a natural and clear synthesized sound through a language process and a signal processing process for an arbitrary text sentence, which will be described with reference to the accompanying accent modeling technique of a general conventional speech synthesis system.

도1은 일반적인 음성 합성 시스템의 구성을 보인 개략도로서, 이에 도시된 바와같이 임의의 입력 텍스트를 한 문장씩 분리하여 한 문장 단위로 구문을 분석하여 음성학적인 표현으로 변환하는 언어처리부(10)와; 상기 언어처리부(10)에서 분석된 문장성분과 음운변동에 따라 피치와 음절의 길이등의 운율정보를 생성하는 운율처리 부(20)와; 상기 운율처리부(20)에서 추출한 운율정보를 이용하여 음성DB로부터 합성단위의 데이터를 오버랩 애드(Overlap Add)를 행하여 합성음의 파형을 생성하는 언어합성부(30)로 이루어지며, 이와 같은 장치의 동작을 설명한다. 1 is a schematic diagram showing a configuration of a general speech synthesis system. As shown in FIG. 1, a language processor 10 which separates an input text by one sentence and analyzes a phrase in a sentence unit and converts it into a phonetic expression; A rhyme processor 20 for generating rhyme information such as pitch and syllable length according to sentence components and phonological variations analyzed by the language processor 10; The language synthesizer 30 is configured to generate a waveform of the synthesized sound by overlapping the data of the synthesis unit from the voice DB by using the rhyme information extracted by the rhyme processing unit 20. Explain.

먼저, 언어처리부(10)는 임의의 입력 텍스트를 입력받아 이를 문장 단위로 분리하여 구문을 분석하여 음성학적인 표현으로 변환한다.First, the language processing unit 10 receives arbitrary input text, separates it into sentence units, analyzes a syntax, and converts it into a phonetic expression.

상기에서 문장단위로 처리하는 이유는 구문분석을 할 수 있는 단위가 문장이기 때문으로, 여기에서 문장단위라 함은 하나의 완결된 사상과 감정을 담고 있는 문법 단위로서 하나의 문장을 주어부와 서술부로 이루어진 것을 말한다. The reason for processing in sentence units in the above is that the unit capable of parsing a sentence is a sentence. Here, a sentence unit is a grammar unit containing a completed thought and emotion, and a sentence is given to a subject and a description unit. Say what is done.

상기 문장내의 문자중 한글이 아닌 문자를 한글로 바꾸어 주는데, 즉 특수문자와 영자는 영어발음사전과 특수어 발음사전을 탐색하여 그에 해당하는 한글로 바꾸어 주며, 숫자 또한 숫자를 읽는 방식에 따라 한글로 변환하여 준다.Among the characters in the sentence, the non-Hangul letters are replaced with Hangeul, that is, special characters and English letters are searched for English phonetic dictionaries and special phonetic dictionaries and converted to the corresponding Hangul. Convert it.

상기와 같이 언어처리부(10)에서 문장에 대한 분석 및 처리가 완료되면 운율처리부(20)에서 분석된 문장성분과 음운변동에 따라 피치와 음절길이등의 운율정보를 생성한다.As described above, when the analysis and processing of the sentence are completed in the language processor 10, the rhyme information such as the pitch and the syllable length is generated according to the sentence component and the phoneme variation analyzed by the rhyme processor 20.

그러면, 언어합성부(30)는 상기 운율처리부(20)에서 추출한 운율정보를 이용하여 음성DB로부터 합성단위의 데이터를 오버랩 애드(Overlap Add)를 행하여 합성음의 파형을 생성한다.Then, the language synthesizing unit 30 generates the waveform of the synthesized sound by overlapping the data of the synthesis unit from the voice DB using the rhyme information extracted by the rhythm processing unit 20.

이때, 상기 음성DB는 한국어의 음소에 따라 무성음과 유성음으로 구분하여, 무성음의 경우에는 pcm의 형태로 저장하고, 유성음의 경우에는 한 주기 단위의 PSE(Power Spectrum Envelope)를 생성하여 이를 시간 축상의 데이터로 변환한후 저 장한다.At this time, the voice DB is divided into an unvoiced sound and a voiced sound according to the phoneme of Korean, and stored in the form of pcm in the case of unvoiced sound, and generates a power spectrum envelope (PSE) of one cycle unit in the case of voiced sound Convert it to data and save it.

즉, 상술한 바와같은 종래 기술의 운율처리에 있어서, 코퍼스 구축의 자동화가 어렵고 또한 다양한 억양의 다양한 변화에 적절히 대처하지 못하는 문제점이 있었다.That is, in the above-described rhyme processing of the prior art, there is a problem in that it is difficult to automate corpus construction and cannot cope with various changes in various intonations.

따라서, 상기와 같은 문제점을 감안하여 창안한 다양한 화자및 적용분야에 대하여 음절별 성조 패턴을 예측하고, 기본 주파수 궤적을 예측하는 것을 자동화하여 화자 고유의 억양 패턴을 구현함으로써 합성음의 자연성을 향상 시킬 수 있도록 한 음성 합성 시스템의 억양 모델링 장치 및 방법을 제공함에 그 목적이 있다.Therefore, it is possible to improve the naturalness of the synthesized sound by realizing the speaker's own accent pattern by predicting the tone pattern for each syllable and predicting the fundamental frequency trajectory for various speaker and application fields created in view of the above problems. It is an object of the present invention to provide an accent modeling apparatus and method for a speech synthesis system.

상기와 같은 목적을 달성하기 위한 본 발명은 임의의 입력 텍스트를 한 문장씩 분리하여 그 분리된 문장 단위로 구문을 분석하고, 그 분석결과에 근거한 문장성분과 음운변동에 관한 정보를 출력하는 언어처리부와;
상기 언어처리부에서 분석된 문장성분과 음운변동을 학습된 성조 패턴을 예측하기 위한 회귀 나무구조에 적용하여 음절별 성조패턴을 결정하고,
그 음절별 성조패턴을 기본 주파수 궤적을 예측하기 위한 회귀 나무구조에 적용하여 기본 주파수 궤적을 생성하는 억양예측부와;
상기 억양예측부에서 생성된 기본 주파수 궤적을 이용하여, 음성DB로부터 합성단위의 데이터를 오버랩 애드(Overlap Add)를 행하여 그에 따른 합성음의 파형을 생성하는 합성부를 포함하여 구성한 것을 특징으로 한다.
상기와 같은 목적을 달성하기 위한 본 발명은 임의의 입력 텍스트를 한 문장씩 분리하여 한 문장 단위로 구문을 분석하는 제1 단계와;
제1 단계의 구문분석 결과를 틸트(Tilt) 모델에 적용하여 기본 주파수 형태를 계수화시켜 성조 패턴을 예측하는 제2 단계와;
상기 제2 단계의 성조 패턴을 기본 주파수 궤적을 예측하기 위한 회귀 나무구조에 적용하여 기본 주파수 궤적을 예측하는 제3 단계와;
상기 제3 단계에서 예측된 기본주파수 궤적으로 음성DB로부터 합성단위의 데이터를 읽어들여 그에 따른 합성음의 파형을 생성하는 제4 단계로 이루어진 것을 특징으로 한다.The present invention for achieving the above object is a language processing unit for separating any input text by one sentence to analyze the syntax in the unit of the separated sentence, and outputs information on the sentence component and phonological variation based on the analysis result Wow;
The sentence component and phonological variation analyzed by the language processor are applied to a regression tree structure for predicting the learned tone pattern, to determine the tone pattern for each syllable,
An accent prediction unit generating the fundamental frequency trajectory by applying the syllable pattern of each syllable to a regression tree structure for predicting the fundamental frequency trajectory;
And a synthesizer configured to overlap data of the synthesized unit from the voice DB and generate a waveform of the synthesized sound according to the fundamental frequency trajectory generated by the intonation predictor.
The present invention for achieving the above object comprises a first step of analyzing the syntax in units of sentences by separating any input text by one sentence;
A second step of predicting a tonal pattern by applying a syntax analysis result of the first step to a tilt model and digitizing a fundamental frequency shape;
A third step of predicting the fundamental frequency trajectory by applying the tonal pattern of the second stage to a regression tree structure for predicting the fundamental frequency trajectory;
And a fourth step of reading the synthesis unit data from the voice DB using the predicted fundamental frequency trajectory in the third step and generating a waveform of the synthesized sound accordingly.

삭제delete

이하, 본 발명에 의한 음성 합성 시스템의 억양 모델링 장치 및 방법에 대한 작용과 효과를 첨부한 도면을 참조하여 상세히 설명한다.Hereinafter, operations and effects of the intonation modeling apparatus and method of the speech synthesis system according to the present invention will be described in detail with reference to the accompanying drawings.

도2는 본 발명 음성 합성시스템의 억양 모델링 장치에 대한 구성을 보인 블록도로서, 이에 도시한 바와같이 임의의 입력 텍스트를 한 문장씩 분리하여 한 문장 단위로 구문을 분석하여 음성학적인 표현으로 변환하는 언어처리부(100)와; 상기 언어처리부(100)에서 분석된 문장성분과 음운변동을 학습된 성조 패턴 예측용 회귀 나무구조에 적용하여 음절별 성조패턴을 결정한후, 이 음절별 성조패턴을 기본 벡터 예측용 회귀 나무구조에 적용하여 기본 주파수 궤적을 생성하는 억양예측부(200)와; 상기 억양예측부(200)의 기본주파수 궤적으로 음성DB로부터 합성단위의 데이터를 오버랩 애드 (Overlap Add)를 행하여 합성음의 파형을 생성하는 합성부 (300)로 구성하며, 이와같은 본 발명의 동작을 설명한다.Figure 2 is a block diagram showing the configuration of the intonation modeling apparatus of the speech synthesis system of the present invention, as shown in the above to separate any input text by one sentence to analyze the syntax in units of sentences to convert the phonetic representation A language processing unit 100; After determining the tonal pattern for each syllable by applying the sentence component and phonological variation analyzed by the language processor 100 to the learned regression tree structure for predicting tonal pattern, the tonal pattern for each syllable is applied to the regression tree structure for basic vector prediction. An accent prediction part 200 for generating a fundamental frequency trajectory; The synthesizer 300 is configured to generate a waveform of the synthesized sound by overlapping the data of the synthesized unit from the voice DB by the fundamental frequency trajectory of the intonation predictor 200. Explain.

먼저, 언어처리부(100)는 종래와 동일한 동작을 수행하는데, 즉 임의의 입력 텍스트를 한 문장씩 분리하여 한 문장 단위로 구문을 분석하여 음성학적인 표현으로 변환하여 출력한다.First, the language processing unit 100 performs the same operation as that of the related art, that is, separates arbitrary input text by one sentence, analyzes the syntax in units of sentences, and converts the phonetic expression into a phonetic expression.

그러면, 억양예측부(200)는 상기 언어처리부(100)에서 분석된 문장성분과 음 운변동을 학습된 성조 패턴 예측용 회귀 나무구조에 적용하여 음절별 성조패턴을 결정한후, 이 음절별 성조패턴을 기본 벡터 예측용 회귀 나무구조에 적용하여 기본 주파수 궤적을 생성한다.Then, the intonation prediction unit 200 determines the tonal pattern for each syllable by applying the sentence component and the phonetic variation analyzed by the language processing unit 100 to the regression tree structure for predicting the tonal pattern learned. Is applied to the regression tree structure for the basic vector prediction to generate the fundamental frequency trajectory.

보다 상세하게 도3 및 도4를 참조하여 설명하면, 성조 패턴의 모델링은 음절별 기본 주파수를 계수화시킴으로써 이루어지는데, 기본 주파수 궤적의 계수화를 위해 영어권에서 악센트 위치에서의 기본 주파수 모델링을 위해 제안된 틸트(Tilt) 모델을 사용한다.3 and 4, modeling of the tonal pattern is performed by digitizing the fundamental frequency of each syllable. Suggested for the basic frequency modeling at the accented position in English for digitization of the fundamental frequency trajectory. Use a tilt model.

여기서, 상기 틸트(Tilt) 모델은 도4와 같이 기본 주파수 궤적은 시작 레벨(ABS), 주파수 궤적의 변이 정도를 나타내는 진폭변이(A), 주파수 궤적의 경사도를 나타내는 틸트(Tilt)로 가능한데, 이를 수식으로 표현하면 아래와 같다.Here, the tilt model may be a basic frequency trajectory such as a start level ABS, an amplitude shift A representing a degree of variation of the frequency trajectory, and a tilt indicating a slope of the frequency trajectory, as shown in FIG. Expressed as an expression:

삭제delete

여기서, 기본 주파수 벡터 길이: L=D1 +D2, 기본 주파수 진폭변이:A=A1+A2, 기본 주파수 경사도:-1.0 ≤Tilt ≤1.0

Here, the fundamental frequency vector length: L = D1 + D2, the fundamental frequency amplitude shift: A = A1 + A2, the fundamental frequency gradient: -1.0 ≤ Tilt ≤ 1.0

이때, 상기 틸트(Tilt) 모델은 소수의 실변수를 이용해서 비교적 정확한 기본 주파수궤적 모델링이 가능하고 궤적의 레벨뿐만 아니라 형태의 표현 또한 가능한 장점이 있은데, 본 발명에서는 상기 틸트(Tilt) 계수를 성조 패턴정보로 도입한다.In this case, the tilt model has a relatively accurate basic frequency trajectory modeling using a small number of real variables, and it is possible to express the shape as well as the level of the trajectory. Introduced as tonal pattern information.

즉, 모델링된 음절별 성조 패턴은 적절히 선정된 문장 구조 정보와 구문 분석 결과를 입력으로 하는 회귀 나무구조를 통해 학습되고, 음절별 기본 주파수 궤적은 벡터화되고, 문장구문구조와 구문 분석결과와 성조패턴을 입력으로 하는 회귀 나무 구조를 통해 학습되며, 이 학습된 성조 패턴 예측용 회귀 나무구조와 기본 주파수 벡터 예측용 회귀 나무구조는 합성시스템의 억양예측부(200)에서 결합되어 상기 언어처리부(100)에서 얻어지는 문장의 구문 구조와 구분분석 결과를 이용하여 성조패턴이 예측된후, 이들을 이용하여 최종적인 기본 주파수 궤적을 얻게된다. That is, modeled syllable tonal patterns are learned through a regression tree structure with input of appropriately selected sentence structure information and syntax analysis results, basic frequency trajectories for each syllable are vectorized, sentence syntax structure, syntax analysis results, and tonal patterns Learned through the regression tree structure as an input, the learned regression tree structure for predicting the tonal pattern and the regression tree structure for the fundamental frequency vector prediction is combined in the intonation prediction unit 200 of the synthesis system the language processing unit 100 After the tonal pattern is predicted by using the syntax structure and the result of the division analysis of the sentence obtained in, the final fundamental frequency trajectory is obtained by using these.

이후, 합성부(300)는 상기 억양예측부(200)의 최종 기본주파수 궤적으로 음성DB로부터 합성단위의 데이터를 오버랩 애드 (Overlap Add)를 행하여 합성음의 파형을 생성한다.Subsequently, the synthesis unit 300 generates a waveform of the synthesized sound by performing overlap addition of data of the synthesis unit from the voice DB as the final fundamental frequency trajectory of the intonation prediction unit 200.

여기서, 상기 틸트(Tilt) 계수를 이용한 성조 패턴 모델링 및 예측 기법외에도 음절별 기본주파수의 평균레벨을 이용한 성조 패턴을 모델링하는 방법과, 운율구 비경계 음절인 경우에 어절에서의 상대적 레벨을 이용하고 경계음절인 경우에 틸트(Tilt) 모델의 진폭변이와 틸트(Tilt) 계수를 이용한 성조 패턴 모델링 또한 가능하다.Here, in addition to the tone pattern modeling and prediction technique using the tilt coefficient, a method of modeling the tone pattern using the average level of the fundamental frequency for each syllable, and using the relative level in the word in the case of the non-boundary syllable In case of boundary syllables, it is also possible to model tonal patterns using amplitude variation and tilt coefficients of the tilt model.

이상에서 상세히 설명한 바와같이 본 발명은 성조 패턴 추출과 기본 주파수 궤적 예측을 자동화시킴으로 인하여 성조 패턴 추출에 요구되던 인력과 시간 소모를 줄일 수 있고, 또한 화자 고유의 억양 패턴을 구현함으로써 음성 합성 시스템의 자연성을 향상시킬 수 있는 효과가 있다.As described in detail above, the present invention can reduce the manpower and time required for the tonal pattern extraction by automating the tonal pattern extraction and the fundamental frequency trajectory prediction, and also implements the speaker's own intonation pattern to achieve the naturalness of the speech synthesis system. There is an effect to improve.

Claims

A language processor that separates arbitrary input text by one sentence and analyzes syntax in units of the separated sentences, and outputs information on sentence components and phonological fluctuations based on the analysis result;

The sentence component and phonological variation analyzed by the language processor are applied to a regression tree structure for predicting the learned tone pattern, to determine the tone pattern for each syllable,

An accent prediction unit generating the fundamental frequency trajectory by applying the syllable pattern of each syllable to a regression tree structure for predicting the fundamental frequency trajectory;

And a synthesizer configured to overlap data of the synthesized unit from the speech DB and generate a waveform of the synthesized sound according to the basic frequency trajectory generated by the intonation predictor. Accent modeling device.

The syllable pattern of each syllable is

Accent modeling device of speech synthesis system characterized in that it is divided into rhyme boundary syllable and non-boundary syllable.

A first step of parsing arbitrary input text by one sentence and analyzing the syntax in units of one sentence;

A second step of predicting a tonal pattern by applying a syntax analysis result of the first step to a tilt model and digitizing a fundamental frequency shape;

A third step of predicting the fundamental frequency trajectory by applying the tonal pattern of the second stage to a regression tree structure for predicting the fundamental frequency trajectory;

An accent modeling method of a speech synthesis system, characterized in that the fourth step of reading the data of the synthesis unit from the speech DB with the fundamental frequency trajectory predicted in the third step to generate a waveform of the synthesized sound accordingly.

delete

The method of claim 3, wherein the tonal pattern is

An intonation modeling method of a speech synthesis system, characterized in that it is learned through a regression tree structure that receives the sentence structure information and the syntax analysis result.

The method of claim 3, wherein the tonal pattern is

An intonation modeling method of a speech synthesis system, characterized in that it is predicted using the average level of the fundamental frequency for each syllable.

The method of claim 3, wherein the tonal pattern is

In the case of rhyme phrase non-syllable syllables, it is predicted using the relative level of the word, and in the case of rhyme phrase boundary syllables, it is predicted using the amplitude variation of the tilt model and the tilt coefficient. Accent modeling method for synthetic systems.

The method of claim 3, wherein the fundamental frequency trajectory using the tilt model,

An intonation modeling method of a speech synthesis system, characterized by the following equation.

here,

Fundamental Frequency Vector Length: L = D1 + D2

Basic Frequency Amplitude Shift: A = A1 + A2

Basic frequency gradient: -1.0 ≤Tilt ≤1.0

i is a programmatic variable for the tilt model