KR100373329B1

KR100373329B1 - Apparatus and method for text-to-speech conversion using phonetic environment and intervening pause duration

Info

Publication number: KR100373329B1
Application number: KR10-1999-0033869A
Authority: KR
Inventors: 이정철; 강동규; 김상훈
Original assignee: 한국전자통신연구원
Priority date: 1999-08-17
Filing date: 1999-08-17
Publication date: 2003-02-25
Also published as: KR20010018064A

Abstract

1. 청구범위에 기재된 발명이 속한 기술분야1. TECHNICAL FIELD OF THE INVENTION

본 발명은 텍스트/음성변환 장치 및 그 방법과, 상기 방법을 실현시키기 위한 프로그램 및 데이터 구조를 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것임.The present invention relates to a text-to-speech device and a method thereof, and a computer-readable recording medium having recorded thereon a program and a data structure for realizing the method.

2. 발명이 해결하려고 하는 기술적 과제2. The technical problem to be solved by the invention

본 발명은, 음성인식시스템 등에서 음운환경과 합성단위 앞뒤에 위치하는 묵음구간 길이정보를 합성단위 선정에 이용하여 합성음의 명료도 및 자연성을 향상시킨 텍스트/음성변환 장치 및 그 방법과, 그를 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하고자 함.The present invention provides a text / voice conversion apparatus and method for improving the clarity and naturalness of a synthesized sound by using the phonological environment and the silence section length information located before and after the synthesized unit in a speech recognition system and the like, and for realizing the same. To provide a computer-readable recording medium that records the program.

3. 발명의 해결방법의 요지3. Summary of Solution to Invention

본 발명은, 합성단위 검색정보에 대응되는 음편들로 구성된 합성단위 데이터베이스를 구축하는 제 1 단계; 입력되는 텍스트로부터 음소열과 문장의 구문구조 정보를 추출하고, 규칙과 운율 테이블을 이용하여 운율제어 파라메터값을 추정하는 제 2 단계; 및 추정된 음소열과 운율제어 파라메터값으로부터 대상 음소 및 직전/직후 음소열의 음운환경과 심볼 변환된 묵음구간 길이를 추정하여 합성단위 검색정보를 생성하고, 합성단위 검색정보를 이용하여 합성단위 데이터베이스에 저장된 후보 음편을 선택/합성하여 원하는 합성음을 생성하는 제 3 단계를 포함함.The present invention is a first step of constructing a synthesis unit database consisting of sound pieces corresponding to the synthesis unit search information; Extracting a phoneme string and syntax structure information from an input text, and estimating a rhyme control parameter value using a rule and a rhyme table; And generating synthesized unit search information by estimating the phoneme environment of the target phoneme, the phoneme sequence of the previous phoneme and the previous / previous phoneme sequence and the symbol-converted silence interval from the estimated phoneme sequence and rhyme control parameter values, and using the unit search information And a third step of selecting / synthesizing the candidate pieces to generate a desired synthesis sound.

4. 발명의 중요한 용도4. Important uses of the invention

본 발명은 입력된 텍스트로부터 합성음을 생성하는 장치 등에 이용됨.The present invention is used for an apparatus for generating a synthesized sound from input text.

Description

Text / Voice Conversion Apparatus Using Phonological Environment and Silence Section Length and Its Method {APPARATUS AND METHOD FOR TEXT-TO-SPEECH CONVERSION USING PHONETIC ENVIRONMENT AND INTERVENING PAUSE DURATION}

본 발명은 음성인식시스템 등에서 음운환경과 합성단위 앞뒤에 위치하는 묵음구간 길이정보를 합성단위 선정에 이용하여 합성음의 명료도 및 자연성을 향상시킨 텍스트/음성변환 장치 및 그 방법과, 상기 방법을 실현시키기 위한 프로그램 및 데이터 구조를 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.The present invention provides a text / voice conversion apparatus and method for improving the clarity and naturalness of a synthesized sound using the phonological environment and the silence section length information located before and after the synthesized unit in a speech recognition system and the like, and to realize the method. A computer readable recording medium having recorded thereon a program and a data structure.

기존의 합성기는 후보를 선정하는데 있어서, 일반적으로 대상이 되는 음소의 변이음 특성을 자연스럽게 구현하기 위해서, 좌/우 음운환경만을 고려한다. 이처럼 종래의 합성음 생성 방법에서는 묵음구간을 그 길이와 무관하게 하나의 음소로 간주하였다.Existing synthesizers generally consider only the left / right phonological environment in order to naturally implement the variation sound characteristics of the target phoneme. As described above, in the conventional synthesis sound generating method, the silent section is regarded as one phoneme regardless of its length.

그러나, 실제의 발성음에서는 묵음구간의 장단에 따라 묵음구간 이전 또는 이후의 음운환경이 여전히 대상 음소의 음향, 음성적 자질에 영향을 미치고 있으므로, 이를 고려하여야만 한다.However, in actual phonation, the phonological environment before or after the silent period still affects the sound and voice quality of the target phoneme, depending on the length and length of the silent period.

텍스트/음성변환기(TTS : Text-to-Speech conversion system)의 기능은 컴퓨터가 사용자인 인간에게 다양한 형태의 정보를 음성으로 제공하는데 있다.The function of the Text-to-Speech Conversion System (TTS) is to provide various types of information by voice to a human being.

이를 위해서, TTS는 사용자에게 주어진 텍스트로부터 고품질의 합성음 서비스를 제공할 수 있어야 한다. 여기서, 고품질의 합성음이란, 음가가 명료해야 하며, 끊어읽기, 음의 길이, 세기, 높이 등과 같은 운율적 요소들이 적절히 구현되어 자연성이 높아야 한다. 이는 마치 텍스트에서 독자에 의미전달을 정확히 하기 위해 적합한 어휘와 띄어쓰기, 구두점 등을 이용하는 것과 동일하다 할 수 있다.To this end, the TTS should be able to provide high quality synthesized speech services from the text given to the user. Here, the high-quality synthesized sound should have a clear sound value and have high naturalness by appropriately implementing rhythmic elements such as reading, sound length, strength, and height. This is the same as using the proper vocabulary, spacing, and punctuation to correctly convey meaning to the reader in the text.

그런데, 발화에서 나타나는 이들 운율적 요소들은 문장의 의미구조, 구문구조, 단어, 조음결합 현상, 화자의 의도, 발화속도 등이 복합적으로 작용한 결과이다.However, these rhyme elements appearing in speech are the result of the combination of sentence semantic structure, phrase structure, words, articulation combination, speaker's intention, and speech rate.

그러므로, 고품질의 합성음을 생성하기 위해서 기존의 음성합성기는 입력되는 텍스트 문장으로부터 읽기변환, 구문구조분석, 운율처리, 신호처리 과정을 통하여 음편선정, 끊어읽기, 음의 길이, 세기, 높이를 조절한 합성음을 생성한다. 즉, 종래의 TTS가 입력된 텍스트로부터 합성음을 생성하는 과정을 도 1을 참조하여 설명한다.Therefore, in order to generate high quality synthesized sound, the existing speech synthesizer adjusts the selection, cut-off, length, strength, and height of speech through reading conversion, syntax structure analysis, rhyme processing, and signal processing from input text sentences. Produces synthesized sound. That is, a process of generating a synthesis sound from the text in which the conventional TTS is input will be described with reference to FIG. 1.

도 1 은 종래의 텍스트/음성변환 장치의 구성도로서, 도면에서 "11"은 언어처리부, "12"는 운율처리부, "13"은 신호처리부, 및 "14"는 합성단위 데이터베이스를 각각 나타낸다.FIG. 1 is a block diagram of a conventional text / voice conversion apparatus, in which "11" is a language processing unit, "12" is a rhyme processing unit, "13" is a signal processing unit, and "14" is a synthesis unit database, respectively.

언어처리부(11)는 TTS에 입력된 텍스트 문장을 받아서 기호, 숫자, 외국어 읽기 변환을 한다. 또한, 운율처리에 필요한 정보로서 음절, 단어의 경계정보, 문장내 각 단어의 품사, 문법적 기능 등의 구문구조 정보를 추정하고, 예외발음사전과 읽기변환 규칙을 이용하여 철자/음소 변환을 수행한다.The language processor 11 receives a text sentence input to the TTS and converts a symbol, a number, and a foreign language reading. In addition, as information necessary for rhyme processing, information on syntax structure such as syllables, word boundary information, parts of speech in each sentence, and grammatical functions is estimated, and the spelling / phoneme conversion is performed using an exception phonetic dictionary and a read conversion rule. .

이상의 과정을 통하여 언어처리부(11)는 음소열과 문장의 구문구조 정보를 운율처리부(12)로 전달한다.Through the above process, the language processor 11 transmits the phoneme string and the syntax structure information of the sentence to the rhyme processor 12.

운율처리부(12)는 언어처리부(11)에서 전달받은 구문구조 정보와 음소열을 입력으로 하여 규칙과 운율 테이블을 이용하여 끊어읽기, 소리의 높낮이, 소리의 강약, 소리의 장단과 관련된 운율 파라미터값을 계산한다. 즉, 구/절의 경계에서 분리도에 따라 끊어읽기 규칙을 적용하여 적절한 쉼 구간을 삽입하고, 단어간 수식관계에 따라 규칙과 테이블을 이용하여 억양과 관련된 피치값을 계산하며, 해당 음소에 피치값을 할당한다.The rhyme processing unit 12 inputs the syntax structure information and the phoneme strings transmitted from the language processing unit 11, and uses the rules and the rhyme table to cut off the reading, the height of the sound, the strength and weakness of the sound, and the rhythm parameter values related to the length and length of the sound. Calculate In other words, insert the appropriate rest period by applying the break reading rule according to the degree of separation at the boundary of phrase / verse, calculate the pitch value related to intonation using rules and tables according to the formula relation between words, and assign the pitch value to the phoneme. Assign.

그리고, 운율처리부(12)는 음소별 고유 지속시간값과 음운환경 및 구문구조에 따른 변화 규칙을 이용하여 각 음소의 지속시간을 계산한다. 또한, 각 음소별 조음특성과 문법적 기능을 고려하여 각 음소의 에너지 컨투어를 생성한다.Then, the rhyme processing unit 12 calculates the duration of each phoneme using the intrinsic duration value for each phoneme and the change rule according to the phoneme environment and syntax structure. In addition, an energy contour of each phoneme is generated by considering the articulation characteristics and grammatical functions of each phoneme.

이상의 과정을 통하여 운율처리부(12)는 추정된 운율 파라미터 값들을 음소열 정보와 함께 신호처리부(13)로 전달한다.Through the above process, the prosody processor 12 transmits the estimated prosody parameter values to the signal processor 13 along with phoneme string information.

신호처리부(13)는 운율처리부(12)에서 전달받은 음소열, 운율제어 파라미터 값을 이용하여 합성 단위 데이터베이스(14)에서 합성음 생성에 적합한 음편 후보를 선택하고, 필요시 운율제어 파라미터 값에 부합되도록 선택된 음편을 신호처리 방법으로 가공하여 이들 음편들을 접합함으로써 원하는 합성음을 만든다.The signal processor 13 selects a phonetic candidate suitable for synthesis sound generation from the synthesis unit database 14 by using the phoneme string and rhyme control parameter values received from the rhythm processor 12, and if necessary, matches the rhythm control parameter value. The selected pieces are processed by the signal processing method and the desired pieces are combined by joining these pieces.

합성단위의 설계와 음편후보 선택은 TTS가 고품질의 합성음을 생성하는데 큰 영향을 미치는 중요한 요인으로서, 많은 경험과 기술을 필요로 한다.The design of the synthesis unit and the selection of the music candidates are important factors that greatly influence the TTS to produce high quality synthesized sound, which requires much experience and skill.

합성단위의 선정 및 제작 방법에 있어서, 문장이나 단어는 그 수가 무한하므로 TTS의 합성단위로는 고려의 대상이 되질 못한다. 왜냐하면, 음절은 그 수가 유한하지만 인접 음운환경에 따른 변이음 수가 많아지게 되고, 음소는 그 수가 작지만 실제 발성음에서 음소간의 조음결합 현상을 올바르게 구현하기 어렵기 때문이다.In the method of selecting and producing the synthesis unit, the number of sentences or words is infinite, and thus the synthesis unit of the TTS cannot be considered. Because the number of syllables is finite but the number of variation sounds according to the adjacent phonological environment increases, the number of phonemes is small, but it is difficult to correctly implement the articulation coupling phenomenon between the phonemes in the actual voice.

조음결합 현상을 고려하여 1958년 "Peterson"은 이음소(Diphone)를 제안하였고, 1968년 "Dixon"에 의해서 포만트 합성기에 구현되었으며, 1977년 "Olive"에 의해 선형예측부호화(LPC : Linear Predictive Coding) 합성기에 접목되었다.In 1958, "Peterson" proposed Diphone, implemented in formant synthesizer by "Dixon" in 1968, and Linear Predictive (LPC) by "Olive" in 1977. Coding) to the synthesizer.

그러나, 이러한 시도는 접합점에서의 평활화(Smoothing), 이음소 지속시간However, these attempts result in smoothing at the junction, and the duration of the phoneme.

의 변경, 기본주파수의 변경으로 인하여 합성음의 자연성은 기대만큼의 성과를 얻지 못했다.Due to the change in the frequency and the fundamental frequency, the naturalness of the synthesized sound did not achieve as expected.

1978년 "Fujimura"가 제안한 반음절(Demisyllable)은 조음결합이 강한 음절내 자음군을 단위로 묶는 장점이 있지만, 음절경계에서의 조음결합 처리가 미흡한 단점이 있다.The semisyllable proposed by "Fujimura" in 1978 has the advantage of grouping consonant groups within syllables with strong articulation, but has a disadvantage of insufficient articulation in syllable boundaries.

이외에도 1961년 "Sivertsen"이 제안한 VCV(모음+자음+모음의 연속인 음운환경) 단위인 "Syllable dyad"와, 1980년대부터 사용되기 시작한 3음소(Triphone), 4음소(Tetraphone) 및 1988년 "Sagisaka"의 비정형 길이의 합성단위, "Nakajima"의 자동회귀 방법을 채택한 문맥지향 군집화(Context Oriented Clustering) 3음소 작성방법도 시도되었다. 또한, 1996년 "Donovan"은 음소인식기를 이용하여 합성단위 데이터베이스의 제작을 통계기반으로 자동화하였다.In addition, "Syllable dyad" proposed by "Sivertsen" in 1961, a phonological environment that is a continuation of vowels + consonants + vowels, and triphones, tetraphones, and 1988 " Context-oriented clustering triphones using Sagisaka's unstructured length unit, Nakajima's automatic regression, were also tried. Also, in 1996, "Donovan" automated the production of a database of synthesis units using a phoneme recognizer.

그러나, 이 방법들은 여러가지 장점에도 불구하고 음성인식기의 성능에 의존적이며, 음편내 천이구간의 연결이 자연스럽지 못한 단점이 있다.However, these methods, in spite of various advantages, depend on the performance of the speech recognizer, and have a disadvantage in that the connection between transitions in the sound is not natural.

이와 같이, 합성단위와 관련된 연구는 초기의 합성음 명료도와 자연성에서 한계가 분명한 소량의 음소, 이음소, 반음절 제작에서 시작하여 합성음의 명료도 향상을 위해 다음소열과 운율특성을 고려한 합성단위로 확장되고 있다.In this way, the research related to the synthesis unit begins with the production of a small number of phonemes, phonemes, and half-syllables, which have limitations in the initial synthesis soundness and naturalness, and then expands to the synthesis unit considering the next sequence and rhyme characteristics to improve the intelligibility of the synthesized sound. have.

이상에서와 같이, 종래의 합성기는 후보를 선정하는데 있어서, 일반적으로 대상이 되는 음소의 변이음 특성을 자연스럽게 구현하기 위해서 좌,우 음운환경을 고려한다.As described above, in selecting a candidate, a conventional synthesizer generally considers the left and right phonological environments in order to naturally implement the variation sound characteristics of the target phoneme.

예를 들면, 대상 음소를 T, 앞뒤 음운환경을 각각 x,y라 할 때, 합성단위For example, when the target phoneme is T and the front and back phonetic environments are x and y, respectively,

데이터베이스(14)에 저장되는 음편들은 xTy의 조건에 입각하여 제작되며, 후보의선정에 있어서도 xTy의 조건에 부합되는 음편을 선택하게 된다. 이때, 묵음 구간도 하나의 음소로 인정하여 좌,우 음운환경에 포함시킨다. 즉, 종래의 방법은 묵음 구간을 그 길이와 무관하게 하나의 음소로 간주하고 있다.The music pieces stored in the database 14 are produced based on the conditions of xTy, and in selecting the candidates, the music pieces satisfying the conditions of xTy are selected. At this time, the silent section is also recognized as one phoneme and included in the left and right phonetic environment. That is, the conventional method regards the silent section as one phoneme regardless of its length.

이러한 종래의 방법은 묵음구간이 대상 음소의 음향, 음성적 자질에 미치는 영향도 고려하면서 등록되어야 할 음편의 수도 줄일 수 있는 장점이 있으나, 실제의 발성음에서 묵음구간의 장단에 따라 묵음구간 이전 또는 이후의 음운환경이 여전히 대상 음소의 음향, 음성적 자질에 영향을 미치고 있는 점을 간과하여 합성음의 자연성 및 명료도가 떨어지는 문제점이 있었다.This conventional method has the advantage of reducing the number of music to be registered while considering the effect of the silent section on the sound and voice quality of the target phoneme, but before or after the silent section depending on the length and duration of the silent section. Overlooking the fact that the phonological environment still affects the sound and voice quality of the target phoneme, there was a problem that the naturalness and clarity of the synthesized sound were inferior.

본 발명은 상기한 바와 같은 문제점을 해결하기 위하여 제안된 것으로, 음성인식시스템 등에서 음운환경과 합성단위 앞뒤에 위치하는 묵음구간 길이정보를 합성단위 선정에 이용하여 합성음의 명료도 및 자연성을 향상시킨 텍스트/음성변환 장치 및 그 방법과, 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.The present invention has been proposed to solve the above problems, and the text / enhancement and naturalness of the synthesized sound is improved by using the length information of the phonological environment and the silence section located at the front and back of the synthesis unit in the speech recognition system. It is an object of the present invention to provide a speech conversion apparatus and a method thereof, and a computer-readable recording medium storing a program for realizing the method.

도 1 은 종래의 텍스트/음성변환 장치의 구성도.1 is a block diagram of a conventional text-to-speech device.

도 2 는 본 발명에 따른 텍스트/음성변환 장치의 일실시예 구성도.2 is a block diagram of an embodiment of a text-to-speech device according to the present invention;

도 3 은 본 발명에 따른 텍스트/음성변환 방법에 대한 일실시예 흐름도.3 is a flowchart illustrating an embodiment of a text / voice conversion method according to the present invention;

* 도면의 주요 부분에 대한 부호의 설명* Explanation of symbols for the main parts of the drawings

21 : 언어처리부 22 : 운율처리부21: language processing unit 22: rhyme processing unit

23 : 신호처리부 24 : 합성단위 데이터베이스23: signal processing unit 24: synthesis unit database

상기 목적을 달성하기 위한 본 발명은, 데이터베이스 관리 장치에 저장된 내용을 읽어오는 방법에 있어서, 음소의 변이음 분류시에, 목표음소(T)와, 직전 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(x)과, 직후 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(y)을 순차적으로 구성하되, 상기 직전 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(x)과 상기 목표음소(T) 사이에 제1 묵음구간(S1)을 삽입하고, 상기 목표음소(T)와 직후 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(y) 사이에 제2 묵음구간(S2)을 삽입하는 것을 특징으로 한다.The present invention for achieving the above object, in the method for reading the contents stored in the database management device, at the time of classifying the phonetic variations, the target phoneme (T), the articulation environment and the articulation method associated with the immediately preceding phoneme sequence or the preceding phoneme sequence (x) and the articulation environment and articulation method (y) associated with the immediately subsequent phoneme sequence or the immediately preceding phoneme sequence, wherein the articulation environment and articulation method (x) and the target phoneme (T) related to the immediately preceding phoneme sequence or the immediately preceding phoneme sequence are configured. Inserting a first silent section (S1) between the second and the second silent section (S2) between the target phoneme (T) and the articulation environment and the articulation method (y) associated with the immediately preceding phoneme sequence or the immediately preceding phoneme sequence. It features.

그리고, 본 발명은, 데이터베이스 관리 장치에 저장하는 방법에 있어서, 음소의 변이음 분류시에, 목표음소(T)와, 직전 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(x)과, 직후 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(y)을 순차적으로 구성하되, 상기 직전 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(x)과 상기 목표음소(T) 사이에 제1 묵음구간(S1)을 삽입하고, 상기 목표음소(T)와 직후 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(y) 사이에 제2 묵음구간(S2)을 삽입하는 것을 특징으로 한다.In addition, the present invention relates to a method for storing in a database management device, wherein, in classifying the phonetic variations, the target phoneme T, the articulation environment and the articulation method (x) associated with the immediately preceding phoneme sequence or the immediately preceding phoneme sequence, and immediately after the phoneme The articulation environment and the articulation method (y) related to the heat or immediately preceding phoneme sequence are sequentially configured, and a first silent section is formed between the articulation environment and articulation method (x) and the target phoneme (T) related to the immediately preceding phoneme sequence or the immediately preceding phoneme sequence. And inserting a second silent section S2 between the target phoneme T and the articulation environment and the articulation method y associated with the immediately preceding phoneme sequence or the immediately preceding phoneme sequence.

또한, 본 발명은, 데이터베이스 관리 장치에 저장된 내용을 읽어오는 방법에 있어서, 음절의 변이음 분류시에, 목표음절(T)과, 직전 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(x)과, 직후 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(y)을 순차적으로 구성하되, 상기 직전 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(x)과 상기 목표음절(T) 사이에 제1 묵음구간(S1)을 삽입하고, 상기 목표음절(T)과 직후 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(y) 사이에 제2 묵음구간(S2)을 삽입하는 것을 특징으로 한다.In addition, the present invention relates to a method of reading contents stored in a database management device, wherein, when classifying the syllables of syllables, the target syllable T, the articulation environment related to the immediately preceding phoneme sequence and the immediately preceding phoneme sequence, and the articulation method (x); And the articulation environment and articulation method (y) related to the immediately subsequent phoneme sequence or the immediately preceding phoneme sequence are sequentially arranged between the articulation environment and articulation method (x) and the target syllable (T) related to the immediately preceding phoneme sequence or the immediately preceding phoneme sequence. A first silent section S1 is inserted, and a second silent section S2 is inserted between the target syllable T and the articulation environment and the articulation method y associated with the immediately after phoneme sequence or the immediately preceding phoneme sequence.

또한, 본 발명은, 데이터베이스 관리 장치에 저장하는 방법에 있어서, 음절의 변이음 분류시에, 목표음절(T)과, 직전 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(x)과, 직후 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(y)을 순차적으로 구성하되, 상기 직전 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(x)과 상기 목표음절(T) 사이에 제1 묵음구간(S1)을 삽입하고, 상기 목표음절(T)과 직후 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(y) 사이에 제2 묵음구간(S2)을 삽입하는 것을 특징으로 한다.In addition, the present invention relates to a method for storing in a database management device, wherein, when classifying the syllables of syllables, the target syllable T, the articulation environment and articulation method (x) associated with the immediately preceding phoneme sequence or the immediately preceding phoneme sequence, and immediately after the phoneme The articulation environment and the articulation method (y) related to the column or the immediately preceding phoneme sequence are sequentially formed, and the first silent section is formed between the articulation environment and the articulation method (x) and the target syllable (T) related to the immediately preceding phoneme sequence or the previous phoneme sequence. And inserting a second silent section S2 between the target syllable T and the articulation environment and the articulation method y associated with the immediately subsequent phoneme sequence or the immediately preceding phoneme sequence.

또한, 본 발명은, 입력된 텍스트로부터 합성음을 생성하기 위한 장치에 있어서, 입력되는 텍스트로부터 음소열과 문장의 구문구조 정보를 추출하기 위한 언어 처리수단; 상기 언어 처리수단에서 추출된 구문구조 정보와 음소열을 입력받아, 규칙과 운율 테이블을 이용하여 운율제어 파라메터값을 추정하기 위한 운율 처리수단; 합성단위 검색 정보에 대응하는 음편을 저장하기 위한 합성단위 저장수단; 및 상기 운율 처리수단에서 추정된 음소열과 운율제어 파라메터값으로부터 대상 음소 및 직전/직후 음소열의 음운환경과 심볼 변환된 묵음구간 길이를 추정하여 상기 합성단위 검색정보를 생성하고, 상기 합성단위 검색정보를 이용하여 상기 합성단위 저장수단에 저장된 음편들을 선택한 후에 선택된 음편들을 합성하여 원하는 합성음을 생성하기 위한 합성음 생성수단을 포함한다.The present invention also provides an apparatus for generating a synthesized sound from input text, comprising: language processing means for extracting a phoneme string and syntax structure information of a sentence from input text; Rhyme processing means for receiving syntax structure information and a phoneme string extracted from the language processing means and estimating a rhyme control parameter value using a rule and a rhyme table; Synthesis unit storage means for storing a sound piece corresponding to the synthesis unit search information; And generating the synthesized unit search information by estimating the phonological environment of the target phoneme, the phoneme sequence immediately before and after the phoneme sequence, and the length of the symbol-converted silent section from the phoneme sequence and the rhyme control parameter values estimated by the rhyme processing means. And synthesized sound generating means for generating the desired synthesized sound by synthesizing the selected sound pieces after selecting the pieces stored in the synthesis unit storage means.

또한, 본 발명은, 입력된 텍스트로부터 합성음을 생성하기 위한 장치에 적용되는 텍스트/음성변환 방법에 있어서, 합성단위 검색정보에 대응되는 음편들로 구성된 합성단위 데이터베이스를 구축하는 제 1 단계; 입력되는 텍스트로부터 음소열과 문장의 구문구조 정보를 추출하고, 규칙과 운율 테이블을 이용하여 운율제어 파라메터값을 추정하는 제 2 단계; 및 상기 추정된 음소열과 운율제어 파라메터값으로부터 대상 음소 및 직전/직후 음소열의 음운환경과 심볼 변환된 묵음구간 길이를 추정하여 상기 합성단위 검색정보를 생성하고, 상기 합성단위 검색정보를 이용하여 상기 합성단위 데이터베이스에 저장된 후보 음편을 선택/합성하여 원하는 합성음을 생성하는 제 3 단계를 포함한다.The present invention also provides a text / voice conversion method applied to an apparatus for generating a synthesized sound from input text, the method comprising: a first step of constructing a synthesis unit database composed of pieces corresponding to synthesis unit search information; Extracting a phoneme string and syntax structure information from an input text, and estimating a rhyme control parameter value using a rule and a rhyme table; And generating the synthesized unit search information by estimating the phonological environment of the target phoneme, the phoneme sequence immediately before and after the phoneme sequence, and the length of the symbol-converted silent section from the estimated phoneme sequence and the rhythm control parameter value, and using the synthesis unit search information. And selecting / synthesizing candidate pieces stored in the unit database to generate a desired synthesized sound.

또한, 본 발명은, 프로세서를 구비한 텍스트/음성변환 장치에, 합성단위 검색정보에 대응되는 음편들로 구성된 합성단위 데이터베이스를 구축하는 기능; 입력되는 텍스트로부터 음소열과 문장의 구문구조 정보를 추출하고, 규칙과 운율 테이블을 이용하여 운율제어 파라메터값을 추정하는 기능; 및 상기 추정된 음소열과 운율제어 파라메터값으로부터 대상 음소 및 직전/직후 음소열의 음운환경과 심볼 변환된 묵음구간 길이를 추정하여 상기 합성단위 검색정보를 생성하고, 상기 합성단위 검색정보를 이용하여 상기 합성단위 데이터베이스에 저장된 후보 음편을 선택/합성하여 원하는 합성음을 생성하는 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.In addition, the present invention, in the text / speech conversion device having a processor, a function for constructing a synthesis unit database composed of sound pieces corresponding to the synthesis unit search information; Extracting a phoneme string and syntax structure information from an input text, and estimating a rhyme control parameter value using a rule and a rhyme table; And generating the synthesized unit search information by estimating the phonological environment of the target phoneme, the phoneme sequence immediately before and after the phoneme sequence, and the length of the symbol-converted silent section from the estimated phoneme sequence and the rhythm control parameter value, and using the synthesis unit search information. Provided is a computer readable recording medium having recorded thereon a program for realizing a function of selecting / synthesizing candidate pieces stored in a unit database to generate a desired synthesized sound.

또한, 본 발명은 프로세서를 구비한 데이터베이스 관리 장치에, 음소의 변이음 분류시에, 목표음소(T) 구조; 직전 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(x) 구조; 직후 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(y) 구조; 상기 직전 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(x) 구조; 상기 목표음소(T) 사이에 삽입되는 제1 묵음구간(S1) 구조; 및 상기 목표음소(T)와 직후 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(y) 사이에 삽입되는 제2 묵음구간(S2) 구조를 가지는 데이터가 기록된 컴퓨터로 읽을 수 있는 기록매체를 제공한다.In addition, the present invention provides a database management apparatus having a processor, comprising: a target phoneme (T) structure at the time of classifying the variation of phonemes; Articulation environment and articulation method (x) structure associated with the immediately preceding phoneme sequence or the immediately preceding phoneme sequence; Articulation environment and articulation method (y) structure associated with immediately or immediately after phoneme sequence; An articulation environment and an articulation method (x) structure related to the immediately preceding phoneme sequence or the immediately preceding phoneme sequence; A first silent section S1 inserted between the target phonemes T; And a computer-readable recording medium on which data having a second silent section S2 inserted between the target phoneme T and the immediately subsequent phoneme sequence or the immediately preceding phoneme sequence and the articulation method y is inserted. to provide.

본 발명은, 프로세서를 구비한 데이터베이스 관리 장치에, 음절의 변이음 분류시에, 목표음절(T) 구조; 직전 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(x) 구조; 직후 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(y) 구조; 상기 직전 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(x) 구조; 상기 목표음절(T) 사이에 삽입되는 제1 묵음구간(S1) 구조; 및 상기 목표음절(T)과 직후 음소열이나 직전 음소열과 관련된 조음환경 및 조음방법(y) 사이에 삽입되는 제2 묵음구간(S2) 구조를 가지는 데이터가 기록된 컴퓨터로 읽을 수 있는 기록매체를 제공한다.The present invention provides a database management apparatus including a processor, comprising: a target syllable (T) structure at the time of classifying syllable variations; Articulation environment and articulation method (x) structure associated with the immediately preceding phoneme sequence or the immediately preceding phoneme sequence; Articulation environment and articulation method (y) structure associated with immediately or immediately after phoneme sequence; An articulation environment and an articulation method (x) structure related to the immediately preceding phoneme sequence or the immediately preceding phoneme sequence; A first silent section S1 inserted between the target syllables T; And a computer-readable recording medium on which data having a second silent section S2 inserted between the target syllable T and the immediately subsequent phoneme sequence or the immediately preceding phoneme sequence and the articulation method y is inserted. to provide.

본 발명은 음성인식시스템 등에서 대상 음소의 전후에 삽입되는 묵음 구간의 유무, 장단에 대한 정보와 대상 음소의 전후에 위치하는 음소들의 정보를 이용하여 대상 음소의 변이음을 정의하고, 이에 따라 합성음 생성에 적합한 후보 음편을 검색하여 선택함으로써, 합성음의 자연성과 명료도를 향상시킬 수 있다.The present invention defines a transition sound of a target phoneme by using information about whether there is a silent section inserted before and after the target phoneme, information about long and short periods, and information about phonemes located before and after the target phoneme in a voice recognition system. By searching for and selecting suitable candidate pieces, the naturalness and clarity of the synthesized sound can be improved.

상술한 목적, 특징들 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시예를 상세히 설명한다.The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 2 는 본 발명에 따른 텍스트/음성변환 장치의 일실시예 구성도이다.2 is a block diagram of an embodiment of a text-to-speech device according to the present invention.

도 2에 도시된 바와 같이, 본 발명에 따른 텍스트/음성변환 장치는, 입력된 텍스트로부터 합성음을 생성하기 위한 도 1의 텍스트/음성변환 장치에 있어서, 입력되는 텍스트로부터 음소열과 문장의 구문구조 정보를 추출하기 위한 언어처리부(21)와, 언어처리부(21)에서 추출된 구문구조 정보와 음소열을 입력받아, 규칙과 운율 테이블을 이용하여 운율제어 파라메터값을 추정하기 위한 운율처리부(22)와, 합성단위 검색 정보에 대응하는 음편을 저장하기 위한 합성단위 데이터베이스(24)와, 운율처리부(22)에서 추정된 음소열과 운율제어 파라메터값으로부터 대상 음소 및 직전/직후 음소열의 음운환경과 심볼 변환된 묵음구간 길이를 추정하여 합성단위 데이터베이스(24)에 저장된 음편을 검색하기 위한 합성단위 검색 정보를 생성하고, 이러한 합성단위 검색 정보를 이용하여 합성단위 데이터베이스(24)에 저장된 후보 음편을 선택한 후에, 선택된 음편들을 합성하여 원하는 합성음을 생성하기 위한 신호처리부(23)를 포함한다.As shown in FIG. 2, in the text / voice conversion apparatus of FIG. 1 for generating a synthesized sound from input text, the text / voice conversion apparatus according to the present invention includes syntax structure information of a phoneme string and a sentence from input text. A language processor 21 for extracting the information, a syntax structure information and a phoneme string extracted from the language processor 21, and a rhyme processor 22 for estimating a rhyme control parameter value using a rule and a rhyme table; And the phoneme environment and symbol conversion of the phoneme and the immediately / before / after phoneme sequence from the phoneme sequence and rhyme control parameter values estimated by the synthesized unit database 24 and the rhyme processor 22 for storing the pieces corresponding to the synthesized unit search information. Estimating the silence section length to generate synthesis unit search information for searching for the sound segments stored in the synthesis unit database 24, and searching such synthesis unit After selecting the candidate pieces stored in the synthesis unit database 24 by using the information, the signal processing unit 23 for synthesizing the selected pieces to generate a desired synthesized sound.

언어처리부(21)는 TTS에 입력된 텍스트 문장을 받아서 기호, 숫자, 외국어 읽기 변환을 한다. 또한, 운율처리에 필요한 정보로서 음절, 단어의 경계정보, 문장내 각 단어의 품사, 문법적 기능 등의 구문구조 정보를 추정하고, 예외발음사전과 읽기변환 규칙을 이용하여 철자/음소 변환을 수행한다.The language processor 21 receives a text sentence input to the TTS and converts a symbol, a number, and a foreign language reading. In addition, as information necessary for rhyme processing, information on syntax structure such as syllables, word boundary information, parts of speech in each sentence, and grammatical functions is estimated, and the spelling / phoneme conversion is performed using an exception phonetic dictionary and a read conversion rule. .

이상의 과정을 통하여 언어처리부(21)는 음소열과 문장의 구문구조 정보를 운율처리부(22)로 전달한다.Through the above process, the language processor 21 transmits the phoneme string and the syntax structure information of the sentence to the rhyme processor 22.

운율처리부(22)는 언어처리부(21)에서 전달받은 구문구조 정보와 음소열을 입력으로 하여 규칙과 운율 테이블을 이용하여 끊어읽기, 소리의 높낮이, 소리의 강약, 소리의 장단과 관련된 운율 파라미터값을 계산한다. 즉, 구/절의 경계에서 분리도에 따라 끊어읽기 규칙을 적용하여 적절한 쉼 구간을 삽입하고, 단어간 수식관계에 따라 규칙과 테이블을 이용하여 억양과 관련된 피치값을 계산하며, 해당 음소에 피치값을 할당한다.The rhyme processing unit 22 inputs the syntax structure information and the phoneme sequence transmitted from the language processing unit 21 as inputs, and uses a rule and a rhyme table to read, raise and lower the sound, strength and weakness of the sound, and rhyme parameter values related to the length and length of the sound. Calculate In other words, insert the appropriate rest period by applying the break reading rule according to the degree of separation at the boundary of phrase / verse, calculate the pitch value related to intonation using rules and tables according to the formula relation between words, and assign the pitch value to the phoneme. Assign.

그리고, 운율처리부(22)는 음소별 고유 지속시간값과 음운환경 및 구문구조에 따른 변화 규칙을 이용하여 각 음소의 지속시간을 계산한다. 또한, 각 음소별 조음특성과 문법적 기능을 고려하여 각 음소의 에너지 컨투어를 생성한다.And, the rhyme processing unit 22 calculates the duration of each phoneme using the intrinsic duration value for each phoneme and the change rule according to the phoneme environment and syntax structure. In addition, an energy contour of each phoneme is generated by considering the articulation characteristics and grammatical functions of each phoneme.

이상의 과정을 통하여 운율처리부(22)는 추정된 운율 파라미터 값들을 음소열 정보와 함께 신호처리부(23)로 전달한다.Through the above process, the prosody processor 22 transmits the estimated prosody parameter values to the signal processor 23 along with phoneme string information.

신호처리부(23)는 운율처리부(22)에서 전달받은 음소열로부터 음운환경과 관련하여 대상(즉, 목표) 음소와 직전 음소열, 직후 음소열을 선정하고, 직전/직후 음소열로부터 조음방법, 조음위치와 관련된 조음결합 패턴을 추정하여 대상음소(T), 직전 음소열의 음운환경(x), 직후 음소열의 음운환경(y)을 설정한다. 그리고, 신호처리부(23)는 운율제어 파라미터값중에서 목표 음소의 직전/직후에 삽입되는 묵음구간 길이를 선형, 비선형으로 양자화하여 이를 심볼(S1,S2)로 변환한다.The signal processor 23 selects a target (ie, a target) phoneme, a phoneme string immediately before and immediately after a phoneme string in relation to the phoneme environment from the phoneme sequence received from the rhyme processor 22, and then modulates the phoneme sequence from immediately before and after the phoneme sequence. By estimating the articulation combination pattern related to the articulation position, the target phoneme (T), the phonological environment (x) of the immediately preceding phoneme sequence, and the phonological environment (y) of the immediately after phoneme sequence are set. In addition, the signal processor 23 quantizes the length of the silent section inserted immediately before and immediately after the target phoneme in the rhythm control parameter values linearly and nonlinearly and converts the lengths of the silent sections into symbols S1 and S2.

신호처리부(23)는 대상음소(T), 직전 음소열의 음운환경(x), 직후 음소열의 음운환경(y), 심볼(S1,S2)로 변환된 묵음구간 길이로부터 x.S1.T.S2.y의 합성단위검색정보를 생성하고, 이러한 합성단위 검색정보(x.S1.T.S2.y)를 이용하여 합성 단위 데이터베이스(24)에서 합성음 생성에 적합한 후보 음편을 검색 선택한다. 또한, 필요시 운율제어 파라미터값에 부합되도록 선택된 음편을 신호처리 방법으로 억양, 강세, 지속시간을 조절하여 가공하며, 문장내 끊어읽기를 삽입한 후에, 이들 음편들을 접합함으로써 원하는 합성음을 생성하고 이를 외부로 출력한다.The signal processor 23 converts x.S1.T.S2 from the silence section lengths converted into the target phoneme T, the phoneme environment x of the immediately preceding phoneme sequence, the phoneme environment y of the immediately phoneme sequence, and the symbols S1 and S2. Synthesis unit search information of .y is generated, and the candidate unit suitable for synthesis sound generation is selected from the synthesis unit database 24 by using the synthesis unit search information (x.S1.T.S2.y). In addition, if necessary, the selected sound samples are processed to adjust the accent, accent, and duration in accordance with the rhythm control parameter value.Then, after inserting the broken reading in the sentence, the desired sound is generated by joining these sound pieces. Output to the outside.

합성단위 데이터베이스(24)에는 합성단위 검색정보(x.S1.T.S2.y)에 일대일 대응되는 음편들이 운용자에 의해 미리 저장된다.In the synthesis unit database 24, sound pieces corresponding to one-to-one correspondence to the synthesis unit search information (x.S1.T.S2.y) are previously stored by the operator.

도 3 은 본 발명에 따른 텍스트/음성변환 방법에 대한 일실시예 흐름도이3 is a flowchart illustrating an embodiment of a text / voice conversion method according to the present invention.

다.All.

도 3에 도시된 바와 같이, 본 발명에 따른 텍스트/음성변환 방법은, 먼저 언어처리부(21)에서 TTS에 입력되는 텍스트로부터 음소열과 문장의 구문구조 정보를 추출한다(301). 즉, 언어처리부(21)는 TTS에 입력된 텍스트 문장을 입력받아 기호, 숫자, 외국어 읽기 변환하고, 운율처리에 필요한 음절, 단어의 경계정보, 문장내 각 단어의 품사, 문법적 기능 등의 구문구조 정보를 추정하며, 예외발음사전과 읽기변환 규칙을 이용하여 철자/음소 변환을 수행한다.As shown in FIG. 3, in the text / voice conversion method according to the present invention, first, the language processor 21 extracts a phoneme string and syntax structure information of a sentence from text input to the TTS (301). That is, the language processor 21 receives the text sentence inputted in the TTS, converts symbols, numbers, and foreign language readings, and constructs syntactic structures such as syllables, word boundary information, parts of speech in each sentence, and grammatical functions necessary for rhyme processing. The information is estimated, and the spelling / phoneme conversion is performed using the exception phonetic dictionary and the read conversion rule.

이후, 운율처리부(22)가 언어처리부(21)에서 추출된 구문구조 정보와 음소열을 입력받아, 규칙과 운율 테이블을 이용하여 끊어읽기, 소리의 높낮이, 소리의 강약, 소리의 장단과 관련된 운율 파라미터값을 추정한다(302). 운율처리부(22)는 구/절의 경계에서 분리도에 따라 끊어읽기 규칙을 적용하여 적절한 쉼 구간을 삽입하고, 단어간 수식관계에 따라 규칙과 테이블을 이용하여 억양과 관련된 피치값을 계산하며, 해당 음소에 피치값을 할당한다. 그리고, 음소별 고유 지속시간값과 음운환경 및 구문구조에 따른 변화 규칙을 이용하여 각 음소의 지속시간을 계산하고, 각 음소별 조음특성과 문법적 기능을 고려하여 각 음소의 에너지 컨투어를 생성한다.Subsequently, the rhyme processor 22 receives the syntax structure information and the phoneme string extracted from the language processor 21, and uses the rules and the rhyme table to read and cut, the height of the sound, the strength and weakness of the sound, and the rhythm related to the length and length of the sound. The parameter value is estimated (302). The rhyme processor 22 inserts an appropriate rest period by applying a break reading rule according to the degree of separation at the boundary of the phrase / clause, calculates a pitch value related to the intonation by using a rule and a table according to a mathematical relationship between words, and the corresponding phonemes. Assign a pitch value to. Then, the duration of each phoneme is calculated using the intrinsic duration value of each phoneme and the change rules according to phonological environment and syntax structure, and an energy contour of each phoneme is generated in consideration of articulation characteristics and grammatical functions of each phoneme.

다음으로, 신호처리부(23)가 운율처리부(22)에서 전달받은 음소열로부터 음운환경과 관련하여 대상(즉, 목표) 음소와 직전 음소열, 직후 음소열을 선정하고(303), 직전/직후 음소열로부터 조음방법, 조음위치와 관련된 조음결합 패턴을 추정하여 대상음소(T), 직전 음소열의 음운환경(x), 직후 음소열의 음운환경(y)을 설정한다(304).Next, the signal processor 23 selects the target (ie, target) phoneme and the immediately preceding phoneme sequence, immediately after the phoneme sequence from the phoneme sequence received from the rhyme processor 22 (303), and immediately before / after From the phoneme strings, the articulation method and the articulation combination pattern related to the articulation position are estimated to set the target phoneme (T), the phonological environment (x) of the immediately preceding phoneme sequence, and the phonological environment (y) of the immediately after phoneme sequence (304).

그리고, 신호처리부(23)가 운율제어 파라미터값중에서 대상음소(T)의 직전/직후에 삽입되는 묵음구간 길이를 선형, 비선형으로 양자화하여 이를 심볼(S1,S2)로 변환한다(305).The signal processor 23 quantizes the length of the silent section inserted immediately before and after the target phoneme T in the rhythm control parameter values linearly and nonlinearly and converts the length of the silent section into symbols S1 and S2 (305).

이어서, 신호처리부(23)에서 대상음소(T), 직전 음소열의 음운환경(x), 직후 음소열의 음운환경(y)과, 심볼(S1,S2)을 합성하여 합성단위 검색정보(x.S1.T.S2.y)를 생성한다(306). 이후에, 합성음을 생성하기 위해, 생성된 합성단위 검색정보(x.S1.T.S2.y)를 바탕으로 합성단위 데이터베이스(24)에 미리 저장된 음편을 검색하여 선택한다(307). 또한, 필요시 운율제어 파라미터값에 부합되도록 선택된 음편을 신호처리 방법으로 억양, 강세, 지속시간을 조절하여 가공하며, 문장내 끊어읽기를 삽입한다(308).Subsequently, the signal processor 23 synthesizes the target phoneme T, the phoneme environment x of the immediately preceding phoneme sequence, the phoneme environment y of the immediately after phoneme sequence, and the symbols S1 and S2, and synthesizes unit search information (x.S1). .T.S2.y) is generated (306). Subsequently, in order to generate a synthesized sound, a sound previously stored in the synthesis unit database 24 is searched and selected based on the generated synthesis unit search information (x.S1.T.S2.y) (307). Further, if necessary, the selected sound piece is processed by adjusting the intonation, accent, and duration by a signal processing method so as to match the rhythm control parameter value, and inserts a broken reading in the sentence (308).

마지막으로, 선택된 음편들을 접합시켜 원하는 합성음을 생성한다(309).Finally, the selected pieces are joined to generate a desired synthesized sound (309).

상기한 바와 같은, 본 발명은 연속된 음소들 사이에 삽입되는 묵음구간의 길이를 합성단위 검색정보에 활용함으로써, 인접 음운환경이 대상 음소에 미치는 조음결합 현상을 보다 정확하게 분류할 수 있으며, 합성음의 명료도와 자연성을 향상시킬 수 있다. 따라서, 본 발명은 고품질의 합성음을 생성함으로써 웹 브라우저 읽기, 전자메일 읽기, 주문형 소설(동화) 낭독 서비스 등의 통신서비스와 교육 등의 여러 분야에 응용할 수 있다.상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 기록매체(씨디롬, 램, 롬, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다.As described above, the present invention utilizes the length of the silent section inserted between consecutive phonemes in the synthesis unit search information, thereby more accurately classifying the articulation coupling effects of adjacent phoneme environments on the target phonemes. Can improve clarity and naturalness. Therefore, the present invention can be applied to various fields such as communication service and education such as web browser reading, e-mail reading, on-demand novel reading story, etc. by generating high quality synthesized sound. The program may be implemented and stored in a computer-readable recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.).

이상에서 설명한 본 발명은, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 있어 본 발명의 기술적 사상을 벗어나지 않는 범위내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시예 및 첨부된 도면에 한정되는 것이 아니다.The present invention described above is capable of various substitutions, modifications, and changes without departing from the spirit of the present invention for those skilled in the art to which the present invention pertains, and the above-described embodiments and accompanying It is not limited to the drawing.

상기한 바와 같은 본 발명은, 연속된 음소들 사이에 삽입되는 묵음구간의 길이를 합성단위 데이터베이스에 저장된 음편을 검색하기 위한 합성단위 검색정보에 활용함으로써, 인접 음운환경이 대상 음소에 미치는 조음결합 현상을 보다 정확하게 분류할 수 있으며, 합성음의 명료도와 자연성을 크게 향상시킬 수 있는 효과가 있다.As described above, the present invention utilizes the length of a silent section inserted between consecutive phonemes in synthesis unit search information for searching for a piece of music stored in a synthesis unit database, whereby an adjacent phonological environment affects a target phoneme. It can be classified more accurately, and has the effect of greatly improving the clarity and naturalness of the synthesized sound.

Claims

In the method of reading the contents stored in the database management device,

In classifying the phonetic transition, the target phone (T), the articulation environment and articulation method (x) related to the immediately preceding phoneme sequence or the immediately preceding phoneme sequence, and the articulation environment and articulation method (y) related to the immediately preceding phoneme sequence or the immediately preceding phoneme sequence are sequentially. Comprising: a first silent section (S1) is inserted between the articulation environment and the articulation method (x) and the target phoneme (T) associated with the immediately preceding phoneme sequence or the immediately preceding phoneme sequence, and the target phoneme (T) and immediately after the phoneme And a second silent section (S2) is inserted between the articulation environment and the articulation method (y) associated with the column or the immediately preceding phoneme sequence.

In the method for storing in the database management device,

In the method of reading the contents stored in the database management device,

In classifying the syllable variation, the target syllable (T), the articulation environment and articulation method (x) related to the immediately preceding phoneme or immediately preceding phoneme sequence, and the articulation environment and articulation method (y) related to the immediately phoneme or immediately preceding phoneme sequence are sequentially Comprising: a first silent section (S1) is inserted between the articulation environment and the articulation method (x) and the target syllable (T) associated with the immediately preceding phoneme sequence or the immediately preceding phoneme sequence, and the phoneme syllable (T) and immediately after the phoneme And a second silent section (S2) is inserted between the articulation environment and the articulation method (y) associated with the column or the immediately preceding phoneme sequence.

In the method for storing in the database management device,

An apparatus for generating a synthesized sound from input text,

Language processing means for extracting a phoneme string and syntax structure information of a sentence from input text;

Rhyme processing means for receiving syntax structure information and a phoneme string extracted from the language processing means and estimating a rhyme control parameter value using a rule and a rhyme table;

Synthesis unit storage means for storing a sound piece corresponding to the synthesis unit search information; And

The synthesized unit search information is generated by estimating the phonological environment of the target phoneme, the phoneme strings of the target phoneme and the immediately / near phoneme sequence, and the length of the symbol-converted silent section from the phoneme sequence and the rhyme control parameter values estimated by the rhyme processor. Synthesized sound generating means for generating a desired synthesized sound by synthesizing the selected pieces after selecting the candidate pieces stored in the synthesis unit storing means

Text to speech conversion device comprising a.

The method of claim 5,

The synthesis unit search information,

When classifying the phonetic variances, the phoneme is divided into

Select phoneme strings immediately before / after and estimate the articulation combination pattern related to the articulation method and articulation position from the phoneme strings immediately before / after, the phoneme environment (T), the phonological environment (x) of the immediately preceding phoneme sequence, and the phoneme environment of the immediately after phoneme sequence. (y) is set and composed of symbols S1 and S2 in which the length (duration time) of the silence section inserted immediately before and after the target phoneme T in the rhythm control parameter value is quantized in either linear or nonlinear form ( x.S1.T.S2.y) text to speech converter.

The method of claim 6,

The process of quantizing the length (duration) of the silent section to either linear or non-linear,

Text / Voice Conversion characterized by different quantization patterns by grouping according to the phonological environment of the previous phoneme sequence (x) and the target phoneme (T), and immediately after the phoneme sequence of the phoneme sequence (y) and the target phoneme (T) Device.

The method according to any one of claims 5 to 7,

The synthesis unit search information,

When classifying the syllables of the syllables, the phoneme sequence is selected from the phoneme sequence in relation to the phonological environment, and the phoneme sequence is immediately and immediately before and after the phoneme sequence. ), The phonological environment (x) of the immediately preceding phoneme sequence, and the phonological environment (y) of the immediately after phoneme sequence, and the length of the silent section inserted immediately before / after the target syllable (T) in the rhythm control parameter value (duration time). (X.S1.T.S2.y) consisting of symbols (S1, S2) quantized to either linear or nonlinear.

A text / voice conversion method applied to an apparatus for generating a synthesized sound from input text,

A first step of constructing a synthesis unit database composed of sound pieces corresponding to the synthesis unit search information;

Extracting a phoneme sequence and syntax structure information from the input text, and estimating a rhyme control parameter value using a rule and a rhyme table; And

The synthesized unit search information is generated by estimating the phonological environment of the target phoneme, the phoneme strings of the previous / previous / previous phoneme sequence and the symbol-converted silent section from the estimated phoneme sequence and the rhythm control parameter value, and using the synthesized unit search information to generate the synthesized unit search information. Third step of selecting / synthesizing candidate pieces stored in the database to generate desired synthesized sounds

Text / voice conversion method comprising a.

The method of claim 9,

The second step,

Extracting a phoneme string and syntax structure information of a sentence from text input by the language processor; And

A fifth rhyme processor that receives syntax structure information and a phoneme string extracted by the language processor, and estimates rhyme parameter values related to reading, height of a sound, strength and weakness of a sound, and short and long sounds using a rule and a rhyme table; step

Text / voice conversion method comprising a.

The method according to claim 9 or 10,

The third step,

A sixth step of selecting, by the signal processor, the target phoneme (ie, the target) phoneme, the immediately preceding phoneme sequence, and immediately after the phoneme sequence from the phoneme sequence received from the rhyme processor;

The signal processor estimates an articulation method related to the articulation method and articulation position from the immediately preceding phoneme sequence and the immediately after phoneme sequence to set the target phoneme (T), the phonological environment (x) of the immediately preceding phoneme sequence, and the phonological environment (y) of the immediately after phoneme sequence. A seventh step;

An eighth step of the signal processor quantizing the length of the silent section inserted immediately before and after the target phoneme T in the rhythm control parameter values linearly and nonlinearly and converting the silence section length into symbols S1 and S2;

The signal processing unit target phone (T), the phonetic environment (x) of the immediately preceding phoneme sequence, immediately after the phoneme sequence

A ninth step of synthesizing the phonological environment y and the symbols S1 and S2 to generate synthesis unit search information (x.S1.T.S2.y);

A tenth step of the signal processing unit searching and selecting a piece of sound stored in the synthesis unit database based on the synthesis unit search information (x.S1.T.S2.y) to generate a synthesis sound; And

An eleventh step of joining the selected sound pieces by the signal processor to generate a desired synthesized sound;

Text / voice conversion method comprising a.

The method of claim 11,

The signal processing unit,

If necessary, processing the selected sound to match the rhyme control parameter value by adjusting the intonation, accent, and duration by the signal processing method, and inserting a break in the sentence.

The method of claim 12,

When classifying the syllables of the syllables, the phoneme sequence is selected from the phoneme sequence in relation to the phonological environment, and the phoneme sequence is immediately and immediately before and after the phoneme sequence. ), The phonological environment (x) of the immediately preceding phoneme sequence, and the phonological environment (y) of the immediately after phoneme sequence, and the length of the silent section inserted immediately before / after the target syllable (T) in the rhythm control parameter value (duration time). (X.S1.T.S2.y) consisting of symbols S1 and S2 quantized to either linear or nonlinear.

In a text-to-speech device with a processor,

A function of constructing a synthesis unit database composed of sound pieces corresponding to the synthesis unit search information;

Extracting a phoneme string and syntax structure information from an input text, and estimating a rhyme control parameter value using a rule and a rhyme table; And

The synthesized unit search information is generated by estimating the phonological environment of the target phoneme, the phoneme strings of the previous / previous / previous phoneme sequence and the symbol-converted silent section from the estimated phoneme sequence and the rhythm control parameter value, and using the synthesized unit search information to generate the synthesized unit search information. Ability to select / synthesize candidate pieces stored in the database to generate the desired synthesis sound

A computer-readable recording medium having recorded thereon a program for realizing this.

In a database management device having a processor,

At the time of classification of phonetic variance,

Target phoneme (T) structure;

Articulation environment and articulation method (x) structure associated with the immediately preceding phoneme sequence or the immediately preceding phoneme sequence;

Articulation environment and articulation method (y) structure associated with immediately or immediately after phoneme sequence;

An articulation environment and an articulation method (x) structure related to the immediately preceding phoneme sequence or the immediately preceding phoneme sequence;

A first silent section S1 inserted between the target phonemes T; And

A second silent section S2 inserted between the target phoneme T and the immediately following phoneme sequence or the articulation environment related to the immediately preceding phoneme sequence and the articulation method y;

The computer-readable recording medium having the data recorded thereon.

In a database management device having a processor,

At the time of classification of syllable variations

Target syllable (T) structure;

A first silent section S1 inserted between the target syllables T; And

A second silent section S2 inserted between the target syllable T and the immediately subsequent phoneme sequence or the articulation environment related to the immediately preceding phoneme sequence and the articulation method y;

The computer-readable recording medium having the data recorded thereon.