KR0146549B1

KR0146549B1 - Korean language text acoustic translation method

Info

Publication number: KR0146549B1
Application number: KR1019950014828A
Authority: KR
Inventors: 이정철; 최운천; 김상훈
Original assignee: 양승택; 한국전자통신연구원; 조백제; 한국전기통신공사
Priority date: 1995-06-05
Filing date: 1995-06-05
Publication date: 1998-09-15
Also published as: KR970002706A

Abstract

본 발명은 한국어 텍스트/ 음성 변환 방법에 관한 것으로, 규칙을 이용한 언어 처리 모듈, 운율 처리 모듈을 통하여 합성음의 자연성을 높이고, 합성음 생성 모듈에서는 TD-PSOLA 합성기를 이용하여 합성음의 명료도를 높인 한국어 텍스트/ 음성 변환 방법을 제공하기 위하여, 한국어의 음운 구조 형태와 음소 연결의 제약을 분석하여 합성단위를 분류하는 제 1 단계; 음소 단위로 합성단위를 쉽게 억세스하고, 음소의 지속 시간 변경 및 피치 제어를 실시간에 처리하기 위한 구조로 합성단위 데이타베이스를 작성하는 제 2 단계; 상기 합성단위 데이타베이스에서 음절의 각 세그먼트에 필요한 데이타를 음소, 반음절 형태소로 가져오는 제 3 단계; 및 텍스트 문장에 대하여 전처리를 수행한 후에 어절 분석을 하고 파싱 과정을 수행한 다음에 글자/음운 변환을 수행하고, 문장 구조에 따라 적합한 운율 규칙을 적용하고, 발음 기호와 운율 정보를 합성단위 DB에서 검색하여 합성단위들을 제 4 단계를 포함하여 합성음의 자연성의 유창성의 향상을 높이고, 구현이 용이하고 합성음의 명료도를 크게 향상시킬 수 있는 효과가 있다.The present invention relates to a Korean text / speech conversion method, which improves the naturalness of synthesized speech through a language processing module and a rhyme processing module using rules, and in the synthesized speech generating module, improves the clarity of the synthesized speech using a TD-PSOLA synthesizer. In order to provide a voice conversion method, a first step of classifying the synthesis unit by analyzing the phonological structure form and the constraints of phoneme connection of Korean; A second step of easily accessing the synthesis unit in phoneme units and creating a synthesis unit database in a structure for processing the duration of the phoneme and controlling the pitch in real time; A third step of importing data required for each segment of a syllable into phoneme and half-syllable morphemes in the synthesis unit database; And after performing preprocessing on text sentences and performing word analysis, parsing process, letter / phonic conversion, applying proper rhyme rules according to sentence structure, and using phonetic symbols and rhyme information Including the fourth step by searching the synthesized units, it is possible to increase the fluency of the naturalness of the synthesized sound, to easily implement, and to greatly improve the intelligibility of the synthesized sound.

Description

Korean text / to-speech method

제1도는 본 발명이 적용되는 하드웨어의 구성도1 is a block diagram of hardware to which the present invention is applied

제2도는 본 발명에 사용되는 합성단위 데이타베이스의 구성도2 is a block diagram of a synthesis unit database used in the present invention

제3도는 본 발명에 사용되는 CDU의 유형별 결합 흐름도3 is a combination flow chart for each type of CDU used in the present invention.

제4도는 본 발명에 따른 흐름도4 is a flow chart according to the present invention

제5도는 본 발명에 따른 파싱 과정의 흐름도5 is a flowchart of a parsing process according to the present invention.

제6도는 본 발명에 따른 운율 처리 과정의 흐름도6 is a flow chart of the rhyme processing process according to the present invention

제7도는 본 발명에 따른 어절, 음절 및 음소 지속 시간 계산의 흐름도7 is a flowchart of word, syllable and phoneme duration calculation according to the present invention.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

11 : 문자 입력 장치 12 : 중앙 처리 장치11: character input device 12: central processing unit

13 : 합성단위 데이타베이스 14 : D/A 변환 장치13 synthesis unit database 14 D / A converter

21 : 합성단위 DB 중 dbdic 파일의 구조21: Structure of the dbdic file in the compound unit DB

22 : 합성단위 DB 중 dbptch 파일의 구조22: Structure of the dbptch file in the compound unit DB

23 : 합성단위 DB 중 dbsp 파일의 구조23: Structure of the dbsp file in the compound unit DB

본 발명은 규칙을 이용한 언어 처리 모듈, 운율 처리 모듈을 통하여 합성음의 자연성을 높이고, 합성음 생성에서는 TD-PSOLA(time domain pitch synchronous overlap and add) 합성기를 이용하여 합성음의 명료도를 높인 한국어 텍스트/ 음성 변환(text-to-speech conversion: 이하, TTS 라 함)방법에 관한 것이다.The present invention improves the naturalness of the synthesized sound through a language processing module and a rhyme processing module using a rule, and in the generation of the synthesized sound, Korean text / speech conversion using the time domain pitch synchronous overlap and add (TD-PSOLA) synthesizer to increase the intelligibility of the synthesized sound. (text-to-speech conversion: hereinafter referred to as TTS).

음성 합성의 기능은 컴퓨터가 사용자인 인간에게 다양한 형태의 정보를 음성으로 제공하는 것으로, 사용자는 음성 합성을 이용하여 기존의 텍스트 데이타나 대화 상대로부터 제공되는 텍스트 정보를 음성으로 출력할 수 있다. 물론 사용자에게 고품질의 음성 합성 서비스를 제공하기 위해서는 합성음의 명료도와 자연성이 높고 발성 속도 및 적절한 의미적 강조가 이루어지도록 유창해야 하며 하드웨어와 소프트웨어측면에서 쉽게 구현할 수 있어야 한다.The function of speech synthesis is that a computer provides various types of information to a human who is a user as a voice, and the user may output existing text data or text information provided from a conversation partner by voice using speech synthesis. Of course, in order to provide a high quality speech synthesis service to the user, it is necessary to be fluent in the intelligibility and naturalness of synthesized speech, to have a vocal speed and appropriate semantic emphasis, and to be easily implemented in terms of hardware and software.

현재 이용되고 있는 합성단위로는 문장, 어절, 음절의 조합 방식, 또는 더 작은 단위로 이음절, 이음소, 음소 등이 있다. 이러한 합성단위의 선정, 합성단위 DB의 작성 및 합성 단위간의 결합 방법은 합성음의 음질과 음색에 직접적으로 영향을 주는 중요한 요인이다.Synthesis units currently in use include sentences, words, syllable combinations, or smaller units, such as syllables, phonemes, and phonemes. The selection of the synthesis unit, the creation of the synthesis unit DB, and the combination method between the synthesis units are important factors that directly affect the sound quality and tone of the synthesized sound.

운율은 음의 높이, 세기, 길이의 3가지 형태로 나타난다. 이중 음의 높낮이 변화는 억양을, 음의 세기는 의미적 강세를, 음의 길이는 조음점, 조음 방법, 조음 결합에 따른 변별적 지속 시간 차이와 운율 경계 정보를 내포하고 있다. 그러므로 실제의 운율 패턴의 구별 가능한 패턴의 종류와 그 의미, 그리고 텍스트 문장의 형태, 구문 구조, 문맥에 따른 운율 패턴의 관련성을 찾아 운율 구현 규칙을 작성하는 것이 합성음의 명료도 및 자연성 구현에 필수적이다. 따라서, 텍스트 문장의 분석 방식, 운율 구현 모델, 운율 패턴 구현 방식은 합성음의 명료도 및 자연성을 결정짓는 중요 원천 기술이 된다.Rhymes come in three forms: pitch height, intensity, and length. The change in pitch of the note contains accent, the intensity of the note means semantic stress, and the length of note contains distinctive duration difference and rhyme boundary information according to articulation point, articulation method, and articulation combination. Therefore, it is essential to formulate rhyme implementation rules by finding the types and meanings of distinguishable patterns of actual rhyme patterns, their meanings, and the relationship between rhyme patterns according to the form, syntax structure, and context of text sentences. Therefore, an analysis method of a text sentence, a rhyme implementation model, and a rhyme pattern implementation method are important source technologies for determining the intelligibility and naturalness of the synthesized sound.

합성기로는 LPC(Linear Predictive Coding), LSP(Line Spectral Pairs),포먼트(Formant) 등의 분석/합성기와 TD-PSOLA와 같은 시간 영역 처리 합성기가 연구되고 있다. 이들 합성기는 합성음의 명료도, 음색과 밀접한 관계가 있는 것으로서 합성기의 선정은 자연성,유창도, 복잡도를 고려하여 결정된다.As a synthesizer, analysis / synthesis such as Linear Predictive Coding (LPC), Line Spectral Pairs (LSP), and Formant, and time domain processing synthesizers such as TD-PSOLA are being studied. These synthesizers are closely related to the intelligibility and tone of the synthesized sound. The selection of the synthesizer is determined in consideration of the naturalness, fluency, and complexity.

그러나, 인간에 가까운 자연스러운 합성음의 생성에는 많은 어려움이 있어 아직도 실제 상용화된 제품은 거의 없고, 특히 무제한 어휘를 인간에 가까운 음성으로 변환하기 위한 규칙 합성 방식은 명료도 확보를 위한 합성단위의 선정과 결합 방법, 자연성 구현을 위한 운율 구현 방법 및 합성음 생성을 위한 합성 방식 등의 기술이 아직도 미흡하여 이에 대한 새로운 기술의 창출이 요구되고 있다.However, there are many difficulties in producing natural synthesized sounds that are close to humans, and there are still few commercially available products. Especially, the rule synthesis method for converting unlimited vocabulary into voices that are close to humans is a method of selecting and combining synthetic units for clarity. However, there are still insufficient technologies such as a rhythm realization method for realizing nature and a synthesis method for synthesizing sound generation. Therefore, a new technology is required.

이에 따라 안출된 본 발명은 합성단위로는 음소, 반음절, 음절의 혼합형을 사용하고, 합성단위 DB의 작성은 한국어 음운 환경을 모두 포함하면서 최소 개수가 되는 CDU(context dependent unit)1204개로 작성하였으며, 합성단위간의 결합 방법은 CDU 작성 원리에 따라 작성하고, 운율 구현을 위해 먼저 한국어의 언어학적 분석 요소(속성)를 정의하여 입력 텍스트에서 문장의 형태, 구문 구조 분석 방법, 합성음의 억양, 발성음의 길이 및 강조 처리의 제어 방법을 개발함으로써 합성음의 자연성과 유창성의 향상을 높이고, 또한 효율적인 파형 편집 방식의 하나인 TD-PSOLA 합성기를 사용함으로서 구현이 용이하고 합성음의 명료도를 크게 향상시킨 한국어 텍스트/ 음성 변환 방법을 제공하는데 그 목적이 있다.According to the present invention, a combination of phonemes, half-syllables, and syllables is used as a synthesis unit, and the composition unit DB is written with 1,200 CDUs (context dependent units), which includes a minimum number of Korean phonological environments. In order to implement the rhyme, first, the method of composing unit is composed according to the CDU writing principle. The Korean text / enhancing method improves the naturalness and fluency of the synthesized sound by developing the control method of the length and the emphasis process, and the TD-PSOLA synthesizer, which is one of the efficient waveform editing methods, is easy to implement and greatly improves the clarity of the synthesized sound. The purpose is to provide a voice conversion method.

상기 목적을 달성하기 위하여 본 발명은, 한국어 문자를 입력받는 문자 입력 수단; 상기 문자 입력 수단으로 부터 한국어 문자를 입력받아 실장된 본 발명의 알고리즘에 따라 각 구성 요소를 제어하는 중앙 제어 수단; 합성 알고리즘에 사용되는 CDU 합성단위 DB로서 기억 장치에 저장되어 있으며, 상기 중앙 처리 수단으로 필요한 데이타를 전송하는 합성단위 데이타베이스; 및 합성이 끝난 디지탈 데이타를 아날로그 신호로 변환하여 외부로 출력하는 디지탈/아날로그 변환 수단을 구비하는 장치에 적용되는 방법에 있어서, 한국어의 음운 구조 형태와 음소 연결의 제약을 분석하여 합성단위를 분류하는 제 1 단계; 음소 단위로 합성단위를 쉽게 억세스(access)하고, 음소의 지속 시간 변경 및 피치(pitch)제어를 실시간에 처리하기 위한 구조로 합성단위 데이타베이스를 작성하는 제 2 단계; 상기 합성단위 데이타베이스에서 음절의 각 세그먼트에 필요한 데이타를 음소,반음절 형태소로 가져오는 제 3 단계; 및 언어 처리 모듈에서는 입력된 텍스트 문장에 대하여 전처리를 수행한 후에 어절 분석을 하고 파싱 과정을 수행한 다음에 글자/음운 변환을 수행하고, 운율 처리 모듈에서는 상기 언어 처리 모듈의 처리 결과를 받아서 문장 구조에 따라 적합한 운율 규칙을 적용하고, 합성음 생성 모듈에서는 상기 언어 처리 모듈과 운율 처리 모듈의 처리결과를 받아서 발음기호와 운율 정보를 합성단위 DB에서 검색하여 합성단위들을 합성하는 제 4 단계를 포함하는 것을 특징으로 한다.The present invention to achieve the above object, Character input means for receiving a Korean character; Central control means for receiving each Korean character from the character input means and controlling each component according to the algorithm of the present invention; A synthesis unit database, which is stored in a storage device as a CDU synthesis unit DB used for a synthesis algorithm, and transmits necessary data to the central processing unit; And a digital / analog converting means for converting synthesized digital data into an analog signal and outputting the result to an external device, the method comprising: classifying a synthesized unit by analyzing a phonological structure form of a Korean phone and a constraint of a phoneme connection; First step; A second step of creating a synthesis unit database in a structure for easily accessing the synthesis unit in phoneme units, and processing the duration of the phoneme and the pitch control in real time; A third step of importing data required for each segment of the syllable into phoneme and half-syllable morphemes in the synthesis unit database; And the language processing module performs word processing after performing preprocessing on the input text sentence, performs a parsing process, and then performs a character / phonic conversion. The rhyme processing module receives the processing result of the language processing module to construct a sentence. Applying a suitable rhyme rule according to the above, and the synthesized sound generating module includes a fourth step of receiving the processing results of the language processing module and the rhyme processing module and retrieving the phonetic symbols and the rhyme information from the synthesis unit DB to synthesize the synthesis units. It features.

이하, 첨부된 도면을 참조하여 본 발명에 따른 일실시예를 상세히 설명한다.Hereinafter, with reference to the accompanying drawings will be described an embodiment according to the present invention;

제 1 도는 본 발명이 적용되는 하드웨어의 구성도로서, 11은 문자 입력 장치, 12는 중앙처리 장치, 13은 합성단위 데이타베이스, 14는 D/A변환 장치를 각각 나타낸다.1 is a block diagram of hardware to which the present invention is applied, where 11 is a character input device, 12 is a central processing unit, 13 is a synthesis unit database, and 14 is a D / A conversion device.

그 동작을 살펴보면, 문자 입력 장치(11)는 KS5601 완성형 및 2 바이트 조합형으로 표현이 가능한 한국어 문자를 입력받아 중앙처리 장치(12)로 넘겨주는 역할을 담당한다. 중앙 처리 장치(12)는 상기 문자 입력 장치(11)로 부터 한국어 문자를 입력받아 실장된 본 발명의 알고리즘에 따라 각 구성 요소를 제어한다. 합성단위 데이타베이스 (13)는 합성 알고리즘에 사용되는 CDU 합성단위 DB로서 기억 장치에 저장되어 있으며, 상기 중앙처리 장치(12)로 필요한 데이타를 전송하는 역할을 담당한다. D/A 변환 장치(14)는 합성이 끝난 디지탈 데이타를 아날로그 신호로 변환하여 외부로 출력하는 장치이다.Looking at the operation, the character input device 11 is responsible for receiving the Korean characters that can be expressed in the KS5601 complete type and the two-byte combination type to pass to the central processing unit (12). The CPU 12 receives Korean characters from the character input device 11 and controls each component according to the algorithm of the present invention. The synthesis unit database 13 is stored in the storage device as the CDU synthesis unit DB used for the synthesis algorithm, and is responsible for transferring necessary data to the CPU 12. The D / A converter 14 is a device that converts synthesized digital data into an analog signal and outputs it to the outside.

아래의 표 1은 본 발명에 사용되는 CDU의 유형별 분류표로서, 음성을 합성하기 위해서 각각의 음절에 대해서 선행 음절과 후행 음절을 참고로 하여 분류한다.Table 1 below is a classification table for each type of CDU used in the present invention. In order to synthesize speech, each syllable is classified with reference to the preceding syllable and the following syllable.

고품질의 한국어 TTS 시스템을 위한 합성단위의 작성은 한국어의 음운 및 운율 환경을 분석하여 합성에 필요한 합성단위를 선정하며, 사용음절내에서 환경에 따라 실제 사용될 세부 사용부를 결정하고, 합성단위 음성 파형에 사용부를 분절 및 표기하여 음운 환경의 변화에 따른 합성단위를 구성한다.한국어의 음운 구조 형태와 음소 연결의 제약을 분석하여 작성된 합성단위는 모두 1204개이다. 생성된 CDU는 합성단위 데이타베이스로부터 데이타를 가져오는데 사용되며, 합성단위간의 결합시에 사용된다.The synthesis unit for high quality Korean TTS system analyzes the phonological and rhyme environment of Korean and selects the synthesis unit necessary for synthesis, and determines the detailed unit to be used according to the environment within the syllables. Segmentation and notation are used to form a synthesis unit according to the change of phonetic environment. A total of 1204 synthesis units were prepared by analyzing the phonological structure form and the constraints of phoneme connection. The generated CDU is used to fetch data from the synthesis unit database and is used to combine the synthesis units.

제 2 도는 본 발명에 사용되는 합성단위 데이타베이스의 구성도로서, 음소 단위로 합성단위를 쉽게 억세스(access)하고, 음소의 지속 시간 변경 및 피치(pitch) 제어를 실시간에 처리하기 위한 구조로 작성되어 있다.2 is a block diagram of a composition unit database used in the present invention, which is a structure for easily accessing a synthesis unit in phoneme units and processing a change in duration of a phoneme and a pitch control in real time. It is.

합성단위 DB는 합성단위 주소 데이타 화일(21), 피치 마크 화일(22) 및 음성 데이타 화일(23)을 구비한다. 음성 데이타 화일(23)은 합성단위의 PCM(Pulse Code Modulation) 데이타를 순서대로 저장하고 있다.The synthesis unit DB includes a synthesis unit address data file 21, a pitch mark file 22, and an audio data file 23. The voice data file 23 stores PCM (Pulse Code Modulation) data of the synthesis unit in order.

피치 마크 화일(22)은 상기 음성 데이타 화일(23)에 저장된 음성 신호인 피치 마크(pitch mark)들의 위치와 각 피치값을 샘플수의 형태로 저장하고 있다.The pitch mark file 22 stores the positions and pitch values of pitch marks, which are voice signals stored in the voice data file 23, in the form of samples.

합성단위 주소 데이타 화일(21)은 각각의 합성단위에 대하여 합성단위 번호, 상기 음성 데이타 화일(23) 내에서의 시작점/끝점, 세그먼트 개수, 각 세그먼트의 피치 마크 화일(22) 내에서의 시작점, 피치 마크 개수 정보를 저장하고 있다.The synthesis unit address data file 21 is a synthesis unit number for each synthesis unit, the start / end point in the voice data file 23, the number of segments, the start point in the pitch mark file 22 of each segment, Pitch mark number information is stored.

이들 데이타베이스 구조를 이용하여 '가'의 'ㅏ' 시작점으로부터 피치값이 n인 피치 구간의 음성 신호를 불러내는 과정은 다음과 같다.Using these database structures, the process of recalling the speech signal of the pitch section with pitch value n from the start point of 'A' is as follows.

먼저 '가'에 해당하는 DB 억세스 번호 nl에 따라 합성단위 주소 데이타 화일(21)에서 해당 정보를 읽어들인다. 모음 'ㅏ'는 nl CDU내의 2번째 세그먼트이므로 2번째 세그먼트에 해당하는 피치 마크 화일(22)내의 시작점 정보 n2와 해당 세그먼트의 피치 마크 개수 n3을 읽는다. 피치 마크 화일(22)내의 시작점 정보 n2를 이용하여 시작점을 찾은 순서대로 각 피치 마크의 음성 데이타 화일(23)내 시작점 n4와 샘플 개수 정보 n5를 읽어들여 샘플수와 피치값 n이 일치하는가를 확인한다. 일치하면 음성 데이타 화일(23)내의 시작점 n4로부터 샘플 개수 n5개를 읽어들인다.First, the information is read from the composition unit address data file 21 according to the DB access number nl corresponding to 'A'. Since the vowel 'k' is the second segment in the nl CDU, the starting point information n2 in the pitch mark file 22 corresponding to the second segment and the number of pitch marks n3 of the segment are read. The start point n4 and the sample number information n5 in the audio data file 23 of each pitch mark are read in order of finding the starting point using the starting point information n2 in the pitch mark file 22, and it is checked whether the number of samples and the pitch value n match. do. If there is a match, n5 samples are read from the starting point n4 in the audio data file 23.

제 3 도는 본 발명에 사용되는 CDU의 유형별 결합 흐름도이다.3 is a combination flow chart for each type of CDU used in the present invention.

문장내의 각 음절은 초성 C1, 모음 전반부 V1, 모음 후반부 V2, 종성C2, C3의 5개 단위로 분리되어 각각에 적합한 데이타를 합성단위 DB로부터 음소, 반음절 형태소로 가져오게 된다. 합성단위 DB 내에서의 데이타 선정은 선행 음절의 모든 Vp, 종성 Cp, 대상 음절의 초성 C1, 모음V, 종성 C2, 후속 음절의 초성 Cn, 모음 Vn의 조건에 따라 결정된다. 결합 유형은 다음과 같다.Each syllable in the sentence is divided into five units: initial C1, first vowel V1, second vowel V2, final C2, and C3 to bring the appropriate data from the synthesis unit DB into phonemes and half-syllable morphemes. Data selection in the synthesis unit DB is determined according to the conditions of all the Vp, the final Cp of the preceding syllable, the initial C1, the vowel V, the final C2 of the target syllable, the initial Cn of the subsequent syllable, and the vowel Vn. The type of coupling is as follows:

1. C1 유형1. C1 type

·어절의 첫음절일 때 : C1 = CVWhen the first syllable of a word: C1 = CV

·Cp = C1 = 'ㄹ'일 때 : C1 = e1LVWhen Cp = C1 = 'ㄹ': C1 = e1LV

·기타 : C1 = eCVOthers: C1 = eCV

2. V1 유형2. V1 type

·초성이 있을 때 : V1 = C1When there is a primitive: V1 = C1

·모음으로 시작되는 어절의 첫음절일 때When the first syllable of a word begins with a vowel

- 단음절이거나 받침없는 이중모음 : V1 = V-Single syllable or double vowel without support: V1 = V

- 이중모음에 종성이 'ㅇ' 일 때 : V1 = V-When the finality is 'ㅇ' in double vowels: V1 = V

- 기타 : V1 = -1(다음과정에서 결정)Others: V1 = -1 (determined in the next step)

·선행 음절에 종성이 있을 때 : V1 = eCVWhen there is a final syllable in the preceding syllable: V1 = eCV

·선행 음절이 단모음으로 끝날 때 : V1 = VVWhen the preceding syllable ends with a short vowel: V1 = VV

·선행 음절이 'j'계열 모음으로 끝날 때 : V1 = ejVWhen the preceding syllable ends with the 'j' series collection: V1 = ejV

·선행 음절이 'w'계열 모음으로 끝날 때 : V1 = ewVWhen the preceding syllable ends with the 'w' series collection: V1 = ewV

3. V2 유형3. V2 type

·종성이 없을 때When there is no bell

- 어절의 끝일 때 : V2 = V1-At the end of a word: V2 = V1

- 후속 음절에 초성이 있을 때 : V2 = VCeWhen there is a consonant in subsequent syllables: V2 = VCe

- 후속 음절이 단모음일 때 : V2 = VV-When subsequent syllables are short vowels: V2 = VV

- 후속 음절이 'j'계열모음일 때 : V2 = Vje-When the next syllable is 'j' collection: V2 = Vje

- 후속 음절이 'w'계열모음일 때 : V2 = Vwe-When subsequent syllables are 'w' collections: V2 = Vwe

·종성이 있을 때When there is a bell

- 어절의 끝일 때 : V2 = VC-At end of word: V2 = VC

- 폐쇄받침일 때 : V2 = VC-When closed: V2 = VC

- Cp =ㄹ, C1 =ㄹ 일 때 : V2 = VL1e-When Cp = ㄹ, C1 = ㄹ: V2 = VL1e

- 후속 음절이 모음으로 시작할 때 : V2 : VCe-When subsequent syllables begin with a vowel: V2: VCe

- 유성 종성과 후속 음절의 초성이 있을 때 : V2 = VCDaWhen there is a meteor and a subsequent syllable: V2 = VCDa

4. C2 유형4. C2 type

·종성이 있을 때 : C2 = V2When there is finality: C2 = V2

5. C3 유형5. C3 type

·유성 종성과 후속 음절의 초성이 있을 때 : C3 = aCCwWhen there is a shooting star and a subsequent syllable: C3 = aCCw

6. 최종적으로 V1 = -1일 때 : V1 = V26. Finally, when V1 = -1: V1 = V2

제 4 도는 본 발명에 따른 흐름도로서, 각 모듈의 기능은 아래와 같다.4 is a flow chart according to the present invention, the function of each module is as follows.

언어 처리 모듈에서는 먼저 입력된 텍스트 문장(41)을 약어, 문장 기호, 특정 용어에 대해 한국어로 해석하는 전처리 과정를 수행한다(42). 다음 과정에서는 어휘 사전에 60여개 그룹으로 분류, 등록된 한국어의 조사, 활용형 어미, 부사, 접속사 등을 이용하여 입력 문장의 각 어절에 문법적 기능을 추정, 할당하는 어절 분석을 한다(43). 그리고, 한국어 문법을 이용하여 입력 문장의 구문 구조를 추정하는 파싱 과정을 수행한다(44). 이후, 예외 발음 사전을 검색하여 등록된 단어는 예외 발음 사전에 따라 처리하고 예외 발음 사전에 등록되지 않은 단어는 단어 사전 검색과 한국어 발음 규칙에 따라 입력 문장을 소리나는대로 바꾼다(45).The language processing module first performs a preprocessing process of interpreting the input text sentence 41 into Korean for an abbreviation, a sentence symbol, and a specific term (42). In the next process, a word analysis is performed to estimate and assign grammatical functions to each word of an input sentence using a group of about 60 groups in a lexical dictionary, a survey of registered Korean words, a utilization-type ending, an adverb, and a conjunction (43). In operation 44, a parsing process of estimating a syntax structure of an input sentence using Korean grammar is performed. Subsequently, an exception pronunciation dictionary is searched for and registered words are processed according to the exception pronunciation dictionary, and words not registered in the exception pronunciation dictionary are changed as they are input sentences according to the word dictionary search and Korean pronunciation rules (45).

운율 처리 모듈에서는 언어 처리 모듈의 처리 결과를 받아서 문장 구조에 따라 적합한 운율 규칙을 적용함으로써 합성음의 빠르기, 억양, 뛰어 읽기 등의 자연성 및 유창함과 관계된 정보를 생성한다(46).The rhyme processing module receives the processing result of the language processing module and applies the appropriate rhyme rules according to the sentence structure to generate information related to the naturalness and fluency of the speed, intonation, and reading of the synthesized sound (46).

합성음 생성 모듈에서는 위의 처리 과정을 거쳐 구한 발음 기호와 운율 정보를 합성단위 DB에서 검색하여 합성단위들을 TD-PSOLA 방식으로 조절, 가공 및 결합한다(47). 마지막으로 합성음을 생성하여 사용자에게 음성으로 출력한다(48).In the synthesized sound generation module, the phonetic symbols and rhyme information obtained through the above processing are retrieved from the synthesis unit DB, and the synthesized units are adjusted, processed, and combined using the TD-PSOLA method (47). Finally, the synthesized sound is generated and output to the user as a voice (48).

제 5 도는 본 발명에 따른 파싱 과정의 흐름도이다.5 is a flowchart of a parsing process according to the present invention.

기능어를 이용한 구문 분석의 목적은 입력된 문장으로부터 순차적으로 스페이스를 기준으로 하여 어절을 분리하고 형태소 사전을 사용하여 각 어절에 문법적 속성을 부여하는 것이다(51). 속성의 정의는 후술하는 표 2와 같다. 추정된 구문 구조 정보는 합성음의 운율 처리를 위한 제어 정보를 생성하는데 사용된다. 본 발명에서는 형태소 사전에서 정의된 문법적 정보를 바탕으로 문장의 언어학적 분석 요소를 설정하고 음성에서의 운율적 특성은 문형, 절, 구, 어절의 형태로 분류하였다(52,53,54,55).The purpose of syntactic analysis using functional words is to separate the phrases sequentially based on spaces from the input sentences and to assign grammatical properties to each phrase using the morpheme dictionary (51). Attribute definitions are shown in Table 2 below. The estimated syntax structure information is used to generate control information for processing the rhyme of the synthesized sound. In the present invention, the linguistic analysis elements of sentences are set based on the grammatical information defined in the morpheme dictionary, and the rhyme characteristics of the speech are classified into sentence, clause, phrase, and word form (52, 53, 54, 55). .

-문형으로 평서문(긍정/부정), 의문문(의문사 유/무, 도치), 감탄문이 있다.-There are written sentence (positive / negative), question (with or without question, inversion) and admiration.

-절의 분류로는 위치(문의 앞/중간/뒤),대등 관계, 수식 관계(수식되는 단어의 품사 의존도)가 있다.The classifications of clauses include position (front / middle / back of the statement), equality relations, and mathematical relations (part-of-speech dependence of the word being modified).

-구는 문법적인 격(주부/술부), 구의 종류(명사구,동사구), 문장내의 위치 및 특정구와의 상대적인 위치 관계로 분류된다.The phrases are classified into grammatical cases (housewives / predicates), types of phrases (noun phrases, verb phrases), positions within sentences, and relative positional relationships with specific phrases.

-어절의 분석 요소로는 형태소 특히 기능어의 결합 관계로 정의한다.-Analysis elements of the word are defined as the combination of morphemes and functional words.

제 6 도는 본 발명에 따른 운율 처리 과정의 흐름도로서,상기 파싱 과정에서 생성된 구문 분석 결과를 이용하여 후술하는 제 7 도의 지속시간 제어와 표 3의 억양 생성 모델의 선정 및 변수값을 계산하고 이를 이용하여 합성음의 빠르기와 피치 조절을 수행한다.6 is a flowchart of a rhyme processing process according to the present invention, using the result of parsing generated in the parsing process, calculating the duration control of FIG. To adjust the speed and pitch of the synthesized sound.

지속 시간 계산은 어절의 지속 시간, 음절의 지속 시간, 음소의 지속 시간 순서로 계산되며(56,57,58) 그 방식은 후술하는 제 7 도와 같다. 피치 제어 규칙은 문형에 따른 피치의 기본 패턴, 구문 구조와 관련된 변화 정도(완만/급격), 변화의 시작점과 영향이 미치는 영역 분석, 기능적 분류 및 분석에 따라 작성된다(64,65,66).The duration calculation is calculated in the order of the duration of the word, the duration of the syllable, and the duration of the phoneme (56, 57, 58). Pitch control rules are written according to the basic pattern of pitch according to the sentence pattern, the degree of change (slow / rapid) related to the syntax structure, the analysis of the starting point and influence of the change, functional classification and analysis (64, 65, 66).

기본주파수의 계산은 먼저 문장, 절, 구 순서로 기본 패턴을 생성한 뒤, 단어내의 음운 환경을 고려한 지엽적인 계산을 하는 순서로 진행한다.The fundamental frequency is calculated by first generating basic patterns in the order of sentences, clauses, and phrases, and then performing local calculations considering the phonological environment in words.

-구문 정보를 이용하여 기본 역양 패턴을 생성하고 절경계에서의 이탈정도와 상승 정도를 계산하여 품사 정보를 이용하여 상대적인 피크(peak)의 크기를 구한다.-Create a basic retrograde pattern using syntax information, calculate the degree of deviation and elevation in the supersonic system, and use the part-of-speech information to obtain the relative peak size.

-음운 정보는 조음 장소, 조음 방법에 따른 지엽적인 피크(peak)와 밸리(valley)를 생동감있는 억양을 만드는데 사용한다.Phonological information is used to create lively accents of local peaks and valleys according to the articulation location and articulation method.

표 2는 조사, 어미의 활용형, 보조용언, 부사, 접속어를 이용하여 각 어절의 문법적 기능(속성)을 부여하는데 사용되는 속성 분류표이다.Table 2 is an attribute categorization table used to assign grammatical functions (attributes) to each word by using a survey, its usage forms, auxiliary words, adverbs, and conjunctions.

제 7 도는 본 발명에 따른 어절, 음절 및 음소 지속 시간 계산의 흐름도로서, 절, 구 경계에서는 음절이 길어지며, 단어나 구 내에서는 단음화 다음절어일 경우 지속 시간 변화등이 있다. 먼저, 어절의 지속 시간 WDdur을 구한다(71).7 is a flowchart of calculating words, syllables, and phoneme durations according to the present invention, where syllables become longer at the boundary of phrases and phrases, and duration changes when the next phrase is shortened within a word or phrase. First, the word duration WDdur is calculated (71).

여기서, RFdur은 단음절의 평균 지속 시간, a는 비례 상수, j는 어절 내 음절을 각각 나타낸다.Where RFdur is the average duration of a syllable, a is a proportional constant, and j is a syllable within a word.

다음에 어절내 각 음절의 지속 시간 SYLdur을 구한다(72).Next, the duration SYLdur of each syllable in the word is calculated (72).

그리고, 문장, 절, 구 경계점 이전 음절의 지속 시간 신축을 조절한다(73), 이후, 각 음절의 초기 신축률 PRCNTO을 구하는데 이는 이후 음운 환경에 따른 각 음소의 신축률 계산의 초기값으로 사용한다(74).Then, the duration stretching of the syllable before the boundary point of the sentence, clause, and phrase is adjusted (73). Then, the initial expansion rate PRCNTO of each syllable is obtained, which is then used as an initial value of the calculation of the expansion rate of each phoneme according to the phonetic environment. (74).

여기서, INHdur_i는 음절을 구성하는 음소들의 고유 지속 시간을 나타낸다.Here, INHdur _i represents the intrinsic duration of the phonemes constituting the syllable.

이후, 음운 환경에 따른 음소 지속 시간의 신축률 PRCNT 계산은 각 음소에 대하여 해당되는 규칙을 순차적으로 적용하고, 각 규칙에 할당된 지속 시간 변화율 PRcnt_i를 이용하여 구한다(75).Thereafter, the expansion rate PRCNT calculation of the phoneme duration according to the phonological environment is sequentially applied to each phoneme, and is calculated using the duration change rate PRcnt _i assigned to each rule (75).

최종적으로 구한 음소의 지속 시간 변화율과 그음소의 고유 지속 시간 INHdur, 최소 지속 시간 MINdur을 이용하여 음소의 지속 시간 PHONdur을 계산한다(76).Finally, the phoneme duration PHONdur is calculated using the obtained rate of change of duration of the phoneme, the inherent duration of the phoneme INHdur, and the minimum duration MINdur (76).

표 3은 억양 생성 모델 분류표로서, 기본 피치 컨투어(contour)로는 평서문, 의문문, 감탄문, 구/어절 4가지가 있다. 종결부 피치 모델로는 평서문 종결부, 의문문 종결부, 감탄문 종결부, 구/어절 종결부 4가지가 있다. 그리고, 특정 음절에 대한 모델로는 특정 음절의 국지적 피치 특성과 특정 음절 종결부의 국지적 피치 특성의 2가지가 있다.Table 3 is an accent generation model classification table. There are four basic pitch contours: plain text, question, exclamation, and phrase / phrase. There are four types of ending pitch models: the end of the plain sentence, the end of the interrogation, the end of the exclamation, and the end of the phrase / phrase. In addition, there are two models of specific syllables: local pitch characteristics of specific syllables and local pitch characteristics of specific syllable endings.

여기서, t는 문장, 구, 어절 단위로 각각 정규화된 시간을 나타낸다.Here, t denotes a time normalized in units of sentences, phrases, and words.

상기와 같은 본 발명은 합성단위로는 음소, 반음절, 음절의 혼합형을 사용하고, 합성단위 DB의 작성은 한국어 음운 환경을 모두 포함하면서 최소 개수가 되는 CDU(context dependent unit)1204개로 작성하였으며, 합성단위간의 결합방법은 CDU 작성 원리에 따라 작성하고, 운율 구현을 위해 먼저 한국어의 언어학적 분석 요소(속성)를 정의하여 입력 텍스트에서 문장의 형태, 구문 구조 분석 방법, 합성음의 억양, 발성음의 길이 및 강조 처리의 제어 방법을 개발함으로써 합성음의 자연성과 유창성의 향상을 높이고, 또한 효율적인 파형 편집 방식의 하나인 TD-PSOLA 합성기를 사용함으로서 구현이 용이하고 합성음의 명료도를 크게 향상시킬 수 있는 효과가 있다.As described above, the present invention uses a combination of phonemes, syllables, and syllables as a synthesis unit, and the composition unit DB is written with 1,200 CDUs (context dependent units), which includes a minimum number of Korean phonological environments. Combination method between composition units is prepared according to the CDU writing principle, and in order to implement the rhyme, first, the linguistic analysis element (attribute) of Korean language is defined. By developing the control method of length and emphasis processing, it is possible to improve the naturalness and fluency of synthesized sound, and to use TD-PSOLA synthesizer, which is one of the efficient waveform editing methods, which is easy to implement and greatly improves the clarity of synthesized sound. have.

Claims

Character input means (11) for receiving Korean characters; A central control means (12) for receiving each Korean character from the character input means (11) and controlling each component according to the algorithm of the present invention; A synthesis unit database (13) which is stored in a storage device as a CDU synthesis unit DB used for a synthesis algorithm, and transmits necessary data to the central processing means (12); And a digital / analog converting means (14) for converting the synthesized digital data into an analog signal and outputting the result to the outside. A first step of classifying; A second step of creating a synthesis unit database in a structure for easily accessing the synthesis unit in phoneme units, and processing the duration of the phoneme and the pitch control in real time; A third step of bringing the database required for each segment of the syllable into the phoneme and half-syllable morpheme in the synthesis unit database; And the language processing module performs word processing after performing preprocessing on the input text sentence, performs a parsing process, and then performs a character / phonic conversion. The rhyme processing module receives the processing result of the language processing module to construct a sentence. According to a suitable rhyme rule, the synthesized sound generating module receives the processing results of the language processing module and the rhythm processing module, and searches for a phonetic symbol and rhyme information in a synthesis unit DB to synthesize synthesis units (41 to 48). Korean text / voice conversion method comprising a).

The method of claim 1, wherein the classification synthesis unit is 1204 in the first step.

2. The synthesis unit database (DB) of claim 1, further comprising: a voice data file (23) which sequentially stores Pulse Code Modulation (PCM) data of the synthesis unit; A pitch mark file (22) for storing positions of pitch marks, which are voice signals stored in the voice data file (23), and respective pitch values in the form of samples; And for each synthesis unit, a synthesis unit number, a start point / end point in the voice data file 23, a number of segments, a start point in the pitch mark file 22 of each segment, and a number of pitch marks. And a compound unit address data file (21).

The method of claim 1, wherein the third step comprises separating each syllable into five units: initial C1, first vowel V1, second vowel V2, final C2, and C3 to separate data suitable for each phoneme from the synthesis unit DB. Korean text / to-speech method, characterized in that imported to the stem.

The method of claim 4, wherein the data selection in the synthesis unit DB is based on the conditions of the vowel Vp of the preceding syllable, the final Cp, the initial C1 of the target syllable, the vowel V, the final C2, the initial Cn of the subsequent syllable, and the vowel Vn. Type C1 is C1 = CV for the first syllable of the word, C1 = e1LV when Cp = C1 = 'ㄹ', C1 = eCV for guitars, and V1 = C1 for initials with V1, and starts with vowels If the first syllable is the first syllable, the single syllable or unsupported double vowel is V1 = V, the final vowel for the double vowel is V1 = V, the other is V1 = -1 (determined in the next step), and the final syllable is the final syllable. V1 = eCV, V1 = VV when the preceding syllable ends with a short vowel, V1 = ejV when the preceding syllable ends with the 'j' series, and V1 = ewV when the preceding syllable ends with the 'w' series Type V2, there is no finality, and at the end of the word V2 = V1, there is no finality and there is no initialism in subsequent syllables V2 = VV when V2 = VCe and no final and subsequent syllables are short vowels V2 = VV when there is no longitudinal and subsequent syllables 'j' series Is V2 = Vwe, and there is a finality, V2 = VC at the end of the word, and V2 = VC when there is a finality and closure, and there is a finality, V2 = VL1e when Cp = ㄹ, C1 = d. And V2: VCe when the next syllable starts with a vowel, V2 = VCDa when there is a lasting star, and a consonant with the last syllable, V2 = VCDa. C3 = aCCw when there is a consonant of subsequent syllables, and finally V1 = V2 when V1 = -1.

According to claim 1, The fourth step (41 to 48), the language processing module, after performing a pre-processing process for first interpreting the input text sentence in the abbreviation, sentence symbol, a specific term in Korean group in the lexical dictionary The sentence analysis is performed by estimating and assigning grammatical functions to each word of the input sentence by using the classification, registered Korean survey, conjugation endings, adverbs, and conjunctions, and then estimating the syntax structure of the input sentence using Korean grammar. Characters that perform the parsing process, search the exception pronunciation dictionary, process the registered words according to the exception pronunciation dictionary, and change the input sentence to phoneme according to the word dictionary search and Korean pronunciation rules not registered in the exception pronunciation dictionary. A fifth step 41 to 45 for performing a phonological conversion process; The rhyme processing module receives a processing result of the language processing module and generates information related to the naturalness and fluency, such as fastness, intonation, jumping, and the like, by applying an appropriate rhyme rule according to the sentence structure; And a synthesized sound generation module that retrieves the phonetic symbols and rhyme information obtained through the above process from the synthesis unit DB, adjusts, processes, and combines the synthesized units by a TD-PSOLA method to generate a synthesized sound and outputs it to the user as a voice. Korean text to speech conversion method comprising the steps (47,48).

7. The method of claim 6, wherein the sixth step (46) comprises: an eighth step (61 to 63) of calculating the duration in the order of the duration of the word, the duration of the syllable, and the duration of the phoneme; And a ninth step (64, 65) for creating a pitch control rule according to the degree of change (slow / abrupt) related to the basic pattern syntax structure of the pitch according to the sentence pattern, the analysis of the starting point and the effect of the change, and the functional classification and analysis. Korean text / voice conversion method comprising a.

8. The method of claim 7, wherein the eighth step (61 to 63) comprises: a tenth step (71, 72) of obtaining the duration SYLdur of each syllable in the word after obtaining the duration WDdur of the word; An eleventh step (73, 74) of obtaining an initial stretching ratio PRCNTO of each syllable after adjusting the duration stretching of the syllable before the sentence, clause, or phrase boundary point; And the corresponding rule is applied to each phoneme in sequence, and the expansion rate of phoneme duration according to phonetic environment is calculated using the duration change rate PRcnt _i assigned to each rule. And a twelfth step (75,76) of calculating the duration PHONdur of the phoneme using the intrinsic duration INHdur, the minimum duration MINdur.