KR0150366B1

KR0150366B1 - Graphic user interface

Info

Publication number: KR0150366B1
Application number: KR1019950032887A
Authority: KR
Inventors: 김상훈
Original assignee: 양승택; 한국전자통신연구소; 이준; 한국전기통신공사
Priority date: 1995-09-29
Filing date: 1995-09-29
Publication date: 1998-10-15
Also published as: KR970017007A

Abstract

본 발명은 그래픽 사용자 인터페이스를 이용한 억양 규칙 생성 방법에 관한 것으로, 특히 무제한 음성 합성기를 이용한 음성 합성에 있어서 보다 자연스러운 음성을 생성하기 위하여 그래픽 사용자 인터페이스를 이용하여 억양 규칙을 생성하는 방법을 제공하기 위하여, 상기 저장 수단(2)으로 부터 읽어온 시험 데이타를 상기 음성 합성 수단(5)에서 음성 합성하고 음절별 평균 억양, 평균 좌승 오차, 및 좌승 오차를 계산하는 제1단계(11 내지 15); 음절의 좌승 오차와 평균 좌승 오차의 크기를 비교하는 제2단계(16); 및 상기 제2단계(16)의 비교 결과에 따라 문장의 분석된 정보를 저장한 후에 억양 데이타를 수정하고 억양 규칙을 작성하거나, 바로 억양 데이타를 수정하고 억양 규칙을 작성하는 제3단계(17 내지 23)를 포함하여 최종 합성음의 자연성을 개선시키고 규칙화하기 용이하며, 자연성이 크게 훼손되지 않는 범위의 운율 패턴을 추 출할 수 있으며, 프로그래밍 능력이 없는 음성 언어학자들을 위한 운율 규칙 실험 및 작성 도구로서 유용하게 사용될 수 있는 효과가 있다.The present invention relates to a method of generating intonation rules using a graphical user interface, and more particularly, to provide a method of generating intonation rules using a graphical user interface in order to generate a more natural voice in speech synthesis using an unlimited speech synthesizer. A first step (11 to 15) of speech synthesis of the test data read from the storage means (2) by the speech synthesis means (5) and calculating the average intonation, average left error, and left error of each syllable; A second step 16 of comparing the magnitude of the left-hand error of the syllable and the average left-hand error; And a third step (17 to 17) of correcting the intonation data and creating the intonation rule after storing the analyzed information of the sentence according to the comparison result of the second step 16, or immediately modifying the intonation data and creating the intonation rule. It is easy to improve and regularize the naturalness of the final synthesized sound, including 23), and can extract rhyme patterns in a range where the naturalness is not greatly impaired, and it is a rhythm rule experiment and writing tool for speech linguists without programming ability. There is an effect that can be useful.

Description

How to create intonation rule using graphical user interface (GUI)

제1도는 본 발명이 적용되는 시스템의 구성도.1 is a block diagram of a system to which the present invention is applied.

제2도는 본 발명에 따른 흐름도.2 is a flow chart in accordance with the present invention.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

1 : 중앙 처리 장치 2 : 기억 장치1: central processing unit 2: storage unit

3 : 출력 장치 4 : 입력 장치3: output device 4: input device

5 : 음성 합성 장치 6 : D/A 변환 장치5: speech synthesizer 6: D / A converter

7 : 스피커7: speaker

본 발명은 그래픽 사용자 인터페이스를 이용한 억양 규칙 생성 방법에 관한 것으로, 특히 무제한 음성 합성기를 이용한 음성 합성에 있어서 보다 자연스러운 음성을 생성하기 위하여 그래픽 사용자 인터페이스를 이용하여 억양 규칙을 생성하는 방법에 관한 것이다.The present invention relates to a method of generating intonation rules using a graphical user interface, and more particularly, to a method of generating intonation rules using a graphical user interface in order to generate a more natural voice in speech synthesis using an unlimited speech synthesizer.

음성 합성은 컴퓨터가 사용자인 인간에게 다양한 형태의 정보를 음성으로 제공하는데 그 의의가 있다. 사용자는 음성 합성기를 이용하여 기존의 텍스트 데이타나 대화 상대로 부터 제공되는 텍스트 정보를 음성으로 출력할 수 있다. 합성음은 명료도와 자연성이 높고, 발성 속도 및 적절한 의미적 강조가 이루어지게 유창해야만 사용자에게 고품질의 음성 합성 서비스를 제공할 수 있다.Speech synthesis is meaningful because computers provide various types of information in speech to human users. The user may output the existing text data or the text information provided from the conversation partner using the voice synthesizer. Synthesized sound is high in clarity and naturalness, and can be provided with high quality speech synthesis service only when the user is fluent in speech speed and proper semantic emphasis.

이를 위해서는 언어의 운율 요소인 억양을 텍스트 데이타에 적절히 할당해야 하는 규칙이 필요하며, 이러한 억양 규칙 생성 작업은 음성 데이타를 기반으로 음성 언어학적 지식을 규칙화하여야 하는 일로 많은 시간이 소요된다. 또한, 억양 규칙 생성에 필요한 음성 언어학적 지식에 대하여 연구 개발자의 지식은 미진하여 정확한 억양 규칙을 생성하는데 한계가 있었고, 음성 언어학자는 기술 지식이 미진하여 억양 규칙을 컴퓨터상에서 구현하는데 한계가 있었다.To this end, a rule is required to properly assign accents, which are the rhyme elements of language, to text data, and the creation of such intonation rules requires a lot of time to regularize the phonetic linguistic knowledge based on the speech data. In addition, the research developer's knowledge about the phonetic linguistic knowledge required for generating the intonation rule is insufficient, and there is a limit to generating accurate intonation rule.

그리하여 현재까지는 음성 언어학적 지식을 기반으로 억양 규칙을 생성하기 어려운 문제점이 있었다.Thus, until now, there has been a problem that it is difficult to generate intonation rules based on phonetic linguistic knowledge.

상기의 문제점율 해결하기 위하여 안출된 본 발명은 무제한 음성 합성기를 이용한 음성 합성기에 있어 언어학자가 손쉽게 그래픽 사용자 인터페이스를 이용하여 언어에서 나타나는 운율 현상을 분석하고 출력함으로서 억양 규칙을 생성할 수 있는 억양 규칙 생성 방법을 제공하는데 그 목적이 있다.The present invention devised to solve the above problem rate is to create an intonation rule in which a linguist can easily create an accent rule by analyzing and outputting a rhyme phenomenon occurring in a language using a graphical user interface in a speech synthesizer using an unlimited speech synthesizer. The purpose is to provide a method.

상기 목적을 달성하기 위하여 본 발명은, 각 구성 요소를 제어하는 중앙 처리 수단; 음성 파형과 문자 데이타 및 억양 데이타를 저장하는 저장 수단; 운용자와 입/출력하는 입/출력 수단; 상기 중앙 처리 수단의 출력 데이타를 음성으로 합성하는 음성 합성 수단; 상기 음성 합성 수단의 디지탈 신호를 아날로그 신호로 변환하는 D/A 변환 수단; 및 상기 D/A 변환 수단의 음성 데이타를 출력하는 스피커를 구비하는 음성 변환 시스템에 적용되는 억양 규칙 생성 방법에 있어서, 상기 저장 수단으로 부터 읽어온 시험 데이타를 상기 음성 합성 수단에서 음성 합성하고 음 절별 평균 억양, 평균 좌승 오차, 및 좌승 오차를 계산하는 제1단계; 음절의 좌승 오차와 평균 좌승 오차의 크기를 비교하는 제2단계; 및 상기 제2단계의 비교 결과에 따라 문장의 분석된 정보를 저장한 후에 억양 데이타를 수정하고 억양 규칙을 작성하거나, 바로 억양 데이타를 수정하고 억양 규칙을 작성하는 제3단계를 포함하는 것을 특징으로 한다.The present invention to achieve the above object, the central processing means for controlling each component; Storage means for storing voice waveforms and text data and intonation data; Input / output means for input / output with an operator; Speech synthesizing means for synthesizing the output data of the central processing means into speech; D / A conversion means for converting the digital signal of the speech synthesis means into an analog signal; And an inflection rule generating method applied to a speech conversion system having a speaker for outputting speech data of said D / A conversion means, wherein said speech synthesis means synthesizes the test data read from said storage means in said speech synthesizing means. Calculating a mean intonation, an average left error, and a left error; A second step of comparing the magnitude of the left-hand error of the syllable and the average left-hand error; And a third step of modifying the intonation data and creating an intonation rule or storing the intonation data immediately after storing the analyzed information of the sentence according to the comparison result of the second step. do.

이하, 첨부된 도면을 참조하여 본 발명에 따른 일실시예를 상세히 설명한다.Hereinafter, with reference to the accompanying drawings will be described an embodiment according to the present invention;

제1도는 본 발명이 적용되는 시스템의 구성도로서, 각 구성 요소를 제어하는 중앙 처리 장치(1), 음성 파형과 문자 데이타 및 억양 데이타를 저장하는 기억 장치(2), 운용자에게 화면과 프린터를 통해 출력하는 출력 장치(3), 운용자로 부터의 입력 데이타를 입력받는 입력 장치(4),상기 중앙 처리 장치(1)의 출력 데이타를 음성으로 합성하는 음성 합성 장치(5), 상기 음성 합성 장치(5)의 디지탈 신호를 아날로그 신호로 변환하는 D/A 변환 장치(6), 상기 D/A 변환 장치(6)의 음성 데이타를 출력하는 스피커(7)를 구비한다.1 is a configuration diagram of a system to which the present invention is applied, which includes a central processing unit 1 for controlling each component, a storage device 2 for storing voice waveforms, text data, and intonation data, and a screen and printer for the operator. An output device 3 for outputting through; an input device 4 for receiving input data from an operator; a voice synthesis device 5 for synthesizing output data of the central processing unit 1 into voice; A D / A converter 6 for converting the digital signal of (5) into an analog signal and a speaker 7 for outputting audio data of the D / A converter 6 are provided.

본 발명은 중앙 처리 장치(1)에 로딩되어 수행되며, 기억 장치(2)는 연구 개발자가 일반적인 규칙에 의해 형성한 억양 데이타를 저장하며, 음성 언어학자는 기억 장치(2)의 억양 데이타를 중앙 처리 장치(1)에서 처리한 후에 음성 합성 장치(5)를 음성으로 합성한 후에 디지탈/아날로그(D/A) 변환하여 스피커(7)를 통하여 출력한다. 출력된 음성이 수정을 요하는 경우에는 입력 장치(4)를 통하여 수정하게 된다. 이때, 음성 합성 장치(5)를 통하여 합성된 음성과 원음성 데이타는 출력 장치(3)에 출력된다. 이러한 과정을 반복 수행하여 완료되면 수정된 데이타를 저장하고, 상기의 과정에서 나타난 자연성에 미치는 요소를 기록하여 억양 규칙을 생성한다.The present invention is carried out by being loaded into the central processing unit 1, and the storage unit 2 stores the intonation data formed by the research developer according to general rules, and the speech linguist centrally processes the intonation data of the storage unit 2. After processing by the apparatus 1, the speech synthesis apparatus 5 is synthesized into speech, and then digital / analog (D / A) is converted and output through the speaker 7. If the outputted voice requires correction, the correction is made through the input device 4. At this time, the speech and original speech data synthesized through the speech synthesizing apparatus 5 are output to the output apparatus 3. When this process is repeated and completed, the modified data is stored, and the accent rule is generated by recording the factors affecting the naturalness shown in the above process.

제2도는 본 발명에 따른 전체 흐름도이다.2 is an overall flowchart according to the present invention.

본 발명이 로딩되면 음성 파형과 문자 데이타 그리고 억양 데이타를 기억 장치(2)로 부터 읽어와 화면에 출력한다(11). 이때, 음성 파형은 문자에 해당하는 음성 부분을 보임으로서 수정한 억양 부분에 대한 음성 파형의 변화를 볼 수 있다. 또한, 문자 데이타는 문장에 해당하는 한글 코드와 국제 음성 기호(IPA)를 출력하며, 억양 데이타와 문자 데이타는 서로 일선상에 출력하여 억양 수정시 원하는 부분을 수정할 수 있다.When the present invention is loaded, voice waveforms, text data, and intonation data are read from the storage device 2 and output to the screen (11). In this case, the voice waveform may show a change in the voice waveform with respect to the modified intonation part by showing the voice portion corresponding to the character. In addition, the text data outputs a Hangul code corresponding to a sentence and an international phonetic symbol (IPA), and the accent data and the text data are output on a line so that the desired part can be corrected when the intonation is corrected.

음성 합성 장치(5)는 읽어온 데이타를 규칙에 따라 처리하여 억양 데이타를 생성하여 기억 장치(2)에 저장한 후에 합성음의 억양 데이타를 출력 장치(3)에 출력한다(12).The speech synthesizing apparatus 5 processes the read data according to a rule, generates intonation data, stores it in the storage device 2, and outputs the intonation data of the synthesized sound to the output device 3 (12).

기억 장치(2)에 저장된 원음성의 억양 데이타와 음성 합성 장치(5)를 통하여 출력된 억양 데이타의 각 음절에 해당하는 구간내에서의 억양 데이타의 평균을 계산한다(13). 음절별 억양의 평균치는 원음성과 합성 음성의 억양 데이타의 차이를 비교하기 위한 기준 데이타로서, 평균을 구하는 식은 다음과 같다.The average of the intonation data in the section corresponding to each syllable of the intonation data of the original sound stored in the storage device 2 and the intonation data output through the speech synthesis device 5 is calculated (13). The average value of each syllable intonation is reference data for comparing the difference between the original and the synthetic intonation data.

여기서, N은 음절내의 피치의 갯수,는 피치(pitch)를 나타낸다.Where N is the number of pitches in the syllable, Denotes a pitch.

이후, 원음성의 억양 데이타와 규칙(음성 합성기)에 의해 생성된 억양 데이타의 차이를 원음성의 음절별 평균 억양과 합성음의 음절별 평균억양과의 평균 좌승 오차(MSE : Mean Square Error)로 구한다(14). 평균 좌승 오차는 원음성과 합성 음성의 억양 데이타의 차이로서, 문장 전체에 대한 평균적 차이률 나타낸다.Subsequently, the difference between the accent data of the original sound and the accent data generated by the rule (voice synthesizer) is obtained as the mean square error (MSE: Mean Square Error) between the mean intonation syllable of the original sound and the mean intonation of the syllable of the synthesized sound. (14). The mean left error is the difference between the accent data of the original voice and the synthesized voice, and represents the average difference rate for the entire sentence.

여기서, M는 문장내의 피치의 갯수,는 원음성의 평균 피치(pitch),는 합성음의 평균 피치를 가리킨다.Where M is the number of pitches in the sentence, Is the average pitch of the original sound, Indicates the average pitch of the synthesized sound.

이후, 원음성의 음절별 평균 억양과 합성음의 음절별 평균 억양과의 좌승 오차를 계산하고(15), 원음성의 억양 곡선과 합성음의 억양 곡선의 차이를 각 음절별 좌승 오차와 평균 좌승 오차를 비교하여 결정한다(16).Then, the left error between the average intonation syllable of the original syllable and the average intonation of the syllable by the syllable is calculated (15). Determined by comparison (16).

각 음절별 좌승 오차가 평균 좌승 오차보다 크면, 원음성의 억양 곡선과 합성음의 억양 곡선의 차이가 큰 부분에 대해 문장 분석 결과(자연성에 영향을 미치는 정보)를 기억 장치(2)에 저장한다(17). 이때, 기억 장치(2)에 저장되는 정보로는 문장(sentence)의종류(평서문, 의문문,감탄문) 및 절(clause)의 종류(위치, 대등관계,수식 관계 )구(phrase)의 종류(명사구, 동사구), 어절의 기능어와의 결합 관계, 절 및 구내에서의 어절의 위치, 어절을 이루는 음절의 수, 어절내 음절의 위치 등이 있다. 이들 정보는 음성 합성 장치(5)의 자연성에 영향을 미치는 요소로, 음성 합성 장치(5)에러 분석되어 출력된다.If the left-handed error for each syllable is larger than the average left-handed error, the sentence analysis result (information affecting naturalness) is stored in the storage device 2 for a large difference between the accent curve of the original sound and the intonation curve of the synthesized sound ( 17). At this time, the information stored in the storage device 2 includes a kind of sentence (a written sentence, an interrogative sentence, an exclamation sentence) and a kind of a clause (a position, an equality relation, a formula relation), and a kind of a phrase (noun phrase). , Verb phrases), the association of words with functional words, the position of words in clauses and phrases, the number of syllables that make up a word, and the location of syllables in words. These information are factors that affect the naturalness of the speech synthesis apparatus 5 and are analyzed and output by the speech synthesis apparatus 5.

각 음절별 좌승 오차가 평균 좌승 오차보다 작거나 또는 문장 분석 결과를 저장한 후에 합성 음성의 억양 데이터를 그래픽 사용자 인터페이스(GUI)를 이용하여 수정한다(18). 수정은 마우스를이용하여 화면상에서 억양 곡선을 원하는 곡선으로 생성하면 된다.After each syllable's left-handed error is smaller than the average left-handed error or after storing the sentence analysis result, the intonation data of the synthesized speech is corrected using the graphical user interface (GUI) (18). To modify, simply use the mouse to create the accent curve on the screen.

수정이 완료되면 음성 합성 장치(5)를 구동하고(19), 구동된 음성 합성 장치(5)는 수정된 억양 데이타를 이용하여 음성을 합성하여 디지탈/아날로그 변환을 수행한다(20).When the modification is completed, the speech synthesis apparatus 5 is driven (19), and the driven speech synthesis apparatus 5 synthesizes the speech using the modified intonation data and performs digital / analog conversion (20).

수정된 억양 데이타로 생성된 합성음으로 부터 수정한 억양 곡선을 기억 장치(2)에 저장할 것인지를 결정하는 저장 메시지를 입력받아 판단하여(21) 저장 메시지가 수정 완료이면 수정한 억양 데이타를 저장하고(22), 수정한 억양 데이타와 문장 분석된 정보(자연성에 영향을 미치는 요소)를 이용하여 억양 규칙을 새로 작성하고 종료한다(23). 억양 규칙의 작성은 훈련 알고리즘, 즉 히든 마르코프 모델(Hidden Markov Model), 신경회로망, 또는 룩업테이블(table-look-up) 방식을 이용한다.From the synthesized sound generated by the modified accent data, a storage message for determining whether to store the modified accent curve in the storage device 2 is received and judged (21) and when the storage message is corrected, the modified accent data is stored ( 22) A new intonation rule is created and terminated using the modified intonation data and sentence-analyzed information (elements affecting naturalness) (23). The creation of intonation rules uses training algorithms, such as the Hidden Markov Model, neural networks, or a table-look-up approach.

저장 메시지가 수정을 다시 요구하는 명령이면, 상기 억양 데이타를 수정하는 과정(18)부터 반복 수행한다.If the stored message is a command requesting modification again, the process is repeated from step 18 of modifying the intonation data.

원음성의 억양 곡선은 규칙화하기 용이한 형태 즉 억양 곡선이 수식으로 모델링이 가능하거나, 최소한의 억양 패턴으로 모델링 가능한 형태의 억양 곡선을 생성시키기 위한 기준 데이타가 된다.The accent curve of the original sound is a reference data for generating an accent curve of a form that can be easily ordered, that is, the accent curve can be modeled by a formula, or a model that can be modeled with a minimum accent pattern.

상기와 같은 본 발명은 최종 합성음의 자연성을 개선시키고 규칙화하기 용이하며, 자연성이 크게 훼손되지 않는 범위의 운율 패턴을 추출할 수 있으며, 음성 합성을 위한 억양 규칙을 저장된 억양 데이타와 문장 분석된 정보를 이용하여 훈련에 의해 작성할 수 있으며, 프로그래밍 능력이 없는 음성 언어학자들을 위한 운율 규칙 실험 및 작성 도구로서 유용하게 사용될 수 있는 효과가 있다.The present invention as described above is easy to improve and regularize the naturalness of the final synthesized sound, it is possible to extract a rhyme pattern of a range that does not significantly impair the naturalness, the intonation data and sentence analysis information stored the intonation rule for speech synthesis Can be written by training and can be useful as a rhythm rule experiment and writing tool for speech linguists without programming skills.

Claims

Central processing means (1) for controlling each component; Storage means (2) for storing speech waveforms and text data and intonation data; Input / output means (3,4) for inputting / outputting the operator; Speech synthesizing means (5) for synthesizing the output data of said central processing means (1) into speech; D / A conversion means (6) for converting the digital signal of the speech synthesis means (5) into an analog signal; And a loudspeaker 7 for outputting speech data of the D / A converting means 6, wherein the intonation rule generating method is applied to the speech converting system, wherein the test data read from the storing means 2 is read. A first step (11 to 15) for speech synthesis by the speech synthesis means 5 and calculating the syllable-based average intonation, the average left error, and the left error; A second step (16) of comparing a left error of the syllable and a mean left error; And a third step (17 to 17) of correcting the intonation data and creating an intonation rule or storing the intonation data immediately after storing the analyzed information of the sentence according to the comparison result of the second step (16). 23) A method of generating intonation rule, characterized in that it comprises a.

2. The method of claim 1, wherein the first steps (11 to 15) comprise: a fourth step (11) of reading voice waveforms, text data and intonation data from the storage means (2); The speech synthesizing means 5 processes the read data according to a rule to generate intonation data and stores the intonation data in the storage means 2 and then outputs the intonation data of the synthesized sound to the input / output means 3 and 4. A fifth step 12; A sixth step (13) of calculating an average of the intonation data for each syllable of the intonation data of the original sound stored in the storage means (2) and the intonation data output through the speech synthesis means (5); Step 7 (14), which calculates a mean square error (MSE: mean square error) between the average intonation syllable of the original syllable and the average intonation syllable of the syllable, and the average intonation syllable of the syllable of the original syllable And an eighth step (15) of calculating a left error with the accent.

According to claim 1, wherein the third step (17 to 23), if the left error of each syllable is larger than the average left error, sentence analysis results for the portion where the difference between the accent curve of the original sound and the accent curve of the synthesized sound is large. A fourth step (17) of storing in the electric field means (2); After each syllable's left-handed error is smaller than the average left-handed error or after storing sentence analysis results, the intonation data of the synthesized speech is corrected using a graphical user interface (GUI), and the speech synthesis means 5 corrects the modified intonation data. A fifth step (18 to 20) of performing digital / analog conversion after synthesizing the voice using the synthesized voice; A sixth step 21 of determining the type of the stored message; And storing the modified intonation data according to the determination result of the sixth step 21 and newly creating the intonation rule using the modified intonation data and sentence-analyzed information, or repeatedly performing the fifth intonation (18 to 20). Accent rule generation method comprising the seventh step (22, 23).