KR100883649B1

KR100883649B1 - Text to speech conversion apparatus and method thereof

Info

Publication number: KR100883649B1
Application number: KR1020020018503A
Authority: KR
Inventors: 정지혜
Original assignee: 삼성전자주식회사
Priority date: 2002-04-04
Filing date: 2002-04-04
Publication date: 2009-02-18
Also published as: KR20030079460A

Abstract

The present invention is a text / voice conversion apparatus and method capable of generating a synthesized sound minimizing discontinuity of input text information based on contextual information.

In the apparatus for converting text into speech, the text / voice conversion apparatus according to the present invention analyzes contextual information (morpheme, syntax structure) of text and recognizes discontinuities between synthesizing units or discontinuities are below a predetermined value. A language processor for dividing the segment into segments and listing context information (morpheme, syntax structure) for each segment; A storage unit which stores rhyme and phonological information of a segment unit predicted in advance; A detector for detecting candidate segments for each segment in the storage based on the listing information transmitted from the language processor; And a synthesis processing unit for generating a synthesis sound corresponding to the text by using the candidate segments detected by the detection unit.

Therefore, it is possible to minimize the discontinuity in the generated synthesized sound.

Description

Text to speech conversion apparatus and method

도 1은 본 발명의 바람직한 실시 예인 텍스트/음성 변환 장치의 블록도이다.1 is a block diagram of a text-to-speech device, which is a preferred embodiment of the present invention.

도 2는 본 발명에 따른 장치의 동작을 설명하기 위한 한글 텍스트 일 예이다.2 is an example of Hangul text for explaining the operation of the device according to the present invention.

도 3은 본 발명에 따른 장치의 동작을 설명하기 위한 영문 텍스트 일 예이다. 3 is an example of English text for explaining the operation of the apparatus according to the present invention.

도 4는 본 발명의 바람직한 실시 예에 따른 텍스트/음성 변환 방법에 대한 동작 흐름 도이다. 4 is a flowchart illustrating an operation of a text / voice conversion method according to an exemplary embodiment of the present invention.

본 발명은 텍스트/음성 변환 장치(이하 TTS(Text-To-Speech) 장치라고 함) 및 방법에 관한 것으로, 특히, 합성음의 불연속성을 최소화하기 위한 텍스트/음성 변환 장치 및 방법에 관한 것이다. The present invention relates to a text-to-speech device (hereinafter referred to as a text-to-speech device) and a method, and more particularly, to a text-to-speech device and a method for minimizing discontinuity of a synthesized sound.

일반적으로 TTS 장치는 텍스트 정보에 대응되는 음성을 제공하는 것으로, 주로 컴퓨터 시스템에서 사용자에게 다양한 형태의 정보를 음성으로 제공하기 위해 사용되고 있다. 이러한 TTS 장치는 주어진 텍스트로부터 고품질의 합성음을 제공할 수 있어야 한다. 고품질의 합성음이란 발음(음가 또는 음운)이 명료하고, 끊어 읽기, 음의 길이, 음의 높이, 음의 세기와 같은 운율적 요소들이 적절히 구현된 자연성이 높은 음을 말한다. In general, the TTS apparatus provides a voice corresponding to text information, and is mainly used to provide various types of information to a user by voice in a computer system. Such a TTS device should be able to provide high quality synthesized sound from a given text. High-quality synthesized sound refers to a high-natural sound that is clearly pronounced (phonetic or phonological) and properly embodied with rhythmic elements such as reading, breaking length, pitch, and loudness.

고품질의 합성음을 제공하기 위하여, 기존의 TTS 장치는 먼저 입력된 텍스트로부터 문장 부호를 포함한 순수 문장 텍스트만을 분리한다. 그리고 분리된 문장으로부터 언어 정보를 추정하고, 발음 변환 과정을 통해 문장을 음소 열로 변환한다. 그리고, 추정된 언어 정보와 음소 열을 토대로 끊어 읽기, 소리의 높낮이, 소리의 강약, 소리의 장단과 관련된 운율 파라미터 값을 계산하고, 계산된 운율 파라미터 값들과 음소 열 정보를 이용하여 합성 단위 데이터 베이스에서 적합한 음편을 선택하여 원하는 합성음을 생성한다. In order to provide a high quality synthesized sound, the existing TTS apparatus first separates pure sentence text including punctuation marks from input text. Language information is estimated from the separated sentences, and the sentences are converted to phoneme strings through the pronunciation conversion process. Based on the estimated linguistic information and the phoneme string, the rhyme parameter values related to reading, pitch, sound intensity, and short and long term sounds are calculated, and the synthesized unit database is calculated using the calculated rhyme parameter values and the phoneme string information. Select the appropriate note in to generate the desired synthesis sound.

그러나 기존의 TTS장치는 사전에 정의된 합성 단위로 연결하여 텍스트에 대한 합성음을 생성함으로써, 단위 연결부분에서 합성음의 불연속이 크게 인지될 가능성이 높다. 상기 불연속이 인지되는 구간은 쉼 구간이나 언어 해석적으로 분절되는 구간이며, 음절 유형과 음성학적 조합에 의해 결정되어지는 부분이다. 그러나, 기존의 TTS 장치에서의 합성 단위는 상술한 불연속이 인지되는 구간과 관계없이 음소 단위로 합성 가능한 조건을 고려하여 정의된 것이다. 따라서 정의된 합성 단위의 길이가 일정하든 일정하지 않든 관계없이 생성되는 합성음에서 상술한 불연속이 인지될 가능성이 높은 것이다.However, the existing TTS apparatus generates a synthesized sound for text by connecting to a predefined synthesis unit, so that the discontinuity of the synthesized sound is highly recognized at the unit connection part. The discontinuity is recognized as a rest period or a language-interpreted segment, and is a part determined by syllable type and phonetic combination. However, the synthesis unit in the conventional TTS apparatus is defined in consideration of the conditions that can be synthesized in phoneme units regardless of the section in which the aforementioned discontinuity is recognized. Therefore, the above-mentioned discontinuity is likely to be recognized in the synthesized sound generated regardless of whether the defined synthesis unit has a constant length or not.

본 발명은 상술한 문제를 해결하기 위한 것으로, 문맥 정보를 토대로 입력된 텍스트 정보에 대한 불연속성을 최소화한 합성음을 생성할 수 있는 텍스트/음성 변환장치 및 방법을 제공하는데 그 목적이 있다. SUMMARY OF THE INVENTION The present invention has been made in view of the above-described problem, and an object thereof is to provide an apparatus and method for converting a text / voice to generate a synthesized sound with a minimum discontinuity of input text information based on contextual information.

본 발명의 다른 목적은 문맥 정보를 토대로 합성 단위간에 불연속이 작게 인지되거나 인지되지 않는 부분(또는 형태소)은 세그먼트 단위로 분절하여 합성음을 생성함으로써, 합성음의 불연속 구간을 최소화할 수 있는 텍스트/음성 변환 장치 및 방법을 제공하는데 있다.It is another object of the present invention to generate a synthesized sound by segmenting segments (or morphemes) that have small or unrecognized discontinuities between synthesis units based on the contextual information, and thus, to minimize the discontinuity of the synthesized sound. An apparatus and method are provided.

본 발명의 또 다른 목적은 문맥 정보를 토대로 사전에 구비된 세그먼테이션(presegmentation) 정보를 이용하여 입력된 텍스트 정보에 대한 운율 및 음운(또는 발음) 정보를 얻음으로써, 운율 생성 및 음운 선택이 용이한 텍스트/음성 변환 장치 및 방법을 제공하는데 있다. It is still another object of the present invention to obtain rhyme and phonetic (or pronunciation) information on input text information by using segmentation information provided in advance based on context information, so that rhythm generation and phonological selection are easy. / Voice conversion apparatus and method.

상기 목적들을 달성하기 위하여 본 발명에 따른 텍스트/음성 변환 장치는, 텍스트를 음성으로 변환하는 장치에 있어서, 텍스트에 대한 문맥 정보를 분석하여합성 단위간에 불연속이 인지되지 않거나 불연속이 소정 치 이하로 인지되는 부분을 세그먼트로 구분하고, 구분된 각 세그먼트에 대한 문맥 정보를 리스팅하는 언어 처리부; 사전에 예측된 세그먼트 단위의 운율 및 음운 정보를 저장하는 저장부; 언어 처리부로부터 전송되는 리스팅 정보를 토대로 저장부에서 각 세그먼트에 대한 후보 세그먼트를 검출하는 검출부; 검출부에서 검출된 후보 세그먼트를 이용하여 텍스트에 대응되는 합성음을 생성하는 합성 처리부를 포함하는 것이 바람직하다. In order to achieve the above objects, the text / voice conversion apparatus according to the present invention, in the apparatus for converting text into speech, analyzes the contextual information about the text to recognize discontinuities between the synthesis units or to recognize the discontinuities below a predetermined value. A language processor for dividing the segment into segments, and listing context information about each segment; A storage unit which stores rhyme and phonological information of a segment unit predicted in advance; A detector for detecting candidate segments for each segment in the storage based on the listing information transmitted from the language processor; It is preferable to include a synthesis processing unit for generating a synthesis sound corresponding to the text by using the candidate segment detected by the detection unit.

상기 텍스트/음성 변환장치는, 검출부에서 후보 세그먼트가 검출되지 않은 세그먼트는 합성음을 생성하기 위해 필요한 음운을 생성하고 운율을 예측하는 운율 처리부를 더 포함하는 것이 바람직하다. The text-to-speech converter may further include a rhythm processing unit for generating a phonogram necessary for generating a synthesized sound and predicting a rhyme for the segment in which the candidate segment is not detected by the detector.

상기 언어 처리부에서 리스팅되는 정보는 해당 세그먼트의 선행 세그먼트와 후행 세그먼트의 형태소 정보를 포함하는 문맥 정보와 해당 세그먼트의 형태소 정보를 포함하는 것이 바람직하다. The information listed in the language processor may include context information including the stem information of the preceding segment and the following segment of the segment, and the stem information of the segment.

상기 텍스트/음성 변환장치는, 검출부에서 검출된 후보 세그먼트가 복수개이면, 해당 세그먼트의 후보 세그먼트와 해당 세그먼트의 선행 세그먼트의 후보 세그먼트와 해당 세그먼트의 후행 세그먼트의 후보 세그먼트간의 문맥정보를 토대로 해당 세그먼트의 후보 세그먼트들에 대한 우선 순위를 결정하여 최적의 후보 세그먼트를 결정하는 결정부를 더 포함하는 것이 바람직하다. If the text / voice conversion apparatus detects a plurality of candidate segments detected by the detector, the candidates of the corresponding segments are based on context information between the candidate segments of the corresponding segments, the candidate segments of the preceding segments of the segments, and the candidate segments of the following segments of the segments. It is preferable to further include a determining unit for determining the best candidate segment by determining the priority of the segments.

상기 결정부는 우선 순위를 토대로 결정된 최적의 후보 세그먼트가 복수개이면, 선행 세그먼트와 후행 세그먼트의 최적의 후보 세그먼트와 결정된 최적의 후보 세그먼트간의 음향 스펙트럼을 토대로 하나의 최적의 후보 세그먼트를 결정하는 것이 바람직하다. When there are a plurality of optimal candidate segments determined based on the priority, the determination unit may determine one optimal candidate segment based on an acoustic spectrum between the best candidate segment of the preceding segment and the following segment and the determined best candidate segment.

상기 목적들을 달성하기 위하여 본 발명에 따른 텍스트/음성 변환 방법은, 텍스트를 음성으로 변환하는 방법에 있어서, 텍스트 정보가 입력되면 문맥 정보를 분석하여 합성 단위간에 불연속이 인지되지 않거나 불연속이 소정 치 이하로 인지되는 부분을 세그먼트로 구분하고, 구분된 각 세그먼트에 대한 문맥 정보를 리스팅하는 단계; 사전에 음운 및 운율을 예측하여 저장한 세그먼트 관련 정보에서 상기 리스팅 정보를 토대로 각 세그먼트별 후보 세그먼트를 검색하는 단계; 검색된 후보 세그먼트를 이용하여 텍스트 정보에 대한 합성음을 생성하는 단계를 포함하는 것이 바람직하다. In order to achieve the above objects, the text / voice conversion method according to the present invention is a method of converting text to speech, and when text information is input, discontinuity is not recognized or discontinuity is not recognized between synthesis units by analyzing contextual information. Dividing the recognized portion into segments and listing contextual information on each segment; Searching for candidate segments for each segment based on the listing information from segment-related information previously predicted and stored in rhyme and rhyme; Preferably, the method comprises generating a synthesized sound for text information using the retrieved candidate segment.

상기 방법은, 검색단계에서 검색된 후보 세그먼트가 복수개이면, 해당 세그먼트의 후보 세그먼트와 해당 세그먼트의 선행 세그먼트의 후보 세그먼트와 해당 세그먼트의 후행 세그먼트의 후보 세그먼트간의 문맥정보를 토대로 해당 세그먼트의 후보 세그먼트들에 대한 우선 순위를 결정하여 최적의 후보 세그먼트를 결정하는 단계를 더 포함하는 것이 바람직하다. In the method, if there are a plurality of candidate segments searched in the searching step, the candidate segments of the corresponding segments are determined based on the context information between the candidate segments of the corresponding segments, the candidate segments of the preceding segments of the segments, and the candidate segments of the following segments of the segments. Preferably, the method further includes determining a priority candidate to determine an optimal candidate segment.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시 예를 설명하기로 한다. Hereinafter, exemplary embodiments of the present invention will be described with reference to the accompanying drawings.

도 1은 본 발명의 바람직한 실시 예인 텍스트/음성 변환 장치의 블록도로서, 텍스트(101), 언어 처리부(Natural Language Processing, NLP라고 약하기도 함)(102), 후보 세그먼트 검출부(103), 후보 세그먼트 데이터 베이스(DataBase, DB라고 약하기도 함)(104), 최적 후보 세그먼트 결정부(105), 합성 처리부(106) 및 운율 처리부(107)로 구성된다. 1 is a block diagram of a text / voice conversion apparatus according to a preferred embodiment of the present invention, including text 101, language processing unit (also referred to as NLP) 102, candidate segment detection unit 103, and candidate segment And a database (DataBase, sometimes referred to as DB) 104, an optimal candidate segment determiner 105, a synthesis processor 106, and a rhyme processor 107.

텍스트(101)는 음성으로 변환이 요구되는 정보로서, 종이 형태의 문서에 기록되어 있는 정보이거나 컴퓨터에서 사용되는 다양한 형태의 정보일 수 있다. 또한, 텍스트(101)는 한국어나 그 이외의 불연속 구간이 존재하는 다양한 언어로 표현된 정보일 수 있다. 상기 불연속 구간은 상술한 바와 같이 쉼 구간이나 언어 해석적으로 분절되는 구간이며, 음절 유형과 음성학적 조합에 의해 결정되어지는 부 분이다. The text 101 is information that needs to be converted into a voice, and may be information recorded in a paper document or various types of information used in a computer. In addition, the text 101 may be information expressed in various languages in which Korean or other discontinuous sections exist. As described above, the discontinuous section is a rest section or a segment that is segmented linguistically, and is a section determined by syllable type and phonetic combination.

언어 처리부(102)는 텍스트(101)의 문맥 정보를 분석하여 세그먼트 단위로 구분한다. 문맥 정보 분석 방식은 기존에 알려진 방식을 사용한다. 세그먼트는 합성 단위간에 불연속이 인지되지 않거나 불연속이 소정 치 이하로 인지되는 부분이다. 상기 소정 치는 실험적으로 구해지는 값으로서, 합성 단위간의 차(합성 단위간에 존재하는 빈 구간)가 하나의 형태소를 형성하기 위해 연결된 것으로 인정될 수 있는 값이다. The language processor 102 analyzes the context information of the text 101 and divides the information into segments. The contextual information analysis method uses a known method. A segment is a portion in which discontinuity is not recognized or a discontinuity is recognized below a predetermined value between synthetic units. The predetermined value is an experimentally obtained value, and is a value that can be recognized as a difference between synthesis units (empty sections existing between synthesis units) connected to form one morpheme.

예를 들어 도 2에 도시된 바와 같이 "정확한 번호를 입력하세요."라는 문장이 텍스트(101) 정보로서 입력되면, 언어 처리부(102)에서 문맥 정보를 분석한 결과, "정"자와 "확"자와 "한"자간은 불연속이 인지되지 않거나 불연속이 소정 치 이하인 형태소로 인식되게 된다. 따라서, "정확한"을 하나의 세그먼트로 구분하게 된다. 또한, "번호를"도 하나의 세그먼트로 구분하고, "입력하세요"로 하나의 세그먼트로 구분하게 된다. 상기 세그먼트는 어절로 표현할 수도 있다. For example, as shown in FIG. 2, when the sentence "Please input the correct number" is input as the text 101 information, the language processing unit 102 analyzes the contextual information, and results in "correct" and "correct". The word "a" and "a" is recognized as a morpheme whose discontinuity is not recognized or whose discontinuity is not more than a predetermined value. Thus, "correct" is divided into one segment. In addition, "number" is also divided into one segment, and "enter" is divided into one segment. The segment may be expressed in words.

이와 같이 문맥 정보 분석에 따라 텍스트(101)에 대한 세그먼트가 구분되면, 언어 처리부(102)는 구분된 각 세그먼트에 대한 문맥 정보를 토대로 각 세그먼트에 대한 정보를 리스팅(listing)한다. 각 세그먼트에 대한 리스팅 정보는 도 2의 (a)에 표현된 바와 같이 표 1에 정의된 바와 같은 의미를 갖는 기호를 이용하여 정의될 수 있다. When the segments for the text 101 are classified according to the contextual information analysis as described above, the language processor 102 lists the information on each segment based on the contextual information about the segmented segments. Listing information for each segment may be defined using a symbol having a meaning as defined in Table 1, as shown in (a) of FIG.

기호 sign 의 미 meaning 기호 sign 의 미 meaning 기호 sign 의 미 meaning ef ef 종결어미 Ending ii ii 감탄사 interjection mm mm 관형사 An adjective ep ep 선어말 어미Fresh Horse jc jc 조사 Research pp pp 용언 Word es es 연결어미 Connection jp jp 서술격조사 Screening ss ss 기호 sign et et 전성어미 Malleable mother nc nc 체언 Statement xp xp 접두사 prefix ff ff 외국어 Foreign language md md 부사 adverb xs xs 접미사 suffix

또한, 각 세그먼트의 리스팅 정보는 도 2의 (a)를 통해 알 수 있는 바와 같이 해당되는 세그먼트의 형태소에 대한 정보와 선행 세그먼트와 후행 세그먼트의 형태소에 대한 정보를 포함한다. 그리고 선행 세그먼트와 후행 세그먼트가 존재하지 않으면, 그에 대한 정보도 리스팅 정보에 포함시킨다. 즉, 도 2의 (a)에서 $=$는 선행 세그먼트가 존재하지 않는다는 것을 의미하는 것이고, $:$는 후행 세그먼트가 존재하지 않는다는 것을 의미한다. 따라서 도 2의 (a)에 기재되어 있는 정보를 통해 "정확한 번호를 입력하세요"의 각 세그먼트간의 관계를 파악할 수 있다. In addition, the listing information of each segment includes information on the stem of the corresponding segment and the stem of the preceding segment and the following segment, as can be seen through (a) of FIG. If the leading segment and the trailing segment do not exist, the information about the segment is included in the listing information. That is, in FIG. 2A, $ = $ means that the preceding segment does not exist, and $: $ means that the trailing segment does not exist. Therefore, the relationship between each segment of "Please enter the correct number" can be grasped through the information described in (a) of FIG.

만약 입력되는 텍스트(101)가 도 3에 제시되어 있는 바와 같이 "There were fifteen people present."와 같이 영어로 표현된 정보인 경우에, 텍스트(101)에 대한 문맥 정보 분석은 상술한 한글일 때와 같이 불연속이 인지되지 않거나 불연속이 소정치 이하로 인지되는 부분을 세그먼트로 구분한다. 그리고, 분석된 문맥 정보를 토대로 구분된 세그먼트에 대한 정보를 리스팅한다. 정보를 리스팅할 때, 상기 표 1에 정의되어 있는 기호를 이용할 수 있다. 따라서 상술한 텍스트에 대한 리스팅 정보는 도 3의 (a)에 기재된 바와 같이 정의될 수 있다. If the input text 101 is information expressed in English, such as “There were fifteen people present.” As shown in FIG. 3, the contextual information analysis of the text 101 is in the above-described Korean language. As shown in FIG. 2, segments where a discontinuity is not recognized or a discontinuity is recognized below a predetermined value are divided into segments. Then, the information on the segment is classified based on the analyzed context information. When listing information, the symbols defined in Table 1 can be used. Therefore, the listing information for the above-described text may be defined as described in FIG.

이와 같이 입력된 텍스트(101)에 대한 각 세그먼트의 리스팅 정보가 얻어지면, 얻어진 리스팅 정보를 후보 세그먼트 검출부(103)로 전송하면서 해당되는 세그 먼트에 대한 정보는 운율 처리부(107)로 전송한다. When listing information of each segment of the input text 101 is obtained as described above, the obtained listing information is transmitted to the candidate segment detection unit 103 while the information on the corresponding segment is transmitted to the rhyme processing unit 107.

후보 세그먼트 검출부(103)는 입력된 리스팅 정보를 토대로 각 세그먼트에 대한 후보 세그먼트를 후보 세그먼트 데이터 베이스(104)로부터 검출한다. 후보 세그먼트 데이터 베이스(104)는 사전에 예측된 세그먼트단위로 해당되는 음운 및 운율 정보를 저장한다. 따라서, 상술한 리스팅 정보를 토대로 후보 세그먼트 데이터 베이스(104)로부터 각 세그먼트에 대한 후보 세그먼트를 검출할 때, 해당되는 후보 세그먼트의 음운 및 운율 정보가 후보 세그먼트 검출부(103)로 제공한다. 이 때, 후보 세그먼트 검출부(103)는 후보 세그먼트가 전혀 검출되지 않는 세그먼트에 대해서는 운율 처리부(107)로 이를 통보한다. 예를 들어 후보 세그먼트 검출부(103)는 해당 세그먼트는 "NULL"이라는 의미를 갖는 정보를 운율 처리부(107)로 전송한다. The candidate segment detector 103 detects a candidate segment for each segment from the candidate segment database 104 based on the input listing information. The candidate segment database 104 stores phonological and rhyme information corresponding to segments predicted in advance. Therefore, when detecting a candidate segment for each segment from the candidate segment database 104 based on the listing information described above, the phonological and rhyme information of the corresponding candidate segment is provided to the candidate segment detector 103. At this time, the candidate segment detection unit 103 notifies the rhythm processing unit 107 of the segment for which the candidate segment is not detected at all. For example, the candidate segment detector 103 transmits information having a meaning of "NULL" to the rhyme processor 107.

그러나, 후보 세그먼트가 복수개 검출되면, 소정의 기준치를 이용하여 검출되는 후보 세그먼트의 개수를 제한할 수 있다. 그리고 복수개의 후보 세그먼트가 검출되면, 선행 세그먼트와 후행 세그먼트에 대한 후보 세그먼트들에 대한 음운 및 운율 정보와 해당되는 후보 세그먼트의 음운 및 운율 정보를 토대로 한 문맥정보를 이용하여 해당되는 후보 세그먼트의 값(cost)을 계산한다. However, if a plurality of candidate segments are detected, the number of candidate segments to be detected may be limited using a predetermined reference value. When a plurality of candidate segments are detected, the values of the corresponding candidate segments are determined using the phonological and rhyme information of the candidate segments for the preceding and following segments and the context information based on the phonological and rhyme information of the corresponding candidate segments ( cost).

이 값은 해당되는 후보 세그먼트의 우선순위에 해당된다. 상기 후보 세그먼트의 값을 계산하기 위하여, 후보 세그먼트 검출부(103)는 예를 들어, 현 세그먼트의 어절 정보(어휘와 어절 태그(tag)와 같은 정보), 선행 세그먼트의 어절 정보(선행 세그먼트의 어휘 및 태그, 선행 세그먼트의 마지막 음소 정보), 후행 세그먼트 의 어절 정보(후행 세그먼트의 어휘 및 태그, 후행 세그먼트의 시작 음소 정보)를 각각 고려한다. 상기 태그는 어절의 대표 품사와 같은 정보이다. 즉, 상기 태그는 어절의 내용 어품사와 기능 어품사를 모두 고려한 값이다. This value corresponds to the priority of the corresponding candidate segment. In order to calculate the value of the candidate segment, the candidate segment detection unit 103 is, for example, word information of the current segment (information such as vocabulary and word tag), word information of the preceding segment (word of the preceding segment and Consider the tag, the last phoneme information of the preceding segment, and the word information of the trailing segment (the vocabulary and tags of the trailing segment, and the start phoneme information of the trailing segment). The tag is information such as a representative part-of-speech word. That is, the tag is a value considering both the content word and the functional word of the word.

후보 세그먼트 검출부(103)는 상술한 고려 항목들을 토대로 해당되는 세그먼트의 값을 계산하는데, 각 고려 항목 단위로 해당되는 세그먼트의 리스팅 정보와 비교하여 근접할수록 0에 근사한 값을 갖도록 설정하고, 각 고려 항목에 대해 설정된 값의 총계를 해당되는 세그먼트의 값으로 결정한다. 그리고 결정된 세그먼트의 값과 사전에 설정한 임계값을 비교하여, 임계치 이하인 값을 갖는 후보 세그먼트만을 선택한다. The candidate segment detection unit 103 calculates a value of a corresponding segment based on the above-described consideration items, and sets the value closer to 0 as closer as compared to the listing information of the corresponding segment in each consideration item unit. The total of the value set for is determined by the value of the corresponding segment. Then, the candidate segment having a value less than or equal to the threshold value is selected by comparing the determined segment value with a preset threshold value.

이러한 각 세그먼트의 값 결정에 따른 후보 세그먼트의 선택으로 도 2의 (b) 또는 도 3의 (b)에 도시된 바와 같이 후보 세그먼트 데이터 베이스(104)로부터 검색된 후보 세그먼트들중에서 적절한 후보 세그먼트를 선택하게 된다. 즉, 도 2의 (b)에서 "번호들"은 제 1 내지 제 5 후보 어절이 후보 세그먼트로서 검색되었으나, 상술한 값 결정에 따른 후보 세그먼트의 선택으로 제 3 후보 어절과 제 5 후보 어절이 남게 된다. 또, 도 3의 (b)에서 "There"은 제 1 내지 제 3 후보 어절이 후보 세그먼트로서 검색되었으나 상술한 값 결정에 따른 후보 세그먼트의 선택으로 제 1 및 제 2 후보 어절이 후보 세그먼트로서 남게 된다. 이와 같이 남겨진 후보 세그먼트들은 최적의 후보 세그먼트 결정부(105)로 전송된다. The selection of the candidate segment according to the value determination of each segment enables the selection of an appropriate candidate segment among candidate segments retrieved from the candidate segment database 104 as shown in FIG. 2B or FIG. 3B. do. That is, in FIG. 2B, the first numbers and the fifth candidate words are searched as candidate segments, but the third candidate words and the fifth candidate words remain as the selection of the candidate segments according to the above-described value determination. do. In addition, in FIG. 3B, "There" indicates that the first to third candidate words are retrieved as candidate segments, but the first and second candidate words remain as candidate segments by selecting the candidate segments according to the above-described value determination. . The candidate segments left in this way are transmitted to the optimal candidate segment determiner 105.

최적 후보 세그먼트 결정부(105)는 후보 세그먼트 검출부(103)에서 선택된 후보 세그먼트에서 최적의 후보 세그먼트를 결정한다. 만약 후보 세그먼트 검출부(103)에서 하나의 후보 세그먼트가 선택된 경우에, 최적의 후보 세그먼트 결정부(105)는 선택된 후보 세그먼트를 해당되는 세그먼트의 최적의 후보 세그먼트로 결정한다. 그러나 후보 세그먼트 검출부(103)에서 선택된 후보 세그먼트가 복수개인 경우에, 최적의 후보 세그먼트 결정부(105)는 선행 세그먼트와 후행 세그먼트의 선택된 후보 세그먼트들과 해당되는 후보 세그먼트간의 음향 스펙트럼을 토대로 최적의 후보 세그먼트를 결정한다. 결정된 후보 세그먼트는 합성 처리부(106)로 전송된다. The best candidate segment determiner 105 determines the best candidate segment from the candidate segments selected by the candidate segment detector 103. If one candidate segment is selected by the candidate segment detector 103, the optimal candidate segment determiner 105 determines the selected candidate segment as an optimal candidate segment of the corresponding segment. However, when there are a plurality of candidate segments selected by the candidate segment detector 103, the optimal candidate segment determiner 105 may select an optimal candidate based on an acoustic spectrum between the selected candidate segments of the preceding segment and the trailing segment and the corresponding candidate segment. Determine the segment. The determined candidate segment is transmitted to the synthesis processor 106.

한편, 후보 세그먼트가 검출되지 않았다는 정보(NULL)가 후보 세그먼트 검출부(103)로부터 전송되면, 운율 처리부(107)는 종래와 같은 방법으로 현재 입력되는 세그먼트(또는 해당되는 세그먼트)에 대해 정해진 합성 단위로 음운을 생성하고 운율을 예측하여 합성 처리부(106)로 전송한다.On the other hand, when information (NULL) that a candidate segment has not been detected is transmitted from the candidate segment detection unit 103, the rhyme processing unit 107 uses a composition unit determined for the currently input segment (or a corresponding segment) in a conventional manner. The phoneme is generated, and the rhyme is predicted and transmitted to the synthesis processor 106.

합성 처리부(106)는 최적의 후보 세그먼트 결정부(105)에서 결정된 후보 세그먼트의 음운 및 운율 정보와 운율 처리부(107)로부터 전송되는 음운 및 운율 정보를 토대로 입력된 텍스트에 대한 합성음을 생성한다. 입력된 운율 정보를 토대로 합성음을 생성하는 방식은 종래의 방식과 동일하게 이루어진다. The synthesis processor 106 generates a synthesized sound for the input text based on the phoneme and rhyme information of the candidate segment determined by the optimal candidate segment determiner 105 and the phoneme and rhyme information transmitted from the rhyme processor 107. The method of generating the synthesized sound based on the input rhyme information is performed in the same manner as the conventional method.

도 4는 본 발명의 바람직한 실시 예에 따른 텍스트/음성 변환 방법에 대한 동작 흐름도이다. 4 is a flowchart illustrating a text / voice conversion method according to an exemplary embodiment of the present invention.

먼저, 제 401 단계에서 입력된 텍스트의 문맥 및 형태소를 분석한다. 분석 방식은 종래와 동일하게 이루어진다. 그 다음, 제 402 단계에서 상술한 텍스트에 대한 문맥 및 형태소 분석 결과를 토대로 입력된 텍스트를 세그먼트로 구분하고, 상술한 문맥 및 형태소 분석 결과를 토대로 각 세그먼트(또는 어절)에 대한 정보를 도 1에서 설명한 바와 같이 리스팅 한다. First, the context and morpheme of the text input in step 401 are analyzed. The analysis method is the same as before. Next, the input text is divided into segments based on the context and morpheme analysis results of the text described above in step 402, and information about each segment (or word) is illustrated in FIG. 1 based on the context and morpheme analysis results. Listing as described.

제 403 단계에서 상술한 각 세그먼트의 리스팅 정보를 토대로 사전에 예측된 세그먼트에 해당되는 세그먼트가 존재하는 지를 검색한다. 제 404 단계에서 적어도 하나의 세그먼트가 검색되면, 제 405 단계에서 검색된 세그먼트를 해당되는 세그먼트의 후보 세그먼트로서 선택한다. 이 때, 선택된 후보 세그먼트는 복수 개일 수 있다. 제 406 단계에서 선택된 후보 세그먼트의 운율 및 음운 정보를 데이터 베이스(104)로부터 가져온다. 그리고 제 407 단계에서 가져온 운율 및 음운 정보를 이용하여 적절한 후보 세그먼트를 결정한다. 이 때, 후보 세그먼트 결정은 도 1의 후보 세그먼트 검출부(103)에서 문맥 정보를 토대로 각 후보 세그먼트에 대한 값(cost)을 계산하여 얻어진 우선순위 정보로 결정하는 방식을 이용할 수 있다. In operation 403, the searcher determines whether a segment corresponding to a segment previously predicted exists based on the listing information of each segment. If at least one segment is found in operation 404, the segment found in operation 405 is selected as a candidate segment of the corresponding segment. In this case, there may be a plurality of selected candidate segments. Rhythm and phonological information of the candidate segment selected in step 406 is obtained from the database 104. In addition, an appropriate candidate segment is determined using the rhyme and phonological information obtained in operation 407. In this case, the candidate segment determination may use a method in which the candidate segment detection unit 103 of FIG. 1 determines priority information obtained by calculating a cost for each candidate segment based on context information.

제 407 단계에서 결정된 적절한 후보 세그먼트에 대해 제 408 단계에서 최적의 후보 세그먼트를 결정한다. 최적의 후보 세그먼트 결정 방식은 도 1의 최적 후보 세그먼트 결정부(105)에서와 같은 방식으로 이루어진다. 그리고 제 409 단계에서 결정된 최적의 후보 세그먼트를 이용하여 합성음을 생성한다. 합성음 생성방식은 종래와 같은 방식을 사용한다. The optimal candidate segment is determined in step 408 for the appropriate candidate segment determined in step 407. The optimal candidate segment determination method is performed in the same manner as in the optimal candidate segment determiner 105 of FIG. 1. The synthesized sound is generated using the optimal candidate segment determined in step 409. Synthetic sound generation method is the same as the conventional method.

한편, 제 404 단계에서 적어도 하나의 세그먼트도 검색되지 않으면, 제 410 단계에서 종래와 같은 방식으로 구분된 세그먼트에 대해 정해진 합성단위로 음운(발음)을 발생하고, 제 411 단계에서 종래와 같은 방식으로 운율을 추정한다. 그리고 제 409 단계에서 추정된 내용을 토대로 합성음을 생성한다. On the other hand, if at least one segment is not retrieved in step 404, a phoneme (pronounced) is generated in a synthesized unit determined for the segments divided in the conventional manner in step 410, and in the conventional manner in step 411 Estimate the rhyme The synthesized sound is generated based on the content estimated in step 409.

상술한 바와 같이 본 발명은 문맥 정보를 토대로 합성 단위간에 불연속이 인지되지 않거나 불연속이 작게 인지되는 부분을 세그먼트 단위로 분절하여 합성음을 연결함으로써, 생성되는 합성음에서의 불연속 구간을 최소화할 수 있다. As described above, the present invention can minimize the discontinuity in the generated synthesized sound by connecting the synthesized sound by segmenting the portions in which the discontinuity is not recognized or the discontinuity between the synthesized units is segmented on a segment basis based on the contextual information.

사전에 구비한 세그먼테이션(presegmantation) 정보를 이용하여 입력된 텍스트 정보의 세그먼트 단위의 운율 및 음운 정보를 얻어 합성음을 생성함으로써, 운율 생성 및 음운 선택이 용이할 뿐 아니라 발음의 명료성을 보장할 수 있다. By using segmentation information provided in advance to obtain rhyme and phonological information for each segment of the input text information, it is possible to easily generate rhythm and phonological selection, and to ensure clarity of pronunciation.

그리고, 현행 세그먼트와 선행 세그먼트간 및 현행 세그먼트와 후행 세그먼트간의 관계를 고려하여 현행 세그먼트에 대한 후보 세그먼트에서 결정된 최적의 세그먼트로 합성음을 생성함으로써, 세그먼트간의 운율의 자연성을 증가시킨 합성음을 제공할 수 있다. In addition, by considering the relationship between the current segment and the preceding segment and the current segment and the following segment, the synthesized sound is generated by the optimal segment determined from the candidate segment for the current segment, thereby providing a synthesized sound having increased the naturalness of the rhyme between the segments. .

Claims

In a device for converting text to speech,

A language processor that analyzes the contextual information on the text, classifies a portion in which a discontinuity is not recognized or a discontinuity is recognized below a predetermined value between the synthesis units into segments, and lists the contextual information on each segment;

A storage unit which stores rhyme and phonological information of a segment unit predicted in advance;

A detector for detecting candidate segments for each segment in the storage unit based on the listing information transmitted from the language processor;

A synthesis processor for generating a synthesized sound corresponding to the text by using the candidate segments detected by the detector; And

When there are a plurality of candidate segments detected by the detection unit, the candidate segments of the corresponding segments are based on context information between the candidate segments of the corresponding segments, the candidate segments of the preceding segments of the corresponding segments, and the candidate segments of the following segments of the corresponding segments. An apparatus for converting text into speech, comprising: a decision unit for determining a priority and determining an optimal candidate segment.

The apparatus of claim 1, wherein the apparatus for converting the text into speech comprises:

The segment for which the candidate segment is not detected in the detection unit further comprises a rhythm processing unit for generating a phonogram necessary to generate the synthesized sound and predicting the rhyme.

The speech information of claim 1 or 2, wherein the information listed by the language processor comprises contextual information including the stemming information of the preceding segment and the following segment of the corresponding segment, and the stemming information of the segment. Device to convert.

delete

The sound spectrum of claim 1 or 2, wherein the determining unit is further configured to determine an acoustic spectrum between an optimal candidate segment of the preceding segment and a succeeding segment and the determined optimal candidate segment when there are a plurality of the optimal candidate segments determined based on the priority. An apparatus for converting text to speech, the method comprising determining one optimal candidate segment based on the determination.

In the method of converting text to speech,

Analyzing the context information when the text information is input, dividing the portions in which the discontinuity is not recognized or the discontinuity is recognized below a predetermined value between the synthesis units into segments, and listing syntax analysis information for each segment;

Searching for candidate segments for each segment based on the listing information from segment-related information previously predicted and stored in rhyme and rhyme;

Generating a synthesized sound for the text information using the retrieved candidate segment; And

If there are a plurality of candidate segments searched in the search step, the candidate segments of the corresponding segments are determined based on the context information between the candidate segments of the corresponding segments, the candidate segments of the preceding segments of the corresponding segments, and the candidate segments of the following segments of the corresponding segments. Determining priority to determine an optimal candidate segment.

7. The method of claim 6, wherein the method further comprises: generating a phoneme necessary for generating a synthesized sound and predicting a rhyme for a segment in which the candidate segment is not searched in the searching step.

delete