KR100306205B1

KR100306205B1 - Text-to-speech processing method and continuous speech recognition method using pronunciation connection graph

Info

Publication number: KR100306205B1
Application number: KR1019990046356A
Authority: KR
Inventors: 이근배; 김병창; 이진석
Original assignee: 정명식; 학교법인 포항공과대학교
Priority date: 1999-10-25
Filing date: 1999-10-25
Publication date: 2001-11-02
Also published as: KR20010038395A

Abstract

본 발명은 TTS 시스템과 연속 음성 인식 시스템 등 음성을 처리하는 시스템에서 한 문장에 대한 형태소별 품사, 원형, 표층, 발음, 접속 정보 등을 통합해서 그래프 형식으로 효율적으로 표현하고 처리할 수 있는 방법에 관한 것이다.The present invention provides a method for efficiently expressing and processing in graph form by integrating morphological parts-of-speech, prototype, surface, pronunciation, and connection information for a sentence in a system for processing a speech such as a TTS system and a continuous speech recognition system. It is about.

본 발명에 의한 발음 접속 그래프를 구성하는 발음 접속 노드를 기록한 컴퓨터로 읽을 수 있는 기록매체에 있어서, 발음 접속 노드는 하나의 형태소의 품사를 저장하는 형태소 품사 필드; 형태소가 발음되는 문자열을 저장하는 형태소 발음 필드; 형태소가 상기 형태소 발음 필드에 저장된 문자열에 따라 발음될 확률을 저장하는 확률 정보 필드; 형태소의 최초 음소의 철자 및 발음을 저장하는 좌측 음운 접속 정보 필드 및 형태소의 마지막 음소의 철자 및 발음을 저장하는 우측 음운 접속 정보 필드; 및 형태소 이전에 발음되는 이전 형태소를 저장한 하나 이상의 발음 접속 노드들을 액세스하기 위한 포인터들을 저장한 전 발음 접속 노드 필드 및 상기 형태소 이후에 발음되는 이후 형태소를 저장한 하나 이상의 발음 접속 노드들을 액세스하기 위한 포인터들을 저장한 후 발음 접속 노드 필드를 구비한다.A computer readable recording medium having recorded thereon a pronunciation connection node constituting a pronunciation connection graph, the pronunciation connection node comprising: a morpheme part-of-speech field for storing a part-of-speech part of speech; A morpheme pronunciation field for storing a string in which morphemes are pronounced; A probability information field for storing a probability that a morpheme is pronounced according to a string stored in the morpheme pronunciation field; A left phonological access information field for storing the spelling and pronunciation of the first phoneme of the morpheme and a right phonological access information field for storing the spelling and pronunciation of the last phoneme of the morpheme; And a pre-pronouncement node field that stores pointers for accessing one or more pronunciation-access nodes that stored a previous morpheme that is pronounced before morphemes, and one or more pronunciation-access nodes that store a later morpheme pronounced after the morphemes. After storing the pointers, the pointers have a pronunciation access node field.

본 발명에 의하면, 하나의 형태소에 대해서 나타나는 여러 후보 발음을 발음 접속 그래프 형태로 표현함으로써, TTS 시스템 및 연속 음성 인식 시스템에서 사용하는 메모리의 절약과 함께 전체적인 성능 향상을 꾀할 수 있다.According to the present invention, by expressing several candidate pronunciations appearing in one morpheme in the form of a pronunciation connection graph, overall performance can be improved while saving memory used in the TTS system and continuous speech recognition system.

Description

Text-to-speech processing method and continuous speech recognition method using pronunciation connection graph

본 발명은 TTS(Text-to-Speech) 시스템과 연속 음성 인식 시스템 등 음성을 처리하는 시스템에서 한 문장에 대한 형태소별 품사, 원형, 표층, 발음, 접속 정보 등을 통합해서 그래프 형식으로 효율적으로 표현하고 처리할 수 있는 방법에 관한것이다.The present invention integrates morphological parts-of-speech, prototype, surface, pronunciation, and connection information for a sentence in a speech processing system such as a text-to-speech (TTS) system and a continuous speech recognition system. It's about how you can handle it.

일반적으로 TTS 시스템과 연속 음성 인식 시스템 등 음성을 처리하는 시스템은 그 내부의 기능에 따라 여러 세부 모듈들로 나뉘어져 있으며, 각 세부 모듈들은 각각 처리한 정보들을 주고 받으면서 전체 시스템이 동작하게 된다. 이때, 주고 받는 정보들을 표현하기 위해 사용되는 형식에 따라 시스템의 성능에 많은 영향을 주게 된다.In general, a system for processing a speech such as a TTS system and a continuous speech recognition system is divided into several detailed modules according to its internal functions, and each detailed module operates the entire system while exchanging processed information. At this time, the format used to express the information to be sent and received has a great effect on the performance of the system.

본 발명은 상기의 필요성에 의해 창작된 것으로서, TTS 시스템 및 연속 음성 인식 시스템에서 처리되는 정보들을 효율적으로 표현하기 위한 발음 접속 그래프를 구성하는 발음 접속 노드를 기록한 컴퓨터가 읽을 수 있는 기록매체 및 상기 발음 접속 그래프를 이용한 TTS 처리 방법 및 연속 음성 인식 방법을 제공함을 그 목적으로 한다.The present invention has been created by the necessity of the above, a computer-readable recording medium recording a pronunciation connection node constituting a pronunciation connection graph for efficiently representing information processed in the TTS system and continuous speech recognition system and the pronunciation It is an object of the present invention to provide a TTS processing method and a continuous speech recognition method using a connection graph.

도 1은 본 발명에 의한 발음 접속 그래프에 사용되는 노드의 자료구조를 도시한 것이다.1 illustrates a data structure of a node used in a pronunciation connection graph according to the present invention.

도 2는 TTS 시스템에 사용되는 발음 접속 그래프를 예시한 것이다.2 illustrates a pronunciation connection graph used in a TTS system.

도 3은 본 발명에 의한 발음 접속 노드를 사용한 TTS 처리 과정을 도시한 순서도이다.3 is a flowchart illustrating a TTS process using a pronunciation access node according to the present invention.

도 4는 연속 음성 인식에 사용되는 발음 접속 그래프를 예시한 것이다.4 illustrates a pronunciation connection graph used for continuous speech recognition.

도 5는 본 발명에 의한 발음 접속 노드를 사용한 연속 음성 인식 과정을 도시한 순서도이다.5 is a flowchart illustrating a continuous speech recognition process using a pronunciation access node according to the present invention.

상기의 목적을 달성하기 위하여, 본 발명에 의한 하나의 텍스트 문장을 음성으로 변환하는 TTS 시스템에서 처리되는 정보를 표현하는 발음 접속 그래프를 구성하는 발음 접속 노드를 기록한 컴퓨터로 읽을 수 있는 기록매체에 있어서, 상기 발음 접속 노드는 하나의 형태소의 품사를 저장하는 형태소 품사 필드; 상기 형태소가 발음되는 문자열을 저장하는 형태소 발음 필드; 상기 형태소가 상기 형태소 발음 필드에 저장된 문자열에 따라 발음될 확률을 저장하는 확률 정보 필드; 상기 형태소의 최초 음소의 철자 및 발음을 저장하는 좌측 음운 접속 정보 필드 및 상기형태소의 마지막 음소의 철자 및 발음을 저장하는 우측 음운 접속 정보 필드; 및 상기 형태소 이전에 발음되는 이전 형태소를 저장한 하나 이상의 발음 접속 노드들을 액세스하기 위한 포인터들을 저장한 전 발음 접속 노드 필드 및 상기 형태소 이후에 발음되는 이후 형태소를 저장한 하나 이상의 발음 접속 노드들을 액세스하기 위한 포인터들을 저장한 후 발음 접속 노드 필드를 구비하는 것을 특징으로 한다.In order to achieve the above object, in a computer-readable recording medium recording a pronunciation connection node constituting a pronunciation connection graph representing information processed in a TTS system for converting one text sentence into speech according to the present invention. The pronunciation access node comprises: a morpheme part-of-speech field for storing a part-of-speech part of speech; A morpheme pronunciation field for storing a string in which the morpheme is pronounced; A probability information field for storing a probability that the morpheme is pronounced according to a string stored in the morpheme pronunciation field; A left phonological access information field for storing the spelling and pronunciation of the first phoneme of the morpheme, and a right phonological access information field for storing the spelling and pronunciation of the last phoneme of the morpheme; And accessing a pre-pronunciation node field that stores pointers for accessing one or more pronunciation-access nodes that stored a previous morpheme that is pronounced before the morpheme and one or more pronunciation-access nodes that store a later morpheme that is pronounced after the morpheme. After storing the pointers for a phonetic access node characterized in that it comprises a field.

상기 컴퓨터로 읽을 수 있는 기록매체에서 발음 접속 노드는 상기 형태소의 원형을 저장하는 형태소 원형 필드; 및 상기 형태소가 상기 텍스트 문장에서 표현된 문자열을 저장하는 형태소 표층 필드를 더 구비할 수 있다.The pronunciation access node in the computer-readable recording medium includes: a morpheme circular field for storing the prototype of the morpheme; And a morpheme surface field for storing the character string expressed in the text sentence.

상기의 다른 목적을 달성하기 위하여, 본 발명에 의한 하나의 문장의 음성 신호를 인식하는 연속 음성 인식 시스템에서 처리되는 정보를 표현하는 발음 접속 그래프를 구성하는 발음 접속 노드를 기록한 컴퓨터로 읽을 수 있는 기록매체에 있어서, 상기 발음 접속 노드는 하나의 형태소에 대하여, 그 형태소가 텍스트 문장에서 표현되는 문자열을 저장하는 형태소 표층 필드; 상기 형태소가 발음되는 문자열을 저장하는 형태소 발음 필드; 상기 형태소가 인식되어 추출될 지지도를 저장하는 확률 정보 필드; 및 상기 형태소 이전에 발음되는 이전 형태소를 저장한 하나 이상의 발음 접속 노드들을 액세스하기 위한 포인터들을 저장한 전 발음 접속 노드 필드 및 상기 형태소 이후에 발음되는 이후 형태소를 저장한 하나 이상의 발음 접속 노드들을 액세스하기 위한 포인터들을 저장한 후 발음 접속 노드 필드를 구비하는 것을 특징으로 한다.In order to achieve the above another object, a computer-readable recording recording a pronunciation connection node constituting a pronunciation connection graph representing information processed in a continuous speech recognition system for recognizing a speech signal of one sentence according to the present invention. 12. The medium of claim 1, wherein the pronunciation access node comprises: a morpheme surface field for storing a string in which a morpheme is expressed in a text sentence, for one morpheme; A morpheme pronunciation field for storing a string in which the morpheme is pronounced; A probability information field for storing the support to be recognized and extracted from the morphemes; And accessing a pre-pronunciation node field that stores pointers for accessing one or more pronunciation-access nodes that stored a previous morpheme that is pronounced before the morpheme and one or more pronunciation-access nodes that store a later morpheme that is pronounced after the morpheme. After storing the pointers for a phonetic access node characterized in that it comprises a field.

상기 컴퓨터로 읽을 수 있는 기록매체에서 발음 접속 노드는 상기 형태소의품사를 저장한 형태소 품사 필드; 상기 형태소의 원형을 저장하는 형태소 원형 필드; 상기 형태소의 발음이 시작되는 시점을 저장하는 시작 필드와 상기 형태소의 발음이 종료되는 시점을 저장하는 끝 필드; 및 상기 형태소의 최초 음소의 철자 및 발음을 저장하는 좌측 음운 접속 정보 필드 및 상기 형태소의 마지막 음소의 철자 및 발음을 저장하는 우측 음운 접속 정보 필드를 더 구비할 수 있다.The pronunciation access node of the computer-readable recording medium includes: a morpheme part-of-speech field storing the morpheme part-of-speech; A morpheme circular field for storing the prototype of the morpheme; A start field for storing a start point of the pronunciation of the morpheme and an end field for storing a start point of the pronunciation of the morpheme; And a left phonological access information field storing spelling and pronunciation of the first phoneme of the morpheme, and a right phonological access information field storing spelling and pronunciation of the last phoneme of the morpheme.

상기의 또 다른 목적을 달성하기 위하여, 본 발명에 의한 발음 접속 그래프를 사용하여 하나의 텍스트 문장을 음성으로 변환하는 방법은 (a) 상기 텍스트 문장을 형태소 분석하고 품사 태깅하여 각 형태소의 원형, 표층 및 품사를 결정하는 단계; (b) (a) 단계에서 결정된 각 형태소 사이의 언절을 추출하는 단계; (c) 한글로 표현되지 않은 기호들을 한글로 표현하여 형태소를 정규화하는 단계; (d) 정규화가 이루어진 각 형태소에 대하여 사전 및 CCV 규칙을 사용하여 발음되는 하나 이상의 발음 문자열들을 생성하고, 각각의 발음 문자열에 대하여 하나의 발음 접속 노드를 생성하여 발음 접속 그래프를 구성하는 단계; (e) 상기 발음 접속 그래프를 구성하는 각 발음 접속 노드에 대하여 좌우 음운 변이를 사용하여 형태소간 음운 접속 검사하여, 접속 불가능한 발음 접속 노드는 상기 발음 접속 그래프에서 삭제하는 단계; 및 (f) 음운 접속 검사가 이루어진 발음 접속 그래프에서 가장 높은 확률 정보를 갖는 경로의 발음 접속 노드들의 형태소 발음 필드들로 구성된 최적 발음접속열에 따른 음성을 생성하는 단계를 포함함을 특징으로 한다.In order to achieve the above another object, a method of converting a text sentence into speech using a pronunciation connection graph according to the present invention includes (a) morphological analysis and tagging of the text sentence to form a circular, surface layer of each morpheme And determining the part of speech; (b) extracting words between each of the morphemes determined in step (a); (c) normalizing morphemes by representing symbols not represented in Korean in Korean; (d) generating one or more pronunciation strings to be pronounced using dictionaries and CCV rules for each morpheme with normalization, and generating one pronunciation connection node for each pronunciation string to construct a pronunciation connection graph; (e) checking the morphological phonological connection of each pronunciation access node constituting the pronunciation access graph using left and right phonological variations, and deleting the unaccessible pronunciation access node from the pronunciation access graph; And (f) generating a voice according to an optimal pronunciation access string consisting of morphological pronunciation fields of pronunciation access nodes of a path having the highest probability information in the pronunciation access graph in which the phonological access test is performed.

상기의 또 다른 목적을 달성하기 위하여, 본 발명에 의한 발음 접속 그래프를 사용하여 하나의 문장의 음성 신호를 연속 음성 인식하는 방법은 (a) 상기 하나의 문장의 음성 신호를 신호처리하여 특징 벡터를 구하는 단계; (b) 상기 특징 벡터에 따라 형태소를 인식하여 발음 접속 노드를 생성하는 단계; (c) 상기 발음 접속 노드의 형태소와 발음 접속 그래프의 이전 노드의 형태소 간의 음운 접속 정보를 검사하여, 접속 가능하면 확률 정보를 저장하여 상기 발음 접속 노드를 상기 발음 접속 그래프에 연결하는 단계; (d) 상기 (b) 단계 내지 상기 (c) 단계를 상기 (a) 단계에서 구한 모든 특징 벡터에 대하여 반복 수행하는 단계; (e) 상기 발음 접속 그래프를 구성하는 각 발음 접속 노드의 확률 정보를 이용하여 확률이 높은 소정의 수의 경로들로 이루어진 최적 발음 접속 그래프를 구성하는 단계를 포함함을 특징으로 한다.In order to achieve the above object, a method of continuously recognizing a speech signal of one sentence by using the pronunciation connection graph according to the present invention comprises: (a) processing a speech signal of the one sentence to generate a feature vector; Obtaining; (b) generating a pronunciation access node by recognizing morphemes according to the feature vector; (c) checking phonological access information between the morpheme of the pronunciation access node and the morpheme of the previous node of the pronunciation access graph, storing the probability information if possible, and connecting the pronunciation access node to the pronunciation access graph; (d) repeating steps (b) to (c) for all the feature vectors obtained in step (a); (e) constructing an optimal pronunciation access graph composed of a predetermined number of paths having a high probability using probability information of each pronunciation access node constituting the pronunciation access graph.

이하에서 첨부된 도면을 참조하여 본 발명을 상세히 설명한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

본 발명은 도 1에 도시된 바와 같이 한 형태소에 대한 <품사>, <원형>, <표층>, <발음>, <좌측 음운 접속 정보>, <우측 음운 접속 정보> 등을 하나의 노드 구조체로 표현하고, 음운 단계의 접속이 가능한 노드 구조체만을 그래프 형태로 서로 연결함으로써, 도 2 및 도 4에 도시된 바와 같이 여러 음성 언어 처리시스템에서 전체 문장을 표현할 수 있도록 한다.As shown in FIG. 1, the <partial speech>, <circular>, <surface>, <pronounce>, <left phonological connection information>, <right phonological connection information>, and the like for a morpheme are represented in one node structure. By expressing and connecting only node structures that can be connected in a phonological stage in a graph form, as shown in FIGS. 2 and 4, the entire sentence can be expressed in various speech language processing systems.

형태소 발음을 그래프 형태로 표현하기 위하여 각 발음 접속 노드는 전/후 발음 접속 노드를 연결할 수 있는 링크, 즉 <전 발음 접속 노드> 및 <후 발음 접속 노드>를 가지고 있다.In order to express the morpheme pronunciation in the form of a graph, each pronunciation access node has a link to which the front and back pronunciation access nodes can be connected, that is, the <before pronunciation access node> and the <after pronunciation access node>.

도 1에 도시된 <확률정보>는 TTS 시스템에 사용될 경우에는 해당 발음으로 표현될 확률을 뜻하고, 연속 음성 인식에 사용될 경우에는 인식되어 추출된 형태소의 지지도가 된다.When used in a TTS system, the <probability information> shown in FIG. 1 denotes a probability to be expressed by a corresponding pronunciation, and when used in continuous speech recognition, it is supported by a recognized morpheme.

연속 음성 인식에 이 발음 접속 그래프가 사용될 경우에는 해당 발음의 시작 시점과 끝 시점을 나타내기 위한 필드, 즉 <시작> 및 <끝>이 추가로 사용된다.When the pronunciation connection graph is used for continuous speech recognition, fields for indicating the start time and the end time of the pronunciation, i.e., <start> and <end>, are additionally used.

<좌/우측 음운 접속 정보>는 해당 형태소가 발음될 때 나타나는 음운 변화에 대한 정보를 담고 있다. 즉, <좌측 음운 접속 정보> 필드는 해당 형태소의 최초 음소의 철자 및 발음을 저장하기 위한 것이고, <우측 음운 접속 정보> 필드는 해당 형태소의 마지막 음소의 철자 및 발음을 저장하기 위한 것이다. 이 <좌/우측 음운 접속 정보>를 사용해서 인접한 <전 발음 접속 노드>와 <후 발음 접속 노드>가 바르게 연결되어 있는지 아닌지를 판단할 수 있다.<Left / Right Phonological Connection Information> includes information on phonological changes that occur when the morpheme is pronounced. That is, the <left phonological access information> field is for storing the spelling and pronunciation of the first phoneme of the morpheme, and the <right phonological access information> field is for storing the spelling and pronunciation of the last phoneme of the morpheme. Using this <left / right phonological connection information>, it is possible to judge whether the adjacent <pronouncing access node> and <after pronunciation access node> are correctly connected.

음운 접속 검사 테이블Phonological Connection Checking Table 좌 형태소 우 접속 정보Left stem stem right connection information 접속 가능Accessible 우 형태소 좌 접속 정보Right stem stem left connection information D*[*|ㄴ]D * [* | ㄴ] <==><==> e*[ㄱ=ㄲ|*] e*[ㄷ=ㄸ|*] e*[ㅅ=ㅆ|*] e*[ㅈ=ㅉ|*]e * [ㄱ = ㄲ | *] e * [ㄷ = ㄸ | *] e * [ㅅ = ㅆ | *] e * [ㅈ = ㅉ | *] D*[*|ㅁ]D * [* | ㅁ] <==><==> e*[ㄱ=ㄲ|*] e*[ㄷ=ㄸ|*] e*[ㅅ=ㅆ|*] e*[ㅈ=ㅉ|*]e * [ㄱ = ㄲ | *] e * [ㄷ = ㄸ | *] e * [ㅅ = ㅆ | *] e * [ㅈ = ㅉ | *] H*[*|ㄴ]H * [* | ㄴ] <==><==> e*[ㄱ=ㄲ|*] e*[ㄷ=ㄸ|*] e*[ㅅ=ㅆ|*] e*[ㅈ=ㅉ|*]e * [ㄱ = ㄲ | *] e * [ㄷ = ㄸ | *] e * [ㅅ = ㅆ | *] e * [ㅈ = ㅉ | *] H*[*|ㅁ]H * [* | ㅁ] <==><==> e*[ㄱ=ㄲ|*] e*[ㄷ=ㄸ|*] e*[ㅅ=ㅆ|*] e*[ㅈ=ㅉ|*]e * [ㄱ = ㄲ | *] e * [ㄷ = ㄸ | *] e * [ㅅ = ㅆ | *] e * [ㅈ = ㅉ | *] D*[*|ㄵ:ㄴ]D * [* | ㄵ: ㄴ] <==><==> e*[ㄱ=ㄲ|*] e*[ㄷ=ㄸ|*] e*[ㅅ=ㅆ|*] e*[ㅈ=ㅉ|*]e * [ㄱ = ㄲ | *] e * [ㄷ = ㄸ | *] e * [ㅅ = ㅆ | *] e * [ㅈ = ㅉ | *] D*[*|ㄻ:ㅁ]D * [* | ㄻ: ㅁ] <==><==> e*[ㄱ=ㄲ|*] e*[ㄷ=ㄸ|*] e*[ㅅ=ㅆ|*] e*[ㅈ=ㅉ|*]e * [ㄱ = ㄲ | *] e * [ㄷ = ㄸ | *] e * [ㅅ = ㅆ | *] e * [ㅈ = ㅉ | *] H*[*|ㄵ:ㄴ]H * [* | ㄵ: ㄴ] <==><==> e*[ㄱ=ㄲ|*] e*[ㄷ=ㄸ|*] e*[ㅅ=ㅆ|*] e*[ㅈ=ㅉ|*]e * [ㄱ = ㄲ | *] e * [ㄷ = ㄸ | *] e * [ㅅ = ㅆ | *] e * [ㅈ = ㅉ | *] H*[*|ㄻ:ㅁ]H * [* | ㄻ: ㅁ] <==><==> e*[ㄱ=ㄲ|*] e*[ㄷ=ㄸ|*] e*[ㅅ=ㅆ|*] e*[ㅈ=ㅉ|*]e * [ㄱ = ㄲ | *] e * [ㄷ = ㄸ | *] e * [ㅅ = ㅆ | *] e * [ㅈ = ㅉ | *] D*[*|ㄴ]D * [* | ㄴ] <==><==> b*[ㄱ=ㄲ|*] b*[ㄷ=ㄸ|*] b*[ㅅ=ㅆ|*] b*[ㅈ=ㅉ|*]b * [ㄱ = ㄲ | *] b * [ㄷ = ㄸ | *] b * [ㅅ = ㅆ | *] b * [ㅈ = ㅉ | *] D*[*|ㅁ]D * [* | ㅁ] <==><==> b*[ㄱ=ㄲ|*] b*[ㄷ=ㄸ|*] b*[ㅅ=ㅆ|*] b*[ㅈ=ㅉ|*]b * [ㄱ = ㄲ | *] b * [ㄷ = ㄸ | *] b * [ㅅ = ㅆ | *] b * [ㅈ = ㅉ | *] H*[*|ㄴ]H * [* | ㄴ] <==><==> b*[ㄱ=ㄲ|*] b*[ㄷ=ㄸ|*] b*[ㅅ=ㅆ|*] b*[ㅈ=ㅉ|*]b * [ㄱ = ㄲ | *] b * [ㄷ = ㄸ | *] b * [ㅅ = ㅆ | *] b * [ㅈ = ㅉ | *] H*[*|ㅁ]H * [* | ㅁ] <==><==> b*[ㄱ=ㄲ|*] b*[ㄷ=ㄸ|*] b*[ㅅ=ㅆ|*] b*[ㅈ=ㅉ|*]b * [ㄱ = ㄲ | *] b * [ㄷ = ㄸ | *] b * [ㅅ = ㅆ | *] b * [ㅈ = ㅉ | *] D*[*|ㄵ:ㄴ]D * [* | ㄵ: ㄴ] <==><==> b*[ㄱ=ㄲ|*] b*[ㄷ=ㄸ|*] b*[ㅅ=ㅆ|*] b*[ㅈ=ㅉ|*]b * [ㄱ = ㄲ | *] b * [ㄷ = ㄸ | *] b * [ㅅ = ㅆ | *] b * [ㅈ = ㅉ | *] D*[*|ㄻ:ㅁ]D * [* | ㄻ: ㅁ] <==><==> b*[ㄱ=ㄲ|*] b*[ㄷ=ㄸ|*] b*[ㅅ=ㅆ|*] b*[ㅈ=ㅉ|*]b * [ㄱ = ㄲ | *] b * [ㄷ = ㄸ | *] b * [ㅅ = ㅆ | *] b * [ㅈ = ㅉ | *] H*[*|ㄵ:ㄴ]H * [* | ㄵ: ㄴ] <==><==> b*[ㄱ=ㄲ|*] b*[ㄷ=ㄸ|*] b*[ㅅ=ㅆ|*] b*[ㅈ=ㅉ|*]b * [ㄱ = ㄲ | *] b * [ㄷ = ㄸ | *] b * [ㅅ = ㅆ | *] b * [ㅈ = ㅉ | *] H*[*|ㄻ:ㅁ]H * [* | ㄻ: ㅁ] <==><==> b*[ㄱ=ㄲ|*] b*[ㄷ=ㄸ|*] b*[ㅅ=ㅆ|*] b*[ㅈ=ㅉ|*]b * [ㄱ = ㄲ | *] b * [ㄷ = ㄸ | *] b * [ㅅ = ㅆ | *] b * [ㅈ = ㅉ | *]

발음 접속 노드에 나타난 형태소 품사와 좌/우측 음운 접속 정보를 사용하여 다음과 같은 형식의 접속 정보를 나타낼 수 있다.The morpheme parts of speech shown in the pronunciation access node and the left / right phonological access information may be used to represent access information in the following format.

형태소의 품사[좌측 음운 접속 정보 | 우측 음운 접속 정보]Part-of-speech [left phoneme connection information | Right phoneme connection information]

각 발음 접속 노드들은 위 형식의 접속 정보를 사용하여, 표 1의 두 음운 접속 정보 사이의 연결 가능 여부를 검사할 수 있는 음운 접속 검사 테이블에 따라 접속 가능 여부를 검사한다. 테이블의 접속 정보에 나타난 형태소 품사 D와 H, e, b는 각각 동사, 형용사, 어미, 보조용언을 나타내고, *는 각 품사의 하위분류는 고려하기 않겠다는 것을 나타낸다. 그리고, 'ㄵ:ㄴ'는 'ㄵ'이 발음될 때 'ㄴ'으로 대표음화하여 발음된다는 것을 보여주고 있고, 'ㄱ=ㄲ'은 'ㄱ'이 발음될 때 'ㄲ'으로 경음화하여 발음된다는 것을 나타낸다. 왼쪽 발음 접속 노드와의 접속을 검사할 때는 오른쪽 음운 접속 정보는 사용하지 않으므로, '*'를 사용하여 고려하지 않는다는 것을 나타낸다.Each pronunciation access node checks the accessibility according to the phonological access check table that can check whether the connection between the two phonological access information shown in Table 1 is possible using the access information of the above format. The morpheme parts D, H, e, and b in the connection information of the table represent verbs, adjectives, endings, and auxiliary verbs, respectively, and * indicates that subclassification of each part of speech is not considered. And, 'ㄵ: ㄴ' shows that the 'ㄵ' is pronounced as 'n' when pronounced, 'ㄱ = ㄲ' is pronounced 'ㄲ' when pronounced '경' Indicates. When checking the connection with the left phonetic access node, the right phonetic access information is not used, indicating that it is not considered using '*'.

이 형식에 따라서 표 1의 1열을 해석해보면, 'ㄴ'으로 끝나는 동사의 뒤에는 'ㄱ','ㄷ','ㅅ','ㅈ'이 경음화하여 발음되는 어미가 연결될 수 있고, 마지막 열을 해석해 보면, 'ㄻ'으로 끝나는 형용사의 'ㄻ'은 'ㅁ'으로 발음되며, 그 뒤에는 'ㄱ','ㄷ','ㅅ','ㅈ'이 경음화하여 발음되는 보조용언이 연결된다. 이 테이블의 내용을 참조하여 인접한 두 발명 접속 노드의 접속 가능 여부를 검사하여 전체 문장에 대한 발음 접속 그래프의 무결성을 검사할 수 있다.According to this format, in the first column of Table 1, verbs ending in 'b' can be connected with endings pronounced by 'ㄱ', 'ㄷ', 'ㅅ', 'ㅈ', and the last column. When interpreted, the adjective 'ㄻ' ending in 'ㄻ' is pronounced as 'ㅁ', followed by the auxiliary verbs which are pronounced by consonant with 'ㄱ', 'ㄷ', 'ㅅ' and 'ㅈ'. By referring to the contents of this table, it is possible to check the integrity of the pronunciation access graph for the entire sentence by checking whether two adjacent invention access nodes are accessible.

도 2는 TTS 시스템에서 '몇 개가 있습니까'라는 텍스트 문장의 발음을 생성하기 위해서 사용되는 발음 접속 그래프를 보여주고 있다. 이 발음 접속 그래프에서 해당 텍스트 문장에 대한 발음을 결정하여 TTS 시스템에서 음성을 생성할 때 사용할 수 있다. 각 발음 접속 노드는 텍스트 문장을 형태소 분석하고 품사 태깅한결과로부터 하나의 형태소에 대한 품사, 원형, 표층, 발음, 좌우 음운 접속 정보 등의 후보를 추출하여 그 정보를 가지고 있다. 발음 접속 그래프에서 연결된 형태소들은 텍스트 문장 상에도 연결되어서 나타난다. 따라서, 발음 접속 그래프의 왼쪽 첫 발음 접속 노드로부터 오른쪽 마지막 발음 접속 노드까지의 발음을 모두 연결시키면 전체 문장에 대한 발음이 된다.2 shows a pronunciation connection graph used to generate a pronunciation of a text sentence 'how many are' in a TTS system. The pronunciation connection graph determines the pronunciation of the text sentence and can be used when generating a voice in the TTS system. Each pronunciation access node extracts candidates for parts of speech, circle, surface, pronunciation, left and right phonological access information, etc., from a result of morphological analysis and part-of-speech tagging of text sentences. Linked morphemes in the phonetic connection graph also appear linked on text sentences. Therefore, when all the pronunciations from the left first pronunciation access node to the right last pronunciation access node of the pronunciation access graph are connected, the pronunciation of the entire sentence is obtained.

도 3에서 발음 접속 그래프를 사용한 TTS 처리 과정을 보이고 있다. 한 텍스트 문장이 형태소 분석과 품사 태깅을 거치게 되면, 그 문장을 이루는 형태소의 원형과 표층, 품사 등이 결정된다.3 shows a TTS process using a pronunciation connection graph. When a sentence of text is subjected to morphological analysis and part-of-speech tagging, the prototype, surface, and part-of-speech are determined.

여기서 결정된 각각 형태소로부터 품사와 품사 간에 언절(말을 할 때 쉬는 단위)의 경계가 나타날 확률을 담고 있는 확률 정보(30)를 이용하여 언절 경계를 추출한다(300 단계). 숫자나 영문자 등 한글로 표현되어 있지 않은 기호들을 한글로 표현하게 되는 형태소 정규화를 거치게 된다(310 단계).From each of the morphemes determined here, a word boundary is extracted using probability information 30 that includes a probability that a boundary of a word of speech (a unit of rest when speaking) appears. In step 310, morphological normalization is performed to represent symbols not represented in Korean, such as numbers or alphabets, in Korean.

한글로 표현된 각각의 형태소에 대한 사전(31)과 CCV(자음+자음+모음)규칙(32)을 사용하여 발음으로 변환하는데, 그 결과는 발음 접속 그래프로 표현된다(320 단계). 여기서 변환된 발음은 음성으로 변환될 발음의 후보를 나타내므로 같은 형태소에 대해 여러 발음 후보가 나올 수 있다. 이렇게 여러 후보로 나타나는 발음은 리스트 형태로 표현하기에는 메모리의 사용량이 많으며, 처리에도 많은 복잡성을 띄게 된다. 이렇게 하나의 형태소에 대해서 나타나는 여러 후보 발음을 본 발명에서 제안하는 발음 접속 그래프 형태로 나타내면, TTS 시스템에서 사용하는 메모리의 절약과 함께 전체적인 성능 향상을 꾀할 수 있다.The dictionary 31 and the CCV (consonant + consonant + vowel) rule 32 for each morpheme expressed in Korean are converted to pronunciation, and the result is represented by a pronunciation connection graph (step 320). Here, the converted pronunciation represents a candidate of the pronunciation to be converted into a voice, so that several pronunciation candidates may appear for the same morpheme. This pronunciation, which appears as a candidate, has a large amount of memory to express as a list, and has a lot of complexity in processing. Thus, by expressing the candidate pronunciations appearing in one morpheme in the form of the pronunciation connection graph proposed by the present invention, overall performance can be improved while saving memory used in the TTS system.

발음 접속 그래프를 구성하는 각 발음 접속 노드에 대하여, 표 1에 예시된 바와 같이 음운 접속 검사표(33)를 사용하여 형태소간 음운 접속 검사한다(330 단계). 음운 접속 검사에 의해 접속 불가능한 것으로 판정된 발음 접속 노드는 발음 접속 그래프에서 삭제한다. 음운 접속 검사가 이루어진 발음 접속 그래프에서 가장 높은 확률 정보를 갖는 경로의 발음 접속 노드들의 형태소 발음 필드들로 구성된 최적 발음접속열에 따른 음성을 생성한다.For each pronunciation connection node constituting the pronunciation connection graph, the morphological phonological connection check is performed using the phonological connection check table 33 as illustrated in Table 1 (step 330). The pronunciation connection node determined to be inaccessible by the phonological connection check is deleted from the pronunciation connection graph. Speech is generated according to the optimal pronunciation access string consisting of the morphological pronunciation fields of the pronunciation access nodes of the path having the highest probability information in the pronunciation access graph in which the phonological access test is performed.

도 4는 연속적으로 발음된 문장을 인식하는 과정에서 만들어진 발음 접속 그래프를 보여주고 있다. 연속 음성 인식에서는 음성 문장 내에서 형태소의 정확한 시간적 경계를 알지 못하면서 형태소를 인식하고 있으므로, 같은 형태소가 시차에 따라 다르게 인식된다. 따라서, 도 4에 도시된 형태소 '줄여'에 대하여 다른 시차를 가지는(0-5, 1-6) 2개의 형태소로 인식되어질 수도 있으며, 각각의 형태소들은 전체 문장을 인식하는데 형태소 후보가 된다. 전체 음성 문장이 '줄여야 한다'이라고 발음되었는데도 불구하고, 현재 기술로는 형태소를 인식하는 음성인식기의 성능이 만족스럽지 못하기 때문에 '주리'나 '야한'같은 잘못 인식되어진 형태소들이 발음 접속 그래프 내에 나타나게 된다. 이렇게 잘못 인식되어진 형태소들은 일차적으로 음운 접속 검사를 통하여 걸러지고, 전체 발음 접속 그래프가 만들어진 후에 확률값 탐색 등을 통하여 완전한 문장으로 인식될 수 있다.4 illustrates a pronunciation connection graph generated in a process of recognizing consecutively pronounced sentences. In continuous speech recognition, since the morpheme is recognized without knowing the exact temporal boundary of the morpheme in the speech sentence, the same morpheme is recognized differently according to the time difference. Accordingly, two morphemes (0-5, 1-6) having different parallaxes (0-5, 1-6) with respect to the morpheme 'reduction' shown in FIG. 4 may be recognized, and each morpheme becomes a morpheme candidate for recognizing the entire sentence. Although the entire phonetic sentence is pronounced 'must be reduced', the current technology does not satisfy the performance of the morpheme-aware speech recognizer, so misrecognized morphemes such as 'Juri' or 'Law' appear in the pronunciation access graph. do. These misrecognized morphemes are first filtered through a phonological connection test, and can be recognized as complete sentences by searching for probability values after the entire phonetic access graph is created.

도 5는 발음 접속 그래프를 사용하여 연속 음성을 인식하는 과정을 나타낸다. 음성 문장에 대한 신호처리에 의해 특징 벡터를 구하고, 그 특징 벡터에 따라 사전(50) 및 히든 마코프 모델 트라이(Hidden Markov Model Trie:HMM Trie)(51)로부터 형태소를 인식하여 발음 접속 그래프의 하나의 노드를 생성하게 된다(500 단계).5 shows a process of recognizing continuous speech using a pronunciation connection graph. A feature vector is obtained by signal processing on a speech sentence, and the morpheme is recognized from the dictionary 50 and the Hidden Markov Model Trie (HMM Trie) 51 according to the feature vector. Create a node (step 500).

하나의 형태소가 인식될 때마다 음운 접속 검사표(52)를 사용하여 새로운 형태소와 이전에 구한 형태소 간의 음운 접속 정보를 검사하여 접속 가능 여부를 판단한다(510 단계). 이때 이전에 구한 형태소들은 발음 접속 그래프로 표현되어 있으며, 새로운 형태소가 이전의 형태소와 접속이 가능하면 확률값과 형태소의 시작시간, 끝시간과 함께 발음 접속 그래프에 포함된다(520 단계). 이러한 방식으로 음성 문장의 마지막 특징 벡터까지 처리가 끝나면(530 단계), 전체 발음 접속 그래프에 대한 확률값을 사용하여 최적의 발음 접속 그래프를 구성하게 되고(540 단계), 이 최적의 발음 접속 그래프로부터 원하는 N-Best 문장을 구할 수 있다.Each time one morpheme is recognized, the phoneme connection check table 52 is used to determine whether the connection is possible by examining phonological access information between the new morpheme and the previously obtained morpheme. At this time, the previously obtained morphemes are represented by the pronunciation access graph, and if the new morphemes are accessible to the previous morphemes, the morphemes are included in the pronunciation access graph together with the probability value, the start time and the end time of the morphemes (step 520). After the processing up to the last feature vector of the speech sentence in this manner (step 530), an optimal pronunciation connection graph is constructed using the probability values for the entire pronunciation connection graph (step 540). You can get N-Best sentences.

한편, 상술한 본 발명의 실시예는 컴퓨터에서 실행될 수 있는 프로그램으로 작성가능하다. 그리고, 컴퓨터에서 사용되는 매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 상기 매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드 디스크 등), 광학적 판독 매체(예를 들면, 씨디롬, 디브이디 등) 및 캐리어 웨이브(예를 들면, 인터넷을 통한 전송)와 같은 저장매체를 포함한다.On the other hand, the embodiments of the present invention described above can be written as a program that can be executed on a computer. And, it can be implemented in a general-purpose digital computer for operating the program using a medium used in the computer. The media may be stored such as magnetic storage media (e.g., ROM, floppy disk, hard disk, etc.), optical reading media (e.g., CD-ROM, DVD, etc.) and carrier waves (e.g., transmission over the Internet). Media.

Claims

A computer-readable recording medium recording a pronunciation connection node constituting a pronunciation connection graph representing information processed by a TTS system for converting a text sentence into a voice,

A morpheme part-of-speech field for storing a part-of-speech in one morpheme;

A morpheme pronunciation field for storing a string in which the morpheme is pronounced;

A probability information field for storing a probability that the morpheme is pronounced according to a string stored in the morpheme pronunciation field;

A left phonological access information field for storing the spelling and pronunciation of the first phoneme of the morpheme, and a right phonological access information field for storing the spelling and pronunciation of the last phoneme of the morpheme; And

For accessing one or more pronunciation access nodes for storing pointers for accessing one or more pronunciation access nodes for storing previous morphemes that are pronounced before the morpheme, and for one or more pronunciation access nodes for storing subsequent morphemes that are pronounced after the morphemes And a pronunciation access node field after storing the pointers.

The method of claim 1,

A morpheme circular field for storing the prototype of the morpheme; And

And a morpheme surface field for storing the character string represented in the text sentence.

A computer-readable recording medium recording a pronunciation connection node constituting a pronunciation connection graph representing information processed by a continuous speech recognition system for recognizing a speech signal of one sentence,

A morpheme surface field, which stores, for one morpheme, a string in which the morpheme is represented in a text sentence;

A probability information field for storing the support to be recognized and extracted from the morphemes; And

The method of claim 3,

A morpheme part-of-speech field for storing parts of speech of the morpheme;

A morpheme circular field for storing the prototype of the morpheme;

A start field for storing a start point of the pronunciation of the morpheme and an end field for storing a start point of the pronunciation of the morpheme; And

And a left phonological access information field for storing the spelling and pronunciation of the first phoneme of the morpheme, and a right phonological access information field for storing the spelling and pronunciation of the last phoneme of the morpheme. Readable record carrier.

In the method of converting a text sentence into speech using a pronunciation connection graph,

(a) morphological analysis and part-of-speech tagging of the text sentence to determine the prototype, surface and part-of-speech of each morpheme;

(b) extracting words between each of the morphemes determined in step (a);

(c) normalizing morphemes by representing symbols not represented in Korean in Korean;

(d) generating one or more pronunciation strings to be pronounced using dictionaries and CCV rules for each morpheme with normalization, and generating one pronunciation connection node for each pronunciation string to construct a pronunciation connection graph;

(e) checking the morphological phonological connection of each pronunciation access node constituting the pronunciation access graph using left and right phonological variations, and deleting the unaccessible pronunciation access node from the pronunciation access graph; And

(f) generating a speech according to an optimal pronunciation access string consisting of morphological pronunciation fields of pronunciation access nodes of a path having the highest probability information in the pronunciation access graph in which the phonological access test is performed. TTS processing method using.

In the method of continuous speech recognition of the speech signal of one sentence using the pronunciation connection graph,

(a) processing a speech signal of the one sentence to obtain a feature vector;

(b) generating a pronunciation access node by recognizing morphemes according to the feature vector;

(c) checking phonological access information between the morpheme of the pronunciation access node and the morpheme of the previous node of the pronunciation access graph, storing the probability information if possible, and connecting the pronunciation access node to the pronunciation access graph;

(d) repeating steps (b) to (c) for all the feature vectors obtained in step (a);

(e) constructing an optimal pronunciation connection graph consisting of a predetermined number of paths having a high probability using probability information of each pronunciation access node constituting the pronunciation connection graph. Continuous speech recognition method used.