KR19980047177A

KR19980047177A - Korean document analyzer for voice conversion system

Info

Publication number: KR19980047177A
Application number: KR1019960065620A
Authority: KR
Inventors: 오영환; 이상호
Original assignee: 윤덕용; 한국과학기술원
Priority date: 1996-12-14
Filing date: 1996-12-14
Publication date: 1998-09-15
Also published as: KR100202292B1

Abstract

주어진 입력 문서를 음성으로 변환시키는 문서 음성 변환 시스템에서 통계적 언어 처리 기법을 도입하여 합성음의 자연성을 좌우하는 운율 생성 모듈에 더욱 정확한 발음 표기 및 구문 구조를 제공하도록 한 것으로, 문서 음성 변환을 위하여 입력되는 문서를 하나의 문장씩 추출하며, 비결정 유한 오토마타를 이용하여 비한글 문자들을 한글 문자로 변환시키는 전처리 수단과, 상기 전처리 수단에서 인가되는 하나씩의 문장 어절에서 모든 가능한 형태소 분석 결과를 구하는 형태소 분석수단과, 상기 형태소 분석되어 인가되는 입력 문장을 확률 정보를 기반으로 하여 가장 가능성이 높은 형태소 분석 결과를 선택하여 최적 형태소 분석열을 추출하여 품사 태거와, 상기 품사 태거에서 인가되는 형태소 분석열의 결과를 이용하여 각 어절들의 발음 표기를 구하는 발음 표기 변환수단 및, 확률 의존 문법을 이용하여 어절들의 지배소-의존소 관계를 구하며, 입력된 문서에 대한 어절의 발음 표기 및 형태소 품사열, 의존 트리를 최종 결과로 출력하는 구문 분석수단을 구비하여 품사태거에서 85.11%의 정확률을, 구문 분석기에서 78.68%의 정확률을 각각 나타내며, 특히 품사 태거의 경우, 처리 중인 문서에 미등록어가 없을 경우에는 96.46%의 높은 정확률을 제공한다.A statistical language processing technique is introduced in a document speech conversion system for converting a given input document into speech, thereby providing a more accurate phonetic representation and syntax structure for a rhythm generation module that determines the naturalness of a synthetic sound. A preprocessing unit that extracts a document one sentence at a time and transforms non-Hangeul characters into Hangeul characters using an amorphous finite automata; morphological analyzing means for obtaining all possible morpheme analysis results in one sentence phrase applied by the preprocessing unit; , The morphologically analyzed input sentence is selected based on the probability information and the most probable morphological analysis result is selected to extract the optimal morpheme analysis string and the result of the morpheme analysis string applied in the morpheme tagger Pronunciation of each word And a syntactic analysis means for outputting the phonetic representation of the phrase for the input document, the morpheme part sequence, and the dependency tree as the final result is obtained by using the phonetic transcription conversion means and the probability dependent grammar to obtain the dominance- The accuracy of the parser is 85.6% and the accuracy of the parser is 78.68%. Especially, in the case of the part - of - speech tag, it provides a high accuracy of 96.46% when there is no unicorded word in the document being processed.

Description

Korean document analyzer for voice conversion system

본 발명은 주어진 입력 문서를 음성으로 변환시키는 문서 음성 변환 시스템(Text-to-Speech System)에 관한 것으로, 보다 더 상세하게는 통계적 언어 처리 기법을 도입하여 합성음의 자연성을 좌우하는 운율 생성 모듈에 더욱 정확한 발음 표기 및 구문 구조를 제공하도록 한 한국어 문서 음성 변환 시스템을 위한 문서 분석기에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a text-to-speech system for converting a given input document into speech, and more particularly, to a prosody generation module for introducing a statistical language processing technique, And more particularly to a document analyzer for a Korean document speech conversion system that provides accurate pronunciation notation and syntax structure.

일반적으로 문서 음성 변환 시스템은 주어진 입력 문서를 음성으로 변환하는 시스템으로, 맹인용 독서기 등 많은 응용 분야를 갖고 있다.In general, a document speech conversion system is a system for converting a given input document into speech, and has many applications such as a blind reader.

이러한 문서 음성 변환 시스템은 크게 문서 분석기와 운율 생성 및 신호 합성 모듈로 이루어지며, 문서 분석기는 단어의 발음 표기와 그 문장이 갖고 있는 운율이 단어의 품사와 문장의 구문 구조가 밀접한 관계를 갖고 있기 때문에 이를 분석하는데 사용된다.This document speech-conversion system consists of a document analyzer, a rhyme generation and a signal synthesis module, and the document analyzer has a close relationship between the phonetic representation of a word and the rhyme of the word, It is used to analyze this.

기존의 한국어 문서 분석기들은 형태소 분석 단계에서 최장 일치법 등을 이용하여 하나의 결과만을 추출하고, 이를 이용하여 발음 변환, 구문 분석 등을 수행한다.Traditional Korean document analyzers extract only one result by using the longest matching method in the morpheme analysis stage, and perform phonetic transformation and syntax analysis using the same.

이외에 최근에는 두 단계(two-level) 모델에 기반한 발음 표기 변환 모듈이 개발되었으며, 더욱 정확한 정보를 추출하기 위해 많은 연구가 진행중이다.Recently, a phonetic transcription module based on a two-level model has been developed and many studies are under way to extract more accurate information.

그러나 전술한 바와 같은 문서 음성 변환 시스템은 프로그래머가 설정한 의미의 품사로써 입력되는 문서의 품사 정보를 분석하게 되므로 같은 음절이 서로 다른 품사를 갖게 되는 경우 품사 정보의 그릇된 판단으로 발음 표기에 오 변환을 일으키게 되는 문제점이 있어 문서 음성 변환에 신뢰성이 저하되는 문제점이 있었다.However, since the document speech conversion system as described above analyzes the part of speech information of a document inputted as a part of speech which is set by the programmer, if the same syllable has different parts of speech, There is a problem that the reliability of the document speech conversion is deteriorated.

본 발명은 이와 같은 문제점을 감안하여 안출한 것으로, 그 목적은 통계적 언어 처리 기법을 이용하여 어절의 발음 표기와 형태소 품사열 및 의존 트리의 결정으로 합성음의 자연성을 좌우하는 운율 생성 모듈에 더욱 정확한 발음 표기 및 구문 구조를 제공하여 신뢰성 있는 문서 음성 변환을 제공하도록 한 것이다.SUMMARY OF THE INVENTION The present invention has been made in view of the above problems, and its object is to provide a rhythm generation module that determines the naturalness of a synthetic sound by determining the phonetic representation of a word, Notation and syntax structure to provide reliable document speech conversion.

이와 같은 목적을 달성하기 위한 본 발명은 음성 변환을 위하여 입력되는 문서를 하나의 문장씩 추출하며, 비결정 유한 오토마타를 이용하여 비한글 문자들을 한글 문자로 변환시키는 전처리 수단과 ; 상기 전처리 수단에서 인가되는 하나씩의 문장 어절에서 모든 가능한 형태소 분석 결과를 구하는 형태소 분석수단과 ; 상기 형태소 분석되어 인가되는 입력 문장을 확률 정보를 기반으로 하여 가장 가능성이 높은 형태소 분석 결과를 선택하여 최적 형태소 분석열을 추출하는 품사 태거와 ; 상기 품사 태거에서 인가되는 형태소 분석열의 결과를 이용하여 각 어절들의 발음표기를 구하는 발음 표기 변환수단 및 ; 확률 의존 문법을 이용하여 어절들의 지배소-의존소 관계를 구하며, 입력된 문서에 대한 발음 표기 및 형태소 품사열, 의존 트리를 최종 결과로 출력하는 구문 분석수단을 구비하는 것을 특징으로 한다.In order to accomplish the above object, the present invention provides a speech recognition system, comprising: a preprocessing unit for extracting a document input for speech conversion, one sentence by one, and converting non-Korean characters into Hangul characters using an amorphous finite automata; Morphology analyzing means for obtaining all possible morpheme analysis results in one sentence word dictionary applied by the preprocessing means; A morphological analyzer for extracting an optimal morpheme analysis sequence by selecting the input morpheme analysis result based on probability information; A phonetic transcription conversion unit that obtains phonetic transcription of each of the phrases using the morphological analysis string applied to the morphological phrase analysis unit; And a syntax analyzing means for obtaining a dominant-dependency relationship of the phrases by using a probability-dependent grammar and outputting a phonetic representation of the input document, a morpheme phrase, and a dependency tree as a final result.

도 1은 본 발명에 따른 한국어 문서 음성 변환 시스템을 위한 문서 분석기 구성 블록도이고,1 is a block diagram of a document analyzer for a Korean document speech conversion system according to the present invention,

도 2는 본 발명에서 전처리를 위한 오토마타(automata)의 계통도이다.2 is a schematic diagram of an automata for preprocessing in the present invention.

도 3은 도 1의 본 발명에서 형태소 분석기의 구성도이고,3 is a block diagram of the morpheme analyzer of the present invention shown in Fig. 1,

도 4는 본 발명에서 나는에 대한 형태소 격자 구성이며,FIG. 4 shows a morphological lattice configuration for I in the present invention,

도 5는 본 발명에서 격자 구성에 대한 알고리즘이다.5 is an algorithm for the lattice configuration in the present invention.

도 6는 본 발명에서 신을 신고 신고하기의 형태소 분석 격자 구성이고,6 is a morphological analysis lattice configuration of reporting a new report in the present invention,

도 7은 본 발명에서 신을 신고 신고한다.의 의존 트리 구성이며,FIG. 7 is a flowchart showing the configuration of a dependency tree according to the present invention.

도 8은 본 발명에서 신을 신고 신고한다.의 구구조 트리 구성이다.FIG. 8 is a diagram showing the structure of a tree structure of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하면 다음과 같다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1에서 알 수 있는 바와 같이 본 발명에 따른 한국어 문서 음성 변환 시스템을 위한 문서 분석기에서 전처리 모듈(10)은 음성 변환을 위하여 입력되는 문서를 비결정 유한 오토마타(nondeterministic finite automata)를 이용하여 비한글 문자들을 한글 문자로 바꾸고 형태소 분석기(20)측에 문장 하나씩 제공한다.1, the preprocessing module 10 in the document analyzer for the Korean document speech conversion system according to the present invention converts a document input for speech conversion into a non-Korean character using a nondeterministic finite automata And one sentence is provided to the morpheme analyzer 20 side.

형태소 분석기(20)는 모든 가능한 형태소 분석 결과를 얻고, 형태소 분석이 실패되었을 경우는 분석중인 어절에 미등록어가 있다고 판단하고 미등록어를 추정한 후, 언어적 휴리스틱을 이용하여 분석을 위한 미등록어 후보들의 수를 줄인다.The morpheme analyzer 20 obtains all possible morpheme analysis results. If the morpheme analysis fails, the morpheme analyzer 20 determines that there is an unrecorded word in the analyzed word and evaluates the unrecorded word. Then, using the linguistic heuristic, Reduce the number.

품사 태거(part-of-speech tagger : 30)에서는 말뭉치로부터 얻은 확률 정보를 기반으로 하여 형태소 분석을 통해 인가되는 입력 문장에서 최적 형태소 분석열을 추출한다.In the part-of-speech tagger (30), the optimal morpheme analysis sequence is extracted from the input sentence through morphological analysis based on the probability information obtained from the corpus.

발음 표기 변환 모듈(40)에서는 품사 태거(30)에서 인가되는 형태소 분석열의 결과를 이용하여 각 어절들의 발음 표기를 구한다.In the phonetic transcription conversion module 40, phonetic transcription of each of the phrases is obtained by using the result of the morphological analysis column applied in the speech pacifier 30.

구문 분석기(50)는 확률 의존 문법을 이용하여 어절들의 지배소-의존소 관계를 구하며, 입력된 문서에 대한 어절의 발음 표기 및 형태소 품사열, 의존 트리를 최종 결과로 출력한다.The parser 50 uses the probability-dependent grammar to obtain the dominance-dependence relationship of the phrases, and outputs the phonetic representation of the phrase for the input document, the morpheme part sequence, and the dependency tree as the final result.

또한, 품사 태거(30)와 구문 분석기(50)에는 중의성 해소를 수행하기 위해 각각 품사 태킹을 위한 확률 정보(70)와 구분 분석을 위한 확률 정보(80)를 구비한다.In addition, the speech tagger 30 and the syntax analyzer 50 are provided with probability information 70 for speech tagging and probability information 80 for classification analysis, respectively, in order to perform deconvolution.

전술한 바와 같은 기능을 구비하여 이루어지는 본 발명에서 입력되는 한국어 문서의 음성 변환 동작을 설명하면 다음과 같다.The speech conversion operation of the Korean document inputted in the present invention having the functions as described above will be described as follows.

음성 변환을 위한 문서가 전처리 모듈(10) 입력되면 총 48개의 상태로 이루어진 비결정 유한 오토마타가 이용되어, 이 중 12개의 종결 상태(final state)가 각각 다른 비한글의 한글 변환 모듈로 구현되어 있는 전처리 모듈(10)은 입력 문서로부터 하나씩 문장을 추출함과 동시에 그 문장안에 있는 비한글 문자들을 모두 한글로 변환시킨다.When a document for speech conversion is input into the preprocessing module 10, an indefinite finite automata consisting of 48 states is used, and 12 final states are pre-processed by different Hangul conversion module The module 10 extracts the sentences one by one from the input document and at the same time converts all non-Korean characters in the sentence into Hangul.

이때, 비한글 문자들은 영어, 영어 약어, 숫자, 전화번호, 년도, 시간 등으로 구별되고 이들은 다음에 오는 한글 문자들을 고려하여 한글화된다.At this time, non-Korean characters are distinguished by English, English abbreviation, number, telephone number, year, and time, and they are translated into Korean by considering the following Hangul characters.

일 예를들어 Mr. Lee의 전화번호는이라는 문서가 입력되는 경우 첨부된 도 2의 전처리를 위한 오토마타에서 알 수 있는 바와 같이, 'Mr.'에 의해 4번 상태에 도달하게 되고, 'Mr.'가 사전에 등록되어 있으면 '미스터'로 한글화되고, 그렇지 않을 경우에는 '엠 알'로 한글화된다.For example, Mr. When a document called Lee's telephone number is inputted, as can be seen from the automata for the preprocessing of FIG. 2, the state of No. 4 is reached by 'Mr.', and 'Mr.' is registered in advance If it is, it is localized as 'Mr.', Otherwise it is changed to 'M-al'.

이때, 'Mr.'는 사전에 등록되어 있는 영어 약어이므로 '미스터'로 한글화된다.At this time, 'Mr.' is an English abbreviation registered in advance, so it is translated into 'Mr.'.

그 다음 'Lee의'의 경우는 5번 상태에서 'Lee'가 '리'로 한글화되고, 방금전에 입력된 '의'를 다시 입력하여 2번 상태에서 한글을 처리한다.In the case of 'Lee', 'Lee' is changed to 'Lee' in the state 5, and 'Hang' is processed in the state 2 by re-entering '

이와 같은 과정에서 미스터 리의 전화번호는이라는 최종 결과를 얻게 된다.In the process, Mr. Lee's phone number gets the final result.

이후, 형태소 분석기(20)는 전술한 바와 같은 동작을 통해 전처리가 완료되어 인가되는 문장 어절의 오른쪽에서 왼쪽으로 형태소들을 찾아 형태소 격자를 구성하게 되는데, 첨부된 도 3에서 알 수 있는 바와 같이 입력 어절(21)에 대하여 위치 추정모듈(22)을 통해 불규칙 및 축약 현상 위치들을 미리 계산하고, 격자 구성모듈(23)에 설정되어 있는 알고리즘에 의해 형태소 분석을 수행한다.Thereafter, the morpheme analyzer 20 constructs a morpheme lattice by searching for morphemes from the right to left of the sentence phrase to be preprocessed through the operation as described above. As shown in FIG. 3, The position estimation module 22 previously calculates the irregular and shortened development positions for the speech recognition module 21 and performs the morphological analysis by the algorithm set in the lattice configuration module 23. [

형태소 분석 결과는 형태소 격자로 표현하는데, 예를 들면, 나는의 최종 분석 결과는 첨부된 도 4와 같고 'INI'부터 'FIN'까지의 모든 가능한 경로가 각각 서로 다른 분석 결과를 나타낸다.The morpheme analysis result is expressed as a morpheme lattice. For example, the final analysis result of I is as shown in FIG. 4, and all possible paths from 'INI' to 'FIN' represent different analysis results.

이때, 첨부된 도 4와 같은 형태소 격자를 표현하기 위해서 하기의 [표 1]에 보이는 집합 L을 정의하였다.In order to express the morpheme lattice as shown in FIG. 4, the set L shown in [Table 1] below is defined.

집합 L의 원소(k, w, t, I)는 형태소 격자에서 하나의 노드와 그 노드의 오른쪽에 붙어 있는 에지들을 표현하며, 도 4에 해당하는 집합 L은 하기의 [표 2]와 같다.The elements (k, w, t, I) of the set L represent one node in the morpheme lattice and the edges to the right of the node, and the set L corresponding to FIG. 4 is shown in Table 2 below.

[표 1][Table 1]

k : l의 인덱스k: index of l

w : 형태소w: morpheme

t : 품사t: Part of speech

i : l이 가리키고 있는 원소들의 인덱스 집합i: the index of the elements pointed to by l

[표 2][Table 2]

L ={(0, FIN, NULL, {})L = {(0, FIN, NULL, {})

(1, ㄴ, 전성어미, {0})(1, b, prime end, {0})

(2, 는, 전성어미, {0})(2,, malleable end, {0})

(3, 는, 주제격조사, {0})(3,, subject schedules, {0})

(4, 날, 동사, {2})(4, day, verb, {2})

(5, 나, 보조용언, {2})(5, I, auxiliary verb, {2})

(6, 나, 동사, {2})(6, I, verb, {2})

(7, 나, 인칭대명사, {3})(7, I, personal pronoun, {3})

(8, INI, NULL, {7, 6, 5, 4})}(8, INI, NULL, {7, 6, 5, 4})}

한편, 격자 구성 모듈(23)에 수록되는 격자 구성 알고리즘은 우선 어절을 자소의 열로 만든 후, 알고리즘의 편의를 위해 각 자소 사이에 0부터 시작해서 지정되는 스텝(step)만큼씩 증가하며 숫자를 삽입한다.On the other hand, the lattice construction algorithm included in the lattice configuration module 23 first increases the number of steps starting from 0 for each algorithm, do.

예를 들어, 스텝을 6으로 두었을 때 나는의 경우, '0 ㄴ 6 ㅏ 12 ㄴ 18 ㅡ 24 ㄴ 30'이 된다.For example, if you set the step to 6, I will be '0 ㄴ 6 ㄱ 12 ㄴ 18 ㅡ 24 ㄴ 30'.

이와 같은 어절을 자소열로 표현한 후, 집합 L을 얻는 과정은 첨부된 도 5에 제시된 알고리즘을 사용하게 되는데, w_{i, j}는 숫자 i와 j 사이의 자소열을 뜻하고, 함수 LookupDict는 사전으로부터 입력 자소열 w_{i, j}의 가능한 품사들을 찾는 기능을 한다.5, the w _{i, j} denotes a row of digits between the digits i and j, and the function LookupDict is a set of digits from the dictionary Finds possible parts of the input sequence w _{i, j} .

함수 searchL은 집합 L에서 (w_{i, j}, t)가 접속할 수 있는 원소들의 인덱스 집합을 받는 기능을 하는 것으로 이는 미리 구축된 품사 접속표를 이용하여 이루어진다.The function searchL functions to receive the set of indexes of elements that (w _{i, j} , t) in the set L can connect to, and this is done by using pre-constructed part-of-speech tables.

알고리즘에서 사용한 집합 J는 현재까지 완성된 격자에서 'FIN' 노드까지 경로가 존재하는 노드들의 왼쪽 번호들을 모아둔 집합으로, 이는 불필요한 사전 탐색을 줄이기 위해 사용되었다.The set J used in the algorithm is a collection of the left numbers of the nodes with paths from the completed grid to the 'FIN' node, which is used to reduce unnecessary dictionary searches.

불규칙을 처리하는 부분은 미리 구해진 불규칙 위치 정보를 기반으로 이루어지는데, 나는의 경우, 12와 24에서 'ㄹ' 탈락 현상이 발생될 수 있으므로, j가 12 혹은 24일 때 'ㄹ'을 첨가하고 새로운 변수 h를 j부터 0까지 감소시키며 w_{i, j}를 사전에서 찾게 된다.The irregular part is based on the irregular position information obtained beforehand. In the case of I, 'd' can be eliminated at 12 and 24, so if 'j' is 12 or 24, 'd' The variable h is decreased from j to 0, and w _{i, j} is found in the dictionary.

그러므로 집합 L에 있는 (4, 날, 동사, {2})는 j가 12일 때 첨가된 것이다.So (4, day, verb, {2}) in the set L is added when j is 12.

한편, 분석중인 어절이 미등록어를 포함하는 경우에는 미완성된 형태소 격자로부터 모든 가능한 미등록어 후보를 생성하여 격자를 완성하게 된다.On the other hand, if the analyzed phrase includes an unregistered word, all possible unregistered word candidates are generated from the incomplete morpheme lattice to complete the lattice.

이때, 조사나 어미와 같은 기능어들은 모두 사전에 등록되어 있다고 가정하면 체언 혹은 용언과 같은 내용어만이 미등록어가 될 수 있으므로, 미등록어는 항상 어절의 왼쪽 부분에 나타나게 된다.At this time, assuming that all the functional words such as the survey and the mother are registered in advance, only the contents such as a censor or a verb can become unregistered words, so the unregistered word always appears in the left part of the word.

그러므로 미등록어를 추측한다는 것은 왼쪽에 남아 있는 자소열을 오른쪽에 있는 노드의 품사와 접속 가능한 품사로 할당하는 것이 되며, 이러한 방법으로 미등록어를 추정한 형태소 격자를 생성하게 된다.Therefore, the assumption of the unregistered word is that the left column is allocated to the part of speech that can be connected to the part of the node at the right. In this way, a morpheme lattice is generated by estimating the unregistered word.

그러나, 미등록어가 추정된 형태소 격자는 가능한 형태소 분석 수가 너무 많게 되므로 다음 단계인 품사 태깅에 많은 오류를 범하게 할 수 있어 이를 방지하기 위해 음절 정보와 단서(clue) 형태소를 이용하여 후보의 수를 줄인다.However, since the number of possible morphological analyzes is too large, the number of candidates can be reduced by using syllable information and clue morpheme to prevent the next step of speech tagging. .

먼저, 음절 정보를 사용하는 방법은 미등록어의 마지막 음절이 추정된 품사의 마지막 음절로 사용되지 않을 경우에는 이를 제외하는 것이다.First, the method of using syllable information is to exclude the case where the last syllable of the unregistered word is not used as the last syllable of the estimated part-of-speech.

예를 들면, 한국어의 용언 중 마지막 음절이 '느'인 단어는 총 6개 뿐이라는 사실을 이용하면 추정된 미등록어인 품사가 용언이고 그 미등록어의 마지막 음절이 '느'일 경우 그 노드를 격자에서 제외할 수 있게 된다.For example, using the fact that the last syllable in Korean is only six words, the last syllable is 'n'. If the unspoken part of speech is a verb and the last syllable of the unregistered word is 'n' .

단서 형태소를 이용하는 방법은 미완성된 형태소 격자내에 아주 빈도가 높고 그 어절의 구성을 추정하기에 충분하다고 생각되는 형태소가 발견되면, 그 형태소의 앞에 추정된 미등록어만 남기고 나머지를 격자에서 제거한다.The method using the clue morpheme removes the rest from the lattice, leaving only the unrecognized word in front of the morpheme when a morpheme that is very frequent in the incomplete morpheme lattice is found to be sufficient to estimate its composition.

예를 들면, 우회시켜라는 어절에서 '우회'가 미등록어일 경우, 미등록어 후보로는 '우회/동작성보통명사', '우회시키/동사', '우회시키/형용사' 등을 얻게 되는데 '시키'라는 형태소가 단서 형태소이므로 '우회/동작성보통명사'만을 남기고 나머지는 모두 제거하게 된다.For example, if the 'detour' is an unregistered word in the word 'detour', unregistered candidates will have 'detour / dongjang noun', 'detour / verb', 'detour / adjective' 'Is a clue morphea, so' bypass / donghyeongseong noun 'is left only, and all the rest will be removed.

이와 같은 방법을 이용하여 미등록어가 포함된 어절에 대해 어절당 형태소 분석 갯수를 22.01개에서 10.83개로 줄일 수 있게 된다.Using this method, the number of morpheme analysis per word can be reduced from 22.01 to 10.83 for a word containing an unregistered word.

이와 같이 입력되는 문장에 대하여 형태소 분석이 완료되면 품사 태거(30)는 각 어절들의 형태소 분석 후보들 중 최적의 형태소 분석 결과를 찾기 위하여 미등록어를 처리하는 방법에 중점을 두어서 확률에 기반한 품사 태깅을 수행한다.When the morpheme analysis is completed on the input sentence, the puzzle tagger 30 focuses on the method of processing unregistered words to find the optimal morpheme analysis result among the morphological analysis candidates of the respective phrases, .

확률에 기반한 품사 태깅은 n개의 어절로 구성된 문장, 즉 어절열 w₁w₂w_3…w_n인 w_1‥n에 대해 최적의 형태소 분석 결과를 찾는 문제이므로, i번째 어절의 형태소 분석 결과를 형태소열 m_i와 품사열 t_i의 쌍으로 표시하여 다음과 같이 품사 태깅 함수 ψ(w_1‥n)를 정의한다.Probability-based tagging of part-of-speech is a sentence consisting of n word-phrases, ie, word heat w ₁ w ₂ w _{3 ...} w _n , w _{1 ‥ n} , the morphological analysis result of the ith word is expressed as a pair of the morpheme column m _i and the partial column t _i , so that the part-of-speech tagging function ψ (w _{1 ... n} ).

[수학식 1][Equation 1]

식 1을 베이즈 룰과 일차 마르코프 가정 등을 이용하면 다음과 같은 식으로 표현된다.Equation 1 is expressed by the following equation using the Bayes rule and the first-order Markov assumption.

[수학식 2]&Quot; (2) "

식 2는 결국 이전 어절의 품사열에서 현재 어절의 품사열로 천이할 확률과 어절의 품사열에서 임의의 형태소열이 발생될 확률들을 매 어절마다 곱하였을 때 가장 큰 값을 갖는 형태소 분석열을 찾는 의미가 된다.Expression 2 finally finds the largest value of the morphological analysis column when the probabilities of transition from the part-of-speech string of the previous word to the part-of-speech string of the current word and the probabilities of occurrence of arbitrary morpheme strings in the part- It makes sense.

이것은 형태소 분석 결과를 노드로, 인접하는 어절들의 형태소 분석 결과 사이에 에지(edge)를 두어 노드에는 P(m_i｜t_i)를, 에지에는 P(t_i｜t_i-1)을 할당하고 가장 높은 확률을 내는 경로를 취하는 것으로 볼 수 있다.It assigns P (m _i | t _i ) to the node and P (t _i | t _i-1 ) to the edge with the edge between the morpheme analysis result nodes and the morpheme analysis results of adjacent words It can be seen as taking the path that gives the highest probability.

예를 들면, 신을 신고 신고하기라는 어절열은 첨부된 도 6과 같이 나타낼 수 있고, 이런 종류의 문제는 바이터비(Viterbi) 알고리즘에 의해 구해진다.For example, a word line for reporting a new report can be represented as shown in FIG. 6 attached, and a problem of this kind is obtained by a Viterbi algorithm.

이때, nc : 보통명사, po : 목적격 조사, vb : 동사, ex : 보조적 연결어미, ec : 연결어미, xj : 형용사 파생 접미사, en : 명사형 정성어미, na : 동작성 보통명사, xv : 동사 파생 접미사, vx : 보조 용언으로 정의한다.In this case, nc is a normal noun, po is a subject verb, vb is a verb, ex is an auxiliary connection mother, ec is a connection mother, xj is an adjective derived suffix, en is a noun qualitative mother, na is a normal noun, Suffix, and vx: auxiliary verb.

그러나, 식 2를 그대로 이용하기에는 자료의 부족 현상이 발생할 수 있으므로 이를 더 작은 단위들의 확률 값으로 근사하여 모델의 파라미터 수를 줄인다.However, since the shortage of data may occur when Equation 2 is used as it is, the number of parameters of the model is reduced by approximating it to a probability value of smaller units.

우선, 식 2의 P(t_i｜t_i-1)에서 t_i는 형태소 품사열을 뜻하므로, 이를로 풀어 쓸 수 있다.First, in P (t _i | t _i-1 ) in Equation 2, t _i means the morpheme part of speech, .

여기서는 i번째 어절의 임의의 형태소 분석 결과에서 j번째 형태소 품사를 뜻하는 것이고 N_i는 그 형태소 분석 결과에 사용된 품사의 갯수를 뜻한다.here Is the jth morpheme part of the morpheme analysis of the _ith word and N _i is the number of parts of the morpheme used in the morpheme analysis.

예를 들면, t_i가 vb, ex, vx, en이라면은 vx이고, N_i는 4가 된다.For example, if t _i is vb, ex, vx, en Is vx, and N _i is 4.

이때, vb : 동사, ex : 보조적 연결어미, vx : 보조 용언, en : 명사형 전성어미로 정의한다.In this case, vb: verb, ex: auxiliary connection ending, vx: auxiliary ending, en: noun ending.

이러한 방법을 사용하면 P(t_i｜t_i-1)은 다음과 같이 전개될 수 있다.Using this method, P (t _i | t _i-1 ) can be expanded as follows.

식 4는 식 3에서 현재 처리 중인 어절의 품사열은 이전 어절의 마지막 형태소 품사에만 의존한다는 가정에 의해 얻은 것이고, 이는 다시 연쇄 규칙(chain rule)에 의해 식 5로 전개되며, 일차 마르코프 가정을 적용하여 식 6을 얻게 된다.Equation 4 is derived from the assumption that the lexical strings in the current process are dependent only on the last morpheme word in the previous word, which is again developed by the chain rule in Equation 5, and the first-order Markov assumption is applied To obtain equation (6).

즉, 품사열간의 천이 확률을 이전 어절의 마지막 형태소 품사에서 현재 처리중인 어절의 첫 형태소 품사로 천이하는 확률과 어절내에서의 품사 천이 확률들을 곱한 것으로 근사한 것이다.That is, it is approximated that the probability of transit of the part of speech is multiplied by the probability of transiting from the last morpheme part of the previous word to the first morpheme part of the current processing word and the parts of speech transition probabilities in the word.

식 2의 P(m_i｜t_i)는 품사열에서 형태소열이 발생할 확률인데 여기서 고려해야할 점은 첫 번째 형태소이 미등록어일 수 있다는 점이다.P (m _i | t _i ) in Eq. (2) is the probability of morpheme heat occurring in the part-of-speech sequence. This may be an unregistered word.

그러므로이 등록어인지 미등록어인지를 뜻하는 새로운 변수을 도입하여 다음과 같이 식을 다시 정의한다.therefore A new variable that means either this register or unregistered word The following equation is redefined.

식 7에서은이 등록어일 때 1을 미등록어일 때 0을 취하게 하면, 다음과 같이 전개된다.In Equation 7 silver When 1 is assigned to this register and 0 is assigned when it is not registered, it is developed as follows.

식 9에서 P(｜t_i)는 임의의 형태소 품사열이 주어지고, 그 중 첫 번째 품사가 미등록어인지, 혹은 등록어인지에 대한 확률을 뜻하고 이것은 독립 가정을 이용하면 다음과 같이 근사시킬 수 있다.In Equation 9, P ( | T _i ) is the probability that an arbitrary morpheme string is given, of which the first part of speech is an unregistered word or a registered word, which can be approximated as follows using the independent assumption.

한편, 식 9의 P(m_i｜t_i,)은 다음과 같이 전개할 수 있다.On the other hand, P (m _i | t _i , ) Can be expanded as follows.

식 15는 다음의 식 16, 17과 같은 가정을 이용하여 얻은 것으로, 최종적으로은 식 15의 첫 항에만 영향을 미치게 된다.Equation (15) is obtained by using the following equations (16) and (17), and finally Affects only the first term of Eq. (15).

식 15의 첫 항은의 값에 의해 다르게 처리되는데,이 1일 경우는 식 18의 근사식을 이용하고, 0일 경우는 식 19와 같은 가정을 이용한다.The first term in Equation 15 The value of < RTI ID = 0.0 > In the case of 1, the approximate expression of Equation 18 is used, and when it is 0, the same assumption as that of Equation 19 is used.

식 19는 한국어에서 미등록어에 대한 추정은 미등록어 다음에 오는 형태소와 품사에 의존한다고 가정한 것으로 일종의 언어적 휴리스틱이라고 볼 수 있다.Equation (19) is a kind of verbal heuristic assuming that the estimation of unregistered words in Korean depends on morphemes and parts of speech that follow the unregistered words.

위의 두 식을 이용하게 되면, P(m_i｜, ti)는의 값에 따라 다음과 같은 근사식을 얻게 된다.Using the above two equations, P (m _i | , ti) The following approximate expression is obtained.

위의 두 식들과 식 6, 9를 식 2에 넣으면 다음과 같은 품사 태깅 수식을 얻을 수 있다.The above two equations and Equations 6 and 9 can be put in Equation 2 to get the following partly equations.

식 23에서 P(｜)은 임의의 품사가 주어졌을 때 미등록어 혹은 등록어가 발생할 확률을 의미하는 것이고, 이 확률 값들은 미등록어가 포함된 말뭉치로부터 얻게 된다.In Equation 23, P ( | ) Is the probability that an unregistered word or registrar will occur when given any part of speech, and these probabilities are obtained from the corpus containing unregistered words.

특히 이 값들은 현재 사용중인 사전의 표제어 수에 의존하여, 사전의 크기에 관계없이 확률적으로 최적의 품사 태깅 결과를 얻는다.In particular, these values depend on the number of headwords in the current dictionary, and the optimal speech marking result is obtained stochastically regardless of the dictionary size.

지금까지 전개한 식은 첨부된 도 6에서 노드에는을, 에지에는 P(｜)을 각각 할당한 후, 바이터비 알고리즘으로 해결할 수 있다.The expression thus far developed is shown in Fig. 6 And P ( | ), Respectively, and can be solved by a bi-terbi algorithm.

한편, 도 6의 노드들을 관찰해 보면, 노드들이 형태소 분석기의 결과인 형태소 격자를 풀어놓은 것임을 알 수 있으므로 어절들의 형태소 격자를 연결하여 바이터비 알고리즘을 수행하면 결과를 더욱 빠르게 얻을 수 있다.6, it can be seen that the nodes have released the morpheme lattice resulting from the morpheme analyzer, so that the results can be obtained more quickly by connecting the morpheme lattice of the phrases and performing the bi-directional algorithm.

이와 같은 동작에 의하여 각 어절들의 형태소 분석이 완료되면 발음 표기 변환 모듈(40)은 주어진 문장의 발음 표기를 찾기 위해 문장을 이루는 모든 어절들의 쓰임새를 정확히 인식하기 위하여 품사 태거의 결과를 기반으로 하여 동일한 어절일지라도 쓰임새에 맞게 다르게 발음될 수 있도록 처리한다.Upon completion of the morphological analysis of each of the phrases by the above operation, the phonetic transcription conversion module 40 searches for the phonetic transcription of the given sentence based on the result of the morpheme tag to accurately recognize the usage of all the phrases constituting the sentence, Even if it is a word, it is processed so that it can be pronounced differently according to the use.

이때 사용된 발음 변환 규칙들은 문교부에서 고시한 '표준 발음법'을 적용한다.The pronunciation conversion rules used at this time apply the standard pronunciation method notified by the Ministry of Education.

발음 표기 변환을 하는 방법은, 우선 품사 태깅의 결과를 바탕으로 입력 어절의 각 자소에 품사를 할당하고, 형태소 분석 결과에서 몇 번째 형태소에 있는지를 적는다.The method of pronunciation notation conversion is based on the results of the first part marking tagging, assigning parts of speech to each of the input poems of the input word, and writing the position of the morpheme analysis result.

예를 들면, 신을 신고 신고하기의 품사 할당 결과는 [표 3]과 같다.For example, the results of allocating parts of speech to report a god are shown in [Table 3].

[표 3][Table 3]

한편, 한국어의 표준 발음법은 중성 'ㅕ', 'ㅖ', 'ㅢ'와 종성에 의해 발음규칙이 일으켜지므로, 3개의 중성에 의한 발음 규칙과 27개의 종성에 의한 발음 규칙을 각각 작성하여 입력 어절의 중성과 종성을 보며 해당 규칙을 실행시킨다.On the other hand, since the standard pronunciation method of Korean generates the pronunciation rule by the neutral 'ㅕ', 'ㅖ', and 'ㅢ', the pronunciation rule of three neutrality and the pronunciation rule of 27 utterance are created See the neutrality and neutrality of the verse and implement the rules.

예를 들어 종성 'ㄴ'에 의한 규칙은 하기의 [표 4]와 같이 경음화, 자음 동화, 연음법칙 현상이 일어날 수 있는데, 모두 다음에 오는 초성의 품사에 의존하여 발생한다.For example, the rule of 'ㄱ' can occur as follows, as in [Table 4], the following transitions occur, depending on the part of the beginning.

[표 4][Table 4]

경음화 : 어간 받침 'ㄴ' 뒤에 결합되는 어미의 첫소리 'ㄱ, ㄷ, ㅅ, ㅈ'은 된소리로 발음한다.Sounding: The first sentence of the mother combined with a backstop 'b' is pronounced as a, c, g, i.

자음동화 : 'ㄴ'은 'ㄹ'의 앞이나 뒤에서 [ㄹ]로 발음한다.Consonant assimilation: 'b' is pronounced [d] before or after 'd'.

연음법칙 : 홑받침이나 쌍받침이 모음으로 시작된 조사나 어미, 접미사와 결합되는 경우에는 제 음가대로 뒤 음절 첫소리로 옮겨 발음한다.YEARS RULES: If a single pedestal or a pair of pedals is combined with a search, an ending, or a suffix that begins with a vowel, move to the next syllable with the original sound.

그러므로 위의 예문인 경우, 전자의 '신고'는 경음화 현상이 발생되어 '신꼬'로, 후자의 '신고하기'는 어떠한 현상도 발생하지 않아 '신고하기'로 발음된다.Therefore, in the case of the above example, the former 'pronounced' is pronounced as 'Sinoko' and the latter 'pronounced' is pronounced as 'notify' because no phenomenon occurs.

전술한 바와 같은 처리를 통해 추출된 형태소 분석열의 품사 정보가 발음 표기 변환 모듈(40)에 인가되면 발음 표기 모듈(40)은 대부분의 발음 변환 현상을 규칙에 의하여 처리하고, 합성어에서의 경음화 현상(예 : 문고리[문꼬리]), 소리의 첨가 현상(예 : 솜이불[솜니불]), 한자어에서의 경음화 현상(예 : 갈증[갈쯩])에 해당하는 단어를 예외 발음 단어로 규정한다.When the part-of-speech information of the morpheme analysis string extracted through the above-described processing is applied to the phonetic transcription conversion module 40, the phonetic transcription module 40 processes most pronunciation transitional phenomenon by rules, Exceptional pronunciation is defined as a word that corresponds to a phenomenon (eg, a door knob), a phenomenon of sound addition (for example, a cushion [sombibit]), or a circumstance phenomenon in a Chinese language (eg, thirst).

이렇게 품사 정보를 이용하여 입력 문장의 발음을 얻게 되지만, 어절의 정확한 발음을 얻기 위해서는 의미 처리 단계가 필요하다.In this way, the pronunciation of the input sentence is obtained by using the part-of-speech information, but a semantic processing step is required to obtain the accurate pronunciation of the phrase.

따라서, 입력 문장의 구문 구조를 알아내는 발음을 얻게 되지만, 어절의 정확한 발음을 얻기 위해서는 의미 처리 단계가 필요하다.Thus, although the pronunciation of the input sentence is obtained, the meaning processing step is required to obtain the correct pronunciation of the phrase.

의존 문법은 구구조 문법과 달리 문장을 이루는 단어들의 지배소-의존소 관계를 찾기 위해 사용되고, 이는 일반적인 구구조 문법으로 표현 가능하다.Dependent grammar is used to find the dominance - dependence relationship of words that make sentences, unlike the sentence structure grammar, which can be expressed in general sentence structure grammar.

그러므로 구구조 문법을 이용하는 파싱 기법을 그대로 사용할 수 있다.Therefore, we can use the parsing method using the old structure grammar as it is.

구문 분석기(50)는 구문 분석을 할 때, 어절들을 대표할 수 있는 비단말 기호(nonterminal symbol)가 필요하기 때문에 우선 품사 태깅된 결과를 바탕으로 각 어절들의 어절 품사를 생성한다.Since the parser 50 needs a nonterminal symbol that can represent the phrases when parsing, the parser 50 generates phrases for each of the phrases based on the tagged results.

어절 품사를 생성하는 방법은 입력 어절의 가장 왼쪽에 위치하는 형태소 품사와 가장 오른쪽에 위치하는 형태소 품사를 연결하여 만드는 것을 기본으로 하고, 만약 오른쪽에 쉼표 등 문장 기호들이 있을 경우에는 그 기호들을 더한다.The method of generating the phrase part is based on the connection between the leftmost and the rightmost part of the input word and if there are any sentence symbols such as a comma on the right side, the symbols are added.

예를 들어 피어 있습니다.라는 어절은 피/vb+어/ex+있/vx+습니다/ef+./se로 형태소 분석이 되고, 어절 품사는 vbefse가 된다.For example, the word "bloom" is morphemically analyzed as p / vb + / ex + vx + + /ef+./se, and the part of speech is vbefse.

한편, 두 개의 형태소 품사가 더해져서 다른 형태소 품사로 변하는 경우가 있는데, 예를 들어 공부하다.의 경우 공부/na+하/xv+다/ef+./se로 형태소 분석이 되지만, '공부하'가 의미적으로 동사의 역할을 하므로 어절 품사를 vbefse 로 만든다.On the other hand, there are cases where two morpheme parts are added and become morpheme parts of another morpheme. For example, in the case of studying, the morpheme analysis is performed by study / na + ha / xv + da /ef+./se. Because it acts as a verb, it makes vbsfse a part of speech.

이렇게 두 개의 형태소 품사가 하나의 형태소 품사로 바뀌는 규칙들은 다음과 같다.The rules for changing two morpheme parts into one morpheme part are as follows.

1. 상태성 보통명사 + 형용사 파생 접미사 → 형용사1. State Noun + Adjective Derived Suffix → Adjective

(예 : 건강/ns+하/xj+다/ef+./se → vjefse)(Eg health / ns + ha / xj + da /ef+./se → vjefse)

2. 동작성 보통명사 + 동사 파생 접미사 →동사2. Conjugation Noun + Noun Derived suffix → verb

(예 : 공부/na+하/xv+다/ef+./se → vbefse)(Eg study / na + ha / xv + da /ef+./se → vbefse)

3. 보통 명사 + 명사 접미사 → 보통명사3. Normal noun + noun Suffix → Normal noun

(예 : 사람/nc+들/xn → nc)(For example, person / nc + s / xn → nc)

4. 상태성 보통명사 + 부사 파생 접미사 → 부사4. Stateful noun + adverb derivative suffix → adverb

(예 : 간단/ns+히/xa → ad)(Eg simple / ns + l / xa → ad)

5. 동사 + 부사파생 접미사 → 부사5. Verb + adverb derivative suffix → adverb

(예 : 소중/ns+하/xj+게/xa → ad)(Eg, precious / ns + ha / xj + cra / xa → ad)

이상과 같이 문장을 이루는 어절들에 대해 어절 품사를 모두 얻은 후, CYK 테이블을 이용하여 확률적으로 최적인 구문 구조를 찾는다.After obtaining all the parts of the phrase for the sentences, the probabilistically optimal syntactic structure is searched using the CYK table.

일반적으로 의존 트리는 구구조 트리에 의해 표현할 수가 있는데, 특히 한국어는 지배소 후위의 원칙이라는 특징이 있으므로 더욱 단순한 형태의 구구조 트리로 표현될 수 있다.In general, a dependency tree can be represented by a phrase structure tree. In particular, Korean is characterized by a principle of a postfix rule, so it can be represented by a simpler form of a phrase structure tree.

예를 들어 신을 신고 신고한다라는 문장의 의존 트리는 첨부된 도 7이고, 이에 해당하는 구구조 트리는 첨부된 도 8과 같이 된다.For example, the dependency tree for declaring a declaration of a sentence is shown in FIG. 7, and the corresponding phrase structure tree is as shown in FIG. 8 attached.

이 때 도 8의 구구조 트리에서 사용한 규칙들은 모드 다음의 세가지 형태 중 하나에 해당한다는 특징을 관찰할 수 있다.In this case, it can be observed that the rules used in the structure tree of FIG. 8 correspond to one of the following three modes.

1. S→A1. S → A

(S : 시작 심벌, A : 어절 품사)(S: start symbol, A: part of speech)

2. A → BA2. A → BA

(A, B : 어절 품사)(A, B: part of speech)

3. A → a3. A → a

(A : 어절품사, a : 어절)(A: the word of the word, a: the word of the word)

한편, 도 8과 같은 구문 트리를 얻게 될 확률은 구문 트리에 사용된 모든 규칙들의 규칙 확률들의 곱으로 표현이 되는데, 구현된 구문 분석기에서는 첫 번째 규칙 형태와 세 번째 규칙 형태에 대해서는 모두 1.0의 확률을 부여하여 사용한다.The probability of obtaining the syntax tree shown in FIG. 8 is represented by the product of the rule probabilities of all the rules used in the syntax tree. In the implemented parser, both the first rule type and the third rule type have a probability of 1.0 Is used.

이는 문장의 마지막 어절이 항상 그 문장의 지배소가 되므로 최적 구문 트리를 찾는데 영향을 끼치지 않는다는 점과, 또한 어절 품사에서 임의의 어절이 발생할 확률은 품사 태깅이 최적 결과만을 이용할 경우, 최적 구문 트리를 찾는데 역시 영향을 끼치지 않는다는 점을 이용할 것이다.This is because the last word of the sentence always becomes the dominant part of the sentence, so that it does not affect the search for the optimal syntax tree. In addition, Will have no effect on the search.

그러므로 첨부된 도 8의 구문 트리 확률은 P(vbefse → vbec vbefse)·P(vbec → ncpo vbec)이 된다.Therefore, the syntax tree probability of the attached FIG. 8 becomes P (vbefse? Vbec vbefse) P (vbec? Ncpo vbec).

전술한 바와 같은 기능으로 실행되는 문서 분석기에서 품사 태거 성능 평가실험 결과는 다음과 같다.In the document analyzer executed with the functions as described above, the results of the performance evaluation of the pachinko tiger are as follows.

품사 태거(30)의 성능을 평가하기 위해 수동으로 태깅된 49,506어절의 학습말뭉치로 모델을 학습시키고, 4,729어절에 대하여 실험하였다.To evaluate the performance of the pacha tagger (30), the model was trained with 49,506 manually tagged learning corpus and tested for 4,729 words.

49,506어절 중 3,652어절은 미등록어에 관한 확률 P(｜)을 학습시키기 위한 것으로, 나머지 45,854어절을 기반으로 사전을 구축한 후, 3,652어절에 대해 사전에 등록되지 않은 단어가 발견되면 이를 미등록어로 간주한다.In the 49,506 phrases, 3,652 words have probability P ( | ). When a dictionary is constructed based on the remaining 45,854 words, if a word not found in the dictionary is found for 3,652 words, it is regarded as an unregistered word.

이때, 품사 태거의 성능은 하기의 [표 5]와 같고, [표 5]에서 등록 어절과 미등록 어절이란 각각 미등록어가 포함안된 어절과 미등록어가 포함된 어절을 뜻한다.In this case, the performance of the speech tagger is as shown in [Table 5], and in [Table 5], the registered word and the unregistered word phrase refer to a word phrase including an unregistered word and an unregistered word, respectively.

[표 5][Table 5]

전술한 바와 같이 [표 5]에서 등록 어절에 대한 정확률이 88.03%로 그다지 높지 않은 정확률을 나타내는 것을 관찰할 수 있는데, 이는 구현된 태거가 형태소격자가 구성되지 않았을 경우에만 그 어절에 미등록어가 포함되었다고 가정하는 것에 기인한다.As described above, it can be observed that the accuracy rate of the registration word in Table 5 is not so high as 88.03%, which means that if the implemented tagger does not constitute a morphological lattice, It is due to assumptions.

예를 들어 신을 신고 신고한다라는 문장에서 '신을'의 올바른 태깅 결과는 '신/보통명사+을/목적격조사'인데 만약 '신'에 대해 동사라는 정보만 사전에 있을 경우, 형태소 분석기는 '신/동사+ㄹ/관형사형전성어미'의 형태소 격자만을 작성하고, 품사 태거는 미등록어가 없다고 간주한 후 이를 그대로 태깅하게 된다.For example, in a sentence that says to report a god, the correct tagging result of 'god' is 'god / normal god + subject / investigation'. If there is only information of god in dictionary, Verb + d / archetypal morpheme ', and it is assumed that there is no unregistered word, and then the tag is tagged as it is.

바로 이런 종류의 오류 때문에 미등록어가 포함안된 어절에 대한 태깅 결과가 그다지 높지 않게 되었다.This kind of error has made the tagging results not very high for a word that does not include an unregistered word.

한편, 실험 말뭉치에서 사용되는 단어들을 사전에 첨가한 후 태깅한 결과, 어절에 대한 정확률이 96.46%, 형태소에 대해서는 98.01%의 정확률을 얻었다.As a result of tagging after adding the words used in the experimental corpus, the accuracy rate was 96.46% for the word and 98.01% for the morpheme.

이러한 실험 결과를 볼 때, 사전의 표제어 수가 높아질수록 태거의 성능은 더 높아진다.From the results of this experiment, the higher the number of headwords in the dictionary, the higher the performance of the tagger.

구문 분석기(50)의 성능 평가 실험결과는 다음과 같다.The performance evaluation test results of the parser 50 are as follows.

수동으로 작성된 498문장의 구문 트리들을 바탕으로 모델을 학습시킨 후, 100문장에 대해 실험하였다.The model was learned based on the syntactic trees of 498 sentences which were created manually, and then 100 sentences were tested.

100문장은 738어절로 이루어졌고 문장의 마지막 어절은 항상 자신을 지배소로 결정하므로 638개의 지배소-의존소 관계가 존재하게 된다.100 sentences consist of 738 words, and the last word of the sentence always determines itself as the dominant place, so there are 638 dominance - dependent relationships.

우선 품사 태거의 결과를 모두 맞게 만들어 구문 분석기 자체의 성능을 평가하는 실험과 품사 태거의 결과를 그대로 이용하는 실험을 하였는데, 전자의 경우 80.87%의 정확률을 나타내었고, 후자의 경우 78.68%의 정확률을 나타내었다.In the first case, the performance of the parser itself was evaluated and the result of the experiment using the result of the pragmatic tagger was used. In the former case, the accuracy was 80.87%, and in the latter case, the accuracy was 78.68% .

이 결과는 예를 들어 약 11개의 어절로 된 문장에 대해 마지막 어절을 제외한 어절 중 8개의 어절은 자신의 지배소를 올바르게 선택한 것으로 볼 수 있다.For example, for a sentence of about 11 words, eight words in the verse except the last verse can be regarded as the correct choice of their own dominance.

이상에서 설명한 바와 같이, 본 발명에 따른 문서 분석기는 입력 문서에 대해 비결정 유한 오토마타를 이용하여 전처리를 수행한 후, 형태소 분석 단계에서 어절의 모든 가능한 형태소 분석 결과를 구하며, 형태소 분석 결과들 중 확률적으로 가장 가능성 있는 형태소 분석 결과를 선택한 후, 선택된 결과를 발음 표기 변환 모듈을 통해 어절을 이루는 모든 자소에 품사를 할당한 후, 3개의 중성에 의한 규칙, 27개의 종성에 의한 규칙들을 적용하여 어절의 발음 표기를 얻은 다음 구문 분석기의 확률 의존 문법을 이용하여 입력 어절들의 지배소-의존소 관계를 찾아 운율을 생성하여 신뢰성 있는 문서 음성 변환을 제공한다.As described above, the document analyzer according to the present invention performs preprocessing using an amorphous finite automata for an input document, and then obtains all possible morpheme analysis results of the word in the morpheme analysis step, After selecting the most probable morphological analysis result, the selected result is assigned to all the prosodic phrases through the phonetic transcription module, and then the rules of three neutrality and 27 verbs are applied. After obtaining the phonetic notation, we use the probability-dependent grammar of the parser to find the dominance-dependence relationship of input phrases and generate a rhyme to provide reliable document speech conversion.

본 발명에 따른 문서 분석기는 품사 태거에서 85.11%의 정확률을, 구문 분석기에서 78.68%의 정확률을 각각 나타내며, 특히 품사 태거의 경우, 처리 중인 문서에 미등록어가 없을 경우에는 96.46%의 높은 정확률을 제공한다.The document analyzer according to the present invention shows a precision of 85.11% in the parser tag and a precision of 78.68% in the parser, and in particular, in case of the parser tag, it provides a high accuracy of 96.46% when there is no unlisted word in the document being processed .

또한, 본 발명에 따른 문서 분석기는 문서 음성 변환 시스템 이외에도 음성 인식 시스템을 위한 발음 사전을 구축하고자 할 때 유용하게 사용될 수 있다.In addition, the document analyzer according to the present invention can be used to construct a phonetic dictionary for a speech recognition system in addition to a document speech conversion system.

Claims

A document speech conversion system comprising: preprocessing means for extracting a document inputted for speech conversion one sentence at a time and converting non-Korean characters into Hangul characters using an amorphous finite automata;

Morphology analyzing means for obtaining all possible morpheme analysis results in one sentence word dictionary applied by the preprocessing means;

A morphing tiger for extracting an optimal morpheme analysis sequence by selecting the morpheme analysis result as the most likely morphological analysis result based on probability information;

A phonetic transcription conversion unit that obtains phonetic transcription of each of the phrases using the morphological analysis string applied to the morphological phrase analysis unit;

And a syntactic analysis means for obtaining a dominant-dependence relationship of the phrases by using a probability-dependent grammar and outputting a phonetic representation of a word of the input document and a morpheme part sequence and a dependency tree as a final result, Document analyzer for document speech system.

[2] The document analyzer according to claim 1, wherein the parsing tag and parsing means are provided with probability information for parsing tagging, and probability information for parsing each parsing tag to perform parsing solving.

[2] The method according to claim 1, wherein the preprocessing unit divides non-Korean characters into English, an English abbreviation, a number, a telephone number, a year, and a time, Document analyzer for voice conversion system.

[Claim 2] The method according to claim 1, wherein the preprocessing unit is implemented by an amorphous finite automata consisting of 48 states, and 12 of them are composed of non-Korean Hangul conversion modules having different final states Document analyzer.

2. The system of claim 1, wherein the morpheme analyzing means directly forwards the phonetic transcript to the phonetic transcription means when the phonetic transcription means refers to a word read from the input document, Document analyzer.

The morpheme analyzing apparatus according to claim 1, wherein the morpheme analyzing means judges that there is an unregistered word in the analyzed word if the morpheme analysis fails on the sentence applied by the preprocessing means, estimates the unregistered word, and minimizes the candidate using the linguistic heuristic A document analyzer for a Korean document speech conversion system.

The document analyzer according to claim 1, wherein the lattice structure algorithm set in the morpheme analysis means is as follows.

[2] The apparatus according to claim 1, wherein the morpheme analyzing means calculates the irregularity and shortening phenomenon positions of the input word in advance and then morphologically analyzes it according to a lattice construction algorithm and expresses the analyzed result as a morphological lattice. Document analyzer for.

The document analyzer according to claim 1, wherein the morpheme analyzing unit forms a morpheme lattice by finding a morpheme from the right side to the left side of the input sentence word.

[Claim 2] The method according to claim 1, wherein the morpheme analyzing means uses the syllable information to exclude the node from the grid if the last syllable of the unregistered word is not used as the last syllable of the estimated part-of- Document analyzer.

The morpheme analyzing apparatus according to claim 1, wherein the morpheme analyzing means uses a clue morpheme to detect a morpheme that is highly incoincident in an incomplete morpheme lattice and is considered to be sufficient to estimate the composition of the morpheme, And removing the rest from the grid.

The document analyzer according to claim 1, wherein the part-of-speech tagger is implemented as follows.

[Claim 2] The method according to claim 1, wherein the phonetic transcription conversion means assigns parts of speech to all the phrases constituting the phrase, and then obtains the phonetic transcription of the phrases as a rule based on three neutrality and 27 utterances, Document analyzer for system.

2. The system according to claim 1, wherein the phonetic transcription conversion means defines words corresponding to a phenomenon of recurring phenomenon, a phenomenon of addition of sound, and a recurring phenomenon in a Chinese language to an exceptional pronunciation word in a compound word Analyzer.

The document analyzer according to claim 1, wherein the parsing means generates a basic word phrase by connecting a morpheme word positioned at the leftmost position of the input word and a morpheme word positioned at the rightmost position, .

The document analyzer according to claim 1, wherein the parsing unit generates a phrase part by adding a symbol to a punctuation mark such as a comma on the right side.