KR100304654B1

KR100304654B1 - Method and apparatus for analyzing korean document

Info

Publication number: KR100304654B1
Application number: KR1019930026152A
Authority: KR
Inventors: 김정수; 김상룡
Original assignee: 윤종용; 삼성전자 주식회사
Priority date: 1993-11-30
Filing date: 1993-11-30
Publication date: 2001-11-22
Also published as: KR950015053A

Abstract

PURPOSE: A method and apparatus for analyzing a Korean document is provided to perform a language coding of a document by analyzing all sorts of morphemes including function words by units, and analyzing sentence structures using a Korean analysis dictionary and a morpheme connection information table. CONSTITUTION: A Korean analysis dictionary(14) is equipped with information on morpheme category codes and so on. A pre-processing unit(10) separates a document inputted into paragraph units. A morpheme analysis unit(12) verifies a morpheme using a morpheme connection information table(16). A paragraph analysis unit(18) analyzes whether a paragraph is used as a paragraph element using the morpheme outputted from the morpheme analysis unit(12). A language code generating unit(20) generates the paragraph element outputted from the paragraph analysis unit(18) into a language code using a pronunciation dictionary(22) and a pause length table(24).

Description

Korean document interpretation method and apparatus

제1도는 종래의 한국어 문서 해석장치를 나타낸 블럭도이다.1 is a block diagram showing a conventional Korean document analyzing apparatus.

제2도는 본 발명에 의한 한국어 문서 해석장치의 일실시예에 따른 블럭도이다.2 is a block diagram according to an embodiment of an apparatus for interpreting Korean documents according to the present invention.

제3도는 제2도의 전처리부에 있어서 전처리의 예를 나타낸 도표이다.3 is a diagram showing an example of preprocessing in the preprocessing section of FIG.

제4도는 제2도의 구문해석부에 있어서 구문해석의 예를 나타낸 도표이다.4 is a diagram showing an example of syntax analysis in the syntax analysis unit of FIG.

제5도는 제2도의 언어코드 생성부에 있어서 숫자읽기의 예를 나타낸 도표이다.FIG. 5 is a diagram showing an example of reading a number in the language code generator of FIG.

제6도는 제2도의 쉼길이표에 있어서 쉼길이 및 그 의미의 예를 나타낸 도표이다.FIG. 6 is a diagram showing an example of a rest length and its meaning in the rest table of FIG.

제7도는 제2도의 언어코드 생성부에 있어서 운율패턴의 예를 나타낸 도표이다.FIG. 7 is a diagram showing an example of a rhyme pattern in the language code generation section of FIG.

제8도는 본 발명에 의한 한국어 문서 해석장치의 성능 평가를 나타낸 도표이다.8 is a chart showing the performance evaluation of the Korean document interpretation apparatus according to the present invention.

* 도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

10 : 전처리부 12 : 형태소 해석부10: preprocessing unit 12: morphological analysis unit

14 : 한국어 해석사전 16 : 접속정보표14: Korean dictionary of interpretation 16: Connection information table

18 : 구문해석부 20 : 언어코드 생성부18: syntax analysis unit 20: language code generation unit

22 : 발음사전 24 : 쉼길이표22: Pronunciation dictionary 24: rest table

본 발명은 한국어 문서 해석방법 및 장치에 관한 것으로, 특히 기능어 뿐만 아니라 모든 형태소를 단위별로 해석하여 한국어 해석사전과 형태소 접속정보표를 이용하여 구문을 해석하고 범주별로 구별하여 음율 뿐 아니라 쉼구간과 길이를 경정하여 문서의 언어코드화를 이루기 위한 한국어 문서 해석방법 및 장치에 관한 것이다.The present invention relates to a method and apparatus for interpreting Korean documents, and in particular, to interpret not only functional words but also all morphemes by unit, by interpreting phrases using a Korean interpretation dictionary and a morpheme connection information table and classifying them by category, as well as rest periods and lengths. The present invention relates to a method and apparatus for interpreting Korean documents to achieve language encoding of documents by correcting.

일반적으로 무제한 음성합성장치 또는 텍스트/스피치 시스템 등과 같은 문서음성변환장치에서는 인간이 문장을 발성할 때 자연적으로 생성하는 음운 및 운율의 정보를 인위적으로 삽입하여야 양호한 문장의 의미 전달이 이루어지며 음질 또한 명료성과 자연상을 갖추게 된다. 이러한 필요에 따라 전기 및 기계적으로 생성하는 음성은 임의의 법칙에 따라 음운기호의 변화와 운율을 조절할 필요가 있다.In general, in a document speech converter such as an unlimited speech synthesizer or a text / speech system, when a human voice is spoken, artificially inserting information of phonological and rhyme that is naturally generated makes a good sentence meaningful and the sound quality is also clear. And nature. According to this need, the electric and mechanically generated voices need to adjust the rhythm and change the rhythm according to arbitrary laws.

종래에는 제1도에 도시한 바와 같이 구문 분석기(1)에서 한국어의 기능어 즉, 조사와 어미를 이용하여 추출한 문장의 정보를 바탕으로 구 또는 절을 분석하여 띄어읽기 등을 행하였다.Conventionally, as shown in FIG. 1, the phrase analyzer 1 analyzes a phrase or a clause based on information of a sentence extracted by using a Korean functional word, that is, a search and a ending, and reads a word or the like.

그러나, 이러한 경우 한국어의 어절 분석이나 기능어 만으로는 문장에서의 각 구문의 성격적 분류가 어렵고 아는 출력음 생성시 자연성을 매우 저하시키는 문제점이 있었다.However, in such a case, it is difficult to characterize each phrase in a sentence only by Korean word analysis or functional word, and there is a problem of degrading the naturalness of the output sound.

따라서 본 발명의 목적은 상술한 문제점을 해결하기 위하여 기능어뿐만 아니라 모든 형태소를 단위별로 해석하여 한국어 해석사전과 형태소 접속정보표를 이용하여 구문을 해석하고, 범주별로 구별하여 음율 뿐 아니라 쉼구간과 길이를 경정하여 문서의 언어코드화를 이루기 하기 위한 방법을 제공하는데 있다.Accordingly, an object of the present invention is to solve all the morphemes as well as functional words in order to solve the above problems by interpreting the phrases using the Korean interpretation dictionary and the morpheme access information table, and by category, not only the rhythm but also the rest period and length. It is to provide a method for achieving language encoding of documents by correcting the problem.

본 발명의 다른 목적은 상기 한국어 문서 해석방법을 실현하는데 가장 적합한 장치를 제공하는데 있다.Another object of the present invention is to provide an apparatus most suitable for realizing the Korean document interpretation method.

상기 목적을 달성하기 위하여 본 발명은 한국어 문서를 해석하기 위한 방법에 있어서, 입력문서를 어절 단위로 분리하기 위한 전처리단계; 상기 전처리과정에서 형성되는 어절 단위의 한국어 문서를 한국어 해석사전과 접속정보표를 이용하여 띄어쓰기의 단위인 형태소를 밝히기 위한 형태소 해석단계; 상기 형태소 해석단계에서 형성되는 형태소를 이용하여 문장 성분의 가장 작은 단위인 어절이 어떠한 구문요소로 쓰여졌는가를 분석하기 위한 구문 해석단계; 및 상기 구문 해석단계에서 형성되는 구문요소를 발음사전과 쉼길이표를 이용하여 언어코드를 생성하기 위한 언어코드 생성과정을 포함함을 특징으로 한다.In order to achieve the above object, the present invention provides a method for interpreting a Korean document, comprising: a preprocessing step of separating input documents into word units; A morpheme analysis step of identifying a morpheme, which is a unit of spacing, using a Korean interpretation dictionary and a connection information table of a word document formed in the preprocessing step; A syntactic analysis step of analyzing what syntax element, the smallest unit of a sentence component, is written using a morpheme formed in the morphological analysis step; And a language code generation process for generating a language code using a pronunciation dictionary and a rest period table for the syntax elements formed in the syntax interpretation step.

상기 다른 목적을 달성하기 위하여 본 발명에 의한 한국어 문서 해석장치는 형태소 표제어, 예외발음 여부 및 장모음 위치, 형태소 범주 코드 등의 정보를 갖고 있는 한국어 해석사전과, 형태소 접속정보표와, 한국어, 영어, 특수기호 등의 발음사전과, 쉼길이표를 구비한 한국어 문서 해석장치에 있어서, 입력문서를 어절 단위로 분리하기 위한 전처리부; 상기 전처리부에서 출력되는 어절 단위의 한국어 문서를 상기 한국어 해석사전과 접속정보표를 이용하여 띄어쓰기의 단위인 형태소를 밝히기 위한 형태소 해석부; 상기 형태소 해석부에서 출력되는 형태소를 이용하여 문장 성분의 가장 작은 단위인 어절이 어떠한 구문요소로 쓰여졌는가를 분석하기 위한 구문 해석부; 및 상기 구문 해석부에서 출력되는 구문요소를 상기 발음사전과 쉼길이표를 이용하여 언어코드를 생성하기 위한 언어코드 생성부를 포함함을 특징으로 한다.In order to achieve the above object, the Korean document interpretation apparatus according to the present invention includes a Korean interpretation dictionary, a morpheme access information table, information on morpheme headwords, exceptions and long vowel positions, morpheme category codes, etc., Korean, English, An apparatus for interpreting Korean documents having a phonetic dictionary, such as a special symbol, and a rest period table, the apparatus comprising: a preprocessing unit for separating input documents into word units; A morpheme analysis unit for identifying a morpheme, which is a unit of spacing, using the Korean interpretation dictionary and a connection information table with the Korean document output from the preprocessor unit; A syntax interpreter for analyzing what syntax element a word, which is the smallest unit of a sentence component, is written using a morpheme output from the morpheme interpreter; And a language code generator for generating a language code using the pronunciation dictionary and the rest length table of the syntax elements output from the syntax interpreter.

이하 첨부된 도면을 참조하여 본 발명에 대하여 설명하기로 한다.Hereinafter, the present invention will be described with reference to the accompanying drawings.

제2도에 도시된 블럭도의 구성은, 형태소 표제어, 예외발음 여부 및 장모음 위치, 형태소 범주 코드 등의 정보를 갖고 있는 한국어 해석사전(14)과, 형태소 접속정보표(16)와, 한국어, 영어 특수기호 등의 발음사전(22)과, 쉼길이표(24)와, 입력문서를 어절 단위로 분리하기 위한 전처리부(10)와, 전처리부(10)에서 출력되는 어절 단위의 한국어 문서를 한국어 해석사전(14)과, 형태소 접속정보표(16)를 이용하여 띄어쓰기의 단위인 형태소를 밝히기 위한 형태소 해석부(12)와, 형태소 해석부(12)에서 출력되는 형태소를 이용하여 문장 성분의 가장 작은 단위인 어절이 어떠한 구문요소로 쓰여졌는가를 분석하기 위한 구문 해석부(18)와, 구문 해석부(18)에서 출력되는 구문요소를 발음사전(22)과 쉼길이표(24)를 이용하여 언어코드를 생성하기 위한 언어코드 생성부(20)로 이루어진다. 여기서 입력문서는 한글 및 한자코드와 ASCII코드로 구성된 한국어 문서로 가정한다.The block diagram shown in FIG. 2 includes a Korean interpretation dictionary 14, which includes information on morpheme headwords, exception phonetics and long vowel positions, morpheme category codes, morpheme connection information table 16, Korean, The phonetic dictionary 22, the rest period table 24, the preprocessor 10 for separating the input document into word units, and the Korean document in the word unit output from the preprocessor 10 Using the Korean interpretation dictionary (14), the morpheme connection information table (16), the morpheme analysis unit (12) for identifying the morphemes as units of spacing, and the morphemes output from the morpheme analysis unit (12), Syntax parser 18 for analyzing what syntax element is used as the smallest word, and syntax element output from parser 18 using pronunciation dictionary 22 and rest table 24 Language code generation unit 20 for generating a language code by Achieved. Here, the input document is assumed to be a Korean document consisting of Hangul, Hanja codes, and ASCII codes.

제3도는 제2도의 전처리부(10)에 있어서 전처리의 예를 나타낸 도표이고, 제4도는 제2도의 구문해석부(18)에 있어서 구문해석의 예를 나타낸 도표이고, 제5도는 제2도의 언어코드 생성부(20)에 있어서 숫자읽기의 예를 나타낸 도표이고, 제6도는 제2도의 쉼길이표(24)에 있어서 쉼길이 및 그 의미의 예를 나타낸 도표이고, 제7도는 제2도의 언어코드 생성부(20)에 있어서 운율패턴의 예를 나타낸 도표이다.FIG. 3 is a diagram showing an example of preprocessing in the preprocessor 10 of FIG. 2, and FIG. 4 is a diagram showing an example of syntax analysis in the syntax interpreter 18 of FIG. 2, and FIG. FIG. 6 is a diagram showing an example of reading a number in the language code generation unit 20. FIG. 6 is a diagram showing an example of a rest length and its meaning in the rest table 24 of FIG. 2. FIG. It is a table which shows an example of a rhyme pattern in the language code generation part 20. FIG.

그러면 본 발명의 동작을 제2도 내지 제8도를 참조하여 상세히 설명하기로 한다.Next, the operation of the present invention will be described in detail with reference to FIGS. 2 to 8.

제2도에 있어서, 문서 해석시 필요한 한국어 해석사전(14)은 형태소 표제어, 예외발음 여부 및 장모음 위치, 형태소 범주 코드 등의 정보를 지니고 있고, 형태소 해석부(12)에서 형태소 해석시 해석 사전정보를 제공하며, 접속정보는 최장일치법을 사용한다.In FIG. 2, the Korean interpretation dictionary 14 necessary for the interpretation of the document has information such as the morpheme headword, the exception phoneme and the long vowel position, the morpheme category code, and the analysis dictionary information when the morpheme analysis unit 12 analyzes the morpheme. The connection information uses the longest matching method.

전처리부(10)는 한국어 문서를 입력으로 하여 어절 단위로 분리하기 위한 것으로, 여기서 어절이란 띄어쓰기가 이루어지는 기본 단위를 말한다. 전처리에는 한글/한자 변환, 어절분리 및 문장분리 등의 처리가 포함된다.The preprocessing unit 10 is for separating Korean words into word units by input, where word word refers to a basic unit in which spacing is performed. Preprocessing includes processing such as Hangul / Hanja conversion, word separation, and sentence separation.

한자코드를 한글코드로 변환하기 위해서 한글에 해당하는 한자의 첫번째 코드와 연속되는 갯수가 들어있는 한자/한글 변환표를 이용한다. 어절분리는 문장을 형태소 해석 단위인 어절 단위로 분리하는 것을 말하며, 구성 문자에 따라서, 한글(K), 영어(E), 숫자(N) 그리고 특수기호(S) 어절로 분류한다. 문장분리는 문서를 문장 단위로 분리하는 것으로서, 문장을 분리하는 단서는 소숫점이 아닌 마침표(.), 물음표(?), 느낌표(!) 등을 문장의 끝으로 본다. 또한 공란도 문장의 끝으로 부는데, 이는 제목이나 표제 다음에는 일반적으로 공란을 두기 때문이다.To convert Hanja code to Hangul code, we use the Hanja / Hangul conversion table that contains the first number of Hanja codes and the number of consecutive words. The word division refers to the division of sentences into word units, which are morphological interpretation units, and are classified into Korean (K), English (E), numbers (N), and special symbols (S). Sentence separation is the division of documents into sentence units, and the clues that separate sentences see periods (.), Question marks (?), And exclamation marks (!) As end of sentences. Blanks are also placed at the end of a sentence, as there is usually a space after the title or heading.

국민교육헌장을 한국어 문서로 입력하는 경우의 전처리 결과의 예가 제3도에 나타나 있다.An example of pretreatment results for entering the National Education Charter as a Korean document is shown in Figure 3.

형태소 해석부(14)에서는 문장을 분석하여 끊어읽기 단위로 재구성하기 위한 것으로서, 특히 한국어 형태소 해석이란, 띄어쓰기의 단위인 어절의 구성요소인 형태소를 밝히는 것이다. 형태소 해석부(14)를 설명하기 위하여 한국어 형태소 분류, 해석사전 구조, 형태소 해석의 중심인 형태소 조합찾기와 최장일치를 접속정보표를 이용하여 설명하기로 한다.The morpheme analysis unit 14 analyzes a sentence and reconstructs it in a reading unit. In particular, the Korean morpheme analysis reveals morphemes that are components of a word that is a unit of spacing. In order to explain the morpheme analysis unit 14, the Korean morpheme classification, the dictionary structure of the analysis, the morpheme combination search and the longest match, which are the centers of the morpheme analysis, will be described using the access information table.

우선, 형태소 분류는 일정한 음성에 일정한 뜻이 결합되어 있는 말의 가장 작은 단위, 즉 최소의 유의적 단위로 정의할 수 있고, 처리 관점에 따라 각각 다르나 대분류된 12가지의 품사를 기초하여 60여가지로 다음 표 1과 같이 분류하였다.First, the morpheme classification can be defined as the smallest unit of words that combines a certain voice with a certain meaning, that is, a minimum significant unit. It was classified as Table 1 below.

[표 1]TABLE 1

다음으로 구문해석부(18)에 대하여 설명하기로 한다.Next, the syntax analysis unit 18 will be described.

인간이 긴 문장을 적절히 끊어 읽는 것은 문장의 구문구조를 분석해낼 수 있기 때문이다. 구문해석에서는 형태소 해석의 결과인 문장성분의 가장 작은 단위인 어절이 어떠한 구문요소로 쓰여졌는가를 토대로 구문구조를 분석한다.Humans can read long sentences properly because they can analyze the syntax of a sentence. In syntax analysis, the syntax structure is analyzed based on which syntax element, the smallest unit of sentence component, is the result of morphological analysis.

구문해석방법을 구문요소의 분류, 홑문장화, 격해석, 독립어 처리의 순으로 분류하고, 구문해석의 예가 제4도에 도시되어 있다.The syntax analysis method is classified in the order of syntax elements, single sentence, other interpretation, and independent word processing. An example of syntax analysis is shown in FIG.

1) 구문요소의 상세분류 ;1) detailed classification of syntax elements;

어절은 문장성분의 가장 작은 단위로서, 문장에서의 쓰임새에 따라 7가지를 상세 분류하여 19가지로 나누었다. 이는 다음 표 2에 나타나있다.The word is the smallest unit of sentence components, and the seven words are classified into 19 according to their use in the sentence. This is shown in Table 2 below.

[표 2]TABLE 2

2) 홑문장화2) single sentence

한국어 문장은 크게 홑문장과 겹문장으로 나누어진다. 홑문장은 주어와 서술어의 관계가 한번 이루어지는 문장이고, 겹문장은 홑문장이 서로 이어지거나 여러 겹으로 된 문장이다. 홑문장화는 홑문장이 서로 이어져 있는 겹문장을 분리하는 것을 말한다.Korean sentences are divided into single and double sentences. A single sentence is a sentence in which the relationship between the subject and the predicate is once, and the double sentence is a sentence in which the single sentences are connected to each other or in multiple layers. A single sentence is to separate the double sentences in which the single sentences are connected to each other.

3) 격해석3) Case Study

격해석에서는 각 어절에 대하여 표충격을 결정한다. 각 어절의 표충격은 마지막 형태소의 범주에 따라 결정된다. 예를 들어 어절 ‘우리는’의 경우 마지막 형태소 ‘는’의 형태소 범주가 통용보조사이므로 통용적으로 결정한다.In the case interpretation, the surface shock is determined for each word. The surface shock of each word is determined by the category of the last morpheme. For example, in the case of the word "we," the morpheme category of the last morpheme "은" is commonly used as a general survey.

4) 독립어 처리4) standalone processing

독립어는 주성분이나 부속성분과 직접적인 관계가 없이 운율이 주어지며, 문장 전체에 영향을 주는 성분이다. 감탄사, 문장부사, 호격조사로 끝나는 어절 및 순서를 표시하는 글자 등을 독립어로 간주한다.Independent words are given rhymes without direct relation to the main or subcomponents and are components that affect the whole sentence. Interjections, sentence adverbs, words ending in a thorough investigation, and letters indicating the order are considered to be independent words.

언어코드 생성부(20)에서는 구문해석에서 얻어진 구문요소에 대하여 음운기호를 생성한다. 음운기호 생성에는 한글, 영어, 숫자와 특수기호에 대하여 각각의 규칙에 따라 음운기호를 생성한다. 이를 설명하면 다음과 같다.The language code generation unit 20 generates phonological symbols for the syntax elements obtained in the syntax analysis. To create phonological symbols, phonological symbols are generated according to the rules for Korean, English, numbers and special symbols. This is described as follows.

1) 한글읽기1) Reading Korean

한글읽기는 표기된 기호와 발성되는 기호의 상이함으로 기인되는 발음법의 문제이다. 즉, 단어를 구성하는 자소는 그들의 상호관계에 의하여 영향을 받아 인간이 발성하기 적합한 음으로 바뀌는 성질이 있다. 자소간의 관계에 의하여 발생하는 음운 규칙은 한국어 표준발음법에 제정되어 있다.Hangeul reading is a problem of pronunciation due to the difference between the written and spoken symbols. That is, the phonemes that make up a word are affected by their interrelationships and are converted into sounds suitable for human speech. Phonological rules arising from the relationship between phonemes are enacted in Korean standard phoneme.

음운변동은 규칙적으로 처리되는 경우가 대부분이지만, 약 6~7%의 불규칙 변동이 일어날 가능성이 있다. 규칙 변동인 경우는 발음 변환표와 표준발음법 규칙에 따라 처리하여 음운기호를 발생하고, 불규칙인 경우에는 한국어 발음사전에서 탐색하여 음운기호를 생성한다. 특히 동일철자의 단어 중 서로 다르게 발음되는 경우는 발견적 해결방법(heuristic aproach)을 사용하였다.Phonological fluctuations are often handled regularly, but there is a possibility of random fluctuations of about 6-7%. In the case of the rule variation, the phonetic symbols are generated by processing the phonetic conversion table and the standard phonetic rules. In the case of irregularities, the phonetic symbols are searched in the Korean phonetic dictionary. In particular, heuristic aproach was used when the words of the same spelling were pronounced differently.

2) 영어읽기2) Reading English

영어의 경우에 있어서, 사전에 존재하는 영어단어는 등록되어 있는 발음대로 읽고, 그렇지 않은 경우에는 각각의 철자를 하나씩 읽는다.In the case of English, the English words existing in the dictionary are read according to the registered pronunciation, otherwise each spelling is read one by one.

3) 숫자읽기3) Reading numbers

숫자는 문장 중에 ‘숫자+단위성 의존명사’, ‘숫자, 숫자’, ‘숫자,...,숫자’와 같은 형태로 나타나며 그외에 ‘8.15’ 등과 같은 관용적인 표현을 쓸 때도 있다. 각각의 경우에 읽는 예가 제5도에 도시되어 있다.Numbers appear in sentences like ‘number + unitary dependency noun’, ‘number, number’, ‘number, ..., number’, and other idiomatic expressions such as ‘8.15’ are used. An example of reading in each case is shown in FIG.

언어코드 생성부(20)에 있어서 운율기호 생성은 구문요소간의 연결에 따라 쉼길이와 억양패턴을 생성한다. 이를 순차적으로 설명하면 다음과 같다.In the language code generation unit 20, the rhythm symbol generation generates the rest length and intonation pattern according to the connection between the syntax elements. This will be described sequentially.

1) 쉼길이 결정1) Determine rest

문장의 의미를 잘 전달하기 위해서는 적절히 끊어 읽어야 한다.In order to convey the meaning of a sentence well, read it properly.

이것에 해당하는 것이 쉼 길이 결정이다. 본 발명에서는 제6도에서와 같이 6단계로 나누어 쉼길이를 결정하였다.This corresponds to the shim length determination. In the present invention, the break length was determined in six steps as shown in FIG.

쉼길이를 결정하는 과정은 다음과 같다.The process of determining the rest length is as follows.

먼저, 구문요소 사이의 구문적 결합도를 이용하여 구문요소간 쉼길이를 결정한다. 이 단계에서 쉼길이 5,4,3이고, 다음으로 끊어읽기 단위가 12개 이상의 모음을 갖고 있는 관형어이고, 앞뒤의 단위가 3개 이상의 모음을 갖고 있으면, 쉼길이 2로 조정한다. 또한 끊어읽기 단위가 12개 이상의 모음을 갖고 있는 명사이고 앞뒤가 3개 이상의 모음을 갖고 있으면 쉼길이 1로 조정한다.First, the rest length between syntax elements is determined using syntactic coupling between syntax elements. At this stage, if the rest is 5, 4, 3, the next break is a tubular word with more than 12 vowels, and the front and back units have more than 3 vowels, the rest is adjusted to 2. Also, if the read unit is a noun with 12 or more vowels and three or more vowels before and after it, adjust the pause length to 1.

2) 운율기호 생성2) Rhythm symbol generation

운율패턴을 발화단위의 유성음 수와 연관관계에 따라 정의하며, 이러한 결정 규칙을 바탕으로 운율기호 생성표를 작성하여 어절단위로 처리한다. 제7도는 한국어 음성의 운율 통계를 기초로 작성된 것이며, 유성음 수는 4개 까지로 한다.The rhyme pattern is defined according to the voiced number of the speech unit and its relationship. Based on this decision rule, a rhythm symbol generation table is prepared and processed in word units. FIG. 7 is based on the rhyme statistics of Korean voices. The number of voiced voices is up to four.

본 발명에 의한 한국어 문서 해석방법의 알고리즘의 성능을 평가하기 위하여 ‘국민교육헌장’과 논설문 등을 입력문서로 하여 각 단계별 정확도를 측정하였다. 이 결과를 제8도에 나타내었다. 측정단위는 어절이고, 전체 어절수에 대한 실패 어절수의 비를 기록하였다.In order to evaluate the performance of the algorithm of the Korean document interpretation method according to the present invention, the accuracy of each step was measured using the 'National Education Charter' and thesis as input documents. This result is shown in FIG. The unit of measure is word, and the ratio of the number of failed words to the total number of words is recorded.

제8도에 있어서, 총 172 문장, 2280 어절에 대한 형태소 해석과 구문해석의 성공율은 각각 97.7%, 95.0%로 나타났다. 오류의 주된 원인은 하나의 어절이 여러가지로 분석되는, 즉 모호성에 인한 것이다. 이러한 모호성을 줄이기 위해서는 문맥정보를 이용하는 해석법을 추가하는 것이 바람직하다.In FIG. 8, the success rates of morphological analysis and syntax analysis for 172 sentences and 2280 words were 97.7% and 95.0%, respectively. The main reason for the error is that one word is analyzed differently, ie ambiguity. In order to reduce such ambiguity, it is desirable to add an interpretation method using contextual information.

음운기호 생성의 성공율은 99.8%이다. 실패의 원인은 숫자의 읽기에서 한자어로 읽느냐, 고유어로 읽느냐에 의한 것이다. 이러한 실패를 없애기 위해 숫자읽기와 휴리스틱 정보가 더 추가되어야 한다.The success rate of phonological symbol generation is 99.8%. The reason for the failure is due to the reading of numbers in native or native language. To eliminate this failure, additional numeric reading and heuristic information must be added.

운율기호 즉 쉼길이와 운율패턴생성의 성공율은 각각 93.1%, 94.0%로 나타났다. 그 기준은 사람이 낭독한 문장을 앞장에서 정의한 쉼길이와 운율패턴으로 분별한 다음 본 장치에서 생성한 결과와 비교한 것이다. 부정확한 쉼길이는 음성의 자연성에 치명적인 영향을 미치므로 정확도를 높이기 위해서는 인간의 발성연습에 대한 휴리스틱도 추가되어야 한다.Rhythm symbols, that is, the success rate of rest period and rhyme pattern generation, were 93.1% and 94.0%, respectively. The criterion is to classify the sentence read by the person into the rest and rhyme pattern defined in the previous chapter, and then compare it with the result generated by the device. Inaccurate rests have a devastating effect on the naturalness of speech, so heuristics on human speech practice must be added to improve accuracy.

본 발명에 의한 한국어 문서 해석방법은 자동번역시스템, 교육용 OA 기기등에 적용이 가능하며 인터액티브 모드(interactive mode)의 시스템 동작을 실현할 수 있다.The Korean document interpretation method according to the present invention can be applied to an automatic translation system, an educational OA device, and the like, and can realize an interactive mode system operation.

상술한 바와 같이 본 발명에 의한 한국어 문서 해석방법 및 장치에서는 기계음의 자연성을 지배하는 음운, 운율, 쉼처리를 정확히 할 수 있는 이점이 있다.As described above, the method and apparatus for interpreting Korean documents according to the present invention has an advantage in that phonology, rhyme, and rest processing that dominates the naturalness of machine sounds can be accurately performed.

또한, 음운과 운율, 쉼에 관한 형태소, 구문해석결과를 도표화하여 한국어 문서의 해석속도를 개선시킬수 있을 뿐 아니라 운율법칙의 경우 운율패턴과 유성음 수의 관계식을 도출할 수 있는 이점이 있다.In addition, it is possible to improve the interpretation speed of Korean documents by charting morphemes, morphemes, and grammatical interpretations of rhymes, rhymes, and breaks.

또한, 쉼길이의 경우 1초에서 100ms까지의 단위를 문장 구문 요소에 따라 분류하므로서 자연성을 향상시킬 수 있는 이점이 있다.In addition, in the case of rest length, the unit from 1 second to 100 ms is classified according to sentence syntax elements, thereby improving the naturalness.

Claims

A Korean document interpretation method for converting a Korean document into a voice, the method comprising: a preprocessing step of dividing an input Korean document into word units; A morpheme analysis step of analyzing a morpheme of a word, which is a spacing unit, using a Korean interpretation dictionary and a connection information table for each separated word; A syntax parsing step for analyzing what syntax element, the smallest unit of sentence component, is written using the analyzed morpheme; And a language code generation process of generating a phonological sign and a rhyme symbol by using a pronunciation dictionary and a rest period table, which are the criteria for synthesizing speech, for the syntax elements obtained by the syntax analysis. The language code generating step includes: Generating a phonological symbol for the syntax element using the phonetic dictionary for at least Korean, English, numbers, and special symbols, and determining a rest length having a plurality of levels using syntactic coupling between syntax elements And a rhythm symbol including a rest length and an accent pattern between the syntax elements using a length table.

The method of claim 1, wherein the preprocessing step includes Hangul / Hanja conversion, word separation, and sentence separation processing.

The method of claim 1, wherein the morpheme analysis step uses 60 morphemes based on the large-class parts of speech.

The method of claim 1, wherein the syntactic analysis step comprises: a first step of classifying syntax elements according to usage of a word in a sentence; A second step of forming the sentence in a single sentence; A third step of deciding a surface shock by interpreting the word according to the category of the last morpheme; And a fourth step for treating interjective words, sentence adverbs, and ending words and words indicating the order as independent words.

According to claim 1, wherein the rest length is determined by the length of the rest 5, 4, 3 according to the degree of syntactic coupling between the syntax elements, the reading unit is a tubular word having more than 12 vowels and three units before and after If there are more than one vowel, the rest length is determined to be 2, and the reading unit is a noun having 12 or more vowels, and if the front and back have three or more vowels, the rest length is determined to be 1.

6. The method of claim 5, wherein the sentence length is 5, the length of a single sentence is 4, the length of a rest is 3, the rest for breath is 2, the rest for meaning 1 is 1, and no rest is 0. Korean document interpretation method characterized in that the setting.

Korean interpretation dictionary containing information on morphological headings, exceptions and long vowels, morpheme category codes, morpheme access information table, pronunciation dictionary of Korean, English, numbers, special symbols, and rest table And a Korean document interpreter for converting a Korean document into a voice, the apparatus comprising: a preprocessor for separating the input Korean document by word units; A morpheme analysis unit for analyzing a morpheme of a word, which is a spacing unit, using the Korean dictionary and access information table for each word; A syntax parser for analyzing which syntax element, the smallest unit of sentence component, is written using the analyzed morpheme; And a language code generation unit for generating a phonological sign and a rhyme symbol using a phonetic dictionary and a rest period table which are the criteria for synthesizing speech for the syntax elements obtained by the syntax analysis. The rest length table for generating phonological symbols for the syntax elements by using the pronunciation dictionary for English, numbers, and special symbols, and determining the rest length having a plurality of levels by using syntactic coupling between syntax elements. Korean document interpretation apparatus for generating a rhyme symbol including the rest length and intonation pattern between the syntax elements.

8. The method according to claim 7, wherein the morpheme analysis unit parses the word output from the preprocessing unit, inputs the analyzed morpheme into the Korean interpretation dictionary, obtains dictionary information, and inputs the morpheme pair into the connection information table. An apparatus for interpreting Korean documents, characterized in that it obtains status information and analyzes morphemes.