KR100434526B1

KR100434526B1 - Sentence extracting method from document by using context information and local document form

Info

Publication number: KR100434526B1
Application number: KR1019970024231A
Authority: KR
Inventors: 이혜정
Original assignee: 삼성전자주식회사
Priority date: 1997-06-12
Filing date: 1997-06-12
Publication date: 2004-09-04
Also published as: KR19990001034A

Abstract

PURPOSE: A sentence extracting method from a document by using context information and a local document form is provided to offer a preprocessor for all document interpreter and enhance performance of the document interpreter by separating the long sentence into the sentences of a proper length at a proper position, extracting the right sentences from the sentence having no period mark and extracting the right sentences from the sentence having period mark which is used not for termination of the sentence. CONSTITUTION: To use the local document form information, the information for the local document form is stored while a preliminary sentence is extracted from the document based on period information and the context information of the period(10). The actual sentence is extracted by using the stored local document form information and the context information(20).

Description

Sentence Extraction Method Using Contextual Information and Local Document Type

본 발명은 문서 해석을 하는 전처리 단계로서 문서로부터 문장을 추출하는 방법에 관한 것으로, 특히 문맥정보와 지역적 문서형태를 고려하여 예비문장 추출과정과 실제문장 추출과정의 2단계 과정을 거쳐 문장을 추출함으로써, 구두점 정보만을 이용한 문장 추출의 문제점을 해결할 수 있도록 한 문맥 정보 및 지역적 문서 형태를 이용한 문장 추출 방법에 관한 것이다.The present invention relates to a method of extracting a sentence from a document as a preprocessing step for interpreting a document, and in particular, by extracting a sentence through a two-step process of preliminary sentence extraction process and actual sentence extraction process in consideration of contextual information and local document type. In addition, the present invention relates to a sentence extraction method using contextual information and local document types, which can solve the problem of sentence extraction using only punctuation information.

지금까지 한국어 문서에서의 문장 추출 방법에 대한 연구는 거의 전무하다고 볼 수 있다. 또한, 대부분의 방법이 구두점 정보만을 이용하고 있다. 즉, 문장의 끝을 나타내는 마침표(.), 물음표(?), 느낌표(!)와 같은 기호만을 고려하여 문장을 추출하였다.There is almost no research on how to extract sentences from Korean documents. In addition, most methods use only punctuation information. That is, the sentence was extracted by considering only symbols such as a period (.), A question mark (?), And an exclamation point (!) Indicating the end of the sentence.

그러나, 실제의 문서를 보면 구두점이 없는 문장이나, 구두점에서 끝나지 않는 문장이 종종 있다. 또한, 한 문장의 길이가 너무 길어서 두 개 이상의 문장으로 분리해야 하는 경우도 있다.In actual documents, however, there are often sentences without punctuation or sentences that do not end with punctuation. In addition, one sentence may be too long to be separated into two or more sentences.

구두점이 없는 문장은 도 1 의 문장(1)에서 보듯이 제목이나 개조식의 글에서 많이 찾아 볼 수 있다.Sentences without punctuation can be found in many titles or modified texts as shown in sentence (1) of FIG.

문장(2)에서는 구두점이 있어도 문장의 끝이 아닌 예를 보여주고 있다. 직접 인용문이나 구두점이 문장의 끝을 나타내는 용도가 아닌 다른 의미로 사용될 경우가 여기에 해당된다.Sentence (2) shows an example of punctuation but not the end of the sentence. This is the case when direct quotations or punctuation are used in a meaning other than the end of a sentence.

마지막으로 문장의 길이가 너무 길어서 두 개 이상의 문장으로 분리해야 하는 경우는 도 1 의 문장(3)에서 그 예를 볼 수 있다.Finally, if the length of the sentence is too long to be separated into two or more sentences, an example can be seen in sentence 3 of FIG. 1.

상기 문장(3)에 나타난 문장 전체를 하나의 문장으로 간주한다면, 문장의 길이가 너무 길어져 문서 해석시 문장의 구조 추출이 어려워지고, 시스템을 구현하는데도 어려움이 생기게 된다. 따라서, 적당한 위치에서 문장을 분할할 필요가 있다.If the entire sentence shown in the sentence (3) is regarded as one sentence, the length of the sentence is too long, making it difficult to extract the structure of the sentence when interpreting the document, it is difficult to implement the system. Therefore, it is necessary to split the sentence at an appropriate position.

종래의 문장 추출 방법에서는, 이러한 상황을 고려하지 않고 문장을 추출하므로, 잘못된 문장을 추출하여 문서 해석기의 입력으로 사용하게 된다. 따라서, 잘못된 결과를 출력하게 된다.In the conventional sentence extraction method, since a sentence is extracted without considering such a situation, an incorrect sentence is extracted and used as an input of a document interpreter. Therefore, the wrong result is output.

특히, 문장 단위로 문서를 읽어주는 음성 합성기에서는 문제가 더욱더 심각해진다. 또한, 긴 문장의 경우, 적당한 위치에서 문장을 분리함으로써 합성기가 자연스럽게 문서를 읽을 수 있도록 해야 한다.In particular, the problem becomes more serious in a speech synthesizer that reads a document in sentence units. In addition, for long sentences, the synthesizer should be able to read the document naturally by separating the sentences at appropriate positions.

이에 본 발명은 상기한 바와 같은 종래의 제 문제점을 해소시키기 위하여 창안된 것으로, 문맥정보와 지역적 문서형태를 고려하여 예비문장 추출과정과 실제문장 추출과정의 2단계 과정을 거쳐 문장을 추출함으로써, 구두점 정보만을 이용한 문장 추출의 문제점을 해결할 수 있도록 한, 문맥 정보 및 지역적 문서형태를 이용한 문장 추출 방법을 제공하는데 그 목적이 있다.Accordingly, the present invention was devised to solve the above-mentioned problems of the prior art, and the sentence is extracted through a two-step process of the preliminary sentence extraction process and the actual sentence extraction process in consideration of the contextual information and the local document type. It is an object of the present invention to provide a sentence extraction method using contextual information and local document types, which can solve the problem of sentence extraction using only information.

상기한 바와 같은 목적을 달성하기 위한 본 발명은, 지역적 문서 형태 정보를 사용하기 위하여 문서로부터 구두점 정보와 구두점의 문맥 정보에 근거한 예비문장을 추출하면서 지역적 문서 형태에 대한 정보를 저장하는 예비문장 추출과정과 ; 상기 예비문장 추출과정에서 저장된 지역적 문서 형태에 대한 정보와 함께 문맥 정보를 사용하여 실제로 사용될 문장을 추출하는 실제문장 추출과정을 포함하여 이루어짐을 특징으로 한다.In order to achieve the above object, the present invention provides a preliminary sentence extraction process for storing information on a local document type while extracting preliminary sentences based on punctuation information and context information of a punctuation mark in order to use the local document type information. And; And extracting the actual sentence to extract the sentence to be actually used by using contextual information together with the information on the local document type stored in the preliminary sentence extraction process.

본 발명에서 제안하는 문장 추출 방법에서는, 기존의 구두점 정보만을 이용한 문장 추출의 문제점을 해결하기 위해, 문맥 정보와 지역적 문서형태를 고려하여 문장을 추출한다.In the sentence extraction method proposed in the present invention, in order to solve the problem of sentence extraction using only the existing punctuation information, the sentence is extracted in consideration of the context information and the local document type.

여기서, 지역적 문서형태란 문서의 일부분의 형태 즉, 한 라인을 구성하는 글자의 수, 시작 공백의 수, 라인의 시작 글자의 종류, 다음 라인에 글자가 있는가 등, 말 그대로 어떠한 언어정보도 없이 단지 문서의 형태에 대한 정보를 의미한다.Here, a local document type is a form of a part of a document, that is, the number of characters constituting a line, the number of starting spaces, the type of a character at the beginning of the line, and whether there is a character on the next line. Information on the type of document.

또한, 지역적이란 말은 문서 전체가 아닌 문서의 일부분만을 고려한다는 의미를 포함하고 있다.Also, the term regional includes the consideration of only a portion of a document, not the entire document.

도 1 은 종래 기술의 문제점을 설명하기 위한 예시도,1 is an exemplary diagram for explaining a problem of the prior art;

도 2 는 본 발명에 따른 문맥 정보 및 지역적 문서 형태를 이용한 문장 추출 방법의 전체 블록도,2 is an overall block diagram of a sentence extraction method using contextual information and a local document form according to the present invention;

도 3 은 도 2 의 예비문장 추출과정에 대한 세부 동작 순서도,3 is a detailed operation flowchart of the preliminary sentence extraction process of FIG.

도 4 는 도 2 의 실제문장 추출과정에 대한 세부 동작 순서도,4 is a detailed operation flowchart of the actual sentence extraction process of FIG.

도 5 는 본 발명에서 제안한 문장 추출 방법을 음성 합성기에 적용한 예시도이다.5 is an exemplary diagram in which the sentence extraction method proposed in the present invention is applied to a speech synthesizer.

* 도면의 주요 부분에 대한 부호의 설명** Explanation of symbols for main parts of the drawing

10 : 예비문장 추출과정10: preliminary sentence extraction process

20 : 실제문장 추출과정20: actual sentence extraction process

이하 본 발명의 목적에 따른 문맥 정보 및 지역적 문서형태를 이용한 문장 추출 방법의 동작 원리를 상세히 설명하면 다음과 같다.Hereinafter, the operation principle of the sentence extraction method using the context information and the local document form according to the object of the present invention will be described in detail.

본 발명이 제안하는 문맥 정보 및 지역적 문서형태를 이용한 문장 추출 방법은 도 2에 도시한 바와 같이, 지역적 문서형태 정보를 사용하기 위하여 문서로부터 구두점 정보와 구두점의 문맥 정보에 근거한 예비 문장을 추출하면서 지역적 문서 형태에 대한 정보를 저장하는 예비문장 추출과정(10)과, 상기 예비문장 추출과정에서 저장된 지역적 문서 형태에 대한 정보와 함께 문맥 정보를 사용하여 실제로 사용될 문장을 추출하는 실제문장 추출과정(20)의 2단계 과정을 포함하여 구성된다.As shown in FIG. 2, the sentence extraction method using the contextual information and the local document form proposed by the present invention can extract the punctuation based on the punctuation information and the punctuation context information from the document in order to use the local document type information. Preliminary sentence extraction process (10) for storing information about the document type, and the actual sentence extraction process (20) for extracting the actual sentence to be used using the contextual information along with information on the local document type stored in the preliminary sentence extraction process It consists of a two-step process.

예비문장 추출과정(10)에서는 지역적 문서 형태에 대한 정보를 추출하여 저장하면서, 토큰 리스트의 형태로 예비 문장을 추출한다.In the preliminary sentence extraction process 10, the preliminary sentence is extracted in the form of a token list while extracting and storing information on a local document type.

도 2에 도시한 바와 같이, 예비문장 추출과정(10)에서 추출된 지역적 문서 형태 정보(33)는 실제문장 추출과정(20)에서 이용된다.As shown in FIG. 2, the local document type information 33 extracted in the preliminary sentence extraction process 10 is used in the actual sentence extraction process 20.

반면에, 문맥 정보를 이용한 문장 분리 규칙(31)은 예비문장 추출과정(10)과 실제문장 추출과정(20) 모두에서 사용된다.On the other hand, the sentence separation rule 31 using the contextual information is used in both the preliminary sentence extraction process 10 and the actual sentence extraction process 20.

실제문장 추출과정(20)에서는 규칙을 적용하기 위해 각 토큰에 대해 문맥 정보(34)를 추출한다.In the actual sentence extraction process 20, context information 34 is extracted for each token in order to apply a rule.

상기 예비문장 추출과정(10)을 자세히 살펴보면 다음과 같다.Looking at the preliminary sentence extraction process (10) in detail as follows.

먼저 문서에서 한 문자를 읽은 후, 읽은 문자가 라인피드인가를 판단하여 라인피드인 경우, 현재까지 읽은 라인에 대한 문서 형태 정보를 추출하여 지역적 문서 형태 정보(33)에 저장한다.First, after reading one character from the document, it is determined whether the read character is a line feed, and if it is a line feed, the document type information on the line read so far is extracted and stored in the local document type information 33.

저장되는 정보의 종류는 언어 정보가 아닌 현재 라인의 형태에 관한 것으로, 현재 라인의 시작에 공백이 얼마나 있는가, 현재 라인에는 몇 개의 문자가 있는가, 현재 라인의 라인피드 위치, 현재 라인의 시작 토큰의 위치 등이 있다.The type of information stored is not the language information, but the type of the current line, how many spaces are at the beginning of the current line, how many characters are on the current line, where the current line feed position is, Location and so on.

이러한 정보들이 라인별로 추출되어 지역적 문서 형태를 나타내는 정보가 되는 것이다.Such information is extracted line by line to form information representing a local document type.

반면에, 읽은 문자가 라인피드가 아닌 경우, 읽은 문자가 공백인가를 판단하여 읽은 문자가 공백인 경우, 토큰의 공백에 대한 정보를 수정한다.On the other hand, if the read character is not a line feed, it is determined whether the read character is a blank, and if the read character is a blank, the information on the blank of the token is corrected.

상기 공백 정보는 문장 추출 단계에서 추출된 토큰 리스트가 언어 해석에 사용될 경우, 어절을 구성하기 위한 정보로 사용되며, 토큰 리스트로부터 원래 문서를 복원하기 위한 정보로도 사용된다.The blank information is used as information for constructing a word when the token list extracted in the sentence extraction step is used for language interpretation, and is also used as information for restoring the original document from the token list.

한편, 읽은 문자가 라인피드인 경우에는 라인피드도 공백의 일종이므로, 현재까지 읽은 라인에 대한 문서 형태 정보를 추출하여 지역적 문서 형태 정보(33)에 저장한 후, 다른 공백과 마찬가지로 토큰의 공백 정보를 수정한다.On the other hand, if the read character is a line feed, the line feed is also a kind of space. Therefore, the document type information of the line read so far is extracted and stored in the local document type information 33, and like the other spaces, the blank information of the token is used. Modify

반면에, 읽은 문자가 공백이 아닌 경우, 읽은 문자가 기호인가를 판단하여읽은 문자가 기호이거나, 이전에 읽었던 문자와 다른 종류인 경우에는 새로운 토큰을 형성한다.On the other hand, if the read character is not a blank, it determines whether the read character is a symbol and forms a new token if the read character is a symbol or a different type from the previously read character.

그렇지 않은 경우에는 현재 읽은 토큰을 토큰 리스트의 마지막 토큰 문자열에 첨가한 후, 문장의 끝인지를 판단한다.Otherwise, it adds the currently read token to the last token string in the token list and determines whether it is the end of the sentence.

상기와 같이 예비문장 추출과정에서는, 구두점을 근거로 문장을 추출하게 된다.In the preliminary sentence extraction process as described above, the sentence is extracted based on the punctuation.

즉, 구두점을 만나야 예비 문장의 추출이 끝나게 된다.That is, the punctuation must be met before the extraction of the preliminary sentence is completed.

그러나, 구두점이 항상 모든 문장의 끝을 의미하는 것은 아니다.However, punctuation does not always mean the end of every sentence.

도 1 의 문장(2)에서 보듯이 물음표(?)가 어제밤에는 어디 갔었읍던교?의 문장의 끝이 아니라, 뒤의 문자()가 문장(2)의 끝이 된다.As shown in sentence 2 of FIG. 1, the character () is the end of sentence (2), not the end of the sentence where the question mark (?) Went last night?

또한, 상기 문장(2)의 다음 라인의 첫단어 하며를 보면, 상기 문장(1)은 단독으로 하나의 문장을 구성하는 것이 아니라, 문장(3)의 직접 인용문임을 알 수 있다.In addition, looking at the first word of the next line of the sentence (2), it can be seen that the sentence (1) is not a single sentence alone, but a direct quote of the sentence (3).

상기 물음표(?)를 문장의 끝으로 판단하게 되면, 다음 문장을 '하며'로 시작하게 되어 문장의 시작이 어색하게 된다.When the question mark (?) Is judged as the end of the sentence, the next sentence starts with 'and' and the beginning of the sentence is awkward.

따라서, 문장(2)의 물음표(~?)는 문장 끝으로 판명되어서는 안된다.Therefore, the question mark (~?) Of the sentence 2 should not turn out to be the end of the sentence.

이러한 문제를 해결하기 위해 구두점이 나타나면, 앞 뒤 문맥 정보를 이용하여 문장의 끝인지 아닌지를 판단해야 한다.When punctuation appears to solve this problem, the context information must be used to determine whether the sentence is the end of the sentence.

여기서, 구두점과 관련된 문맥 정보를 이용한 문장 분리 규칙(31)은 다음과 같다.Here, the sentence separation rule 31 using context information related to punctuation is as follows.

구두점이 직접 인용문의 끝에 오는 경우나 괄호 안에 나타나는 경우는 문장으로 추출하지 않는다.If punctuation occurs directly at the end of a quote or in parentheses, it is not extracted.

특히, 마침표(.)의 경우는 영어 약자에 나오는 마침표, 말줄임표 중간에 나오는 마침표, 소수점을 나타내는 마침표, 문장의 글머리 번호 뒤에 나오는 마침표도 문장으로 추출하지 않는다.In particular, in the case of a period (.), The period that appears in the English abbreviation, the period that appears in the middle of the ellipsis, the period that indicates the decimal point, and the period that follows the bullet number of the sentence are not extracted.

그리고, 말줄임표는 문장의 줄임으로 쓰이지 않은 경우는 문장의 끝이 아니다.And, an ellipsis is not the end of a sentence unless it is used as an abbreviation for a sentence.

상기 동작을 순서도로 도시하면 도 3 에 도시한 바와 같이, 문서에서 한 문자를 읽는 제 1 단계(11)와 ; 읽은 문자가 라인피드인가를 판단하여 라인피드인 경우, 현재까지 읽은 라인에 대한 문서 형태 정보를 추출하여 지역적 문서 형태 정보에 저장하는 제 2 단계(12) ; 반면에, 읽은 문자가 라인피드가 아닌 경우, 읽은 문자가 공백인가를 판단하는 제 3 단계(13) ; 읽은 문자가 공백인 경우, 토큰의 공백에 대한 정보를 수정하는 한편, 읽은 문자가 라인피드인 경우에는 라인피드도 공백의 일종이므로, 현재까지 읽은 라인에 대한 문서 형태 정보를 추출하여 지역적 문서 형태 정보에 저장한 후, 다른 공백과 마찬가지로 토큰의 공백 정보를 수정하는 제 4 단계(14) ; 반면에, 읽은 문자가 공백이 아닌 경우, 읽은 문자가 기호인가를 판단하는 제 5 단계(15) ; 읽은 문자가 기호가 아닌 경우, 이전 문자와 다른 종류의 문자인가를 판단하는 제 6 단계(16) ; 읽은 문자가 기호이거나, 이전 문자와 다른 종류의 문자인 경우에는 새로운 토큰을 형성하는 제 7 단계(17) ; 읽은 문자가 이전 문자와 다른 종류의 문자가 아닌 경우에는 현재 읽은 토큰을 토큰 리스트의마지막 토큰 문자열에 첨가하는 제 8 단계(18) 및 ; 문장의 끝인지를 판단하여, 문장의 끝이 아닌 경우 상기 단계(11~18)를 반복 수행하고, 문장의 끝인 경우 예비 문장과 지역적 문서 형태 정보를 추출하는 제 9 단계(19)를 포함하여 이루어진다.3 shows a first step 11 of reading a character from a document; A second step 12 of determining whether the read character is a line feed and extracting the document type information on the line read so far and storing it in the local document type information; On the other hand, when the read character is not a line feed, the third step 13 of determining whether the read character is a blank; If the read character is a blank, the information about the space of the token is corrected. If the read character is a line feed, the line feed is also a kind of blank. Therefore, the document type information on the line read so far is extracted and the local document type information is extracted. A fourth step 14 of modifying the space information of the token after storing in the same as other spaces; On the other hand, when the read character is not blank, the fifth step 15 of determining whether the read character is a symbol; A sixth step (16) of determining if the read character is not a symbol, a character of a different type from the previous character; A seventh step (17) of forming a new token if the read character is a symbol or a character of a different type from the previous character; An eighth step (18) of appending the currently read token to the last token string of the token list if the read character is not a different kind of character from the previous character; Determining whether the sentence is the end of the sentence, and repeating the steps 11 to 18 if the sentence is not the end of the sentence, and extracting the preliminary sentence and the local document type information when the sentence is the end of the sentence. .

상기와 같이 에비문장 추출과정(10)을 통하여 추출된 예비 문장으로부터 실제 문장을 추출하는 과정을 살펴보면 다음과 같다.Looking at the process of extracting the actual sentence from the preliminary sentence extracted through the Ebi sentence extraction process (10) as described above.

먼저, 예비 문장의 토큰 리스트로부터 하나의 토큰을 읽는다.First, one token is read from the token list of the preliminary sentence.

만약, 예비 문장에 남아있는 토큰이 없다면, 예비문장 추출과정(10)을 수행하여 앞에서 설명한 바와 같이, 예비 문장과 예비 문장의 지역적 문서 형태를 추출하도록 한다.If there is no token remaining in the reserve sentence, the preliminary sentence extraction process 10 is performed to extract the preliminary sentence and the local document type of the reserve sentence as described above.

읽은 토큰이 한 라인의 끝이면, 지역적 문서 형태 정보(33)를 조사하여 읽은 현재 라인이 문장의 끝인지를 조사하게 된다.If the token read is at the end of a line, the local document type information 33 is examined to see if the current line read is at the end of the sentence.

도 1 의 라인(1)에서 뽕은 그 라인의 마지막 토큰으로, 라인(1)은 끝에 구두점이 없지만 하나의 독립적인 문장이 된다.In line 1 of FIG. 1, the mulberry is the last token of the line, and line 1 is an independent sentence with no punctuation at the end.

이것은 예비 문장 추출시 저장된 지역적 문서 형태 정보(33)를 이용하여 판단할 수 있다.This may be determined using the stored local document type information 33 when extracting the preliminary sentence.

즉, 라인(1)의 다음 라인은 공백인 라인이고, 상기 라인(1)은 라인(1)의 글자 수, 라인피드의 위치, 시작 공백의 수 등을 고려했을 때 문서 폭에 비해 상당히 좁은 폭만이 글자가 있다는 사실에 근거하여, 하나의 독립적인 문장으로 추출하는 것이다.That is, the next line of the line 1 is a blank line, and the line 1 has a considerably smaller width than the document width in consideration of the number of characters of the line 1, the position of the line feed, the number of starting blanks, and the like. Based on the fact that this letter is present, it is extracted as an independent sentence.

구두점이 없는 라인을 문장으로 추출하기 위해 지역적 문서 형태 정보를 이용하는 규칙은 다음과 같다.The rules for using local document type information to extract lines without punctuation into sentences are:

아무 글자도 없는 라인이나 같은 기호로만 이루어진 문장을 구성하지 않는 라인으로 간주한다.It is considered to be a line without any letters or a line that does not constitute a sentence consisting of the same symbol.

현재 라인이 문서의 폭에서 차지하는 비율을 고려하여, 문장으로 간주할것인지 처리한다.Handles whether the current line is considered a sentence, taking into account the percentage of the document's width.

다음 라인이 문장에 시작을 나타내는 정보가 있는 경우에는, 현재 라인을 문장의 끝 라인으로 간주하여 문장을 추출한다.If the next line contains information indicating the beginning of the sentence, the sentence is extracted by considering the current line as the end line of the sentence.

나머지 라인은 다음 라인과 같은 문장을 구성하도록 한다.Make the rest of the line the same sentence as the next line.

구두점에 의해 문장이 끝나는 경우는, 이미 예비 문장 추출시 예비 문장의 끝으로 간주되므로, 따로 고려할 필요없이 현재 읽은 토큰이 예비 문장의 마지막 토큰인지를 살펴보면 된다.If a sentence ends with a punctuation mark, it is already considered to be the end of the preliminary sentence when the preliminary sentence is extracted. Therefore, it is necessary to check whether the currently read token is the last token of the preliminary sentence.

지금까지 읽은 토큰 리스트의 길이가 너무 긴 경우, 즉 추출하고 있는 문장의 길이가 너무 긴 경우 적당한 위치에서 문장을 잘라야 한다.If the length of the token list read so far is too long, that is, if the length of the extracted statement is too long, the sentence should be truncated at the appropriate position.

이를 위해, 현재 토큰의 이전 토큰에 대한 문맥 정보(34)를 추출하여, 적당한 문장 분리 위치인가를 판단한다.To do this, the contextual information 34 of the previous token of the current token is extracted to determine whether it is an appropriate sentence separation position.

이때, 추출되는 문맥 정보(34)에는 현재 토큰의 언어 정보뿐만 아니라 앞 뒤 토큰의 언어 정보도 포함한다.In this case, the extracted context information 34 includes language information of the front and rear tokens as well as language information of the current token.

도 1 의 문장(3)의 경우, 실제로는 하나의 문장이지만, 문장(3) 전체를 하나의 문장으로 추출하여 문서 해석기의 입력으로 사용하는 경우, 올바른 문장 구조를 추출하기 어려우며, 특히 음성 합성기의 입력으로 사용되는 경우에는 음성 합성기가 자연스러운 운율을 생성할 수가 없다.In the case of the sentence 3 of FIG. 1, it is actually one sentence. However, when the entire sentence 3 is extracted as one sentence and used as an input of a document interpreter, it is difficult to extract the correct sentence structure. When used as an input, the speech synthesizer cannot produce natural rhymes.

문장을 분리하는 가장 적당한 위치는, 문장의 구조상이나 의미상 크게 나뉘어지는 부분이 된다.The most suitable position for separating sentences is that they are divided into structural and semantic parts.

이를 고려할 때, 도 1 의 라인(2) 이전의 헤치면서 다음이 문장 분리 위치로 가장 적당함을 알 수 있다.In view of this, it can be seen that the following is the most suitable sentence separation position while hovering before the line 2 of FIG. 1.

이것은 헤치면서 토큰의 마지막 형태소가 종속적 연결어미이고, 다음 토큰이 여는 인용 기호라는 문맥 정보(34)를 이용하면 된다.This can be done by using contextual information 34 that the last morphology of the token is the dependent linkage and the next token is a quotation mark.

긴 문장을 적당한 위치에서 분리하기 위해 이용하는 문맥 정보는 다음과 같다.The contextual information used to separate long sentences at appropriate locations is:

현재 토큰과 앞뒤 토큰에서 종속적/대등적 연결어미 정보, -에 따르면과 같이 접속부사와 같은 역할을 하는 관용구가 있는가의 여부, 시간성 명사 정보, 조사 정보, 명사 정보, 기호 정보를 이용한다.In the current token and the preceding and following tokens, the subordinate / parallel connection ending information,-is used whether there is an idiom that acts as an adjunct, temporal noun information, survey information, noun information, and symbol information.

상기 동작을 순서도로 도시하면 도 4 에 도시한 바와 같이, 예비 문장에 남아있는 토큰이 있는가를 판단하는 제 1 단계(21)와 ; 예비 문장에 남아있는 토큰이 없다면 예비문장 추출과정(10)을 수행하고, 예비 문장에 남아있는 토큰이 있는 경우, 예비 문장의 토큰 리스트로부터 하나의 토큰을 읽는 제 2 단계(22) ; 읽은 토큰이 한 라인의 끝인가를 판단하는 제 3 단계(23) ; 읽은 토큰이 한 라인의 끝이면, 지역적 문서 형태 정보를 조사하여 읽은 현재 라인이 문장의 끝인지를 조사하는 제 4 단계(24) ; 반면에 읽은 토큰이 한 라인의 끝이 아니거나, 지역적 문서 형태 정보를 조사하여 읽은 현재 라인이 문장의 끝이 아닌 경우, 현재 읽은 토큰이예비 문장의 마지막 토큰인지를 살펴보는 제 5 단계(25) ; 현재 읽은 토큰이 예비 문장의 마지막 토큰인 경우 추출하고 있는 문장의 길이가 너무 긴가를 판단하는 제 6 단계(26) ; 추출하고 있는 문장의 길이가 너무 긴 경우 적당한 위치에서 문장을 자르기 위하여 현재 토큰의 이전 토큰에 대한 문맥 정보를 추출하는 제 7 단계(27) 및 ; 적당한 문장 분리 위치인가를 판단하여 적당한 문장 분리 위치인 경우, 문장의 토큰 리스트와 부분 형태소 해석 결과를 추출하는 제 8 단계(28)를 포함하여 이루어진다.4, the first step 21 of judging whether there is a token remaining in the preliminary sentence as shown in FIG. 4; Performing a preliminary sentence extraction process 10 if there is no remaining token in the preliminary sentence, and reading a token from a token list of the preliminary sentence if there is a token remaining in the preliminary sentence; A third step (23) of determining whether the read token is the end of a line; If the read token is at the end of a line, a fourth step 24 of examining local document type information to see if the current line read is at the end of a sentence; On the other hand, if the read token is not at the end of one line or if the current line read by examining the local document type information is not at the end of the sentence, the fifth step (25) to check whether the read token is the last token in the reserve sentence. ; A sixth step 26 of determining whether the length of the extracted sentence is too long when the current read token is the last token of the reserved sentence; A seventh step 27 of extracting context information of the previous token of the current token to cut the sentence at an appropriate position if the length of the sentence being extracted is too long; If it is determined that the proper sentence separation position is a suitable sentence separation position, and comprises an eighth step 28 of extracting the token list of the sentence and the result of the partial morpheme analysis.

본 발명에 따른 일실시 예로서, 상기와 같은 원리로 동작하는 문장 추출 방법을 도 5 에 도시한 바와 같이, 음성 합성기에 적용하여 실험을 해 보았다As one embodiment according to the present invention, as shown in Figure 5, the sentence extraction method operating on the principle described above, the experiment was applied to the speech synthesizer

(40)은 이전의 문장 추출 방법 및 자료 구조를 사용한 것으로, 먼저 문장을 문자열 단위로 추출하여 토큰 리스트를 생성하였다.(40) uses the previous sentence extraction method and data structure. First, a sentence list was extracted in string units to generate a token list.

(41)은 본 발명에서 제안한 문장 추출 방법을 사용한 것으로, 문장의 자료 구조가 문자열이 아니라 토큰 리스트이므로, 바로 형태소 해석 단계로 들어간다.Reference numeral 41 uses the sentence extraction method proposed in the present invention, and since the data structure of the sentence is not a string but a token list, it immediately enters a morphological analysis step.

이상에서 상세히 설명한 바와 같이 본 발명은, 문맥 정보와 지역적 문서 형태를 고려하여 예비문장 추출과정과 실제문장 추출과정의 2단계 과정을 거쳐 문장을 추출함으로써 구두점 정보만을 이용한 문장 추출의 문제점을 해결할 수 있으며, 따라서 구두점이 없는 문장에 대해서도 올바른 문장을 추출할 수 있고, 구두점이 다른 의미로 사용된 문장도 올바로 추출할 수 있으며, 문장의 길이가 너무 긴 경우 적당한 위치에서 적당한 길이의 문장으로 분리할 수 있고, 이로 인하여 모든 문서에 대해 올바르고 적당한 길이의 문장을 추출함으로써 문장을 입력으로 하는 모든 문서 해석기의 전처리로서 사용할 수 있으며, 문서 해석기의 성능을 높이는데 기여할 수 있고, 문장 단위로 문서를 읽어주는 무제한 음성 합성기의 자연성을 높일 수 있는 효과가 있다.As described in detail above, the present invention can solve the problem of sentence extraction using only punctuation information by extracting sentences through two-step process of preliminary sentence extraction process and actual sentence extraction process in consideration of contextual information and local document type. Therefore, it is possible to extract the correct sentence even for a sentence without punctuation, to correctly extract a sentence with a different punctuation. If the sentence is too long, it can be separated into a sentence of a proper length at an appropriate position. Because of this, by extracting the correct and proper length of sentence for all documents, it can be used as a preprocessing of all document interpreters that take sentences as input, contribute to improve the performance of the document interpreter, and unlimited speech to read documents in sentence units There is an effect that can increase the naturalness of the synthesizer.

Claims

Generating a token of a character read from a document using punctuation information and contextual information related to the punctuation;

A second step of extracting a preliminary sentence in the form of a token list according to the token generated in the first step;

A third step of storing document type information extracted for each line of the preliminary sentence as local document type information according to the punctuation information present in the preliminary sentence extracted in the second step; And

The preliminary sentence is separated at a predetermined position by using the contextual information of each token read from the preliminary sentence extracted in the second step and the local document type information stored in the third step, and the separated sentence is separated from the preliminary sentence. And a fourth step of extracting the actual sentence in the form of a token list consisting of the tokens read as much as the position.

The method of claim 1, wherein the second step

If the character read from the document is a line feed or a space, step 2-1 of modifying information about the space of the token;

If the read character is a symbol or a character of a different type from the previous character, forming a new token;

If the read character is not a character of a different type from the previous character, adding the currently read token to the last token string of the token list; And

Repeating steps 2-1 to 2-3 until the end of the sentence, and extracting the preliminary sentence in the form of a token list generated in step 2-3 if the sentence is the end of the sentence. Sentence extraction method using contextual information and local document type.

The method of claim 1, wherein the fourth step

Adding the token read from the token list of the extracted preliminary sentence to the token list of the actual sentence;

If the read token corresponds to the end of a line, and the current document read does not correspond to the end line of the sentence by examining the local document type information, the read token may reach the last token of the preliminary sentence. Repeating step 4-1 until step 4-2;

Performing the step 4-2, if the length of the preliminary sentence is greater than or equal to a predetermined length, extracting context information of a previous token of the read token and determining a separation position of the preliminary sentence; And

If the read token corresponds to the last token of the preliminary sentence while performing step 4-2, the token list generated in step 4-1 according to the separation position determined in step 4-3 is the actual sentence. Sentence extraction method using the context information and local document form, characterized in that consisting of 5-4 steps to extract.