KR100303171B1

KR100303171B1 - Morphological and syntactic analysis method using morpheme connection graph

Info

Publication number: KR100303171B1
Application number: KR1019990044750A
Authority: KR
Inventors: 이근배; 김준석; 심준혁
Original assignee: 정명식; 학교법인 포항공과대학교
Priority date: 1999-10-15
Filing date: 1999-10-15
Publication date: 2001-11-02
Also published as: KR20010037309A

Abstract

본 발명은 형태소 분석, 품사 태깅 및 구문분석 등을 포함한 자연언어처리에 관한 것이다.The present invention relates to natural language processing, including morphological analysis, part-of-speech tagging and syntax analysis.

본 발명에 의한 형태소 분석 과정에서는 하나의 형태소를 표현하는 자료구조로서 형태소 접속 노드를 구성하고, 이 노드들을 서로 연결하여 형태소 접속 그래프를 구성한다.In the morpheme analysis process according to the present invention, a morpheme connection node is configured as a data structure representing a morpheme, and the nodes are connected to each other to form a morpheme connection graph.

본 발명에 의하면, 한국어 처리 과정에서 나타나는 여러 형태소 분석 후보들을 그래프 형태로 표현할 수 있어서, 기존의 N-링크드 리스트 형태로 표현할 때 발생하는 메모리 낭비를 줄일 수 있고, 검색 과정을 효율적으로 처리할 수 있다.According to the present invention, various morphological analysis candidates appearing in the Korean language processing process can be represented in the form of a graph, thereby reducing the memory waste generated when representing the existing N-linked list and efficiently processing the searching process. .

Description

Morphological and syntactic analysis method using morpheme connection graph

본 발명은 형태소 분석, 품사 태깅 및 구문분석 등을 포함한 자연언어처리에관한 것이다.The present invention relates to natural language processing, including morphological analysis, part-of-speech tagging, and syntax analysis.

종래의 기술에 의하면, 자연언어 처리 과정에서 생성되는 여러 형태소 분석 후보들을 N-링크드 리스트로 표현하는데, 이때 많은 수의 링크드 리스트들로 인해 큰 용량의 메모리가 요구되고, 검색과정의 효율성도 떨어지게 된다.According to the related art, various morphological candidates generated during natural language processing are represented as N-linked lists, which require a large amount of memory and decrease the efficiency of the search process due to the large number of linked lists. .

본 발명은 상기의 문제점을 해결하기 위해 창작된 것으로서, 자연언어의 형태소 분석과정에서 생성되는 중간 산물들을 효과적으로 표현함으로써 형태소 분석 및 구문분석 과정을 효율적으로 수행할 수 있는 형태소 접속 그래프를 기록한 저장매체, 그 형태소 접속 그래프를 이용한 형태소 분석방법을 제공함을 그 목적으로 한다.The present invention has been created to solve the above problems, by efficiently expressing intermediate products generated in the morphological analysis process of the natural language storage medium recording a morphological access graph that can efficiently perform the morphological analysis and syntax analysis process, It is an object of the present invention to provide a morphological analysis method using the morpheme connection graph.

도 1은 본 발명에 의한 형태소 접속 그래프를 구성하는 형태소 접속 노드의 구성요소들을 도시한 것이다.1 illustrates the components of a morpheme connection node constituting a morpheme connection graph according to the present invention.

도 2는 형태소 접속 노드들을 결합하여 형태소 접속 그래프를 생성하는 과정을 예시적으로 도시한 것이다.2 exemplarily illustrates a process of generating a morpheme access graph by combining morpheme access nodes.

도 3은 본 발명에 의한 형태소 접속 그래프를 이용한 형태소 분석 과정을 도시한 것이다.Figure 3 illustrates a morphological analysis process using a morpheme connection graph according to the present invention.

도 4a는 형태소 분석 과정에서의 형태소 접속 그래프의 사용예를 도시한 것이고, 도 4b는 N-Best 형태소 결과를 도시한 것이다.Figure 4a shows an example of the use of the morpheme connection graph in the morpheme analysis process, Figure 4b shows the N-Best morpheme results.

도 5a는 품사 태깅 과정에서의 형태소 접속 그래프의 사용예를 도시한 것이고, 도 5b는 1-Best 태깅 결과를 도시한 것이다.FIG. 5A illustrates an example of using a morpheme connection graph in a part-of-speech tagging process, and FIG. 5B illustrates a 1-best tagging result.

도 6은 구문 분석 초기화 시 삼각 테이블 패킹을 예시한 것이다.6 illustrates triangular table packing at parsing initialization.

상기의 목적을 달성하기 위하여, 본 발명에 의한 자연언어 입력 문장의 형태소 분석과정에서 생성되는 중간 산물들을 표현하는 형태소 접속 그래프를 구성하는 형태소 접속 노드를 기록한 컴퓨터로 읽을 수 있는 기록매체에 있어서, 형태소 접속 노드는 하나 이상의 형태소의 표층을 저장하는 표층 정보 필드; 상기 하나 이상의 형태소의 품사, 주형태, 이형태 및 형태소의 접속정보를 저장하는 사전 정보 필드; 상기 입력 문장을 음소들로 구분하여 각각 순차적인 번호를 부여할 때, 상기 하나 이상의 형태소의 첫 번째 음소가 위치한 번호를 나타내는 음소열 시작번호 필드와 상기 하나 이상의 형태소의 마지막 음소가 위치한 번호를 나타내는 음소열 끝번호 필드; 상기 입력 문장을 형태소들로 구분하여 각각 순차적인 번호를 부여할때, 상기 하나 이상의 형태소의 첫 번째 형태소가 위치한 번호를 나타내는 형태소 시작번호 필드와 상기 하나 이상의 형태소의 마직막 형태소가 위치한 번호를 나타내는 형태소 끝번호 필드; 상기 하나 이상의 형태소의 어휘 확률값을 저장하는 형태소 확률 필드; 상기 하나 이상의 형태소의 어휘 확률값, 문맥 확률값 및 음절 트라이그램 확률값을 이용하여 계산한 누적확률값을 저장하는 누적 확률 필드; 및 상기 하나 이상의 형태소에 접속된 직전의 모든 형태소 접속 노드들에 대한 포인터를 저장하는 이전 노드들 포인터 필드와 상기 하나 이상의 형태소에 접속된 직후의 모든 형태소 접속 노드들에 대한 포인터를 저장하는 다음 노드들 포인터 필드를 구비하고, 상기 이전 노드들 포인터 필드와 상기 다음 노드들 포인트 필드는 형태소 접속 노드들이 가지는 확률값에 따라 정렬된 우선순위 큐 구조로 구성되어, N-Best 품사 태깅 결과를 얻을 때 상기 우선순위 큐에서 하나씩 꺼내서 품사 태깅된 N개의 형태소열들을 출력하도록 하는 것을 특징으로 한다.In order to achieve the above object, in a computer-readable recording medium recording a morpheme connection node constituting a morpheme connection graph representing intermediate products generated during the morpheme analysis process of the natural language input sentence according to the present invention, The access node includes a surface information field for storing one or more morphological surface layers; A dictionary information field for storing part-of-speech, main form, morphology, and access information of the morpheme; When the input sentence is divided into phonemes and assigned sequential numbers, respectively, a phoneme string start number field indicating a number where the first phoneme of the one or more morphemes is located and a phoneme indicating a number where the last phoneme of the one or more morphemes is located Column end number field; When the input sentence is divided into morphemes and assigned a sequential number, the morpheme start number field indicating the number where the first morpheme of the one or more morphemes is located and the morpheme end indicating the number where the last morpheme of the one or more morphemes are located. Number field; A morpheme probability field for storing lexical probability values of the one or more morphemes; A cumulative probability field for storing a cumulative probability value calculated using the lexical probability value, the context probability value and the syllable trigram probability value of the one or more morphemes; And previous nodes pointer fields that store pointers to all immediately preceding stemmed nodes connected to the at least one stem, and next nodes storing pointers to all stemmed nodes immediately after being connected to the at least one stem. And a pointer field, wherein the previous nodes pointer field and the next nodes point field have a priority queue structure arranged according to a probability value of the stem nodes, and thus the priority when obtaining the N-Best part-of-speech tagging result. It takes out one by one from the queue and outputs N tagged morphemes.

상기 형태소 접속 노드를 기록한 컴퓨터로 읽을 수 있는 기록매체에 있어서 상기 형태소 접속 노드는 접속검사를 통해 상기 형태소 접속 노드가 접속이 실패하면, 그 형태소 접속 노드가 차지하는 메모리 공간을 메모리 관리자가 회수하도록 하기 위한 포인터를 저장한 자유기억공간 포인터 필드; 비터비 검색을 통해서 품사가 결정된 바로 이전 형태소 접속 노드에 대한 포인터를 저장하여 1-Best의 품사 태깅된 형태소 접속 노드들을 찾을 수 있는 전 형태소 포인터 필드; 상기 하나 이상의 형태소가 사전에 등록된 등록어인지, 등록되지 않은 미등록어인지를 구별해 주는 정보를 저장하는 등록/미등록 정보 필드; 서로 다른 품사의 형태소들이 상위개념의 품사로서 분석되는 병렬 태그 및 다수의 형태소들이 하나의 형태소로서 합해져 분석되는 직렬 태그에 대한 정보를 저장하는 직/병렬태그 정보 필드; 상기 하나 이상의 형태소에서 형태소의 개수를 저장하는 형태소 개수 정보 필드; 상기 하나 이상의 형태소가 어절의 시작에 위치하는지 여부를 저장하는 어절시작 형태소노드 정보 필드; 및 접속검사 테이블을 이용하여 접속검사를 수행한 결과를 저장하는 접속 플래그 필드를 더 구비함을 특징으로 한다.In the computer-readable recording medium recording the morpheme access node, the morpheme access node is configured to recover a memory space occupied by the morpheme access node when the morpheme access node fails to access through a connection test. A free storage space pointer field for storing a pointer; An all stem pointer field for storing a pointer to a previous part of stem steming node in which a part of speech is determined through a Viterbi search to find parts of speech tag tagged stem parts of 1-Best; A registered / unregistered information field for storing information for discriminating whether the one or more morphemes are registered or unregistered non-registered words; A serial / parallel tag information field for storing information on a parallel tag in which morphemes of different parts of speech are analyzed as parts of speech of a higher concept and a serial tag in which a plurality of morphemes are combined and analyzed as one morpheme; A morpheme number information field for storing the number of morphemes in the one or more morphemes; A word start morpheme node information field for storing whether the one or more morphemes are located at the beginning of a word; And a connection flag field for storing a result of performing a connection check using the connection check table.

상기의 다른 목적을 달성하기 위하여, 본 발명에 의한 형태소 접속 그래프를 사용하여 자연언어 입력 문장을 형태소 분석하는 방법은 (a) 상기 입력 문장의 형태소 단위로 형태소 사전 및 형태소 패턴 사전의 정보를 이용하여 후보 형태소 접속 노드들을 구성하고, 접속 테이블의 정보를 이용하여 상기 후보 형태소 접속 노드의 접속 여부를 확인하여 접속이 되면, 상기 후보 형태소 접속 노드를 형태소 접속 그래프에 추가하고, 접속이 되지 않으면 상기 후보 형태소 접속 노드에 할당된 메모리 공간을 메모리 관리자에 이양하여 초기 형태소 접속 그래프를 생성하는 단계; (b) 상기 초기 형태소 접속 그래프에 추가된 후보 형태소 접속 노드에 대하여 어휘 확률값, 문맥 확률값 및 음절 트라이그램 확률값을 이용하여 비트비 검색을 수행하여 누적 확률값을 계산하고, 계산된 누적 확률값을 이용하여 후보 형태소 접속 노드들의 개수를 줄여 여과된 형태소 접속 그래프를 생성하는 단계; (c) 상기 입력 문장의 모든 형태소에 대하여 상기 (a) 단계 및 상기 (b) 단계를 반복하여, 상기 여과된 형태소 접속 그래프에서 최적의 패스를 기록한 태깅된 형태소 접속 그래프를 생성하는 단계; 및 (d) 사전에 학습된 에러 수정 규칙을 이용하여 상기 태깅된 형태소 접속 그래프의 태깅 에러를 수정하여 에러 수정된 형태소 접속 그래프를 생성하는 단계를 포함함을 특징으로 한다.In order to achieve the above object, a method of morphological analysis of a natural language input sentence using a morpheme connection graph according to the present invention includes (a) using information of a morpheme dictionary and a morpheme pattern dictionary in morpheme units of the input sentence. Forming a candidate morpheme connection node, and checking whether the candidate morpheme connection node is connected using the information in the connection table, and adding the candidate morpheme connection node to the morpheme connection graph. Transferring the memory space allocated to the access node to the memory manager to generate an initial morpheme connection graph; (b) a bit rate search is performed on a candidate morpheme access node added to the initial morpheme access graph using a lexical probability value, a context probability value, and a syllable trigram probability value to calculate a cumulative probability value, and the candidate is calculated using the calculated cumulative probability value. Reducing the number of morpheme access nodes to generate a filtered morpheme access graph; (c) repeating steps (a) and (b) for all morphemes of the input sentence to generate a tagged morphemes access graph that records an optimal path in the filtered morphemes connection graph; And (d) correcting a tagging error of the tagged stem graph by using a previously learned error correction rule to generate an error-corrected stem stem graph.

형태소 분석은 한국어 처리의 기본이 되는 과정으로서, 입력 문장의 각 어절을 형태소 단위로 분리하고, 사전에서 알맞은 품사를 할당하여 접속 테이블을 이용하여 이들간의 접속을 검사하는 작업을 하고, 통계적인 여러 가지 정보를 이용하여 각 형태소에 알맞은 품사를 부여함으로써 한국어를 의미를 가지는 최소 단위로 분해하는 과정을 의미한다. 본 발명에서는 이 과정에서 사용되는 하나의 형태소를 표현하는 자료구조를 형태소 접속 노드로 정의하고, 이 형태소 접속 노드들이 서로 연결된 구조를 형태소 접속 그래프로 정의한다.Morphological analysis is the basic process of Korean language processing. It is to separate each word of input sentence into morphological units, assign appropriate parts of speech from dictionary, and check connection between them using connection table. By using the information to give the parts of speech appropriate to each morpheme refers to the process of decomposing Korean into the smallest unit having meaning. In the present invention, a data structure representing a morpheme used in this process is defined as a morpheme connection node, and a structure in which the morpheme connection nodes are connected to each other is defined as a morpheme connection graph.

이하에서 첨부된 도면을 참조하여 본 발명을 상세히 설명한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

우선, 형태소 접속 그래프의 기본이 되는 단위인 형태소 접속 노드부터 살펴보도록 한다. 도 1에 도시된 바와 같이 형태소 접속 노드는 다음과 같은 요소들로 구성되어 있다.First, let's look at the stem stem node which is the basic unit of the stem stem graph. As shown in Fig. 1, the stem stem node is composed of the following elements.

<자유기억공간 포인터>는 형태소 접속 노드가 접속검사를 통해 접속이 실패하여 더 이상 메모리에 존재할 이유가 없어질 때 이 형태소 접속 노드가 차지하는 공간을 메모리 관리자에게 전달하는데 사용되는 포인터이다. 즉, 효율적인 메모리 관리에 사용되는 포인터이다.The <Free Memory Space Pointer> is a pointer used to transfer the space occupied by the stemming node to the memory manager when the stemming node fails to connect to the memory manager due to the connection check. That is, it is a pointer used for efficient memory management.

다음으로, <전 형태소 포인터>는 비터비 검색(viterbi search)를 통해서 품사가 결정된 바로 이전 형태소 접속 노드를 가리키는데 사용되는 포인터이다. 1-Best의 품사 태깅된 형태소 접속 노드를 얻을 때 이용할 수 있다.Next, the <former stem pointer> is a pointer used to indicate a previous stem stem node whose part of speech has been determined through a Viterbi search. Can be used to obtain a 1-Best POS-tagged tagged stem node.

<표층정보>는 입력 형태소의 표층(surface)를 저장하는 공간이다.Surface information is a space for storing a surface of an input morpheme.

<등록/미등록 정보>는 현 노드의 형태소가 사전에 등록된 등록어인지, 등록되지 않은 미등록어인지를 구별해 주는 정보를 담아두는 공간이다.<Registered / Unregistered Information> is a space for storing information for distinguishing whether the stem of the current node is a registered word or a non-registered word.

<직/병렬태그 정보>는 {MPN(인명 고유명사),MPC(국명 고유명사),MPP(지명 고유명사),MPO(기타 고유명사)}와 같이 모두 MP(고유명사)로 판단할 수 있는 것들을 병렬 태그(Parallel tag)라고 명하고, ‘(ㄹ#수#있)’과 같이 여러 개의 형태소가 하나의 형태소로서 합해져 분석되는 것을 직렬 태그(Serial tag)라고 부르는데, 이에 대한 정보를 담아두는 공간이다.<Serial / Parallel Tag Information> can be determined as MP (unique noun), such as {MPN (personal proper noun), MPC (national proper noun), MPP (name proper noun), MPO (other proper noun)} These are called parallel tags, and multiple morphemes, such as '(ㄹ # 수 # 는)', are combined and analyzed as serial tags, called serial tags. to be.

<형태소 개수 정보>는 형태소 접속 노드에 포함되는 형태소의 개수를 저장하는 공간인데, 보통은 1의 값을 가진다. 그러나, ‘(ㄹ#수#있)’과 같은 경우에는 5의 값을 가진다.The <morpheme number information> is a space for storing the number of stems included in the stem stem node, and usually has a value of 1. However, in case of '(ㄹ # 수 # able)', it has a value of 5.

<사전 정보>는 형태소 사전 및 형태소 패턴 사전에 나와있는 정보가 저장된 곳이다. 사전에는 형태소의 품사, 주형태, 이형태 및 형태소의 접속정보가 기록되어 있다. 이러한 내용이 <사전 정보>에 기록되어 형태소 접속 그래프를 구성하는데 이용된다.<Dictionary information> is a place where information shown in the morpheme dictionary and the morpheme pattern dictionary is stored. In advance, parts of speech, main form, morphology, and connection information of morphemes are recorded. This content is recorded in <dictionary information> and used to construct a morpheme connection graph.

<어절시작 형태소노드 정보>는 현재의 형태소 접속 노드가 어절의 시작인지 여부를 저장하는 공간이고, <접속 플래그>는 접속검사 테이블을 이용하여 접속검사를 수행한 다음에 그 결과 접속 가능한지 여부를 기록하는 공간이다.<Spoken start morpheme node information> is a space for storing whether the current stem stem node is the beginning of a word, and <connection flag> records whether the connection is possible as a result after performing the connection check using the connection check table. It is a space to do.

<음소열 시작번호>와 <음소열 끝번호>는 입력 문장을 저장하기 위하여 사용되는 공간이다.The <phoneme sequence start number> and <phoneme sequence end number> are spaces used to store the input sentence.

<형태소 시작번호>와 <형태소 끝번호>는 입력문장에서 현재 형태소가 몇 번째 형태소부터 몇 번째 형태소 사이라는 정보를 저장하기 위하여 사용되는 저장공간이다.The <morphological start number> and <morphological end number> are storage spaces used for storing information that the current stem is from the first to the first stem.

<형태소 확률>은 현재 형태소의 어휘(lexical) 확률값이 저장되는 공간이고, <누적 확률>은 비터비 검색을 할 때 현재 노드의 어휘 확률값, 문맥 확률값 및 음절 트라이그램 확률값을 이용하여 누적확률값을 계산하여 이를 저장하는 공간이다.<Morpheme probability> is the space where the lexical probability value of the current morpheme is stored, and <cumulative probability> calculates the cumulative probability value using the lexical probability value, context probability value, and syllable trigram probability value of the current node when performing Viterbi search. To store it.

마지막으로, <이전 노드들 포인터>과 <다음 노드들 포인터>은 현재 형태소 접속 노드 이전과 이후에 나오는 모든 형태소 접속 노드들에 대한 포인터를 저장하는 공간이다. 이 저장공간은 형태소 접속 노드들이 가지는 확률값에 따라 정렬된 우선순위 큐(Priority Queue) 구조로 되어 있으므로, N-Best 품사 태깅 결과를 얻을 때 큐에서 하나씩 꺼내서 품사 태깅된 N개의 형태소열들을 출력하는데 이용할 수 있다.Finally, <Previous Nodes Pointer> and <Next Nodes Pointer> are spaces for storing pointers to all stemming nodes before and after the current stemming node. Since this storage space has a priority queue structure arranged according to probability values of the stem nodes, the N-Best part-of-speech tagging results are taken out of the queue and used to output the parts-tagged N stem sequences. Can be.

나는 밥을 먹었습니다I ate 00 1One 22 33 44 55 66 77 88 …… $$ ㄴN ㅏㅏ ㄴN ㅡㅡ ㄴN ## ㅂㅂ ㅏㅏ ……

다음은, 하나의 예제를 통해서 형태소 접속 그래프를 살펴보도록 한다. 예를 들어, 입력문장 '나는 밥을 먹었습니다'가 입력되었을 때, 도 2에 도시된 바와 같이 접속 노드들이 서로 연결된 형태소 접속 그래프가 만들어지게 된다. 형태소 접속 그래프의 노드들 중에서 ‘는/jS(보조사)’에 해당하는 형태소 접속 노드의 내용에 대하여 살펴보도록 한다. 우선, <자유기억공간 포인터>는 현재 노드가 접속검사에 성공하여 메모리에 남아 있어야 하므로 NULL 값이 저장된다. <전 형태소 포인터>는 비터비 검색을 통해서 결정된 ‘나/T(대명사)’노드와 ‘는/jS(보조사)’노드를 형태소 접속 그래프에서 연결하기 위해 ‘나/T(대명사)’노드를 가리키게 된다. <표층 정보>에는 ‘는’값이 저장되고, ‘는/jS(보조사)’는 사전에 등록되어 있으므로 <등록/비등록어 정보>에는 ‘등록어’로 기록되어 있다. ‘는/jS(보조사)’는 병렬 형태소도 직렬 형태소도 아니므로 <직/병렬 형태소 정보>에는 아무 것도 기록되어 있지 않게 된다. <형태소 개수 정보>에는 1이 기록되고, <사전 정보>에는 형태소 사전 및 형태소 패턴 사전 탐색을 통해 도 2에 도시된 바와 같이 태그정보<주형태>(이형태)[접속정보]가 기록된다. ‘는/jS(보조사)’는 어절의 첫 번째 형태소가 아니므로, <어절 시작 형태소 정보>에는 ‘NO’값이 기록된다. 만약, 어절 시작 형태소였다면, ‘YES’값이 저장되었을 것이다. 한편, 현재 노드는 접속검사에서 성공했으므로 <접속플래그>에는 1이 기록된다. 만약, 실패했다면, 0이 기록되었을 것이다.Next, let's take a look at the stemming graph through an example. For example, when the input sentence 'I have eaten' is input, as shown in FIG. 2, a morpheme connection graph in which connection nodes are connected to each other is created. Among the nodes of the stemming graph, let's take a look at the contents of the stemming connection node corresponding to '/ jS (secondary investigation)'. First of all, the <free memory pointer> is stored as NULL because the current node must be successfully connected and remain in memory. The <morpheme pointer> points to the 'na / T' pronoun to connect the '// T' pronoun determined by Viterbi search and the '/ jS' assistant in the stem stem graph. do. In <surface information>, the value of "is" is stored, and "/ jS (assistant survey)" is registered in advance, so <registration / non-registration information> is recorded as "registration word." '은 / jS (보조)' is neither a parallel morpheme nor a serial morpheme, so nothing is written in the <serial / parallel morpheme information>. 1 is recorded in <morph number information>, and tag information <main form> (this form) (connection information) is recorded in <dictionary information> as shown in FIG. '은 / jS (보조)' is not the first morpheme of the word, so the value of 'NO' is recorded in the <word start morpheme information>. If it were a word start morpheme, the value "YES" would have been stored. On the other hand, since the current node succeeded in the connection test, 1 is recorded in the <connection flag>. If it failed, 0 would have been recorded.

다음 표 1에 표시된 바와 같이 ‘는/jS(보조사)’는 문장을 음소열로 분해했을 때, 시작번호 3과 끝 번호 5가 각각 기록되고, 입력문장에서 ‘는/jS(보조사)’는 두 번째 형태소이므로 <형태소 시작번호>와 <형태소 끝번호> 모두 2가 기록된다. <형태소 확률>에는 미리 학습된 어휘 확률값이 들어가고, <누적 확률>에는 비터비 검색에서 결정된 누적 확률값이 저장된다. 마지막으로, <이전 노드를 포인터>와 <다음 노드들 포인터>에는 도 2에 도시된 바와 같이 이전과 이후의 형태소 접속 노드들에 대한 포인터들이 저장되어 있다.As shown in Table 1, when `` / jS '' is divided into phoneme strings, the start number 3 and the end number 5 are recorded respectively, and in the input sentence, `` / jS '' is defined as two. As the first morpheme, 2 is recorded for both <start stem number> and <end stem number>. The morphological probabilities contain pre-learned lexical probability values, and the cumulative probabilities store cumulative probability values determined by Viterbi search. Finally, pointers to previous and subsequent stemming nodes are stored in <previous node pointer> and <next nodes pointer>.

도 3은 형태소 접속 그래프를 사용한 형태소 분석 과정을 보여준다. 여기에서의 형태소 분석 과정은 좁은 의미에서의 형태소 분석 과정만이 아니라 품사 태깅 과정을 포함한 넓은 의미로 사용된다.Figure 3 shows the morphological analysis process using the morpheme access graph. Here, the morphological analysis process is used not only in a narrow sense but also in a broad sense including a part-of-speech tagging process.

형태소 분석기(30)는 입력 문장을 구성하는 각 형태소별로 그 형태소를 분석하여 형태소 분석 후보들을 생성하고, 생성된 형태소 분석 후보들을 형태소 접속 노드의 형태로 구성한다. 이때, 형태소 분석기(30)는 형태소 사전과 형태소 패턴 사전을 탐색하여 사전 정보를 알아내며 각 형태소 접속 노드들의 내용이 채운다. 또한, 형태소 분석기(30)는 접속 테이블의 정보를 참조하여 접속이 되는 형태소 접속 노드는 형태소 접속 그래프에 추가하고, 접속이 되지 않는 형태소 접속 노드의 메모리 공간은 메모리 관리자에게 이양함으로써, 초기 형태소 접속 그래프를 생성한다.The morpheme analyzer 30 analyzes the morphemes for each morpheme constituting the input sentence, generates morpheme analysis candidates, and forms the generated morpheme analysis candidates in the form of a morpheme access node. At this time, the morpheme analyzer 30 searches the morpheme dictionary and the morpheme pattern dictionary to find out dictionary information and fills in the contents of each morpheme access node. In addition, the stem stem analyzer 30 refers to the information in the connection table, and the stem stem connection node to be connected is added to the stem stem connection graph, and the memory space of the stem stem connection node which is not connected is transferred to the memory manager. Create

다음, 통계 품사 태거(32)는 초기 형태소 접속 그래프의 각 형태소 접속 노드들에 대해서 어휘 확률값, 문맥 확률값 및 음절 트라이그램 확률값을 이용하여 비터비 검색을 수행하여 누적 확률값을 계산하고, 이를 이용하여 프루닝(Pruning)을 통하여 후보형태소 접속 노드들의 개수를 줄인다.Next, the statistical part-of-speech tagger 32 calculates a cumulative probability value by performing a Viterbi search using lexical probability values, context probability values, and syllable trigram probability values for each morpheme access nodes of the initial morpheme access graph. Pruning reduces the number of candidate stem nodes.

형태소 분석이 문장 끝까지 도달했다면 태깅된 형태소 그래프를 에러 수정기(34)로 넘기고, 그렇지 않으면 다시 형태소 분석기(30)에 의해 입력 문장을 구성하는 다음 형태소에 대한 분석을 수행하는 과정을 문장이 끝날 때까지 반복한다.When the morphological analysis reaches the end of the sentence, the tagged morpheme graph is passed to the error corrector 34, and when the sentence ends, the morphological analyzer 30 performs the analysis on the next morpheme constituting the input sentence. Repeat until.

문장의 끝에 도달하면 에러 수정기(34)는 미리 학습된 에러 수정 규칙을 이용하여 태깅된 형태소 그래프에서 태깅 에러를 발견하여 수정하여 최종 형태소 접속 그래프를 완성한다. 최종 형태소 접속 그래프는 구문분석기(36)로 전달된다.When the end of the sentence is reached, the error modifier 34 finds and corrects the tagging error in the tagged stem graph using the pre-learned error correction rule to complete the final stem stem graph. The final morpheme connection graph is passed to parser 36.

도 4a 및 도 4b는 각각 도 3은 형태소 분석기(30)에 의해 형태소 분석 과정에서 발생하는 초기 형태소 접속 그래프 및 형태소 분석 결과를 나타내는 예시도이다. 형태소 접속 그래프는 음소 단위별로 분할이 가능한 모든 형태소 분석 후보를 각각의 노드로 생성하여 연결된 형태를 지니고 있다. 도 4a는 '나는 밥을 먹었습니다'라는 입력 문장에 대한 초기 형태소 연결 그래프를 나타낸 것이다. 도 4a에 의하면, '나는'이라는 어절을 형태소 분석하여 후보를 만드는 과정에서 ‘날다/DR(규칙동사) + 는/eCNMG(관형사형 전성어미)’, ‘나다/DI(불규칙동사) + 는/eCNMG(관형사형 전성어미)’, ‘나/T(대명사)+ 는/jS(보조사)’의 세 가지 연결이 가능함을 알 수 있다. 본 발명에 의한 초기 형태소 연결 그래프는 트렐리스(Trellis)와는 다르게 기준 확률값과 접속 검사표를 통해 연결이 가능한 링크만을 후보로 설정하다. 따라서, ‘날다/DR(규칙동사) + 는/jS(보조사)’, ‘나다/DI(불규칙동사) + 는/jS(보조사)’, ‘나/T(대명사) + 는/eCNMG(관형사형 전성어미)’등의 트렐리스에서 발생하는 후보들은 발생 이전에 접속 검사를 통해서 걸러지게 된다. 도 4b는 형태소 접속 그래프를 만들기 위한 포항공대 자연어처리 연구실 형태소 분석기(POSTAG)의 형태소 분석 결과를 나타낸 것이다. 분석 결과표에서 왼쪽의 []안에서 입력 문장에 대한 분석된 형태소들의 음소 단위 정보와 음절단위 정보가 숫자로 표시되어 있다. 가령, 등록어로서 명사로 추정된 ‘밥’이라는 형태소 분석 후보에 대하여, 음소 정보는 처음부터 시작해서 7번째 음소부터 9번째 음소까지의입력 정보이며, 음절 정보는 4번째 음절의 입력 정보이다. 이러한 입력 정보에 대하여 형태소 접속 그래프에서는 하나의 노드로 생성하여 MC<밥>(밥)[등|밥]이라는 정보를 가지는 하나의 형태소 접속 그래프 노드 ‘밥/MC’를 만들게 된다. 즉, 도 4b의 형태소 분석 결과를 반영한 형태소 접속 그래프로서 도 4a가 구축된다.4A and 4B are each an exemplary diagram showing an initial morpheme access graph and a morpheme analysis result generated in the morpheme analysis process by the morpheme analyzer 30, respectively. The morpheme connection graph has all the morphological analysis candidates that can be divided into phoneme units and connected to each node. Figure 4a shows an initial morphological link graph for the input sentence 'I had a meal'. According to Figure 4a, in the process of making a candidate by stemming the phrase "I" "fly / DR (regular verb) + is / eCNMG (tubular malleable mother)," out / DI (irregular verb) + is / eCNMG It can be seen that there are three possible connection types: (the tubular malleable mother ') and' I / T (pronoun) + / jS (supplemental investigation) '. Unlike the trellis, the initial morpheme connection graph according to the present invention sets only a link that can be connected through a reference probability value and a connection check table as a candidate. Thus, 'fly / DR (regular verbs) + is / jS (supplemental investigation)', 'nada / DI (irregular verb) + is / jS (supplementary investigation)', 'I / T (pronoun) + is / eCNMG Candidates that occur in trellis, such as 'mother', are filtered through a connection check before they occur. Figure 4b shows the results of the morphological analysis of Pohang University Natural Language Processing Laboratory morpheme analyzer (POSTAG) to create a morpheme connection graph. In the analysis result table, phoneme unit information and syllable unit information of the analyzed morphemes for the input sentence are indicated by numbers in [] on the left. For example, for a morphological analysis candidate named 'Bob' estimated as a noun as a registered word, the phoneme information is input information from the seventh phoneme to the ninth phoneme from the beginning, and the syllable information is input information of the fourth syllable. For the input information, the stem stem connection graph is generated as a node to form a stem stem connection graph node 'Bob / MC' having information of MC <bob> (bob) [etc | bob]. That is, FIG. 4A is constructed as a morpheme connection graph reflecting the morphological analysis result of FIG. 4B.

도 5a 및 도 5b는 각각 도 3은 통계 품사 태거(32)에 의해 가장 정확한 분석 결과를 추출하는 품사 태깅 과정에서 발생하는 태깅된 형태소 접속 그래프 및 1-Best 태깅 결과를 나타내는 예시도이다. 접속검사와 확률정보를 이용하여 걸러진 형태소 분석 그래프 후보들 중에서 가장 가능성이 높은 형태소 분석 결과를 추출하는 것이 통계 품사 태거(32)의 역할이다. 포항공대 자연어 처리 연구실 품사 태거는 도 4a에서 제시된 형태소 접속 그래프의 ‘START’노드에서 시작하여 ‘END’노드로 끝날 때까지 형태소 접속 그래프의 가능한 모든 패스를 생성하여 이들 중 최적의 패스를 선택한다. 도 5a의 그림에서 굵게 표시된 패스가 최적의 패스에 대한 형태소 접속 그래프를 나타내고 있다.5A and 5B are exemplary views illustrating tagged morpheme connection graphs and 1-Best tagging results generated during a part-of-speech tagging process of extracting the most accurate analysis result by the statistical part-of-speech tag 32, respectively. Statistical part of speech tagger 32 is responsible for extracting the most likely morphological analysis result from the filtered morphological analysis graph candidates using the connection test and the probability information. The POSAC Natural Language Processing Lab Parts of Speech Tagger generates all the possible paths of the stem stem connection graph starting from the 'START' node of the stem stem graph shown in FIG. 4A and ends with the 'END' node to select the best of them. Paths shown in bold in the figure of FIG. 5A show a morphological connection graph for the optimal paths.

도 6은 도 4a에서 보여진 '나는 밥을 먹었습니다'라는 예문에 대한 형태소 분석 결과를 입력으로 받는 임배딩(Embedding)의 결과를 보여주는 예시도이다. 도 6에서 중간 구조로서 형태소 접속 그래프의 삽입을 이용한 점진적인 구문 분석 과정을 보이고 있다. 형태소 접속 그래프의 파싱은 챠트 파싱에 의해 생성되는 삼각 테이블 내의 형태소들의 초기 구문 정보를 만드는 형태소 접속 그래프 임배딩 과정을 거쳐 각 형태소 접속 그래프의 노드의 테이블 내의 인덱스를 결정한다. 원래의 프레임 인덱스 대신에 가장 긴 형태소 경로에 대한 상태 인덱스로 바뀐 결과를 보여준다. 각각의 형태소 접속 그래프 노드의 인덱스가 바뀐 후에는 각 형태소의 구분 범주들이 삼각 테이블의 각 자리에 만들어진다. 이때, 각 구문범주에 포함된 테이블의 인덱스 정보와 형태소 접속 정보에 의해 형태소 분석 결과에서 결합되지 않은 구문 범주간의 구문 결합을 제한한다.6 is an exemplary view showing a result of embedding (Embedding) receiving the result of the morphological analysis for the example sentence 'I have a meal' shown in FIG. 6 shows a gradual syntax analysis process using the insertion of a morpheme connection graph as an intermediate structure. The parsing of the morpheme connection graph determines the index in the table of nodes of each morpheme connection graph through a morpheme connection graph embedding process that creates initial syntax information of the morphemes in the triangular table generated by the chart parsing. Show the result of changing the state index for the longest stem stem instead of the original frame index. After the index of each stem-connected graph node is changed, the distinct categories of each stem are created in each place of the triangle table. In this case, by combining the index information and the morpheme access information of the tables included in each syntax category, syntax combinations between syntax categories not combined in the morphological analysis result are limited.

한편, 상술한 본 발명의 실시예는 컴퓨터에서 실행될 수 있는 프로그램으로 작성가능한다. 그리고, 컴퓨터에서 사용되는 매체를 이용하여 상기 프로그램을 동작시키는 범용 디지털 컴퓨터에서 구현될 수 있다. 상기 매체는 마그네틱 저장매체(예를 들면, 롬, 플로피 디스크, 하드 디스크 등), 광학적 판독 매체(예를 들면, 씨디롬, 디브이디 등) 및 캐리어 웨이브(예를 들면, 인터넷을 통한 전송)와 같은 저장매체를 포함한다.Meanwhile, the above-described embodiments of the present invention can be written as a program that can be executed on a computer. And, it can be implemented in a general-purpose digital computer for operating the program using a medium used in the computer. The media may be stored such as magnetic storage media (e.g., ROM, floppy disk, hard disk, etc.), optical reading media (e.g., CD-ROM, DVD, etc.) and carrier waves (e.g., transmission over the Internet). Media.

상기 기록매체는 (a) 입력 문장의 형태소 단위로 형태소 사전 및 형태소 패턴 사전의 정보를 이용하여 후보 형태소 접속 노드들을 구성하고, 접속 테이블의 정보를 이용하여 상기 후보 형태소 접속 노드의 접속 여부를 확인하여 접속이 되면, 상기 후보 형태소 접속 노드를 형태소 접속 그래프에 추가하고, 접속이 되지 않으면 상기 후보 형태소 접속 노드에 할당된 메모리 공간을 메모리 관리자에 이양하여 초기 형태소 접속 그래프를 생성하는 모듈; (b) 상기 초기 형태소 접속 그래프에 추가된 후보 형태소 접속 노드에 대하여 어휘 확률값, 문맥 확률값 및 음절 트라이그램 확률값을 이용하여 비트비 검색을 수행하여 누적 확률값을 계산하고, 계산된 누적 확률값을 이용하여 후보 형태소 접속 노드들의 개수를 줄여 여과된 형태소 접속 그래프를 생성하는 모듈; (c) 상기 입력 문장의 모든 형태소에 대하여상기 (a) 모듈 및 상기 (b) 모듈을 반복하여, 상기 여과된 형태소 접속 그래프에서 최적의 패스를 기록한 태깅된 형태소 접속 그래프를 생성하는 모듈; 및 (d) 사전에 학습된 에러 수정 규칙을 이용하여 상기 태깅된 형태소 접속 그래프의 태깅 에러를 수정하여 에러 수정된 형태소 접속 그래프를 생성하는 모듈을 컴퓨터에서 실행하는 프로그램 코드를 포함한다.The recording medium (a) forms candidate morpheme access nodes using information of a morpheme dictionary and a morpheme pattern dictionary in units of morphemes of an input sentence, and checks whether the candidate morpheme access nodes are connected using information of a connection table. A module for adding the candidate stem stem connection node to the stem stem connection graph when the connection is established, and transferring the memory space allocated to the candidate stem stem connection node to a memory manager to generate an initial stem stem connection graph; (b) a bit rate search is performed on a candidate morpheme access node added to the initial morpheme access graph using a lexical probability value, a context probability value, and a syllable trigram probability value to calculate a cumulative probability value, and the candidate is calculated using the calculated cumulative probability value. A module for reducing the number of stemmed connection nodes to generate a filtered stemmed connection graph; (c) a module for repeating the (a) module and the (b) module with respect to all morphemes of the input sentence to generate a tagged morpheme access graph that records an optimal path in the filtered morpheme connection graph; And (d) program code for executing, on a computer, a module for correcting a tagging error of the tagged stemming graph using a previously learned error correction rule to generate an error corrected stemming graph.

이상과 같은 본 발명을 구현하기 위한 기능적인 모듈들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 실시될 수 있다.Functional modules for implementing the present invention as described above can be easily implemented by programmers in the art to which the present invention belongs.

본 발명에 의하면, 한국어 처리 과정에서 나타나는 여러 형태소 분석 후보들을 그래프 형태소 표현할 수 있어서, 기존의 N-링크드 리스트 형태로 표현할 때 발생하는 메모리 낭비를 줄일 수 있고, 검색 과정을 효율적으로 처리할 수 있다.According to the present invention, various morphological analysis candidates appearing in the Korean language processing process can be graph-morphologically represented, thereby reducing the memory waste generated when the existing N-linked list is represented and efficiently processing the search process.

Claims

A computer-readable recording medium recording a morpheme connection node constituting a morpheme connection graph representing intermediate products generated during a morphological analysis of a natural language input sentence.

A surface information field for storing surface layers of one or more morphemes;

A dictionary information field for storing part-of-speech, main form, morphology, and access information of the morpheme;

When the input sentence is divided into phonemes and assigned sequential numbers, respectively, a phoneme string start number field indicating a number where the first phoneme of the one or more morphemes is located and a phoneme indicating a number where the last phoneme of the one or more morphemes is located Column end number field;

When the input sentence is divided into morphemes and assigned sequential numbers, a morpheme start number field indicating a number where the first morpheme of the at least one morpheme is located and a morpheme end indicating a number where the last morpheme of the at least one morpheme are located. Number field;

A morpheme probability field for storing lexical probability values of the one or more morphemes;

A cumulative probability field for storing a cumulative probability value calculated using the lexical probability value, the context probability value and the syllable trigram probability value of the one or more morphemes; And

Previous Nodes pointer field that stores a pointer to all immediately preceding stemmed nodes connected to the one or more stem cells and Next Nodes pointer that stores pointers to all stem stemmed nodes immediately after being connected to the one or more stem cells. With fields,

The previous nodes pointer field and the next nodes point field are configured in a priority queue structure arranged according to probability values of morpheme access nodes. When the N-Best part-of-speech tagging result is obtained, the part-of-speech tag is taken out from the priority queue one by one. And a computer-readable recording medium having recorded morpheme access nodes.

The method of claim 1,

A free memory space pointer field storing a pointer for allowing the memory manager to recover the memory space occupied by the stem stem connection node if the stem stem connection node fails to connect through a connection check;

An all stem pointer field for storing a pointer to a previous part of stem steming node in which a part of speech is determined through a Viterbi search to find parts of speech tag tagged stem parts of 1-Best;

A registered / unregistered information field for storing information for discriminating whether the one or more morphemes are registered or unregistered non-registered words;

A serial / parallel tag information field for storing information on a parallel tag in which morphemes of different parts of speech are analyzed as parts of speech of a higher concept and a serial tag in which a plurality of morphemes are combined and analyzed as one morpheme;

A morpheme number information field for storing the number of morphemes in the one or more morphemes;

A word start morpheme node information field for storing whether the one or more morphemes are located at the beginning of a word; And

And a connection flag field for storing a result of performing a connection check using a connection check table.

In the method of stemming a natural language input sentence using a morpheme connection graph,

(a) forming a candidate morpheme access node using information of a morpheme dictionary and a morpheme pattern dictionary in the morpheme unit of the input sentence, and confirming whether the candidate morpheme access node is connected using the information of a connection table. Adding the candidate morpheme access node to the morpheme access graph, and if not, transferring the memory space allocated to the candidate morpheme access node to a memory manager to generate an initial morpheme access graph;

(b) a bit rate search is performed on the candidate morpheme access node added to the initial morpheme access graph using a lexical probability value, a context probability value, and a syllable trigram probability value to calculate a cumulative probability value, and the candidate is calculated using the calculated cumulative probability value. Reducing the number of morpheme access nodes to generate a filtered morpheme access graph;

(c) repeating steps (a) and (b) for all morphemes of the input sentence to generate a tagged morphemes access graph that records an optimal path in the filtered morphemes connection graph; And

(d) correcting a tagging error of the tagged morpheme connection graph by using a previously learned error correction rule to generate an error-corrected morpheme connection graph. .

A recording medium recording a morphological analysis program for morphological analysis of a natural language input sentence using a morpheme connection graph,

(a) forming a candidate morpheme access node using information of a morpheme dictionary and a morpheme pattern dictionary in the morpheme unit of the input sentence, and confirming whether the candidate morpheme access node is connected using the information of a connection table. A module for adding the candidate morpheme connection node to the morpheme connection graph and, if not connected, transferring the memory space allocated to the candidate morpheme access node to a memory manager to generate an initial morpheme access graph;

(b) a bit rate search is performed on the candidate morpheme access node added to the initial morpheme access graph using a lexical probability value and a syllable trigram probability value to calculate a cumulative probability value, and the candidate morpheme access node is calculated using the calculated cumulative probability value. A module for generating a filtered morpheme connection graph by reducing the number of cells;

(c) a module for repeating the (a) module and the (b) module for all morphemes of the input sentence to generate a tagged morphemes connection graph that records an optimal path in the filtered morphemes connection graph; And

(d) a morpheme analysis program using a morpheme connection graph, comprising a module for correcting a tagging error of the tagged morpheme connection graph by using a previously learned error correction rule to generate an error-corrected morpheme connection graph The computer-readable recording medium which recorded the data.