KR20000038079A

KR20000038079A - Device and method for extracting example of use statistically for large amount language batch

Info

Publication number: KR20000038079A
Application number: KR1019980052941A
Authority: KR
Inventors: 정한민; 김태완; 심철민; 최승권; 여상화; 김영길; 박상규; 박세영
Original assignee: 정선종; 한국전자통신연구원
Priority date: 1998-12-03
Filing date: 1998-12-03
Publication date: 2000-07-05
Also published as: KR100283100B1

Abstract

PURPOSE: A device and method for extracting example of use statistically for a large amount language batch are provided so that a whole field of a natural language process can be used for constructing a basic knowledge. CONSTITUTION: A device for extracting example of use statistically for a large amount language batch includes a device for producing a basic table in order to extract an example of use from a language batch. A candidate of example of use is determined from a device for determining an example of use. For management a memory and time efficiently, a shortened table producing device is used.

Description

Statistical Application Extraction Method and Method for Massive Coral

본 발명은 대용량의 말뭉치로부터 공간과 시간을 효율적으로 이용하여 연속 및 비연속 용례들의 추출 수단 및 그 방법에 관한 것으로, 특히 말뭉치로부터 추출된 문자열들로부터 다른 문자열들과 비교하여 용례 후보들을 결정하고, 공간 및 시간적인 효율성을 위해 축소된 테이블상에서 연속, 비연속 용례들을 결정하며, 최종적으로 패턴 제거 규칙을 이용하여 불필요한 용례들을 제거하므로써 대용량의 말뭉치에서 자연어 처리의 전 분야에서 이용 가능한 말뭉치 특성을 반영한 용례 추출 수단 및 그 방법에 관한 것이다.The present invention relates to a means and method for extracting continuous and discontinuous applications by efficiently using space and time from a large corpus, in particular to determine example candidates compared to other strings from strings extracted from corpus, Determines continuous and discontinuous use cases on a reduced table for spatial and temporal efficiency, and finally uses patterns elimination rules to eliminate unnecessary uses, reflecting corpus features available in all areas of natural language processing in large corpus. An extraction means and a method thereof.

용례 추출 수단이라 함은 사용자가 해당 말뭉치의 특성을 반영할 수 있다고 여겨지는, 말뭉치 내에서 반복적으로 나타나는 복합 명사나 숙어 표현 형태를 포함하는 구문 및 단어들을 패턴 테이블과 정렬 패턴 테이블 등을 이용하여 불필요한 패턴들을 최소화하고 추출 시간을 단축하는 효율적인 방법으로 추출하는 수단을 의미한다.The usage extracting means means that the user can use the pattern table and the sorting pattern table to search for phrases and words including compound nouns or idiom expressions that appear repeatedly within the corpus, which are considered to reflect the characteristics of the corpus. Means to extract in an efficient way to minimize the pattern and shorten the extraction time.

자연어 처리 분야에서 1990년 초부터 지식 자동 추출에 관한 일련의 연구들이 진행되면서 지식 자동 추출에 관한 많은 기술적인 진보가 있었다. 이러한 기술은 실세계의 온라인 텍스트나 컴퓨터 네트워크에서 추출 가능한 자원들로부터 미리 정의된 각 언어 패턴들에 대한 신속하고 강건한 자동 추출을 가능하게 하였다.In the field of natural language processing, since the early 1990s, a series of researches on automatic knowledge extraction have made many technical advances in automatic knowledge extraction. This technology enabled fast and robust automatic extraction of each predefined language pattern from resources extracted from real-world online text or computer networks.

자동 추출된 패턴들에 의해 구축된 구 또는 패턴 단위의 전자 사전들은 기계 번역 시스템에서의 복합 단위 처리 등에 있어 유용하게 사용될 수 있다. 특히, 패턴 기반의 번역을 위해서 실제 자주 사용하는 패턴들의 식별과 자동 추출 과정이 전제되어야 한다. 국외에서는 각 언어들에 나타나는 단어들의 유사성을 이용한 방법, 비교적 작은 규모의 말뭉치들로부터 문자 기반의 N-gram 정렬 방식을 이용한 방법이 제시된 바 있으며, 국내에서는 어절 단위의 용례 제시기에 관한 연구가 이루어진 바 있다.Phrase or pattern electronic dictionaries constructed by automatically extracted patterns may be usefully used in complex unit processing and the like in a machine translation system. In particular, for pattern-based translation, the identification and automatic extraction process of frequently used patterns should be premised. Overseas, a method using similarity of words appearing in each language and a letter-based N-gram sorting method have been suggested from relatively small corpuses. have.

기존의 문자 기반 패턴 추출은 불필요한 많은 수의 패턴들을 생성하며 테이블의 크기가 과도하게 커서 추출 시간이 코퍼스에 비해 과도하다는 단점이 있다.Existing character-based pattern extraction generates a large number of unnecessary patterns and has a disadvantage that the extraction time is excessive compared to the corpus due to the excessive size of the table.

본 발명은 자연어 처리 응용 분야에서 범용적으로 이용할 수 있도록 용례를 추출하고자 하는 대상이 되는 말뭉치의 크기가 대규모가 되더라도 처리 가능하도록 하고, 용례 추출에 소요되는 메모리 공간과 시간을 효율적으로 관리할 수 있도록 하며, 기존의 단순한 키워드(동사 어간이나 명사) 위주의 추출에서 벗어나 연속, 비연속 용례들을 추출할 수 있도록 하는 것을 목적으로 한다.The present invention can be processed even if the size of the corpus to be used for extracting the usage to be universally used in natural language processing applications, and to efficiently manage the memory space and time required for usage extraction It aims to extract continuous and discontinuous usages from the extraction of existing simple keywords (verb stems or nouns).

상술한 목적을 달성하기 위한 본 발명에 따른 대용량 말뭉치를 위한 통계학적 용례 추출 수단은 말뭉치로부터 용례 추출을 위한 기본 테이블을 생성하는 수단과, 테이블 엔트리에서 용례 후보들을 결정하는 용례 후보 결정 수단과, 메모리와 추출 시간의 효율적인 관리를 위해 기본 테이블을 축소한 축소 테이블을 생성하는 수단과, 용례 후보로부터 연속 및 비연속 용례를 추출하는 수단과, 추출된 용례에서 규칙을 이용하여 불필요한 용례들을 제거하는 수단과, 최종적으로 추출된 용례들을 제시하는 수단을 포함하는 이루어진 것을 특징으로 한다.According to the present invention for achieving the above object, the statistical usage extracting means for large-capacity corpus includes a means for generating a base table for extracting a case from the corpus, a case candidate determining means for determining example candidates in a table entry, a memory; Means for creating a reduced table that reduces the base table for efficient management of extraction time, means for extracting continuous and discontinuous use cases from use case candidates, means for removing unnecessary uses using rules from the extracted use cases, and And means for presenting the finally extracted applications.

또한, 상술한 목적을 달성하기 위한 본 발명에 따른 대용량 말뭉치를 위한 통계학적 용례 추출 방법은 입력 장치를 통해 입력된 말뭉치내의 일정한 길이의 문자열이 순차적으로 한 문자씩 이동하면서 포인터 테이블에 하나의 엔트리로서 삽입되는 단계와, 상기 삽입될 문자열이 말뭉치내에 남아 있지 않은 경우 상기 포인터 테이블을 복사하여 정렬 포인터 테이블을 생성하는 단계와, NMC를 구하여 상기 정렬 포인터 테이블의 N 번째 엔트리에 삽입한 후 상기 정렬 포인터 테이블의 N 번째 NSC를 N-1 번째 NMC와 N 번째 NMC중 큰 값으로부터 결정하는 단계와, 상기 포인터 테이블에 대하여 구한 유효 플래그가 1인지를 확인하는 단계와, 상기 확인 결과 유효 플래그가 1일 경우 엔트리를 축소 포인터 테이블로 삽입하고, 유효 플래그가 0일 경우 다음 엔트리로 이동하여 엔트리가 존재하는가를 검사하는 단계와, 상기 검사 결과 엔트리가 존재할 경우 유효 플래그를 계산하는 단계로 천이하고, 엔트리가 존재하지 않을 경우 축소 포인터 테이블로부터 축소 정렬 포인터 테이블을 생성하는 단계와, 상기 축소 정렬 포인터 테이블에서 NES를 계산하여 연속 용례를 추출하고, 상기 축소 정렬 포인터 테이블에서 스타트 포인터를 이용하여 같은 문장내에서 나타난 연속 용례들 간의 겹침을 검사하는 단계와, 상기 겹침 검사 결과 겹침이 발생하지 않으면서 같은 문장에 속한 연속 용례들을 새로운 비연속 용례로 간주하고 패턴 제거 규칙을 이용하여 상기 연속 용례 및 상기 비연속 용례로부터 불필요한 용례를 제거하는 단계를 포함하여 이루어진 것을 특징으로 한다.In addition, the statistical usage extraction method for a large-capacity corpus according to the present invention for achieving the above object is as a single entry in the pointer table while a string of a constant length in the corpus input through the input device is sequentially moved by one character Inserting; copying the pointer table to create an alignment pointer table if the string to be inserted does not remain in the corpus; and obtaining an NMC and inserting into the Nth entry of the alignment pointer table; Determining an Nth NSC of N from an N-1th NMC and an Nth NMC, a larger value of the valid flag obtained for the pointer table, and an entry if the valid flag is 1 Is inserted into the collapsed pointer table and moves to the next entry if the valid flag is 0 Checking whether an entry exists, calculating a valid flag if the entry exists, and if the entry does not exist, generating a reduced sort pointer table from the reduced pointer table; Calculating NES in a sort pointer table to extract consecutive usages, and checking overlaps between successive usages appearing in the same sentence using a start pointer in the reduced sorting pointer table, and if overlapping does not occur as a result of the overlapping inspection In this case, the continuous usages belonging to the same sentence are regarded as a new non-consecutive use and the pattern elimination rule is used to remove unnecessary use from the continuous use and the non-consecutive use.

또한, 상술한 목적을 달성하기 위한 본 발명은 말뭉치로부터 용례 추출을 위한 기본 테이블을 생성하는 수단과, 테이블 엔트리에서 용례 후보들을 결정하는 용례 후보 결정 수단과, 메모리와 추출 시간의 효율적인 관리를 위해 기본 테이블을 축소한 축소 테이블을 생성하는 수단과, 용례 후보로부터 연속 및 비연속 용례를 추출하는 수단과, 추출된 용례에 패턴 제거 규칙을 이용하여 불필요한 용례들을 제거하는 수단과, 최종적으로 추출된 용례들을 출력하는 수단을 기능시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체인 것을 특징으로 한다.In addition, the present invention for achieving the above object is a means for generating a base table for extracting a case from the corpus, a case candidate determining means for determining the case candidates in the table entry, the base for efficient management of memory and extraction time Means for creating a reduced table that reduces the table, means for extracting continuous and discontinuous use cases from the use case candidates, means for removing unnecessary uses using pattern removal rules for the extracted use cases, and finally extracted cases. And a computer readable recording medium having recorded thereon a program for functioning the outputting means.

도 1은 본 발명에 따른 통계학적 용례 추출 수단의 개념을 설명하기 위해 도시한 블록도.1 is a block diagram illustrating the concept of a statistical application extracting means according to the present invention;

도 2는 본 발명에 따른 통계학적 연속 및 비연속 용례 추출 방법을 설명하기 위한 흐름도.2 is a flowchart illustrating a method for extracting statistical continuous and discontinuous applications according to the present invention.

도 3(a) 및 도 3(b)는 본 발명에서 사용하는 PT, SPT 및 RPT, RSPT의 내부 구조도.Figure 3 (a) and Figure 3 (b) is an internal structure diagram of PT, SPT and RPT, RSPT used in the present invention.

도 4는 본 발명에 따른 통계학적 용례 추출 수단의 실시 예.4 is an embodiment of a statistical application extraction means according to the present invention.

＜도면의 주요 부분에 대한 부호의 설명＞<Description of the code | symbol about the principal part of drawing>

101 : 입력 수단 102 : PT 생성부101: input means 102: PT generation unit

103 : SPT 생성부 104 : NMC 계산부103: SPT generation unit 104: NMC calculation unit

105 : NSC 계산부 106 : 유효 플래그 계산부105: NSC calculator 106: valid flag calculator

107 : RPT 생성부 108 : RSPT 생성부107: RPT generator 108: RSPT generator

109 : NES 계산부 110 : ST 계산부109: NES calculation unit 110: ST calculation unit

111 : 불필요 용례 제거부 112 : 인쇄부111: unnecessary application removal unit 112: printing unit

113 : 인쇄 장치 114 : 표시 제어부113: printing apparatus 114: display control unit

115 : 표시 장치115: display device

연속 용례라 함은 추출된 용례가 해당 문장 내에서 연속적인 단어들의 나열로 나타나는 것을 말하며, 비연속 용례라 함은 추출된 용례가 해당 문장 내에서 비연속적으로 다른 단어나 구절 등에 의해 분리되어 나타나는 형태를 말한다.Continuous usage means that the extracted usage appears as a sequence of words in the sentence, and discontinuous usage means that the extracted usage appears to be separated by other words or phrases discontinuously in the sentence. Say.

이하, 본 발명의 실시 예를 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 통계학적 용례 추출 수단의 개념을 설명하기 위해 도시한 블록도이다.1 is a block diagram illustrating the concept of a statistical application extracting means according to the present invention.

입력 장치(101)를 통해 말뭉치내의 일정한 길이의 문자열이 순차적으로 한 문자씩 이동하면서 포인터 테이블(Pointer Table: 이하 PT라 함) 생성부(102)에 의해 생성된 PT에 하나의 엔트리로서 삽입된다. 만일, 더 이상의 삽입될 문자열이 말뭉치내에 남아 있지 않은 경우에는 정렬 포인터 테이블(Sorted Pointer Table: 이하 SPT라 함) 생성부(103)에서 PT를 복사하여 SPT를 생성한다. PT와 SPT의 내부 구조는 도 3(a)에 도시되어 있다. SPT는 문자열에 따라 PT를 정렬한 형태로 현재 엔트리의 바로 위나 아래 엔트리와 동일한 문자 수를 비교하는데 이용된다. NMC(Number of Matched Characters) 계산부(104)에서 N 번째 문자열과 N+1 번째 문자열과의 공통 문자수인 NMC를 구하여 SPT의 N 번째 엔트리에 삽입한다. 다음으로 NSC(Number of Significant Characters) 계산부(105)에서 SPT의 N 번째 NSC를 N-1 번째 NMC와 N 번째 NMC 중의 큰 값으로부터 결정한다. PT에 대해서 유효(Validity) 플래그 계산부(106)에서 유효 플래그를 구하여 이를 용례 후보를 결정하는데 이용한다. 유효(Validity) 플래그는 N 번째 NSC의 값이 N-1 번째 NSC의 값보다 크거나 같으면 1로 세팅되며, 그렇지 않으면 0으로 세팅된다. 유효(Validity) 플래그가 1인 엔트리들은 축소된 포인터 테이블(Reduced PT: 이하 RPT라 함)(107)에 의해 생성된 RPT로 옮겨져서 불필요하게 다음부터의 연산을 필요 없는 엔트리에 동작하지 않도록 하여 메모리 공간 및 속도를 개선한다. 유효 플래그가 0일 경우 다음 엔트리로 이동하여 거기에 엔트리가 존재하지 않을 경우 축소된 정렬 포인터 테이블(Reduced SPT: 이하 RSPT라 함) 생성부(108)에서 PT에서 SPT를 만드는 과정과 동일하게 PSTP를 생성한다. RPT와 RSPT의 내부 구조는 도 3(b)에 도시되어 있다. NES(Number of Extracted Substring)는 추출된 용례의 번호로 용례가 발견될 때마다 1씩 증가하므로, 최종 NES 값으로부터 추출된 용례의 수를 알 수 있다. NES는 RSPT에서 유효(Validity) 플래그와 NSC의 값이 존재하면서 이전 인덱스의 값들과 같은 경우에 증가한다. 이를 NEC 계산부(109)에서 NES의 값이 바뀔 때마다 새로운 연속 용례가 추출된 것으로 간주하며, RSPT에서 스타트 포인터(Start Point: 이하 ST라 함) 계산부(110)를 이용하여 같은 문장내에서 나타난 연속 용례들 간의 겹침(Overlap)을 검사한다. ST는 처음에 말뭉치로부터 문자열이 PT로 삽입될 때, 그 문자열이 속한 문장내에서의 몇 번째 문자인가의 정보이다. 겹침(Overlap)이 발생하지 않으면서 같은 문장에 속한 연속 용례들은 새로운 비연속 용례로 간주한다. 최종적으로 불필요 용례 제거부(111)에 패턴 제거 규칙을 적용하여 불필요한 용례를 제거하고, 인쇄부(112)를 통해 인쇄 장치(113)로 출력하거나, 표시 제어부(114)를 통해 표시 장치(115)로 출력한다.Through the input device 101, a string of a certain length in the corpus is sequentially inserted one character into the PT generated by the pointer table (PT) generation unit 102 while sequentially moving one character. If no more strings to be inserted remain in the corpus, the sorted pointer table (SPT) generation unit 103 copies the PT to generate the SPT. The internal structure of the PT and SPT is shown in Fig. 3 (a). SPT is used to compare the same number of characters as the entry immediately above or below the current entry, sorted by PT according to the string. The Number of Matched Characters (NMC) calculation unit 104 obtains an NMC, which is the number of characters common between the Nth string and the N + 1th string, and inserts the NMC into the Nth entry of the SPT. Next, the NSC calculation unit 105 determines the N-th NSC of the SPT from a large value among the N-1 th NMC and the N th NMC. The validity flag calculation unit 106 obtains a valid flag for the PT and uses the same to determine an application candidate. The Validity flag is set to 1 if the value of the Nth NSC is greater than or equal to the value of the N-1th NSC, otherwise it is set to 0. Entries with a Validity flag of 1 are moved to the RPT generated by the reduced pointer table (reduced PT: RP) 107 to avoid unnecessary operations on entries that do not need subsequent operations. Improve space and speed. If the valid flag is 0, it moves to the next entry and if there is no entry there, PSTP is created in the same manner as the SPT is created in PT by the reduced sorted pointer table (Reduced SPT: RSPT) generator 108. Create The internal structure of the RPT and RSPT is shown in Fig. 3 (b). NES (Number of Extracted Substring) is the number of extracted applications, which increases by 1 each time a usage is found, so the number of extracted applications can be known from the final NES value. NES is incremented when the Validity flag and the value of NSC are present in RSPT and equal to the values of the previous index. Whenever the NES value is changed by the NEC calculation unit 109, it is assumed that a new continuous application is extracted, and in the same sentence by using a start pointer (ST) calculation unit 110 in RSPT. Check for overlap between successive applications shown. ST is the information of the first character in the sentence to which the string belongs when the string is first inserted into the PT from the corpus. Consecutive usages that belong to the same sentence without overlapping are considered new non-consecutive usages. Finally, the unnecessary use case removal unit 111 is applied with a pattern removal rule to remove unnecessary use cases, and is output to the printing apparatus 113 through the printing unit 112 or the display device 115 through the display control unit 114. Will output

도 2는 본 발명에 따른 연속 및 비연속 용례 추출 수단을 포함하는 통계학적 용례 추출 장치의 블록도이다.2 is a block diagram of a statistical application extraction device comprising continuous and discontinuous application extraction means in accordance with the present invention.

입력 장치를 통해 입력된 말뭉치(200)내의 일정한 길이의 문자열이 순차적으로 한 문자씩 이동하면서 PT에 하나의 엔트리로서 삽입된다(202). 만일, 더 이상의 삽입될 문자열이 말뭉치내에 남아 있지 않은 경우(201)에는 PT를 복사하여 SPT를 생성한다(203). SPT는 문자열에 따라 PT를 정렬한 형태로 현재 엔트리의 바로 위나 아래 엔트리와 동일한 문자 수를 비교하는데 이용된다. N 번째 문자열과 N+1 번째 문자열과의 공통 문자수인 NMC를 구하여 SPT의 N 번째 엔트리에 삽입한다. 다음으로 SPT의 N 번째 NSC를 N-1 번째 NMC와 N 번째 NMC 중의 큰 값으로부터 결정한다(206). PT에 대해서 유효 플래그를 구하여 이를 용례 후보를 결정하는데 이용한다(207). 유효 플래그는 N 번째 NSC의 값이 N-1 번째 NSC의 값보다 크거나 같으면 1로 세팅되며, 그렇지 않으면 0으로 세팅된다. 유효 플래그가 1인 엔트리들은 RPT로 옮겨져서 불필요하게 다음부터의 연산을 필요 없는 엔트리에 동작하지 않도록 하여 메모리 공간 및 속도를 개선한다(209). 유효 플래그가 0일 경우(208) 다음 엔트리로 이동하여 엔트리가 존재하는가를 검사한다(211). 검사 결과 엔트리가 존재할 경우 단계 (207)로 천이하고, 엔트리가 존재하지 않을 경우 RSPT의 생성을 PT에서 SPT를 만드는 과정과 동일하게 수행한다(212 및 213). NES는 추출된 용례의 번호로 용례가 발견될 때마다 1씩 증가하므로, 최종 NES 값으로부터 추출된 용례의 수를 알 수 있다. NES는 RSPT에서 유효 플래그와 NSC의 값이 존재하면서 이전 인덱스의 값들과 같은 경우에 증가한다(214). NES의 값이 바뀔 때마다 새로운 연속 용례가 추출된 것으로 간주하며(215), RSPT에서 ST를 이용하여 같은 문장내에서 나타난 연속 용례들 간의 겹침(Overlap)을 검사한다(216). ST는 처음에 말뭉치로부터 문자열이 PT로 삽입될 때, 그 문자열이 속한 문장내에서의 몇 번째 문자인가의 정보이다. 겹침(Overlap)이 발생하지 않으면서 같은 문장에 속한 연속 용례들은 새로운 비연속 용례로 간주한다(217). 최종적으로 패턴 제거 규칙을 이용하여 불필요한 용례를 제거함으로써(218) 용례 추출을 종료한다.A string of a certain length in the corpus 200 input through the input device is inserted as one entry into the PT while sequentially moving by one character. If no more strings to be inserted remain in the corpus (201), the PT is copied to generate an SPT (203). SPT is used to compare the same number of characters as the entry immediately above or below the current entry, sorted by PT according to the string. The NMC, which is the number of characters common between the Nth string and the N + 1th string, is obtained and inserted into the Nth entry of the SPT. Next, the Nth NSC of the SPT is determined from the large value among the N-1th NMC and the Nth NMC (206). A valid flag is obtained for the PT and used to determine a candidate for use (207). The valid flag is set to 1 if the value of the N th NSC is greater than or equal to the value of the N-1 th NSC, otherwise it is set to zero. Entries with a valid flag of 1 are moved to the RPT to improve memory space and speed by avoiding unnecessary operations on entries that do not need subsequent operations (209). If the valid flag is 0 (208), it moves to the next entry and checks whether an entry exists (211). If a check result entry exists, the process transitions to step 207. If there is no entry, generation of the RSPT is performed in the same manner as that of creating an SPT in the PT (212 and 213). The NES is the number of extracted applications, which increases by 1 each time a usage is found, so the number of applications extracted from the final NES value can be known. The NES is incremented 214 if there is a valid flag and the value of NSC in RSPT equal to the values of the previous index. Whenever the value of NES changes, it is assumed that a new continuous usage is extracted (215), and the overlap between successive usages shown in the same sentence is checked using ST in RSPT (216). ST is the information of the first character in the sentence to which the string belongs when the string is first inserted into the PT from the corpus. Consecutive usages belonging to the same sentence without overlapping are considered as new discontinuous usages (217). Finally, the usage extraction is terminated by removing unnecessary usage (218) using the pattern removal rule.

도 4는 본 발명에 따른 통계학적 용례 추출 수단의 실시 예로 말뭉치를 선택한 후에 추출 대상이 연속이냐 비연속이냐를 선택하면 해당 용례들과 그 용례들이 속해 있는 문장들을 제시하는 예를 보여준다.4 shows an example of presenting the examples and the sentences to which the examples belong when selecting the corpus after selecting corpus as an embodiment of the statistical example extracting means according to the present invention.

이상에서 설명한 본 발명은 본 발명이 속하는 기술분야에서 통상의 지식을 가진자에 있어 본 발명의 기술적 사상을 벗어나지 않는 범위내에서 여러 가지 치환, 변형 및 변경이 가능하므로 전술한 실시 예 및 첨부된 도면에 한정하고 있는 것은 아니다.The present invention described above is capable of various substitutions, modifications, and changes within the scope without departing from the spirit of the present invention for those skilled in the art to which the present invention pertains, the foregoing embodiments and the accompanying drawings. It is not limited to this.

상술한 바와 같이 본 발명에 의하면 대용량의 말뭉치로부터 연속, 비연속 형태의 다양한 용례들을 추출할 수 있기 때문에 말뭉치의 특성을 분석하고자 하는 분야나 다량의 용례를 이용하여 자연어 처리를 하고자 하는 분야를 포함하여 자연어 처리의 전 분야에서 기본 지식의 구축을 위해 광범위하게 사용될 수 있다.As described above, according to the present invention, since it is possible to extract a variety of applications in continuous and discontinuous form from a large amount of corpus, including a field for analyzing the characteristics of corpus or a field for natural language processing using a large amount of applications. It can be used extensively to build basic knowledge in all areas of natural language processing.

Claims

Means for creating a base table for extracting usage from corpus,

Usage candidate determination means for determining usage candidates in a table entry;

Means for creating a collapsed table with a reduced base table for efficient management of memory and extraction time;

Means for extracting continuous and discontinuous usage from the usage candidates;

Means for eliminating unnecessary usages using pattern removal rules in the extracted usages;

Statistical example extraction means for large-capacity corpus consisting of means for outputting the finally extracted applications.

The apparatus of claim 1, wherein the basic table generating means comprises: a pointer table generating unit for generating a pointer table by sequentially extracting a string from a corpus;

And a sorted pointer table generator for generating a pointer table arranged in synchronization with the pointer table.

The application candidate determining means comprises: an NMC calculator and an NSC calculator for calculating the same number of characters between character strings;

Statistical example extracting means for large-capacity corpus characterized in that it comprises a valid flag calculation unit for determining whether to use the candidate.

2. The apparatus of claim 1, wherein the reduction table generating means comprises: a reduced pointer table generation unit configured to extract only application candidates from a base table and form a reduced type table;

And a sorted collapsed pointer table generator for generating a reduced pointer table arranged in synchronization with the reduced pointer table.

The method according to claim 1, wherein the continuous and discontinuous application extracting means comprises: an NES calculation unit for determining whether the application candidate is selected as the longest match in the corpus;

Statistical case extraction means for large-capacity corpus, characterized in that it comprises a start pointer calculation unit for examining whether the overlap between the two applications in the sentence.

2. The statistical usage extracting means according to claim 1, wherein the unnecessary usage removing means comprises an unnecessary usage portion for removing unnecessary usage using a pattern removing rule.

Inserting a string of a constant length in the corpus input through the input device as an entry into the pointer table while sequentially moving one character at a time;

Creating a sort pointer table by copying the pointer table when the string to be inserted does not remain in the corpus;

Obtaining an NMC and inserting the same into an Nth entry of the sort pointer table and determining an Nth NSC of the sort pointer table from a larger value of N-1th NMC and Nth NMC;

Checking whether the valid flag obtained for the pointer table is 1;

Inserting the entry into the reduced pointer table when the valid flag is 1 and moving to the next entry when the valid flag is 0, and checking whether the entry exists;

Transitioning to calculating a valid flag if there is an entry as a result of the check, and generating a reduced sort pointer table from the reduced pointer table if the entry does not exist;

Calculating NES from the reduced alignment pointer table to extract consecutive usages, and checking overlaps between successive usages appearing in the same sentence using a start pointer in the reduced alignment pointer table;

The overlapping check result is that the overlapping occurs, including the continuous use of the same sentence as a new non-consecutive use and using a pattern removal rule to remove unnecessary use from the continuous use and the non-continuous use Statistical application for extracting large volumes of corpus.

10. The method of claim 7, wherein the sort pointer table is used to compare the same number of characters as the entry immediately above or below the current entry in the form of sorting the pointer table according to a character string. Way.

8. The statistical example of claim 7 wherein the valid flag is set to 1 if the value of the N th NSC is greater than or equal to the value of the N-1 th NSC, otherwise it is set to 0. Way.

Means for creating a base table for extracting usage from corpus,

A computer-readable recording medium having recorded thereon a program for functioning means for outputting finally extracted examples.