KR20020054254A

KR20020054254A - Analysis Method for Korean Morphology using AVL+Trie Structure

Info

Publication number: KR20020054254A
Application number: KR1020000083385A
Authority: KR
Inventors: 손소현; 송종철; 이성용; 문병주; 홍기채; 정현수; 이동일
Original assignee: 오길록; 한국전자통신연구원
Priority date: 2000-12-27
Filing date: 2000-12-27
Publication date: 2002-07-06

Abstract

PURPOSE: A Korean morpheme analysis method is provided to analyze morphemes of the Korean by using a structure of a dictionary(AVL, Adelson-Velskii and Landis + Trie, reTrieval) for reducing a load of a word cluster engine with an input list generator of a word clustering engine. CONSTITUTION: The method comprises steps of extracting words from a sentence by separating words based on a space, a comma or other special characters(100), checking if each separated word is included in a compressed stop word dictionary formed of words having a high appearance frequency among no noun words, i.e. verbs, adjectives or adverbs by the full matching algorithm(103), if the word is not included in the compressed stop word dictionary, checking if the word is included in a noun dictionary by the longest matching algorithm(104), if the word is included in the compressed stop word dictionary, checking the next word(102), if the word is not included in the noun dictionary, determining if there occurs an inappropriate error at the words remaining after extracting the words included in the noun dictionary by referring to a general noun dictionary and a dependency morpheme dictionary(105), if there occurs an inappropriate error, registering the corresponding word at a non-registered word process module, extracting the nouns included in the noun dictionary(106), checking if there exists a list of adjacent nouns(107), designating the adjacent nouns as a noun phrase and checking if the noun phrase is available(108), and transmitting the extracted nouns or noun phrases to an application database(109).

Description

Analysis Method for Korean Morphology using AVL + Trie Structure}

본 발명은 사전구조(AVL+Trie)를 이용한 한국어 형태소 분석 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것으로, 더욱 상세하게는 단어 클러스터링 엔진의 입력리스트 생성기로 단어 클러스터 엔진의 성능부담을 감소시키는 효과를 가지기 위한 사전구조(AVL+Trie)를 이용한 한국어 형태소 분석 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체에 관한 것이다.The present invention relates to a Korean morpheme analysis method using a dictionary structure (AVL + Trie) and a computer-readable recording medium recording a program for realizing the method. More particularly, the present invention relates to an input list generator of a word clustering engine. The present invention relates to a Korean morphological analysis method using a dictionary structure (AVL + Trie) for reducing the performance burden of a cluster engine and a computer-readable recording medium recording a program for realizing the method.

인터넷 산업이 크게 확장되고, 엄청난 양의 지식정보가 쏟아짐에 따라 인터넷 이용성향은 좀더 편리하고 개인 중심적인 서비스로의 활로를 모색하고 있다.As the Internet industry expands greatly and a huge amount of knowledge and information flows, the tendency to use the Internet is seeking a more convenient and personalized service route.

검색 시스템의 경우에도 얼마나 많은 데이터를 보유하는가 보다는 이용자에게 유용하고 정확한 정보를 효과적으로 서비스하는 기술관점으로 초점이 바뀌고 있다. 새롭게 대두되고 있는 클러스터링 검색기술은 정보 이용자가 던진 질의에 대해 관련도가 높은 질의 및 문서들을 클러스터링하여 이용자 질의검색 결과와 함께 제공한다.In the case of a retrieval system, the focus is shifting to a technology viewpoint that effectively provides useful and accurate information to users, rather than how much data it holds. The emerging clustering retrieval technology clusters highly related queries and documents for the queries thrown by information users and provides them with the user query search results.

클러스터링 엔진에서 제공하는 부가적인 검색결과는 새로운 지식에 처음 접하는 이용자일 경우 관련어 및 유사어 인식이 가능하며 이를 통해 초기개념 인식부담이 감소한다.The additional search results provided by the clustering engine can recognize related words and similar words when users are new to the new knowledge, thereby reducing the initial concept recognition burden.

초기개념 인식부담을 감소시키는 효과가 있는 클러스터링 검색기술은 크게 문서 클러스터링과 단어 클러스터링 기술이 있으며, 이 중 단어 클러스터링 기술은 단어만을 클러스터링하여 데이터베이스로 구축하고, 관련된 문서는 내부검색을 통해 제공한다. 따라서, 클러스터링 데이터베이스 구축시간이 짧고, 기존의 개발된 검색엔진을 활용할 수 있다는 잇점이 있다.Clustering retrieval technology that has an effect of reducing initial concept recognition burden is largely document clustering and word clustering technology. Among them, word clustering technology clusters words only to build a database and provides related documents through internal search. Therefore, the clustering database construction time is short, and the existing developed search engine can be utilized.

적용분야, 말뭉치 종류, 사용언어에 따른 형태소 분석기의 종래기술을 살펴보면 다음과 같다.Looking at the prior art of the morpheme analyzer according to the field of application, the type of corpus, the language used are as follows.

먼저, 형태소 분석기가 적용될 응용분야의 동향은 다음과 같다.First, the trend of application fields to which morphological analyzer is applied is as follows.

현재 형태소 분석기는 영문, 한글 구분없이 대부분의 응용시스템에서 실용화 단계에 도달하였으며, 문장구조변형이나 어형변화가 상대적으로 적게 일어나는 영문 언어처리연구가 한글언어처리연구에 비해 앞서 있는 편이다.The morphological analyzer has reached the practical stage in most application systems, regardless of English or Korean, and English language processing research that has relatively few sentence structure transformations or morphological changes is more advanced than Korean language processing research.

실용화 단계에 도달한 형태소 분석기는 적용된 각 응용시스템(예 : 맞춤법검사기, 정보검색, 번역시스템, 음성인식 시스템)의 목적 및 확장범위에 따라 알고리즘과 참조사전 구조가 다양하다.The morphological analyzer, which has reached the level of practical use, has various algorithms and reference dictionaries depending on the purpose and extent of each application system applied (eg, spell checker, information retrieval, translation system, speech recognition system).

맞춤법 검사기용으로 개발된 형태소 분석기는 정확성이 최우선으로 고려되어야 하고, 정보검색용으로 개발된 형태소 분석기는 빠른 속도와 정확성이 요구되고, 번역시스템용으로 개발된 형태소 분석기는 속도보다는 번역대상 언어사이의 관용구 처리를 포함한 완벽한 변환매칭이 이루어져야 한다. 따라서, 한 시스템에서 사용된 형태소 분석 알고리즘 및 산출결과를 다른 시스템에서 공용으로 사용한다면 효율성 측면에서 성능저하를 감수해야 한다.Morphological analyzers developed for spell checkers should be considered for accuracy first. Morphological analyzers developed for information retrieval require high speed and accuracy, while stemmers developed for translation systems are more likely to be used than translation speeds. Complete conversion matching, including idiom processing, should be done. Therefore, if the morphological analysis algorithm and the calculation result used in one system are shared in another system, performance should be reduced in terms of efficiency.

본 발명과 직접적 관련이 있는 검색엔진용 형태소 분석기도 여러 가지 목적으로 구분할 수 있으며, 각각의 목적에 맞춘 효과적인 형태소 분석기가 필요하다. 즉, 형태소 분석기는 검색엔진의 색인데이터베이스 구축을 위한 색인어 추출기로서의 형태소 분석기, 문서자동 요약문 추출을 위한 전처리기로서의 형태소 분석기, 자연어 질의처리기를 위한 전처리기로서의 형태소 분석기, 또는 본 발명과 같이 단어 클러스터 엔진의 입력어 추출기로서의 형태소 분석기 등 다양하게 이용된다.The morpheme analyzer for search engines directly related to the present invention can also be classified into various purposes, and an effective morpheme analyzer for each purpose is required. That is, the morpheme analyzer is a morpheme analyzer as an index word extractor for building an index database of a search engine, a morpheme analyzer as a preprocessor for automatic document abstract extraction, a morpheme analyzer as a preprocessor for a natural language query processor, or a word cluster engine as in the present invention. It is used in various ways such as a morpheme analyzer as an input word extractor.

상기 형태소 분석기가 이렇게 다양한 목적의 응용에 활용될 수 있으므로, 한 시스템에서 사용된 형태소 분석기를 다른 시스템에서도 이식성을 가지고 활용할 수 있도록 일부 모듈화된 형태소 분석기 개발이 이루어지고는 있으나, 현재까지는 각 응용목적에 맞춘 개별적인 형태소 분석기 개발이 주로 이루어지고 있다.Since the morphological analyzer can be used for such various purposes, some modular morphological analyzers have been developed so that the morphological analyzer used in one system can be used in other systems with portability. The development of customized individual morphological analyzers is mainly done.

다음으로, 형태소 분석기가 적용된 말뭉치의 동향은 다음과 같다.Next, the trend of corpus applied morphological analyzer is as follows.

말뭉치는 가공정도에 따라, 문서 특성별로, 그리고 담고 있는 내용에 따라여러가지가 있을 수 있다. 먼저, 가공정도에 따라서는 가공되지 않은 기초말뭉치, 형태소 분석으로 태깅(tagging)된 말뭉치, 구문트리로 태깅(tagging)된 말뭉치, 범주화된 말뭉치, 기본 동사목록에 따라 문서를 분류한 문형 자료 모음이 있고, 문서 특성에 따라서는 신문, 잡지와 같이 길이가 짧으면서 한가지 주제를 다루는 말뭉치와 소설과 같이 길이가 길고 여러가지 줄거리를 가지는 말뭉치가 있을 수 있다. 또한, 담고 있는 내용에 따라서는 각 분야별로 정보통신분야, 반도체분야, 정치, 경제 등을 범주화하여 말뭉치를 구성할 수도 있다.Corpus can vary depending on the degree of processing, by document characteristics, and by what it contains. First, depending on the degree of processing, the raw sentence corpus, which classifies documents according to raw raw corpus, corpus tagged tagging, corpus tagged tagging, categorized corpus, and basic verb list, In addition, depending on the characteristics of the document, there may be a short-length corpus such as a newspaper or a magazine and a corpus having a long length and various plots such as a novel. In addition, according to the contents contained in each field, information and communication, semiconductor, politics, economics, etc. can be categorized to form a corpus.

현재, 한국어 말뭉치로는 KTset 96, 한국경제 신문에서 발취한 ETRI-계몽 말뭉치, ETRI(한국전자통신연구소) 제 1 회 형태소 분석기 및 태거 평가대회에서 사용된 말뭉치, 코딕(KORDIC)의 말뭉치 등이 있으며, 아직까지 정보통신분야의 전문적인 내용을 담고 있는 문서들을 기초말뭉치로 구축하여 클러스터 엔진을 구축한 사례는 없다.Currently, Korean corpus includes KTset 96, ETRI-Enlightening corpus from the Korea Economic Newspaper, corpus used by the 1st Morphological Analyzer and Tagger Evaluation Contest, and Cork cork from KORDIC. However, there has been no case of building a cluster engine by constructing a basic corpus of documents that contain specialized contents in the information and communication field.

한국어에 적용된 형태소 분석 알고리즘 및 사전구조를 살펴보면 다음과 같다.The morphological analysis algorithm and dictionary structure applied to Korean are as follows.

현재, 실용화단계에 있는 한국어 형태소 분석기도 사용 알고리즘과 참조하는 사전의 구조를 어떻게 구성하는가에 따라 처리속도에 상당한 변화를 기대할 수 있다. 특히, 한국어는 복합명사를 많이 포함하고 있으므로 이들 복합명사의 저장 및 검색이 빨라야 하며, 사전에 미등록된 단어 발생시 이를 처리해 줄 수 있는 모듈이 있어야 한다.At present, Korean morpheme analyzers in practical use can expect significant changes in processing speed depending on the algorithm used and the structure of the dictionary to be referred to. In particular, since Korean includes a large number of compound nouns, it must be fast to store and retrieve these compound nouns, and there must be a module that can handle the occurrence of unregistered words in the dictionary.

대표적인 알고리즘으로는 최장일치기법(Longest matching Method), 최단일치기법(Shortest matching Method), 타불러파싱기법(Tabular Parsing Method), 헤드-테일 구분기법(Head-Tail division Method), 음절기반분석법(Syllable based Analysis Method), 투-레벨 형태소 분석(Two-level morphology) 등이 있고, 사전구조로는 트라이구조(Trie Structure), 해쉬구조(Hash Structure), B+ 트리구조(B+ Tree Structure) 등이 있다.Typical algorithms include the longest matching method, the shortest matching method, the tabular parsing method, the head-tail division method, and the syllable-based analysis method. based Analysis Method), two-level morphology, etc. The pre-structures include a Tri structure, a hash structure, and a B + tree structure.

사전에 미등록된 단어를 처리하기 위해 기존 발명에서는 구축한 사전이외에 미등록 가능성이 있는 단어 리스트를 미리 추정하여 유형별로 분류저장해 놓음으로써 사전에 미등록어 발생 빈도를 줄이도록하는 형태소 해석방법이 제시되기도 하였다.In order to process the unregistered words in the dictionary, the existing invention has been proposed a morphological analysis method to reduce the frequency of occurrence of unregistered words in the dictionary by estimating and classifying and storing them by type in addition to the word list that can be registered in advance.

그러나, 상기 형태소 분석기들은 미등록어가 작거나, 검색시 속도가 느리며, 전문분야 형태소 분석 및 신조어처리에 문제점이 있었다.However, the morpheme analyzers have small unregistered words, or are slow in search, and have problems in specialized morpheme analysis and new word processing.

본 발명은, 상기한 바와 같은 문제점을 해결하기 위하여 제안된 것으로, 단어 클러스팅 엔진의 성능부담을 감소시키고 질적 성능향상을 도모하기 위한 사전구조(AVL+Trie)를 이용한 사전기반 한국어 형태소 분석 방법과 상기 방법을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는데 그 목적이 있다.The present invention has been proposed to solve the problems described above, and the dictionary-based Korean morphological analysis method using a dictionary structure (AVL + Trie) to reduce the performance burden of the word clustering engine and to improve the quality performance; It is an object of the present invention to provide a computer-readable recording medium having recorded thereon a program for realizing the method.

도 1 은 본 발명에 따른 형태소 분석 방법에 대한 일실시예 흐름도.1 is a flow chart of an embodiment of a morphological analysis method according to the present invention.

도 2 는 본 발명에 따른 AVL+Trie 사전구조에 대한 일실시예 설명도.2 is a diagram illustrating an embodiment of an AVL + Trie dictionary structure according to the present invention.

상기 목적을 달성하기 위한 본 발명의 방법은, 형태소 분석 장치에 적용되는 AVL+Trie 구조를 이용한 사전기반 한국어 형태소 분석 방법에 있어서, 어절 단위로 단어를 추출하는 제 1 단계; 상기 추출된 단어를 압축 불용어 사전을 참조하여 검색하는 제 2 단계; 상기 추출된 단어를 명사사전을 참조하여 검색하는 제 3 단계; 상기 명사사전 참조 검색 후 남은 음절을 체크하여 부적합 오류가 발생하였는지를 확인하는 제 4 단계; 상기 제 4 단계의 확인 결과, 부적합 오류가 발생하면 미등록어로 처리하는 제 5 단계; 및 상기 제 4 단계의 확인 결과, 부적합 오류가 발생하지 않으면 명사를 추출하여 데이터베이스에 저장하는 제 6 단계를 포함하는 것을 특징으로 한다.According to an aspect of the present invention, there is provided a dictionary-based Korean morpheme analysis method using an AVL + Trie structure applied to a morpheme analysis apparatus, the method comprising: extracting a word in word units; A second step of searching for the extracted word by referring to a compression stop word dictionary; A third step of searching for the extracted word by referring to a noun dictionary; Checking a syllable remaining after the noun dictionary reference search to determine whether an unsuitable error has occurred; A fifth step of processing a non-registered word when a non-conformance error occurs as a result of the checking in the fourth step; And a sixth step of extracting a noun and storing the noun in the database if the non-conformance error does not occur as a result of the checking in the fourth step.

또한, 본 발명의 다른 방법은, 사전에 없는 불용어를 임시 저장하는 제 7 단계; 불용어 사전을 참조하여 상기 불용어 사전에 등록되어 있는 불용어인지를 확인하는 제 8 단계; 상기 제 8 단계의 확인 결과, 불용어 사전에 등록되어 있으면 발생 빈도가 높은 불용어를 상기 압축 불용어 사전에 저장하는 제 9 단계; 및 상기 제 8 단계의 확인 결과, 불용어 사전에 없으면 신조어로 간주하여 관리자 모듈로 이동하는 제 10 단계를 더 포함하는 것을 특징으로 한다.In addition, another method of the present invention includes a seventh step of temporarily storing stop words that are not in advance; An eighth step of checking whether the term is registered in the terminology dictionary by referring to the terminology dictionary; A ninth step of storing a stop word having a high frequency of occurrence in the compressed stop word dictionary if it is registered in the stop word dictionary as a result of the checking in the eighth step; And as a result of the checking of the eighth step, the tenth step of moving to the manager module considering the new word if it is not in the stopword dictionary.

한편, 본 발명은, 대용량 프로세서를 구비한 형태소 분석 시스템에, 어절 단위로 단어를 추출하는 제 1 기능; 상기 추출된 단어를 압축불용어사전을 참조하여 검색하는 제 2 기능; 상기 추출된 단어를 명사사전을 참조하여 검색하는 제 3 기능; 상기 명사사전 참조 검색 후 남은 음절 체크하여 부적합 오류가 발생하였는지를 확인하는 제 4 기능; 상기 제 4 기능의 확인 결과, 부적합 오류가 발생하면 미등록어로 처리하는 제 5 기능; 및 상기 제 4 기능의 확인 결과, 부적합 오류가 발생하지 않으면 명사를 추출하여 데이터베이스에 저장하는 제 6 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.On the other hand, the present invention, the morpheme analysis system having a large processor, the first function for extracting words in word units; A second function of searching for the extracted word by referring to a compression stopword dictionary; A third function of searching for the extracted word by referring to a noun dictionary; A fourth function of checking whether syllable errors have occurred by checking the remaining syllables after the noun dictionary reference search; A fifth function of processing non-registered words when a non-conformance error occurs as a result of confirming the fourth function; And a computer-readable recording medium having recorded thereon a program for realizing a sixth function of extracting a noun and storing it in a database if a non-conformance error does not occur as a result of confirming the fourth function.

또한, 본 발명은, 사전에 없는 불용어를 임시 저장하는 제 7 기능; 불용어 사전을 참조하여 상기 불용어 사전에 등록되어 있는 불용어인지를 확인하는 제 8 기능; 상기 제 8 기능의 확인 결과, 불용어 사전에 등록되어 있으면 발생 빈도가 높은 불용어는 압축 불용어 사전에 저장하는 제 9 기능; 및 상기 제 8 기능의 확인 결과, 불용어 사전에 없으면 신조어로 간주하여 관리자 모듈로 이동하는 제 10 기능을 더 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.In addition, the present invention, the seventh function for temporarily storing the stopwords that do not exist in advance; An eighth function of checking whether a term is registered in the terminology dictionary by referring to the terminology dictionary; A ninth function of storing a stop word having a high occurrence frequency in the compressed stop word dictionary if it is registered in the stop word dictionary as a result of confirming the eighth function; And a computer-readable recording medium having recorded thereon a program for further realizing the tenth function of moving to a manager module, which is regarded as a new word if it is not found in the stopword dictionary as a result of the checking of the eighth function.

상술한 목적, 특징들 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해 질 것이다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일실시예를 상세히 설명한다.The above objects, features and advantages will become more apparent from the following detailed description taken in conjunction with the accompanying drawings. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1 은 본 발명에 따른 형태소 분석 방법에 대한 일실시예 흐름도이다.1 is a flowchart illustrating an embodiment of a morpheme analysis method according to the present invention.

본 발명에서의 형태소 분석기는 단어 클러스터 엔진을 위한 입력어 추출기로 문서에서 핵심이 되는 명사들을 추출한다. 핵심 명사범주를 일반명사에 두지않고, 정보통신분야의 전문적인 기술정보용어를 기반으로 구성함으로써 전문정보를 체계적이고 효율적인 인터페이스로 제공할 수 있는 기반을 마련한다.The morpheme analyzer in the present invention is an input word extractor for the word cluster engine, which extracts key nouns from a document. The core noun category is not based on general nouns, but is formed based on technical technical terminology in the field of information and communication, providing a foundation for providing specialized information in a systematic and efficient interface.

도 1 에 도시된 바와 같이, 본 발명에 따른 형태소 분석 방법은, 먼저 문서에서 "정보통신기술에 미친 영향 분석"(101)이라는 문장이 있다면, 상기 문장에서 어절단위로 단어를 추출한다(100). 스페이스(space), 콤마(comma), 기타 특수기호 등을 기준으로 단어를 구분한다. 단어의 추출은 문서단위와 문장단위로 처리함으로써 근접한 단어들 사이의 연관관계 도출 및 핵심문장 도출에 활용한다.As shown in FIG. 1, in the morphological analysis method according to the present invention, if there is a sentence "analysis of impact on information and communication technology" 101 in a document, a word is extracted from the sentence in word units (100). . Words are separated by spaces, commas, and other special symbols. The word extraction is processed in document unit and sentence unit, and it is used to derive correlations and key sentences between adjacent words.

그리고, 다음 단어를 넘어가서(102), 명사가 아닌 단어들(예 : 동사, 형용사, 부사, 감탄사 등) 중에서 출현 빈도율이 높은 단어들은 압축 불용어 사전(122)으로 구성하고, 명사사전(131)을 참조하기 전에 먼저 참조한다.Then, beyond the next word (102), words that have a high occurrence frequency among words that are not nouns (eg, verbs, adjectives, adverbs, interjections, etc.) are composed of a compression stopword dictionary 122 and a noun dictionary 131 ) Before you refer.

높은 출현 빈도율을 가지는 명사단어와 비교하여 불용어의 출현 빈도율이 높게 나왔다면, 명사사전(131) 참조 전에 출현 빈도율이 높은 불용어부터 참조함으로써 속도향상을 도모할 수 있다. 따라서, 압축 불용어 사전(122)의 크기는 작고, 참조 알고리즘도 간단한 완전일치법을 사용한다.If the occurrence frequency of the stopword is higher than that of the noun word having the high appearance frequency, the speed improvement can be achieved by referring to the stopword having the high appearance frequency before referring to the noun dictionary 131. Therefore, the compression stop dictionary 122 is small in size, and the reference algorithm also uses a simple perfect match method.

출현빈도가 높은 불용어선택을 위한 임계(Threshold)값 결정은 아래 [수학식 1]을 바탕으로 결정한다.The threshold value for the stop word selection with high occurrence frequency is determined based on Equation 1 below.

상기 [수학식 1]에서, "n"은 명사, "N"은 명사사전, "d"는 불용어, "D"는 불용어 사전, "k"와 "q"는 각 단어의 출현빈도수, 그리고 "max"는 최고 빈도값을 각각 나타낸다.In Equation 1, "n" is a noun, "N" is a noun dictionary, "d" is a stopword, "D" is a stopword dictionary, "k" and "q" are the frequency of occurrence of each word, and " max "represents the highest frequency value, respectively.

상기 압축 불용어가 아닌 단어들에 대해 명사사전(131)을 참조하여 명사여부를 체크한다(104). 여기서, 명사사전은 크게 2가지로 구분된다. 즉, 일반 명사와 정보통신분야 전문용어를 담고 있는 전문 명사이다.The non-compression stop words are checked for nouns by referring to the noun dictionary 131 (104). Here, the noun dictionary is largely divided into two types. That is, nouns containing general nouns and terminology in the ICT field.

상기 명사사전 참조단계에서는 단어 클러스터링 엔진에서 핵심적인 용어로 쓰기 위한 전문 명사를 추출해야 하므로 전문 명사와 일반 명사를 구분한다.In the noun dictionary reference step, a noun for a key term is to be extracted from the word clustering engine.

상기 전문 명사는 최장일치기법을 일반 명사는 최단일치기법을 사용한다.The professional nouns use the shortest match method and the general nouns use the shortest match method.

상기 최장일치방법은 찾고자 하는 단어와 사전에 저장된 단어들 사이의 비교결과, 사전에 저장된 단어가 여럿 있다면 이들 중 가장 긴 단어를 선택하는 방법이다. 상기 최단일치방법과 비교하여 명사추출시 추출 단어 개수가 작고, 의미있는 복합명사를 추출할 수 있다. 따라서, 최장일치방법은 복합명사를 포함하는 문장에서 복합명사를 추출하는데 효과적이다.The longest matching method is a method of selecting the longest word among the words that are to be found and a result of the comparison between the words stored in the dictionary and the words stored in the dictionary. Compared with the shortest matching method, when extracting nouns, the number of extracted words is small and significant compound nouns can be extracted. Therefore, the longest matching method is effective for extracting compound nouns from sentences containing compound nouns.

만일, 일반 명사로 등록된 단어가 전문명사 앞에 나온 경우, 즉 "한일기계번역"과 같이 "한일"(일반 명사) "기계번역"(전문 명사)이 함께 있는 경우에는, 사전검색을 통해 단어를 검색한 결과가 전문 명사로 태깅(tagging)된 경우는 가장 긴 단어를 추출하고 일반 명사로 태깅된 경우는 가장 짧은 단어를 추출한다.If a word registered as a general noun comes before a noun, that is, when a "Korean-Japanese" (general noun) and "machine translation" (special noun) are present, such as "Han-il Machine Translation", If the search result is tagged as a noun, the longest word is extracted, and if it is tagged as a general noun, the shortest word is extracted.

상기 전문 명사를 우선으로 선택하며 만일 일반 명사가 전문 명사 전위에 위치한 경우 일반 명사부분을 제외한 전문 명사를 추출할 수 있도록 한다.The nouns are selected first, and if the general nouns are located in the front of the nouns, the nouns except for the general nouns can be extracted.

다음으로, 한 어절에서 최장일치 방법으로 전문 명사사전에 등록된 명사를 추출하고 남은 음절들에 대해 일반 명사 혹은 의존형태소 여부를 일반명사사전(131)과 의존형태소사전(132)을 참조하여 체크하고, 부적합 오류가 발생하였는지를 판단한다(105). 여기서, 참조 알고리즘은 예를 들어, "기계번역시스템에서의"라는 어절에서 명사사전에 "기계번역"이 저장되어 있어 최장일치방법으로 "기계번역"이 추출되면, 남은 음절체크를 통해 "시스템"은 일반명사로 "의"는 의존형태소로 분석대상 어절 끝까지 확인한다.Next, extract the nouns registered in the professional noun dictionary in the longest way from a word and check the remaining syllables with reference to the general noun dictionary (131) and the dependent morpheme dictionary (132). In step 105, it is determined whether a nonconformance error has occurred. Here, the reference algorithm is, for example, "machine translation" is stored in the noun dictionary in the phrase "in a machine translation system", and when "machine translation" is extracted in the longest way, the "syllable system" through the remaining syllable checks. Is a common noun, and "of" is a dependent morpheme.

만일, 남은 음절체크 단계에서 분석대상 어절을 끝까지 확인 결과(105), 부적합 오류가 발생하면 명사사전 참조과정(104)부터 다시 수행한다. 다시 수행하였으나 더 이상 분석할 명사대상이 없거나 분석대상 어절 끝까지 확인 결과 부적합 오류가 발생하면 미등록어 처리모듈(120)에 어절 전체를 넘긴다.If, in the remaining syllable check step, the result of the analysis to the end of the word to be analyzed (105), if a non-conformance error occurs, it is performed again from the noun dictionary reference process (104). If the noun object to be analyzed is no longer analyzed or if an unsuitable error occurs until the end of the analysis target word, the entire word is passed to the unregistered word processing module 120.

분석대상 어절을 끝까지 확인 결과(105) 적합 판정이 나면 응용 데이터베이스로 전문 명사를 전달한다.After confirming the result of analysis (105) to the end of the word to be analyzed, the noun is transmitted to the application database.

한편, 상기 미등록어 처리모듈(120)로 전달된 어절에 대한 처리 과정은 다음과 같다.On the other hand, the processing for the word delivered to the non-registered word processing module 120 is as follows.

기존 방법 중에는 사전에 구축된 단어가 아닌 미등록어를 미리 추정하여 유형별로 분류저장함으로써 최대한 미등록어의 발생빈도를 줄이는 방법을 취하기도 한다. 즉, 사람이름, 도시이름 등 사전에는 포함되지 않은 미등록어이지만 문서에서 나타날 경우 처리할 수 없는 단어들을 미리 추정하고 유형별로 저장하여 활용한다. 대신 본 방법은 미등록어를 유형별로 구분하지 않고 미등록어가 나타났을 때 실시간으로 미등록어를 불용어 사전에 추가할지 신조어로 처리할지 결정할 수 있도록 미등록어 처리기(120)을 제공한다.Among the existing methods, a method of reducing the occurrence frequency of unregistered words as much as possible by estimating and storing them by type in advance is estimated and stored. In other words, it is a non-registered word not included in the dictionary such as a person's name and a city name, but it cannot be processed when it appears in a document. Instead, the method provides a non-registered word processor 120 to determine whether to add a non-registered word to a stopword dictionary or to treat it as a new word in real time when the non-registered word appears by type.

상기 미등록어 처리기(120)는 크게 2가지로 구성되어 있다. 불용어 자동 분류모듈(130)과 관리자 모듈(125)이다. 미등록어 처리기(120)로 이동된 단어는 우선 불용어 자동 분류 모듈(130)의 임시 불용어 사전으로 이동하여 불용어사전(121)에 해당단어가 수록되어 있는지 여부를 판단하고(124), 만일 있다면 빈도수를 증가시킨다. 만일 사전에 수록되어 있지 않다면 관리자 모듈(125)로 단어를 넘겨주고, 일정 시간 간격으로 관리자가 처리할 수 있도록 품사판정권한을 관리자에게 부여한다.The non-registered word processor 120 is largely composed of two types. Terminology automatic classification module 130 and the manager module 125. The word moved to the non-registered word processor 120 first moves to a temporary stop word dictionary of the stop word automatic classification module 130 to determine whether the corresponding word is included in the stop word dictionary 121 (124), and if so, the frequency Increase. If not listed in the dictionary, the word is passed to the manager module 125, and the part-of-speech judging authority is given to the manager so that the manager can process it at a predetermined time interval.

상기 불용어사전(121)에는 동사, 형용사, 부사, 기타 품사단어들과 명사중에서 관리자가 불필요하다고 지정한 명사단어를 포함한다.The stopword dictionary 121 includes verbs, adjectives, adverbs, other parts of speech and noun words designated by the administrator as unnecessary among nouns.

기존에는 어절단위로 추출된 단어에서 명사를 추출하여 만일 해당 명사가 복합명사라면 단일 명사로 구분하여 추출하는 방법이 있다. 하지만, 본 방법에서는 한국어 복합명사가 단일명사로 분리되었을 때 고유의미가 상실되기 때문에 복합명사 그대로의 활용을 중요시한다. 따라서, 한 어절 내의 단일명사 추출뿐 아니라 다 어절에 걸쳐 형성될 수 있는 복합명사를 명사조합이라는 모듈을 두어 추출할 수 있도록 제공한다.Conventionally, there is a method of extracting a noun from a word extracted by word units and, if the noun is a compound noun, divides it into a single noun. However, in this method, it is important to use compound nouns as they are lost when Korean compound nouns are separated into single nouns. Therefore, not only extracts a single noun in a word, but also provides a compound noun combination to extract a compound noun that can be formed over multiple words.

이를 위해 앞서 압축 불용어 및 의존형태소, 미등록어로 판명된 의미있는 명사가 아닌 형태소들은 복합명사 조합처리를 위해 해당 형태소 뒤에 태그를 붙인다.To this end, previously uncompressed stopwords, dependent morphemes, and non-registered meaningful morphemes are tagged after the corresponding morpheme for compound noun combination processing.

상기 태그를 포함하지 않는 연속된 명사 단어 리스트는 비록 어절단위 단어추출단계에서 분리되어 추출되더라도 의미있는 하나의 복합명사일 가능성이 높으므로 명사조합을 한다. 상기 조합된 명사가 의미있는 복합명사인지 전문 명사사전을참조하여 결과 명사리스트에 추가한다.Consecutive noun word lists that do not include the tag are noun combinations, since they are likely to be a single compound noun even though they are separated and extracted in the word-word extraction stage. The combined noun adds to the result noun list by referring to a specialized noun dictionary which is a meaningful compound noun.

각 문서의 각 문장(sentence)당 출현한 명사리스트와 출현빈도수 정보를 작성하여 단어 클러스터 처리모듈로 전달한다.The list of nouns and the frequency of appearance that appear in each sentence of each document is prepared and transmitted to the word cluster processing module.

도 2 는 본 발명에 따른 AVL+Trie 사전구조에 대한 일실시예 설명도이다.2 is a diagram illustrating an embodiment of an AVL + Trie dictionary structure according to the present invention.

도 2 에 도시된 바와 같이, AVL+Trie 사전구조는 기존의 Trie(reTrieval)구조의 장점과 AVL(Adelson-Velskii and Landis) 트리구조의 장점을 조합한 구조이다. 여기서, Trie는 가변적인 길이를 가진 '키이'더라도 검색속도를 항상 일정하게 유지할 수 있기 때문에 복합명사를 많이 포함한 한국어의 저장/검색/삭제에 유용한 자료구조이다.As shown in FIG. 2, the AVL + Trie dictionary structure combines the advantages of the existing Trie (reTrieval) structure and the advantages of the Adelson-Velskii and Landis (AVL) tree structure. Here, Trie is a useful data structure for storing, retrieving, and deleting Korean, including many compound nouns, because the search speed can be kept constant even if the key is of variable length.

상기 Trie구조에 AVL구조의 장점을 조합하기 위해, Trie 각 단계별 노드대신에 하나의 AVL 트리를 구성하고, 음절단위 혹은 자소단위의 작은 문자셋을 범위로 저장/검색/삭제가 이루어 질 수 있도록 구성한다. 즉, Trie단계를 음절단위로 구성시 도 2 와 같이 구성되며, 첫음절을 찾기 위해 첫음절을 위해 구성해 놓은 AVL 구조(210)를 참조한다.In order to combine the advantages of the AVL structure with the Trie structure, a single AVL tree is constructed in place of each Trie node, and a small character set of syllable units or phoneme units can be stored, searched, and deleted in a range. . That is, when configuring the Trie step by syllable unit, it is configured as shown in FIG. 2 and refers to the AVL structure 210 configured for the first syllable to find the first syllable.

상기 각 AVL 구조의 노드는 다음 음절(220, 230, 240)을 위한 링크정보를 가지고 있으며, 이 링크정보를 이용하여 다음 음절 정보를 가지고 있는 링크된 AVL 트리로 점프한다. 또한, 각 AVL 트리의 각 노드에는 복합명사의 띄어쓰기 오류 및 단일명사로의 분리를 위해 단일명사로 분리될 수 있는 음절에는 분리 태그(Taq)를 셋팅시킨다.Each node of the AVL structure has link information for next syllables 220, 230, and 240, and jumps to a linked AVL tree having next syllable information using the link information. Also, each node of each AVL tree sets a separation tag (Taq) for syllables that can be separated into a single noun for spacing errors and separation into a single noun.

AVL 트리는 작은 양의 데이터에 대해 저장/검색/삽입 시 메모리를 사용할 수있어 사전사용에 상당한 속도향상을 도모할 수 있다.The AVL tree can use memory for storing, retrieving and inserting small amounts of data, which can significantly speed up prior use.

상기 AVL+Trie구조를 이용하여 단어를 검색하는 과정에 대하여 일예를 들어 설명하면 다음과 같다.An example of the process of searching for a word using the AVL + Trie structure will be described below.

예를 들어, 사전에 수록된 단어가 "정보처리", "의존형태", "태그", "정보통신", "단일명사", "불용어", "합성어", "정보화", "정보검색", "정밀측위"이고, 찾고자 하는 단어가 "정보처리"라면, 첫번째 단계에서는 첫번째 음절을 위한 첫번째 AVL 트리(210)에서 "정"을 검색하고, 두번째 단계에서는 "정"이 링크하고 있는 두번째 음절을 위한 두번째 AVL 트리(220)로 이동 후 "처"를 검색한다.For example, the words contained in the dictionary are "information processing", "dependent form", "tag", "information communication", "single noun", "stopword", "synthetic", "informatization", "information search", If "precision" and the word you are looking for are "information processing", the first step searches for "jeong" in the first AVL tree 210 for the first syllable, and the second step retrieves the second syllable that "jeong" links to. Go to the second AVL tree 220 for and search for "Destination".

다음으로, 세번째 단계에서 "보"가 링크하고 있는 세번째 음절을 위한 세번째 AVL 트리(230)로 이동 후 "처"를 검색하고, 네번째 단계에서 "처"가 링크하고 있는 네번째 음절을 위한 네번째 AVL 트리(240)로 이동 후 "리"를 검색하게 된다.Next, go to the third AVL tree 230 for the third syllable that "Bo" is linking in the third step and search for "Look", and in the fourth step, the fourth AVL tree for the fourth syllable that "She" is linking to After moving to 240, "Lee" is searched.

상술한 바와 같은 본 발명의 방법은 프로그램으로 구현되어 컴퓨터로 읽을 수 있는 형태로 기록매체(씨디롬, 램, 롬, 플로피 디스크, 하드 디스크, 광자기 디스크 등)에 저장될 수 있다.As described above, the method of the present invention may be implemented as a program and stored in a recording medium (CD-ROM, RAM, ROM, floppy disk, hard disk, magneto-optical disk, etc.) in a computer-readable form.

이상에서 설명한 본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니고, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 여러 가지 치환, 변형 및 변경이 가능하다는 것이 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어 명백할 것이다.The present invention described above is not limited to the above-described embodiments and the accompanying drawings, and various substitutions, modifications, and changes are possible in the art without departing from the technical spirit of the present invention. It will be apparent to those of ordinary knowledge.

상기한 바와 같은 본 발명은, 검색시 초기 개념 인식부담을 감소시키며, 단어 클러스터 엔진의 성능부담을 감소시키는 효과가 있다.As described above, the present invention has the effect of reducing the initial concept recognition burden in searching and reducing the performance burden of the word cluster engine.

Claims

In the Korean morpheme analysis method using a dictionary structure (AVL + Trie) applied to the morpheme analysis device,

Extracting a word in a word unit;

A second step of searching for the extracted word by referring to a compression stop word dictionary;

A third step of searching for the extracted word by referring to a noun dictionary;

Checking a syllable remaining after the noun dictionary reference search to determine whether an unsuitable error has occurred;

A fifth step of processing a non-registered word when a non-conformance error occurs as a result of the checking in the fourth step; And

A sixth step of extracting a noun and storing it in a database if a non-conformance error does not occur as a result of the checking in the fourth step;

Morphological analysis method comprising a.

The method of claim 1,

A seventh step of temporarily storing stop words that are not in the dictionary;

An eighth step of checking whether the term is registered in the terminology dictionary by referring to the terminology dictionary;

A ninth step of storing a stop word having a high frequency of occurrence in the compressed stop word dictionary if it is registered in the stop word dictionary as a result of the checking in the eighth step; And

As a result of the checking of the eighth step, if the dictionary does not exist in the stopwords, the tenth step is regarded as a new word and moved to the manager module.

Morphological analysis method further comprising.

The method according to claim 1 or 2,

The second step,

An eleventh step of checking whether the extracted word is completely identical with the extracted stopword dictionary;

A twelfth step of proceeding to a process of extracting the next word if it is a perfect match as a result of the eleventh step; And

A twelfth step of proceeding to the third step if the check result of the eleventh step does not match;

Morphological analysis method comprising a.

The method according to any one of claims 1 to 3,

The third step,

A thirteenth step of checking whether the extracted word is the longest match by referring to a noun dictionary;

A fourteenth step of proceeding to the fourth step if the longest match is found as a result of the checking of the thirteenth step; And

A fifteenth step of processing the unregistered language if it is not the longest match as a result of the checking of the thirteenth step

Morphological analysis method comprising a.

The method according to any one of claims 1 to 4,

The process of referring to the noun dictionary,

A morphological analysis method characterized by checking as the longest match method in a noun dictionary of a specialized field and by using the shortest match method in a noun dictionary in a general field.

The method of claim 2,

The process of storing the stopword in a compression stopword dictionary,

A morpheme analysis method comprising selecting a stop word having a high frequency of occurrence according to a threshold value determined as in Equation 1 and storing the stop word in the compressed stop word dictionary.

[Equation 1]

Where "n" is a noun, "N" is a noun dictionary, "d" is a stopword, "D" is a stopword dictionary, "k" and "q" are the frequency of occurrence of each word, and "max" is the highest frequency value. Respectively.

In a morphological analysis system with a large processor,

A first function of extracting words in word units;

A second function of searching for the extracted word by referring to a compression stop word dictionary;

A third function of searching for the extracted word by referring to a noun dictionary;

A fourth function of checking whether a non-conformance error has occurred by checking a syllable remaining after the noun dictionary reference search;

A fifth function of processing non-registered words when a non-conformance error occurs as a result of confirming the fourth function; And

A sixth function of extracting a noun and storing the noun in a database if a non-conformance error does not occur as a result of checking the fourth function

A computer-readable recording medium having recorded thereon a program for realizing this.

The method of claim 6,

A seventh function of temporarily storing stop words that are not in the dictionary;

An eighth function of checking whether a term is registered in the terminology dictionary by referring to the terminology dictionary;

A ninth function of storing a stop word having a high frequency of occurrence in the compressed stop word dictionary if it is registered in the stop word dictionary as a result of confirming the eighth function; And

A tenth function for moving to the manager module by considering a new word if it is not found in the stopword dictionary as a result of confirming the eighth function

A computer-readable recording medium that records a program for further realization.