KR100757340B1

KR100757340B1 - Method for improving performance of morpheme analyzer using automatic extraction and system for executing the method

Info

Publication number: KR100757340B1
Application number: KR1020060029199A
Authority: KR
Inventors: 김태일
Original assignee: 엔에이치엔(주)
Priority date: 2006-03-30
Filing date: 2006-03-30
Publication date: 2007-09-11

Abstract

A method and a system for improving performance of a morpheme analyzer using automatic extraction are provided to improve quality of a morpheme analysis result and a search result by reinforcing a dictionary of the morpheme analyzer with unregistered words automatically extracted from the query log or a newspaper. A word extractor extracts the word including data related to a morpheme analysis. A registration checker checks whether the extracted word is registered to the morpheme dictionary of the morpheme analyzer. An error cause extractor(701) extracts the first and second error causing a spacing error from the data. An analysis result generator(702) generates the first and second morpheme analysis result by performing the morpheme analysis for the first and second error cause. An analysis result comparator(703) compares the first morpheme analysis result with the second morpheme analysis result. An error cause adder(704) adds the first and second error cause to an error list if the first and second morpheme analysis result is different with each other.

Description

METHOD FOR IMPROVING PERFORMANCE OF MORPHEME ANALYZER USING AUTOMATIC EXTRACTION AND SYSTEM FOR EXECUTING THE METHOD}

도 1은 본 종래기술에 있어서, 미등록어 및 띄어쓰기 오류를 포함하는 형태소 분석 대상의 형태소 분석 결과에 대한 일례를 도시한 도면이다.1 is a diagram illustrating an example of a morphological analysis result of a morphological analysis target including an unregistered word and a space error in the related art.

도 2는 본 발명의 일실시예에 있어서, 형태소 분석기를 포함하는 검색 시스템의 개괄적인 모습을 도시한 흐름도이다.2 is a flowchart illustrating an overview of a search system including a morpheme analyzer according to an embodiment of the present invention.

도 3은 본 발명의 일실시예에 있어서, 미등록어를 자동 추출하여 형태소 분석기의 성능을 향상시키는 방법을 도시한 흐름도이다.3 is a flowchart illustrating a method of automatically extracting unregistered words to improve performance of a morpheme analyzer, according to an embodiment of the present invention.

도 4는 본 발명의 다른 실시예에 있어서, 띄어쓰기 오류를 자동 추출하여 형태소 분석기의 성능을 향상시키는 방법을 도시한 흐름도이다.4 is a flowchart illustrating a method of automatically extracting a spacing error to improve performance of a morpheme analyzer according to another embodiment of the present invention.

도 5는 본 발명의 또 다른 실시예에 있어서, 콘텐츠명을 자동 추출하여 형태소 분석기의 성능을 향상시키는 방법을 도시한 흐름도이다.5 is a flowchart illustrating a method of automatically extracting content names to improve performance of a morpheme analyzer according to another embodiment of the present invention.

도 6은 본 발명의 일실시예에 있어서, 미등록어를 자동 추출하여 형태소 분석기의 성능을 향상시키는 시스템의 내부 구성을 설명하기 위한 블록도이다.FIG. 6 is a block diagram illustrating an internal configuration of a system for automatically extracting unregistered words to improve performance of a morpheme analyzer.

도 7은 본 발명의 다른 실시예에 있어서, 띄어쓰기 오류를 자동 추출하여 형태소 분석기의 성능을 향상시키는 시스템의 내부 구성을 설명하기 위한 블록도이 다.FIG. 7 is a block diagram illustrating an internal configuration of a system for automatically extracting a spacing error to improve performance of a morpheme analyzer according to another embodiment of the present invention.

도 8은 본 발명의 또 다른 실시예에 있어서, 콘텐츠명을 자동 추출하여 형태소 분석기의 성능을 향상시키는 시스템의 내부 구성을 설명하기 위한 블록도이다.FIG. 8 is a block diagram illustrating an internal configuration of a system for automatically extracting content names to improve performance of a morpheme analyzer according to another embodiment of the present invention.

도 9는 본 발명의 다른 실시예에 있어서, 띄어쓰기 오류를 검수하는 방법의 일례를 도시한 도면이다.9 is a diagram illustrating an example of a method for checking a space error according to another embodiment of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for main parts of the drawings>

601: 어휘 추출부601: lexical extraction unit

602: 등록 여부 확인부602: registration confirmation unit

603: 어휘 등록부603: vocabulary register

604: 반복부604: repeating unit

본 발명은 자동 추출을 이용한 형태소 분석기의 성능 향상 방법 및 상기 방법을 수행하는 시스템에 관한 것이다.The present invention relates to a method for improving performance of a morpheme analyzer using automatic extraction and a system for performing the method.

종래기술의 형태소 분석 과정에서 형태소 분석 대상이 미등록어, 또는 띄어쓰기 오류를 포함하는 경우 형태소 분석기의 성능이 감소하게 되고, 검색의 성능을 좌우하는 형태소 분석기의 성능 감소는 검색 결과 품질의 감소로 이어지는 문제점이 있다.In the prior art morphological analysis, if the morphological analysis target includes unregistered words or spacing errors, the performance of the morpheme analyzer is reduced, and the performance of the morpheme analyzer, which determines the performance of the search, leads to a decrease in the search result quality. There is this.

도 1은 본 종래기술에 있어서, 미등록어 및 띄어쓰기 오류를 포함하는 형태 소 분석 대상의 형태소 분석 결과에 대한 일례를 도시한 도면이다.1 is a diagram illustrating an example of a morpheme analysis result of a morpheme analysis target including an unregistered word and a space error in the related art.

도면부호(110)는 미등록어(111)를 포함하는 형태소 분석 대상(112)의 형태소 분석 결과(113)를 도시하고 있다. 이는, 미등록어(섀튼)(111)를 포함하는 형태소 분석 대상(섀튼교수의)(112)의 형태소 분석 결과(섀튼교/수의)(113)가 검색 결과의 품질을 감소시킬 수 있음을 보여준다.Reference numeral 110 illustrates a morphological analysis result 113 of the morphological analysis object 112 including the non-registered word 111. This shows that the morphological analysis results (Schatten / Veterinary) 113 of the morphological analysis object (Professor Schatten's) 112 including the unregistered words (Shaton) 111 can reduce the quality of the search results. .

도면부호(120)는 띄어쓰기가 서로 다른 두 형태소 분석 대상(121, 122)의 형태소 분석 결과가 서로 다를 수 있음을 보이고 있다. 이는, 띄어쓰기 오류를 포함하는 형태소 분석 대상(122)의 형태소 분석 결과(브라질/전결/과)(123)가 검색 결과의 품질을 감소시킬 수 있음을 보여준다.Reference numeral 120 shows that the morphological analysis results of the two morphological analysis objects 121 and 122 having different spacing may be different from each other. This shows that the morphological analysis results (Brazil / Aggression / Department) 123 of the morphological analysis object 122 including the spacing error may reduce the quality of the search results.

본 발명은 상기와 같은 종래기술의 문제점을 해결하기 위해, 자동 추출을 이용한 형태소 분석기 성능 향상 방법에 관한 새로운 기술을 제안한다.The present invention proposes a new technique for improving the performance of the morpheme analyzer using automatic extraction to solve the problems of the prior art as described above.

본 발명은 질의로그, 또는 신문기사로부터 형태소 분석기의 성능을 감소시키는 미등록어를 미리 자동 추출하여 상기 형태소 분석기의 사전을 보강함으로써 형태소 분석 결과와 검색 결과의 품질을 향상시키는 것을 목적으로 한다.The present invention aims to improve the quality of morphological analysis results and search results by automatically extracting unregistered words that reduce the performance of the morpheme analyzer from query logs or newspaper articles in advance.

본 발명의 다른 목적은 질의로그로부터 띄어쓰기 오류로 인한 형태소 분석기의 성능 저하를 막기 위해 미리 상기 띄어쓰기 오류를 자동 추출하여 오류 목록에 포함함으로써 형태소 분석 결과와 검색 결과의 품질을 향상시키는 것이다.Another object of the present invention is to improve the quality of the morphological analysis results and search results by automatically extracting the spacing error in advance to include in the error list in order to prevent the performance of the morphological analyzer due to the spacing error from the query log.

본 발명의 또 다른 목적은 미등록된 인명(인명, 직업명 + 인명, 또는 인명 + 직업명) 또는 콘텐츠명(영화, 드라마, 게임, 또는 연극의 제목)을 질의로그 또는 신문기사로부터 자동 추출하여 보유 콘텐츠 목록을 확장하는 것이다.Another object of the present invention is to automatically extract and retain unregistered names (personal name, occupation name + human name, or human name + occupation name) or content name (title of movie, drama, game, or theater) from query log or newspaper article. Is to expand the content list.

상기의 목적을 달성하고, 상술한 종래기술의 문제점을 해결하기 위하여, 본 발명의 일실시예에 따른 자동 추출을 이용한 형태소 분석기 성능 향상 방법은, 형태소 분석기의 성능을 향상시키는 방법에 있어서, 형태소 분석과 연관된 데이터가 포함하는 어휘를 추출하는 제1 단계, 상기 형태소 분석기의 형태소 사전을 통해 상기 어휘의 등록 여부를 확인하는 제2 단계, 및 상기 어휘의 등록이 확인되지 않은 경우, 상기 어휘를 상기 형태소 사전에 추가하여 등록하는 제3 단계를 포함한다.In order to achieve the above object and to solve the above-mentioned problems of the prior art, the method of improving the performance of the morpheme analyzer using the automatic extraction according to an embodiment of the present invention, in the method of improving the performance of the morpheme analyzer, A first step of extracting a vocabulary included in data associated with the second step; a second step of checking whether the vocabulary is registered through a morpheme dictionary of the morpheme analyzer; and if the registration of the vocabulary is not confirmed, the vocabulary is stored in the morpheme The third step of registering in addition to the dictionary.

본 발명의 일측에 따르면, 자동 추출을 이용한 형태소 분석기 성능 향상 방법은, 상기 제1 단계 내지 상기 제3 단계를 상기 데이터가 포함하는 모든 어휘에 대해 반복 수행하는 단계를 더 포함할 수 있다.According to an aspect of the present invention, the method for improving performance of a morpheme analyzer using automatic extraction may further include repeating the first to third steps with respect to all the vocabularies included in the data.

본 발명의 다른 측면에 따르면, 상기 제2 단계는, 상기 어휘를 상기 형태소 사전에서 검색하여 상기 어휘의 존재 여부로 상기 어휘의 등록 여부를 확인하는 단계일 수 있다.According to another aspect of the present invention, the second step may be a step of checking the vocabulary in the morpheme dictionary to determine whether the vocabulary is registered based on the existence of the vocabulary.

본 발명의 다른 실시예에 따른 형태소 분석기의 성능을 향상시키는 방법은, 소정의 데이터로부터 띄어쓰기 오류를 유발하는 제1 오류 대상 및 제2 오류 대상을 추출하는 단계, 상기 제1 오류 대상 및 상기 제2 오류 대상을 각각 형태소 분석하여 제1 형태소 분석 결과 및 제2 형태소 분석 결과를 생성하는 단계, 상기 제1 형태소 분석 결과 및 제2 형태소 분석 결과를 비교하는 단계, 및 상기 비교의 결과가 서로 다른 경우, 상기 제1 오류 대상 및 상기 제2 오류 대상을 오류 목록에 추가하 는 단계를 포함한다.According to another aspect of the present invention, there is provided a method of improving performance of a morpheme analyzer, extracting a first error object and a second error object that cause a spacing error from predetermined data, wherein the first error object and the second error object are extracted. Morphologically analyzing the error object to generate a first morphological analysis result and a second morphological analysis result, comparing the first morphological analysis result and the second morphological analysis result, and when the result of the comparison is different from each other, Adding the first error object and the second error object to an error list.

본 발명의 또 다른 실시예에 따른 형태소 분석기의 성능을 향상시키는 방법은, 형태소 분석과 연관된 데이터에서 보유 콘텐츠 목록에 등록되지 않은 미등록 콘텐츠를 검색하는 제1 단계, 상기 미등록 콘텐츠가 검색된 경우, 상기 보유 콘텐츠 목록에 추가하는 제2 단계를 포함한다.According to another embodiment of the present invention, a method of improving performance of a morpheme analyzer may include: a first step of searching for unregistered content not registered in a reserved content list in data associated with morphological analysis; A second step of adding to the content list.

이하 첨부된 도면을 참조하여 본 발명에 따른 다양한 실시예를 상세히 설명하기로 한다.Hereinafter, various embodiments of the present disclosure will be described in detail with reference to the accompanying drawings.

도 2는 본 발명의 일실시예에 있어서, 형태소 분석기를 포함하는 검색 시스템의 개괄적인 모습을 도시한 흐름도이다. 도 2에 도시된 바와 같이, 통신망(201)을 통해 사용자 컴퓨터(202)와 질의 또는 검색 정보를 송수신하는 검색 시스템(203)은 상기 질의에 대한 검색 정보를 생성하기 위해 형태소 분석기(204) 및 데이터베이스(205)를 이용한다.2 is a flowchart illustrating an overview of a search system including a morpheme analyzer according to an embodiment of the present invention. As shown in FIG. 2, a search system 203 that transmits and receives query or search information to and from a user computer 202 via a communication network 201 includes a morphological analyzer 204 and a database to generate search information for the query. (205) is used.

이 경우, 형태소 분석기 성능 향상 시스템은 형태소 분석기(204)의 성능을 향상시키기 위해 데이터베이스(205)에 저장되어 있는 질의로그 또는 신문기사를 이용하여 미등록어 및 띄어쓰기 오류를 자동 추출하여 형태소 분석기(204)의 성능을 향상시킴으로써 검색 시스템(203)의 성능을 향상시킬 수 있다. 여기서 말하는 형태소 분석기(Morpheme analyzer)는 복수의 형태소 분석 결과에서 가장 가능성이 높은 하나의 결과만을 선택하는 품사 태거(part of speech tagger)를 포함하는 개념이다.In this case, the morpheme analyzer performance enhancing system automatically extracts unregistered words and spacing errors using a query log or newspaper article stored in the database 205 to improve the performance of the morpheme analyzer 204. The performance of the search system 203 can be improved by improving the performance of. The Morpheme analyzer referred to herein is a concept including a part of speech tagger that selects only one most likely result from a plurality of morpheme analysis results.

도 3은 본 발명의 일실시예에 있어서, 미등록어를 자동 추출하여 형태소 분 석기의 성능을 향상시키는 방법을 도시한 흐름도이다.3 is a flowchart illustrating a method of automatically extracting unregistered words to improve performance of a morpheme analyzer.

단계(S301)에서 형태소 분석기 성능 향상 시스템은, 형태소 분석과 연관된 데이터가 포함하는 어휘를 추출한다. 이 경우, 상기 데이터는 질의로그 또는 신문기사를 포함할 수 있다.In step S301, the stemmer performance enhancing system extracts a vocabulary included in the data associated with the stemming analysis. In this case, the data may include a query log or a newspaper article.

단계(S302)에서 상기 형태소 분석기 성능 향상 시스템은, 상기 형태소 분석기의 형태소 사전을 통해 상기 어휘의 등록 여부를 확인한다. 이 경우, 상기 등록 여부는 상기 어휘를 상기 형태소 사전에서 검색하여 상기 어휘의 존재 여부로 확인할 수 있다.In step S302, the morpheme analyzer performance enhancing system confirms whether or not the vocabulary is registered through the morpheme dictionary of the morpheme analyzer. In this case, the registration can be confirmed by the presence of the vocabulary by searching the vocabulary in the morpheme dictionary.

단계(S303)에서 상기 형태소 분석기 성능 향상 시스템은, 상기 등록 여부가 확인된 경우 단계(S305)를 수행하고, 등록 여부가 확인되지 않은 경우 단계(S304)를 수행한다.In step S303, the stemmer performance improving system performs step S305 when the registration is confirmed, and performs step S304 when the registration is not confirmed.

단계(S304)에서 상기 형태소 분석기 성능 향상 시스템은, 상기 어휘를 상기 형태소 사전에 추가하여 등록한다. 이 경우, 상기 형태소 사전은 상기 형태소 분석기의 형태소 분석 과정에 이용되는 사전을 포함할 수 있다.In step S304, the morpheme analyzer performance enhancing system registers the vocabulary by adding it to the morpheme dictionary. In this case, the morpheme dictionary may include a dictionary used in the morpheme analysis process of the morpheme analyzer.

단계(S305)에서 상기 형태소 분석기 성능 향상 시스템은, 상기 데이터 상의 모든 어휘에 대해 단계(S301) 내지 단계(S304)를 수행하였는지 확인하여 상기 모든 어휘에 대해 수행되었으면 종료하고, 상기 모든 어휘에 대해 수행되지 않았으면 단계(S301)을 수행함으로써 단계(S301) 내지 단계(S304)의 과정을 반복한다. 이와 같이 단계(S305)를 통해 상기 데이터가 포함하는 모든 미등록어를 상기 형태소 사전에 추가할 수 있다.In step S305, the stemmer performance improving system checks whether the steps S301 to S304 have been performed for all the words on the data, and ends if all the words have been performed. If not, the process of steps S301 to S304 are repeated by performing step S301. In this way, all non-registered words included in the data may be added to the morpheme dictionary through step S305.

이와 같이, 질의로그 또는 신문기사로부터 형태소 분석기의 성능을 감소시키는 미등록어를 미리 자동 추출하여 상기 형태소 분석기의 사전을 보강함으로써 형태소 분석 결과와 검색 결과의 품질을 향상시킬 수 있다.In this way, by automatically extracting unregistered words that reduce the performance of the morpheme analyzer from the query log or newspaper article in advance, the morphological analysis results and the quality of the search results can be improved.

단계(S401)에서 형태소 분석기 성능 향상 시스템은, 소정의 데이터로부터 띄어쓰기 오류를 유발하는 제1 오류 대상 및 제2 오류 대상을 추출한다. 이 경우, 상기 제1 오류 대상 및 상기 제2 오류 대상은 포함되어 있는 공백문자의 위치가 서로 다르고, 상기 공백문자를 제외하면 서로 동일한 데이터일 수 있다.In step S401, the stemmer performance improving system extracts a first error object and a second error object that cause a spacing error from the predetermined data. In this case, the first error object and the second error object may have different positions of the space character included therein, and may be the same data except for the space character.

단계(S402)에서 상기 형태소 분석기 성능 향상 시스템은, 상기 제1 오류 대상 및 상기 제2 오류 대상을 각각 형태소 분석하여 제1 형태소 분석 결과 및 제2 형태소 분석 결과를 생성한다.In step S402, the morphological analyzer performance improvement system morphologically analyzes the first error object and the second error object to generate a first morphological analysis result and a second morphological analysis result.

단계(S403)에서 상기 형태소 분석기 성능 향상 시스템은, 상기 제1 형태소 분석 결과 및 제2 형태소 분석 결과를 비교한다.In step S403, the morphological analyzer performance improvement system compares the first morphological analysis result and the second morphological analysis result.

단계(S404)에서 상기 형태소 분석기 성능 향상 시스템은, 상기 분석 결과가 서로 동일한 경우 종료하고, 동일하지 않은 경우 단계(S405)를 수행한다.In step S404, the morphological analyzer performance enhancing system terminates when the analysis results are the same, and performs step S405 when they are not the same.

단계(S405)에서 상기 형태소 분석기 성능 향상 시스템은, 상기 제1 오류 대상 및 상기 제2 오류 대상을 오류 목록에 추가한다.In step S405, the stemmer performance improvement system adds the first error object and the second error object to an error list.

단계(S406)에서 상기 형태소 분석기 성능 향상 시스템은, 상기 띄어쓰기 오류를 검수하는 검수자 또는 검수 프로그램으로 상기 오류 목록을 제공한다. 이 경 우, 상기 검수자 또는 상기 검수 프로그램은 상기 오류 목록에 기초하여 (1) 형태소 사전에 추가, (2) 정답표 작성, 또는 (3) 상기 형태소 분석기를 수정할 수 있다.In step S406, the morphological analyzer performance enhancing system provides the error list to an inspector or an inspection program that inspects the spacing error. In this case, the inspector or the inspection program may add (1) to the morpheme dictionary, (2) prepare a correct answer table, or (3) modify the morphological analyzer based on the error list.

이와 같이, 질의로그를 포함하는 상기 데이터로부터 띄어쓰기 오류로 인한 형태소 분석기의 성능 저하를 막기 위해 미리 상기 띄어쓰기 오류를 자동 추출하여 오류 목록에 포함함으로써 형태소 분석 결과와 검색 결과의 품질을 향상시킬 수 있다.As such, the spacing error is automatically extracted and included in the error list in advance in order to prevent the performance of the morphological analyzer due to the spacing error from the data including the query log, thereby improving the quality of the morphological analysis result and the search result.

단계(S501)에서 형태소 분석기 성능 향상 시스템은, 형태소 분석과 연관된 데이터에서 보유 콘텐츠 목록에 등록되지 않은 미등록 콘텐츠를 검색한다. 이 경우, 상기 미등록 콘텐츠는 상기 보유 콘텐츠 목록에 미등록된 (1) 인명 또는 (2) 영화, 드라마, 게임, 또는 연극의 명칭일 수 있다.In step S501, the stemming analyzer performance improvement system searches for unregistered content not registered in the reserved content list in the data associated with the stemming analysis. In this case, the unregistered content may be the name of (1) a person or (2) a movie, drama, game, or play that is not registered in the content list.

단계(S502)에서 상기 형태소 분석기 성능 향상 시스템은, 상기 미등록 콘텐츠가 검색된 경우 단계(S503)를 수행하고, 검색되지 않는 경우 단계(S504)를 수행한다.In step S502, the stemmer performance improving system performs step S503 when the unregistered content is searched, and performs step S504 when it is not found.

단계(S503)에서 상기 형태소 분석기 성능 향상 시스템은, 상기 미등록 콘텐츠가 검색된 경우 상기 보유 콘텐츠 목록에 추가한다.In step S503, the stemmer performance improving system adds the unregistered content to the list of retained content.

단계(S504)에서 상기 형태소 분석기 성능 향상 시스템은, 단계(S501) 내지 단계(S503)가 모든 미등록 콘텐츠에 대해 수행되었는지 확인하여 상기 모든 미등록 콘텐츠에 대한 수행이 확인된 경우 단계(S505)를 수행하고, 확인되지 않은 경우 단계(S501)을 수행한다. 이와 같이 단계(S504)를 통해 상기 데이터가 포함하는 모든 미등록 콘텐츠를 상기 보유 콘텐츠 목록에 추가할 수 있다.In step S504, the stemmer performance improving system checks whether steps S501 to S503 are performed for all unregistered contents, and performs step S505 when it is confirmed that performance for all unregistered contents is performed. If not confirmed, step S501 is performed. In this way, all unregistered contents included in the data may be added to the reserved content list through step S504.

단계(S505)에서 상기 형태소 분석기 성능 향상 시스템은, 상기 미등록 콘텐츠를 검수하는 검수자 또는 검수 프로그램으로 상기 미등록 콘텐츠를 제공한다.In step S505, the morphological analyzer performance enhancing system provides the unregistered content to an inspector or a inspecting program that inspects the unregistered content.

이와 같이, 미등록된 인명(인명, 직업명 + 인명, 또는 인명 + 직업명) 또는 콘텐츠명(영화, 드라마, 게임, 또는 연극의 제목)을 질의로그 또는 신문기사로부터 자동 추출하여 보유 콘텐츠 목록을 확장할 수 있다.In this way, an unregistered name (person name, occupation name + person name, or person name + occupation name) or content name (movie, drama, game or theater title) is automatically extracted from the query log or newspaper article to expand the list of owned contents. can do.

도 6은 본 발명의 일실시예에 있어서, 미등록어를 자동 추출하여 형태소 분석기의 성능을 향상시키는 시스템의 내부 구성을 설명하기 위한 블록도이다. 도 6에 도시된 바와 같이 형태소 분석기 성능 향상 시스템(600)은 어휘 추출부(601), 등록 여부 확인부(602), 및 어휘 등록부(603)를 포함할 수 있고, 반복부(604)를 더 포함할 수 있다.FIG. 6 is a block diagram illustrating an internal configuration of a system for automatically extracting unregistered words to improve performance of a morpheme analyzer. As shown in FIG. 6, the morpheme analyzer performance improvement system 600 may include a lexical extractor 601, a registration check unit 602, and a lexical register 603, and further include a repeater 604. It may include.

어휘 추출부(601)는 형태소 분석과 연관된 데이터가 포함하는 어휘를 추출한다. 이 경우 상기 데이터는 질의로그 또는 신문기사일 수 있고, 상기 어휘는 상기 질의로그 및 상기 신문기사에 포함된 순서로 추출될 수 있다.The vocabulary extractor 601 extracts a vocabulary included in data associated with morphological analysis. In this case, the data may be a query log or a newspaper article, and the vocabulary may be extracted in the order included in the query log and the newspaper article.

등록 여부 확인부(602)는 상기 형태소 분석기의 형태소 사전을 통해 상기 어휘의 등록 여부를 확인한다.The registration confirmation unit 602 confirms whether or not the vocabulary is registered through the morpheme dictionary of the morpheme analyzer.

어휘 등록부(603)는 상기 어휘의 등록이 확인되지 않은 경우, 상기 어휘를 상기 형태소 사전에 추가하여 등록한다.If the vocabulary registration unit 603 cannot confirm the registration of the vocabulary, the vocabulary registration unit 603 adds and registers the vocabulary to the morpheme dictionary.

반복부(604)는 상기 어휘 추출부, 상기 등록 여부 확인부, 및 상기 어휘 등록부를 상기 데이터가 포함하는 모든 어휘에 대해 반복 수행시킨다.The repeater 604 iteratively performs the vocabulary extraction unit, the registration check unit, and the vocabulary registration unit for all vocabularies included in the data.

이와 같이 질의로그 또는 신문기사로부터 형태소 분석기의 성능을 감소시키는 미등록어를 미리 자동 추출하여 상기 형태소 분석기의 사전을 보강함으로써 형태소 분석 결과와 검색 결과의 품질을 향상시킬 수 있다.In this way, by automatically extracting the non-registered words that reduce the performance of the morpheme analyzer from the query log or newspaper article in advance, the dictionary of the morpheme analyzer can be improved to improve the quality of the morpheme analysis result and the search result.

도 7은 본 발명의 다른 실시예에 있어서, 띄어쓰기 오류를 자동 추출하여 형태소 분석기의 성능을 향상시키는 시스템의 내부 구성을 설명하기 위한 블록도이다. 도7에 도시된 바와 같이 형태소 분석기 성능 향상 시스템(700)은 오류 대상 추출부(701), 분석 결과 생성부(702), 분석 결과 비교부(703), 오류 대상 추가부(704)를 포함할 수 있고 오류 목록 제공부(705)를 더 포함할 수 있다.7 is a block diagram illustrating an internal configuration of a system for automatically extracting a spacing error to improve performance of a morpheme analyzer according to another embodiment of the present invention. As shown in FIG. 7, the morpheme analyzer performance improving system 700 may include an error object extractor 701, an analysis result generator 702, an analysis result comparer 703, and an error object adder 704. It may further include an error list providing unit 705.

오류 대상 추출부(701)는 소정의 데이터로부터 띄어쓰기 오류를 유발하는 제1 오류 대상 및 제2 오류 대상을 추출한다.The error object extracting unit 701 extracts a first error object and a second error object causing a spacing error from predetermined data.

결과 분석부(702)는 상기 제1 오류 대상 및 상기 제2 오류 대상을 각각 형태소 분석하여 제1 형태소 분석 결과 및 제2 형태소 분석 결과를 생성한다.The result analyzer 702 morphologically analyzes the first error object and the second error object to generate a first morphological analysis result and a second morphological analysis result.

분석 결과 비교부(703)는 상기 제1 형태소 분석 결과 및 제2 형태소 분석 결과를 비교한다.The analysis result comparison unit 703 compares the first morpheme analysis result and the second morpheme analysis result.

오류 대상 추가부(704)는 상기 비교의 결과가 서로 다른 경우, 상기 제1 오류 대상 및 상기 제2 오류 대상을 오류 목록에 추가한다.If the result of the comparison is different, the error object adding unit 704 adds the first error object and the second error object to the error list.

오류 목록 제공부(705)는 상기 띄어쓰기 오류를 검수하는 검수자 또는 검수 프로그램으로 상기 오류 목록을 제공한다. 이 경우, 상기 검수자 또는 상기 검수 프로그램은 상기 오류 목록에 기초하여 (1) 형태소 사전에 추가, (2) 정답표 작성, 또는 (3) 상기 형태소 분석기를 수정할 수 있다.The error list providing unit 705 provides the error list to an inspector or a inspection program that inspects the spacing error. In this case, the inspector or the inspection program may add (1) to the morpheme dictionary, (2) prepare a correct answer table, or (3) modify the morphological analyzer based on the error list.

이와 같이, 질의로그를 포함하는 데이터로부터 띄어쓰기 오류로 인한 형태소 분석기의 성능 저하를 막기 위해 미리 상기 띄어쓰기 오류를 자동 추출하여 오류 목록에 포함함으로써 형태소 분석 결과와 검색 결과의 품질을 향상시킬 수 있다.As such, the spacing error is automatically extracted in advance and included in the error list in order to prevent the performance of the stemming analyzer due to the spacing error from the data including the query log, thereby improving the quality of the morphological analysis result and the search result.

도 8은 본 발명의 또 다른 실시예에 있어서, 콘텐츠명을 자동 추출하여 형태소 분석기의 성능을 향상시키는 시스템의 내부 구성을 설명하기 위한 블록도이다. 도 8에 도시된 바와 같이 형태소 분석기 성능 향상 시스템(800)은 미등록 콘텐츠 검사부(801) 및 미등록 콘텐츠 추가부(802)를 포함할 수 있고, 반복부(803) 및 미등록 콘텐츠 제공부(804)를 더 포함할 수 있다.FIG. 8 is a block diagram illustrating an internal configuration of a system for automatically extracting content names to improve performance of a morpheme analyzer according to another embodiment of the present invention. As shown in FIG. 8, the morpheme analyzer performance enhancing system 800 may include an unregistered content inspecting unit 801 and an unregistered content adding unit 802, and include a repeater 803 and an unregistered content providing unit 804. It may further include.

미등록 콘텐츠 검사부(801)는 형태소 분석과 연관된 데이터에서 보유 콘텐츠 목록에 등록되지 않은 미등록 콘텐츠를 검색한다.The unregistered content inspecting unit 801 searches for unregistered content not registered in the owned content list in the data associated with the morphological analysis.

미등록 콘텐츠 추가부(802)는 상기 미등록 콘텐츠가 검색된 경우, 상기 보유 콘텐츠 목록에 추가한다.The unregistered content adding unit 802 adds the unregistered content to the list of retained content when the unregistered content is found.

반복부(803)는 상기 미등록 콘텐츠 검색부 및 상기 미등록 콘텐츠 추가부를 상기 데이터가 포함하는 모든 미등록 콘텐츠에 대해 반복 수행시킨다.The repeater 803 repeats the unregistered content searching unit and the unregistered content adding unit for all unregistered content included in the data.

미등록 콘텐츠 제공부(804)는 상기 미등록 콘텐츠를 검수하는 검수자 또는 검수 프로그램으로 상기 미등록 콘텐츠를 제공한다.The unregistered content providing unit 804 provides the unregistered content to an inspector or a inspecting program that inspects the unregistered content.

이와 같이 미등록된 인명(인명, 직업명 + 인명, 또는 인명 + 직업명) 또는 콘텐츠명(영화, 드라마, 게임, 또는 연극의 제목)을 질의로그 또는 신문기사로부터 자동 추출하여 보유 콘텐츠 목록을 확장할 수 있다.As such, unregistered names (persons, occupations + names, or names + occupations) or content names (titles of movies, dramas, games, or plays) can be automatically extracted from query logs or newspaper articles to expand the list of owned content. Can be.

도 9는 본 발명의 다른 실시예에 있어서, 띄어쓰기 오류를 검수하는 방법의 일례를 도시한 도면이다. 이 경우, 도 9에 도시된 바와 같이, 검수 틀(900)은 띄어쓰기 오류가 포함된 오류 목록이 검수자 또는 검수 프로그램으로 제공된 경우, 제1 오류 대상(901) 및 제2 오류 대상(902) 각각을 형태소 분석하고 각각의 검색 결과를 생성하여 비교함으로써 상기 띄어쓰기 오류로 인한 형태소 오분석에 대해 분석할 수 있음을 보여준다.9 is a diagram illustrating an example of a method for checking a space error according to another embodiment of the present invention. In this case, as shown in FIG. 9, when the error list including the spacing error is provided to the inspector or the inspection program, the inspection frame 900 selects each of the first error object 901 and the second error object 902. By morphological analysis and by generating and comparing each search result, we can analyze the morphological analysis due to the spacing error.

본 발명에 따른 자동 추출을 이용한 형태소 분석기 성능 향상 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(Floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 상기 매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수도 있다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method for improving performance of a morpheme analyzer using automatic extraction according to the present invention may be implemented in the form of program instructions that can be executed by various computer means and recorded in a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the media may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. The medium may be a transmission medium such as an optical or metal wire, a waveguide, or the like including a carrier wave for transmitting a signal specifying a program command, a data structure, or the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다.In the present invention as described above has been described by the specific embodiments, such as specific components and limited embodiments and drawings, but this is provided to help a more general understanding of the present invention, the present invention is not limited to the above embodiments. For those skilled in the art, various modifications and variations are possible from these descriptions.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the described embodiments, and all the things that are equivalent to or equivalent to the claims as well as the following claims will belong to the scope of the present invention. .

본 발명에 따르면, 질의로그, 또는 신문기사로부터 형태소 분석기의 성능을 감소시키는 미등록어를 미리 자동 추출하여 상기 형태소 분석기의 사전을 보강함으로써 형태소 분석 결과와 검색 결과의 품질을 향상시킬 수 있다.According to the present invention, the quality of the morphological analysis result and the search result can be improved by automatically extracting the non-registered words that reduce the performance of the morpheme analyzer from the query log or the newspaper article in advance.

본 발명에 따르면, 질의로그로부터 띄어쓰기 오류로 인한 형태소 분석기의 성능 저하를 막기 위해 미리 상기 띄어쓰기 오류를 자동 추출하여 오류 목록에 포함함으로써 형태소 분석 결과와 검색 결과의 품질을 향상시킬 수 있다.According to the present invention, in order to prevent the performance of the morpheme analyzer due to the spacing error from the query log, the spacing error is automatically extracted in advance and included in the error list, thereby improving the quality of the morphological analysis result and the search result.

본 발명에 따르면, 미등록된 인명(인명, 직업명 + 인명, 또는 인명 + 직업 명) 또는 콘텐츠명(영화, 드라마, 게임, 또는 연극의 제목)을 질의로그 또는 신문기사로부터 자동 추출하여 보유 콘텐츠 목록을 확장할 수 있다.According to the present invention, an unregistered name (person name, occupation name + person name, or person name + occupation name) or content name (title of a movie, drama, game, or play) is automatically extracted from a query log or a newspaper article, and the contents list is held. Can be extended.

Claims

In the method of improving the performance of the morpheme analyzer,

Extracting a vocabulary included in data associated with the morphological analysis;

A second step of checking whether or not the vocabulary is registered through a morpheme dictionary of the morpheme analyzer;

A third step of adding the vocabulary to the morpheme dictionary when registering the vocabulary is not confirmed;

A fourth step of repeating the first to third steps with respect to all vocabularies included in the data;

Extracting a first error object and a second error object that cause a spacing error from the data;

A sixth step of morphologically analyzing the first error object and the second error object to generate a first morphological analysis result and a second morphological analysis result;

A seventh step of comparing the first morphological analysis result and the second morphological analysis result; And

An eighth step of adding the first error object and the second error object to an error list when results of the comparison are different;

Including,

The first error object and the second error object has a different position of the space character is included, except for the space character morphological analyzer performance improvement method, characterized in that the same data.

delete

The method of claim 1,

The second step,

And searching the vocabulary in the morpheme dictionary to determine whether the vocabulary is registered based on the existence of the vocabulary.

In the method of improving the performance of the morpheme analyzer,

Extracting a first error object and a second error object that cause a spacing error from the predetermined data;

Morphologically analyzing the first error object and the second error object to generate a first morphological analysis result and a second morphological analysis result;

Comparing the first morphological analysis result and the second morphological analysis result; And

If the result of the comparison is different, adding the first error object and the second error object to an error list;

Including,

The method of claim 4, wherein

Providing the error list to the inspector or inspection program for inspecting the spacing error,

The inspector or the inspector program (1) add to the morpheme dictionary, (2) create the correct answer table, or (3) modify the morpheme analyzer based on the error list.

delete

In the method of improving the performance of the morpheme analyzer,

A first step of retrieving unregistered content not registered in the retained content list from the data associated with the morphological analysis;

If the unregistered content is found, adding it to the list of retained content;

A third step of repeating the first step and the second step for all unregistered content included in the data;

A fifth step of generating a first morphological analysis result and a second morphological analysis result by morphologically analyzing the first error object and the second error object;

A sixth step of comparing the first morphological analysis result and the second morphological analysis result; And

A seventh step of adding the first error object and the second error object to an error list when results of the comparison are different;

Including,

The method of claim 7, wherein

Eighth step of providing the unregistered content to the inspector or inspection program for inspecting the unregistered content

Stem analyzer performance improvement method comprising the more.

The method of claim 8,

And the unregistered content is a name of (1) a person or (2) a movie, a drama, a game, or a play that is not registered in the possessed content list.

The method according to claim 1, 4, or 7, wherein

The data may include a query log or a newspaper article.

A computer-readable recording medium having recorded thereon a program for executing the method according to any one of claims 1 to 3 or 7 to 9.

In morphological analyzer performance improvement system,

A vocabulary extracting unit extracting a vocabulary included in data associated with morphological analysis;

A registration confirmation unit confirming whether or not the vocabulary is registered through a morpheme dictionary of the morpheme analyzer;

A vocabulary registration unit for registering the vocabulary by adding the vocabulary to the morpheme dictionary when the registration of the vocabulary is not confirmed;

An error object extraction unit that extracts a first error object and a second error object that cause a spacing error from the data;

An analysis result generation unit for generating a first morphological analysis result and a second morphological analysis result by morphologically analyzing the first error object and the second error object;

An analysis result comparison unit comparing the first morphological analysis result and the second morphological analysis result; And

An error target adding unit configured to add the first error target and the second error target to an error list when results of the comparison are different;

Including,

The first error object and the second error object has a different position of the space character is included, except for the space character, stemming analyzer performance improvement system, characterized in that the same data.

The method of claim 12,

The repeater repeats the vocabulary extraction unit, the registration check unit, and the vocabulary registration unit for all the vocabularies included in the data.

Stem analyzer performance improvement system further comprising.

In morphological analyzer performance improvement system,

An error object extraction unit that extracts a first error object and a second error object that cause a spacing error from predetermined data;

Including,

The method of claim 14,

Further comprising an error list providing unit for providing the error list to the inspector or inspection program for inspecting the spacing error,

The inspector or the inspector program (1) add to the morpheme dictionary, (2) prepare the correct answer table, or (3) modify the morpheme analyzer based on the error list.

In morphological analyzer performance improvement system,

An unregistered content searching unit that searches for unregistered content not registered in the owned content list in the data associated with the morphological analysis;

An unregistered content adding unit to add to the list of retained content when the unregistered content is found;

Including,

The method of claim 16,

A repeater which repeats the unregistered content searching unit and the unregistered content adding unit for all unregistered contents included in the data; And

Unregistered content provider for providing the unregistered content to the inspector or inspection program for inspecting the unregistered content

Stem analyzer performance improvement system further comprising.