KR101740330B1

KR101740330B1 - Apparatus and method for correcting multilanguage morphological error based on co-occurrence information

Info

Publication number: KR101740330B1
Application number: KR1020130122054A
Authority: KR
Inventors: 김창현; 김영길; 권오욱; 나승훈; 노윤형; 서영애; 이기영; 정상근; 최승권; 김운; 박은진; 신종훈; 황금하
Original assignee: 한국전자통신연구원
Priority date: 2013-10-14
Filing date: 2013-10-14
Publication date: 2017-05-29
Also published as: KR20150043065A

Abstract

단어 공기 정보에 기반한 다국어 형태소 분석 오류 정정 장치 및 방법이 개시된다. 본 발명에 따른 단어 공기 정보에 기반한 다국어 형태소 분석 오류 정정 방법은 중의성이 발생하는 단어에 대한 중의성 후보 사전을 구축하는 단계, 중의성이 발생하는 단어를 대상으로 대규모 원시 코퍼스에서 공기정보를 추출하는 단계, 입력된 문장에 대해 형태소 분석 결과를 생성하는 단계 및 공기 정보를 기반으로 상기 형태소 분석 결과를 수정하는 단계를 포함한다.A multilingual morpheme analysis error correction apparatus and method based on word air information is disclosed. The multi-lingual morpheme analysis error correction method based on the word air information according to the present invention comprises steps of constructing a hypothetical candidate dictionary for the words in which the ambiguity occurs, extracting air information from the large-scale primitive corpus Generating a morpheme analysis result for the input sentence, and modifying the morpheme analysis result based on the air information.

Description

[0001] APPARATUS AND METHOD FOR CORRECTING MULTILANGUAGE MORPHOLOGICAL ERROR BASED ON CO-OCCURRENCE INFORMATION [0002]

본 발명은 단어 공기 정보에 기반한 다국어 형태소 분석 오류 정정 장치 및 방법에 관한 것이다. 특히, 본 발명은 공기정보 기반의 형태소 분석 방법을 도입함으로써 특정 언어에 의존적이지 않는 단어 공기정보에 기반한 다국어 형태소 분석 오류 정정 장치 및 방법에 관한 것이다.The present invention relates to a multilingual morphological analysis error correction apparatus and method based on word air information. In particular, the present invention relates to a multilingual morpheme analysis error correction apparatus and method based on word air information that is not dependent on a specific language by introducing a morpheme analysis method based on an air information.

현재 가장 많이 활용되는 형태소 분석 방법은 통계와 사전에 기반한 접근법으로, 품사 태깅된 코퍼스로부터 학습된 확률 정보와 이와 별도로 수동으로 기 구축된 어휘 사전으로 구성된다. The morphological analysis method that is most widely used at present is a statistical and dictionary based approach. It consists of probabilistic information learned from partly tagged corpus and a manually constructed dictionary of words.

이러한 종래의 형태소분석 방법은 대부분 단어 기반(word-based) 접근법인데, 개별 단어(word) 또는 형태소(morpheme)마다 품사를 부착할 수 있게 된다.This conventional morpheme analysis method is mostly a word-based approach in which parts of speech can be attached to individual words or morpheme.

그러나, 이러한 단어 기반 방식에서는 단어 단위로 태깅이 이루어지기 때문에, 단어의 중의성이 높은 경우, 해당 단어의 품사 결정 시의 중의성을 크게 높여, 결과적으로 품사 부착 성능을 크게 저하시킬 수 있게 된다. However, in this word-based method, the tagging is performed on a word-by-word basis. Therefore, when the degree of the word is high, the degree of ambiguity in determining the part of the word is greatly increased.

특히, 한국어와 같이 두 개 이상의 형태소가 결합해 하나의 어절을 이루는 언어의 경우에는 단어의 중의성이 상당히 높아지고, 이로 인해 형태소 분할 및 태깅 과정에서의 복잡도가 올라감으로써 형태소 분석 성능이 저하될 수 있다.In particular, in the case of a language in which two or more morphemes are combined to form a single word, the ambiguity of the word is considerably increased, and the morphological analysis performance may be degraded by increasing the complexity in the morphological segmentation and tagging process .

이러한 문제를 해소하기 위해, 기존 형태소 분석 방법론에서는 단어보다 긴 어절이나 부분어절에 대해 기분석 사전을 구축하여, 입력문이 주어질 때 각 어절별로 기분석 사전의 분석 후보 결과를 조합하는 방식으로 품사 태깅을 수행하는 방식을 제안했다. In order to solve this problem, the existing morphological analysis methodology constructs a preliminary analysis dictionary for a longer word or partial word phrase than words, and combines the analysis candidate results of each preliminary analysis dictionary with the input sentence, And the like.

그러나, 이러한 어절 또는 부분어절 기반 방식들은, 기분석 사전의 분석 후보가 2개 이상일 경우에는, 기분석 결과를 조합하여 최적의 분석 결과를 찾는 과정에서, 통계 기반 방법에서 활용되는 디코더와 같은 통합된 확률적 프레임워크를 정의하지 못하여, 최적 분석 결과를 탐색하는 과정이 수학적으로 간결하지 않고 휴리스틱적이며, 다양한 자질을 효과적으로 통합하지 못하는 단점이 있다. However, in the case of two or more analysis candidates in the preliminary analysis dictionary, these phrases or partial word-based schemes are used in combination with the integrated analysis such as decoders used in statistical based methods, Since the probabilistic framework can not be defined, the process of searching for the optimal analysis result is not mathematically simple, heuristic, and can not effectively integrate various qualities.

게다가, 이러한 방법론을 적용하더라도, 분석 과정에서 고려할 수 있는 문맥정보의 크기가 제한됨으로써 여전히 올바른 분석 결과를 생성해내기 어려운 현상들이 존재한다. In addition, even if this methodology is applied, there are phenomena that it is difficult to generate correct analysis results because the size of context information that can be considered in the analysis process is limited.

예를 들어, "산에 가느니 집에서 쉬자" 같은 단어의 중의성이 높은 문장을 살펴보도록 한다.For example, let's take a look at high-level sentences such as "Let's go to the mountain or take a rest at home."

위 문장에서 "가느니" 라는 어절은 "가늘+니", "갈+느니", "가+느니" 와 같은 3가지 분석이 가능하다.In the sentence above, the phrase "to go" can be analyzed in three ways: "thin + you", "go +", and "+ go".

그러나, 이러한 3가지 분석 후보 가운데 어느 것이 정확한 분석인지는 문장의 의미를 정확하게 알지 않고서는 결정하기가 쉽지 않으며, 기존의 기분석 사전이나 확률적 방법을 통해서는 이러한 문제를 해결하기 어렵다.However, it is difficult to decide which one of these three analysis candidates is accurate without knowing the meaning of the sentence precisely, and it is difficult to solve the problem through the existing preliminary analysis dictionary or probabilistic method.

이러한 문제 이외에도, 기존의 형태소 분석 과정상에서 새로운 처리 방법론과의 결합방식은 다국어 확장성에도 문제가 된다. In addition to these problems, the method of combining with the new processing methodology in the existing morphological analysis process also becomes a problem in the multilingual scalability.

아울러, 한국등록특허 0474823호에서는 자연어의 품사 태깅 장치 및 그 방법에 관하여 기술하고 있으나, 이는 오류 보정 데이터가 미리 수작업에 의해 준비되어 있어야 하는 제약이 있다는 점에서 한계가 있다. In addition, Korean Patent No. 0474823 describes a speech tagging apparatus and a method thereof in a natural language, but it is limited in that there is a restriction that the error correction data must be prepared by manual operation in advance.

따라서, 품사 태깅의 효율성 측면과 다국어 확장성 측면에서의 상기 기술한 기존 방식의 기술적 한계를 극복하는 기술이 필요한 실정이다.Therefore, there is a need for a technique that overcomes the technical limitations of the existing methods described above in terms of efficiency of tagging tagging and multilingual scalability.

본 발명의 목적은, 공기정보 기반의 형태소 분석 방법을 도입함으로써, 다국어적 확장성의 한계를 해결하여 기존의 형태소 분석에 비해 성능을 향상시키는 것이다.An object of the present invention is to improve the performance compared to the existing morphological analysis by solving the limitation of the multilingual scalability by introducing the morpheme analysis method based on the air information.

또한, 본 발명의 목적은 형태소를 분석하는 과정 이후에 형태소 분석 결과의 수정 형태를 취함에 따라, 특정 형태소 분석 방법론이나 특정 언어에 의존적이지 않는 기술을 제공하는 것이다.It is also an object of the present invention to provide a morphological analysis methodology or a technique which does not depend on a specific language, by taking a modification form of the morphological analysis result after analyzing the morpheme.

상기한 목적을 달성하기 위한 본 발명에 따른 단어 공기 정보에 기반한 다국어 형태소 분석 오류 정정 방법은, 중의성이 발생하는 단어에 대한 중의성 후보 사전을 구축하는 단계, 상기 중의성이 발생하는 단어를 대상으로 대규모 원시 코퍼스에서 공기 정보를 추출하는 단계, 입력된 문장에 대해 형태소 분석 결과를 생성하는 단계 및 상기 공기 정보를 기반으로 상기 형태소 분석 결과를 수정하는 단계를 포함한다.
이 때, 상기 중의성 후보 사전은 상기 대규모 원시 코퍼스에 존재하는 중의적 단어 및 상기 중의적 단어에 상응하는 조사 또는 어미 중 어느 하나가 포함될 수 있다.
이 때, 상기 형태소 분석 결과를 생성하는 단계는 상기 입력된 문장에 존재하는 단어 및 상기 중의성 후보 사전을 이용하여 생성된 후보들 중 가장 최적의 분석 결과를 생성할 수 있다.
이 때, 상기 대규모 원시 코퍼스는 웹 사이트, 블로그, 신문 기사 중 어느 하나 이상에서 추출할 수 있다.
이 때, 상기 형태소 분석 결과를 수정하는 단계는 상기 형태소 분석 결과 에서 상기 중의성 후보 사전에 존재하는 어절을 추출하고, 추출된 어절에 대해서 공기 정보 사전에 기반하여 상기 형태소 분석 결과를 수정할 수 있다.
이 때, 상기 공기 정보 사전은 문장에 대한 형태소 분석 결과 중, 구조적 중의성 및 형태적 중의성이 발견되지 않은 단어들을 추출하고, 상기 단어들을 저장할 수 있다.
또한, 본 발명의 일실시예에 따른 단어 공기 정보에 기반한 다국어 형태소 분석 오류 정정 장치는 입력된 문장에서 중의성이 발생하는 단어에 대한 분석 후보를 생성하고, 상기 분석 후보들 각각에 대한 형태소 분석 결과를 생성하는 형태소 분석 후보 생성 모듈; 상기 분석 후보들 각각의 형태소 분석 결과 및 대규모 원시 코퍼스에서 추출한 중의적 단어에 기반하여 생성된 중의성 후보 사전을 이용하여, 하나의 분석 결과를 선택하는 최적해 선택 모듈; 및 선택된 분석 결과 및 공기 정보에 기반하여 상기 형태소 분석 결과를 수정하는 공기 정보 기반 수정 모듈을 포함한다.
이 때, 상기 중의성 후보 사전은 상기 대규모 원시 코퍼스에 존재하는 중의적 단어 및 상기 중의적 단어에 상응하는 조사 또는 어미 중 어느 하나가 포함할 수 있다.
이 때, 상기 대규모 원시 코퍼스는 웹 사이트, 블로그, 신문 기사 중 어느 하나 이상에서 추출할 수 있다.
이 때, 상기 공기 정보 기반 수정 모듈은 상기 형태소 분석 결과에서 상기 중의성 후보 사전에 존재하는 어절을 추출하고, 추출된 어절에 대해서 공기 정보 사전에 기반하여 상기 형태소 분석 결과를 수정할 수 있다.According to another aspect of the present invention, there is provided a method for correcting errors in multilingual morpheme analysis based on word air information, the method comprising the steps of: constructing an ambiguous candidate dictionary for a word in which ambiguity occurs; Extracting air information from a large-scale original corpus, generating a morpheme analysis result for the input sentence, and modifying the morpheme analysis result based on the air information.
At this time, the ambiguity candidate dictionary may include either a medium word existing in the large-scale raw corpus and a search or an end corresponding to the medium word.
In this case, the step of generating the morpheme analysis result may generate the most optimal analysis result among the candidates generated using the words existing in the input sentence and the ambiguous candidate dictionary.
At this time, the large-scale raw corpus can be extracted from at least one of a web site, a blog, and a newspaper article.
In this case, the step of modifying the morpheme analysis result may extract a word phrase existing in the ambiguity candidate from the morphological analysis result, and modify the morpheme analysis result based on the air information dictionary for the extracted word phrase.
At this time, the air information dictionary can extract words that are not found in structural ambiguity and morphological ambiguity among morphological analysis results of sentences, and can store the words.
In addition, according to an embodiment of the present invention, a multilingual morpheme analysis error correction apparatus based on word air information generates an analysis candidate for a word in which an ambiguity occurs in an input sentence, and outputs a morphological analysis result for each of the analysis candidates Generating morpheme analysis candidate generation module; An optimal solution selection module for selecting one analysis result using the ambiguous candidate dictionary generated based on the morpheme analysis result of each of the analysis candidates and the ambiguous word extracted from the large scale corpus; And an air information based modification module for modifying the morpheme analysis result based on the selected analysis result and the air information.
At this time, the ambiguity candidate dictionary may include one of an ambiguous word existing in the large-scale original corpus and a search or an end corresponding to the ambiguous word.
At this time, the large-scale raw corpus can be extracted from at least one of a web site, a blog, and a newspaper article.
At this time, the air information based correction module can extract a word phrase existing in the ambiguity candidate from the morphological analysis result, and modify the morphological analysis result based on the air information dictionary for the extracted word phrase.

본 발명에 따르면, 공기정보 기반의 형태소 분석 방법을 도입함으로써, 다국어적 확장성의 한계를 해결하여 기존의 형태소 분석에 비해 성능을 향상시킬 수 있다.According to the present invention, by introducing the morpheme analysis method based on the air information, it is possible to improve the performance compared to the existing morphological analysis by solving the limitation of the multilingual scalability.

또한, 본 발명에 따르면, 형태소를 분석하는 과정 이후에 형태소 분석 결과의 수정 형태를 취함에 따라, 특정 형태소 분석 방법론이나 특정 언어에 의존적이지 않는 기술을 제공할 수 있다.In addition, according to the present invention, it is possible to provide a morphological analysis methodology or a technique which does not depend on a specific language, by taking a modification form of the morphological analysis result after analyzing the morpheme.

도 1은 본 발명에 따른 단어 공기 정보에 기반한 다국어 형태소 분석 오류 정정 방법의 흐름도이다.
도 2는 본 발명에 따른 단어 공기 정보에 기반한 다국어 형태소 분석 오류 정정 장치의 블록도이다.
도 3은 본 발명에 따른 형태소 분석 오류가 발생한 경우의 오류 정정의 일 실시예이다.1 is a flowchart of a multilingual morpheme analysis error correction method based on word air information according to the present invention.
2 is a block diagram of a multilingual morpheme analysis error correction apparatus based on word air information according to the present invention.
3 is an example of error correction when a morpheme analysis error occurs according to the present invention.

본 발명을 첨부된 도면을 참조하여 상세히 설명하면 다음과 같다. 여기서, 반복되는 설명, 본 발명의 요지를 불필요하게 흐릴 수 있는 공지 기능, 및 구성에 대한 상세한 설명은 생략한다. The present invention will now be described in detail with reference to the accompanying drawings. Hereinafter, a repeated description, a known function that may obscure the gist of the present invention, and a detailed description of the configuration will be omitted.

본 발명의 실시형태는 당 업계에서 평균적인 지식을 가진 자에게 본 발명을 보다 완전하게 설명하기 위해서 제공되는 것이다. Embodiments of the present invention are provided to more fully describe the present invention to those skilled in the art.

따라서, 도면에서의 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.
Accordingly, the shapes and sizes of the elements in the drawings and the like can be exaggerated for clarity.

도 1은 본 발명에 따른 단어 공기 정보에 기반한 다국어 형태소 분석 오류 정정 방법의 흐름도이다.1 is a flowchart of a multilingual morpheme analysis error correction method based on word air information according to the present invention.

도 1을 참조하면, 본 발명에 따른 단어 공기 정보에 기반한 다국어 형태소 분석 오류 정정 방법은 중의성이 발생하는 단어에 대한 중의성 후보 사전을 구축하는 단계(S10), 중의성이 발생하는 단어를 대상으로 대규모 원시 코퍼스에서 공기정보를 추출하는 단계(S20), 입력된 문장에 대해 형태소 분석 결과를 생성하는 단계(S30) 및 공기 정보를 기반으로 상기 형태소 분석 결과를 수정하는 단계(S40)를 포함한다.
Referring to FIG. 1, a multi-lingual morpheme analysis error correction method based on word air information according to the present invention includes a step of constructing a tentative candidate dictionary for a word in which a truism occurs, (S20) of extracting air information from a large-scale raw corpus with a sentence, generating a morpheme analysis result (S30) on the input sentence, and modifying the morpheme analysis result based on the air information (S40) .

도 2는 본 발명에 따른 단어 공기 정보에 기반한 다국어 형태소 분석 오류 정정 장치의 블록도이다. 도 3은 본 발명에 따른 형태소 분석 오류가 발생한 경우의 오류 정정의 일 실시예이다.
2 is a block diagram of a multilingual morpheme analysis error correction apparatus based on word air information according to the present invention. 3 is an example of error correction when a morpheme analysis error occurs according to the present invention.

도 2를 참조하면, 본 발명에 따른 단어 공기 정보에 기반한 다국어 형태소 분석 오류 정정 장치(100)는, 형태소 분석 후보 생성 모듈(110), 최적해 선택 모듈(120) 및 공기 정보 기반 수정 모듈(130)을 포함한다.2, the multilingual morpheme analysis error correction apparatus 100 based on the word air information according to the present invention includes a morpheme analysis candidate generation module 110, an optimal solution selection module 120, and an air information based modification module 130, .

여기서, 형태소 분석 후보 생성 모듈(110)은 입력 문장 가운데에 중의적 분석이 가능한 경우, 가능한 모든 분석 후보를 생성하는 기능을 수행한다.Here, the morpheme analysis candidate generation module 110 performs a function of generating all possible analysis candidates when the ambiguous analysis is possible in the input sentence.

또한, 최적해 선택 모듈(120)은 모든 가능한 분석 후보들 가운데 가장 가능성이 높은 하나의 분석 결과만을 선택하는 기능을 수행한다.In addition, the optimal solution selection module 120 performs a function of selecting only one analysis result that is the most probable among all possible analysis candidates.

또한, 공기 정보 기반 수정 모듈(130)은 최적해 선택 모듈(120)에서 생성한 형태소 분석 결과에 대해, 공기정보기반의 결과 검증을 수행하고, 이를 통해 형태소 분석 결과에 오류가 존재한다고 판단될 경우 해당 오류를 수정하는 기능을 수행한다.The air information based correction module 130 performs a result verification of the air information based on the morpheme analysis result generated by the optimal solution selection module 120. If it is determined that there is an error in the morpheme analysis result, It performs error correcting function.

도 2 및 3을 참조하여, 본 발명에 따른 단어 공기 정보에 기반한 다국어 형태소 분석 오류 정정 방법의 전체적인 동작을 설명하도록 한다.
2 and 3, the overall operation of the multilingual morpheme analysis error correction method based on the word air information according to the present invention will be described.

S10 단계에서는 중의성이 발생하는 단어에 대한 중의성 후보 사전을 구축한다.In the step S10, a hypothetical candidate dictionary is constructed for the words in which the hypothesis occurs.

구체적으로, 도 2를 참조하면, 웹이나 블로그, 각종 신문 기사 등으로 구성된 대용량 원시코퍼스에 대해, 최적해 선택 모듈 이전까지의 형태소 분석 결과인 형태소 분석 후보 생성모듈을 수행하여 중의적 단어들을 대상으로 하는 중의적 단어사전을 구축한다. 형태소 분석 후보 생성 모듈은 입력 문장 가운데에 중의적 분석이 가능한 경우, 가능한 모든 분석 후보를 생성하며, 이렇게 생성된 일 예는 아래의 표 1 같다.Specifically, referring to FIG. 2, a morpheme analysis candidate generation module, which is a morphological analysis result up to the time of selecting an optimal solution selection module, is executed for a large capacity raw corpus composed of a web, a blog, Construct an inner word dictionary. The Morphological Analysis Candidate Generation module generates all possible analysis candidates if possible, if possible, in the middle of the input sentence.

입력어절: 가는지
분석후보: 가/V+는지/E
갈/V+는지/E
가늘/A+ㄴ지/EInput phrase:
Analysis Candidate: E / V + / E
Go / V + / E
Thin / A + N / E 입력어절: 가느니
분석후보: 가늘/A+니/E
갈/V+느니/E
가/V+느니/EInput phrase:
Analysis Candidates: Thin / A + N / E
Go to / V + to / E
Is / V + / E 입력어절: 사는
분석후보: 사/N+는/J
사/V+는/E
살/V+는/EInput phrase: Buying
Analytical Candidate: Sa / N + / J
Company / V + / E
Buy / V + / E 입력어절: 주는
분석후보: 주/N+는/J
주/V+는/E
줄/V+는/EInput
Analysis Candidate: State / N + / J
Note / V + / E
Line / V + / E V:동사, E:어미, N:명사, J:조사V: verb, E: mother, N: noun, J: investigation

즉, 입력어절 '가는지'에 대해서는, '가/V+는지/E', '갈/V+는지/E', '가늘/A+ㄴ지/E' 로 3개의 중의적 분석이 가능하다.In other words, for the input word 'go', it is possible to perform three median analyzes with '/ V + / E', 'Gal / V + is / E', and 'A / N + A / E'.

또한, 입력어절 '가느니'에 대해서는, '가늘/A+니/E', '갈/V+느니/E','가/V+느니/E'로 3개의 중의적 분석이 가능하다.In addition, for the input word 'Gonni', there are three possible analyzes of 'Gin / A + Nin / E', 'Gal / V + Ein', and 'Gin / V + Ein'.

또한, 입력어절 '사는'에 대해서는, '사/N+는/J', '사/V+는/E', '살/V+는/E'로 3개의 중의적 분석이 가능하며, 입력어절 '주는'에 대해서는, '주/N+는/J', '주/V+는/E', '줄/V+는/E'로 3개의 중의적 분석이 가능하다. In addition, for the input verb 'to buy', there are three possibilities for analysis of 'S / N + / J', 'S / V + / E' and 'S / V + / E' ', It is possible to perform three median analyzes of' main / N + / J ',' main / V + / E 'and' line / V + / E '.

위와 같이 생성된 어절별 분석부호가 중의적 단어 사전에는 아래의 예와 같이 저장된다.The generated word dictionary analysis code is stored as shown in the following example.

예: Key : 가는지, Value : 가/V+는지/E, 갈/V+는지/E, 가늘/A+ㄴ지/E
Example: Key: Go, Value: / V + / E, Go / V + / E, Thin / A +

S20 단계에서는 중의성이 발생하는 단어를 대상으로 대규모 원시 코퍼스에서 공기 정보를 추출한다.
In step S20, the air information is extracted from the large-scale raw corpus with respect to the words in which the ambiguity occurs.

구체적으로, 후술할 S40단계에서 형태소 분석 결과에 대해 공기 정보 기반의 형태소 분석 결과를 수정하기 위해 활용할 공기정보사전 및 공기 정보를 추출하는 사례를 살펴보도록 한다.Specifically, an example of extracting the air information dictionary and the air information to be used for modifying the morphological analysis result of the air information based on the morpheme analysis result in step S40 will be described.

여기서, 공기정보사전이란, '명사+격조사 용언' 형태로 구성되는 사전을 말한다. 예를 들면, '사과/N+를/J 사/V+다/E' 와 같은 형태의 엔트리가 공기정보사전을 구성한다. Here, the air information dictionary refers to a dictionary composed of a noun and a phrase. For example, an entry of the form 'apple / N + / J / V + da / E' constitutes an air information dictionary.

공기정보사전은 중의적 단어사전을 구성하는 것과는 달리 수행된다. 즉, 공기정보사전은 형태적, 구조적 중의성이 없는 분석 결과만을 이용하여 구축된다. 아래의 표 2를 참조하여 구체적인 예를 보도록 한다.The air information dictionary is performed in contrast to the one for constructing the median word dictionary. That is, the air information dictionary is constructed using only analysis results that have no morphological or structural ambiguity. A specific example will be described with reference to Table 2 below.

문장 : 산에 가느니 집에 가자.Sentence: Let 's go to the mountain or go home. 산에Mountain 가느니Go 집에At home 가자Go 산/N+에Mountain / N + on 가늘/A+니/E
갈/V+느니/E
가/V+느니/EThin / A + Ne / E
Go to / V + to / E
Is / V + / E 집/N+에/JHome / N + to / J 가/V+자/E/ V + character / E

표 2를 참조하면,'산+에 가늘/갈/가' 의 경우 형태적 중의성이 있으므로 공기정보대상에서 제외되고, '집+에 가' 의 경우만이 공기정보대상으로 선정된다. As shown in Table 2, in the case of 'acid + plus / minus / a', the air information is excluded because of morphological ambiguity.

공기정보 선정 시 구조적 중의성의 유무를 판단하는 방법은 다양한 휴리스틱을 적용할 수 있으며, 해당 휴리스틱의 일례는 '명사+격조사 연결어미', '명사+격조사 부사 연결어미', '명사+격조사 종결어미', '명사+격조사 부사 종결어미'와 같다.
In the selection of air information, various heuristics can be applied to judge the existence of structural ambiguity. Examples of the corresponding heuristic are 'noun + connection ending to list', 'noun + adjacency + adjunct connection adverb', 'noun + , 'Noun + classifier adverb ending mother'.

공기정보를 추출하는 다른 사례를 보면 아래의 표3과 같다.Other examples of extracting air information are shown in Table 3 below.

문장 : 화가 나더라도 화를 참자.Sentence: Even if you are angry, let 's get angry. 화가artist 나더라도Even me 화를Anger 참자Subject 화/N+가/J
화가/NTue / N + / J
Painter / N 나/V+더라도/EEven if I / V + / E 화/N+를/JTue / N + / J 참/V+자True / V + character

표 3을 참조하면,'화가'의 경우 '화+'와 '화가' 라는 2가지 분석 중의성이 존재하므로 1차적으로 공기정보대상에서 제외된다. 즉,'화/N+를/J 참/V' 만이 공기정보로 추출된다. As shown in Table 3, in the case of 'painter', there are two analytical ambiguities such as 'pain +' and 'painter'. That is, only 'T / N + / J J / V' is extracted as air information.

그러나, 이렇게 중의성이 없는 경우만을 공기정보로 추출할 경우,'화+가 나다'와 같은 공기정보는 구축될 수가 없는 문제점이 발생한다. However, if only the case where there is no such ambiguity is extracted as the air information, there is a problem that the air information such as 'fire + is out' can not be constructed.

이러한 경우에는 다음과 같이 2가지 방법을 적용하여 공기정보를 추출할 수 있다.In this case, the air information can be extracted by applying the following two methods.

1. 문맥을 통해 중의성 해소가 가능한 경우에 있어서, 공기정보를 추출하는 방법1. How to extract air information when context can be solved through context

위의 '화가 나더라도 화를 참자' 문장의 경우, '화가'의 경우에는 '화+가'와 '화가' 라는 중의성이 발생하고 있지만, 문장 내에 '화'라는 명사가 확실히 존재함으로 인해, '화가'의 가능성보다는 '화+가'의 가능성이 크다고 판단할 수 있으므로 '화+가 나다'를 공기정보로 추출할 수 있다.In the case of the sentence "Let's pray for anger even if it is angry" above, there is a doubt that "painter" and "painter" occur in the case of "painter" Since it is judged that the possibility of 'fire + a' 'is higher than the possibility of' painter ',' air + fire 'can be extracted as air information.

이러한 중의성 해소 판단 근거는 다양한 휴리스틱을 도입할 수 있으며, 예를 들어 1) 현재의 문장 내에 중의성 해소 근거가 있는 경우, 공기정보 추출, 2) 현재의 문장 이전에 중의성 해소 근거가 있는 경우, 공기정보 추출, 3)현재의 문장 이후에 중의성 해소 근거가 있는 경우, 공기정보 추출과 같은 휴리스틱을 적용할 수도 있다.
For example, if there is a basis for solving the ambiguity in the current sentence, extract the air information, and 2) there is a basis for solving the ambiguity before the present sentence. , Air information extraction, and 3) heuristic such as air information extraction may be applied when there is a basis for solving the problem after the present sentence.

2. 보조사를 포함하여 공기정보를 추출하는 방법2. How to extract air information including assistant

앞서 기술한 공기정보 기술 방법은 격조사만을 대상으로 하였으나, 보조사를 포함하여 공기정보를 추출할 경우, '화+가 나다' 와 같이 중의성으로 인해 추출하기 어려운 공기정보를 추출할 수 있다.Although the air information description method described above is directed to only the question mark, when extracting the air information including the assistant, it is possible to extract the air information which is difficult to extract due to the ambiguity such as 'fire +.'

예를 들어, '화만 났다'와 같은 문장을 통해 '화+만 나다' 형태의 공기정보를 추출할 수 있고, 이를 통해 '화+가'와 '화가' 형태의 중의성을 해결할 수 있는 정보를 확보할 수 있게 된다.For example, it is possible to extract air information in the form of 'fire + manna' through the sentence such as' I am angry ', and through this, information that can solve the ambiguity of' fire + .

이러한 보조사를 포함한 공기정보 추출 휴리스틱의 다양한 형태 중 일례를 들면 1) 명사+보조사(도/만) 연결어미, 2) 명사+보조사 종결어미, 3) 명사+보조사 부사 종결어미와 같다.Examples of various types of heuristics including these auxiliary information are: 1) noun + auxiliary node 2) noun + adjunct termination noun 3) noun + auxiliary adjunct.

이렇게 추출된 공기정보들은 공기정보사전은 저장되며, 이때 공기정보 사전에 저장되는 형태는 다음과 같다. The extracted air information is stored in the air information dictionary, and the form stored in the air information dictionary is as follows.

어휘공기정보1 : 명사+조사 용언Vocabulary air information 1: noun + inquiry

집/N+에/J 가/V 100Home / N + / J / V 100

어휘공기정보2 : 명사 용언Lexical information 2: noun verb

집/N 가/V 150House / N / V 150

의미공기정보1 : 명사의미+조사 용언Meaning air information 1: Meaning of noun +

$장소+에 가/V 200$ Place + to / V 200

의미공기정보2 : 명사의미 용언Meaning air information 2: Noun Meaning verb

$장소 가/V 300
$ Place / V 300

여기서 의미공기정보는, 어휘공기정보와, 해당 어휘공기정보의 명사의미를 참조하여 생성하며, 이때 명사의미가 중의성을 가지는 경우에는 제외하고, 명사의미가 중의성이 없는 경우에 대해서만 의미공기정보를 생성한다.
Here, the semantic air information is generated by referring to the lexical air information and the meaning of the noun of the lexical air information. In this case, except for cases where the noun semantics are ambiguous, .

도 2를 참조하면, S30단계에서는 최적해 선택 모듈(120)을 통하여 입력된 문장에 대해 형태소 분석 결과를 생성한다.Referring to FIG. 2, in step S30, the morpheme analysis result is generated for the sentence input through the optimal solution selection module 120. [

이 때, 최적해 선택 모듈(120)은 모든 가능한 분석 후보들 가운데 가장 가능성이 높은 하나의 분석 결과만을 선택하는 모듈이다. 아래의 표 4는 최적해 선택 결과에 대한 예이다.In this case, the optimal solution selection module 120 is a module for selecting only one analysis result which is the most probable among all possible analysis candidates. Table 4 below shows an example of the optimal solution selection result.

입력문장Input sentence 산에 가느니 집에 가자.Go to the mountain or go to your house. 어절단위A word unit 산에Mountain 가느니Go 집에At home 가자Go 형태소분석
후보 생성 모듈Morphological analysis
Candidate generation module 산/N+에Mountain / N + on 가늘/A+니/E
갈/V+느니/E
가/V+느니/EThin / A + Ne / E
Go to / V + to / E
Is / V + / E 집/N+에/JHome / N + to / J 가/V+자/E/ V + character / E 최적해 선택 모듈Optimal solution selection module 산/N+에Mountain / N + on 갈/V+느니/EGo to / V + to / E 집/N+에/JHome / N + to / J 가/V+자/E/ V + character / E

구체적으로, 최적해 선택 모듈(120)은 형태소 분석 후보 생성 모듈(110)을 통한 단일 분석 결과인 '산/N+에', ' 집/N+에/J ' 및 ' 가/V+자/E '를 최적해로 선택하고, 형태소 분석 후보 생성 모듈(100)을 통한 3개의 분석 결과인 '가늘/A+니/E', '갈/V+느니/E' 및 '가/V+느니/E' 중에서 '갈/V+느니/E' 를 최적해로 선택한다.
Specifically, the optimal solution selection module 120 optimizes the results of the single analysis through the morpheme analysis candidate generation module 110 to 'acid / N +', 'house / N + / J' and '/ V + V / G + V 'among the three results of analysis through the morpheme analysis candidate generation module 100,' A / N + E ',' Gal / V + / E 'as the optimal solution.

*이러한 최적해 선택 모듈(120)은 최적의 해를 선택함에 있어서 이미 앞서 기술한 대로 다양한 방법론이 존재한다. 기본적으로 다양한 기계학습 방법론을 활용할 수 있으며, 품사부착말뭉치를 통해 최적해 선택에 사용할 다양한 파라미터를 학습하고 이를 실시간으로 적용하게 된다.
The optimal solution selection module 120 has various methodologies as described above in selecting an optimal solution. Basically, various machine learning methodologies can be utilized, and various parameters to be used for selecting the optimal solution can be learned through the corpus with part-of-speech and applied in real time.

도 2를 계속하여 참조하면, S40단계에서는 공기정보 기반 수정모듈(130)을 통하여 공기 정보를 기반으로 형태소 분석 결과를 수정한다.
Referring to FIG. 2, in step S40, the morphological analysis result is modified based on the air information through the air information based correction module 130. FIG.

공기 정보 기반 수정 모듈(130)은, 최적해 선택 모듈(120)에서 생성한 형태소 분석 결과에 대해, 공기정보기반의 결과 검증을 수행하고, 이를 통해 형태소 분석 결과에 오류가 존재한다고 판단될 경우 해당 오류를 수정하게 된다. 다음은 공기정보기반 수정모듈을 통해 용언의 형태소 분석 오류를 수정하는 일례를 보여주고 있다.
The air information based correction module 130 performs the result verification of the air information based on the morpheme analysis result generated by the optimum solution selection module 120. If it is determined that there is an error in the morpheme analysis result, . The following is an example of correcting morpheme analysis error of verb through the air information based correction module.

공기 정보 기반 수정 모듈(130)에서는 1) 중의적 단어사전에 존재하는 어절을 탐색한 후, 2) 해당 어절에 대해 공기정보사전을 참조해 형태소분석 오류가 발생했을 경우 수정한다. The air information based correction module 130 searches for a word dictionary existing in the word dictionary of 1) and 2) corrects a morpheme analysis error when referring to the air information dictionary for the corresponding word.

공기 정보 사전 정보는, '명사+조사 용언' > '명사 용언' > '$명사의미+조사 용언'의 순서대로 높은 가중치를 가진다.The air information dictionary information has a high weight in the order of "noun + inquiry term"> "noun verb"> "$ noun meaning + inquiry term".

도 3은 본 발명에 따른 형태소 분석 오류가 발생한 경우의 오류 정정의 일 실시예이다.3 is an example of error correction when a morpheme analysis error occurs according to the present invention.

도 2 및 3을 참조하여, 명사 형태소 분석에 오류가 발생한 경우에 있어서, 오류를 수정하는 예를 살펴보도록 한다.Referring to FIGS. 2 and 3, an example of correcting an error when an error occurs in noun morpheme analysis will be described.

입력된 문장이 형태소 분석 후보 생성 모듈(110) 및 최적해 선택 모듈만(120)을 거친 결과를 수정전 결과(220)라 하며, 공기정보기반 수정 모듈을 거친 결과를 수정후 결과(240)라 한다.The result of the input sentence through the morpheme analysis candidate generating module 110 and the optimal solution selecting module 120 is referred to as an unmodified result 220 and the result through the air information based correction module is referred to as a corrected result 240. [

예를 들어, 수정 전 결과가 '화가/N 나/V+더라도/E 화/N + 를/J 참/V + 자/E'인 경우, 공기 정보 기반 수정모듈에서는 중의성 후보사전(111)에 존재하는 '화가'를 검색하여 중의성 후보인 ' 화가/N' 및 '화/N+가/J'를 추출하고, 해당 어절에 대하여 공기정보사전(112)에 존재하는 '화/N+도/J 나/V' 및 화/N 나/V'를 참조하여 형태소 분석의 오류의 발생을 탐지하여 수정한다. For example, if the pre-correction result is '/ N or / V + even though / E / N + is / J true / V + character / E' N + degrees / J ' existing in the air information dictionary 112 for the corresponding phrases by searching the existing " painter " I / V 'and' / N / V 'to detect and correct the occurrence of errors in morpheme analysis.

따라서, 수정된 결과(240)는 '화/N +가/J 나/V+더라도/E 화/N+를/J 참/V+자/E'가 된다.
Therefore, the modified result 240 becomes 'J / V true / V true / V' even if 'Tue / N + is / J or V / V'.

이상에서와 같이 본 발명에 따른 단어 공기 정보에 기반한 다국어 형태소 분석 오류 정정 장치 및 방법은 상기한 바와 같이 설명된 실시예들의 구성과 방법이 한정되게 적용될 수 있는 것이 아니라, 상기 실시예들은 다양한 변형이 이루어질 수 있도록 각 실시예들의 전부 또는 일부가 선택적으로 조합되어 구성될 수도 있다.As described above, the apparatus and method for correcting multi-lingual morpheme analysis based on the word air information according to the present invention are not limited to the configuration and method of the embodiments described above, All or some of the embodiments may be selectively combined.

100: 단어 공기 정보에 기반한 다국어 형태소 분석 오류 정정 장치
110: 형태소 분석 후보 생성 모듈
120: 최적해 선택 모듈
130,230: 공기정보 기반 수정 모듈
220: 수정 전 결과
240: 수정 후 결과
111: 중의성 후보 사전
112: 공기 정보 사전100: Multilingual Morphological Analysis Error Correction Device Based on Word Air Information
110: Morphological analysis candidate generation module
120: Optimal solution selection module
130,230: air information based correction module
220: Results before correction
240: Result after modification
111: Candidate candidate dictionary
112: Air information dictionary

Claims

A multilingual morpheme analysis error correction method performed by a multilingual morpheme analysis error correction apparatus based on word air information,
A step of constructing a tentative candidate dictionary for a word in which the tentative word occurs;
Extracting air information from a large-scale source corpus on the word having the above-mentioned ambiguity;
Generating a morpheme analysis result for the input sentence; And
And modifying the morpheme analysis result based on the air information,
Wherein the modifying the morpheme analysis result comprises:
Extracting a word phrase existing in the ambiguity candidate from the morpheme analysis result, and correcting the morpheme analysis result based on the air information dictionary for the extracted word phrase, wherein the multilingual morpheme analysis error based on the word air information Correction method.

The method according to claim 1,
The second candidate dictionary
Wherein the word correcting unit corrects the correct morpheme error based on the word air information based on the word-of-mouth information.

The method of claim 2,
The step of generating the morpheme analysis result
And generates the most optimal analysis result among the candidates generated by using the words existing in the input sentence and the ambiguous candidate dictionary, based on the word air information.

The method of claim 3,
Wherein the large-scale raw corpus is extracted from at least one of a web site, a blog, and a newspaper article.

delete

The method according to claim 1,
The air information dictionary
The method comprising the steps of: extracting words that have not been found in structural ambiguity and morphological ambiguity among the morpheme analysis results of the sentence, and storing the words; and correcting errors in multilingual morpheme analysis based on the word air information.

A morpheme analysis candidate generation module generating an analysis candidate for a word in which an ambiguity occurs in an input sentence and generating a morpheme analysis result for each of the analysis candidates;
An optimal solution selection module for selecting one morpheme analysis result using the ambiguous candidate dictionary generated based on the morpheme analysis result of each of the analysis candidates and the ambiguous word extracted from the large scale raw corpus; And
An air information based correction module for extracting a word phrase existing in the ambiguity candidate from the selected morpheme analysis result and correcting the morpheme analysis result based on the air information dictionary for the extracted word phrase,
Wherein the morpheme analysis error correction unit is based on word air information.

The method of claim 7,
The second candidate dictionary
Wherein the word correcting unit corrects the morpheme morpheme error based on the word air information based on the word ambiguity.

The method of claim 8,
The large-scale raw corpus
A word, a word, a web site, a blog, and a newspaper article.

delete