KR101573854B1

KR101573854B1 - Method and system for statistical context-sensitive spelling correction using probability estimation based on relational words

Info

Publication number: KR101573854B1
Application number: KR1020140089089A
Authority: KR
Inventors: 권혁철; 윤애선; 김민호; 서한영
Original assignee: 부산대학교 산학협력단
Priority date: 2014-07-15
Filing date: 2014-07-15
Publication date: 2015-12-02

Abstract

The present invention relates to a context-sensitive spelling error correction. More specifically the present invention relates to an apparatus and a method for correcting statistical context-sensitive spelling error using a probability estimation method based on a relational word which performs context spelling error correction by using a conditional probability value between each vocabulary of a revised pair of vocabularies and a vocabulary represented in neighboring context, and a co-occurrence frequency between each relational word of the revised pair of vocabularies and the vocabulary represented in neighboring context.

Description

FIELD OF THE INVENTION [0001] The present invention relates to a statistical context-dependent spelling correction apparatus and method for estimating context-dependent spelling errors using a relative word-based probability estimation method,

본 발명은 문맥 철자오류(context-sensitive spelling error) 교정에 관한 것으로, 구체적으로 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 조건부 확률값과 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도를 이용하여 문맥 철자오류 교정을 수행하는 관계어 기반 확률추정 방법을 이용한 통계적 문맥의존 철자오류 교정 장치 및 방법에 관한 것이다.The present invention relates to the correction of context-sensitive spelling errors. More specifically, the present invention relates to a context-sensitive spelling error correction method, and more particularly, to a method for correcting a context-sensitive spelling error, The present invention relates to a statistical context-dependent spelling error correction apparatus and method using a relational word-based probability estimation method for performing context spelling error correction using inter-air frequency.

컴퓨터, 인터넷과 스마트폰(smartphone)이 융합된 정보환경은 SNS(social network service)를 비롯한 새로운 정보유통 환경을 구축하였고, 모든 사람이 정보의 생산자이자 소비자가 되었다.The information environment in which computers, the Internet, and smartphones converge has established a new information distribution environment including SNS (social network service), and everyone has become a producer and consumer of information.

이에 따라 실수든 의도적이든 또는 무지든 문서에 포함된 철자 오류는 더욱 증가하고 있다.As a result, spelling mistakes that are included in a document, whether accidental, intentional or unintentional, are increasing.

여기에 더해 두벌식 자판, 세벌식 자판, 스마트폰과 피처폰(feature phone; 일반 휴대전화) 등 다양한 입력 환경에 따라 입력 오류의 형태도 다양한 다른 특성을 보이면서 발생하고 있다. 여기에 더해 한류, 국제결혼의 증가와 같은 국제화에 따라 한국어를 사용하거나 배우는 외국인이 크게 늘고 있다.In addition to this, according to various input environments, such as two-word keyboard, three-word keyboard, smart phone and feature phone, the type of input error is occurring with different characteristics. In addition, there are a growing number of foreigners who use or learn Korean according to internationalization, such as the increasing number of Korean marriages and international marriages.

이런 환경 변화에 따라 한국어 문서 교정기의 성능 향상에 대한 요구가 증대하고 있다. 그런데 기존의 규칙에 기반을 둔 철자 검사 기술로는 이런 변화에 적응하는 문서 교정기를 개발하기는 불가능하다.Due to such changes in the environment, there is an increasing demand for improvement of the performance of the Korean document calibrator. However, with spell checking technology based on existing rules, it is impossible to develop document calibrators that adapt to these changes.

그 가장 큰 이유는 '문맥 철자오류'가 현재 해결해야 할 중요한 대상이지만, 기존 문서 교정기는 규칙에 기반을 둔 접근이므로 한국어 사용자가 자주 틀리는 정형화된 문맥 철자오류 외에는 고칠 수 없기 때문이다. The main reason is that 'context spelling error' is an important object to be solved at present, but the existing document corrector is a rule-based approach, so Korean users can not fix other than formalized spelling mistakes that are frequently wrong.

일반적으로 한국어 문장에서 나타나는 오류어의 유형은 크게 단순 철자오류(non-word spelling error)와 문맥 철자오류로 구분할 수 있다.In general, the types of error words appearing in Korean sentences can be classified into non-word spelling errors and context spelling errors.

전자는 '결죄'와 같이 사전에 등재되지 않은 어휘를 사용한 오류로서 텍스트를 형태적으로 분석하는 것만으로 쉽게 오류어를 검색할 수 있다.The former is an error using a vocabulary that is not listed before, such as 'a sentence.' It is easy to search for an error word simply by analyzing the text morphologically.

반면에 후자는 '요금 결재'의 '결재'와 같이 문맥의 의미통사적 관계를 고려해야만 해당 어휘의 오류 여부를 알 수 있다.On the other hand, the latter can only know whether the vocabulary is erroneous by taking into account the semantic and syntactic relationship of the context, such as the 'settlement' of 'payment'.

일반적으로 문맥 철자 오류를 교정하는 방법은 크게 규칙을 이용한 방법과 통계적 방법으로 나뉜다.In general, there are two methods of correcting context spelling errors: rule-based and statistical.

규칙을 이용한 방법은 사람이 직접 규칙을 만드는 방법과 기계 학습을 이용하는 방법으로 나뉜다.The rule-based method is divided into two methods, one is to create rules directly and the other is to use machine learning.

통계적 문맥 철자 오류 검사와 교정 방법은 영어를 대상으로 활발히 연구되었으며, 다음과 같이 크게 3가지를 들 수 있다. The statistical context spelling error checking and correction methods have been studied actively in English.

첫 번째는 교정 어휘 쌍을 이용한 방법으로 기본적으로 어의 중의성 해결(word sense disambiguation, WSD) 방식과 같은 방법론을 이용한다.The first is a method using a pair of calibration lexicons and basically uses the same methodology as word sense disambiguation (WSD).

즉, 교정 어휘 쌍에 해당하는 단어가 중의적이라 보고, 통계적 방법으로 중의성을 해결한 후 그 결과와 원래 단어가 같으면 철자가 바르다고 보고, 아니면 문맥 철자 오류로 본다. In other words, if the word corresponding to the correction lexical pair is an ambiguous word and the ambiguity is solved by a statistical method, then if the result is the same as the original word, the word is spelled correctly or it is regarded as a spelling error.

두 번째 방법은 n-gram에 기반을 둔 언어모형을 사용하는 것이다.The second method is to use a n-gram based language model.

이 방법은 대용량 말뭉치에서 어절 n-gram을 구하고, 이를 바탕으로 각 문장 또는 부분 문장의 확률을 계산한다. 그리고 그 문장 또는 부분 문장에서 빈도가 낮은 n-gram 중 철자 오류로 생성될 수 있으면서 확률이 높은 n-gram으로 대치한 문장이나 부분 문장의 확률을 원래 확률과 비교하여 문맥 철자 오류를 찾는 방법이다.This method computes the probability of each sentence or partial sentence based on the obtained word n-gram in a large corpora. In the sentence or partial sentence, the probability of a sentence or a partial sentence that is generated by a spelling error in a low-frequency n-gram is compared with the original probability by a probability n-gram.

세 번째 방법은 문서 전체를 분석하여 사용된 어휘가 문맥상으로 일관성을 유지하는지를 검증하는 방법이다.The third method is to analyze the entire document to verify that the used vocabulary is consistent in context.

이 방법은 어휘 간의 관계를 분석하기 위한 일종의 지식베이스가 필요하다. This method requires a kind of knowledge base for analyzing the relationship between vocabularies.

이와 같이 문맥 철자오류의 처리를 위한 연구는 크게 규칙을 이용한 방법과 통계적 방법으로 나눌 수 있다.In this way, the research for the processing of context spelling errors can be roughly divided into a rule-based method and a statistical method.

규칙을 이용한 방법은 통계적 방법과 비교하면 정확도(precision)는 높지만, 재현율(recall)은 낮다. 이론적으로 정확도와 재현율은 반대로 움직이기 때문에, 정확도를 높이는 방법은 재현율의 감소를 동반한다.The rule-based method has a higher precision than the statistical method, but a lower recall. Theoretically, accuracy and recall rate move inversely, so the method of increasing accuracy is accompanied by a decrease in recall.

맞춤법에 관한 지식이 없는 일반 사용자는 정확도가 높은 방법을 선호하겠지만, 교과서나 도서 교열을 담당하는 전문가는 정확도가 너무 떨어져 불편한 정도가 아니라면 오류 검색과 대치어 제시가 최대한으로 이루어지기를 원한다.Regular users without knowledge of spelling will prefer a method with a high degree of accuracy, but experts in textbooks or book chapters want to maximize error detection and substitution unless the accuracy is too low.

즉, 정확도가 어느 정도 유지되는 선에서 재현율이 높아지기를 원한다. That is, we want to increase the recall rate on the line where accuracy is maintained to some extent.

통계적 방법에서 가장 중요한 정보는 두 어휘 간 공기빈도에 기반을 둔 어휘의 발생 확률이다.The most important information in the statistical method is the probability of occurrence of a vocabulary based on the air frequency between two vocabularies.

이 확률은 최대우도 추정(Maximum Likelihood Estimation; MLE)을 통해 추정하는 것이 가장 간단한 방법이다. 하지만 MLE는 학습 데이터에서 나타난 관찰 빈도를 이용하기 때문에 어휘의 발생 확률을 정확히 예측하기 어렵다.This probability is the simplest method to estimate through Maximum Likelihood Estimation (MLE). However, since MLE uses the observation frequency shown in the training data, it is difficult to accurately predict the occurrence probability of the vocabulary.

특히, 관찰 빈도가 0인 경우는 발생 확률도 0으로 보기 때문에 전체 통계 모델의 성능을 떨어트린다는 약점이 있다.In particular, when the observation frequency is zero, the probability of occurrence is also 0, which is a disadvantage of degrading the performance of the entire statistical model.

이런 문제점을 해결하고 어휘의 발생확률을 좀 더 정확히 추정하기 위해 다양한 스무딩(smoothing) 기법이 연구되었다.Various smoothing techniques have been studied to solve these problems and estimate the probability of occurrence of vocabulary more accurately.

스무딩 기법 중 가장 간단한 방법은 Additive 스무딩이며, Lidstone 스무딩이라고도 불린다.The simplest method of smoothing is Additive Smoothing, also called Lidstone Smoothing.

이 방법은 한 번도 나타나지 않은 데이터에 대하여 실제 나타난 횟수보다 λ만큼 더 나타난다고 보고, 이를 위해 모든 데이터의 발생횟수에 λ만큼 더하는 방법이다.In this method, it is shown that λ is more than the actual number of occurrences for the data that never appears. To this end, λ is added to the number of occurrences of all data.

λ에 1을 준 방법이 가장 기본적인 방법이며, Laplace 스무딩이라고도 한다. A method of 1 for λ is the most basic method, also called Laplace smoothing.

N-그램 모델에서 이를 수식으로 표현하면 다음과 같으며, 여기서 B는 모든 어휘를 포함한 집합의 크기이다. In the N-gram model, this is expressed as: where B is the size of the set containing all the vocabularies.

이러한 Additive 스무딩은 일반적으로 좋은 성능을 보이지 못한다고 알려졌다.Such additive smoothing is generally not shown to perform well.

두 번째로 Good-Turing 스무딩이 있다. Additive 스무딩의 단점을 해결한 것으로 학습 데이터에서 r번 나타나는 n-그램에 대하여

만큼 나타난다고 가정하고 사용한 방법이다.The second is Good-Turing smoothing. It solves the disadvantages of additive smoothing, and it can be applied to n-grams appearing r times in learning data

And it is the method used.

여기서

은 학습 데이터에서 r번 나타난 n-그램의 개수이다. 이렇게 변환된 빈도를 이용해 다음 수식을 통해 어휘 발생 확률을 추정한다. here

Is the number of n-grams appearing r times in the learning data. Using this converted frequency, we estimate the probability of vocabulary occurrence by the following formula.

Good-Turing 스무딩은

이 매우 큰 값을 가질 때, 정확한 어휘 출현 확률을 추정할 수 있는 것으로 알려졌다.Good-Turing Smoothing

It is known that the probability of occurrence of an accurate vocabulary can be estimated.

세 번째로 n-그램의 빈도가 0인 경우와 0이 아닌 경우를 따로 쪼개어 확률을 추정하는 백오프 스무딩(Back-Off smoothing)이 있다. 대표적으로 Katz 스무딩이 있으며, 이 방법은 해당 n-그램이 존재하면 보정된 빈도로 확률을 구하고, 해당 n-그램이 존재하지 않는다면 n-1 그램의 빈도로 값을 근사하는 방법이다. Third, there is back-off smoothing that estimates the probability by dividing the case where the frequency of n-gram is 0 and the case of non-zero. Typically, there is a Katz smoothing method that finds the probability at a corrected frequency when the n-gram is present, and approximates the value at a frequency of n-1 gram if the n-gram is not present.

마지막으로 인터폴레이션 스무딩(Interpolation smoothing)이 있는데, 백오프 스무딩과 반대로 빈도가 0인 경우와 아닌 경우를 따로 쪼개지 않고 high order probability와 lower order probability에 각각 가중치를 곱한 결과를 더해주는 방식으로 구현된다. 대표적으로 Jelinek-Mercer 스무딩 등이 있다.Finally, there is interpolation smoothing. In contrast to backoff smoothing, it is implemented in such a way that the result of multiplying the high order probability and the lower order probability by the weights, respectively, without dividing the case where the frequency is zero or not. Typically, there are Jelinek-Mercer smoothing.

이와 같은 종래 기술의 통계적 방법을 이용한 문맥 의존 철자오류 교정은 다음과 같은 문제가 있다.The context-dependent spelling error correction using the statistical method of the related art has the following problems.

첫 번째로 현재 문맥에서 나타난 어휘들의 출현 빈도를 바탕으로 철자오류를 검색하고 교정하기 때문에, 의미 유사성을 분석할 수 없다.First, the semantic similarity can not be analyzed because the spelling errors are retrieved and corrected based on the occurrence frequency of the vocabularies present in the current context.

두 번째로 자료 부족문제가 발생하는데, 이는 n-그램 언어 모델에서 사용되는 스무딩 기법들이 활용해 바로잡을 수 있지만, 현재 문맥에서 나타난 어휘들의 출현 빈도를 기반으로 하고 있기 때문에, 현재 나타난 어휘들 이외의 정보는 얻을 수 없다는 문제가 있다.Second, there is a data shortage problem because it is based on the frequency of occurrence of the vocabulary in the present context, although the smoothing techniques used in the n-gram language model can be utilized and corrected. There is a problem that information can not be obtained.

한국공개특허 10-2009-0106937호Korean Patent Publication No. 10-2009-0106937 한국공개특허 10-2008-0039009호Korean Patent Publication No. 10-2008-0039009

본 발명은 이와 같은 종래 기술의 문맥 의존 철자오류 교정의 문제를 해결하기 위한 것으로, 미리 구축한 교정 어휘 쌍을 이용하여 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 출현빈도에 바탕을 둔 통계 모형을 이용하여 문맥 철자오류를 검색하고 교정하는 방법을 제공하는데 그 목적이 있다.In order to solve the problem of context-dependent spelling error correction of the related art, the present invention provides a method for correcting context-dependent spelling errors by using a pair of calibration dictionary pairs constructed in advance, The purpose of this paper is to provide a method to detect and correct context spelling errors using models.

본 발명은 오타 발생률(typing error rate)에 바탕을 둔 신뢰도를 이용하여 문맥 철자오류 교정의 정확도를 일정 수준 이상으로 유지하면서, 문맥 철자오류 검색과 교정에 이용하는 주변 문맥 어휘의 범위를 제한하면서 문맥 철자오류를 검색하고 교정할 수 있도록 한 관계어 기반 확률추정 방법을 이용한 통계적 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.The present invention utilizes the reliability based on the typing error rate to maintain the accuracy of the context spelling error correction to a certain level or more, while limiting the scope of the surrounding context vocabulary used for context spelling error detection and correction, The present invention is to provide a statistical context-dependent spelling error correction apparatus and method using a relationship-based probability estimation method that can detect and correct an error.

본 발명은 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 조건부 확률값과 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도를 이용하여 문맥 철자오류 교정을 수행하는 관계어 기반 확률추정 방법을 이용한 통계적 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.The present invention relates to a method and apparatus for performing a context-based error correction based on a relation word between each vocabulary of a calibration vocabulary pair and a vocabulary conditional probability value and a correction vocabulary pair appearing in a surrounding context, And to provide a statistical context-dependent spelling error correction apparatus and method using the estimation method.

본 발명은 사용자가 입력한 한국어 문장에서 나타나는 여러 맞춤법문법 오류중에서 사전(事典) 검색을 통해 해결할 수 없는 문맥 의존 철자오류(context-sensitive spelling error)를 검색하고, 이를 교정할 대치어를 제시하는 문맥 철자오류 교정 장치 및 그 방법을 제공하는데 그 목적이 있다.The present invention provides a context-sensitive spelling error which can not be solved through a dictionary search among a plurality of spelling grammar errors appearing in a Korean sentence entered by a user, and a context for presenting a substitute word for correcting the context-sensitive spelling error The present invention provides a spelling error correction apparatus and a method thereof.

본 발명은 자료부족 문제를 한국어 어휘의미망의 관계어를 이용해 어휘를 확장하여 해결하고, 관계정보를 바탕으로 교정 대상어와 문맥의 공기 어휘 간 의미분석에 활용하기 위한 새로운 확률 추정 방법을 제공하는데 그 목적이 있다.The present invention provides a new probability estimation method for expanding the vocabulary by using the related words of the Korean vocabulary to solve the data shortage problem and utilizing it for the semantic analysis between the correction word and the air vocabulary of the context based on the relationship information. There is a purpose.

본 발명의 목적들은 이상에서 언급한 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

이와 같은 목적을 달성하기 위한 본 발명에 따른 관계어 기반 확률추정 방법을 이용한 통계적 문맥의존 철자오류 교정 장치는 문맥 철자오류를 검색하고 교정하기 위한 문장을 입력하는 입력부;입력된 문장에 대하여 형태소 분석 사전에 기반을 두고 어절을 형태소 단위로 분리해내는 형태소 분석을 수행하는 형태소 분석부;형태소 분석부에서 분석된 형태소 중 형태소 중의성이 발생하면 형태소 중의성 제거를 하는 품사 태깅부;해당 어휘와 주변 문맥에 나타난 어휘 간 연관성을 조건부 확률과 신뢰도를 이용하여 정량화하는 연관성 분석부;연관성 분석부에서 도출한 수치를 이용하여 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 조건부 확률값과 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도를 이용하여 문맥 철자오류 교정을 수행하는 철자오류 교정부;를 포함하는 것을 특징으로 한다.According to another aspect of the present invention, there is provided a statistical context-dependent spelling error correcting apparatus using a relational word-based probability estimation method, comprising: an input unit for inputting sentences for searching and correcting context spelling errors; A morphological analysis unit that performs morpheme analysis based on the morphological analysis of the morpheme, and a morphological analysis unit that performs morphological analysis to separate the morpheme from morphemes based on the morphological analysis unit, And the confidence interval between the vocabulary words and the surrounding vocabulary of the corrective vocabulary pair using the numerical values derived from the association analysis unit. Using the frequency of vocabulary between the vocabulary and surrounding vocabulary, And a spelling error correcting unit for performing error correction of the error correcting unit.

여기서, 상기 연관성 분석부는 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도를 두 어휘 간 조건부 확률값을 구할 때 가중치로 사용하는 것을 특징으로 한다.Here, the relevancy analyzer is characterized by using the relation terms of the respective vocabularies of the corrected vocabulary pair and the inter-vocabulary air frequency appearing in the surrounding context as the weight when the conditional probability values between the two vocabularies are obtained.

그리고 상기 연관성 분석부는 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도에 루트나 로그를 취하여 가중치를 구하는 것을 특징으로 한다.The relevance analyzer is characterized by taking the root or log in the correlation terms between the vocabularies of the calibration vocabulary pair and the inter-vocabulary air frequency in the surrounding context to obtain a weight.

그리고 상기 연관성 분석부는 교정 어휘 쌍의 각 어휘의 관계어로 한국어 어휘의미망의 형제어나 하위어를 사용하는 것을 특징으로 한다.The association analyzer is characterized by using the ambiguous sibling or subordinate of the Korean vocabulary in relation words of the respective vocabularies of the calibration vocabulary pair.

그리고 상기 연관성 분석부는 교정 어휘 쌍의 각 어휘의 형제어와 하위어를 사용하여 구한 각각의 가중치에 서로 다른 비중을 두어 합하여 사용하는 것을 특징으로 한다.The association analyzing unit is characterized by using different weights for each of the weights obtained by using the type control of each vocabulary of the correction vocabulary pair and the lower word, and using them together.

다른 목적을 달성하기 위한 본 발명에 따른 관계어 기반 확률추정 방법을 이용한 통계적 문맥의존 철자오류 교정 방법은 입력된 문장에 대하여 형태소 분석 사전에 기반을 두고 어절을 형태소 단위로 분리해내는 형태소 분석을 수행하는 단계;분석된 형태소 중 형태소 중의성이 발생하면 형태소 중의성 제거를 하는 단계;해당 어휘와 주변 문맥에 나타난 어휘 간 연관성을 조건부 확률과 신뢰도를 이용하여 정량화하는 단계;교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 조건부 확률값과 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도를 이용하여 문맥 철자오류 교정을 수행하는 단계;를 포함하는 것을 특징으로 한다.In order to achieve the other object, the statistical context-dependent spelling error correction method using the relation-word-based probability estimation method according to the present invention performs a morphological analysis on the input sentence based on the morphological analysis dictionary and separating the word into morpheme units A step of removing the morpheme from the morpheme if the morpheme of the analyzed morpheme is generated, the step of quantifying the relation between the vocabulary and the vocabulary in the surrounding context by using the conditional probability and the reliability, And performing context spelling error correction using the correlation terms between the vocabulary conditional probability values and the correction vocabulary pairs appearing in the surrounding context and the inter-vocabulary air frequency appearing in the surrounding context.

여기서, 해당 어휘와 주변 문맥에 나타난 어휘 간 연관성을 조건부 확률과 신뢰도를 이용하여 정량화하는 단계에서, 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도를 두 어휘 간 조건부 확률값을 구할 때 가중치로 사용하는 것을 특징으로 한다.Here, in the step of quantifying the relation between the vocabulary and the surrounding vocabulary using the conditional probability and the reliability, the inter-vocabulary air frequency in the relation vocabulary of the calibration vocabulary pair and the surrounding vocabulary is used to obtain the conditional probability value between the two vocabularies And the weight is used as a weight.

그리고 해당 어휘와 주변 문맥에 나타난 어휘 간 연관성을 조건부 확률과 신뢰도를 이용하여 정량화하는 단계에서, 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도에 루트나 로그를 취하여 가중치를 구하는 것을 특징으로 한다.In the step of quantifying the relation between the vocabulary and the vocabulary in the surrounding context using the conditional probability and the reliability, the root and logarithms of the vocabulary air frequency in the relation vocabulary of the calibration vocabulary pair and the surrounding context are obtained, .

그리고 해당 어휘와 주변 문맥에 나타난 어휘 간 연관성을 조건부 확률과 신뢰도를 이용하여 정량화하는 단계에서, 교정 어휘 쌍의 각 어휘의 관계어로 한국어 어휘의미망의 형제어나 하위어를 사용하고, 교정 어휘 쌍의 각 어휘의 형제어와 하위어를 사용하여 구한 각각의 가중치에 서로 다른 비중을 두어 합하여 사용하는 것을 특징으로 한다.Then, in the step of quantifying the relation between the vocabulary and the surrounding vocabulary using the conditional probability and the reliability, the prosodic word or the subordinate word of the Korean vocabulary is used as a relation word of each vocabulary of the correction vocabulary pair, And the weight of each of the vocabularies is calculated by using the type control of each vocabulary and a weight is assigned to each of the weights.

그리고 해당 어휘와 주변 문맥에 나타난 어휘 간 연관성을 조건부 확률과 신뢰도를 이용하여 정량화하는 단계에서, 교정 어휘의 관계어와 문맥 어휘 사이 확장을 통해 얻어진 가중치 α를,In the step of quantifying the relation between the vocabulary and the surrounding vocabulary using the conditional probability and the reliability, the weight α obtained through expansion between the relation word of the correction vocabulary and the context vocabulary,

으로 구하고,

Respectively,

β₁:교정 대상어의 형제어 확장에 대한 가중치 (0.1~1.0), β₂:교정 대상어의 하위어 확장에 대한 가중치 (0.1~1.0), β₃:문맥어휘의 하위어 확장에 대한 가중치 (0.1~1.0), RW_cw:교정 대상어와 문맥 어휘의 하위어 간 공기빈도,

는 형제어를 이용한 어휘 확장결과와 문맥의 공기 어휘 간 출현 빈도,

는 하위어를 이용한 어휘 확장결과와 문맥의 공기 어휘 간 출현 빈도인 것을 특징으로 한다.β ₁ is the weight for the extension of the type of the proofreading target (0.1 to 1.0), β ₂ is the weight for the expansion of the lower word of the proofreading word (0.1 to 1.0), β ₃ is the weight for the extension of the lower word of the context vocabulary ~ 1.0), RW _cw : air frequency between the words to be calibrated and the words of the context vocabulary,

The results of the lexical expansion using the type control and the occurrence frequency of the air vocabulary of the context,

Is the frequency of occurrence between vocabulary expansion results using subordinate words and the air vocabulary of the context.

이와 같은 본 발명에 따른 관계어 기반 확률추정 방법을 이용한 통계적 문맥의존 철자오류 교정 장치 및 방법은 다음과 같은 효과를 갖는다.The apparatus and method for statistically context-dependent spelling error correction using the relative word-based probability estimation method according to the present invention have the following effects.

첫째, 한국어 문서 교정 과정에서 가장 난도가 높은 문맥 철자오류를 교정함으로써 한국어 문서 교정기의 성능을 높일 수 있다.First, the performance of the Korean document corrector can be improved by correcting the most difficult spelling errors in the Korean document revision process.

둘째, 미리 구축한 교정 어휘 쌍을 이용하여 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 출현빈도에 바탕을 둔 통계 모형을 이용하여 문맥 철자오류를 검색하고 교정하여 신뢰성을 높일 수 있다.Second, by using pre - constructed calibration lexical pairs, it is possible to improve the reliability by searching and correcting the context spelling errors using the statistical model based on the frequency of occurrence of each vocabulary in the corrected vocabulary pair and the surrounding vocabulary in the surrounding context.

셋째, 오타 발생률(typing error rate)에 바탕을 둔 신뢰도를 이용하여 문맥 철자오류 교정의 정확도를 일정 수준 이상으로 유지하면서, 문맥 철자오류 검색과 교정에 이용하는 주변 문맥 어휘의 범위를 제한하면서 문맥 철자오류를 검색하고 교정하여 정확도를 높일 수 있다.Third, by using the reliability based on the typing error rate, it is possible to maintain the accuracy of context spelling error correction to a certain level or more, while limiting the scope of the surrounding context vocabulary used for context spelling error detection and correction, Can be searched and calibrated to increase accuracy.

넷째, 한국어 어휘의미망의 관계어를 이용해 어휘를 확장하여 해결하고, 관계정보를 바탕으로 교정 대상어와 문맥의 공기 어휘 간 의미분석에 활용하기 위한 새로운 확률 추정 방법을 제공한다.Fourth, we propose a new probability estimation method for expanding vocabulary by using vocabulary related words of Korean vocabulary and solving them based on relationship information and analyzing the meaning between calibration word and context vocabulary.

다섯째, 한국어 정보검색과 정보추출, 한국어 사용자 인터페이스, 기계번역, 자동통역 등 다양한 한국어 관련 응용 시스템의 기반 기술로 활용되어 해당 시스템이 최적의 성능을 낼 수 있도록 한다.
Fifth, it is utilized as the base technology of various Korean related application systems such as Korean information retrieval and information extraction, Korean user interface, machine translation, and automatic interpretation, so that the system can achieve optimum performance.

도 1은 본 발명에 따른 문맥 철자오류 교정 장치의 구성도
도 2는 본 발명에 따른 관계어 기반 확률추정 방법을 이용한 통계적 문맥의존 철자오류 교정 방법을 나타낸 플로우 차트1 is a block diagram of a context spelling error correcting apparatus according to the present invention;
FIG. 2 is a flowchart showing a statistical context-dependent spelling error correction method using a relation-word-based probability estimation method according to the present invention.

이하, 본 발명에 따른 관계어 기반 확률추정 방법을 이용한 통계적 문맥의존 철자오류 교정 장치 및 방법의 바람직한 실시 예에 관하여 상세히 설명하면 다음과 같다.Hereinafter, a preferred embodiment of a statistical context-dependent spelling error correction apparatus and method using the relation-word-based probability estimation method according to the present invention will be described in detail.

본 발명에 따른 관계어 기반 확률추정 방법을 이용한 통계적 문맥의존 철자오류 교정 장치 및 방법의 특징 및 이점들은 이하에서의 각 실시 예에 대한 상세한 설명을 통해 명백해질 것이다.The features and advantages of the statistical context-dependent spelling error correcting apparatus and method using the relational word-based probability estimation method according to the present invention will be apparent from the following detailed description of each embodiment.

도 1은 본 발명에 따른 문맥 철자오류 교정 장치의 구성도이고, 도 2는 본 발명에 따른 관계어 기반 확률추정 방법을 이용한 통계적 문맥의존 철자오류 교정 방법을 나타낸 플로우 차트이다.FIG. 1 is a configuration diagram of a context spelling error correcting apparatus according to the present invention, and FIG. 2 is a flowchart illustrating a statistical context dependent spelling error correcting method using a relational word based probability estimation method according to the present invention.

본 발명은 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 조건부 확률값과 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도를 이용하여 문맥 철자오류 교정을 수행하는 것이다.The present invention is to perform context spelling error correction using the conditional probability values between the vocabularies in each vocabulary of the corrected vocabulary pair and the surrounding vocabulary in the surrounding context, and the inter-vocabulary air frequency in the surrounding context.

이를 위한 본 발명에 따른 관계어 기반 확률추정 방법을 이용한 통계적 문맥의존 철자오류 교정 장치의 구성을 다음과 같다.The configuration of the statistical context-dependent spelling error correction apparatus using the relation-word-based probability estimation method according to the present invention is as follows.

도 1은 본 발명에 따른 문맥 철자오류 교정 장치의 구성도이다.1 is a block diagram of a context spelling error correcting apparatus according to the present invention.

먼저, 문맥 철자오류를 검색하고 교정하기 위한 문장을 입력하는 입력부(101)와, 입력부(101)를 통해 입력된 문장에 대하여 형태소 분석 사전에 기반을 두고 어절을 형태소 단위로 분리해내는 형태소 분석을 수행하는 형태소 분석부(102)와, 형태소 분석부(102)에서 분석된 형태소 중 형태소 중의성이 발생하면 형태소 중의성 제거를 하는 품사 태깅부(103)와, 해당 어휘와 주변 문맥에 나타난 어휘 간 연관성을 조건부 확률과 신뢰도를 이용하여 정량화하는 연관성 분석부(104)와, 연관성 분석부(104)에서 도출한 수치를 이용하여 철자오류 여부를 판단하고 철자오류를 교정하여 출력부(106)로 교정 결과를 보내는 철자오류 교정부(105)를 포함한다.First, an input unit 101 for inputting a sentence for searching for and correcting a context spelling error, and a morpheme analysis for separating a sentence inputted through the input unit 101 based on a morpheme analysis dictionary into morpheme units And a speech tagging unit 103 for removing the morpheme from the morpheme when the morpheme has been generated in the morpheme analyzed by the morpheme analyzing unit 102 and a vocabulary tagging unit A correlation analyzing unit 104 for quantifying a correlation using a conditional probability and a reliability and a numerical value derived from the correlation analyzing unit 104 to determine a spelling error and correct the spelling error, And a spelling error correcting unit 105 for sending a result.

여기서, 철자오류 교정부(105)는 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 조건부 확률값과 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도를 이용하여 문맥 철자오류 교정을 수행한다.Here, the spelling error correcting unit 105 uses the conditional probability values between the vocabularies in the calibration vocabulary pair and the surrounding vocabulary, the relation terms between the vocabularies in the correction vocabulary pair, and the inter-vocabulary air frequency in the surrounding context, .

그리고 연관성 분석부(104)는 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도를 두 어휘 간 조건부 확률값을 구할 때 가중치로 사용한다.Then, the relevance analyzer 104 uses the relation frequency between the vocabulary of the calibration vocabulary pair and the vocabulary air frequency in the surrounding context as a weight value when the conditional probability value between the two vocabularies is obtained.

그리고 연관성 분석부(104)는 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도에 루트나 로그를 취하여 가중치를 구한다.Then, the relevance analyzer 104 obtains a weight by taking a route or a log from the relation terms of the respective vocabulary of the correction vocabulary pair and the inter-vocabulary air frequency in the surrounding context.

그리고 연관성 분석부(104)는 교정 어휘 쌍의 각 어휘의 관계어로 한국어 어휘의미망의 형제어나 하위어를 사용한다.The association analysis unit 104 uses the syllable or subordinate of the Korean vocabulary in relation words of each vocabulary of the correction vocabulary pair.

그리고 연관성 분석부(104)는 교정 어휘 쌍의 각 어휘의 형제어와 하위어를 사용하여 구한 각각의 가중치에 서로 다른 비중을 두어 합하여 사용한다.Then, the relevance analyzer 104 adds the different weights to each of the weights obtained by using the type control of each vocabulary of the correction vocabulary pairs and the subordinate words, and uses them together.

이와 같은 본 발명에 따른 관계어 기반 확률추정 방법을 이용한 통계적 문맥의존 철자오류 교정 장치에서의 통계적 문맥 철자오류 교정은 다음과 같은 방법으로 이루어진다.The statistical context spelling error correction in the statistical context-dependent spelling error correction apparatus using the relation-word-based probability estimation method according to the present invention is performed as follows.

도 2에서와 같이, 문맥 철자오류를 검색하고 교정하기 위한 문장을 입력하는 단계(S201)와, 입력된 문장에 대하여 형태소 분석 사전에 기반을 두고 어절을 형태소 단위로 분리해내는 형태소 분석을 수행하는 단계(S202)와, 분석된 형태소 중 형태소 중의성이 발생하면 형태소 중의성 제거를 하는 단계(S203)와, 해당 어휘와 주변 문맥에 나타난 어휘 간 연관성을 조건부 확률과 신뢰도를 이용하여 정량화하는 단계(S204)와, 도출한 수치를 이용하여 철자오류 여부를 판단하고 철자오류를 교정하고(S205), 교정 결과를 출력하는 단계(S206)를 포함한다.As shown in FIG. 2, a step (S201) of inputting a sentence to search for and correct a context spelling error (S201) and a morpheme analysis for dividing the input sentence into morpheme units based on a morpheme dictionary (S203) of removing morphological features from the morpheme if the morpheme ambiguity of the analyzed morpheme is generated (S203), quantifying the association between the vocabulary and the surrounding vocabulary using the conditional probability and reliability S204), judging whether or not a spelling error is found by using the derived numerals, correcting the spelling error (S205), and outputting the calibration result (S206).

여기서, 철자오류 교정 단계에서 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 조건부 확률값과 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도를 이용하여 문맥 철자오류 교정을 수행한다.Here, in the spelling error correction step, context spelling error correction is performed using the conditional probability values between the vocabularies in the calibration vocabulary pair and the surrounding vocabulary pairs, the relationship between each vocabulary in the correction vocabulary pair, and the inter-vocabulary air frequency in the surrounding context .

그리고 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도를 두 어휘 간 조건부 확률값을 구할 때 가중치로 사용한다.And we use the air frequency between the vocabulary words in the surrounding vocabulary and the surrounding contexts of the calibration vocabulary pair as the weight when calculating the conditional probability between the two vocabularies.

그리고 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도에 루트나 로그를 취하여 가중치를 구한다.Then we take the root or log in the relation frequency of each vocabulary of the correction vocabulary pair and the inter - vocabulary air frequency in the surrounding context and obtain the weight.

그리고 교정 어휘 쌍의 각 어휘의 관계어로 한국어 어휘의미망의 형제어나 하위어를 사용하고, 교정 어휘 쌍의 각 어휘의 형제어와 하위어를 사용하여 구한 각각의 가중치에 서로 다른 비중을 두어 합하여 사용한다.And we use the sibling or subordinate of the Korean vocabulary in the relation words of each vocabulary of the correction vocabulary pair, and add the different weight to each weight obtained by using the type control and the subordinate of each vocabulary of the correction vocabulary pair do.

문맥 철자오류 유형 중 가장 빈번하게 발생하는 오류는 오타에 의해 발생하는 오류이다. 예를 들어, 자판을 이용하여 'cool down'을 입력할 때 글쇠 위치가 가까워 'cool'을 'cook'으로 입력할 수 있다.The most frequently occurring type of context spelling error is an error caused by a typo. For example, when typing 'cool down' using keyboard, you can enter 'cook' as 'cool' because the key position is near.

그런데 'cool'에서 'l'을 바로 옆에 있는 'k'로 잘못 입력한 결과가 우리가 사용하는 단어인 'cook'이 되어 의미 분석 없이 이 오류를 찾기는 쉽지 않다.However, it is not easy to find this error without analyzing the meaning of 'cook', which is the word that we use to write 'l' in 'cool' and 'k'

이와 같이 오타로 인해 생긴 단어가 존재하는 단어인 경우에는 오류를 교정하려면 의미 분석이 필요하다.In the case of a word that is caused by a typo, meaning analysis is necessary to correct the error.

하지만 현재 개발된 의미분석 기술로 문맥 철자 오류를 교정하는 것은 불가능하다. 이에 따라 통계적 방법으로 이 문제에 접근하는 방법이 영어권에서는 다양하게 연구되었다. 이 중 가장 성능이 높으면서, 단순한 방법이 '교정 어휘 쌍'을 이용하는 방법이다.However, it is impossible to correct context spelling errors with the currently developed semantic analysis technology. Therefore, the method of approaching this problem statistically has been studied variously in English. One of the most powerful of these is the simple method of using a pair of 'proofreading lexical'.

본 발명에서는 편집거리 1에 해당하는 어휘들을 '교정 어휘 쌍'으로 선정하고, '교정 어휘 쌍'의 어휘들과 문맥에 나타난 공기 어휘 간 확률을 계산하여 문맥 철자오류를 검색하고 교정한다.In the present invention, the vocabularies corresponding to the edit distance 1 are selected as the 'correction vocabulary pair', and the context spelling errors are retrieved and corrected by calculating the probability between the vocabularies of the 'correction vocabulary pair' and the air vocabulary in the context.

먼저, 문서 'D⁰'는 사용자가 입력하려고 한 문서이고, 'D^τ'은 실제 우리가 보는 문서라고 가정한다.First, document 'D ⁰ ' is the document that the user is trying to input, and 'D ^τ ' is the document that we actually see.

그러면 이론적으로 통계적 철자검사기는 'D^τ'에서 통계적 방법으로 'D⁰'를 생성하는 시스템이다.Theoretically, the statistical spell checker is a system that generates 'D ⁰ ' statistically at 'D ^τ '.

다르게 말하면 'D^τ'을 생성할 확률이 가장 큰 문서 'D⁰'를 찾는, 즉

의 값이 최대가 되게 하는 모형을 찾는 시스템이 통계적 철자검사기이다. 여기에 통계적인 수식인 Bayes’ rule을 적용하면 다음과 같다.In other words, if the document 'D ⁰ ' having the greatest probability of generating 'D ^τ '

Is a statistical spell checker. Here, the Bayes' rule, which is a statistical formula, is applied as follows.

그런데

와

는 현실적으로 알기 어렵다.By the way

Wow

Is difficult to understand in reality.

그래서 교정 쌍을 이용한 문맥 철자 오류 교정 방법을 이용하여 서로 오류 가능성이 큰 어휘 쌍(교정 쌍) 간에 발생하는 철자 오류를 찾을 것이다.Thus, using the context spelling error correction method using the correction pair, we will find the spelling errors that occur between the pairs of possible error words (correction pairs).

여기서, s₀는 실제 문서 D^τ에 나타나는 단어이고, s₁,...,s_n은 s₀으로 오류에 의해 쓰일 수 있는 단어이다.Here, s ₀ is a word appearing in the actual document D ^τ , s ₁ , ..., s _n is s _0, and can be used by error.

그러면, D⁰ = D^τ[s₀ -->s_j], for j = 0,...,n 그리고 D^τ= D^τ[s₀]이라고 할 수 있다. 또한, s₀ -->s₀으로 올바르게 쓰일 경우

이라고 하고, s₀ -->s₁,...,s_n으로 오류에 의해 쓰일 경우 각각

이라고 쓸 수 있다. Then, D ⁰ = D ^τ [s ₀ -> s _j ], for j = 0, ..., n and D ^τ = D ^τ [s ₀ ]. Also, if s ₀ -> s ₀ is used correctly

, And s ₀ -> s ₁ , ..., s _n are used in error.

.

이에 따라, 수학식 3을 다음과 같이 바꿀 수 있다.Accordingly, Equation (3) can be changed as follows.

여기서 다음과 같은 가정을 할 수 있다.Here, the following assumptions can be made.

(가정1) D⁰문서와 D^τ문서는 교정 대상 어휘인 s₀를 제외하면 왼쪽 문맥C_L과 오른쪽 문맥C_R이 같다.(Assumption 1) The D ⁰ document and the D ^τ document are the same in the left context C _L and the right context C _R except for s ₀ , which is the correction target vocabulary.

(가정1)은 문서에 오류가 극히 적은 환경에서 성립할 수 있다. 즉, s₀만 다를 수 있고 나머지 어휘는 같다고 가정해도 통계적 유의미성이 변하지 않는 환경을 뜻한다. 그래서 s₀만 다르다고 가정하면 수학식 4는 다음과 같이 표현된다.(Assumption 1) can be established in an environment where there is very little error in the document. In other words, it means an environment in which statistical significance does not change even if s ₀ is different and the rest of the vocabularies are the same. Assuming that s ₀ is different, equation (4) is expressed as follows.

입력 오류가 수학식 5에서 문맥과 상관없이 발생한다고 가정하면, Assuming that the input error occurs in the context of Equation 5,

은

와 같이 표현할 수 있고 다음과 같은 수식으로 정의할 수 있다.

silver

And can be defined by the following equation.

앞의 가정에 따라, 수학식 6에서

은 신뢰할 수 있는 대용량 말뭉치에서 구하면 된다.According to the above assumption, in Equation 6,

Can be obtained from a reliable large capacity corpus.

하지만 신뢰성 있는 대용량 말뭉치에서

을 구하는 것도 현실적으로 어려우므로, 우리는 Naive Bayes (bi-gram 확률) 가정을 도입한다. Naive Bayes 가정에서 모든 문맥을 보는 것은 큰 의미가 없으므로 문맥을 제한할 것이다. However, in a reliable large capacity corpus

, We introduce the Naive Bayes (bi-gram probability) assumption. Naive Bayes It is not meaningful to see all the contexts in the home and will limit the context.

그러면, 다음과 같이 표현된다.Then, it is expressed as follows.

여기서 m번째 교정 어휘 대상이라면, m-1은 왼쪽 문맥C_L의 개수이고 n-m은 오른쪽 문맥C_R의 개수이다.Where m-1 is the number of left contexts C _L , and nm is the number of right context C _Rs .

그리고 식4)에서

을 구하기 위해서 오류가 태깅된 말뭉치를 이용할 수 있지만, 현실적으로 불가능하다. 그래서 각각의 오류율인

을 안다고 가정하면 수학식 8에서와 같이 바꿀 수 있다.And in Equation 4)

It is practically impossible to use corpus tagged with errors. So each error rate

, It can be changed as shown in Equation (8).

그런데 어휘 오류 확률 자체도 구하기는 몹시 어렵다. 물론, 각 어휘의 오류 정도를 자판에 따른 입력 오류 형태를 이용하여 추정하는 방법도 있다.However, it is very difficult to obtain the lexical error probability itself. Of course, there is also a method of estimating the degree of error of each vocabulary using an input error type according to the keyboard.

그러나 다양한 문맥 철자 오류의 발생 유형을 볼 때 다른 유형의 문맥 철자 오류의 발생 확률을 구하기가 쉽지 않다. 그래서 문제를 좀 더 쉽게 해결하기 위해 모든 오류 발생률이 ε로 같다고 가정하면, 수학식 8은 다음과 같이 단순화할 수 있다.However, it is not easy to find out the probability of occurrence of other types of context spelling errors when looking at the types of context spelling errors. To solve the problem more easily, assuming that all error rates are equal to epsilon, equation (8) can be simplified as follows.

여기서 ε는 문헌에 나온 값 또는 사용자가 목적에 맞게 설정한 값으로 정할 수도 있다.Here, ε may be set to the value shown in the literature or the value set by the user for the purpose.

기존 통계적 문맥 의존 철자오류 교정에서 발생할 수 있는 큰 문제점은 자료 부족문제이다. 이는 신뢰성 있는 대용량 말뭉치가 존재한다면 해결될 수 있는 문제이지만, 현실적으로 불가능하다.A major problem with existing statistical context - dependent spelling error correction is lack of data. This is a problem that can be solved if there is a reliable large capacity corpus, but it is practically impossible.

이 때문에, 일반적으로 n-그램 언어 모델에서 사용되는 스무딩 기법들을 통해 자료 부족문제를 해결하고 있다. 스무딩 기법은 현재 관찰된 문맥에서의 어휘들의 출현 빈도를 바탕으로 확률을 추정한다.For this reason, the lack of data problem is solved through the smoothing techniques generally used in the n-gram language model. The smoothing technique estimates the probability based on the occurrence frequency of vocabularies in the currently observed context.

이 방법들은 출현 빈도 정보가 없는 어휘와 교정 대상어에 대한 공기빈도(co-occurrence)를 추정하는 것이므로, 결국 현재 문맥에서 관찰된 어휘의 출현 확률에 영향을 받는다.These methods estimate the co-occurrences of the vocabulary and the corrective word without the appearance frequency information, and are therefore influenced by the probability of appearance of the vocabulary observed in the current context.

만약 문맥에서 나타난 어휘들을 다른 의미상으로 유사한 어휘들로 확장하여 이용한다면 교정 대상어와 문맥의 어휘 간 공기빈도 정보를 다양하게 얻을 수 있다. If the vocabularies in the context are extended to other semantically similar vocabularies, the air frequency information between the corrective word and the vocabulary of the context can be obtained in various ways.

그리고 어휘의 출현 빈도에 기반을 둔 스무딩 기법들은 주변 문맥과 교정 대상어의 의미 유사성을 분석하기 어렵다.And smoothing techniques based on the appearance frequency of vocabulary are difficult to analyze semantic similarity between surrounding context and proofreading words.

이러한 기법들은 현재 문맥에서 나타난 어휘들을 대상으로 하기에, 현재 관찰된 어휘들 이외의 의미 관계 정보를 분석할 수 없다.Since these techniques target vocabularies in the current context, they can not analyze semantic relation information other than the currently observed vocabularies.

만약 교정 대상어와 의미상으로 유사한 어휘들과 주변 문맥 어휘 간 공기빈도 정보를 이용한다면, 해당 어휘와 주변 문맥과의 의미 유사성을 분석할 수 있다.If we use semantic similar vocabularies to the proofreading word and air frequency information between the surrounding context vocabularies, we can analyze the semantic similarity between the vocabulary and the surrounding context.

따라서, 본 발명에서는 한국어 어휘의미망을 이용하여 어휘를 확장하고, 확장된 결과와 문맥의 공기 어휘 사이의 의미관계를 분석하는 데 이용한다.Therefore, in the present invention, the vocabulary is expanded using the delusion of the Korean vocabulary, and the vocabulary is used to analyze the semantic relation between the expanded result and the air vocabulary of the context.

그리고 확장된 결과를 이용해 문맥 의존 철자오류 교정을 수행하는 확률 추정 방법을 제안한다.And we propose a probability estimation method that performs context - dependent spelling error correction using extended results.

한국어 어휘 의미망(KorLex)은 PWN(Princeton WordNet)을 참조모델로 하여 구축되었으며, PWN의 의미관계를 기본 골격으로 사용하되, 영어 어휘에 치중된 일부 의미관계를 한국어 어휘 의미에 맞게 수정, 보완, 확장된 대규모 지식베이스이다.The Korean lexical meaning network (KorLex) is constructed by using PWN (Princeton WordNet) as a reference model. It uses the semantic relation of PWN as a basic framework, and some semantic relations focused on English vocabulary are modified, supplemented, It is an extended large-scale knowledge base.

KorLex는 신셋(synonym set; 동의어 집합)을 기본단위로 하며 명사, 동사, 형용사, 부사, 분류사로 구성되어 있다. KorLex는 13만 개의 신셋과 약 15만 개의 어의를 포함하고 있고, 계층구조에서 동의어, 상위어, 하위어, 형제어 관계를 형성하고 있다.KorLex is based on a synonym set and consists of nouns, verbs, adjectives, adverbs, and classifiers. KorLex includes about 130,000 new sets and about 150,000 words, and forms hierarchical structures of synonyms, upper, lower, and upper control relations.

한국어 어휘 의미망(KorLex)을 이용한 어휘 확장에 관하여 설명하면 다음과 같다.The lexical expansion using Korean lexical semantic network (KorLex) is as follows.

본 발명에서는 KorLex를 이용해 교정 대상어의 관계어를 구한 뒤, 이를 이용해 어휘를 확장하여 자료부족 문제를 해결하는데, 확장된 어휘는 교정 대상어와 문맥 어휘 간 의미 유사성을 분석할 수 있다.In the present invention, KorLex is used to obtain the related words of the correction target word, and then the word is expanded by extending the vocabulary to solve the data lacking problem. The extended vocabulary can analyze the semantic similarity between the correction target word and the context vocabulary.

교정 대상어와 문맥의 어휘 간 공기빈도 정보만으로는 문맥 의존 철자오류를 교정할 수 없는 예는 표 1에서와 같다.Table 1 shows examples in which context-dependent spelling errors can not be corrected by only the air frequency information between the words to be corrected and the vocabularies of the context.

표 1은 '원숙한 눈길로 반추하려는 의자가 물씬 풍기는 시편들'이라는 문장에서 문맥에서 나타난 어휘들과 교정 어휘 쌍 어휘들과의 공기빈도를 나타낸 것이다. Table 1 shows the frequency of the vocabulary in the context and the frequency of correction vocabulary vocabulary in the sentence 'Psalms in which a chair to reflect in a mature eye'.

표 1에서 보인 문장은 '의지'가 '의자'로 잘못 쓰였지만, '의자'가 바르게 쓰인 것으로 판단하고 오류를 교정하지 않은 예이다.The sentence shown in Table 1 is an example in which 'will' was mistakenly used as 'chair' but 'chair' was used correctly and the error was not corrected.

표 1에서 나타난 예는 교정어휘 쌍과 문맥 어휘 간 통계 정보가 부족해 올바르게 교정할 수 없다.The example shown in Table 1 can not correct correctly because of insufficient statistical information between calibration vocabulary pairs and context vocabularies.

따라서 교정 어휘 쌍의 어휘를 확장하여 사용하여 이러한 문제점을 해결한다.Therefore, this problem is solved by extending the vocabulary of the correction lexical pair.

표 2는 각 교정 어휘 쌍의 어휘의 형제어를 확장한 예를 나타낸 것이다.Table 2 shows an example of extending the type control of the vocabulary of each calibration lexical pair.

표 2에서 '의자'와 '의지'의 관계어들을 구한 결과를 보이고 있다.Table 2 shows the relationship between 'chair' and 'will'.

이 결과를 바탕으로 이러한 관계어들과 문맥에서 나타난 공기빈도는 확장 대상이 되는 어휘와의 연관성을 나타내는 정보로 사용될 수 있다.Based on these results, the air frequency in context with these relation terms can be used as information indicating the relation with the expansion target vocabulary.

표 3은 KorLex를 이용해 교정 어휘 쌍의 형제어와 문맥 사이의 공기빈도를 추가로 구한 예이다.Table 3 shows an example of using KorLex to further control the type of calibration lexical pair and the air frequency between contexts.

형제어 확장을 통해 '눈길'과 교정 어휘 쌍에 대한 추가적인 정보를 구할 수 있다.You can obtain additional information about the 'eye' and the correction lexical pair through the type control extension.

이를 표 1의 정보와 합쳐서 의미 관계를 분석하는 데 이용하면, 이 문장에서 나타난 '눈길'과의 관계가 '의지'와 강하게 연결된다고 판단할 수 있다.If we use this information to analyze semantic relations with the information in Table 1, we can judge that the relationship with 'eye' in this sentence is strongly connected to 'will'.

이러한 확장된 어휘와 문맥의 공기빈도 정보를 이용해 '의지'가 '의자'로 잘못 쓰인 이 문장은 '의지'로 교정할 수 있다.Using this expanded vocabulary and contextual air frequency information, this sentence with 'will' as 'chair' can be corrected to 'will'.

본 발명은 기존 모델에서 사용한 방법의 문제점을 해결하기 위해 KorLex의 관계어를 이용해 어휘를 확장하고, 그 확장된 결과와 문맥에 나타난 공기 어휘 간 출현빈도를 구하여 가중치로 사용할 수 있게 Laplace 스무딩의 수식을 변경하여 사용한다. In order to solve the problem of the method used in the existing model, the present invention extends the vocabulary using KorLex's related words, and calculates the expression of Laplace smoothing so that the expanded result and the appearance frequency between the air vocabulary appearing in the context can be used as the weight. And use it.

변경된 수식은 아래와 같다.The modified formulas are as follows.

B = 고려될 수 있는 모든 단어 집합의 크기, α= 교정 어휘의 관계어와 문맥 어휘 사이 확장을 통해 얻어진 가중치,B = the size of all word sets that can be considered, α = the weight obtained through expansion between the relation word of the corrective vocabulary and the context vocabulary,

은 왼쪽 문맥에서 나타난

과

가 같이 문맥에 나타날 확률을 계산하는 식이다.

Appears in the context of the left

and

Is the expression that computes the probability of appearing in the same context.

이 확률을 구할 때, KorLex를 통해 확장된 어휘와의 정보를 반영하기 위하여 α를 추가로 구하여, 가중치로 사용한다.To obtain this probability, we use α as an additional weight to reflect information from the expanded vocabulary through KorLex.

오른쪽 문맥과의 확률은

를 이용해 구할 수 있다. The probability with the right context is

.

변경된 수식에서 α를 구하기 위해서는 우선 KorLex를 이용해 어휘를 확장하여야 한다.To find α in the modified formulas, first extend the vocabulary using KorLex.

기본적으로 교정 어휘 쌍의 각 어휘가 확장 대상이 될 수 있다.Basically, each vocabulary of the proofreading lexical pair can be extended.

확장 대상이 된 어휘는 KorLex를 통해 형제어, 하위어 등의 관계어로 확장할 수 있다. 이러한 어휘 확장이 중요한 이유는 기존 방법의 약점 때문이다.The extended vocabulary can be extended to related words such as type control, subordinate through KorLex. This extension of vocabulary is important because of the weakness of the existing method.

기존 통계적 방법을 이용한 문맥 의존 철자오류 교정은 현재 문맥에서 나타난 어휘들만을 사용한다는 약점이 있다.Context - dependent spelling error correction using existing statistical methods has the drawback of using only vocabularies in the present context.

즉 n-그램에서 사용되는 스무딩 기법을 통해 공기빈도가 0인 경우를 바로 잡는다고 하여도, 결국 현재 문맥에서 나타난 어휘들의 발생 확률만을 추정하고, 의미 유사성은 파악할 수 없다는 문제점이 있다.That is, even if the air frequency is corrected to 0 through the smoothing technique used in the n-gram, there is a problem that only the occurrence probability of the vocabulary appearing in the current context is estimated and the semantic similarity can not be grasped.

하지만 KorLex를 통해 확장된 어휘를 이용하면, 단순히 현재 문맥에서 나타난 어휘들의 발생확률을 추정하는 것뿐만 아니라, 교정 대상어와 문맥 정보간 의미 유사성을 구할 수 있게 됨으로써 문맥 의존 철자오류의 교정 성능 향상에도 영향을 미친다.However, using extended vocabulary through KorLex not only estimates the probability of occurrence of vocabulary in the current context, but also makes it possible to obtain semantic similarity between the corrective word and contextual information, thereby improving the correction performance of context dependent spelling errors .

KorLex를 통해 어휘를 확장하고, 이를 이용해 가장 좋은 교정 성능을 보이는 α를 구하기 위하여 KorLex의 관계어를 다양하게 활용하여 실험을 수행한다.We extend the vocabulary through KorLex and use it to perform a variety of experiments using KorLex 's related words to obtain the best α for corrective performance.

α값으로 사용될 수 있는 가장 단순한 방법은 교정 어휘의 관계어 확장결과와 문맥 어휘 사이의 공기빈도 총합을 이용하는 방법이다.The simplest method that can be used as an alpha value is to use the sum of the air frequencies between the contextual vocabulary and the expansion result of the correlation vocabulary.

하지만 단순히 추가로 구한 통계 값을 그대로 사용하게 된다면 잘못된 결과를 유발할 수 있다. 어휘마다 확장된 결과의 수가 차이가 나기 때문이다.However, if you simply use the statistical values obtained by adding them, they can cause false results. This is because the number of extended results differs for each vocabulary.

예를 들어 '사장'의 형제어 수는 18개이지만, '사정'의 형제어는 100개가 넘어가는 수를 보인다. For example, 'boss' has 18 types of control, but 'brother' has more than 100 brothers.

따라서 α는 문맥 의존 철자오류를 교정할 때 유의미한 정보로 동작하도록 하며, 잘못된 결과를 유발하지 않도록 적절한 값의 보정이 필요하다. Therefore, α should operate with meaningful information when correcting context-dependent spelling errors, and it is necessary to correct appropriate values so as not to cause false results.

α는 기존 통계 정보에 부가적으로 더해줌으로써 좀 더 정확한 오류 검색 및 교정에 단서를 제공하는 역할이므로 지나치게 큰 값이 되지 않도록 하여야 한다.α is added to the existing statistical information, thereby providing a clue to more accurate error detection and correction, so that it should not be too large.

첫 번째로 형제어 정보를 이용해 어휘를 확장할 수 있다.First, you can extend vocabularies using type control information.

출현 빈도의 총합을 문맥 의존 철자오류 교정에 유의미한 정보로 동작하게 하면서 잘못된 교정결과를 유발하지 않게 하려면 적절한 값의 변환이 필요하다.Conversion of appropriate values is necessary in order to make the sum of appearance frequencies operate with meaningful information for context dependent spelling error correction while not causing false correction results.

α값은 지나치게 큰 값이 되어서는 안 되므로

형태로 변환하여 사용하였다. The α value should not be too large

Were used.

는 형제어를 이용한 어휘 확장결과와 문맥의 공기 어휘 간 출현 빈도이며, n은 반복적인 실험을 통하여 최적의 값이다.

Is the frequency of occurrence between vocabulary expansion results using type control and the air vocabulary of context, and n is the optimal value through repeated experiments.

두 번째로 하위어 정보를 이용해 어휘를 확장할 수 있다.Second, the vocabulary can be extended using subordinate information.

형제어 정보 대신 하위어 정보로 어휘를 확장하고, 확장된 결과와 문맥의 공기 어휘 간 출현 빈도 총합을 구하여 α값에 반영하기 위해

형태로 변환하여 사용하였다. To extend the vocabulary with subordinate information instead of type control information and to calculate the sum of appearance frequencies between the extended result and the air vocabulary of the context,

Were used.

는 하위어를 이용한 어휘 확장결과와 문맥의 공기 어휘 간 출현 빈도이며, n은 반복적인 실험을 통하여 최적의 값이다.

Is the frequency of occurrence between vocabulary expansion results using the lower word and the air vocabulary of the context, and n is the optimal value through repeated experiments.

세 번째로 형제어 정보와 하위어 정보를 함께 이용해 어휘를 확장할 수 있다. 이 방법에서는

로 변환하여 사용한다.Third, vocabulary can be expanded by using type control information and subordinate information together. In this way

.

마지막으로 이를 바탕으로 형제어 확장결과와 하위어 확장결과의 비중이 달라진다면 성능이 향상될 수 있다.Finally, based on this, performance can be improved if the proportion of the type control extension result and the sub-word expansion result is changed.

또한, 형제어, 하위어의 확장 결과를 모두 사용하였을 때, 재현율이 가장 높은 것으로 보아, 확장된 어휘의 수도 중요한 요소라고 판단한다.In addition, when we use all the expansion results of type control and subordinate, the recall is the highest, so the number of extended vocabulary is also considered to be an important factor.

그래서 여기에 문맥에서 나타난 어휘를 확장하여 추가로 더해주는 것에 의해 성능향상이 가능해진다.So, here we can expand the vocabulary in the context and add it to it to improve performance.

이에 아래와 같이 α를 변환하여 사용한다.Therefore, α is converted and used as follows.

β₁:교정 대상어의 형제어 확장에 대한 가중치 (0.1~1.0)β ₁ : Weight for the type control expansion of the correction target word (0.1 to 1.0)

β₂:교정 대상어의 하위어 확장에 대한 가중치 (0.1~1.0)β ₂ : Weights for the extension of the subordinate words (0.1 ~ 1.0)

β₃:문맥어휘의 하위어 확장에 대한 가중치 (0.1~1.0)β ₃ : Weight of subexpression of context vocabulary (0.1 ~ 1.0)

RW_cw:교정 대상어와 문맥 어휘의 하위어 간 공기빈도RW _cw : Air frequency between the corrective word and the lower word of the context vocabulary

이와 같은 본 발명에 따른 관계어 기반 확률추정 방법을 이용한 통계적 문맥의존 철자오류 교정 장치 및 방법은 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 조건부 확률값과 교정 어휘 쌍의 각 어휘의 관계어와 주변 문맥에 나타난 어휘 간 공기빈도를 이용하여 문맥 철자오류 교정을 수행하는 것이다.The apparatus and method for correcting statistical context dependent spelling errors using the relative word-based probability estimation method according to the present invention is characterized in that the correlation terms between the vocabulary conditional probability values and the correction vocabulary pairs appearing in the respective vocabularies And to perform context spelling error correction using the inter-vowel air frequency in the context.

이상에서의 설명에서와 같이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 본 발명이 구현되어 있음을 이해할 수 있을 것이다.As described above, it will be understood that the present invention is implemented in a modified form without departing from the essential characteristics of the present invention.

그러므로 명시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 하고, 본 발명의 범위는 전술한 설명이 아니라 특허청구 범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.It is therefore to be understood that the specified embodiments are to be considered in an illustrative rather than a restrictive sense and that the scope of the invention is indicated by the appended claims rather than by the foregoing description and that all such differences falling within the scope of equivalents are intended to be embraced therein It should be interpreted.

101. 입력부 102. 형태소 분석부
103. 품사 태깅부 104. 연관성 분석부
105. 철자 오류 교정부 106. 출력부101. Input unit 102. Morphological analysis unit
103. Part of the tagging part 104. Association analysis part
105. Spelling error correction unit 106. Output unit

Claims

An input unit for inputting a sentence for retrieving and correcting a context spelling error;
A morphological analysis unit for performing a morphological analysis on the inputted sentence based on morphological analysis dictionary and separating the morpheme into morpheme units;
The morpheme analysis part analyzes morphemic fragments and removes the morphemic fragments.
A correlation analyzing unit for quantifying the correlation between the vocabulary and the surrounding vocabulary using the conditional probability and reliability;
Using the numerical values derived from the Association Analysis Division, the conditional probability values between the vocabularies appearing in each vocabulary of the corrected vocabulary pair and the relation terms of the vocabulary pairs between the vocabulary pairs and the vocabulary air frequency in the surrounding context, A spelling error correcting unit for performing a spelling error correcting process,
Wherein the correlation analyzing unit uses the correlation terms of the respective vocabularies of the calibration lexical pair and the inter-lexical air frequency in the surrounding context as weight values when the conditional probability values between the two lexicals are calculated, and the statistical context- Error correction device.

delete

2. The method according to claim 1, wherein the correlation analyzing unit obtains a weight by taking a route or a log to a correlation frequency between the vocabulary of the calibration word pair and the vocabulary air frequency in the surrounding context, Dependent spelling error correction device.

2. The apparatus of claim 1, wherein the correlation analyzer uses a syllable of the Korean vocabulary or a subordinate of the Korean vocabulary in relation words of the respective vocabularies of the calibration vocabulary pair, using a statistical context-dependent spelling error correction apparatus .

5. The method according to claim 4, wherein the relevancy analyzing unit uses a weight of each of the corrected vocabulary pairs and a weight of each of the corrected vocabulary pairs, Statistical context dependent spelling error correction device.

Performing morpheme analysis on the inputted sentence based on the morpheme dictionary and dividing the word into morpheme units;
Removing morphologic impurities from the analyzed morphemes if they occur;
The correlation between the vocabulary and the surrounding vocabulary is calculated by using conditional probability and reliability by using the correlation between each vocabulary of the corrective vocabulary pair and the vocabulary air frequency in the surrounding context as a weight when calculating the conditional probability value between two vocabularies step;
And a step of performing context spelling error correction using the conditional probability values between the vocabularies in each of the vocabularies of the corrected vocabulary pair and the surrounding context, the relation terms of the respective vocabularies in the correction vocabulary pair and the intercodal air frequency in the surrounding context A statistical context - dependent spelling error correction method using relation - based probability estimation method.

delete

7. The method of claim 6, wherein, in quantifying the association between the vocabulary and the surrounding vocabulary using the conditional probability and the reliability,
A statistical context-dependent spelling error correction method using a relational word-based probability estimation method, characterized in that a weight is obtained by taking a route or a log from a relationship word of each vocabulary of a calibration word pair and an inter-vowel air frequency in a surrounding context.

9. The method of claim 8, wherein, in quantifying the association between the vocabulary and the surrounding vocabulary using the conditional probability and the reliability,
We use the sibling or subordinate of the Korean vocabulary in the relation words of each vocabulary of the correction vocabulary pair and use the weight of each of the weights obtained by using the type control of each vocabulary of the correction vocabulary pair and the subordinate, A statistical context - dependent spelling error correction method using the relation - based probability estimation method.

7. The method of claim 6, wherein, in quantifying the association between the vocabulary and the surrounding vocabulary using the conditional probability and the reliability,
The weight α obtained through the expansion between the relation word of the proofreading vocabulary and the context vocabulary,

Respectively,
β ₁ is the weight for the extension of the type of the proofreading target (0.1 to 1.0), β ₂ is the weight for the expansion of the lower word of the proofreading word (0.1 to 1.0), β ₃ is the weight for the extension of the lower word of the context vocabulary ~ 1.0), RW _cw : air frequency between the words to be calibrated and the words of the context vocabulary,

Is a statistical context-dependent spelling error correction method using the relation-based probability estimation method, which is characterized by the lexical expansion result using the lower word and the occurrence frequency between the air vocabulary of the context.