KR102348845B1

KR102348845B1 - A method and system for context sensitive spelling error correction using realtime candidate generation

Info

Publication number: KR102348845B1
Application number: KR1020190060639A
Authority: KR
Inventors: 권혁철; 김민호
Original assignee: 부산대학교 산학협력단
Priority date: 2018-05-23
Filing date: 2019-05-23
Publication date: 2022-01-11
Also published as: KR20190133624A

Abstract

본 발명은 철자오류 교정을 위한 검사 단계에서 실시간으로 생성하여 구축한 교정 어휘 쌍을 대상으로 교정을 하여 다양한 오류에 대한 처리가 가능하도록 한 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치 및 방법에 관한 것으로, 문맥 철자오류를 검색하고 교정하기 위한 문장을 입력하는 입력부;상기 입력부를 통하여 입력되는 문장을 어절 단위로 검사를 진행하는 어절 단위 검사부;상기 어절 단위 검사부를 통하여 검사가 진행된 어절의 교정 어휘 쌍을 생성하는 실시간 교정 어휘쌍 생성부;가장 확률이 높은 단어를 기준으로 오류 판단을 하는 오류 판단부;상기 오류 판단부의 판단 결과에 따라 대치어를 제시하는 대치어 제시부;를 포함하는 것이다.The present invention relates to a context-dependent spelling error correction apparatus and method using real-time error candidate generation that enables processing of various errors by correcting a pair of correction vocabularies created and constructed in real time in the inspection step for correcting spelling errors. An input unit for inputting a sentence for searching for and correcting a context spelling error; A word unit inspection unit for inspecting a sentence input through the input unit on a word-by-word unit; Correcting vocabulary of a word tested through the word unit inspection unit A real-time proofreading vocabulary pair generation unit for generating pairs; an error determination unit for determining an error based on a word with the highest probability; and a replacement word presenting unit for presenting a replacement word according to the determination result of the error determination unit

Description

A method and system for context sensitive spelling error correction using realtime candidate generation

본 발명은 철자오류 교정에 관한 것으로, 구체적으로 검사 단계에서 실시간으로 생성하여 구축한 교정 어휘 쌍을 대상으로 교정을 하여 다양한 오류에 대한 처리가 가능하도록 한 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치 및 방법에 관한 것이다.The present invention relates to spelling error correction, and more specifically, context-dependent spelling error correction using real-time error candidate generation that enables handling of various errors by correcting a pair of correction vocabularies created and constructed in real time in the inspection step. It relates to an apparatus and method.

최근 빅 데이터(big data) 활용에 관한 관심이 증가하면서 심층학습(deep learning)과 같은 인공지능 관련 기술의 연구가 활발히 진행되고 있다.Recently, as interest in the use of big data increases, research on artificial intelligence-related technologies such as deep learning is being actively conducted.

특히, 스마트폰을 중심으로 한 모바일 사용 환경 발달로 텍스트와 같은 비정형 데이터(unstructured data)의 양이 폭발적으로 증가함에 따라, 이를 효과적으로 처리하기 위한 자연어처리 기술에 대한 수요가 그 어느 때보다 많다.In particular, as the amount of unstructured data such as text is explosively increased due to the development of the mobile usage environment centered on smartphones, the demand for natural language processing technology to effectively process it is greater than ever.

자연어처리 기술의 핵심은 텍스트의 형태적,의미적 분석에 기반을 둔 자연어 이해(natural language understanding)로서 그 중심에 문서 교정 기술이 있다.The core of natural language processing technology is natural language understanding based on morphological and semantic analysis of text, and document proofreading technology is at the center of it.

텍스트에는 사용자의 실수든, 의도적이든, 또는 무지에 의해서이든 철자오류가 많이 포함되어 있으며, 이러한 오류는 자연어 이해의 정확도를 낮추는 주요한 요인이 된다.Text contains many spelling errors, whether due to user's mistake, intentional or ignorance, and these errors are a major factor in lowering the accuracy of natural language understanding.

도 1은 노이지 채널모형 구성도이고, 도 2는 노이지 채널모형으로 본 문맥의존 철자오류 교정의 일 예를 나타낸 구성도이다.FIG. 1 is a block diagram of a noisy channel model, and FIG. 2 is a block diagram illustrating an example of context-dependent spelling error correction viewed through a noisy channel model.

철자오류의 유형은 크게 단순 철자오류(non-word spelling error)와 문맥의존 철자오류(context-sensitive spelling error)로 구분할 수 있다.Types of spelling errors can be broadly classified into non-word spelling errors and context-sensitive spelling errors.

전자는 "결죄"와 같이 오류어가 실제로 존재하지 않는 어휘이기 때문에 해당 어절을 형태적으로 분석하는 것만으로 쉽게 오류를 검출할 수 있다.Since the former is a vocabulary in which an error word does not actually exist, such as "innocence", an error can be easily detected only by morphologically analyzing the corresponding word.

반면에 후자는 "요금 결재"와 같이 "요금"과 함께 사용되었을 때 오류가 되기 때문에 해당 어절의 형태적.의미적 특성을 고려해야만 검출할 수 있어 검출이 매우 어렵다.On the other hand, since the latter is an error when used together with "fee", such as "payment of fee", it can be detected only by considering the morphological and semantic characteristics of the corresponding word, so it is very difficult to detect.

문맥의존 철자오류의 교정 방법은 크게 규칙을 이용한 방법과 통계적 방법으로 나눌 수 있다.Methods for correcting context-dependent spelling errors can be divided into rule-based methods and statistical methods.

규칙을 이용한 방법은 발생 빈도가 높거나 정형화된 문맥의존 철자오류를 교정할 수 있는 확률이 높으나, 입력 오류로 일어나는 비정형화된 유형에 대한 교정은 어렵다.The method using the rule has a high probability of correcting a high frequency of occurrence or a formalized context-dependent spelling error, but it is difficult to correct an atypical type caused by an input error.

반면에 통계적 방법은 반복성이 작은 문맥의존 철자오류에도 적용할 수 있고, 통계에 사용된 말뭉치를 바꿈으로써 단시간에 다양한 환경에 맞는 철자오류 교정 기술을 개발할 수 있어 활용도가 높다.On the other hand, the statistical method can be applied to context-dependent spelling errors with low repeatability, and by changing the corpus used in statistics, it is possible to develop a spelling error correction technique suitable for various environments in a short time, which is highly useful.

통계적 문맥의존 철자오류 교정에서 가장 널리 사용하는 접근법은 노이지 채널모형(noisy channel model)에 기반을 둔 방법이다. 이 방법은 문맥의존 철자오류 교정 문제를 디코딩 문제(decoding)로 간주한다.The most widely used approach to statistical context-dependent spelling error correction is a method based on the noisy channel model. This method regards the context-dependent spelling error correction problem as a decoding problem.

도 1에서 보듯이 노이지 채널모형은 채널에 존재하는 잡음(noise)에 의해 입력 데이터가 출력 데이터로 왜곡될 수 있다고 가정하고, 디코더(decoder)를 이용해 출력 데이터로부터 입력 데이터를 복원하는 모형이다. As shown in FIG. 1 , the noisy channel model is a model that restores input data from output data using a decoder, assuming that input data can be distorted into output data due to noise existing in a channel.

도 2에서 보듯이 노이지 채널모형에 기반을 둔 문맥의존 철자오류 교정에서는 입력 데이터를 사용자가 입력하려고 한 문서, 출력 데이터를 실제 관찰된 문서로 가정한다.As shown in FIG. 2 , in the context-dependent spelling error correction based on the noisy channel model, it is assumed that the input data is the document the user intends to input and the output data is the actually observed document.

즉, 사용자가 실제 작성하려고 한 문서가 노이즈 채널을 통과하면서 철자오류를 포함하게 되고, 이러한 왜곡된 데이터를 복원하는 것을 문맥의존 철자오류 교정으로 보는 것이다.That is, the document that the user is actually trying to write contains spelling errors as it passes through the noise channel, and restoring such distorted data is viewed as context-dependent spelling error correction.

종래 기술에서는 언어모형

를 사용자가 입력하려고 한 단어열 후보의 확률분포, 채널확률

를 철자오류의 발생률로 해석하여 통계적 문맥의존 철자오류 교정 모형을 구축하였다. In the prior art, the language model

Probability distribution and channel probability of the word string candidates that the user tried to input

was interpreted as the incidence rate of spelling errors, and a statistical context-dependent spelling error correction model was constructed.

'사장/사정', '가장/가정'과 같이 미리 구축한 교정 어휘 쌍을 이용하여 교정 어휘가 포함된 문장에 한하여 철자오류 여부를 판단한다.Using a pre-established pair of proofreading words such as 'President/Assessment' and 'Family/Household', only sentences containing proofreading vocabulary are judged for spelling errors.

예를 들어, '그는 한 집안의 가정으로서 최선을 다하고 있다.'라는 문장을 검사하였을 때, 미리 구축된 교정 어휘 쌍에 포함되는 '가정'에 대해서 오류 여부를 판단한다.For example, when the sentence 'He is doing his best as a family of a family' is checked, it is determined whether there is an error with respect to 'assumption' included in the pre-established pair of proofreading vocabulary.

즉, 해당 문장에서 '가정'으로 쓴 것이 바른지 아니면 '가정'과 같은 교정 어휘 쌍에 포함된 '가장'으로 쓰는 것이 바른지를 판단하는 것이다.That is, it is judged whether it is correct to write 'home' in the corresponding sentence or whether it is correct to write 'most' included in the corrective word pair such as 'home'.

종래 기술의 하나로, 미리 구축한 교정 어휘 쌍을 이용하여 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 출현빈도에 바탕을 둔 통계모형을 이용하여 문맥의존 철자오류를 교정하는 방법이 있다.(대한민국 등록특허 제10-1495240호)As one of the prior art methods, there is a method of correcting a context-dependent spelling error using a statistical model based on the frequency of appearance between each word of the proofing word pair and a word appearing in the surrounding context using a pre-constructed proofreading word pair. ( Republic of Korea Patent Registration No. 10-1495240)

다른 방법으로 교정 규칙의 재현율을 높이기 위해 규칙을 일반화하는 과정에서 한국어 어휘의미망을 활용하는 방법이 제시되고 있다.(대한민국 등록특허 제10-1500617호)Another method is to use the Korean lexical semantic network in the process of generalizing the rules to increase the reproducibility of the proofing rules. (Registration of Korean Patent No. 10-1500617)

또 다른 방법으로, 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 연관성을 계산하는 과정에서 발생하는 자료부족 문제를 해결하기 위한 방법이 제시되고 있다.(대한민국 등록특허 제10-1573854호)As another method, a method for solving the data shortage problem that occurs in the process of calculating the correlation between each vocabulary of the proofreading vocabulary pair and the vocabulary displayed in the surrounding context has been proposed (Korean Patent Registration No. 10-1573854).

그러나 이와 같은 종래 기술은 제한된 교정 어휘 쌍을 대상으로만 철자오류 교정을 수행하기 때문에 교정 어휘 쌍으로 구축되지 않은 유형의 오류에 대해서는 처리할 수 없다.However, since the prior art performs spelling error correction only for a limited pair of proofing words, it cannot handle errors of a type that are not built with a pair of proofreading words.

특히, 오타에 의한 문맥의존 철자오류와 같이 정형화되지 않은 오류는 교정 어휘 쌍을 구축할 방법 자체가 없다. In particular, there is no way to construct a corrective vocabulary pair for unstructured errors such as context-dependent spelling errors due to typos.

따라서, 미리 구축된 교정 규칙에 적용되는 단어에 한해서만 교정이 가능한 종래 기술들의 문제를 해결할 수 있는 새로운 기술의 개발이 요구되고 있다.Accordingly, there is a need for development of a new technology capable of solving the problems of the prior art in which only a word applied to a pre-established correction rule can be corrected.

대한민국 등록특허 제10-1495240호Republic of Korea Patent No. 10-1495240 대한민국 등록특허 제10-1500617호Republic of Korea Patent Registration No. 10-1500617 대한민국 등록특허 제10-1573854호Republic of Korea Patent No. 10-1573854

본 발명은 이와 같은 종래 기술의 철자오류 교정의 문제를 해결하기 위한 것으로, 검사 단계에서 실시간으로 생성하여 구축한 교정 어휘 쌍을 대상으로 교정을 하여 다양한 오류에 대한 처리가 가능하도록 한 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is to solve the problem of correcting spelling errors in the prior art. Real-time error candidates are generated so that various errors can be dealt with by correcting the corrective vocabulary pairs created and constructed in real time in the inspection step. An object of the present invention is to provide an apparatus and method for correcting context-dependent spelling errors using

본 발명은 미리 구축된 교정 어휘 쌍이 아닌 검사 과정에서 교정 어휘 쌍을 실시간으로 생성하고, 심층학습에 기반을 둔 언어모형을 사용하여 오타에 의한 문맥의존 철자오류와 같이 정형화되지 않은 오류를 효율적으로 교정할 수 있도록 한 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.The present invention generates a proofreading vocabulary pair in real time in the inspection process rather than a pre-established proofreading vocabulary pair, and uses a language model based on deep learning to efficiently correct unstructured errors such as context-dependent spelling errors due to typos. An object of the present invention is to provide a context-dependent spelling error correction apparatus and method using real-time error candidate generation.

본 발명은 사용자가 입력한 한국어 문장에서 나타나는 여러 맞춤법/문법 오류중에서 오타와 같은 사용자의 실수나 어문 규정에 대한 무지(無知)에 의해 발생할 수 있는 문맥의존 철자오류(context-sensitive spelling error)를 검색하고, 이를 교정할 대치어를 제시할 수 있도록 한 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.The present invention searches for a context-sensitive spelling error that may occur due to a user's mistake, such as a typo, or ignorance of grammar rules, among various spelling/grammar errors appearing in a Korean sentence input by a user, and , it is an object of the present invention to provide an apparatus and method for correcting context-dependent spelling errors using real-time generation of error candidates that can suggest substitute words to correct them.

본 발명은 사용자의 실수, 의도적 또는 무지에 의해서 철자오류가 많이 포함되어 있는 문서의 오류를 해결하여 자연어 이해의 정확도를 높일 수 있도록 한 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.The present invention provides a context-dependent spelling error correction apparatus and method using real-time error candidate generation that can improve the accuracy of natural language understanding by resolving errors in documents containing many spelling errors due to user's mistake, intentional or ignorance. but it has a purpose.

본 발명은 한국어 문서 교정 과정에서 가장 난도가 높은 문맥 철자오류를 교정함으로써 한국어 문서 교정기의 성능을 높일 수 있도록 한 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.An object of the present invention is to provide an apparatus and method for correcting context-dependent spelling errors using real-time generation of error candidates, which can improve the performance of a Korean document corrector by correcting the most difficult contextual spelling error in the Korean document proofing process.

본 발명은 기구축한 교정 어휘 쌍을 대상으로만 교정하는 것과 달리 대규모 말뭉치을 이용하여 실시간으로 구축한 교정 어휘 쌍을 대상으로 교정하는 것에 의해 다양한 오류에 대한 처리가 가능하도록 한 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.The present invention uses real-time error candidate generation that enables processing of various errors by correcting a correction vocabulary pair constructed in real time using a large-scale corpus, as opposed to only correcting a cramped correction vocabulary pair. An object of the present invention is to provide a context-dependent spelling error correction apparatus and method.

본 발명의 목적들은 이상에서 언급한 목적들로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Objects of the present invention are not limited to the objects mentioned above, and other objects not mentioned will be clearly understood by those skilled in the art from the following description.

이와 같은 목적을 달성하기 위한 본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치는 문맥 철자오류를 검색하고 교정하기 위한 문장을 입력하는 입력부;상기 입력부를 통하여 입력되는 문장을 어절 단위로 검사를 진행하는 어절 단위 검사부;상기 어절 단위 검사부를 통하여 검사가 진행된 어절의 교정 어휘 쌍을 생성하는 실시간 교정 어휘쌍 생성부;가장 확률이 높은 단어를 기준으로 오류 판단을 하는 오류 판단부;상기 오류 판단부의 판단 결과에 따라 대치어를 제시하는 대치어 제시부;를 포함하는 것을 특징으로 한다.In order to achieve the above object, an apparatus for correcting a context-dependent spelling error using real-time error candidate generation according to the present invention includes an input unit for inputting a sentence for searching for and correcting a context spelling error; A word unit inspection unit that inspects; A real-time correction vocabulary pair generation unit that generates a corrected vocabulary pair of the word tested through the word unit inspection unit; An error determination unit that determines an error based on the word with the highest probability; The error It characterized in that it includes; a replacement word presentation unit for presenting a replacement word according to the determination result of the determination unit.

다른 목적을 달성하기 위한 본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 방법은 어절 단위로 검사를 진행하는 어절 판단 단계;말뭉치에서 연속되어 나타난 세 개의 단어를 추출하여 만든 Trigram 사전을 이용하여 대상 단어 T와 같은 문맥을 가지는 단어를 찾아 대상 단어 T가 속한 교정 어휘 쌍을 생성하는 방법으로 해당 어절의 교정 어휘 쌍을 먼저 생성하는 실시간 교정 어휘쌍 생성 단계;가장 확률이 높은 단어를 기준으로 오류를 판단하는 오류 판단 단계;오류 판단 단계에서의 판단 결과에 따라 대치어를 제시하는 대치어 제시 단계;를 포함하는 것을 특징으로 한다.To achieve another object, a context-dependent spelling error correction method using real-time error candidate generation according to the present invention includes a word determination step of performing a word-by-word check; a trigram dictionary created by extracting three consecutive words from a corpus A real-time proofreading vocabulary pair generation step of first generating a proofreading vocabulary pair of the corresponding word by finding a word having the same context as the target word T and generating a proofreading vocabulary pair to which the target word T belongs; and an error determination step of determining an error; a replacement word presentation step of presenting a replacement word according to the determination result in the error determination step.

이와 같은 본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치 및 방법은 다음과 같은 효과를 갖는다.As described above, the context-dependent spelling error correction apparatus and method using real-time error candidate generation according to the present invention has the following effects.

첫째, 검사 단계에서 실시간으로 생성하여 구축한 교정 어휘 쌍을 대상으로 교정을 하여 다양한 오류에 대한 처리가 가능하도록 하여 자연어처리 기술의 신뢰성을 높일 수 있다.First, the reliability of natural language processing technology can be improved by correcting the corrective vocabulary pairs created and constructed in real time in the inspection stage to handle various errors.

둘째, 미리 구축된 교정 어휘 쌍이 아닌 검사 과정에서 교정 어휘 쌍을 실시간으로 생성하고, 심층학습에 기반을 둔 언어모형을 사용하여 오타에 의한 문맥의존 철자오류와 같이 정형화되지 않은 오류를 효율적으로 교정할 수 있다.Second, it is possible to efficiently correct unstructured errors such as context-dependent spelling errors caused by typos by generating corrective word pairs in real time in the inspection process rather than pre-established proofreading word pairs, and using a language model based on deep learning. can

셋째, 사용자가 입력한 한국어 문장에서 나타나는 여러 맞춤법/문법 오류중에서 오타와 같은 사용자의 실수나 어문 규정에 대한 무지(無知)에 의해 발생할 수 있는 문맥의존 철자오류(context-sensitive spelling error)를 검색하고, 이를 교정할 대치어를 제시할 수 있다.Third, we search for context-sensitive spelling errors that may occur due to user mistakes such as typos or ignorance of grammar rules among various spelling/grammar errors appearing in Korean sentences entered by the user. You can suggest a substitute to correct this.

넷째, 사용자의 실수, 의도적 또는 무지에 의해서 철자오류가 많이 포함되어 있는 문서의 오류를 해결하여 자연어 이해의 정확도를 높일 수 있도록 한다.Fourth, it is possible to improve the accuracy of natural language understanding by resolving errors in documents containing many spelling errors due to user's mistake, intentional or ignorance.

다섯째, 한국어 문서 교정 과정에서 가장 난도가 높은 문맥 철자오류를 교정함으로써 한국어 문서 교정기의 성능을 높일 수 있도록 한다.Fifth, the performance of the Korean document proofing machine can be improved by correcting the most difficult contextual spelling errors in the Korean document proofing process.

여섯째, 기구축한 교정 어휘 쌍을 대상으로만 교정하는 것과 달리 대규모 말뭉치을 이용하여 실시간으로 구축한 교정 어휘 쌍을 대상으로 교정하는 것에 의해 다양한 오류에 대한 처리가 가능하다.Sixth, it is possible to deal with various errors by correcting the correction vocabulary pair constructed in real time using a large-scale corpus, unlike the correction only for the cramped correction vocabulary pair.

도 1은 노이지 채널모형 구성도
도 2는 노이지 채널모형으로 본 문맥의존 철자오류 교정의 일 예를 나타낸 구성도
도 3은 본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치의 구성도
도 4는 본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 방법을 나타낸 구성도
도 5는 Trigram 통계사전을 위한 자료구조를 나타낸 구성도
도 6은 본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 방법을 나타낸 플로우 차트1 is a block diagram of a noisy channel model;
2 is a configuration diagram showing an example of context-dependent spelling error correction viewed with a noisy channel model;
3 is a block diagram of a context-dependent spelling error correction apparatus using real-time error candidate generation according to the present invention;
4 is a block diagram illustrating a context-dependent spelling error correction method using real-time error candidate generation according to the present invention;
5 is a block diagram showing the data structure for the Trigram statistical dictionary.
6 is a flowchart illustrating a context-dependent spelling error correction method using real-time error candidate generation according to the present invention;

이하, 본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치 및 방법의 바람직한 실시 예에 관하여 상세히 설명하면 다음과 같다.Hereinafter, a preferred embodiment of an apparatus and method for correcting a context-dependent spelling error using real-time generation of an error candidate according to the present invention will be described in detail as follows.

본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치 및 방법의 특징 및 이점들은 이하에서의 각 실시 예에 대한 상세한 설명을 통해 명백해질 것이다.Features and advantages of the context-dependent spelling error correction apparatus and method using real-time error candidate generation according to the present invention will become apparent through detailed description of each embodiment below.

도 3은 본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치의 구성도이다.3 is a block diagram of a context-dependent spelling error correction apparatus using real-time error candidate generation according to the present invention.

본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치 및 방법은 미리 구축한 교정 어휘 쌍을 이용하는 대신 검사 단계에서 교정 어휘 쌍을 실시간으로 생성하는 구성을 포함한다.An apparatus and method for correcting a context-dependent spelling error using real-time generation of error candidates according to the present invention includes a configuration in which a correction vocabulary pair is generated in real time in the checking step instead of using a pre-established correction vocabulary pair.

검사 단계에서 교정 어휘 쌍을 실시간으로 생성하기 위하여, 대규모 말뭉치에서 연속되어 나타난 세 개의 단어를 추출하여 만든 Trigram 사전을 이용하여 대상 단어 T와 같은 문맥을 가지는 단어를 찾아 대상 단어 T가 속한 교정 어휘 쌍을 생성하는 방법으로 해당 어절의 교정 어휘 쌍을 먼저 생성하는 구성을 포함할 수 있다.To produce a corrected vocabulary pair in the inspection step in real time, using the Trigram dictionary created by extracting three words shown is a row in a large corpus to find a word having the same context as the target word T correction vocabulary belonging to the target word T pairs As a method of generating , it may include a configuration of first generating a pair of correction vocabulary of the corresponding word.

본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치는 도 3에서와 같이, 문맥 철자오류를 검색하고 교정하기 위한 문장을 입력하는 입력부(10)와, 입력부(10)를 통하여 입력되는 문장을 어절 단위로 검사를 진행하는 어절 단위 검사부(20)와, 어절 단위 검사부(20)를 통하여 검사가 진행된 어절의 교정 어휘 쌍을 생성하는 실시간 교정 어휘쌍 생성부(30)와, 가장 확률이 높은 단어를 기준으로 오류 판단을 하는 오류 판단부(40)와, 오류 판단부(40)의 판단 결과에 따라 대치어를 제시하는 대치어 제시부(50)를 포함한다.As shown in FIG. 3, the context-dependent spelling error correction apparatus using real-time error candidate generation according to the present invention includes an input unit 10 for inputting sentences for searching for and correcting a context spelling error, and the input unit 10 The word unit inspection unit 20 that inspects the sentence by word unit, the real-time correction vocabulary pair generation unit 30 that generates the corrected vocabulary pair of the word tested through the word unit inspection unit 20, and the most probable It includes an error determination unit 40 that determines an error based on a high word, and a replacement word presentation unit 50 that presents a replacement word according to the determination result of the error determination unit 40 .

여기서, 실시간 교정 어휘쌍 생성부(30)는 말뭉치에서 연속되어 나타난 세 개의 단어를 추출하여 만든 Trigram 사전을 이용하여 대상 단어 T와 같은 문맥을 가지는 단어를 찾아 대상 단어 T가 속한 교정 어휘 쌍을 생성하는 방법으로 해당 어절의 교정 어휘 쌍을 생성한다.Here, the real-time proofreading vocabulary pair generation unit 30 finds a word having the same context as the target word T using a trigram dictionary created by extracting three consecutive words from the corpus and generates a proofreading vocabulary pair to which the target word T belongs. In this way, a corrective vocabulary pair of the corresponding word is created.

그리고 실시간 교정 어휘쌍 생성부(30)는 검사 대상이 되는 어절의 좌우 2개의 어절을 기준으로 교정 어휘 쌍을 생성하여 좌우 2개의 어절을 문맥으로 가지는 단어가 교정 어휘 쌍이 되는 것이다.And, the real-time proofreading vocabulary pair generating unit 30 generates a proofreading vocabulary pair based on two left and right words of a word to be examined, and a word having the left and right two word as a context becomes a proofreading vocabulary pair.

그리고 대치어 제시부(50)는 확률이 가장 높으면 해당 어절이 오류가 아닌 것으로 판단하고, 다른 어절이 확률이 더 높다면 해당 어절은 오류로 판단하고, 해당 어휘를 대치어로 제시하는 것이다.In addition, the replacement word presentation unit 50 determines that the corresponding word is not an error when the probability is highest, and determines that the corresponding word is an error when another word has a higher probability, and presents the corresponding word as a replacement.

본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 방법을 각 단계별로 구체적으로 설명하면 다음과 같다.The context-dependent spelling error correction method using real-time error candidate generation according to the present invention will be described in detail step by step.

도 4는 본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 방법을 나타낸 구성도이고, 도 5는 Trigram 통계사전을 위한 자료구조를 나타낸 구성도이다.4 is a block diagram illustrating a context-dependent spelling error correction method using real-time error candidate generation according to the present invention, and FIG. 5 is a block diagram illustrating a data structure for a trigram statistical dictionary.

노이지 채널모형에 기반을 둔 복원 문제는 베이즈 이론(Bayes'theorem)에 의해 표현되며, 출력 데이터의 확률

는 상수이다.The reconstruction problem based on the noisy channel model is expressed by Bayes' theorem, and the probability of the output data

is a constant.

수학식 1에는 두 개의 확률분포가 존재하는데, 언어모형(language model)인

와 채널 확률(channel probability)인

이다.In Equation 1, there are two probability distributions, the language model

and the channel probability

to be.

언어모형은 자연언어의 생성이나 이해를 위해 사용하는 모형이며, 통계적 언어 모형화(Statistical Language Modeling)는 문자열의 확률을 예측할 수 있는 언어모형을 구축하는 작업이다.A language model is a model used for the generation or understanding of natural language, and statistical language modeling is the task of constructing a language model that can predict the probability of a character string.

이때, 통계적 언어모형은 주어진 문자열

에 대한 확률분포

로 나타낼 수 있다. 오늘날 가장 널리 활용되는 통계적 언어모형은 N-gram 모형으로서, 문자열

에 대한 확률분포

에 대한 근사치를 수학식 2와 같이 계산한다.At this time, the statistical language model is a given string

probability distribution for

can be expressed as The most widely used statistical language model today is the N-gram model.

probability distribution for

An approximation to is calculated as in Equation (2).

수학식 3에서 조건부 확률값은 학습말뭉치로부터 획득한 빈도를 활용하여 최대우도추정법(Maximum Likelihood Estimation; MLE)에 의해 추정된다.In Equation 3, the conditional probability value is estimated by Maximum Likelihood Estimation (MLE) using the frequency obtained from the learning corpus.

수학식 1에서 입력 데이터 I를 사용자가 입력하고 한 문서의 단어열

, 출력 데이터

를 실제 사용자가 보고 있는 문서의 단어열

로 치환하면 수학식 4와 같은 통계적 문맥의존 철자오류 교정 모형이 된다.In Equation 1, the input data I is input by the user and the word string of one document

, output data

is a string of words in the document being viewed by the real user.

If replaced with Equation 4, it becomes a statistical context-dependent spelling error correction model as shown in Equation 4.

입력 단어열

와 출력 단어열

에서 노이지 채널을 지나면서 바뀐 단어가

라고 하고, 다른 단어에는 변화가 없다고 가정하면, 문맥의존 철자오류 교정은 확률

를 최대로 하는

를 선택하는 문제가 된다.input word string

and output word string

Words changed as they passed through the noisy channel in

, and assuming that there is no change in other words, context-dependent spelling error correction is

to maximize

It is a matter of choosing

따라서 수학식 4는 수학식 5와 같이 정리할 수 있다.Therefore, Equation 4 can be rearranged as Equation 5.

종래 기술에서는 '가장/가정'과 같이 미리 구축된 교정 어휘 쌍에 포함된 단어가 발견되면 노이지 채널모형의 디코딩 문제 해결에 따라 오류 여부를 판단한다. In the prior art, when a word included in a pre-established proofreading vocabulary pair such as 'family/family' is found, it is determined whether there is an error according to the resolution of the decoding problem of the noisy channel model.

예를 들어, '그는 한 집안의 가정으로서 최선을 다하고 있다.'라는 문장을 검사하였을 때, '가정'은 교정 어휘 쌍에 포함된 단어이므로 수학식 5에 의해 'T = 가장'인 경우와, 'T = 가정'인 경우의 확률을 비교한다.For example, if an when examining the statement: 'He is committed as the home of one family', 'home'is' T = best, by Equation 5 because of the words in the calibration vocabulary pairs and, Compare the probabilities in the case of 'T = assumption'.

만약, 'T = 가장'인 경우의 확률이 더 높다면, '가정'을 문맥의존 철자오류로 판단하여, '가장'으로 고치게 된다.If the probability of 'T = most likely' is higher, 'assumption' is judged as a context-dependent spelling error and is corrected to 'most likely'.

이와 같은 기술에서는 미리 구축된 교정 어휘 쌍이 아닌 어휘의 문맥의존 철자오류에 대해서는 오류 교정을 하지 못한다.In such a technique, error correction is not possible for context-dependent spelling errors of a vocabulary other than a pre-established corrective vocabulary pair.

본 발명에서는 검사 과정에서 실시간으로 교정 어휘 쌍을 생성하기 위해 도 4와 같은 방법으로 같은 문맥을 가지는 어휘들을 교정 어휘 쌍으로 생성한다.In the present invention, words having the same context are generated as proofing word pairs in the same manner as in FIG. 4 to generate proofreading word pairs in real time during the inspection process.

교정 어휘 쌍은 문서 작성 과정에서 사용자의 실수 등에 의해서 잘못 쓰일 수 있는 단어의 쌍이기 때문에 같은 문맥을 가지는 단어로 해석할 수 있다.Since the proofreading vocabulary pair is a pair of words that may be misused due to a user's mistake or the like in the process of writing a document, it can be interpreted as a word having the same context.

예를 들어, 검색 대상 단어 T가 a, b, c, d 단어와 함께 사용되었다고 가정한다.For example, it is assumed that the search target word T is used together with the words a, b, c, and d.

즉, 'a, b, T, c, d'의 꼴로 사용되었다.That is, it was used in the form of 'a, b, T , c, d'.

'Trigram 사전'은 대규모 말뭉치에서 연속되어 나타난 세 개의 단어를 추출하여 만든 사전이다.The 'Trigram Dictionary' is a dictionary created by extracting three consecutive words from a large corpus.

따라서, 대상 단어 T와 같은 문맥을 가지는 단어를 찾으면 T가 속한 교정 어휘 쌍을 생성할 수 있다.Accordingly, if a word having the same context as the target word T is found, a correction vocabulary pair to which T belongs may be generated.

이 경우에는 'Trigram 사전'에서 e, w, b, v, q가 검색되었으므로 이 단어들이 교정 어휘 쌍이 된다.In this case, since e, w, b, v, and q were searched in the 'Trigram Dictionary', these words become the corrective vocabulary pairs.

이를 수식으로 나타내면 수학식 6에서와 같다.If this is expressed as an equation, it is the same as in Equation 6.

교정 어휘 쌍 생성에 이용되는 'Trigram 사전'은 도 5에서 볼 수 있듯이 'Trie 자료구조'에 기반을 두어 구축된다.As shown in FIG. 5 , the 'Trigram dictionary' used to generate the proofreading vocabulary pair is built based on the 'Trie data structure'.

1-gram은 다음과 같이

으로 저장한다.1-gram is as follows

save as

여기서

은

(1-gram)의 출현빈도를 뜻한다.

은

으로 시작하는 모든 2-gram이 저장된 저장소를 가리킨다.here

silver

It means the frequency of appearance of (1-gram).

silver

It points to the storage where all 2-grams starting with .

2-gram은

→

처럼 기억한다.

는 2-gram인 (

,

)의 개수다. 당연히

은 2-gram인 (

,

)의 개수다.2-gram is

→

remember like

is 2-gram (

,

) is the number of of course

is 2-gram (

,

) is the number of

마지막으로 세 번째 단계는

→

로 기억한다. 단,

는

로 저장할 수도 있다.Finally, the third step

→

remember as only,

Is

can also be saved as

본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 방법을 구체적으로 설명하면 다음과 같다.The context-dependent spelling error correction method using real-time error candidate generation according to the present invention will be described in detail as follows.

도 6은 본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 방법을 나타낸 플로우 차트이다.6 is a flowchart illustrating a context-dependent spelling error correction method using real-time error candidate generation according to the present invention.

먼저, 문맥 철자오류를 검색하고 교정하기 위한 텍스트 입력이 이루어지면(S601), 검사할 어절에 대한 모든 교정 어휘쌍 생성 및 대치어 제시가 이루어졌는지 판단하여(S602), 어절 단위의 검사를 진행하여 실시간 교정 어휘쌍 생성을 한다.(S603)First, when text input for searching and correcting contextual spelling errors is made (S601), it is determined whether all corrective vocabulary pairs for the word to be checked and replacements have been presented (S602), and the word unit check is performed Real-time proofreading vocabulary pair generation (S603)

이어, 가장 확률이 높은 단어를 기준으로 오류 판단을 하고(S604), 확률이 가장 높으면 해당 어절이 오류가 아닌 것으로 판단하고, 다른 어절이 확률이 더 높다면 해당 어절은 오류로 판단하고, 해당 어휘를 대치어로 제시한다.(S605)Next, an error determination is made based on the word with the highest probability (S604), and if the probability is highest, it is determined that the corresponding word is not an error, and if another word has a higher probability, the corresponding word is determined as an error, and the corresponding word is presented as a substitute. (S605)

일 예로, 본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 방법은 어절 단위로 검사를 진행하게 되며, 해당 어절의 교정 어휘 쌍을 먼저 생성한다.For example, in the context-dependent spelling error correction method using real-time error candidate generation according to the present invention, inspection is performed in units of words, and a corrective vocabulary pair of the corresponding word is first generated.

예를 들어, '그는 한 집안의 가정으로서 최선을 다하고 있다.'라는 문장이 입력되며, '그는', '한', '집안의', '가정으로서', '최선을', '다하고', '있다.'가 검사의 대상이 된다.For example, the sentence 'He is doing his best as a family member' is input, 'He', 'Han', 'In the family', 'As a family', 'Doing his best', 'Doing his best', 'Yes' is the subject of inspection.

좌우 2개의 어절을 기준으로 교정 어휘 쌍을 생성하기 때문에, '가정으로서'가 속하는 교정 어휘 쌍은 '한', '집안의', '최선을', '다하고'를 문맥으로 가지는 단어가 교정 어휘 쌍이 된다.Since the corrective word pair is generated based on the left and right two words, the corrective word pair to which 'as a household' belongs is the corrective word for the word having 'han', 'family', 'the best', and 'doing it' as the context. become a pair

수학식 5에 의해 T = 가정으로서, 가장으로서, 사정으로서, 사정으로서, ...가 되었을 때, 가장 확률이 높은 단어를 기준으로 오류를 판단한다.According to Equation 5, when T = as assumption, as household, as situation, as situation, as ..., an error is determined based on the word with the highest probability.

만약 '가정으로서'를 포함한 문장의 확률이 가장 높다면 해당 어절은 오류가 아니다.If the sentence containing 'assumption' has the highest probability, the word is not an error.

그러나 다른 어절을 포함한 문장의 확률이 더 높다면, '가정으로서'는 오류가 되고 해당 어휘가 대치어가 된다.However, if the probability of a sentence containing another word is higher, 'assumption' is an error and the corresponding word becomes a substitute.

도 6은 본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 방법을 이용하여 한 문장에 대한 문맥의존 철자오류를 교정하는 과정을 나탄낸 플로우 차트이다.6 is a flowchart illustrating a process of correcting a context-dependent spelling error for a sentence using the context-dependent spelling error correction method using real-time error candidate generation according to the present invention.

문장의 첫 어절부터 마지막 어절까지 어절 단위의 검사를 진행하여 오류를 판단하고, 오류어가 발견되면 대치어로 교정한다.From the first word to the last word of a sentence, a word-by-word test is performed to determine an error, and if an error word is found, it is corrected with a replacement.

위에서 살펴 본 예인 '그는 한 집안의 가정으로서 최선을 다하고 있다.'라는 문장에서, '가정으로서'가 속하는 교정 어휘 쌍은 '한', '집안의', '최선을', '다하고'를 문맥으로 가지는 단어가 교정 어휘 쌍이 된다.In the example we looked at above, in the sentence 'He is doing his best as a family member', the corrective vocabulary pair to which 'As a family' belongs is 'Han', 'of the household', 'Doing your best', 'Doing your best' in the context. A word with , becomes a corrective vocabulary pair.

그러나 이전 과정에서 '집안의'가 오류로 판단되고, '집만의'가 대치어가 된다면, '가정으로서'가 속하는 교정 어휘 쌍은 '한', '집만의', '최선을', '다하고'를 문맥으로 가지는 단어가 교정 어휘 쌍이 된다.However, in the previous process, if 'domestic' is judged as an error and 'dominantly' is a substitute, the corrective vocabulary pair to which 'as a family' belongs is 'han', 'only at home', 'do your best', 'do your best'. A word having as a context becomes a proofreading lexical pair.

즉, 오류 판단과 교정의 과정이 순차적으로 진행되면서 이전 어절의 오류 여부에 따라 현재 어절의 교정 어휘 쌍이 동적으로 결정된다.That is, as the process of error determination and correction proceeds sequentially, the corrective vocabulary pair of the current word is dynamically determined depending on whether the previous word is in error.

이상에서 설명한 본 발명에 따른 실시간 오류 후보 생성을 이용한 문맥의존 철자오류 교정 장치 및 방법은 철자오류 교정을 위한 검사 단계에서 실시간으로 생성하여 구축한 교정 어휘 쌍을 대상으로 교정을 하여 다양한 오류에 대한 처리가 가능하도록 한 것이다.In the context-dependent spelling error correction apparatus and method using real-time error candidate generation according to the present invention described above, various errors are dealt with by correcting the corrective vocabulary pairs created and constructed in real time in the checking step for correcting spelling errors. that made it possible

이상에서의 설명에서와 같이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 본 발명이 구현되어 있음을 이해할 수 있을 것이다.As described above, it will be understood that the present invention is implemented in a modified form without departing from the essential characteristics of the present invention.

그러므로 명시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 하고, 본 발명의 범위는 전술한 설명이 아니라 특허청구 범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.Therefore, the specified embodiments should be considered in an illustrative rather than a restrictive sense, the scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the equivalent scope are included in the present invention. will have to be interpreted.

10. 입력부
20. 어절 단위 검사부
30. 실시간 교정 어휘쌍 생성부
40. 오류 판단부
50. 대치어 제시부10. Input
20. Word unit inspection unit
30. Real-time proofreading vocabulary pair generator
40. Error judgment unit
50. Substitute

Claims

an input unit for inputting sentences for searching and correcting contextual spelling errors;
a word unit inspection unit that inspects the sentence input through the input unit in unit of word;
a real-time proofreading vocabulary pair generating unit that generates a proofreading vocabulary pair of words that have been tested through the word unit checker;
an error determination unit that determines an error based on the word with the highest probability;
Including; and a replacement word presentation unit for presenting a replacement word according to the determination result of the error determination unit;
In order to correct a context-dependent spelling error in a sentence, a word-by-word test is performed from the first word to the last word of the sentence to determine the error, and if an error word is found, it is corrected with a substitute, and a sentence including the target word If the probability of is highest, the corresponding word is not an error, and if the probability of a sentence containing another word is higher, the word to be tested becomes an error and the corresponding vocabulary becomes a substitute,
A context-dependent spelling error correction apparatus using real-time error candidate generation, characterized in that the corrective vocabulary pair of the current word is dynamically determined depending on whether the previous word is in error while the process of error determination and correction is sequentially performed.

The method of claim 1, wherein the real-time proofreading vocabulary pair generator comprises:
It is a method of generating a correction lexicon pair to which the target word T belongs by finding a word having the same context as the target word T using a trigram dictionary created by extracting three consecutive words from the corpus. A context-dependent spelling error correction apparatus using real-time error candidate generation, characterized in that.

The method of claim 2, wherein the real-time proofreading vocabulary pair generator comprises:
A context-dependent spelling error correction apparatus using real-time error candidate generation, characterized in that a corrective vocabulary pair is generated based on two left and right words of a word to be inspected so that a word having two left and right words as a context becomes a corrective vocabulary pair .

The method of claim 1, wherein the substitute presentation unit,
Context-dependent spelling using real-time error candidate generation, characterized in that if the probability is highest, the corresponding word is not an error, if another word has a higher probability, the corresponding word is determined as an error, and the corresponding word is presented as a replacement. error correction device.

The method of claim 1, wherein the real-time proofreading vocabulary pair generator comprises:
input word string

and output word string

Words changed as they passed through the noisy channel in

to maximize

select and

A context-dependent spelling error correction apparatus using real-time error candidate generation, characterized in that it is defined as

6. The method of claim 5, wherein the proofreading lexical pair comprises:

a word determination step of performing a word-by-word test;
Using a trigram dictionary created by extracting three consecutive words from the corpus, the corrective vocabulary pair of the corresponding word is first created by finding the word having the same context as the target word T and generating the corrective vocabulary pair to which the target word T belongs. Real-time proofreading vocabulary pair generation step;
an error determination step of determining an error based on the word with the highest probability;
Including; a replacement word presentation step of presenting a replacement word according to the determination result in the error determination step;
In order to correct a context-dependent spelling error in a sentence, a word-by-word test is performed from the first word to the last word of the sentence to determine the error, and if an error word is found, it is corrected with a substitute, and a sentence including the target word If the probability of is highest, the corresponding word is not an error, and if the probability of a sentence containing another word is higher, the word to be tested becomes an error and the corresponding vocabulary becomes a substitute,
A context-dependent spelling error correction method using real-time error candidate generation, characterized in that the corrective vocabulary pair of the current word is dynamically determined depending on whether the previous word is in error while the process of error determination and correction is sequentially performed.

8. The method of claim 7, wherein the generating of the real-time proofreading vocabulary pair comprises:
A context-dependent spelling error correction method using real-time error candidate generation, characterized in that a corrective vocabulary pair is generated based on two left and right words of a word to be tested, and a word having two left and right words as a context becomes a corrective vocabulary pair .

The method of claim 7, wherein the step of presenting the replacement
Context-dependent spelling using real-time error candidate generation, characterized in that if the probability is highest, the corresponding word is not an error, if another word has a higher probability, the corresponding word is determined as an error, and the corresponding word is presented as a replacement. How to correct errors.

8. The method of claim 7, wherein the generating of the real-time proofreading vocabulary pair comprises:
input word string

and output word string

Words changed as they passed through the noisy channel in

to maximize

select and

A context-dependent spelling error correction method using real-time error candidate generation, characterized in that it is defined as

11. The method of claim 10, wherein the proofreading vocabulary pair comprises:

8. The method of claim 7, wherein in the real-time proofreading vocabulary pair generation step, the Trigram dictionary used to generate the proofreading vocabulary pair is built based on the Trie data structure,
1-gram is

is stored as
here,

silver

(1-gram) frequency of occurrence,

silver

Point to the storage where all 2-grams starting with .
2-gram is

→

is stored as
here,

is 2-gram (

,

) of the number,

is 2-gram (

,

) is the number of
Finally, the third step

→

A context-dependent spelling error correction method using real-time error candidate generation, characterized in that it is memorized as

delete