KR20220075807A

KR20220075807A - System and Method for correcting Context sensitive spelling error using Generative Adversarial Network

Info

Publication number: KR20220075807A
Application number: KR1020200164302A
Authority: KR
Inventors: 권혁철; 이정훈
Original assignee: 부산대학교 산학협력단
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2022-06-08
Also published as: KR102517983B1

Abstract

본 발명은 생성적 적대 신경망(generative adversarial network)을 이용하여 실제 언어모형(language model)에 가깝게 모사하고 이 정보를 사용하여 텍스트로 작성된 일반 문서에서 나타나는 다양한 오류에 대한 대응이 가능하도록 한 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법에 관한 것으로, 교정하기 위한 문장을 입력하는 입력부;입력 문장을 어절 단위로 검사하고 문맥 철자오류를 검색하는 오류 어절 검사부;교정 대상 어절과 단어 사전의 단어 사이의 편집거리를 계산하여 후보 단어를 선별하는 후보 선별부;교정 대상 어절의 주변 전체 문맥과 후보 선별부에서 걸러진 후보 단어들 간의 거리를 생성적 적대 신경망을 이용해 생성된 언어모형을 이용하여 계산하는 예측 후보 생성부;생성적 적대 신경망에서 만들어진 언어모형에서 계산된 단어 간의 거리 계산 값을 기반으로 최종 교정어를 선택하는 교정어 제시부;를 포함하는 것이다.The present invention simulates a real language model using a generative adversarial network, and uses this information to respond to various errors appearing in text-written general documents. An apparatus and method for correcting context-dependent spelling errors using a neural network, comprising: an input unit for inputting a sentence for correction; an error word inspection unit for examining an input sentence by word unit and searching for a context spelling error; a word to be corrected and a word in a word dictionary A candidate selection unit that selects a candidate word by calculating the editing distance between them; A language model generated using a generative adversarial neural network is used to calculate the distance between the entire surrounding context of the word to be corrected and the candidate words filtered by the candidate selection unit It includes; a prediction candidate generator; a correction word presenting unit that selects a final corrective word based on a distance calculated value between words calculated in a language model created in a generative adversarial neural network.

Description

System and Method for correcting Context sensitive spelling error using Generative Adversarial Network

본 발명은 철자오류 교정에 관한 것으로, 구체적으로 생성적 적대 신경망(generative adversarial network)을 이용하여 실제 언어모형(language model)에 가깝게 모사하고 이 정보를 사용하여 텍스트로 작성된 일반 문서에서 나타나는 다양한 오류에 대한 대응이 가능하도록 한 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법에 관한 것이다.The present invention relates to spelling error correction, and specifically, by using a generative adversarial network to simulate a real language model close to the actual language model, and using this information to solve various errors appearing in general documents written in text It relates to an apparatus and method for correcting context-dependent spelling errors using a generative adversarial neural network that enables a response to the suffix.

최근 세계적으로 컴퓨터 관련 다양한 영역에서 심층학습(deep learning)과 같은 인공지능 기술이 활발하게 연구가 되고 있다.Recently, artificial intelligence technology such as deep learning has been actively researched in various computer-related fields around the world.

그 중 자연언어처리(natural language processing) 관련해서 세계적으로 Google Research, Facebook Research, AllenNLP 등 다양한 연구진들이 심층학습 기술을 개발하고 있으며, 산업의 각 분야에서도 심층학습 기반의 프로그램의 수요가 급속도로 늘어나고 있다.Among them, various researchers around the world such as Google Research, Facebook Research, and AllenNLP are developing deep learning technologies in relation to natural language processing, and the demand for deep learning-based programs is rapidly increasing in each field of industry. .

자연어처리는 정보 분석 과정에서 올바른 문장이 초기에 입력이 될수록 양질의 결과가 나오며, 그렇기 때문에 자연어처리에 있어서 문맥의존 철자오류 교정은 전처리(preprocessing) 과정에서 빼놓을 수 없는 중요한 기술이다.In natural language processing, the earlier the correct sentence is input in the information analysis process, the better the result is.

철자오류는 크게 두 종류로 단순 철자오류(non-word spelling error)와 문맥의존 철자오류(context sensitive spelling error)로 나뉜다. There are two main types of spelling errors: non-word spelling errors and context sensitive spelling errors.

단순 철자오류는 문맥의존 철자오류에 비해서 쉽게 교정할 수 있는데 사전(dictionary)에 교정 대상 단어가 포함되어 있는지 아닌지를 비교하여 오류어를 판단한다. A simple spelling error can be corrected more easily than a context-dependent spelling error, and the error word is judged by comparing whether the word to be corrected is included in the dictionary.

반면 문맥의존 철자오류의 경우 교정 난도가 상당히 올라가게 되며, 예를 들어 "주의를 살피다"와 "주위를 살피다"라는 문장에서 "주의"가 오류어가 되는데, "주의"와 "주위"는 사전에 존재하는 단어이므로 단순 철자오류를 교정하는 방식으로는 해결이 어렵기 때문에 주변 문맥의 정보를 파악하여 해결하여야 한다. On the other hand, in the case of a context-dependent spelling error, the correction difficulty increases considerably. For example, in the sentences "look at your attention" and "look around you", "attention" becomes an error word. Since it is a word that exists, it is difficult to solve it by correcting a simple spelling error, so it must be solved by grasping the information of the surrounding context.

예에서는 "주의"가 오류어지만 문맥에 따라서 "주위"가 오류어가 될 수도 있다. In the example, "attention" is an error word, but depending on the context, "around" can also be an error word.

문맥의존 철자오류 교정 방법에는 규칙을 이용한 교정 방법과 통계정보를 기반으로 한 교정 방법 그리고 신경망(neural network)을 이용한 교정 방법으로 나눌 수 있다. The context-dependent spelling error correction method can be divided into a correction method using rules, a correction method based on statistical information, and a correction method using a neural network.

규칙기반 교정 방법은 규칙을 만들고 검증하는데 고도의 언어학과 전산학 지식을 갖춘 전문가가 필요하며, 실세계의 모든 언어 현상을 반영하는 규칙을 만드는 것이 현실적으로 불가능하다. The rule-based proofreading method requires experts with advanced linguistics and computer science knowledge to create and verify rules, and it is practically impossible to make rules that reflect all linguistic phenomena in the real world.

특히, 발생빈도가 높거나 정형화된 오류는 규칙기반 방법으로 교정할 수 있는 확률이 높으나, 입력 오류로 일어나는 비정형화된 오류교정은 규칙기반 방법만으로는 불가능하고 교정 난도가 훨씬 높다.In particular, a high frequency of occurrence or a standardized error has a high probability of being corrected by the rule-based method, but the unstructured error correction caused by an input error cannot be corrected only by the rule-based method, and the difficulty of correction is much higher.

통계적 교정 방법은 비정형적인 오류가 자주 발생하는 언어 환경에서 적용이 가능하며, 신경망 기반의 교정 이전에 주로 제시되었던 방식이다. The statistical correction method can be applied in a language environment where atypical errors frequently occur, and is a method that has been mainly proposed before neural network-based correction.

신경망 기반의 기술은 발달 속도에 비해서 문맥의존 철자오류 교정 기술에 적용이 된 사례를 찾기가 어렵다.Compared to the development speed of neural network-based technology, it is difficult to find cases that are applied to context-dependent spelling error correction technology.

종래 기술의 하나로, 미리 구축한 교정 어휘 쌍을 이용하여 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 출현빈도에 바탕을 둔 통계모형을 이용하여 문맥의존 철자오류를 교정하는 방법이 있다.(대한민국 등록특허 제10-1495240호)As one of the prior art methods, there is a method of correcting a context-dependent spelling error using a statistical model based on the frequency of appearance between each word of the proofing word pair and a word appearing in the surrounding context using a pre-constructed proofreading word pair. ( Republic of Korea Patent Registration No. 10-1495240)

다른 방법으로 교정 규칙의 재현율을 높이기 위해 규칙을 일반화하는 과정에서 한국어 어휘의미망을 활용하는 방법이 제시되고 있다.(대한민국 등록특허 제10-1500617호)Another method is to use the Korean lexical semantic network in the process of generalizing the rules to increase the reproducibility of the proofing rules. (Registration of Korean Patent No. 10-1500617)

또 다른 방법으로, 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 연관성을 계산하는 과정에서 발생하는 자료부족 문제를 해결하기 위한 방법이 제시되고 있다.(대한민국 등록특허 제10-1573854호)As another method, a method for solving the data shortage problem that occurs in the process of calculating the correlation between each vocabulary of the proofreading vocabulary pair and the vocabulary displayed in the surrounding context has been proposed (Korean Patent Registration No. 10-1573854).

최근 제시된 방법으로, 다양한 오류에 대해 실시간으로 오류 후보를 생성하여 문맥의존 철자오류 교정하는 방법이 제시되고 있다.(대한민국 공개특허 제10-2019-0133624호)As a recently proposed method, a method for correcting context-dependent spelling errors by generating error candidates in real time for various errors has been proposed. (Korean Patent Publication No. 10-2019-0133624)

종래 기술의 통계적 학습 방법과 딥러닝 학습 방법의 공통적인 한계점으로 학습 데이터가 적을수록 성능이 상당히 떨어진다는 점이다.A common limitation of the statistical learning method and the deep learning method of the prior art is that the less the training data, the significantly lower the performance.

특정 상황에서는 학습 데이터를 구하기가 어려워 원하는 성능을 얻기가 힘들 수가 있는데 자연어처리에서도 예외는 아니다.In certain circumstances, it may be difficult to obtain desired performance due to difficulties in obtaining training data, but natural language processing is no exception.

따라서, 저용량 데이터를 이용해서 성능을 높이는 기술의 개발이 요구되고 있으며, 문맥의존 철자오류 교정에서도 필요성이 요구되고 있다.Therefore, there is a need to develop a technique for improving performance using low-capacity data, and a necessity for correcting spelling errors depending on context is also required.

대한민국 등록특허 제10-1495240호Republic of Korea Patent No. 10-1495240 대한민국 등록특허 제10-1500617호Republic of Korea Patent Registration No. 10-1500617 대한민국 등록특허 제10-1573854호Republic of Korea Patent No. 10-1573854 대한민국 공개특허 제10-2019-0133624호Republic of Korea Patent Publication No. 10-2019-0133624

본 발명은 종래 기술의 철자오류 교정 기술의 문제점을 해결하기 위한 것으로, 생성적 적대 신경망(generative adversarial network)을 이용하여 실제 언어모형(language model)에 가깝게 모사하고 이 정보를 사용하여 텍스트로 작성된 일반 문서에서 나타나는 다양한 오류에 대한 대응이 가능하도록 한 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is to solve the problems of the prior art spelling error correction technology, and by using a generative adversarial network to simulate a real language model close to the actual language model, and use this information to create a general text An object of the present invention is to provide an apparatus and method for correcting context-dependent spelling errors using a generative adversarial neural network that enables responses to various errors appearing in documents.

본 발명은 학습 데이터를 구하기가 어려워 원하는 성능을 얻기가 힘든 환경에서 저용량 데이터를 이용해서 문맥의존 철자오류 교정이 효율적으로 이루어지도록 한 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.The present invention provides an apparatus and method for correcting context-dependent misspellings using a generative adversarial neural network that enables efficient context-dependent spelling error correction using low-volume data in an environment in which it is difficult to obtain learning data due to difficulties in obtaining desired performance. There is a purpose.

본 발명은 생성적 적대 신경망을 이용해서 실제 사람이 사용하는 문장과 유사한 학습 데이터를 생성하여 제공하므로 양적으로 학습하기에 풍부한 데이터를 제공할 수 있도록 한 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.The present invention uses a generative adversarial neural network to generate and provide learning data similar to sentences used by real people, so it is possible to provide abundant data for quantitative learning. and to provide a method.

본 발명은 데이터 부족 문제로 교정이 잘 안 되던 단어에 대해서 교정 정확도를 높일 수 있도록 한 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.It is an object of the present invention to provide an apparatus and method for correcting a context-dependent spelling error using a generative adversarial neural network capable of increasing correction accuracy for words that were not well corrected due to a lack of data.

본 발명의 다른 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Other objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned will be clearly understood by those skilled in the art from the following description.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치는 교정하기 위한 문장을 입력하는 입력부;입력 문장을 어절 단위로 검사하고 문맥 철자오류를 검색하는 오류 어절 검사부;교정 대상 어절과 단어 사전의 단어 사이의 편집거리를 계산하여 후보 단어를 선별하는 후보 선별부;교정 대상 어절의 주변 전체 문맥과 후보 선별부에서 걸러진 후보 단어들 간의 거리를 생성적 적대 신경망을 이용해 생성된 언어모형을 이용하여 계산하는 예측 후보 생성부;생성적 적대 신경망에서 만들어진 언어모형에서 계산된 단어 간의 거리 계산 값을 기반으로 최종 교정어를 선택하는 교정어 제시부;를 포함하는 것을 특징으로 한다.In order to achieve the above object, an apparatus for correcting a context-dependent spelling error using a generative adversarial neural network according to the present invention includes an input unit for inputting a sentence for correction; Inspection unit; A candidate selection unit that selects a candidate word by calculating the edit distance between the word to be corrected and a word in the word dictionary; A generative adversarial neural network is used to calculate the distance between the entire surrounding context of the word to be corrected and the candidate words filtered by the candidate selector A prediction candidate generator that calculates using a language model generated using do.

다른 목적을 달성하기 위한 본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 방법은 입력 문장을 어절 단위로 검사하여 철자오류의 가능성을 판단하는 단계;교정 대상 단어와 후보어가 될 언어 모형에서의 사전 단어들 간의 편집거리를 계산하는 단계;교정 대상 어절의 주변 전체 문맥과 후보 선별부에서 걸러진 후보 단어들 간의 거리를 생성적 적대 신경망을 이용해 생성된 언어모형을 이용하여 계산하여 교정 대상 어절을 대체 할 단어를 계산하는 단계;순위화 된 정보를 바탕으로 최종 교정 단어를 제시하는 단계;를 포함하는 것을 특징으로 한다.To achieve another object, a context-dependent spelling error correction method using a generative adversarial neural network according to the present invention includes the steps of determining the possibility of a spelling error by examining an input sentence by word unit; in a language model to be corrected and a candidate word calculating the editing distance between the dictionary words of Calculating the word to be replaced; presenting the final correction word based on the ranked information; characterized in that it includes.

이상에서 설명한 바와 같은 본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법은 다음과 같은 효과가 있다.As described above, the apparatus and method for correcting a context-dependent spelling error using a generative adversarial neural network according to the present invention have the following effects.

첫째, 생성적 적대 신경망(generative adversarial network)을 이용하여 실제 언어모형(language model)에 가깝게 모사하고 이 정보를 사용하여 텍스트로 작성된 일반 문서에서 나타나는 다양한 오류에 대한 대응이 가능하도록 한다.First, by using a generative adversarial network, it is simulated close to the actual language model, and by using this information, it is possible to respond to various errors that appear in general documents written in text.

둘째, 학습 데이터를 구하기가 어려워 원하는 성능을 얻기가 힘든 환경에서 저용량 데이터를 이용해서 문맥의존 철자오류 교정이 효율적으로 이루어지도록 한다.Second, in an environment where it is difficult to obtain the desired performance because it is difficult to obtain training data, the context-dependent spelling error correction is efficiently performed using low-volume data.

셋째, 생성적 적대 신경망을 이용해서 실제 사람이 사용하는 문장과 유사한 학습 데이터를 생성하여 제공하므로 양적으로 학습하기에 풍부한 데이터를 제공할 수 있도록 한다.Third, by using a generative adversarial neural network to generate and provide learning data similar to sentences used by real people, it is possible to provide abundant data for quantitative learning.

넷째, 데이터 부족 문제로 교정이 잘 안 되던 단어에 대해서 교정 정확도를 높일 수 있도록 한다.Fourth, it is possible to increase the proofreading accuracy for words that were not well proofread due to the lack of data.

도 1a와 도 1b는 본 발명에 따른 생성적 적대 신경망에서 생성한 문맥을 이용한 언어 모형 학습 구성도
도 2는 본 발명에 따른 생성적 적대 신경망에서 생성한 문맥을 이용한 언어 모형 학습 알고리즘 구성도
도 3는 본 발명에 따른 언어 모형을 이용한 문맥의존 철자오류 교정 장치 구성도
도 4은 본 발명에 따른 언어 모형을 이용한 문맥의존 철자오류 교정 방법을 나타낸 순서도1A and 1B are diagrams illustrating a language model learning configuration using a context generated by a generative adversarial neural network according to the present invention.
2 is a block diagram of a language model learning algorithm using a context generated by a generative adversarial neural network according to the present invention.
3 is a configuration diagram of a context-dependent spelling error correction apparatus using a language model according to the present invention;
4 is a flowchart illustrating a context-dependent spelling error correction method using a language model according to the present invention.

이하, 본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법의 바람직한 실시 예에 관하여 상세히 설명하면 다음과 같다.Hereinafter, a preferred embodiment of an apparatus and method for correcting a context-dependent spelling error using a generative adversarial neural network according to the present invention will be described in detail as follows.

본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법의 특징 및 이점들은 이하에서의 각 실시 예에 대한 상세한 설명을 통해 명백해질 것이다.Features and advantages of the apparatus and method for correcting context-dependent spelling errors using a generative adversarial neural network according to the present invention will become apparent through detailed description of each embodiment below.

도 1a와 도 1b는 본 발명에 따른 생성적 적대 신경망에서 생성한 문맥을 이용한 언어 모형 학습 구성도이고, 도 2는 본 발명에 따른 생성적 적대 신경망에서 생성한 문맥을 이용한 언어 모형 학습 알고리즘 구성도이다.1A and 1B are diagrams illustrating a language model learning configuration using a context generated by a generative adversarial neural network according to the present invention, and FIG. 2 is a configuration diagram of a language model learning algorithm using a context generated by a generative adversarial neural network according to the present invention. to be.

본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법은 생성적 적대 신경망(generative adversarial network)을 이용하여 실제 언어모형(language model)에 가깝게 모사하고 이 정보를 사용하여 텍스트로 작성된 일반 문서에서 나타나는 다양한 오류에 대한 대응이 가능하도록 한 것이다.An apparatus and method for correcting context-dependent spelling errors using a generative adversarial network according to the present invention closely simulates a real language model using a generative adversarial network, and uses this information to create text It was made possible to respond to various errors that appear in general documents.

이를 위하여, 본 발명은 생성적 적대 신경망을 통해 문장을 생성하며, 생성 문장을 이용해서 언어 모형을 학습하고 언어 모형을 통해 교정 대상 어절과 교정 대상 어절의 주변 문맥과의 관계를 파악하여 교정하는 구성을 포함한다.To this end, the present invention generates a sentence through a generative adversarial neural network, learns a language model using the generated sentence, and identifies and corrects the relationship between the word to be corrected and the surrounding context of the word to be corrected through the language model. includes

생성적 적대 신경망이란 게임 이론(game theory) 중의 하나인 폰 노이만(John Von Neumann)의 최소최대 정리(minmax theorem)를 기반으로 접근하여 데이터 생성 문제를 해결하는 방식이며, 도 1a와 도 1b에서와 같이 데이터의 생성 모형(generate model)과 판별 모형(discriminate model)으로 구성된 심층 신경망이 서로 의존적으로 생성과 판별을 반복하면서 대항하는 방식으로 학습을 한다.A generative adversarial neural network is a method of solving data generation problems by approaching based on the minmax theorem of John Von Neumann, one of game theory, and in Figs. 1a and 1b. Similarly, a deep neural network composed of a data generation model and a discriminate model learns in an opposing way while repeating generation and discrimination dependent on each other.

이런 학습이 반복되면서 결국 판별기가 진짜와 가짜의 데이터를 구분하지 못하는 순간이 되었을 때 게임이론에서 말하는 내시 균형(nash equilibrium)에 도달한 것인데 이는 문장을 생성자(generator)가 생성한 임의의 문장을 가짜로 구분하던 판별자(discriminator)가 학습을 통해 발전된 생성자가 생성한 문장을 구별하지 못하게 된 것을 의미한다.As this learning is repeated, eventually, when the discriminator cannot distinguish between real and fake data, the game theory reaches the nash equilibrium, which means that any sentence generated by the generator can be faked. It means that the discriminator, which used to be classified by , cannot distinguish the sentences generated by the generator developed through learning.

최종적으로 생성자가 생성한 문장은 실제 사람이 작성한 문장과 유사해지면서 의미 있는 정보가 된다.Finally, the sentence generated by the creator becomes meaningful information as it becomes similar to the sentence written by a real person.

본 발명에서의 교정에는 생성 문장을 이용해 실제에 가까운 언어 모형을 학습하고 언어 모형을 이용하여 교정을 한다.In the proofreading in the present invention, a language model close to reality is learned using a generated sentence, and correction is performed using the language model.

본 발명은 다양한 구조의 생성적 적대 신경망에도 적용되고, 도 1a와 도 1b에서의 구조로 한정되지 않는다.The present invention is also applied to generative adversarial neural networks of various structures, and is not limited to the structures shown in FIGS. 1A and 1B.

도 2는 본 발명에 따른 생성적 적대 신경망을 이용하여 문장을 생성하고 언어 모형에 반영하는 알고리즘의 일 예를 나타낸 것이다.2 shows an example of an algorithm for generating a sentence using a generative adversarial neural network according to the present invention and reflecting it in a language model.

본 발명에 따른 언어 모형을 이용한 문맥의존 철자오류 교정 장치를 구체적으로 설명하면 다음과 같다.A context-dependent spelling error correction apparatus using a language model according to the present invention will be described in detail as follows.

도 3는 본 발명에 따른 언어 모형을 이용한 문맥의존 철자오류 교정 장치 구성도이다.3 is a block diagram of a context-dependent spelling error correction apparatus using a language model according to the present invention.

본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치는 도 3에서와 같이, 오류를 교정하기 위한 문장을 입력하는 입력부(301)와, 입력부(301)를 통해 입력된 문장의 어절의 오류를 순차적으로 검사하는 오류 어절 검사부(302)와, 어절에 오류가 있다고 판단되었을 때 교정 대상 어절에 대한 교정 후보어들을 선별하는 후보 선별부(303)와, 교정 대상 어절 위치의 주변 문맥과 후보어들의 거리 값을 계산하는 예측 후보 생성부(304)와, 최종적으로 문맥과 가까운 후보어를 제시하는 교정어 제시부(305)를 포함한다.As shown in FIG. 3, the context-dependent spelling error correction apparatus using a generative adversarial neural network according to the present invention includes an input unit 301 for inputting a sentence for correcting an error, and an input unit 301 for inputting a word of a sentence input through the input unit 301. The error word checker 302 sequentially checks for errors, the candidate selector 303 selects correction candidates for the word to be corrected when it is determined that there is an error in the word, and the context and candidates around the location of the word to be corrected. It includes a prediction candidate generating unit 304 for calculating the distance values of words, and a proofing word presenting unit 305 for finally suggesting a candidate word close to the context.

이와 같은 구성을 갖는 본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치를 이용하는 교정 과정을 각 단계별로 구체적으로 설명하면 다음과 같다.The correction process using the context-dependent spelling error correction apparatus using a generative adversarial neural network according to the present invention having such a configuration will be described in detail for each step as follows.

교정에 사용되는 모든 단어들은 생성된 언어 모형을 사용하여 표현하며, 언어 모형이란 자연언어의 이해를 위해 사용하는 모형을 말한다.All words used for proofreading are expressed using the generated language model, and the language model refers to a model used for understanding natural language.

수학식 1은 기본적인 생성적 적대 신경망을 나타내며, D는 판별 모형(또는 판별자)이고 G는 생성 모형(또는 생성자)을 나타낸다. Equation 1 represents a basic generative adversarial neural network, where D is a discriminant model (or discriminant) and G denotes a generative model (or generator).

생성자는 랜덤 벡터 z를 입력으로 받아 가짜 데이터을 생성하며, 판별자는 진짜 일 때 1 가짜일 때 0을 출력하게 되는데

은 초기에 생성자에서 생성된 가짜 데이터를 판단하여 0, 실제 데이터 x를 입력으로 하는

는 1로 계산된다.The generator takes a random vector z as input and generates fake data, and the discriminator outputs 1 when it is real and 0 when it is fake.

judging the fake data generated in the initial generator, 0 and real data x as input

is counted as 1.

는 실제 데이터 분포이고

는 생성자에서 생성한 데이터의 분포이며, E는 예상되는 출력이다.

is the actual data distribution

is the distribution of data generated by the constructor, and E is the expected output.

에 있어서 훈련 중 판별자는 진짜를 진짜로 가짜를 가짜로 판별해야 하므로 식의 총 출력을 최대화하려 하지만 생성자는 실제 데이터와 최대한 가까운 가짜 데이터를 만들어내어 식의 출력을 최소화하려 할 것이므로 생성적 적대 신경망은 훈련을 거듭하면서 생성자와 판별자의 신경망이 균형을 이루게 된다.

In training, the discriminator tries to maximize the total output of the expression because it has to discriminate between real and fake, but the generator will try to minimize the output of the expression by creating fake data as close as possible to the real data, so the generative adversarial neural network is trained Over time, the neural networks of generators and discriminators are balanced.

오류 어절 검사부(302)는 통계적 언어 모형(statistical language model)인 N-gram 모형으로서, 수학식 2는 통계 후보어 집합 T가 대치되는 교정 대상 어절의 위치

의 주변 문맥

,

에 대해 통계 후보어들 중 최대가 되는

의 문맥 확률을 계산한다.The error word checker 302 is an N-gram model that is a statistical language model, and Equation 2 is the position of the word to be corrected in which the statistical candidate word set T is substituted.

the surrounding context of

,

The largest among statistical candidates for

Calculate the context probability of

오류 어절 검사 단계에서만 사용이 되는 통계적 언어 모형은 후보어 집합 T에서 오류 검사 대상 어절이 통계 후보어들에 비해 확률이 높은지 낮은지만을 보고 오류 어절 유무를 판단한다.The statistical language model, used only in the error word checking stage, judges the presence of an error word by looking only at whether the word to be checked for error in the candidate word set T has a higher or lower probability than the statistical candidate words.

통계 후보어는 교정 후보어와 다르며, 교정 대상 어절의 검사 과정에서만 사용된다.Statistical candidate words are different from proofreading words and are used only in the process of examining words to be corrected.

는 확률이 가장 높게 나온 예측 후보어이며,

는 실제 문서에서 나타난 단어 W대신 오류에 의해 쓰일 수 있는 단어 Y가 오용될 확률을 나타낸다.

is the predicted candidate word with the highest probability,

represents the probability of misuse of the word Y, which can be used by error instead of the word W appearing in the actual document.

수학식 3에서와 같이 통계 후보어는 미리 구축된 3-gram 사전을 통해 얻어지며, 중심어 위치 '*'를 기준으로 양쪽 2어절의 범위의 3-gram을 검색한다. 검색 목적은 중심어 위치 '*'의 주변 문맥 단어와 함께 나타나는 모든 통계 후보어를 검색한다.As shown in Equation 3, statistical candidate words are obtained through a pre-built 3-gram dictionary, and 3-grams in the range of both two word clauses are searched based on the central word position '*'. The purpose of the search is to search all statistical candidate words that appear together with the surrounding context word at the central word position '*'.

검색된 단어들은 후보어 집합에 속하게 되며, 현재 교정 대상 검사 어절의 단어와 편집거리를 계산하여 가까운 단어들을 선별하게 된다.The searched words belong to a candidate word set, and words close to each other are selected by calculating the editing distance from the word of the current target to be corrected.

편집거리는 1에서부터 시작하며, 단어 간의 차이 비교를 하는 기준이 되는데 기준 단어로부터 비교 단어의 알파벳이나 음소가 삽입, 삭제, 교환이 이루어짐에 따라서 편집거리가 늘어나게 된다.The editing distance starts from 1 and serves as a standard for comparing the differences between words. The editing distance increases as the alphabet or phoneme of the comparison word is inserted, deleted, or exchanged from the reference word.

예로 기준 단어 ‘가위’와 비교 단어 ‘사위’는 ‘ㄱ’이 ‘ㅅ’으로 교환이 된 상태이므로 ‘사위’는 ‘가위’에 대해서 편집거리가 1이다.For example, in the reference word 'scissors' and the comparison word 'son-in-law', 'a' is replaced with 'ㅅ', so the edit distance for 'scissors' is 1 for 'scissors'.

예측 후보 생성부(304)는 생성적 적대 신경망을 이용해 생성한 언어 모형을 사용한다.The prediction candidate generator 304 uses a language model generated using a generative adversarial neural network.

수학식 4에서 언어 모형에 입력되는 문장

와 선별부에서 선별된 교정 후보어 집합을 C라고 하고, 문맥의존 철자오류 교정 거리 값이

최대가 되는

를 선택하게 된다. Sentences input to the language model in Equation 4

and the set of candidate words for correction selected by the selection unit is called C, and the context-dependent spelling error correction distance value is

to be the maximum

will choose

수학식 4의 교정 후보어 집합을 나타내는 C는 수학식 5에서처럼 편집거리 계산함수 EDF를 이용하여 중심어

를 기준으로 언어 모형의 전체 삽입 단어 사전과 설정된 편집거리를 만족하는 교정 후보어를 얻은 N개의 집합이다.C, which represents the set of candidate words for correction in Equation 4, is a central word using the edit distance calculation function EDF as in Equation 5.

Based on , it is a set of N obtained words that satisfy the entire inserted word dictionary of the language model and the set editing distance.

수학식 6은 수학식 4에서의 입력 문맥의 내적값의 합을 구하며,

는 교정후보단어이고

은 교정후보단어의 주변 문맥의 크기이다.Equation 6 obtains the sum of the dot product values of the input context in Equation 4,

is a candidate word for correction

is the size of the surrounding context of the word for correction.

는 교정후보단어와 문맥단어간의 내적을 구하는 함수이며, 이 부분이 각 단어삽입 모형에서의 단어 간의 거리 값을 적용하는 부분이다.

is a function to find the dot product between the correction candidate word and the context word, and this is the part that applies the distance value between words in each word insertion model.

는 미등록어를 처리하기 위한 평탄화(smoothing)값이며, 언어 모형이 형태적으로 단어를 비교하여 미등록 언어의 특징에 따라서 내적을 비교한다면

를 따로 사용하지 않을 수도 있다.

is a smoothing value for processing unregistered words, and if the language model compares words morphologically,

may not be used separately.

본 발명에 따른 언어 모형을 이용한 문맥의존 철자오류 교정 방법을 구체적으로 설명하면 다음과 같다.The context-dependent spelling error correction method using the language model according to the present invention will be described in detail as follows.

도 4은 본 발명에 따른 언어 모형을 이용한 문맥의존 철자오류 교정 방법을 나타낸 순서도이다.4 is a flowchart illustrating a context-dependent spelling error correction method using a language model according to the present invention.

먼저, 문맥의존 철자오류를 검색하고 교정 할 문서를 입력하고(S401), 문서에서의 문장 내 어절을 순차적으로 검사를 하며(S402), 어절에 오류가 있는지를 판단한다.(S403)First, a context-dependent spelling error is searched for and the document to be corrected is input (S401), the word within the sentence in the document is sequentially checked (S402), and it is determined whether there is an error in the word (S403).

만약 어절에 문제가 없다면 다음 어절을 검사하고, 어절에 오류가 있다고 판단이 되었을 때 해당 어절은 교정 대상 어절로 결정하여 교정 대상 어절에 태그(‘<target word>’)를 적용한다.(S404)If there is no problem with the word, the next word is checked, and when it is determined that there is an error in the word, the word is determined as the word to be corrected and a tag (‘<target word>’) is applied to the word to be corrected. (S404)

교정 후보어를 이용해서 문맥과의 거리를 계산하기에 앞서 계산의 양을 줄이기 위해서 교정 대상 어절을 기반으로 언어 모형의 전체 사전 단어와의 편집거리를 계산하여 설정된 거리만큼의 후보어를 판별하며(S405), 판별된 후보어를 바탕으로 예측 대상 어절의 주변 문맥과 각 후보어 간의 거리 값을 언어 모형을 이용해서 구하여(S406), 가장 높은 1순위를 최종 교정 단어로 선택한다.(S407)In order to reduce the amount of calculation before calculating the distance from the context using the proofreading candidate word, the candidate word is determined as much as the set distance by calculating the edit distance from the entire dictionary word of the language model based on the proofreading word ( S405), the distance value between each candidate word and the surrounding context of the prediction target word is obtained based on the determined candidate word using a language model (S406), and the highest priority is selected as the final proofreading word (S407).

여기서 교정 단어가 교정 대상 단어와 같다면 교정이 이루어지지 않은 것이고, 교정 단어가 교정 대상 단어와 다르다면 대치를 해서 교정을 하게 된다.(S407)Here, if the proofreading word is the same as the proofreading word, proofreading has not been performed.

교정 단어를 예측하는 과정이 끝나면 다음 어절이나 문장이 있는지를 판단하며(S408), 시스템을 종료할 것인지를 결정짓는다.When the process of predicting the corrected word is finished, it is determined whether there is a next word or sentence (S408), and it is determined whether to terminate the system.

본 발명에서는 언어 모형을 이용한 문맥의존 철자오류 교정 방법은 문장 단위를 기준으로 첫 어절부터 끝 어절까지 순차적으로 오류를 검사한다.In the present invention, the context-dependent spelling error correction method using a language model sequentially checks errors from the first word to the last word based on the sentence unit.

교정 대상 어절의 오류 검사의 예로 '도대체 장모라는 사람이 가위가 왔는디 씨암탉은 못...'이라는 문장이 있었을 때 '도대체', '장모라는', '사람이', '가위가', '왔는디', '씨암탉은', '못', ...을 각 어절이라고 하고, 현재 오류의 검사가 이루어지고 있는 중심 어절을 '가위가'라고 가정한다.As an example of error checking of the word to be corrected, when there is a sentence such as 'Hell, a man named mother-in-law came with scissors, but a hen can't...' It is assumed that 'I'm here', 'the seed hen', 'nail', ... are referred to as each word, and the central word currently being checked for errors is 'scissors'.

중심 어절 '가위가'를 기준으로 미리 구축된 대용량의 3-gram 사전에서 2어절 범위의 ('장모라는', '사람이', '*'), ('사람이', '*', '왔는디'), ('*', '왔는디', '씨암탉은')을 검색해서 '*'의 위치에 올 수 있는 문맥의 후보어 집합 '거위가', '써위가', '학교가', '사위가', '하위가', '가위가', '도로가' 등을 얻고, 중심 어절 '가위가'와의 편집가리(예에서는 편집거리 1)가 가까운 후보어 집합 '거위가', '사위가', '하위가', '가위가' 등을 선별해서 얻는다.From the large 3-gram dictionary built in advance based on the central word 'scissors', a range of 2 words ('Mother-in-law', 'People', '*'), ('People', '*', ' Did you come'), ('*', 'I'm here', 'Sheen'), a set of candidate words in the context that can come in the position of '*' ', 'son-in-law', 'child', 'scissors', 'doro', etc., and the set of candidate words 'goose' that is close to the central word 'scissors' (edit distance 1 in the example) , 'son-in-law', 'children', 'scissors' and so on.

다음으로 철자오류 검색 대상 어절을 기준으로 2어절 범위의 문맥이 이루는 확률을 계산하게 되는데 여기에서 중심 어절 '가위가'가 포함된 2어절 범위의 문맥 '장모라는 사람이 사위가 왔는디 씨암탉은'의 문장 확률과 2어절 범위의 문맥에서 '가위가'를 각 후보어로 대체하여 확률을 계산 값을 비교하였을 때 '가위가'를 포함한 문맥의 확률이 가장 높다면 어절에 오류가 없다고 판단하고, 후보어들 중에서 문맥과 이루는 확률이 '가위가'의 확률 보다 높은 후보어가 있다면 오류가 있다고 판단한다.Next, the probability of forming a context within a two-word range is calculated based on the word to be searched for spelling errors. When the probability is calculated by substituting 'scissors' for each candidate word in the context of the sentence probability of 2 words, and the probability of the context including 'scissors' is the highest, it is judged that there is no error in the word and the candidate Among the words, if there is a candidate word with a higher probability of forming a context than that of 'scissors', it is judged as having an error.

교정 대상 어절이라 판단되었을 때 가장 먼저 중심어 '가위가'를 '<target word>'로 대체한 문장 '도대체', '장모라는', '사람이', '<target word>', '왔는디', '씨암탉은', '못', ...를 언어 모형에 넣는다.When it is judged to be a word to be corrected, the first sentence in which the central word 'scissors' is replaced with '<target word>' 'Hell', 'Mother-in-law', 'People', '<target word>', 'Come here' , 'seed', 'nail', ... are put into the language model.

언어 모형은 심화학습을 통해 미리 학습된 언어 모형이며, 중심어로 설정된 전체 문맥을 입력으로 하며, 출력에서는 중심어를 제외한 전체 문맥을 기반으로 중심어를 예측하는데 학습과정에서 사용된 말뭉치의 삽입(embedding) 단어 사전의 전체를 대상으로 문맥과의 거리를 계산한다.The language model is a language model learned in advance through deep learning, and takes the entire context set as the core word as input, and in the output, the embedding word in the corpus used in the learning process to predict the core word based on the entire context excluding the core word. Calculate the distance from the context for the entire dictionary.

교정 언어 모형을 통해 계산되는 양을 줄이기 위해서 삽입 단어 사전 전체(앞에서 예를 든 통계 모형에서 얻어진 후보어와 다른 언어 모형의 사전)에서 '가위가'와 편집거리가 가까운 '고위가', '거위가', '사위가', '다위가', '하위가' 등과 같은 단어들로 선별하고, 이 교정 후보어들을 중에서 언어 모형에서 교정 대상 어절의 주변 문맥과의 거리 값을 계산하여 순위화를 한다.In order to reduce the amount calculated through the proofreading language model, in the entire inserted word dictionary (a dictionary of candidate words obtained from the statistical model and other language models obtained in the previous example), 'scissors' and 'gooses' are close to the editing distance. ', 'son-in-law', 'dawi-ga', 'low-class', etc. are selected, and these correction candidates are ranked by calculating the distance value of the word to be corrected from the surrounding context in the language model. .

1순위의 후보어가 교정 대상 어절 '사위가'와 같다면 어절에는 오류가 없다고 판단하고 다른 후보어 중에서 나타난다면 어절이 오류가 있다고 판단하여 교정을 하게 된다.If the first-priority candidate word is the same as the word 'son-in-law' to be corrected, it is judged that there is no error in the word.

이런 과정이 순차적으로 입력 문서가 끝날 때까지 반복되어 최종 교정 결과를 출력하게 된다.This process is sequentially repeated until the input document is finished to output the final proofreading result.

이상에서 설명한 본 발명에 따른 언어 모형을 이용한 문맥의존 철자오류 교정 장치 및 방법은 철자오류 교정 단계에서 언어 모형에서 얻어지는 교정 대상 어절과 문맥과의 거리 값을 이용하여 다양한 오류에 대한 처리가 가능하도록 한 것이다.The apparatus and method for correcting a context-dependent spelling error using a language model according to the present invention described above enable processing of various errors by using the distance value between the correct word and context obtained from the language model in the spelling error correction step. will be.

이상에서 설명한 본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법은 생성적 적대 신경망(generative adversarial network)을 이용하여 실제 언어모형(language model)에 가깝게 모사하고 이 정보를 사용하여 텍스트로 작성된 일반 문서에서 나타나는 다양한 오류에 대한 대응이 가능하도록 한 것이다.The apparatus and method for correcting context-dependent spelling errors using a generative adversarial network according to the present invention described above closely simulate an actual language model using a generative adversarial network and use this information to It was made possible to respond to various errors that appear in general documents written in text.

이상에서의 설명에서와 같이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 본 발명이 구현되어 있음을 이해할 수 있을 것이다.As described above, it will be understood that the present invention is implemented in a modified form without departing from the essential characteristics of the present invention.

그러므로 명시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 하고, 본 발명의 범위는 전술한 설명이 아니라 특허청구 범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.Therefore, the specified embodiments should be considered in an illustrative rather than a restrictive sense, the scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the equivalent scope are included in the present invention. will have to be interpreted.

301. 입력부
302. 오류 어절 검사부
303. 후보 선별부
304. 예측 후보 생성부
305. 교정어 제시부301. Input
302. Error word checker
303. Candidate Selection Unit
304. Prediction candidate generator
305. Correction word presentation section

Claims

an input unit for inputting a sentence for correction;
an error word checker that checks the input sentence by word unit and searches for context spelling errors;
a candidate selection unit for selecting a candidate word by calculating an editing distance between a word to be corrected and a word in the word dictionary;
a prediction candidate generator that calculates the distance between the entire surrounding context of the word to be corrected and the candidate words filtered by the candidate selector using a language model generated using a generative adversarial neural network;
A context-dependent spelling error correction apparatus using a generative adversarial neural network, comprising: a corrective word presenting unit that selects a final corrective word based on a distance calculated value between words calculated in a language model created in a generative adversarial neural network.

The method of claim 1, wherein the generative adversarial neural network in the prediction candidate generator comprises:

is defined as
D is the discriminator, G is the generator,

is the actual data distribution

is the distribution of data generated by the constructor, E is the expected output,
The generator takes a random vector z as input and generates fake data, the discriminator outputs 1 when it is real and 0 when it is fake,

A context-dependent spelling error correction apparatus using a generative adversarial neural network, characterized in that it is calculated as 1.

The method of claim 1, wherein the error word checker is an N-gram model that is a statistical language model,
The position of the word to be corrected in which the statistical candidate word set T is substituted

the surrounding context of

,

The largest among statistical candidates for

Calculate the context probability of

is the probability of misuse of the word Y that can be used by error instead of the word W that appears in the actual document,
Statistical language model is a context-dependent spelling error correction device using a generative adversarial neural network, characterized in that it judges the presence or absence of an error word by only looking at whether a word to be checked for error in the candidate word set T has a higher or lower probability than the statistical candidate words.

[4] The method of claim 3, wherein the statistical candidate words are obtained through a 3-gram dictionary built in advance, and 3-grams in the range of both two word clauses are searched based on the central word position '*',

is defined as
The purpose of the search is to search all statistical candidate words that appear together with the surrounding context words at the central word position '*', the searched words belong to the candidate word set, and the words and the edit distance of the current word to be corrected are calculated and close words are selected. A context-dependent spelling error correction device using a generative adversarial neural network, characterized in that

The method of claim 4, wherein the prediction candidate generator uses a language model generated using a generative adversarial neural network,

using ,
Sentences input to the language model

and the set of correction candidates selected by the candidate selection unit is called C, and the context-dependent spelling error correction distance value is

to be the maximum

A context-dependent spelling error correction device using a generative adversarial neural network, characterized in that

The method of claim 5, wherein C, which represents a set of proofreading candidates,

is defined as
The central word using the editing distance calculation function EDF

A context-dependent spelling error correction apparatus using a generative adversarial neural network, characterized in that it is a set of N sets obtained by obtaining the entire inserted word dictionary of the language model and the correction candidate words satisfying the set editing distance based on .

The method of claim 6, wherein the corrective word presentation unit,

Using , find the sum of the dot product of the input context,

is a candidate word for correction

is the size of the surrounding context of the word for correction,

is a function to find the dot product between the correction candidate word and the context word, and is the part that applies the distance value between words in each word insertion model,

A context-dependent spelling error correction apparatus using a generative adversarial neural network, characterized in that is a smoothing value for processing unregistered words.

The method of claim 1, wherein the corrective word presentation unit,
In order to predict the corrective word, the distance value of each candidate word from the surrounding context is obtained using the calculated value of the full word dictionary calculated from the language model generated by the generative adversarial neural network.
A context-dependent spelling error correction device using a generative adversarial neural network, characterized in that the highest value among the correction candidates is determined based on the calculated distance value, and the corresponding word is presented as a substitute.

determining a possibility of a spelling error by examining the input sentence by word unit;
calculating an editing distance between the word to be corrected and the dictionary words in the language model to be a candidate word;
calculating a word to replace the word to be corrected by calculating the distance between the entire surrounding context of the word to be corrected and the candidate words filtered by the candidate selection unit using a language model generated using a generative adversarial neural network;
A context-dependent spelling error correction method using a generative adversarial neural network, comprising: presenting a final correction word based on the ranked information.