KR102517983B1

KR102517983B1 - System and Method for correcting Context sensitive spelling error using Generative Adversarial Network

Info

Publication number: KR102517983B1
Application number: KR1020200164302A
Authority: KR
Inventors: 권혁철; 이정훈
Original assignee: 부산대학교 산학협력단
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2023-04-05
Also published as: KR20220075807A

Abstract

본 발명은 생성적 적대 신경망(generative adversarial network)을 이용하여 실제 언어모형(language model)에 가깝게 모사하고 이 정보를 사용하여 텍스트로 작성된 일반 문서에서 나타나는 다양한 오류에 대한 대응이 가능하도록 한 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법에 관한 것으로, 교정하기 위한 문장을 입력하는 입력부;입력 문장을 어절 단위로 검사하고 문맥 철자오류를 검색하는 오류 어절 검사부;교정 대상 어절과 단어 사전의 단어 사이의 편집거리를 계산하여 후보 단어를 선별하는 후보 선별부;교정 대상 어절의 주변 전체 문맥과 후보 선별부에서 걸러진 후보 단어들 간의 거리를 생성적 적대 신경망을 이용해 생성된 언어모형을 이용하여 계산하는 예측 후보 생성부;생성적 적대 신경망에서 만들어진 언어모형에서 계산된 단어 간의 거리 계산 값을 기반으로 최종 교정어를 선택하는 교정어 제시부;를 포함하는 것이다.The present invention closely mimics an actual language model using a generative adversarial network, and uses this information to respond to various errors in general text written documents. It relates to an apparatus and method for correcting context-dependent spelling errors using a neural network, comprising: an input unit for inputting a sentence to be corrected; an erroneous word check unit for inspecting an input sentence in word units and searching for contextual spelling errors; a word to be corrected and a word in a word dictionary. A candidate selection unit that selects candidate words by calculating an editing distance between them; Calculating the distance between the entire context surrounding the word to be corrected and candidate words filtered out by the candidate selection unit using a language model generated using a generative adversarial neural network. It includes a prediction candidate generation unit; a corrective word presentation unit that selects a final corrected word based on a distance calculation value between words calculated in a language model created in a generative adversarial neural network.

Description

System and method for correcting context sensitive spelling error using Generative Adversarial Network}

본 발명은 철자오류 교정에 관한 것으로, 구체적으로 생성적 적대 신경망(generative adversarial network)을 이용하여 실제 언어모형(language model)에 가깝게 모사하고 이 정보를 사용하여 텍스트로 작성된 일반 문서에서 나타나는 다양한 오류에 대한 대응이 가능하도록 한 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법에 관한 것이다.The present invention relates to spelling error correction, and more specifically, closely mimics an actual language model using a generative adversarial network, and uses this information to correct various errors appearing in text written documents. It relates to an apparatus and method for correcting context-dependent spelling errors using a generative adversarial network enabling response to

최근 세계적으로 컴퓨터 관련 다양한 영역에서 심층학습(deep learning)과 같은 인공지능 기술이 활발하게 연구가 되고 있다.Recently, artificial intelligence technologies such as deep learning are being actively researched in various fields related to computers worldwide.

그 중 자연언어처리(natural language processing) 관련해서 세계적으로 Google Research, Facebook Research, AllenNLP 등 다양한 연구진들이 심층학습 기술을 개발하고 있으며, 산업의 각 분야에서도 심층학습 기반의 프로그램의 수요가 급속도로 늘어나고 있다.Among them, various researchers such as Google Research, Facebook Research, and AllenNLP are developing deep learning technology in relation to natural language processing worldwide, and the demand for deep learning-based programs is rapidly increasing in each field of industry. .

자연어처리는 정보 분석 과정에서 올바른 문장이 초기에 입력이 될수록 양질의 결과가 나오며, 그렇기 때문에 자연어처리에 있어서 문맥의존 철자오류 교정은 전처리(preprocessing) 과정에서 빼놓을 수 없는 중요한 기술이다.In natural language processing, the higher the correct sentence is initially input in the information analysis process, the higher the quality of the result. Therefore, context-dependent spelling error correction in natural language processing is an indispensable and important technique in the preprocessing process.

철자오류는 크게 두 종류로 단순 철자오류(non-word spelling error)와 문맥의존 철자오류(context sensitive spelling error)로 나뉜다. Spelling errors are largely divided into two types: non-word spelling errors and context sensitive spelling errors.

단순 철자오류는 문맥의존 철자오류에 비해서 쉽게 교정할 수 있는데 사전(dictionary)에 교정 대상 단어가 포함되어 있는지 아닌지를 비교하여 오류어를 판단한다. Simple spelling errors can be corrected more easily than context-dependent spelling errors. Error words are determined by comparing whether or not the word to be corrected is included in a dictionary.

반면 문맥의존 철자오류의 경우 교정 난도가 상당히 올라가게 되며, 예를 들어 "주의를 살피다"와 "주위를 살피다"라는 문장에서 "주의"가 오류어가 되는데, "주의"와 "주위"는 사전에 존재하는 단어이므로 단순 철자오류를 교정하는 방식으로는 해결이 어렵기 때문에 주변 문맥의 정보를 파악하여 해결하여야 한다. On the other hand, in the case of context-dependent spelling errors, the degree of correction is significantly increased. For example, in the sentences "look at attention" and "look around", "attention" becomes an error word, and "attention" and "around" are Since it is a word that exists, it is difficult to solve it by correcting simple spelling errors, so it is necessary to solve it by identifying information in the surrounding context.

예에서는 "주의"가 오류어지만 문맥에 따라서 "주위"가 오류어가 될 수도 있다. In the example, "attention" is an error word, but "around" may be an error word depending on the context.

문맥의존 철자오류 교정 방법에는 규칙을 이용한 교정 방법과 통계정보를 기반으로 한 교정 방법 그리고 신경망(neural network)을 이용한 교정 방법으로 나눌 수 있다. Context-dependent spelling error correction methods can be divided into correction methods using rules, correction methods based on statistical information, and correction methods using neural networks.

규칙기반 교정 방법은 규칙을 만들고 검증하는데 고도의 언어학과 전산학 지식을 갖춘 전문가가 필요하며, 실세계의 모든 언어 현상을 반영하는 규칙을 만드는 것이 현실적으로 불가능하다. The rule-based correction method requires experts with advanced linguistics and computer science knowledge to create and verify rules, and it is practically impossible to create rules that reflect all linguistic phenomena in the real world.

특히, 발생빈도가 높거나 정형화된 오류는 규칙기반 방법으로 교정할 수 있는 확률이 높으나, 입력 오류로 일어나는 비정형화된 오류교정은 규칙기반 방법만으로는 불가능하고 교정 난도가 훨씬 높다.In particular, there is a high probability that errors that occur frequently or are standardized can be corrected by the rule-based method, but correction of unstructured errors caused by input errors is impossible only with the rule-based method and the correction difficulty is much higher.

통계적 교정 방법은 비정형적인 오류가 자주 발생하는 언어 환경에서 적용이 가능하며, 신경망 기반의 교정 이전에 주로 제시되었던 방식이다. The statistical correction method can be applied in a language environment where atypical errors frequently occur, and it was mainly presented before neural network-based correction.

신경망 기반의 기술은 발달 속도에 비해서 문맥의존 철자오류 교정 기술에 적용이 된 사례를 찾기가 어렵다.Compared to the speed of development of neural network-based technology, it is difficult to find cases that have been applied to context-dependent spelling error correction technology.

종래 기술의 하나로, 미리 구축한 교정 어휘 쌍을 이용하여 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 출현빈도에 바탕을 둔 통계모형을 이용하여 문맥의존 철자오류를 교정하는 방법이 있다.(대한민국 등록특허 제10-1495240호)As one of the prior art, there is a method of correcting context-dependent spelling errors using a statistical model based on the frequency of occurrence between each word of the corrected word pair and a word appearing in the surrounding context using a pre-constructed corrected word pair. ( Republic of Korea Patent Registration No. 10-1495240)

다른 방법으로 교정 규칙의 재현율을 높이기 위해 규칙을 일반화하는 과정에서 한국어 어휘의미망을 활용하는 방법이 제시되고 있다.(대한민국 등록특허 제10-1500617호)As another method, a method of using Korean lexical semantics in the process of generalizing rules to increase the recall of correction rules has been proposed. (Republic of Korea Patent Registration No. 10-1500617)

또 다른 방법으로, 교정 어휘 쌍의 각 어휘와 주변 문맥에 나타난 어휘 간 연관성을 계산하는 과정에서 발생하는 자료부족 문제를 해결하기 위한 방법이 제시되고 있다.(대한민국 등록특허 제10-1573854호)As another method, a method for solving the problem of lack of data occurring in the process of calculating the correlation between each word of the corrected word pair and the word appearing in the surrounding context is proposed. (Republic of Korea Patent Registration No. 10-1573854)

최근 제시된 방법으로, 다양한 오류에 대해 실시간으로 오류 후보를 생성하여 문맥의존 철자오류 교정하는 방법이 제시되고 있다.(대한민국 공개특허 제10-2019-0133624호)As a recently proposed method, a method for correcting context-dependent spelling errors by generating error candidates in real time for various errors has been proposed. (Republic of Korea Patent Publication No. 10-2019-0133624)

종래 기술의 통계적 학습 방법과 딥러닝 학습 방법의 공통적인 한계점으로 학습 데이터가 적을수록 성능이 상당히 떨어진다는 점이다.A common limitation of the prior art statistical learning method and the deep learning method is that the performance decreases considerably as the amount of training data decreases.

특정 상황에서는 학습 데이터를 구하기가 어려워 원하는 성능을 얻기가 힘들 수가 있는데 자연어처리에서도 예외는 아니다.In certain situations, it may be difficult to obtain the desired performance because it is difficult to obtain training data, but natural language processing is no exception.

따라서, 저용량 데이터를 이용해서 성능을 높이는 기술의 개발이 요구되고 있으며, 문맥의존 철자오류 교정에서도 필요성이 요구되고 있다.Therefore, there is a need to develop a technique for improving performance using low-volume data, and there is also a need for context-dependent spelling error correction.

대한민국 등록특허 제10-1495240호Republic of Korea Patent No. 10-1495240 대한민국 등록특허 제10-1500617호Republic of Korea Patent No. 10-1500617 대한민국 등록특허 제10-1573854호Republic of Korea Patent No. 10-1573854 대한민국 공개특허 제10-2019-0133624호Republic of Korea Patent Publication No. 10-2019-0133624

본 발명은 종래 기술의 철자오류 교정 기술의 문제점을 해결하기 위한 것으로, 생성적 적대 신경망(generative adversarial network)을 이용하여 실제 언어모형(language model)에 가깝게 모사하고 이 정보를 사용하여 텍스트로 작성된 일반 문서에서 나타나는 다양한 오류에 대한 대응이 가능하도록 한 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.The present invention is to solve the problems of the prior art spelling error correction technology, and uses a generative adversarial network to closely mimic an actual language model, and uses this information to create general texts. Its purpose is to provide an apparatus and method for correcting context-dependent spelling errors using a generative adversarial neural network that enables response to various errors appearing in documents.

본 발명은 학습 데이터를 구하기가 어려워 원하는 성능을 얻기가 힘든 환경에서 저용량 데이터를 이용해서 문맥의존 철자오류 교정이 효율적으로 이루어지도록 한 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.The present invention provides an apparatus and method for correcting context-dependent spelling errors using a generative adversarial neural network that enables efficient correction of context-dependent spelling errors using low-volume data in an environment where it is difficult to obtain learning data and to obtain desired performance. It has a purpose.

본 발명은 생성적 적대 신경망을 이용해서 실제 사람이 사용하는 문장과 유사한 학습 데이터를 생성하여 제공하므로 양적으로 학습하기에 풍부한 데이터를 제공할 수 있도록 한 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.The present invention uses a generative adversarial network to generate and provide learning data similar to sentences used by real people, so that rich data for quantitative learning can be provided. And to provide a method for that purpose.

본 발명은 데이터 부족 문제로 교정이 잘 안 되던 단어에 대해서 교정 정확도를 높일 수 있도록 한 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법을 제공하는데 그 목적이 있다.An object of the present invention is to provide an apparatus and method for correcting context-dependent spelling errors using a generative adversarial network that can increase the accuracy of correction for words that have not been properly corrected due to a lack of data.

본 발명의 다른 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.Other objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned above will be clearly understood by those skilled in the art from the description below.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치는 교정하기 위한 문장을 입력하는 입력부;입력 문장을 어절 단위로 검사하고 문맥 철자오류를 검색하는 오류 어절 검사부;교정 대상 어절과 단어 사전의 단어 사이의 편집거리를 계산하여 후보 단어를 선별하는 후보 선별부;교정 대상 어절의 주변 전체 문맥과 후보 선별부에서 걸러진 후보 단어들 간의 거리를 생성적 적대 신경망을 이용해 생성된 언어모형을 이용하여 계산하는 예측 후보 생성부;생성적 적대 신경망에서 만들어진 언어모형에서 계산된 단어 간의 거리 계산 값을 기반으로 최종 교정어를 선택하는 교정어 제시부;를 포함하는 것을 특징으로 한다.In order to achieve the above object, an apparatus for correcting context-dependent spelling errors using a generative adversarial neural network according to the present invention includes an input unit for inputting a sentence to be corrected; Inspection unit; Candidate selection unit for selecting candidate words by calculating the editing distance between the word to be corrected and words in the word dictionary; The generative adversarial network calculates the distance between the entire context around the word to be corrected and the candidate words filtered out by the candidate selection unit. A prediction candidate generation unit that calculates using a language model generated using a generative adversarial neural network; a correction word presentation unit that selects a final corrected word based on a distance calculation value between words calculated from a language model created in a generative adversarial neural network. do.

다른 목적을 달성하기 위한 본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 방법은 입력 문장을 어절 단위로 검사하여 철자오류의 가능성을 판단하는 단계;교정 대상 단어와 후보어가 될 언어 모형에서의 사전 단어들 간의 편집거리를 계산하는 단계;교정 대상 어절의 주변 전체 문맥과 후보 선별부에서 걸러진 후보 단어들 간의 거리를 생성적 적대 신경망을 이용해 생성된 언어모형을 이용하여 계산하여 교정 대상 어절을 대체 할 단어를 계산하는 단계;순위화 된 정보를 바탕으로 최종 교정 단어를 제시하는 단계;를 포함하는 것을 특징으로 한다.In order to achieve another object, a method for correcting context-dependent spelling errors using a generative adversarial network according to the present invention includes the steps of examining an input sentence word by word and determining the possibility of a spelling error; Calculating an editing distance between dictionary words of the word to be corrected; Calculating the distance between the entire context surrounding the word to be corrected and the candidate words filtered out by the candidate selection unit using a language model generated using a generative adversarial neural network to determine the word to be corrected. Calculating a word to be replaced; Presenting a final correction word based on the ranked information; characterized in that it includes.

이상에서 설명한 바와 같은 본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법은 다음과 같은 효과가 있다.As described above, the apparatus and method for correcting context-dependent spelling errors using the generative adversarial neural network according to the present invention have the following effects.

첫째, 생성적 적대 신경망(generative adversarial network)을 이용하여 실제 언어모형(language model)에 가깝게 모사하고 이 정보를 사용하여 텍스트로 작성된 일반 문서에서 나타나는 다양한 오류에 대한 대응이 가능하도록 한다.First, a generative adversarial network is used to approximate an actual language model, and using this information, it is possible to respond to various errors appearing in general documents written in text.

둘째, 학습 데이터를 구하기가 어려워 원하는 성능을 얻기가 힘든 환경에서 저용량 데이터를 이용해서 문맥의존 철자오류 교정이 효율적으로 이루어지도록 한다.Second, in an environment where it is difficult to obtain learning data and it is difficult to obtain the desired performance, context-dependent spelling error correction is performed efficiently using low-volume data.

셋째, 생성적 적대 신경망을 이용해서 실제 사람이 사용하는 문장과 유사한 학습 데이터를 생성하여 제공하므로 양적으로 학습하기에 풍부한 데이터를 제공할 수 있도록 한다.Third, it generates and provides learning data similar to sentences used by real people using a generative adversarial network, so that rich data can be provided for quantitative learning.

넷째, 데이터 부족 문제로 교정이 잘 안 되던 단어에 대해서 교정 정확도를 높일 수 있도록 한다.Fourth, it is possible to increase the correction accuracy for words that have not been corrected well due to the lack of data.

도 1a와 도 1b는 본 발명에 따른 생성적 적대 신경망에서 생성한 문맥을 이용한 언어 모형 학습 구성도
도 2는 본 발명에 따른 생성적 적대 신경망에서 생성한 문맥을 이용한 언어 모형 학습 알고리즘 구성도
도 3는 본 발명에 따른 언어 모형을 이용한 문맥의존 철자오류 교정 장치 구성도
도 4은 본 발명에 따른 언어 모형을 이용한 문맥의존 철자오류 교정 방법을 나타낸 순서도1A and 1B are diagrams of language model learning using context generated by the generative adversarial neural network according to the present invention.
2 is a block diagram of a language model learning algorithm using a context generated by a generative adversarial network according to the present invention.
3 is a block diagram of an apparatus for correcting context-dependent spelling errors using a language model according to the present invention;
4 is a flowchart showing a method for correcting context-dependent spelling errors using a language model according to the present invention.

이하, 본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법의 바람직한 실시 예에 관하여 상세히 설명하면 다음과 같다.Hereinafter, preferred embodiments of an apparatus and method for correcting context-dependent spelling errors using a generative hostile neural network according to the present invention will be described in detail.

본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법의 특징 및 이점들은 이하에서의 각 실시 예에 대한 상세한 설명을 통해 명백해질 것이다.The characteristics and advantages of the apparatus and method for correcting context-dependent spelling errors using a generative adversarial neural network according to the present invention will become clear through detailed descriptions of each of the following embodiments.

도 1a와 도 1b는 본 발명에 따른 생성적 적대 신경망에서 생성한 문맥을 이용한 언어 모형 학습 구성도이고, 도 2는 본 발명에 따른 생성적 적대 신경망에서 생성한 문맥을 이용한 언어 모형 학습 알고리즘 구성도이다.1A and 1B are diagrams of language model learning using context generated by the generative adversarial neural network according to the present invention, and FIG. 2 is a diagram showing the configuration of a language model learning algorithm using context generated by the generative adversarial neural network according to the present invention. am.

본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법은 생성적 적대 신경망(generative adversarial network)을 이용하여 실제 언어모형(language model)에 가깝게 모사하고 이 정보를 사용하여 텍스트로 작성된 일반 문서에서 나타나는 다양한 오류에 대한 대응이 가능하도록 한 것이다.An apparatus and method for correcting context-dependent spelling errors using a generative adversarial network according to the present invention closely mimics an actual language model using a generative adversarial network and uses this information to generate text written in text. This is to make it possible to respond to various errors that appear in general documents.

이를 위하여, 본 발명은 생성적 적대 신경망을 통해 문장을 생성하며, 생성 문장을 이용해서 언어 모형을 학습하고 언어 모형을 통해 교정 대상 어절과 교정 대상 어절의 주변 문맥과의 관계를 파악하여 교정하는 구성을 포함한다.To this end, the present invention generates a sentence through a generative adversarial neural network, learns a language model using the generated sentence, and identifies and corrects the relationship between the word to be corrected and the surrounding context of the word to be corrected through the language model. includes

생성적 적대 신경망이란 게임 이론(game theory) 중의 하나인 폰 노이만(John Von Neumann)의 최소최대 정리(minmax theorem)를 기반으로 접근하여 데이터 생성 문제를 해결하는 방식이며, 도 1a와 도 1b에서와 같이 데이터의 생성 모형(generate model)과 판별 모형(discriminate model)으로 구성된 심층 신경망이 서로 의존적으로 생성과 판별을 반복하면서 대항하는 방식으로 학습을 한다.A generative adversarial neural network is a method of solving data generation problems by approaching based on von Neumann's minmax theorem, one of game theories. Similarly, a deep neural network composed of a generate model and a discriminate model of data learns in an opposing way while repeatedly generating and discriminating each other dependently.

이런 학습이 반복되면서 결국 판별기가 진짜와 가짜의 데이터를 구분하지 못하는 순간이 되었을 때 게임이론에서 말하는 내시 균형(nash equilibrium)에 도달한 것인데 이는 문장을 생성자(generator)가 생성한 임의의 문장을 가짜로 구분하던 판별자(discriminator)가 학습을 통해 발전된 생성자가 생성한 문장을 구별하지 못하게 된 것을 의미한다.As this learning is repeated, when the moment the discriminator cannot distinguish between real and fake data is reached, the Nash equilibrium, which is said in game theory, is reached, which means that a random sentence generated by a generator is fake This means that the discriminator, which used to be classified as , is no longer able to distinguish the sentences generated by the generator developed through learning.

최종적으로 생성자가 생성한 문장은 실제 사람이 작성한 문장과 유사해지면서 의미 있는 정보가 된다.Finally, the sentences generated by the generator become meaningful information as they become similar to sentences written by actual people.

본 발명에서의 교정에는 생성 문장을 이용해 실제에 가까운 언어 모형을 학습하고 언어 모형을 이용하여 교정을 한다.In the correction in the present invention, a language model close to reality is learned using generated sentences, and correction is performed using the language model.

본 발명은 다양한 구조의 생성적 적대 신경망에도 적용되고, 도 1a와 도 1b에서의 구조로 한정되지 않는다.The present invention is also applied to generative adversarial networks of various structures, and is not limited to the structures in FIGS. 1A and 1B.

도 2는 본 발명에 따른 생성적 적대 신경망을 이용하여 문장을 생성하고 언어 모형에 반영하는 알고리즘의 일 예를 나타낸 것이다.2 shows an example of an algorithm for generating sentences using a generative adversarial network according to the present invention and reflecting them in a language model.

본 발명에 따른 언어 모형을 이용한 문맥의존 철자오류 교정 장치를 구체적으로 설명하면 다음과 같다.A context-dependent spelling error correction device using a language model according to the present invention will be described in detail.

도 3는 본 발명에 따른 언어 모형을 이용한 문맥의존 철자오류 교정 장치 구성도이다.3 is a block diagram of an apparatus for correcting context-dependent spelling errors using a language model according to the present invention.

본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치는 도 3에서와 같이, 오류를 교정하기 위한 문장을 입력하는 입력부(301)와, 입력부(301)를 통해 입력된 문장의 어절의 오류를 순차적으로 검사하는 오류 어절 검사부(302)와, 어절에 오류가 있다고 판단되었을 때 교정 대상 어절에 대한 교정 후보어들을 선별하는 후보 선별부(303)와, 교정 대상 어절 위치의 주변 문맥과 후보어들의 거리 값을 계산하는 예측 후보 생성부(304)와, 최종적으로 문맥과 가까운 후보어를 제시하는 교정어 제시부(305)를 포함한다.As shown in FIG. 3, the apparatus for correcting context-dependent spelling errors using a generative adversarial network according to the present invention includes an input unit 301 for inputting a sentence to correct an error, and a word of a sentence input through the input unit 301. An error word checking unit 302 that sequentially checks for errors, a candidate selection unit 303 that selects correction candidate words for the word to be corrected when it is determined that there is an error in the word, and a context surrounding the position of the word to be corrected and candidates It includes a prediction candidate generation unit 304 that calculates a distance value between words, and a correction word presentation unit 305 that finally presents a candidate word close to the context.

이와 같은 구성을 갖는 본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치를 이용하는 교정 과정을 각 단계별로 구체적으로 설명하면 다음과 같다.The correction process using the context-dependent spelling error correction device using the generative adversarial network according to the present invention having such a configuration will be described in detail for each step.

교정에 사용되는 모든 단어들은 생성된 언어 모형을 사용하여 표현하며, 언어 모형이란 자연언어의 이해를 위해 사용하는 모형을 말한다.All words used for proofreading are expressed using the generated language model, and the language model refers to a model used for understanding natural language.

수학식 1은 기본적인 생성적 적대 신경망을 나타내며, D는 판별 모형(또는 판별자)이고 G는 생성 모형(또는 생성자)을 나타낸다. Equation 1 represents a basic generative adversarial network, where D is the discriminant model (or discriminator) and G is the generative model (or generator).

생성자는 랜덤 벡터 z를 입력으로 받아 가짜 데이터을 생성하며, 판별자는 진짜 일 때 1 가짜일 때 0을 출력하게 되는데

은 초기에 생성자에서 생성된 가짜 데이터를 판단하여 0, 실제 데이터 x를 입력으로 하는

는 1로 계산된다.The generator takes the random vector z as input and generates fake data, and the discriminator outputs 1 when it is real and 0 when it is fake.

determines the fake data initially generated by the constructor and takes 0 and the real data x as input

is counted as 1.

는 실제 데이터 분포이고

는 생성자에서 생성한 데이터의 분포이며, E는 예상되는 출력이다.

is the actual data distribution and

is the distribution of data produced by the constructor, and E is the expected output.

에 있어서 훈련 중 판별자는 진짜를 진짜로 가짜를 가짜로 판별해야 하므로 식의 총 출력을 최대화하려 하지만 생성자는 실제 데이터와 최대한 가까운 가짜 데이터를 만들어내어 식의 출력을 최소화하려 할 것이므로 생성적 적대 신경망은 훈련을 거듭하면서 생성자와 판별자의 신경망이 균형을 이루게 된다.

During training, the discriminator must discriminate real from real and fake from fake, so it tries to maximize the total output of the equation, but the generator tries to minimize the output of the equation by creating fake data as close as possible to the real data. Repeatedly, the generator and discriminator neural networks are balanced.

오류 어절 검사부(302)는 통계적 언어 모형(statistical language model)인 N-gram 모형으로서, 수학식 2는 통계 후보어 집합 T가 대치되는 교정 대상 어절의 위치

의 주변 문맥

,

에 대해 통계 후보어들 중 최대가 되는

의 문맥 확률을 계산한다.The error checking unit 302 is an N-gram model, which is a statistical language model, and Equation 2 is the position of the word to be corrected to which the set of statistical candidate words T is replaced.

the surrounding context of

,

The largest of the statistical candidates for

Calculate the context probability of

오류 어절 검사 단계에서만 사용이 되는 통계적 언어 모형은 후보어 집합 T에서 오류 검사 대상 어절이 통계 후보어들에 비해 확률이 높은지 낮은지만을 보고 오류 어절 유무를 판단한다.Statistical language models, which are used only in the error word checking step, determine the presence or absence of an error word by looking only at whether the probability of the word subject to error checking in the candidate word set T is higher or lower than that of the statistical candidate words.

통계 후보어는 교정 후보어와 다르며, 교정 대상 어절의 검사 과정에서만 사용된다.Statistical candidate words are different from proofreading candidate words, and are used only in the process of examining the word to be proofread.

는 확률이 가장 높게 나온 예측 후보어이며,

는 실제 문서에서 나타난 단어 W대신 오류에 의해 쓰일 수 있는 단어 Y가 오용될 확률을 나타낸다.

is the prediction candidate with the highest probability,

represents the probability of misuse of the word Y, which could be used by error instead of the word W appearing in the actual document.

수학식 3에서와 같이 통계 후보어는 미리 구축된 3-gram 사전을 통해 얻어지며, 중심어 위치 '*'를 기준으로 양쪽 2어절의 범위의 3-gram을 검색한다. 검색 목적은 중심어 위치 '*'의 주변 문맥 단어와 함께 나타나는 모든 통계 후보어를 검색한다.As in Equation 3, statistical candidates are obtained through a pre-built 3-gram dictionary, and 3-grams in the range of two words on both sides are searched based on the central word position '*'. The purpose of the search is to search all statistical candidate words appearing together with the surrounding context words of the central word position '*'.

검색된 단어들은 후보어 집합에 속하게 되며, 현재 교정 대상 검사 어절의 단어와 편집거리를 계산하여 가까운 단어들을 선별하게 된다.The searched words belong to a set of candidate words, and close words are selected by calculating an editing distance from the word of the current target word to be corrected.

편집거리는 1에서부터 시작하며, 단어 간의 차이 비교를 하는 기준이 되는데 기준 단어로부터 비교 단어의 알파벳이나 음소가 삽입, 삭제, 교환이 이루어짐에 따라서 편집거리가 늘어나게 된다.The edit distance starts from 1, and is a criterion for comparing differences between words. The edit distance increases as alphabets or phonemes of the comparison word are inserted, deleted, or exchanged from the reference word.

예로 기준 단어 ‘가위’와 비교 단어 ‘사위’는 ‘ㄱ’이 ‘ㅅ’으로 교환이 된 상태이므로 ‘사위’는 ‘가위’에 대해서 편집거리가 1이다.For example, in the standard word ‘scissors’ and the comparison word ‘scissors’, ‘ㄱ’ is exchanged for ‘ㅅ’, so ‘scissors’ has an editing distance of 1 for ‘scissors’.

예측 후보 생성부(304)는 생성적 적대 신경망을 이용해 생성한 언어 모형을 사용한다.The prediction candidate generation unit 304 uses a language model generated using a generative adversarial neural network.

수학식 4에서 언어 모형에 입력되는 문장

와 선별부에서 선별된 교정 후보어 집합을 C라고 하고, 문맥의존 철자오류 교정 거리 값이

최대가 되는

를 선택하게 된다. Sentences input to the language model in Equation 4

and the set of correction candidate words selected in the selection unit is called C, and the context-dependent spelling error correction distance value is

maximum

will choose

수학식 4의 교정 후보어 집합을 나타내는 C는 수학식 5에서처럼 편집거리 계산함수 EDF를 이용하여 중심어

를 기준으로 언어 모형의 전체 삽입 단어 사전과 설정된 편집거리를 만족하는 교정 후보어를 얻은 N개의 집합이다.C, which represents the set of correction candidate words in Equation 4, is the center word using the edit distance calculation function EDF as in Equation 5

Based on , N sets of correction candidates that satisfy the entire inserted word dictionary of the language model and the set editing distance are obtained.

수학식 6은 수학식 4에서의 입력 문맥의 내적값의 합을 구하며,

는 교정후보단어이고

은 교정후보단어의 주변 문맥의 크기이다.Equation 6 calculates the sum of the dot products of the input contexts in Equation 4,

is a correction candidate word

is the size of the surrounding context of the correction candidate word.

는 교정후보단어와 문맥단어간의 내적을 구하는 함수이며, 이 부분이 각 단어삽입 모형에서의 단어 간의 거리 값을 적용하는 부분이다.

is a function that obtains the dot product between the correction candidate word and the context word, and this part applies the distance value between words in each word embedding model.

는 미등록어를 처리하기 위한 평탄화(smoothing)값이며, 언어 모형이 형태적으로 단어를 비교하여 미등록 언어의 특징에 따라서 내적을 비교한다면

를 따로 사용하지 않을 수도 있다.

is a smoothing value for processing non-registered words, and if the language model compares words morphologically and compares dot products according to the characteristics of unregistered languages,

may not be used separately.

본 발명에 따른 언어 모형을 이용한 문맥의존 철자오류 교정 방법을 구체적으로 설명하면 다음과 같다.A method for correcting context-dependent spelling errors using a language model according to the present invention will be described in detail.

도 4은 본 발명에 따른 언어 모형을 이용한 문맥의존 철자오류 교정 방법을 나타낸 순서도이다.4 is a flowchart illustrating a method for correcting context-dependent spelling errors using a language model according to the present invention.

먼저, 문맥의존 철자오류를 검색하고 교정 할 문서를 입력하고(S401), 문서에서의 문장 내 어절을 순차적으로 검사를 하며(S402), 어절에 오류가 있는지를 판단한다.(S403)First, search for context-dependent spelling errors and enter a document to be corrected (S401), sequentially inspect words in sentences in the document (S402), and determine whether there is an error in the word (S403).

만약 어절에 문제가 없다면 다음 어절을 검사하고, 어절에 오류가 있다고 판단이 되었을 때 해당 어절은 교정 대상 어절로 결정하여 교정 대상 어절에 태그(‘<target word>’)를 적용한다.(S404)If there is no problem with the word, the next word is checked, and when it is determined that the word has an error, the corresponding word is determined as the word to be corrected and a tag (‘<target word>’) is applied to the word to be corrected (S404).

교정 후보어를 이용해서 문맥과의 거리를 계산하기에 앞서 계산의 양을 줄이기 위해서 교정 대상 어절을 기반으로 언어 모형의 전체 사전 단어와의 편집거리를 계산하여 설정된 거리만큼의 후보어를 판별하며(S405), 판별된 후보어를 바탕으로 예측 대상 어절의 주변 문맥과 각 후보어 간의 거리 값을 언어 모형을 이용해서 구하여(S406), 가장 높은 1순위를 최종 교정 단어로 선택한다.(S407)In order to reduce the amount of calculation prior to calculating the distance from the context using the correction candidate word, the editing distance with the entire dictionary word of the language model is calculated based on the correction target word, and the candidate word is determined by the set distance ( S405), based on the determined candidate words, a distance value between the surrounding context of the target word to be predicted and each candidate word is obtained using a language model (S406), and the highest first rank is selected as the final correction word (S407).

여기서 교정 단어가 교정 대상 단어와 같다면 교정이 이루어지지 않은 것이고, 교정 단어가 교정 대상 단어와 다르다면 대치를 해서 교정을 하게 된다.(S407)Here, if the correction word is the same as the proofreading target word, proofreading is not performed, and if the proofreading word is different from the proofreading target word, correction is performed by substitution (S407).

교정 단어를 예측하는 과정이 끝나면 다음 어절이나 문장이 있는지를 판단하며(S408), 시스템을 종료할 것인지를 결정짓는다.When the process of predicting the correction word is finished, it is determined whether there is a next word or sentence (S408), and it is determined whether to terminate the system.

본 발명에서는 언어 모형을 이용한 문맥의존 철자오류 교정 방법은 문장 단위를 기준으로 첫 어절부터 끝 어절까지 순차적으로 오류를 검사한다.In the present invention, the method for correcting context-dependent spelling errors using a language model sequentially checks errors from the first word to the last word on a sentence-by-sentence basis.

교정 대상 어절의 오류 검사의 예로 '도대체 장모라는 사람이 가위가 왔는디 씨암탉은 못...'이라는 문장이 있었을 때 '도대체', '장모라는', '사람이', '가위가', '왔는디', '씨암탉은', '못', ...을 각 어절이라고 하고, 현재 오류의 검사가 이루어지고 있는 중심 어절을 '가위가'라고 가정한다.As an example of the error check of the word to be corrected, when there is a sentence 'A mother-in-law came with scissors, but a hen couldn't...' It is assumed that 'Where's you from', 'The hen', 'Nail', ... are each word, and the central word that is currently being checked for errors is 'scissors'.

중심 어절 '가위가'를 기준으로 미리 구축된 대용량의 3-gram 사전에서 2어절 범위의 ('장모라는', '사람이', '*'), ('사람이', '*', '왔는디'), ('*', '왔는디', '씨암탉은')을 검색해서 '*'의 위치에 올 수 있는 문맥의 후보어 집합 '거위가', '써위가', '학교가', '사위가', '하위가', '가위가', '도로가' 등을 얻고, 중심 어절 '가위가'와의 편집가리(예에서는 편집거리 1)가 가까운 후보어 집합 '거위가', '사위가', '하위가', '가위가' 등을 선별해서 얻는다.In the large-capacity 3-gram dictionary built in advance based on the central word 'scissors', the two-word range ('mother-in-law', 'person', '*'), ('person', '*', ' A set of candidate words in the context that can appear in the position of '*' by searching for ('*', 'Coming', 'Mr. ', 'Sawiga', 'Hawiga', 'Gawiga', 'Doroga', etc., and a set of candidate words 'goosega' with close editing distance (editing distance 1 in the example) to the central word 'Gawiga'. .

다음으로 철자오류 검색 대상 어절을 기준으로 2어절 범위의 문맥이 이루는 확률을 계산하게 되는데 여기에서 중심 어절 '가위가'가 포함된 2어절 범위의 문맥 '장모라는 사람이 사위가 왔는디 씨암탉은'의 문장 확률과 2어절 범위의 문맥에서 '가위가'를 각 후보어로 대체하여 확률을 계산 값을 비교하였을 때 '가위가'를 포함한 문맥의 확률이 가장 높다면 어절에 오류가 없다고 판단하고, 후보어들 중에서 문맥과 이루는 확률이 '가위가'의 확률 보다 높은 후보어가 있다면 오류가 있다고 판단한다.Next, based on the word to be searched for spelling errors, the probability of the context of the 2-word range is calculated. Here, the context of the 2-word range including the central word 'scissors' is 'The mother-in-law is the son-in-law' When comparing the calculated probability by replacing 'scissors' with each candidate word in the sentence probability of ' and the context of the two-word range, if the probability of the context including 'scissors' is the highest, it is determined that there is no error in the word, and the candidate Among the words, if there is a candidate word whose probability of forming with the context is higher than the probability of 'scissors', it is determined that there is an error.

교정 대상 어절이라 판단되었을 때 가장 먼저 중심어 '가위가'를 '<target word>'로 대체한 문장 '도대체', '장모라는', '사람이', '<target word>', '왔는디', '씨암탉은', '못', ...를 언어 모형에 넣는다.When it is judged to be a word to be corrected, the first sentence in which the key word 'scissors' is replaced with '<target word>' is 'what the hell', 'mother-in-law', 'person', '<target word>', and 'Wongaundi' , 'the hen', 'nail', ... are put into the language model.

언어 모형은 심화학습을 통해 미리 학습된 언어 모형이며, 중심어로 설정된 전체 문맥을 입력으로 하며, 출력에서는 중심어를 제외한 전체 문맥을 기반으로 중심어를 예측하는데 학습과정에서 사용된 말뭉치의 삽입(embedding) 단어 사전의 전체를 대상으로 문맥과의 거리를 계산한다.The language model is a language model pre-learned through deep learning, takes the entire context set as the central word as input, and predicts the central word based on the entire context except for the central word in the output. The embedding word of the corpus used in the learning process The distance from the context is calculated for the entire dictionary.

교정 언어 모형을 통해 계산되는 양을 줄이기 위해서 삽입 단어 사전 전체(앞에서 예를 든 통계 모형에서 얻어진 후보어와 다른 언어 모형의 사전)에서 '가위가'와 편집거리가 가까운 '고위가', '거위가', '사위가', '다위가', '하위가' 등과 같은 단어들로 선별하고, 이 교정 후보어들을 중에서 언어 모형에서 교정 대상 어절의 주변 문맥과의 거리 값을 계산하여 순위화를 한다.In order to reduce the amount calculated through the correction language model, 'gowiga' and 'goose', which have a close editing distance to 'scissors' in the entire inserted word dictionary (dictionaries of candidate words obtained from the statistical model and other language models in the previous example) ', 'Sawiga', 'Dawiga', 'Hawiga', etc. are selected, and among these correction candidates, the language model calculates the distance value from the surrounding context of the target word to be corrected and ranks them. .

1순위의 후보어가 교정 대상 어절 '사위가'와 같다면 어절에는 오류가 없다고 판단하고 다른 후보어 중에서 나타난다면 어절이 오류가 있다고 판단하여 교정을 하게 된다.If the first candidate word is the same as the word 'Sawiga' to be corrected, it is determined that there is no error in the word, and if it appears among other candidate words, it is determined that the word is erroneous and corrected.

이런 과정이 순차적으로 입력 문서가 끝날 때까지 반복되어 최종 교정 결과를 출력하게 된다.This process is repeated sequentially until the input document is finished, and the final proofreading result is output.

이상에서 설명한 본 발명에 따른 언어 모형을 이용한 문맥의존 철자오류 교정 장치 및 방법은 철자오류 교정 단계에서 언어 모형에서 얻어지는 교정 대상 어절과 문맥과의 거리 값을 이용하여 다양한 오류에 대한 처리가 가능하도록 한 것이다.The apparatus and method for correcting context-dependent spelling errors using the language model according to the present invention described above enable processing of various errors by using the distance value between the word to be corrected and the context obtained from the language model in the spelling error correction step. will be.

이상에서 설명한 본 발명에 따른 생성적 적대 신경망을 이용한 문맥의존 철자오류 교정 장치 및 방법은 생성적 적대 신경망(generative adversarial network)을 이용하여 실제 언어모형(language model)에 가깝게 모사하고 이 정보를 사용하여 텍스트로 작성된 일반 문서에서 나타나는 다양한 오류에 대한 대응이 가능하도록 한 것이다.The apparatus and method for correcting context-dependent spelling errors using a generative adversarial network according to the present invention described above closely mimics an actual language model using a generative adversarial network, and uses this information to It is possible to respond to various errors that appear in general documents written in text.

이상에서의 설명에서와 같이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 본 발명이 구현되어 있음을 이해할 수 있을 것이다.As described above, it will be understood that the present invention is implemented in a modified form without departing from the essential characteristics of the present invention.

그러므로 명시된 실시 예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 하고, 본 발명의 범위는 전술한 설명이 아니라 특허청구 범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.Therefore, the specified embodiments should be considered from an explanatory point of view rather than a limiting point of view, and the scope of the present invention is shown in the claims rather than the foregoing description, and all differences within the equivalent range are considered to be included in the present invention. will have to be interpreted

301. 입력부
302. 오류 어절 검사부
303. 후보 선별부
304. 예측 후보 생성부
305. 교정어 제시부301. Input section
302. Error word check unit
303. Candidate selection department
304. Prediction candidate generator
305. Proofreading section

Claims

an input unit for inputting sentences to be corrected;
an error word checking unit that checks the input sentence by word unit and searches for contextual spelling errors;
a candidate selector selecting a candidate word by calculating an editing distance between a word to be corrected and a word in the word dictionary;
a prediction candidate generation unit that calculates a distance between the entire context around the word to be corrected and candidate words filtered out by the candidate selection unit using a language model generated using a generative hostile neural network;
An apparatus for correcting context-dependent spelling errors using a generative adversarial network, comprising: a correction word presenter for selecting a final corrected word based on a distance calculation value between words calculated in a language model created in a generative adversarial neural network.

The method of claim 1, wherein the generative adversarial neural network in the prediction candidate generator,

is defined as,
D is the discriminator, G is the generator,

is the actual data distribution and

is the distribution of data produced by the constructor, E is the expected output,
The generator takes a random vector z as input and generates fake data, and the discriminator outputs 1 when it is real and 0 when it is fake.

A context-dependent spelling error correction device using a generative adversarial network, characterized in that is calculated as 1.

The method of claim 1, wherein the error word check unit is an N-gram model, which is a statistical language model,
The position of the word to be corrected where the set of statistical candidate words T is replaced

the surrounding context of

,

The largest of the statistical candidates for

Calculate the context probability of

is the probability of misuse of the word Y, which could be used by error instead of the word W appearing in the actual document,
The statistical language model is a context dependent spelling error correction device using a generative adversarial network, characterized in that it determines the presence or absence of an error by only looking at whether the error test target word has a higher or lower probability than the statistical candidate words in the candidate word set T.

The method of claim 3, wherein the statistical candidate words are obtained through a pre-built 3-gram dictionary, and 3-grams in the range of two words on both sides are searched based on the central word position '*',

is defined as,
The purpose of the search is to search for all statistical candidate words that appear together with the surrounding context words of the central word position '*', the searched words belong to the candidate word set, and select words close to the words of the current target word to be corrected by calculating the editing distance. A context-dependent spelling error correction device using a generative adversarial network, characterized in that.

The method of claim 4, wherein the prediction candidate generation unit uses a language model generated using a generative adversarial network,

using,
Sentences input to the language model

and the set of correction candidates selected by the candidate selection unit is called C, and the context-dependent spelling error correction distance value is

maximum

Context-dependent spelling error correction device using a generative adversarial network, characterized in that for selecting.

The method of claim 5, wherein C representing a set of correction candidates,

is defined as,
Using the editing distance calculation function EDF, the center word

A context-dependent spelling error correction device using a generative adversarial network, characterized in that N sets of correction candidate words that satisfy the entire insertion word dictionary of the language model and the set editing distance are obtained based on.

The method of claim 6, wherein the correction word presentation unit,

Using , the sum of the dot products of the input context is obtained,

is a correction candidate word

is the size of the surrounding context of the correction candidate word,

is a function that obtains the dot product between the correction candidate word and the context word, and is a part that applies the distance value between words in each word embedding model,

A context-dependent spelling error correction device using a generative hostile neural network, characterized in that is a smoothing value for processing unregistered words.

The method of claim 1, wherein the correction word presentation unit,
In order to predict the correction word, the distance value between the surrounding context and each candidate word is obtained using the calculated value of the entire word dictionary calculated in the language model generated by the generative adversarial neural network,
An apparatus for correcting context-dependent spelling errors using a generative adversarial network, characterized in that, based on the calculated distance value, the final corrected word is determined with the highest value among the corrected candidate words, and the corresponding word is presented as a substitute word.

When a sentence to be corrected is input through the input unit, determining the possibility of a spelling error by inspecting the input sentence in word units in an erroneous word checking unit;
calculating an editing distance between a word to be corrected and dictionary words in a language model to be candidate words in a candidate selector;
Calculating a word to replace the word to be corrected by calculating the distance between the entire context around the word to be corrected in the prediction candidate generation unit and the candidate words filtered out in the candidate selection unit using a language model generated using a generative hostile neural network ;
A method for correcting context-dependent spelling errors using a generative adversarial neural network, comprising: presenting a final correction word based on the ranked information in the correction word presentation unit.