KR101633556B1

KR101633556B1 - Apparatus for grammatical error correction and method using the same

Info

Publication number: KR101633556B1
Application number: KR1020140125944A
Authority: KR
Inventors: 이근배; 이규송
Original assignee: 포항공과대학교 산학협력단
Priority date: 2014-09-22
Filing date: 2014-09-22
Publication date: 2016-06-24
Also published as: KR20160034678A

Abstract

문법 오류 수정 장치 및 이를 이용한 문법 오류 수정 방법이 개시된다. 입력 텍스트에 대해 문법 오류 말뭉치에 기반하여 오류 후보 단어와 오류 후보군을 생성하는 오류 후보 생성부; 상기 오류 후보 단어와 상기 오류 후보군을 바탕으로 문맥 프레임 집합을 추출하는 라우터; 상기 문맥 프레임 집합에 대해 n-gram 점수를 산출하는 점수 산출부; 및 상기 n-gram 점수를 이용하여 상기 입력 텍스트를 수정하는 단어 수정부를 구성한다. 따라서, 사용자로부터 제공받은 입력 텍스트에서 문법 오류에 대한 문맥 프레임 집합별로 n-gram 점수를 비교하므로 외국어 학습자의 문법 오류를 효율적으로 수정할 수 있고, 입력 텍스트에 포함된 문법 오류를 정확하게 수정함으로써 외국어 학습을 효과적으로 수행할 수 있다.A grammar error correction apparatus and a grammar error correction method using the same are disclosed. An error candidate generation unit for generating an error candidate word and an error candidate group based on a grammar error corpus for the input text; A router for extracting a context frame set based on the error candidate word and the error candidate group; A score calculating unit for calculating an n-gram score for the context frame set; And a word modifying unit for modifying the input text using the n-gram score. Therefore, it is possible to efficiently correct grammatical errors of foreign language learners by comparing the n-gram scores according to the set of context frames for the grammatical errors in the input text provided by the user, and by correcting the grammatical errors included in the input text, Can be effectively performed.

Description

[0001] APPARATUS FOR GRAMMATICAL ERROR CORRECTION AND METHOD USING THE SAME [0002]

본 발명은 문법 오류 수정 장치 및 이를 이용한 문법 오류 수정 방법에 관한 것으로, 더욱 상세하게는 입력 텍스트의 문법 오류를 수정하는 문법 오류 수정 장치 및 이를 이용한 문법 오류 수정 방법에 관한 것이다.The present invention relates to a grammar error correcting apparatus and a grammar error correcting method using the same, and more particularly, to a grammar error correcting apparatus for correcting a grammar error of an input text and a grammar error correcting method using the same.

세계화, 국제화되어 가고 있는 현대 사회에서 외국어 구사 능력에 대한 요구가 증가함에 따라 외국어를 효율적으로 학습할 수 있는 외국어 교육 시스템이 활발하게 연구되고 있는 추세이다. 또한, 정보 통신 기술이 발달함에 따라 스마트폰, 태블릿 PC, PMP(Portable Multimedia Player), PDA(Personal Digital Assistant), 컴퓨터와 같은 정보 처리 기기를 활용한 외국어 학습이 증가하고 있다. 특히, 외국어 문법에 대한 사용자의 학습 요구가 증가함에 따라 정보 처리 기기를 활용하여 사용자로부터 입력된 외국어 작문에서 문법적인 오류를 교정하고 오류에 대한 교정 정보를 제공하는 시스템이 상용화되고 있다.As the demand for foreign language skills increases in modern society, which is becoming globalized and internationalized, a foreign language education system capable of efficiently learning foreign languages is actively being studied. In addition, with the development of information and communication technology, foreign language learning using an information processing device such as a smart phone, a tablet PC, a portable multimedia player (PMP), a personal digital assistant (PDA) Especially, as the user 's learning demand for foreign language grammar increases, a system for correcting grammatical errors in foreign language writing inputted from a user using information processing devices and providing correction information for errors is commercialized.

외국어 작문에 포함된 문법의 오류를 교정하는 대표적인 프로그램으로 마이크로소프트(Microsoft)의 MS Word를 예로 들 수 있다. MS Word는 사용자가 작성한 텍스트의 철자법(spelling) 또는 맞춤법과 같은 문법 검사를 수행하여 검출된 오류를 표시함으로써 사용자에게 문법에 대한 정보를 제공할 수 있다. 그러나, MS Word는 텍스트에 포함된 단어의 철자 또는 문장의 대소문자 구별과 같은 단순한 문법의 오류를 교정한다는 점에서 단어의 품사 정보에 기반한 문법적인 오류에 대한 교정은 어렵다는 문제가 있다. 그리하여, 공개특허공보 제10-2009-0096905호에 기재된 구조 분석용 규칙과 같이 외국어 문장이 표현되는 구조의 형식이나 문법 규칙이 미리 등록된 데이터베이스 또는 모델을 이용하여 외국어 학습자의 문장으로부터 문법 오류를 검출하여 교정하는 방법과 공개특허공보 제10-2013-0059795호와 같이 외국어 문장을 품사 정보에 기초하여 분석한 후, 정규문법 규칙(regular grammar) 및 문맥자유문법(contect Free Grammar) 규칙에 기반한 통계적인 분류 과정을 통해 외국어 학습자의 문법의 오류를 교정하는 방법이 제안되었다. 그러나, 외국어 문장이 표현되는 구조의 형식이나 문법 규칙이 다양하게 존재하기 때문에 상술한 종래의 기술에서 문법의 오류를 교정하기 위해 사용되는 문법 규칙을 정교하게 생성하기 어렵고, 이로 인해 문법 오류 수정에 대한 정확성과 그 효율성이 떨어진다는 문제가 있다.For example, Microsoft's MS Word is a representative program that corrects grammatical errors in foreign language writing. MS Word can provide grammar information to the user by displaying a detected error by performing a grammar check such as spelling or spelling of user-created text. However, there is a problem that it is difficult to correct grammatical errors based on the parts of speech information of words because MS Word corrects simple grammatical errors such as spelling of words contained in text or case difference of sentences. Thus, grammatical errors are detected from foreign language learner's sentences using a database or a model in which the format or grammar rules of a structure in which a foreign language sentence is expressed, as in the structure analysis rule disclosed in Japanese Patent Application Laid-Open No. 10-2009-0096905 , And Japanese Patent Laid-Open Publication No. 10-2013-0059795 discloses a method of analyzing a foreign language sentence based on part-of-speech information, and then analyzing a sentence based on regular grammar rules and contect free grammar rules A method of correcting errors in the grammar of foreign language learners through the classification process has been proposed. However, since there are various types of structures and grammatical rules in which foreign language sentences are expressed, it is difficult to precisely generate grammatical rules used for correcting errors in grammar in the above-described conventional techniques, There is a problem that accuracy and efficiency are inferior.

상기와 같은 문제점을 해결하기 위한 본 발명의 목적은, 사용자로부터 제공받은 입력 텍스트에서 문법 오류에 대한 문맥 프레임 집합별로 n-gram 점수를 비교하여 입력 텍스트에 포함된 문법 오류를 수정하는 문법 오류 수정 장치 및 이를 이용한 문법 오류 수정 방법을 제공하는데 있다.In order to solve the above problems, it is an object of the present invention to provide a grammar error correcting device for correcting a grammar error included in an input text by comparing n-gram scores according to a set of context frames for a grammar error, And a method for correcting a grammar error using the same.

상기와 같은 문제점을 해결하기 위한 본 발명의 다른 목적은, 높은 n-gram 점수가 산출된 문맥 프레임의 단어가 모든 문맥 프레임에서 일치하는 경우 입력 텍스트에 포함된 문법 오류를 수정하는 문법 오류 수정 장치 및 이를 이용한 문법 오류 수정 방법을 제공하는데 있다.Another object of the present invention is to provide a grammar error correcting apparatus for correcting a grammatical error contained in an input text when a word of a context frame in which a high n-gram score is calculated matches in all context frames, And to provide a method for correcting a grammar error using the same.

상기 목적을 달성하기 위한 본 발명은, 입력 텍스트에 대해 문법 오류 말뭉치에 기반하여 오류 후보 단어와 오류 후보군을 생성하는 오류 후보 생성부; 상기 오류 후보 단어와 상기 오류 후보군을 바탕으로 문맥 프레임 집합을 추출하는 라우터; 상기 문맥 프레임 집합에 대해 n-gram 점수를 산출하는 점수 산출부; 및 상기 n-gram 점수를 이용하여 상기 입력 텍스트를 수정하는 단어 수정부를 포함하는 문법 오류 수정 장치를 제공한다.According to an aspect of the present invention, there is provided an apparatus for generating an error candidate, comprising: an error candidate generation unit for generating an error candidate word and an error candidate group based on a grammar error corpus for an input text; A router for extracting a context frame set based on the error candidate word and the error candidate group; A score calculating unit for calculating an n-gram score for the context frame set; And a word correcting unit for correcting the input text using the n-gram score.

여기에서, 상기 오류 후보 생성부는 미리 구축된 문법 오류 말뭉치를 바탕으로 특정 단어에 대한 가장 빈도가 높은 n개의 단어를 상기 오류 후보 단어와 상기 오류 후보군으로 생성할 수 있다.Here, the error candidate generation unit may generate, as the error candidate word and the error candidate group, n words having the highest frequency for a specific word based on the grammar corpus constructed in advance.

한편, 상기 라우터는 상기 오류 후보 단어와 상기 오류 후보군에 대한 오류 수정 성능을 측정하여 상기 문맥 프레임 집합에 대해 오류 수정 성능이 높은 문맥 프레임 집합을 추출할 수 있다.Meanwhile, the router can extract the context frame set having a high error correction capability for the context frame set by measuring the error correction performance of the error candidate word and the error candidate group.

한편, 상기 점수 산출부는 기존 원어민 문서에서 상기 문맥 프레임 집합의 각 단어 나열에 대응한 단어 빈수 또는 확률값으로 상기 n-gram 점수를 산출할 수 있다.Meanwhile, the score calculating unit may calculate the n-gram score from the existing native speaker document as a word bili number or a probability value corresponding to each word list of the context frame set.

한편, 상기 점수 산출부는 상기 n-gram 점수로 상기 문맥 프레임 집합의 단어수가 5개 이하인 경우에는 카운트 정보를, 5개 이상인 경우에는 포워드 n-gram 점수, 백워드 n-gram 점수를 산출할 수 있다.On the other hand, the score calculator may calculate the count information when the number of words of the context frame set is 5 or less, the forward n-gram score, and the backward n-gram score when the number of words is five or more .

한편, 상기 단어 수정부는 높은 n-gram 점수가 산출된 문맥 프레임의 단어가 모든 문맥 프레임에 대해서 일치하는 경우에 입력 텍스트를 수정할 수 있다.On the other hand, the word correcting unit can modify the input text when the words of the context frame in which the high n-gram score is calculated match for all the context frames.

상기 다른 목적을 달성하기 위한 본 발명은 입력 텍스트에 대해 문법 오류 말뭉치에 기반하여 오류 후보 단어와 오류 후보군을 생성하는 단계; 상기 오류 후보 단어와 상기 오류 후보군을 바탕으로 문맥 프레임 집합을 추출하는 단계; 상기 문맥 프레임 집합에 대해 n-gram 점수를 산출하는 단계; 및 상기 n-gram 점수를 이용하여 상기 입력 텍스트를 수정하는 단계를 포함하는 문법 오류 수정 방법이 제공될 수 있다.According to another aspect of the present invention, there is provided a method for generating an error candidate word and an error candidate word based on a grammar error corpus for an input text. Extracting a context frame set based on the error candidate word and the error candidate group; Calculating an n-gram score for the context frame set; And correcting the input text using the n-gram score.

이때, 상기 생성하는 단계는 미리 구축된 문법 오류 말뭉치를 바탕으로 특정 단어에 대한 가장 빈도가 높은 n개의 단어를 상기 오류 후보 단어와 상기 오류 후보군으로 생성할 수 있다.In this case, the generating step may generate the n most frequent words for the specific word as the error candidate word and the error candidate word based on the grammar corpus constructed in advance.

한편, 상기 추출하는 단계는 상기 오류 후보 단어와 상기 오류 후보군에 대한 오류 수정 성능을 측정하여 상기 문맥 프레임 집합에 대해 오류 수정 성능이 높은 문맥 프레임 집합을 추출할 수 있다.Meanwhile, the extracting step may extract the context frame set having a high error correction performance for the context frame set by measuring the error correction performance for the error candidate word and the error candidate set.

한편, 상기 산출하는 단계는 기존 원어민 문서에서 상기 문맥 프레임 집합의 각 단어 나열에 대응한 단어 빈수 또는 확률값으로 상기 n-gram 점수를 산출할 수 있다.Meanwhile, the calculating step may calculate the n-gram score from the existing native speaker document as a word number or a probability value corresponding to each word list of the context frame set.

한편, 상기 산출하는 단계는 상기 n-gram 점수로 상기 문맥 프레임 집합의 단어수가 5개 이하인 경우에는 카운트 정보를, 5개 이상인 경우에는 포워드 n-gram 점수, 백워드 n-gram 점수를 산출할 수 있다.On the other hand, the calculating step may calculate the count information when the number of words of the context frame set is 5 or less, the forward n-gram score, and the backward n-gram score when the number of words is five or more have.

한편, 상기 수정하는 단계는 높은 n-gram 점수가 산출된 문맥 프레임의 단어가 모든 문맥 프레임에 대해서 일치하는 경우에 입력 텍스트를 수정할 수 있다.On the other hand, the modifying step may correct the input text when the words of the context frame in which the high n-gram score is calculated are consistent for all the context frames.

상기와 같은 본 발명에 따른 문법 오류 수정 장치 및 이를 이용한 문법 오류 수정 방법을 이용할 경우에는 사용자로부터 제공받은 입력 텍스트에서 문법 오류에 대한 문맥 프레임 집합별로 n-gram 점수를 비교하므로 외국어 학습자의 문법 오류를 효율적으로 수정할 수 있다.When the grammar error correcting apparatus and the grammar error correcting method according to the present invention as described above are used, the n-gram score is compared according to the set of context frames for grammatical errors in the input text received from the user, Can be modified efficiently.

또한, 입력 텍스트에 포함된 문법 오류를 정확하게 수정함으로써 학습자가 문법 오류에 적절히 대응하여 외국어 학습을 효과적으로 수행하는 장점이 있다.In addition, it corrects the grammatical errors included in the input text so that the learner can appropriately deal with grammatical errors and effectively perform foreign language learning.

도 1은 본 발명의 일 실시예에 따른 문법 오류 수정 장치의 전체적인 구성도이다.
도 2는 도 1의 라우터의 상세 구성도이다.
도 3은 도 1의 단어 수정부의 상세 구성도이다.
도 4는 본 발명의 일 실시예에 따른 문법 오류 수정 장치를 이용한 문법 오류 수정 방법에 대한 순서도이다.1 is a block diagram of a grammar error correction apparatus according to an embodiment of the present invention.
2 is a detailed configuration diagram of the router of FIG.
3 is a detailed configuration diagram of the word correcting unit of FIG.
4 is a flowchart illustrating a method for correcting a grammar error using a grammar error correction apparatus according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.

제1, 제2, A, B 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 기재된 항목들의 조합 또는 복수의 관련된 기재된 항목들 중의 어느 항목을 포함한다. The terms first, second, A, B, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

본 출원에서 사용한 용어는 단지 특정한 실시예를 설명하기 위해 사용된 것으로, 본 발명을 한정하려는 의도가 아니다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terminology used in this application is used only to describe a specific embodiment and is not intended to limit the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.

이하, 본 발명에 따른 바람직한 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

본 발명의 일 실시 예는 라우터를 이용하여 각 문법 오류 후보군에 대한 최적의 문맥 프레임 집합을 선택한다. 아래 문장에서 equipment 앞의 관사 단어를 교정하기 위해서는 관사 뒤에 있는 단어가 중요한 교정의 정보로 사용된다.An embodiment of the present invention selects an optimal context frame set for each grammar error candidate group using a router. In the sentence below, the word after the article is used as the important calibration information to correct the article word in front of the equipment.

We need (an→Ø) equipment to solve problems.We need (an → Ø) equipment to solve problems.

아래 예의 경우 One 뒤의 동사 단어를 교정하기 위해서는 앞에 있는 One이 동사 형태를 결정하는 중요한 교정의 정보로 사용된다.In the example below, to correct the verb words after One, the one in front is used as the important correction information that determines the verb form.

One (are→is) deemed to death at a later stage.One (are → is) deemed to die at a later stage.

또한 순방향, 역방향의 두 방향 또는 여러가지 문맥 프레임을 고려해야 하는 경우도 있다. 하지만 문맥 프레임의 범위와 단어 개수가 늘어날수록 정확도(precision)는 높아지지만 원치 않은 오류를 수정할 수 있다. 즉 결함 보상(Recall)이 떨어진다. 이렇게 최적의 문맥 프레임 방향, 범위, 단어 개수를 선택하여 F-score가 가장 높은 문법 오류 수정 장치 및 이를 이용한 문법 오류 수정 방법을 제공하고자 한다.There are also cases where two forward or backward directions or several context frames need to be considered. However, as the range and number of context frames increase, the precision increases, but you can fix the unwanted errors. That is, the defect compensation (Recall) falls. The grammatical error correcting device having the highest F-score by selecting the optimum context frame direction, range, and word count and a grammatical error correction method using the same are provided.

도 1은 본 발명의 일 실시예에 따른 문법 오류 수정 장치의 전체적인 구성도이고, 도 2는 각각의 오류 후보 단어와 오류 후보군에 대해서 가장 적절한 문맥 프레임 집합을 추출하는 라우터의 상세 구성도이고, 도 3은 높은 n-gram 점수가 산출된 오류 후보군의 단어가 모든 문맥 프레임에 대해서 일치하는 경우에 수정을 실시하는 단어 수정부의 상세 구성도이고, 도 4는 본 발명의 일 실시예에 따른 문법 오류 수정 장치를 이용한 문법 오류 수정 방법에 대한 순서도이다.FIG. 2 is a detailed configuration diagram of a router for extracting a most appropriate context frame set for each error candidate word and an error candidate group, and FIG. 3 is a detailed configuration diagram of a word correcting unit for correcting a case in which a word of the error candidate group in which a high n-gram score is calculated is identical for all context frames, and FIG. 4 is a detailed configuration diagram of a grammar error And a method for correcting a grammar error using a correction device.

도 1을 참조하면, 본 발명의 일 실시예에 따른 문법 오류 수정 장치는 오류 후보 생성부(100), 라우터(200), 점수 산출부(300), 단어 수정부(400)를 포함한다.Referring to FIG. 1, a grammar error correcting apparatus according to an embodiment of the present invention includes an error candidate generating unit 100, a router 200, a score calculating unit 300, and a word correcting unit 400.

오류 후보 생성부(100)는 입력 텍스트에 대해 문법 오류 말뭉치에 기반하여 오류 후보 단어와 오류 후보군을 생성한다. 오류 후보 생성부(100)는 미리 구축된 문법 오류 말뭉치를 바탕으로 특정 단어에 대한 가장 빈도가 높은 n개의 단어를 오류 후보 단어와 오류 후보군으로 생성한다.The error candidate generation unit 100 generates an error candidate word and an error candidate group based on the grammar error corpus for the input text. The error candidate generation unit 100 generates n error words having the highest frequency for a specific word as error candidate words and error candidates based on the grammar corpus constructed in advance.

오류 후보 생성부(100)는 오류 후보 단어와 오류 후보군을 태그된 문법 오류 코퍼스로부터 추출할 수 있다. 예를 들어, a가 an으로 변경해야 한다고 태그되었고, a가 the로 변경해야 한다고 태그되었으면 a는 an 또는 the로 변경 가능하다. 따라서, 다음과 같이 표현 가능하다.The error candidate generation unit 100 may extract the error candidate word and the error candidate group from the tagged grammar error corpus. For example, if a is tagged as an, and a is tagged as changing to the, then a can be changed to an or the. Therefore, it can be expressed as follows.

a -> {a, an, the}a -> {a, an, the}

즉, 오류 후보 단어 a는 자기 자신 또는 an, the로 수정 가능하다. 여기서 a를 오류 후보 단어라 하고, {a, an, the}를 오류 후보군이라 명명한다.That is, the error candidate word a can be modified by itself or by an, the. Here, a is the error candidate word, and {a, an, the} is the error candidate.

도 2는 도 1의 라우터의 상세 구성도이다.2 is a detailed configuration diagram of the router of FIG.

라우터(200)는 오류 후보 단어와 오류 후보군을 바탕으로 문맥 프레임 집합을 추출한다. 라우터(200)는 오류 후보 단어와 오류 후보군에 대한 오류 수정 성능을 측정하는 성능 산출기(210)를 이용하여 문맥 프레임 집합에 대해 가장 적절한 문맥 프레임 집합을 추출한다.The router 200 extracts a context frame set based on the error candidate word and the error candidate group. The router 200 extracts the most appropriate context frame set for the context frame set using the performance calculator 210 that measures the error correction performance for the error candidate word and the error candidate.

문맥 프레임은 수정 대상 위치를 기준으로 n개의 이전 단어와 m개의 이후 단어로 구성된다. 그리고, 이것을 (n;m)으로 표현한다. 예를 들어, "There is another way to look at it." 문장에서 수정 대상 위치 기준 단어가 to라면 (1;1)=way to look, (1;2)=way to look at과 같이 된다. 문맥 프레임 안에 단어의 수가 5개인 경우 단어의 n-gram count 정보를 이용하지만 5개 이상일 경우에는 방향성까지 고려한다. 예를 들어, (1;1)의 경우 count(way to look) 정보를 계산하지만, (5;5)의 경우는 순방향, 역방향을 모두 고려한다. (5:5->) 경우에는 순방향으로 포워드 n-gram 스코어를, (5;5<-) 경우에는 역방향으로 백워드 n-gram 스코어를 계산한다.The context frame is composed of n previous words and m later words based on the correction target position. This is represented by (n; m). For example, "There is another way to look at it." (1; 1) = way to look, (1; 2) = way to look at in the sentence. If the number of words in the context frame is 5, the n-gram count information of the word is used, but if the number of words is 5 or more, the directionality is considered. For example, count (way; look) information is calculated for (1; 1), but forward and reverse directions are considered for (5; (5: 5 >), the forward n-gram score is calculated in the forward direction and the backward n-gram score is calculated in the reverse direction (5;

문맥 프레임 집합은 서로 다른 문맥 프레임으로 구성된다. 문맥 프레임을 구성하는 단어의 개수에 있어서 오류 후보 단어를 중심으로 앞 뒤 단어에서 최소 개수는 1개이며, 최대 개수는 3개로 구성된다. 예를 들어, 문맥 프레임 집합을 M으로 했을 때 M={(1;1)} {(1;1),(1;2)},...,...,.. 또는 {(1;1),(1;2),(3;1)}와 같이 구성할 수 있다.The context frame set consists of different context frames. In the number of words constituting the context frame, the minimum number is one and the maximum number is three, centering on the error candidate word. For example, when the set of context frames is M, M = {(1; 1)} {(1; 1), (1; 2)}, ..., 1), (1; 2), (3; 1)}.

라우터(200)는 각각의 오류 후보 단어와 오류 후보군인 문맥 프레임 집합들로부터 가장 적절한 문맥 프레임 집합을 산출한다. 라우터(200)는 라우터 훈련기를 통해 훈련된다. 라우터(200)의 입력은 오류 후보 단어와 오류 후보군이고, 출력은 문맥 프레임 집합이다.The router 200 calculates the most appropriate context frame set from the context frame sets, which are each error candidate word and error candidate group. The router 200 is trained through a router trainer. The input of the router 200 is an error candidate word and an error candidate group, and the output is a context frame set.

라우터 모델은 하나의 테이블로 나타낼 수 있다. 라우터 모델은 각각의 오류 후보군에 대해서 가장 적절한 문맥 프레임 집합이다. 라우터 모델은 아래 표 1과 같이 나타낼 수 있다.The router model can be represented as a single table. The router model is the most appropriate context frame set for each error candidate. The router model is shown in Table 1 below.

오류 후보 단어와 오류 후보군Errors Candidate words and error candidates 라우터 출력Router output (a→{a, an, the})
(an→{an, a, the})
(the→{the, a, the})
(a→{a, Ø)
(an→{an, Ø)
(the→{the, Ø)
(Ø→{a, an, the})(a → {a, an, the})
(an → {an, a, the}}
(the → {the, a, the}}
(a → {a, Ø)
(an → {an, Ø)
(the → {the, Ø)
(Ø → {a, an, the}} {(1;2), (2;1)}
{(1;2)}
{(1;2),(2;1)}
{(1;2)}
{(1;1)}
{(1;2)}
{(1;1)}{(1; 2), (2; 1)}
{(1; 2)}
{(1; 2), (2; 1)}
{(1; 2)}
{(1; 1)}
{(1; 2)}
{(1; 1)}

라우터(200)는 라우터 모델에 테이블을 완성한다. 라우터 훈련을 위해서는 문법 오류 테깅 데이터와 성능 산출기(210)가 필요하다. 성능 산출기(210)는 해당 오류 후보군에 대한 오류 수정을 하였을 때 얼마나 정확도가 높은지를 나타내는 성능 지수를 산출한다. 또한, 성능 산출기(210)는 얼마만큼 오류를 잡았는지를 나타내는 성능 지수를 산출할 수 있다. 라우터(200)는 라우터를 훈련하기 위해서 모든 가능한 문맥 프레임 집합을 생성한다. 모든 가능한 문맥 프레임 집합은 {(1;1)}, {(1;2), (1;1)},{(1;1), (2;1)},{(1;3), (1;1)},{(1;1), (2;2)},{(3;1), (1;1)},{(1;2), (1;1), (2;1)},{(1;2), (1;3), (1;1)},{(1;2), (1;1), (2;2)},{(1;2), (3;1), (1;1)},...,...,...으로 나타낼 수 있다. 라우터(200)는 모든 가능한 문맥 프레임 집합에 대해서 성능 산출기(210)를 이용해서 성능을 산출한다. 라우터(200)는 산출된 결과 가능 높은 성능을 가지고 있는 문맥 프레임 집합을 라우터(200)의 결과값으로 설정하고 라우터 모델에 저장한다. 라우터(200)는 이런 방식으로 모든 오류 후보 단어와 오류 후보군에 대해서 실시하여 표1과 같이 라우터 모델을 완성한다.The router 200 completes the table in the router model. Grammar error tagging data and performance calculator 210 are required for router training. The performance calculator 210 calculates a figure of merit indicating how accurate the error candidate is when the error is corrected. In addition, the performance calculator 210 may calculate a figure of merit indicating how much error has been captured. The router 200 generates all possible context frame sets to train the router. All possible context frame sets are {(1; 1)}, {(1; 2), (1; 1)}, {1; 1, 1, 1, 2, 1, 2, 3, 4, 5, 1)}, {(1; 2), (1; 3), (1; , (3; 1), (1; 1)}, ..., ..., .... The router 200 calculates the performance using the performance calculator 210 for all possible context frame aggregates. The router 200 sets the calculated result set of the context frame having high performance as a result value of the router 200 and stores it in the router model. The router 200 performs all error candidate words and error candidates in this manner and completes the router model as shown in Table 1.

점수 산출부(300)는 문맥 프레임 집합에 대해 n-gram 점수를 산출한다. 점수 산출부(300)는 기존 원어민 문서에서 문맥 프레임 집합의 각 단어 나열에 대응한 단어 빈수 또는 확률값으로 n-gram 점수를 산출한다. 점수 산출부(300)는 n-gram 점수로 문맥 프레임 집합의 단어수가 5개 이하인 경우에는 카운트 정보를, 5개 이상인 경우에는 순방향, 역방향으로 포워드 n-gram 점수, 백워드 n-gram 점수를 산출한다.The score calculation unit 300 calculates an n-gram score for the set of context frames. The score calculation unit 300 calculates an n-gram score using a word-based number or a probability value corresponding to each word sequence of the context frame set in the existing native-language document. The score calculator 300 calculates the count information when the number of words of the context frame set is 5 or less, the forward n-gram score in the forward direction and the backward n-gram score in the case of five or more words, and the backward n-gram score in the n-gram score do.

점수 산출부(300)는 구글 n-gram 코퍼스(corpus)의 카운트 정보를 이용하거나 RNN(Recurrent neural network) 기반 n-gram 모델을 사용할 수 있다. 만약 문맥 프레임의 단어수가 5 이상일 경우 점수 산출부(300)는 RNN 기반 n-gram 모델을 사용하여 n-gram 점수를 계산한다. RNN 기반 n-gram은 순방향, 역방향에 따라 포워드, 백워드로 나눠서 계산할 수 있다. 점수는 프레임의 연속된 단어들의 나열이 얼마나 코퍼스에서 많이 나타났는지를 나타낸다. 예를 들어, I am a boy라는 입력은 I is a boy보다 실제 영어 원어민 코퍼스에서 훨씬 많이 나타났으므로 n-gram 점수는 높은 점수를 가진다.The score calculation unit 300 may use the count information of the Google n-gram corpus or use an n-gram model based on a RNN (Recurrent Neural Network). If the number of words in the context frame is 5 or more, the score calculation unit 300 calculates an n-gram score using an RNN-based n-gram model. RNN-based n-grams can be calculated by dividing them into forward and backward words according to forward and backward directions. The score indicates how much of the sequence of consecutive words in the frame appears in the corpus. For example, the input I am a boy is much larger than the actual English native speaker corpus than I is a boy, so the n-gram score has a high score.

도 3은 도 1의 단어 수정부의 상세 구성도이다.3 is a detailed configuration diagram of the word correcting unit of FIG.

단어 수정부(400)는 n-gram 점수를 이용하여 가장 높은 단어의 일치도로 입력 텍스트를 수정한다. 단어 수정부(400)는 높은 n-gram 점수가 산출된 문맥 프레임의 단어가 모든 문맥 프레임에 대해서 일치하는 경우에 수정을 실시한다.The word correcting unit 400 corrects the input text with the degree of matching of the highest word using the n-gram score. The word correcting unit 400 corrects the case where the words of the context frame in which the high n-gram score is calculated coincide with all the context frames.

단어 수정부(400)는 오류 후보군에 대해서 교정을 할 것인지, 교정하지 않을 것인지를 결정한다. 단어 수정부(400)는 해당 오류 후보 단어와 오류 후보군에 해당하는 가장 적절한 문맥 프레임 집합을 라우터(200)로부터 얻는다. 단어 수정부(400)는 문맥 프레임 집합의 모든 문맥 프레임에 대해서 점수 산출부(300)를 이용하여 점수를 산출한다. 단어 수정부(400)는 각각의 오류 후보군에 대해서 각 문맥 프레임을 이용하여 n-gram점수를 산출하고 각 문맥 프레임 별로 가장 높은 n-gram점수가 산출된 오류 후보군의 단어가 모든 문맥 프레임에 대해서 일치할 경우에 교정을 실시한다. 예를 들어 오류 후보군의 단어가 오류 후보 단어 a->{a, an, the}에 대한 라우터 결과가 {(1;1),(2;1),(1;2)}일 경우에 단어 수정부(400)는 아래 문장 a에 대한 수정을 실시하기 위해 표 2와 같이 아래와 같은 계산을 실시한다. The word correcting unit 400 decides whether or not to correct the error candidates. The word correcting unit 400 obtains the most appropriate context frame set corresponding to the error candidate word and the error candidate group from the router 200. The word correcting unit 400 calculates a score using the score calculating unit 300 for all the context frames of the context frame set. The word correcting unit 400 calculates an n-gram score using each context frame for each error candidate group, and determines whether the word of the error candidate group, which has the highest n-gram score for each context frame, Calibration is carried out. For example, if the word of an error candidate is {(1; 1), (2; 1), (1; 2)} for the error candidate word a-> {a, an, The government 400 performs the following calculation as shown in Table 2 in order to make a correction to the following sentence a.

I have a apple and...I have an apple and ...

(1;1)(1: 1) (2;1)(2: 1) (1;2)(1; 2) AA C(Have a apple)C (Have a apple) C(I have a appleC (I have a apple C(a apple and)C (a apple and) AnAn C(Have an apple)C (Have an apple) C(I have an appleC (I have an apple C(an apple and)C (an apple and) TheThe C(Have the apple)C (Have the apple) C(I have the appleC (I have the apple C(the apple and)C (the apple and)

(1;1), (2;1), (1;2)에 대해 모든 경우에서 an을 넣었을 때 n-gram 점수가 가장 높을 경우, 단어 수정부(400)는 a를 an으로 변경한다.If the n-gram score is highest when an is inserted in all cases for (1; 1), (2; 1), (1; 2), the word correcting unit 400 changes a to an.

도 4는 본 발명의 일 실시예에 따른 문법 오류 수정 장치를 이용한 문법 오류 수정 방법에 대한 순서도이다.4 is a flowchart illustrating a method for correcting a grammar error using a grammar error correction apparatus according to an embodiment of the present invention.

문법 오류 수정 장치는 입력 텍스트에 대해 문법 오류 말뭉치에 기반하여 오류 후보 단어와 오류 후보군을 생성한다(510). 문법 오류 수정 장치는 미리 구축된 문법 오류 말뭉치를 바탕으로 특정 단어에 대한 가장 빈도가 높은 n개의 단어를 오류 후보 단어와 오류 후보군으로 생성한다.The grammar error correction device generates an error candidate word and an error candidate word based on the grammar error corpus for the input text (510). The grammar error correction device generates n most frequently used words for a certain word as error candidate words and error candidates based on a grammar corpus constructed in advance.

문법 오류 수정 장치는 오류 후보 단어와 오류 후보군을 바탕으로 문맥 프레임 집합을 추출한다(520). 문법 오류 수정 장치는 오류 후보 단어와 오류 후보군에 대한 오류 수정 성능을 측정하는 성능 산출기를 이용하여 문맥 프레임 집합에 대해 가장 적절한 문맥 프레임 집합을 추출한다. 문법 오류 수정 장치는 오류 수정이 정확이 되었는지 얼마만큼 오류를 잡았는지를 측정하는 성능 산출기를 이용하여 오류 수정 성능를 산출하고 오류 수정 성능이 높은 가장 적절한 문맥 프레임 집합을 추출한다.The grammar error correction device extracts a context frame set based on the error candidate word and the error candidates (520). The grammar error correction unit extracts the most appropriate context frame set for the context frame set by using the performance calculator that measures the error correction performance for the candidate word and the candidate word for error. The grammatical error correction device calculates the error correction performance using the performance calculator that measures the correctness of error correction and how much error has been caught, and extracts the most appropriate context frame set with high error correction performance.

문법 오류 수정 장치는 문맥 프레임 집합에 대해 n-gram 점수를 산출한다(530). 문법 오류 수정 장치는 기존 원어민 문서에서 문맥 프레임 집합의 각 단어 나열에 대응한 단어 빈수 또는 확률값으로 n-gram 점수를 산출한다. 문법 오류 수정 장치는 n-gram 점수로 문맥 프레임 집합의 단어수가 5개 이하인 경우에는 카운트 정보를, 5개 이상인 경우에는 순방향, 역방향으로 포워드 n-gram 점수, 백워드 n-gram 점수를 산출한다.The grammar error correction device calculates an n-gram score for the context frame set (530). The grammar error correction device calculates an n-gram score from a word dictionary or a probability value corresponding to each word sequence of the context frame set in an existing native speaker document. The grammar error correction device calculates the count information when the number of words in the context frame set is 5 or less, the forward n-gram score in the forward direction in the case of 5 or more, and the backward n-gram score in the n-gram score.

문법 오류 수정 장치는 n-gram 점수를 이용하여 가장 높은 단어의 일치도로 입력 텍스트를 수정한다(540). 문법 오류 수정 장치는 높은 n-gram 점수가 산출된 문맥 프레임의 단어가 모든 문맥 프레임에 대해서 일치하는 경우에 수정을 실시한다. 문법 오류 수정 장치는 높은 n-gram 점수가 산출되는 문맥 프레임의 단어를 찾고 찾아진 문맥 프레임의 단어가 모든 문맥 프레임에 대해서 일치하는 경우 입력 텍스트를 찾아진 문맥 프레임의 단어로 수정한다.The grammatical error correction unit corrects the input text using the n-gram score and the degree of match of the highest word (540). The grammatical error correction device corrects the case where the words of the context frame that have a high n-gram score are matched for all the context frames. The grammatical error correction device finds the words of the context frame in which the high n-gram score is calculated, and corrects the input text to the words of the context frame found if the words of the found context frame match for all the context frames.

상기에서는 본 발명의 바람직한 실시예를 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다.It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the present invention as defined by the following claims It can be understood that

100 : 오류 후보 생성부 200 : 라우터
300 : 점수 산출부 400 : 단어 수정부100: error candidate generation unit 200: router
300: score calculating unit 400: word counting unit

Claims

An error candidate generation unit for generating an error candidate word and an error candidate group based on a grammar error corpus for the input text;
A router for extracting a context frame set based on the error candidate word and the error candidate group;
A score calculating unit for calculating an n-gram score for the context frame set; And
And a word correcting unit for correcting the input text using the n-gram score,
The router includes:
Wherein a context frame set having a high error correction performance is extracted from the context frame set by measuring an error correction performance of the error candidate word and the error candidate group.

The method according to claim 1,
Wherein the error candidate generating unit comprises:
Wherein the grammar error correcting unit generates n words having the highest frequency for a specific word as the error candidate word and the error candidate group based on a grammar corpus constructed in advance.

delete

The method according to claim 1,
The point calculating unit calculates,
The grammar error correcting device calculates the n-gram score from the existing native speaker document as a word bins or a probability value corresponding to each word list of the context frame set.

The method according to claim 1,
The point calculating unit calculates,
Wherein the counting information is counted when the number of words of the context frame set is 5 or less and the forward n-gram score and the number of backward n-grams are calculated when the number of words of the context frame set is 5 or more.

The method according to claim 1,
The word correcting unit,
And corrects the input text if the words of the context frame in which the high n-gram score is calculated match for all the context frames.

Generating an error candidate word and an error candidate group based on a grammar error corpus for the input text;
Extracting a context frame set based on the error candidate word and the error candidate group;
Calculating an n-gram score for the context frame set; And
Modifying the input text using the n-gram score,
Wherein the extracting of the context frame set comprises:
Wherein a context frame set having a high error correction performance is extracted from the context frame set by measuring an error correction performance of the error candidate word and the error candidate group.

8. The method of claim 7,
Wherein the generating comprises:
A grammar error correction method for generating n words having the highest frequency for a specific word as the error candidate word and the error candidate word based on a grammar corpus constructed in advance.

delete

8. The method of claim 7,
Wherein the calculating step comprises:
Grammar error correction method for calculating the n-gram score from a word-of-mouth count or a probability value corresponding to each word list of the context frame set in an existing native-language document.

8. The method of claim 7,
Wherein the calculating step comprises:
Counting information when the number of words of the context frame set is equal to or less than five, and forward n-gram score and backward n-gram score when the number of words of the context frame set is five or more.

8. The method of claim 7,
Wherein the modifying comprises:
A grammar error correction method for correcting the input text when a word of a context frame set having a high n-gram score is matched for all context frame sets.