KR20150092879A

KR20150092879A - Language Correction Apparatus and Method based on n-gram data and linguistic analysis

Info

Publication number: KR20150092879A
Application number: KR1020140013464A
Authority: KR
Inventors: 노윤형; 권오욱; 김창현; 김운; 김강일; 나승훈; 박은진; 서영애; 신종훈; 이기영; 최승권; 정상근; 황금하; 김영길; 박상규
Original assignee: 한국전자통신연구원
Priority date: 2014-02-06
Filing date: 2014-02-06
Publication date: 2015-08-17
Also published as: KR102026967B1

Abstract

Disclosed are an apparatus and a method for correcting grammatical errors in a hybrid method wherein a statistical method using n-gram data and a language analysis method are combined. According to an embodiment of the present invention, the method for correcting grammatical errors based on n-gram data and language analysis which is realized by a computer includes the steps of: tagging and pre-treating an input sentence; detecting a grammatical error candidate in the tagged and pre-treated input sentence by using n-gram statistical data extracted from a large amount of corpus; extracting a similar n-gram with respect to the grammatical error candidate and generating a corrected sentence by choosing final corrected n-gram using similarity, frequency, and grammatical conditions; receiving input of the corrected sentence again, analyzing phrases by applying error rules to the phrases, generating a phrase tree, and allocating node correction information to each node constituting the phrase tree; and correcting the corrected sentence by using the node correction information.

Description

Technical Field [0001] The present invention relates to a grammar error correcting apparatus and method based on n-gram data and language analysis,

본 발명은 자연언어 문장에 대해 문법 오류를 검출하고, 오류를 교정하는 장치 및 방법에 관한 것으로서, 보다 상세하게는 n-gram 데이터를 이용한 통계적 기법 및 언어 분석 기법을 결합한 하이브리드 방식으로 문법 오류를 교정하는 장치 및 방법에 관한 것이다.The present invention relates to an apparatus and method for detecting a grammar error in a natural language sentence and correcting the error. More particularly, the present invention relates to a hybrid method combining statistical and language analysis techniques using n-gram data, And more particularly,

컴퓨터를 이용한 외국어 학습, 번역, 문서 작성, 작문 평가 등이 늘어남에 따라 자동적인 문법 오류 검출 및 교정의 요구가 증대되어 가고 있다. 그 중에서 특히 외국어 학습을 위한 문법오류 교정의 경우 학습자의 수준에 따라 다양한 문법오류가 발생할 수 있기 때문에, 기존의 문법교정 기술은 그 적용성이나 정확성에서 실용적인 요구를 만족시키지 못하고 있다. 기존의 문법교정 방법들은 주로 규칙에 의한 방법, 언어 분석, n-gram 통계 데이터 또는 학습 기반 방식에 의해 이루어진다.The demand for automatic grammatical error detection and correction is increasing as computer-based foreign language learning, translation, document writing, and writing evaluation are increasing. Especially, in case of grammar error correction for foreign language learning, various grammar errors may occur depending on the level of the learner. Therefore, the existing grammar proofing technique does not satisfy the practical requirement in its applicability and accuracy. Conventional grammar proofing methods are mainly performed by rule-based methods, language analysis, n-gram statistical data, or learning-based methods.

여기서, 규칙에 의한 방법은 규칙 작성에 많은 노력이 들어가고, 학습기반 방식은 커버리지 문제 및 긴 거리 문맥이 필요로 하는 경우 성능에 한계가 있으며, 언어 분석 방식은 문장에 오류가 있는 경우 오분석이 될 가능성이 있어서 각각 적용하는데 어려움이 있었다.Here, the rule-based method requires a great deal of effort to write a rule, and the learning-based method has a limitation in performance when a coverage problem and a long distance context are required, and the language analysis method is a false analysis There was a possibility that it was difficult to apply each.

상술한 문제점을 해결하기 위한 본 발명은 대량의 언어 코퍼스(corpus)로부터 추출한 n-gram 데이터 및 언어 분석 기법을 결합하여 문법 오류 교정의 적용성 및 정확성을 향상시킨 문법 오류 교정장치 및 그 방법을 제공하는 것을 목적으로 한다.The present invention for solving the above-mentioned problems provides a grammar error correcting apparatus and method for improving the applicability and accuracy of grammar error correction by combining n-gram data and language analysis techniques extracted from a large amount of language corpus .

본 발명의 목적은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 또 다른 목적들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The objects of the present invention are not limited to the above-mentioned objects, and other objects not mentioned can be clearly understood by those skilled in the art from the following description.

전술한 목적을 달성하기 위한 본 발명의 일 면에 따른 컴퓨터로 구현 가능한 n-gram 데이터 및 언어 분석에 기반한 문법 오류 교정방법은 입력된 문장에 대해 태깅 및 전처리를 수행하는 단계; 대량의 코퍼스(corpus)로부터 추출한 n-gram 통계 데이터를 이용하여 상기 태깅 및 전처리된 입력 문장에서 문법 오류 후보를 검출하는 단계; 상기 문법 오류 후보에 대해 유사 n-gram을 추출하고 유사도, 빈도, 문법적인 조건을 이용하여 최종 교정 n-gram을 선택하여 교정문을 생성하는 단계; 상기 교정문을 다시 입력 받아 오류 규칙을 적용한 구문분석을 수행하면서 구문 트리를 생성하고, 상기 구문 트리를 구성하는 각각의 노드에 노드 교정정보를 할당하는 단계; 및 상기 노드 교정정보를 이용하여 상기 교정문을 교정하는 단계를 포함한다.According to an aspect of the present invention, there is provided a method for correcting a grammar error based on n-gram data and language analysis, the method comprising the steps of: performing tagging and preprocessing on an input sentence; Detecting grammatical error candidates in the tagged and preprocessed input sentence using n-gram statistical data extracted from a large amount of corpus; Extracting a similar n-gram for the grammatical error candidates, and generating a proofreading statement by selecting a final correction n-gram using similarity, frequency, and grammatical conditions; Generating a syntax tree by inputting the correction statement again and performing syntax analysis using error rules, and allocating node calibration information to each node constituting the syntax tree; And calibrating the calibration statement using the node calibration information.

본 발명에 따르면, 입력된 문장에 대한 문법 오류를 대용량 코퍼스로부터 추출한 데이터를 이용하여 특정 지식을 수동 구축할 필요 없이 손쉽게 문법오류를 검출 및 교정할 수 있으며, 이렇게 교정된 문장을 다시 구문 분석함으로써 원거리 의존성을 가지는 문법오류를 구문분석 오류의 위험을 최소화하여 인식 및 교정할 수 있다. 이러한 과정을 통해 입력문장의 문법오류를 비교적 정확하고 넓은 적용성을 가지고 인식 및 교정할 수 있어서 외국어 학습, 번역, 문서 작성, 작문 평가 등의 작업에 실용적으로 활용할 수 있다.According to the present invention, it is possible to easily detect and correct a grammatical error without manually constructing a specific knowledge using data extracted from a large-capacity corpus, and further, by parsing the corrected sentence again, Dependent grammatical errors can be recognized and corrected by minimizing the risk of parsing errors. Through this process, grammatical errors of input sentences can be recognized and corrected with relatively accurate and wide applicability, which can be practically applied to foreign language learning, translation, document writing, and writing evaluation.

도 1은 본 발명의 일 실시예에 따른 n-gram 데이터 및 언어 분석에 기반한 문법 오류 교정방법을 도시한 도면.
도 2는 본 발명의 실시예에 따라 오류 파싱 규칙을 적용하여 구문 파싱한 결과를 설명하기 위한 예시도.
도 3은 본 발명의 실시예에 따라 오류 인식 규칙을 적용하여 구문 파싱한 결과를 설명하기 위한 예시도.
도 4는 본 발명의 실시예에 따른 n-gram 데이터 및 언어 분석에 기반한 문법 오류 교정방법 이 실행될 수 있는 컴퓨터 장치의 일 구성을 도시한 도면.FIG. 1 illustrates a grammar error correction method based on n-gram data and language analysis according to an embodiment of the present invention. FIG.
FIG. 2 is an exemplary diagram for explaining a result of syntax parsing by applying an error parsing rule according to an embodiment of the present invention; FIG.
FIG. 3 is an exemplary diagram for explaining a result of syntax parsing by applying an error recognition rule according to an embodiment of the present invention; FIG.
4 illustrates an exemplary configuration of a computer device in which a grammar error correction method based on n-gram data and language analysis according to an embodiment of the present invention can be performed.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 기재에 의해 정의된다. 한편, 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 또는 "포함하는(comprising)"은 언급된 구성요소, 단계, 동작 및/또는 소자 이외의 하나 이상의 다른 구성요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention and the manner of achieving them will become apparent with reference to the embodiments described in detail below with reference to the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. And is provided to fully convey the scope of the invention to those skilled in the art, and the present invention is defined by the claims. It is to be understood that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. It is noted that " comprises, " or "comprising," as used herein, means the presence or absence of one or more other components, steps, operations, and / Do not exclude the addition.

이하, 본 발명의 바람직한 실시예를 첨부된 도면들을 참조하여 상세히 설명한다. 우선 각 도면의 구성요소들에 참조부호를 부가함에 있어서, 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가급적 동일한 부호를 부여하고 또한 본 발명을 설명함에 있어, 관련된 공지 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있는 경우에는 그 상세한 설명은 생략한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, like reference numerals refer to like elements throughout. In the drawings, like reference numerals are used to denote like elements, and in the description of the present invention, In the following description, a detailed description of the present invention will be omitted.

도 1은 본 발명의 일 실시예에 따른 n-gram 데이터 및 언어 분석에 기반한 문법 오류 교정방법을 도시한 도면이다. 1 is a diagram illustrating a grammar error correction method based on n-gram data and language analysis according to an embodiment of the present invention.

도 1을 참조하면, 입력장치를 통하여 텍스트 또는 음성인식 결과 문장이 컴퓨팅 장치 등에 입력된다(100).Referring to FIG. 1, a text or speech recognition result sentence is input to a computing device through an input device (100).

입력된 문장에 대해 먼저 태깅 및 전처리(101)가 수행되고, 이후 n-gram에 의한 확률을 이용해 입력된 문장에서 문법오류 후보가 검출된다(102). 검출된 문법 오류 후보에 대해 n-gram 유사 매칭을 통해 n-gram 교정 후보가 추출되고, 유사도, 빈도, 문법적인 조건을 이용하여 최종 교정 n-gram이 선택되고 입력된 문장에 대한 교정문이 생성된다(103). 이 때, 대량의 코퍼스(corpus)로부터 추출한 n-gram 통계 데이터(107)가 활용된다. The input sentence is first subjected to tagging and preprocessing 101, and then a grammar error candidate is detected in the input sentence using the probability of the n-gram (102). The n-gram calibration candidates are extracted through the n-gram similar matching for the detected grammar candidates, and the final correction n-gram is selected using the similarity, frequency, and grammatical conditions, and a correction statement for the input sentence is generated (103). At this time, n-gram statistical data 107 extracted from a large amount of corpus is utilized.

이후, 컴퓨팅 장치는 교정문에서 구문분석에 의한 문법 오류 후보 검출 및 교정하는 과정을 수행한다. 구문분석에 의한 문법 오류 검출 및 교정 과정은 오류 규칙을 포함하는 구문 규칙을 이용한 구문분석 과정(104)과, 구문분석에 의한 문법 교정정보 할당 및 문장 생성 과정(105)을 포함한다.Then, the computing device performs a process of detecting and correcting a grammar error candidate by a syntax analysis in a proofreading statement. The grammar error detection and correction process based on syntax analysis includes a syntax analysis process 104 using syntax rules including error rules, and a grammar correction information assignment and a sentence generation process 105 using syntax analysis.

구체적으로 설명하면, 컴퓨팅 장치는 n-gram 통계 데이터를 활용한 확률 통계 기법에 의해 1차적으로 교정된 입력 문장에 대해 구문분석에 의한 문법 오류를 검출한다(104). 이때, 오류 파싱 규칙 및 오류 인식 규칙(108)이 적용될 수 있다.Specifically, the computing device detects grammatical errors by parsing the input sentences that are primarily corrected by the probability statistic technique using n-gram statistical data (104). At this time, an error parsing rule and an error recognition rule 108 may be applied.

이후, 컴퓨팅 장치는 1차적으로 교정된 입력 문장의 전체 문장에 대한 구문분석이 성공한 경우 최종 선택된 구문 트리를 depth-first 순서로 탐색하면서 문장을 생성한다(105). 구문 트리를 탐색하는 과정에서 컴퓨팅 장치는 구문 트리를 구성하고 있는 복수의 구문 노드들 중에서, 현재 탐색 대상이 되는 구문 노드에 할당된 교정정보가 있는지 여부를 확인한다. 만일 교정 변환정보가 존재하면 교정 변환 정보에 따라 각 노드를 생성하고 이때 변환정보에 child 노드에 대한 교정정보가 존재하면 해당 교정정보를 반영한다. 만일 교정 변환 정보가 존재하지 않으면 child 노드들을 차례로 생성한다(105).Then, when the parsing of the entire sentence of the firstly corrected input sentence is successful, the computing device generates a sentence by searching the finally selected sentence tree in depth-first order (105). In searching the syntax tree, the computing device determines whether there is correction information allocated to the syntax node that is the current search target among a plurality of syntax nodes constituting the syntax tree. If calibration transformation information exists, each node is generated according to the calibration transformation information, and if the calibration information for the child node exists in the transformation information at this time, the corresponding calibration information is reflected. If calibration transformation information does not exist, child nodes are created in order (105).

이하에서는, 도 2 및 도 3을 참조하여 구체적인 예를 들면서, 전술한 n-gram 데이터 및 언어 분석에 기반한 문법 오류 교정방법의 각 단계에 대해서 설명한다. Hereinafter, each step of the grammar error correction method based on the above-described n-gram data and language analysis will be described in detail with reference to FIG. 2 and FIG.

예를 들어, 다음과 같은 문장이 입력되었다고 하자.For example, suppose the following sentence is entered.

"one of the hobbies I like the most are listening music.""One of the hobbies I like the most is listening music."

위에서 "are"는 주어 "one"과 주어-동사 수일치를 위해서 "is"로 수정되어야 하고, "listening music"에서 "listen"은 자동사이므로 전치사 "to"가 추가되어 "listening to music"으로 수정되어야 한다. 이러한 수정 과정을 도 1에서 전술한 교정 방법에 대응하면, 먼저 태깅 및 전처리 단계(101)에서 입력 문장에 태깅이 수행되고, 구두점 등을 제외하고, 대문자를 소문자로 변환하는 작업이 수행된다. 그리고, 문장의 시작과 끝을 나타내는 심볼로 "START" 및 "END"가 삽입된다.In the above, "are" should be modified to "is" in order to match the subject "one" with the subject verb number. In "listening music", "listen" . 1, tagging is performed on an input sentence in a tagging and preprocessing step 101, and an operation for converting an uppercase character to a lowercase character except punctuation and the like is performed. Then, "START" and "END" are inserted as symbols indicating the start and end of the sentence.

N-gram에 의한 확률을 이용하여 문법오류 후보를 검출하는 단계(102)에서는 bi-gram부터 tri-gram, 4-gram, 5-gram,...순서대로 n-gram 확률을 체크하여 입력 문장에 문법 오류 가능성이 있는지를 판정한다. 예컨대, 위 문장에 대해 n-gram을 생성하면 다음과 같다.
In step 102 of detecting a grammatical error candidate using probabilities by N-grams, the n-gram probability is checked in the order of bi-gram to tri-gram, 4-gram, Is a grammatical error. For example, an n-gram is generated for the above sentence as follows.

bi-gram: START one, one of, of the, the hobbies, hobbies i, I like, like the, the most, most are, are listening, listening music, music Endbi-gram: START one, one of, the hobbies, hobbies i, I like, the most, are, listening, listening music, music end

tri-gram: START one of, one of the, of the hobbies, the hobbies i, hobbies I like,...tri-gram: START one of the, of the hobbies, the hobbies i, hobbies I like, ...

4-gram: START one of the, one of the hobbies, of the hobbies i, the hobbies i like, hobbies i like the,...4-gram: START one of the, one of the hobbies, of the hobbies i, the hobbies i like, hobbies i like the, ...

...
...

각 n-gram 확률을 이용한 문법오류 후보 검출 방법은 다음과 같다. 먼저 uni-gram에 대한 확률이 체크된다. 만일 uni-gram 확률이 0이 되면 해당 단어가 스펠링 오류로 체크된다. 이후, bi-gram 부터는 아래 수학식 1과 같은 비율이 특정 threshold 값보다 작은 경우 문법오류 후보로 체크된다.The grammar error candidate detection method using each n-gram probability is as follows. First, the probability of uni-gram is checked. If the uni-gram probability is 0, the word is checked as a spelling error. Thereafter, the bi-gram is checked as a grammatical error candidate if the ratio of Equation 1 below is smaller than a certain threshold value.

[수학식 1][Equation 1]

bi-gram: P(wi|wi-n+1,wi-n+2,...,wi-1) / P(wi) < tbi-gram: P (wi | wi-n + 1, wi-n + 2, ..., wi-1) / P

n-gram: P(wi|wi-n+1,wi-n+2,...,wi-1) / P(wi|wi-n+2,...,wi-1) < twi-n + 1, wi-n + 2, ..., wi-1) / P (wi |

즉, 조건부 n-gram 확률값이 n보다 더 적은 n-gram을 조건부로 했을 때의 확률값보다 특정 비율 이상 떨어지면 해당 n-gram은 문법오류를 가질 수 있는 후보로 판정된다. P(wi|wi-n+2,...,wi-1)값이 너무 작거나 (wi-n+2,...,wi-1)의 빈도수가 너무 작은 경우는 sparseness 문제가 있다고 보고, 문법오류 후보 점검에서 제외한다. 본 실시예에서는 sparseness 문제 때문에 P(wi|wi-n+2,...,wi-1) ~= P(wi|wi-1) 로 단순화한 값을 사용하기로 한다.That is, if the conditional n-gram probability value is less than a certain ratio than the probability value when the conditional n-gram is less than n, the n-gram is judged as a candidate having a grammatical error. If the value of P (wi | wi-n + 2, ..., wi-1) is too small or the frequency of (wi-n + 2, ..., wi-1) is too small, , And grammatical error candidates are excluded from the check. In this embodiment, the value simplified to P (wi | wi-n + 2, ..., wi-1) to = P (wi | wi-1) is used for the sparseness problem.

위 문장에서 bi-gram을 예로 들면 다음과 같다.In the above sentence, for example, bi-gram is as follows.

P(one|START)/P(one) = 1.7463P (one | START) / P (one) = 1.7463

P(of|one)/P(of) = 19.696323P (of | one) / P (of) = 19.696323

P(the|of)/P(the) = 3.3802967P (the | of) / P (the) = 3.3802967

......

위 문장에서는 bi-gram 확률은 "listening music" 전까지는 threshold를 통과하고, P(music|listening)/P(music) = 0 이 되어 문법오류 후보로 체크된다. In this sentence, the bi-gram probability passes through the threshold until "listening music" and becomes P (music | listening) / P (music) = 0.

다음으로, 검출된 문법오류 후보들에 대한 문법 오류 교정이 수행되는데, 그 구체적인 과정은 아래와 같다. 먼저 문법오류 후보로 검출된 n-gram(예컨대, "listening_music")부터 시작하여 좌우 윈도우를 확장해 가면서 아래와 같이 매칭용 n-gram이 생성된다.Next, the grammar error correction for the detected grammar error candidates is performed. The concrete procedure is as follows. First, a matching n-gram is generated as follows, starting from an n-gram (e.g., " listening_music ") detected as a grammatical error candidate and extending left and right windows.

[매칭용 n-gram][Matching n-gram]

listening_music, are_listening_music, listening_music_END, are_listening_music_END, the_most_are_listening_music_END, most_are_listening_music
listening_music, are_listening_music, listening_music_END, are_listening_music_END, the_most_are_listening_music_END, most_are_listening_music

다음으로, 각 매칭용 n-gram에 대해 코퍼스에서 추출한 n-gram 통계 데이터로부터 가장 유사한 n-gram(이하, 유사 n-gram)이 검색된다. 하나의 매칭용 n-gram에 대한 유사 n-gram의 검색 방법은 아래와 같다. Next, the most similar n-gram (hereinafter referred to as a similar n-gram) is retrieved from the n-gram statistical data extracted from the corpus for each matching n-gram. A similar n-gram search method for one matching n-gram is as follows.

먼저, 모든 n-gram에 대해 n-gram의 첫 단어와 마지막 단어 그리고, n으로 구성된 키가 n-gram 통계 데이터 DB로 구축된다. 예를 들어, "the_music_5"를 키로 한 DB의 콘텐츠로 "the_gym_listening_to_music"이 저장될 수 있다.First, the n-gram statistical data DB is constructed of the first word and the last word of the n-gram and the key composed of n for all the n-grams. For example, " the_gym_listening_to_music "may be stored as the contents of the DB with" the_music_5 ".

예컨대, "the_gym_listening_music"와 유사한 n-gram을 검색한다고 가정하자. 이를 위해서, "the_music_4"를 키로 한 DB에서 가장 유사한 n-gram 콘텐츠(치환인 경우), "the_music_5"를 키로 한 DB에서 가장 유사한 n-gram 콘텐츠(missing인 경우), "the_music_3"를 키로 한 DB에서 가장 유사한 n-gram 콘텐츠(삽입된 경우)가 검색된다.For example, suppose that an n-gram similar to "the_gym_listening_music" is searched. To do so, the DB with the key "the_music_4" as the key for the most similar n-gram content (in case of substitution), the most similar n-gram content in DB with key "the_music_5" The most similar n-gram content (if inserted) is retrieved.

또한, 각각의 키 DB에서 n-gram 콘텐츠 사이에서 가장 유사한 n-gram이 판별된다. 이를 위해서는 각 키 DB 내의 n-gram에 대해 구성 단어들을 이용해 inverted index를 만든다. 매칭용 n-gram 과 유사한 n-gram은 빈도순으로 정렬되어, 동일한 유사도를 갖는 경우 빈도수가 높은 것이 우선 선택된다.Also, the most similar n-gram is discriminated among the n-gram contents in each key DB. To do this, construct an inverted index using the constituent words for each n-gram in each key DB. The n-grams similar to the matching n-gram are arranged in order of frequency, and when the similarity is high, the higher frequency is first selected.

검색된 모든 유사 n-gram에 대해, 유사도 및 빈도수의 함수로 가중치(weight)를 구하고, 그 가중치에 따라서 다시 유사 n-gram을 정렬하면 아래의 표 1과 같다.For all similar n-grams searched, weights are obtained as a function of the similarity and the frequency, and the similar n-grams are rearranged according to the weights, as shown in Table 1 below.

weightweight 매칭용 n-gramN-gram for matching 유사 n-gramSimilar n-grams 빈도frequency -0.0519805-0.0519805 listening_music_ENDlistening_music_END listening_to_music_ENDlistening_to_music_END 2323 -0.0534099-0.0534099 listening_musiclistening_music listening_to_musiclistening_to_music 4444 -0.214286-0.214286 the_most_are_listening_musicthe_most_are_listening_music the_gym_listening_to_musicthe_gym_listening_to_music 1One -0.339937-0.339937 most_are_listening_music_ENDmost_are_listening_music_END most_hated_season_ENDmost_hated_season_END 44 -0.340583-0.340583 are_listening_music_ENDare_listening_music_END are_not_fresh_are_not_fresh_ 1414

위 표 1을 참조하면, 각각의 매칭용 n-gram에 가중치가 가장 큰 유사 n-gram이 n-gram 통계 데이터로부터 선정되고, 이는 가중치의 크기에 따라서 정렬된다. Referring to Table 1, a similar n-gram having the largest weight to each matching n-gram is selected from n-gram statistical data, and this is sorted according to the weight value.

한편, 가중치 값이 특정 임계치(threshold) 이상인 유사 n-gram은 매칭용 n-gram과 비교되고, 그 결과 단어의 삽입, 삭제, 치환인 경우, 원형/품사 정보를 참조하여 특정 조건을 만족하는지 여부가 판단된다. 유사 n-gram이 특정 조건을 만족한다면 이는 최종 교정 결과로 선정된다. On the other hand, if a similar n-gram whose weight value is greater than or equal to a certain threshold is compared with a matching n-gram and if the result is a word insertion, deletion, or substitution, . If a similar n-gram satisfies a certain condition, it is selected as the final calibration result.

위 표 1에서 "listening_music_END"의 매칭용 n-gram 대해 가장 가중치가 높은 후보로 "listening_to_music_END"의 유사 n-gram 이 선택된다. 이 경우는 전치사의 삽입에 해당하는 경우로서, 품사 점검을 통과하여 최종 결과로 선택된다. 최종 교정 결과에 따라서 수정된 원문에 대해 다시 N-gram에 의한 교정은 반복적으로 수행된다.In the above Table 1, a similar n-gram of "listening_to_music_END" is selected as the highest weighting candidate for the matching n-gram of "listening_music_END". This case corresponds to the insertion of the preposition, which is passed through the part-of-speech check and is selected as the final result. According to the final calibration result, the correction by the N-gram is repeatedly performed on the corrected original text.

최종교정 결과에 대해, 품사정보 등을 참조하여 오류 유형을 분류하여 다음과 같이 오류 분류 태그가 추가된다.For the final calibration result, the error type is classified by referring to the part-of-speech information, and the error classification tag is added as follows.

One of the hobbies I like the most are listening <MT>|to</MT>music.One of the hobbies I like the most listening music.

<MV>|to</MV>에서 "|to"는 ""->"to"로 수정되어야 함을 의미하고 "MT"는 전치사가 missing되었다는 의미의 오류 분류코드를 의미한다.
"To" means "to" to "to", and "MT" means an error classification code to mean that the preposition is missing.

이상에서는, 본 발명의 실시예에 따라 n-gram 통계 데이터에 의한 확률적 기법으로 입력 문장의 문법 오류가 교정되는 과정을 설명하였다. 전술한 확률적 기법에 의해 1차적으로 교정된 입력 문장은 다시 구문분석에 의한 오류 규칙을 적용하여 2차적으로 교정된다. 만약, 구문분석에 의한 오류 규칙만을 적용하여 입력 문장을 교정한다면, 파싱 실패 및 파싱 오류의 가능성이 있을 수 있다. 본 발명은 n-gram 통계 데이터에 의한 확률적 기법으로 1차 문법 오류를 교정하고, 그 결과를 구문분석을 통해 교정함으로써, 파싱 실패 및 파싱 오류의 가능성을 줄일 수 있는데 특징이 있다.In the above, the process of correcting the grammatical errors of the input sentences by the stochastic technique based on the n-gram statistical data according to the embodiment of the present invention has been described. An input sentence that is first corrected by the stochastic technique described above is corrected secondarily by applying an error rule by parsing again. If the input sentence is corrected by applying only the error rule by parsing, there may be a possibility of parsing failure and parsing error. The present invention is characterized in that a first-order grammar error is corrected by a stochastic technique using n-gram statistical data, and the result is corrected by parsing, thereby reducing the possibility of parsing failure and parsing error.

이하에서는, 도 2 및 도 3을 참조하여 본 발명의 실시예에 따라 1차적으로 교정된 입력 문장을 구문분석 기법을 이용하여 재교정하는 과정을 구체적으로 살펴본다.Hereinafter, a process of re-predicting an input sentence, which is primarily calibrated according to an embodiment of the present invention, using a parsing technique will be described in detail with reference to FIGS. 2 and 3. FIG.

오류 규칙은 두 가지 종류가 존재하는데, 이는 오류 파싱 규칙과 오류 인식 규칙으로 분류할 수 있다. 오류 파싱 규칙은 구문 파싱에 사용되는 구문 규칙으로서, 만일 해당 구문 규칙이 존재하지 않는 경우 구문 파싱이 실패하는 경우에 기술해 줘야 하는 규칙이다. There are two kinds of error rules, which can be classified into error parsing rules and error recognition rules. An error parsing rule is a syntax rule used in syntax parsing, which is a rule that must be described if syntax parsing fails if the corresponding syntax rule does not exist.

예를 들어, "Our I-pad can easily revealed our locations."에서 "can" 다음에 동사 원형이 와야 하는데 과거형 "revealed"가 오기 때문에 통상적인 구문 파서에서는 구문 파싱에 실패할 것이다. 따라서 이러한 구문을 인식하여 파싱에 성공하기 위해서는 다음과 같은 정상적인 파싱 규칙에 대해,For example, the phrase "revealed" will come after "can" in "Our I-pad can easily uncovered our locations." The conventional syntax parser will fail to parse the syntax. Therefore, in order to recognize such a phrase and succeed in parsing,

{ MD VP!:[(eform == [vr])] } -> {VP MD VP!} {MD VP !: [(eform == [vr])] -> {VP MD VP!}

다음의 오류 파싱 규칙을 추가해 줘야 한다.The following error parsing rules should be added:

{ MD VP!:[(eform == [vb])] } -> {VP:[child2.tenseverb.gc:=[<FV>wd|VB(wd)</FV>],feat:=[modal]] MD VP!}{MD VP !: [(eform == [vb])] -> {VP: [child2.tenseverb.gc: = [<FV> wd | VB (wd) </ FV>], feat: = [modal ]] MD VP!

"VP:[child2.tenseverb.gc:=[<FV>wd|VB(wd)</FV>],feat:=[modal]]"는 현재의 동사구 노드에 "child2.tenseverb.gc:=[<FV>wd|VB(wd)</FV>],feat:=[modal]"와 같은 교정 정보를 할당하라는 의미이다. 그리고, "eform == [vr]"는 동사구의 헤드동사 형태가 원형인 경우를 말하고, "eform == [vb]"는 시제를 가지고 있는 과거형, 현재형 등의 형태를 말한다. 또한 tenseverb는 동사구에서 시제를 갖는 단어를 말한다."Feat: = [modal]]" tells the current verb phrase node "child2.tenseverb.gc: = [<FV> wd | VB (wd) </ FV> , "Feat: = [modal]", and so on). "Eform == [vr]" refers to the case where the head verb form of the verb phrase is circular, and "eform == [vb]" refers to the form of the past, present, etc. having the tense. Also, tenseverb refers to a tense word in a verb phrase.

이와 같은 오류 규칙이 추가된 상태에서 구문 파싱이 수행되면 구문 파싱 과정에서 적용된 오류 파싱 규칙에 의해 노드 교정정보가 할당되고, 도2와 같이 최종 구문 파싱 결과가 생성된다(104).When syntax parsing is performed with such an error rule added, the node correction information is allocated according to the error parsing rule applied in the syntax parsing process, and a final syntax parsing result is generated as shown in FIG.

이후, 생성된 구문 트리를 대상으로 교정정보가 생성된다(105).Thereafter, calibration information is generated on the generated syntax tree (105).

위 트리 상에서 depth-first로 탐색을 시작하면, 먼저 S(204)노드에서 아무런 교정정보가 없기 때문에 :NP VP"에 대한 노드 생성이 계속된다. NP에서는 child들이 leaf노드이므로 어휘가 생성된다. 즉, "Our I-pad"가 그대로 생성된다.If we start the search with depth-first on the upper tree, node creation for: NP VP "continues because there is no calibration information at S (204) node first. , "Our I-pad"

VP(203)노드에서는 교정정보가 있으므로 교정 정보에 따른 액션(action)이 수행된다. "child2.tenseverb.gc:=[<FV>wd|VB(wd)</FV>]"는 2번째 child노드의 tenseverb의 교정 정보에 다시 "<FV>wd|VB(wd)</FV>"를 할당하라는 의미이다. 따라서 2번째 child인 "(VP easily revealed our locations)"의 tenseverb인 "revealed" 노드에 "<FV>wd|VB(wd)</FV>"를 할당한다.Since the VP 203 has the calibration information, an action according to the calibration information is performed. (wd) </ FV>] to the calibration information of the tenseverb of the second child node again "<child> ". Thus, we assign "<FV> wd | VB (wd) </ FV>" to the "revealed" node which is the tenseverb of the second child "(VP easily revealed our locations)

결과적으로 다음과 같은 상태가 된다.As a result, the following conditions are obtained.

(S Our I-pad (VP can (VP easily revealed[<FV>wd|VB(wd)</FV>] our locations)))(VP easy (VP easily revealed [<FV> wd | VB (wd) </ FV>] our locations)))

이후, 하위 노드에 대해 계속 같은 과정은 반복되고, "revealed[<FV>wd|VB(wd)</FV>]"를 생성하는 시점에서 "revealed" 대신에 "<FV>revealed|reveal</FV>"가 생성된다. 최종적으로 다음과 같은 결과가 출력된다.Then, the same process is repeated for the lower node, and "<FV> revealed | reveal </ FV>" instead of "revealed" at the time of generating "revealed [<FV> wd | VB (wd) FV > "is generated. Finally, the following results are output.

Our I-pad can easily <FV>revealed|reveal</FV> our locations
Our I-pad can easily <FV> revealed | reveal </ FV> our locations

이하, 오류 인식 규칙에 대해 설명한다. 오류 인식 규칙은 정상적으로 구문 파싱이 성공하는 경우에 대해, 특정한 조건을 만족하는 구문 노드에 교정정보를 할당하도록 하는 규칙이다.Hereinafter, the error recognition rule will be described. The error recognition rule is a rule that, when the syntax parsing normally succeeds, the correction information is allocated to the syntax nodes satisfying the specific condition.

예를 들어 다음 규칙에서,For example, in the following rule,

{ S << NP:[pos == [NN]] VP!:[ tenseverb.pos == [VBP]] } -> {S:[tenseverb.gc:=[<AGV>wd|VBZ(wd)</AGV>]] }{S << NP: [pos == [NN]] VP !: [tenseverb.pos == [VBP]] -> {S: [tenseverb.gc: / AGV>]]}

S -> NP VP의 파싱규칙이 적용될 때, NP의 헤드의 품사가 "NN"이고, VP의 tense verb의 품사가 "VBP"인 경우 주어-동사 수일치 오류이며, 이때 "[tenseverb.gc:=[<AGV>wd|VBZ(wd)</AGV>]]"와 같은 교정정보를 할당하라는 의미이다.When the parsing rule of S -> NP VP is applied, if the part of the head of the NP is "NN" and the part of the tense verb of the VP is "VBP", then it is a subject - verb number matching error, where "[tenseverb.gc: = [<AGV> wd | VBZ (wd) </ AGV>]] ".

도 3을 참조하여 위 규칙을 적용하여 예문 1을 구문 파싱할 때, 다음 규칙이 적용되는 시점에서 S(303)노드에 위 교정정보가 할당된다.Referring to FIG. 3, when parsing the example sentence 1 by applying the above rule, the above calibration information is allocated to the node S (303) at the time when the next rule is applied.

규칙: S -> NP VP:[eform==[vb]]Rule: S -> NP VP: [eform == [vb]]

파싱결과: (S[tenseverb.gc:=[<AGV>wd|VBZ(wd)</AGV>]] (NP One of the hobbies I like the most) (VP are listening to music))Parsing results: (S [tenseverb.gc: = [<AGV> wd | VBZ (wd) </ AGV>]] (VP is listening to music)

도 3에 도시된 트리상에서 depth-first로 탐색을 시작하면, 먼저 S(303)노드에서 교정정보가 있으므로 교정 정보에 따른 액션(action)이 수행된다. 이는 "tenseverb.gc:=[<AGV>wd|VBZ(wd)</AGV>]"는 S(303)노드의 tenseverb의 교정 정보에 다시 "<AGV>wd|VBZ(wd)</AGV>"를 할당하라는 의미이다. 따라서 tenseverb인 "are" 노드에 "<AGV>wd|VBZ(wd)</AGV>"가 할당된다.When the search starts at depth-first on the tree shown in FIG. 3, an action according to the calibration information is performed because there is the calibration information at the node S (303). <AGV> wd | VBZ (wd) </ AGV>] is added to the calibration information of the tenseverb of the S (303) node again "tenseverb.gc: = [<AGV> wd | VBZ ". Therefore, "<AGV> wd | VBZ (wd) </ AGV>" is assigned to the "are" node with tenseverb.

(S (NP One of the hobbies I like the most) (VP are[<AGV>wd|VBZ(wd)</AGV>] listening music))(Listening to music) (VP is [[AGV> wd | VBZ (wd) </ AGV>])

한편, 하위 노드에 대해 계속 같은 과정은 반복적으로 수행되고, "are[<AGV>wd|VBZ(wd)</AGV>]"를 생성하는 시점에서 "are" 대신에 "<AGV>are|is</AG>"가 생성된다.In the meantime, the same process for the lower node is repeatedly performed, and "<AGV> are | is (n)" is substituted for "are" at the time of generating "are [<AGV> wd | VBZ &Lt; / AG > "is generated.

최종적으로 다음과 같은 결과가 출력된다.Finally, the following results are output.

One of the hobbies I like the most <AGV>are|is</AGV> listening to musicOne of the hobbies I like the most <AGV> are | is </ AGV> listening to music

그러면, n-gram 문법 교정 결과와 취합되어 최종적으로 다음과 같은 결과가 생성된다.Then, it is combined with the n-gram grammar correction result, and finally the following result is generated.

One of the hobbies I like the most <AGV>are|is</AGV> listening <MV>|to</MV> musicListening to music

한편, 본 발명의 실시예에 따른 n-gram 데이터 및 언어 분석에 기반한 문법 오류 교정방법은 컴퓨터 시스템에서 구현되거나, 또는 기록매체에 기록될 수 있다. 도 3에 도시된 바와 같이, 컴퓨터 시스템은 적어도 하나 이상의 프로세서(121)와, 메모리(123)와, 사용자 입력 장치(126)와, 데이터 통신 버스(122)와, 사용자 출력 장치(127)와, 저장소(128)를 포함할 수 있다. 전술한 각각의 구성 요소는 데이터 통신 버스(122)를 통해 데이터 통신을 한다.Meanwhile, the grammar error correction method based on n-gram data and language analysis according to an embodiment of the present invention can be implemented in a computer system or recorded on a recording medium. 3, the computer system includes at least one processor 121, a memory 123, a user input device 126, a data communication bus 122, a user output device 127, And may include a storage 128. Each of the above-described components performs data communication via the data communication bus 122. [

컴퓨터 시스템은 네트워크에 커플링된 네트워크 인터페이스(129)를 더 포함할 수 있다. 상기 프로세서(121)는 중앙처리 장치(central processing unit (CPU))이거나, 혹은 메모리(123) 및/또는 저장소(128)에 저장된 명령어를 처리하는 반도체 장치일 수 있다. The computer system may further include a network interface 129 coupled to the network. The processor 121 may be a central processing unit (CPU) or a semiconductor device that processes instructions stored in the memory 123 and / or the storage 128.

상기 메모리(123) 및 상기 저장소(128)는 다양한 형태의 휘발성 혹은 비휘발성 저장매체를 포함할 수 있다. 예컨대, 상기 메모리(123)는 ROM(124) 및 RAM(125)을 포함할 수 있다.The memory 123 and the storage 128 may include various forms of volatile or nonvolatile storage media. For example, the memory 123 may include a ROM 124 and a RAM 125.

따라서, 본 발명의 실시예에 따른 n-gram 데이터 및 언어 분석에 기반한 문법 오류 교정방법은 컴퓨터에서 실행 가능한 방법으로 구현될 수 있다. 본 발명의 실시예에 따른 n-gram 데이터 및 언어 분석에 기반한 문법 오류 교정방법이 컴퓨터 장치에서 수행될 때, 컴퓨터로 판독 가능한 명령어들이 본 발명에 따른 인식 방법을 수행할 수 있다.Accordingly, the grammar error correction method based on n-gram data and language analysis according to an embodiment of the present invention can be implemented in a computer-executable method. When a grammar error correction method based on n-gram data and language analysis according to an embodiment of the present invention is performed in a computer device, computer-readable instructions can perform the recognition method according to the present invention.

한편, 상술한 본 발명에 따른 n-gram 데이터 및 언어 분석에 기반한 문법 오류 교정방법은 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현되는 것이 가능하다. 컴퓨터가 읽을 수 있는 기록 매체로는 컴퓨터 시스템에 의하여 해독될 수 있는 데이터가 저장된 모든 종류의 기록 매체를 포함한다. 예를 들어, ROM(Read Only Memory), RAM(Random Access Memory), 자기 테이프, 자기 디스크, 플래시 메모리, 광 데이터 저장장치 등이 있을 수 있다. 또한, 컴퓨터로 판독 가능한 기록매체는 컴퓨터 통신망으로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 읽을 수 있는 코드로서 저장되고 실행될 수 있다.Meanwhile, the grammar error correction method based on n-gram data and language analysis according to the present invention can be implemented as a computer-readable code on a computer-readable recording medium. The computer-readable recording medium includes all kinds of recording media storing data that can be decoded by a computer system. For example, there may be a ROM (Read Only Memory), a RAM (Random Access Memory), a magnetic tape, a magnetic disk, a flash memory, an optical data storage device and the like. The computer-readable recording medium may also be distributed and executed in a computer system connected to a computer network and stored and executed as a code that can be read in a distributed manner.

이상, 본 발명의 바람직한 실시예를 통하여 본 발명의 구성을 상세히 설명하였으나, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 본 명세서에 개시된 내용과는 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 본 발명의 보호범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구의 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.
While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, It is to be understood that the invention may be embodied in other specific forms. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive. The scope of the present invention is defined by the appended claims rather than the detailed description, and all changes or modifications derived from the scope of the claims and their equivalents should be construed as being included within the scope of the present invention.

Claims

A grammar error correction method based on computer-implemented, n-gram data and language analysis,
Performing tagging and preprocessing on the input sentence;
Detecting grammatical error candidates in the tagged and preprocessed input sentence using n-gram statistical data extracted from a large amount of corpus;
Extracting a similar n-gram for the grammatical error candidates, and generating a proofreading statement by selecting a final correction n-gram using similarity, frequency, and grammatical conditions;
Generating a syntax tree by inputting the correction statement again and performing syntax analysis using error rules, and allocating node calibration information to each node constituting the syntax tree; And
Correcting the calibration statement using the node calibration information
And a grammar error correction method based on n-gram data and language analysis.