KR20060111920A

KR20060111920A - System or method of detecting mis-written word and supplying information to correct it

Info

Publication number: KR20060111920A
Application number: KR1020050033875A
Authority: KR
Inventors: 봉래 박
Original assignee: 봉래 박
Priority date: 2005-04-25
Filing date: 2005-04-25
Publication date: 2006-10-31

Abstract

A system and a method for offering miswritten word detection and correction information are provided to generate and present the miswritten word detection/correction information to a user by checking a word including a spelling error and the word which is misused in the context. A miswritten word database(2) stores frequent miswritten words. A word instance database(4) stores correct instances of the words stored in the miswritten word database. A word obtaining module obtains the words from writing. A miswritten word determining module(1) determines the miswritten word by searching the word in the miswritten word database. An instance extracting module(3) extracts the correct instance of the miswritten word by searching the miswritten word from the word instance database.

Description

Determination of error suspects and providing correction information system and its implementation method {System or Method of Detecting Mis-Written Word and Supplying Information to correct it}

1. 오류의심어절 판별모듈 2. 고급오류어절정보 데이터베이스1. Error suspect word discrimination module 2. Advanced error word information database

3. 용례추출모듈 4. 어절용례 데이터베이스3. Usage Extraction Module 4. Word Usage Database

5. 교정어절추출모귤 6. 교정어절 데이터베이스5. Correction word extraction rule 6. Correction word database

정보사회에 접어들면서 사람들이 언어를 사용하는 데에 있어 신중함이 사라져 가고 있다. 적절한 어휘를 정확한 철자로 표기하기보다는 인터넷을 통해 빠른 시간내에 의사를 전달하려는 습관속에서 자주 사용하여 바로 생각나는 어휘만을 사용하고 그나마도 발음나는 대로 표기하다보니 맞춤법이 자주 틀리고 있다. 급기야는 사회문제화되었고 현제 TV나 라디오에서 언어의 올바른 사용을 위한 각종 퀴즈프로그램이나 교양프로그램이 급증한 상태이다.As we enter the information society, prudence is disappearing when people use language. Rather than spelling out the proper vocabulary, the spelling is frequently missed because it is often used only in vocabulary, and it is pronounced as it is pronounced. The feeding field has become a social problem, and various quizzes and culture programs for the proper use of language on TV and radio are increasing rapidly.

컴퓨터를 통해 문서를 작성할 때 사용하는 워드프로세서들에는 맞춤법 검사기가 탑재되어 있으나, 어절을 형태소분석하여 구성 형태소들이 품사 결합 규칙에 이상없이 결합되는 분석 결과가 하나도 존재하지 않을 때 상기 어절을 오류어절로 간주하는 방식을 사용하고 있다. 그러나 이러한 방식은 의미를 고려하지 않다보니 의미상으로는 결합될 수 없는 형태소들이 잘못 분석되어 오류 어절이 오분석됨으로써 오류어절로서 발견되지 못하는 경우가 비일비재하다.Word processors used to compose documents through a computer include a spell checker, but when there is no analysis result in which the morphemes are perfectly combined with the parts-of-speech combination rule by morphing the word, the word is converted into an error word. I use the way of thinking. However, since this method does not take meaning into account, it is not uncommon to find morphemes that cannot be semantically combined because they are misinterpreted, so that the error word is misinterpreted and cannot be found as an error word.

더우기 어절 자체로는 오류어절이 아니지만 사용된 문장내에서 문맥적으로 맞지 않는 오류 어절도 발견하지 못한다. 예를들어 문장Moreover, the word itself is not an error word, but it does not find any error word that is not contextually correct in the sentence used. Example sentence

"돌을 던졌다. 그리고는 달아났다""Throwed the stone. And ran away."

에서 '그리고는'는 자체로는 오류어절이 아니지만, 어절 '그러고는'이 사용되어야 할 곳에 잘못 사용된 것이다.In 'and' is not an error word in itself, but the word 'and' is misused where it should be used.

본 발명은 상기와 같은 점들을 고려하여 창안된 것으로서, 올바르지 못하게 사용된 어절들의 발견이 필요한 워드프로세서 등 각종 응용 프로그램들이 호출하여 사용할 수 있는 프로그램 모듈의 구현 방법으로서, 철자오류가 존재하는 어절 및 문맥에 부적절하게 사용된 어절을 발견하는 오류의심어절 탐지방법과 적절한 교정정보를 생성하여 사용자에게 제시하는 교정정보 제공방법을 제공하는 데에 목적이 있다.The present invention was devised in consideration of the above points, and is a method of implementing a program module that can be called and used by various application programs such as a word processor that requires the discovery of incorrectly used words. The purpose is to provide an error detection word detection method that detects an improperly used word, and a method of providing correction information to the user by generating appropriate correction information.

상기와 같은 목적을 달성하기 위한 본 발명에 따른 오류의심어절 탐지방법은,Error heartbeat detection method according to the present invention for achieving the above object,

형태소분석이 가능하지만 오류인 어절 및 이 어절이 오류임을 표시하는 오류표지, 그리고 문맥에 따라 오류가 될 수 있는 있는 어절 및 이 어절이 정상 문맥에 서 사용될 경우의 정상문맥정보를 내포하고 있는 오류정보 데이터베이스가 구비되고,Error information containing a word that is morphological but can be an error, an error label indicating that the word is an error, a word that can be an error depending on the context, and normal context information when the word is used in a normal context. A database is available,

(1) 오류 유무를 판별할 대상 어절 및 이 어절이 사용된 문맥글을 입력받는 단계;(1) receiving a target word to determine whether there is an error and a context text using the word;

(2) 상기 어절로부터 후보어절을 추출하는 단계;(2) extracting candidate words from the words;

(3) 상기 후보어절을 오류정보 데이터베이스에서 검색하는 단계;(3) searching the candidate phrase in an error information database;

(4) 검색 결과, 오류 어절임을 표시하는 정보가 있는 경우 상기 어절을 오류 의심어절로 판정하는 단계;(4) determining that the word is a suspected error word when there is information indicating that the error word is found as a result of the search;

(3) 검색 결과, 정상문맥정보가 존재하는 경우, 상기 문맥글에서 문맥정보를 추출하는 단계;(3) extracting contextual information from the contextual text, when normal contextual information exists as a result of the search;

(5) 상기 정상문맥정보와 상기 문맥정보를 비교하는 단계;(5) comparing the normal context information with the context information;

(6) 비교 결과, 상기 정상문맥정보와 상기 문맥정보가 상이할 경우, 상기 어절을 오류의심어절로 판정하는 단계;를 포함하는 것을 특징으로 한다.And (6) if the normal contextual information and the contextual information are different from each other, determining the word as an error suspected word.

이하 도면을 참조하여 자세하게 설명한다.It will be described in detail with reference to the drawings.

도1은 본 발명에 따른 오류의심어절 판별시스템으로서 오류의심어절 판별모듈(1)과 오류정보 데이터베이스(2)를 포함한다.1 is an error doubt clause determination system according to the present invention and includes an error doubt clause determination module 1 and an error information database 2.

상기 오류의심어절 판별모듈은 글입력이 필요한 일반 컴퓨터 응용 프로그램으로부터 어절과 이 어절이 사용된 문맥글을 입력받고, 상기 오류정보 데이터베이스(2)를 검색하여 획득한 정보와 상기 문맥글에서 추출한 문맥정보를 이용하여 상기 어절의 오류의심어절 유무를 판정한다.The error suspect word discrimination module receives a word and a context text using the word from a general computer application program requiring a text input, and retrieves the information obtained by searching the error information database 2 and the context information extracted from the context text. It is determined whether the word suspected word error of the word using.

상기 오류정보 데이터베이스(2)는 자연어처리 기술을 통해 대용량 말뭉치에서 획득한 정보들로 구축하거나, 언어 전문가들이 수작업으로 구축하여 미리 구비되게 된다.The error information database 2 is constructed of information obtained from a large corpus through natural language processing technology, or is prepared by language experts by hand.

도2는 본 발명에 따른 오류의심어절 판별방법의 흐름도이다.2 is a flowchart of a method for determining an error question clause according to the present invention.

s10은 어절이 문맥에 맞게 사용되었는지 맞춤법 오류는 없는지 판별 정보가 필요한 컴퓨터 응용 프로그램으로부터 상기 오류의심어절 판별모듈이 상기 어절과 이 어절이 사용된 문맥글을 입력받는 단계이다.S10 is a step in which the error suspect word discrimination module receives the word and the context text in which the word is used from a computer application program that needs to determine whether a word is used in a context or that there is no spelling error.

상기 문맥글이란 상기 어절이 포함된 문장 또는 문단일 수도 있고 상기 어절 앞뒤에 위치한 일정 개수의 어절들 또는 문장들일 수도 있다.The context text may be a sentence or paragraph including the word or a predetermined number of words or sentences located before and after the word.

s20은 상기 어절로부터 상기 오류정보 데이터베이스에서 검색할 후보어절을 추출하는 단계이다.S20 is a step of extracting candidate words to be searched in the error information database from the words.

한국어에서 어절은 하나 이상의 형태소들로 구성된다. 따라서 오류가 존재하는 형태소는 다양한 다른 형태소들과 결합되어 수 많은 오류 어절을 생성할 수 있다. 이 경우 오류정보 데이터베이스에 모든 오류어절을 다 보관하는 것은 비효율적이게 된다. 따라서 상기 어절로부터 오류 후보어절을 별도로 추출하여 검색하는 것이 바람직하다. 예를들어, 어절 '부엌에서'를 잘못 표기한 '부억에서'의 경우, '부억에서', '부억에' 및 '부억'이 후보어절로 추출되게 된다.In Korean, a word consists of one or more morphemes. Thus, the morpheme in which an error exists can be combined with various other morphemes to generate a number of error words. In this case, it is inefficient to keep all the error phrases in the error information database. Therefore, it is preferable to extract and search for an error candidate word separately from the word. For example, in the case of the word 'in the kitchen' which is incorrectly spelled 'in the kitchen', 'in the kitchen', 'in the kitchen' and 'the kitchen' are extracted as candidate words.

s30은 상기 후보어절을 상기 오류정보 데이터베이스에서 검색하는 단계이다.S30 is a step of searching for the candidate word in the error information database.

상기 오류정보 데이터베이스에는 자주 발생하는 오류 어절 및 오류 어절로부터 추출한 s20에서와 같은 후보어절들이 추출되어 색인어로 저장되어 있다. 또한 상기 오류 어절 및 상기 후보어절로 생성된 색인어는 오류임을 표시하는 정보인 오류표지가 함께 저장되어 있다. 그리고 자체로는 오류가 아니지만 자주 적합하지 않은 문맥에 사용되는 어절 및 이 어절에서 추출된 후보어절 또한 색이어로 저장되고, 이 어절이 사용된 문맥글에서 추출된 문맥정보가 정상문맥정보로 오류정보 데이터베이스에 저장되어 있다.In the error information database, candidate words as in s20 extracted from frequently occurring error words and error words are extracted and stored as index words. In addition, the error word and the index word generated by the candidate word are stored together with an error mark indicating information indicating that an error. In addition, the words used in contexts that are not errors but are not suitable for themselves, and the candidate words extracted from these words are also stored in Sacker, and the context information extracted from the context text using the words is normal context information. It is stored in the database.

s40은 상기 오류정보 데이터베이스 검색이 성공한 경우, 오류표지 정보가 존재하는지를 확인하는 단계이다. 상기 오류표지 정보가 존재하는 경우에는 상기 오류의심어절 조사 대상 어절이 오류의심어절로 판정된다(s90).S40 checks whether the error label information exists when the error information database search is successful. If the error label information is present, the error suspect word search target word is determined as an error suspect word (s90).

s50은 상기 오류정보 데이터베이스 검색이 성공하고 오류표지 정보는 존재하지 않는 경우, 정상문맥정보가 존재하는지를 확인하는 단계이다. 상기 정상문맥정보자 존재하지 않는 경우 오류의심어절이 아닌 것으로 판정된다.S50 is a step of checking whether the normal context information exists when the error information database search is successful and the error cover information does not exist. If the normal context information does not exist, it is determined that it is not an error suspect clause.

s60은 상기 정상문맥정보가 존재하는 경우, 상기 어절이 상기 문맥글에서 정상적으로 사용되었는지 확이하기 위해 상기 문맥글에서 문맥정보를 추출하는 단계이다.S60 is a step of extracting context information from the context text to confirm whether the word is normally used in the context text when the normal context information exists.

상기 문맥정보는 상기 문맥글에서 상기 어절과 함께 사용된 어휘, 상기 어휘의 품사, 문장 구조를 포함한 상기 어절이 사용된 문맥에 대한 정보이다. 상기 어휘는 상기 어절내에서 추출될 수도 있다.The context information is information about a context in which the word is used, including a vocabulary used with the word in the context text, a part of speech of the word, and a sentence structure. The vocabulary may be extracted within the word.

s70은 상기 문맥정보와 상기 정상문맥정보를 비교하는 단계이다.S70 is a step of comparing the context information with the normal context information.

상기 어절이 사용된 상기 문맥글에서 추출한 상기 문맥정보가 상기 어절이 정상적으로 사용된 경우의 문맥글에서 추출한 정상문맥정보와 어느 정도 유사한지 를 비교함으로서 상기 어절이 정상적으로 사용되었는지를 확인하기 위한 단계이다.And comparing the context information extracted from the context text using the word with the normal context information extracted from the context text when the word is normally used.

만일 상기 문맥정보와 상기 정상문맥정보가 상이한 경우(s80), 상기 어절이 상기 문맥글에서 적합하게 사용된 것일 가능성이 높은 것으로 보고 오류의심어절로 판정하게 된다(s90).If the contextual information and the normal contextual information are different (s80), it is determined that the word is likely used properly in the contextual text (s90) and is determined as an error suspected sentence (s90).

물론 상기 오류정보 데이터베이스가 각 오류 어절에 대해 필요한 모든 정상문맥정보를 저장하고 있기는 현실적으로 어렵다. 하지만 자주 발생하는 문맥을 대상으로 정상문맥을 추출함으로서 오류의심어절 판정의 정확도를 높이게 된다.Of course, it is practically difficult for the error information database to store all the normal contextual information necessary for each error word. However, by extracting the normal context from the frequently occurring contexts, the accuracy of error suspect clause determination is increased.

또한 부적합하게 사용된 어절과 이 어절이 사용된 문맥글에서 비정상문맥정보를 추출하여 얼마나 유사한지를 비교하는 단계를 추가함으로서 정확도를 향상시킬 수도 있다.In addition, it is possible to improve accuracy by adding a step of comparing the similarity by extracting abnormal context information from the inappropriately used word and the context text in which the word is used.

본 발명의 또 다른 보다 개선된 일실시예를 그림3에 제시한다.Another more improved embodiment of the present invention is shown in Figure 3.

본 발명의 보다 효과적인 사용을 위하여, 상기 오류의심어절에 대해 교정어절 정보를 제공하는 것과 오류의심어절이 바르게 사용된 글의 용례 및 교정어절이 바르게 사용된 글의 용례를 사용자에게 제공할 수 있다.For more effective use of the present invention, it is possible to provide a user with an example of providing corrected word information for the error suspected phrase, an example of an article correctly used with the error suspected phrase, and an example of an article correctly used with the corrected phrase.

이를 위해서는 오류정보 데이터베이스(2)에 존재하는 어절들에 대한 교정어절들을 모아둔 교정어절 데이터베이스(6)와 상기 어절들 및 상기 교정어절들이 올바르게 사용된 글의 용례들을 모아둔 어절용례 데이터베이스(4)를 구비한다.To this end, a corrected word database (6) that collects corrected words for words in the error information database (2), and a word example database (4) that contains examples of the phrases and the correct words used in the correct words It is provided.

오류의심어절 판별모듈(1)이 오류의심어절로 판정한 어절에 대해 교정어절추출모듈(5)이 상기 어절에 대한 교정어절이 교정어절 데이터베이스(6)에 존재하는지 확인하고, 또한 용례추출모듈(3)이 상기 어절과 교정어절이 바르게 사용된 글의 용 례가 존재하는지 어절용례 데이터베이스(4)를 검색한다.The corrected word extraction module 5 checks whether the corrected word clause exists in the corrected word clause database 6 for the word determined by the error suspected word discrimination module 1 as the error suspected word clause. 3) The word usage example database (4) is searched for whether there is a usage example of the sentence in which the phrase and the correction phrase are used correctly.

예를 들어, 오류어절 '부억에서'에 대해서는 상기 어절이 오류의심어절임을 사용자에게 알리면서 교정어절 '부엌에서'가 포함된 '어머니께서 부엌에서 요리를'과 같은 글의 용례를 함께 제공한다.For example, the word `` in the kitchen '' is used to inform the user that the word is in question, and the example includes `` Mother cooks in the kitchen '' with the correction word `` in the kitchen. '' .

또 한 예로, 글의 일부 '학생들을 가르키는 사람으로'에서 오류 어절 '가르키는'에 대해서는 상기 어절이 오류의심어절임을 사용자게에 알리면서 '학생들을 가르치는 선생님'이나 '방향을 가르키는 표지'와 같은 글의 용례를 함께 제공한다. 그리고 글의 용례가 하나 이상 제공될 수 있음은 물론이다.As another example, in some of the articles 'Teachers of Students', the word 'pointing' of the error word is a 'teacher who teaches the students' or 'direction' indicating the word to the user. Use examples such as' Of course, one or more examples of writing may be provided.

여기서, 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아닌되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 한다. 따라서, 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시예에 불과할 뿐이고 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Herein, the terms or words used in the present specification and claims are not to be construed as being limited to the ordinary or dictionary meanings, and the inventors appropriately define the concept of terms in order to explain their own invention in the best way. It should be interpreted as meaning and concept corresponding to the technical idea of the present invention based on the principle that it can. Therefore, the embodiments described in the specification and the drawings shown in the drawings are only the most preferred embodiment of the present invention and do not represent all of the technical idea of the present invention, various modifications that can be replaced at the time of the present application It should be understood that there may be equivalents and variations.

본 발명에 의하면 기존 맞춤법 오류 검사기를 사용하여 발견할 수 없었던 부적절하게 사용되 어절이나 맞춤법 오류가 존재하는 오류를 발견하기 용이해 짐으로서 올바른 언어 사용 환경을 제공하게 된다.According to the present invention, an improperly used error that could not be detected using a conventional spelling error checker is easily found to provide an error in a word or spelling error, thereby providing a correct language using environment.

Claims

As a method of determining whether a word is likely to exist,

Words that can be stemmed but are errors, error label information indicating that the word is an error, and words that may be errors depending on the context, and error information containing normal context information when the word is used in a normal context. A database is available,

(1) receiving a target word to determine whether there is an error and a context text using the word;

(2) extracting candidate words from the words;

(3) searching the candidate phrase in an error information database;

(4) determining that the word is a suspected error word when there is information indicating that the error word is found as a result of the search;

(3) extracting contextual information from the contextual text, when normal contextual information exists as a result of the search;

(5) comparing the normal context information with the context information;

And (6) if the normal context information is different from the context information as a result of the comparison, determining the word as an error suspect word.

As a method of determining whether a word is likely to exist,

Contains a word that is morphological but can be an error, error label information indicating that the word is an error, and abnormal context information, which is a word that can be an error depending on the context, and contextual information extracted from the context in which the word is inappropriately used. Has a database of error information

(2) extracting candidate words from the words;

(3) searching the candidate phrase in an error information database;

(3) extracting context information from the context text when abnormal context information exists as a result of the search;

(5) comparing the abnormal context information with the context information;

And (6) if the abnormal context information and the context information are similar, determining the word as an error suspect word.

The method according to claim 1 and 2,

The candidate word is the word itself; and

And a word that is a remainder except for possible investigation or endings from the word.

The method according to claim 1 and 2,

And the context information includes a vocabulary extracted from the context text.

As a system that finds words that people who write, often use inappropriately, and provides users with examples of the words that are used correctly,

Error suspect word database that stores words that are often inappropriately used;

A word usage example database storing usage examples of a sentence in which the words stored in the error suspect word database are correctly used;

A word acquisition module for obtaining a word from the written text;

An error syllable clause determination module that searches for the word in the error syllable clause database and determines whether the word has an error symptom clause;

A usage extraction module for retrieving a word determined as an error suspect word from the word usage example database and extracting an example of a sentence in which the word is appropriately used; And

And display means for expressing an example of the extracted text to a user.

A system that detects an error word and provides a user with a correction word and a usage example of the sentence correctly used,

An error suspect word database storing frequently occurring error words;

A correction word phrase database for storing correction words of words stored in the error suspect word phrase database and an example of a sentence in which the correction word clause is correctly used;

A word acquisition module for obtaining a word from the written text;

A usage extraction module for retrieving a word determined as an error suspect word from the correction word database and extracting a usage word of the sentence appropriately used for the word and the correction word; And

And display means for expressing the extracted correction word and the usage example of the text to a user.

The method according to claim 5 or 6,

And means for informing a user of a word determined as the suspected error word.

The method according to claim 5 or 6,

The error suspect clause database is an error information database provided in claim 1 or 2,

And the error suspected word determination module uses the error suspected word determination method provided in claim 1 or 2.