KR101621154B1

KR101621154B1 - Method and appratus for correcting spelling error for social text

Info

Publication number: KR101621154B1
Application number: KR1020140098936A
Authority: KR
Inventors: 임해창; 고대옥
Original assignee: 고려대학교 산학협력단
Priority date: 2014-08-01
Filing date: 2014-08-01
Publication date: 2016-05-13
Also published as: KR20160015933A

Abstract

본 발명은 규칙 기반 교정 방식과 통계 기반 교정 방식을 혼합하여 철자 오류를 교정하는 소셜 텍스트를 위한 철자 오류 교정 방법 및 장치에 관한 것이다
본 발명의 일실시예에 의한 소셜 텍스트를 위한 철자 오류 교정 방법은 규칙 기반 교정 방식에 대한 정보 및 통계 기반 교정 방식에 대한 정보가 저장된 데이터베이스를 유지하는 단계; 원문의 입력을 수신하는 단계; 상기 원문에 대해 상기 규칙 기반 교정 방식에 대한 정보를 이용하여 제1교정 문장을 생성하는 단계; 상기 원문에 대해 상기 통계 기반 교정 방식에 대한 정보를 이용하여 제2교정 문장을 생성하는 단계; 및 상기 제1교정 문장과 제2교정 문장을 조합하여 최종 교정 문장을 생성하는 단계를 포함할 수 있다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a spelling error correction method and apparatus for social texts that correct spelling errors by combining a rule-based correction method and a statistical-based correction method
According to an embodiment of the present invention, there is provided a spelling error correction method for a social text, comprising: maintaining a database storing information on a rule-based correction method and a statistical-based correction method; Receiving an input of the original text; Generating a first calibration sentence for the original text using information on the rule-based calibration scheme; Generating a second calibration sentence for the original text using information on the statistical based calibration scheme; And generating a final proofread sentence by combining the first proofread sentence and the second proofread sentence.

Description

METHOD AND APPARATUS FOR CORRECTING SPELLING ERROR FOR SOCIAL TEXT FOR SOCIAL TEXT [0002]

본 발명은 소셜 텍스트를 위한 철자 오류 교정 방법 및 장치에 관한 것으로, 보다 상세하게는 규칙 기반 교정 방식과 통계 기반 교정 방식을 혼합하여 철자 오류를 교정하는 소셜 텍스트를 위한 철자 오류 교정 방법 및 장치에 관한 것이다.The present invention relates to a spelling error correction method and apparatus for social texts, and more particularly, to a spelling error correction method and apparatus for a social text in which a spelling error is corrected by mixing a rule-based correction method and a statistical-based correction method will be.

최근 스마트폰, 태블릿 등의 모바일 기기의 사용이 증가함에 따라, 실시간으로 자신의 상황을 알리거나 생각을 표현할 수 있는 소셜 네트워크 서비스가 각광을 받고 있다. 트위터는 대표적인 소셜 네트워크 서비스의 하나로서, 트위터 사용자들이 작성하는 트윗 메시지는 폭발적으로 증가하고 있는 추세이다.Recently, as the use of mobile devices such as smart phones and tablets is increasing, a social network service that can notify the user of his / her situation in real time or express his / her thoughts is getting popular. Twitter is one of the most popular social network services, and twitter messages written by Twitter users are increasing explosively.

이러한 대량의 트윗은 최신 트렌드, 실시간 이슈 탐지, 또는 의사 결정을 위한 빅데이터 마이닝에 매우 활용가치가 높은 자원이다.These massive tweets are a very useful resource for the latest trends, real-time issue detection, or big data mining for decision making.

그러나, 트윗에서는 많은 철자 오류가 발견된다. 예를 들어, 발음이 비슷한 단어로 바꿔 쓰는 경우(요 → 여, 고 → 구)나 오탈자(있 → 잇), 음절의 의도적 변형(다→ 당)이 있다. 이러한 철자 오류는 텍스트 마이닝 과정에서 필수적인 통계 분석 과정에 오류를 전파함으로써, 텍스트 분석의 신뢰도를 떨어뜨린다. 더 정확하고 신뢰성 있는 트윗 데이터 분석을 위해서는 이런 트윗 철자 오류를 자동으로 교정해 줄 수 있는 시스템의 개발이 필수적으로 요구된다.However, many spelling errors are found in tweets. For example, there is a case where the pronunciation is changed to a similar word (yo → f, high → phrase), a punctuation (→ →), and an intentional modification of the syllable (→). These misspellings reduce the reliability of text analysis by propagating errors to the statistical analysis process, which is essential in the text mining process. For more accurate and reliable analysis of tweet data, it is essential to develop a system that can automatically correct these tweets spelling errors.

철자 오류 교정 태스크는 사용자가 작성한 문장을 입력으로 받아 그 문장 내의 철자 오류를 자동으로 탐지하고, 그것을 올바른 철자로 수정한 결과를 출력하는 태스크이다. 예를 들면 아래와 같다.The spelling error correction task is a task that takes a sentence created by the user as input and automatically detects the spelling error in the sentence and outputs the result of correcting it with the correct spelling. For example:

원본 문장: 처음부터 여유롭개 할 걸...Original Sentence: From the beginning, you'll have a spare room ...

교정 문장: 처음부터 여유롭게 할 걸...Correction sentence: It will be easy from the beginning ...

기존 철자 오류 교정 연구는 주로 SMS 메시지를 대상으로 이루어져 왔다. 관련 선행문헌으로 대한민국 등록특허 제10-0897718호가 있다. Existing spelling error correction studies have mainly been conducted on SMS messages. Korean Patent No. 10-0897718 is a related prior art document.

SMS 메시지를 대상으로 철자 오류 교정 코퍼스를 구축한 후, 그 코퍼스를 바탕으로 규칙을 추출하거나 통계 모델을 학습시키는 방식을 주로 사용하였다. SMS 메시지가 아닌 트윗에서의 철자 오류 교정을 위해서는 SMS 교정 코퍼스와 마찬가지로 트윗에서 자주 나타나는 오류를 학습할 수 있는 트윗 데이터에 대한 교정 코퍼스가 필요하지만, 현재 사용 가능한 자원을 구하기 어려운 실정이다. 이로 인해, 트윗에 적합한 철자 오류 교정 연구 역시 미흡한 실정이다.After constructing the spelling error correction corpus for the SMS message, we extracted the rule based on the corpus or learned the statistical model. To correct spelling errors in tweets that are not SMS messages, it is necessary to have a correction corpus for tweet data that can be used to learn frequent errors in tweets as well as SMS correction corpus, but currently available resources are difficult to obtain. As a result, research on spelling error correction suitable for tweets is also inadequate.

따라서 트윗 등의 소셜 미디어에서 사용자가 생성한 텍스트의 철자 오류 교정 기술에 대한 연구가 필요한 실정이다.Therefore, it is necessary to study the spelling error correction technique of user generated text in social media such as Twitter.

본 발명의 목적은 규칙 기반 교정 방식과 통계 기반 교정 방식을 혼합 사용함으로써, 많은 철자를 정확하게 교정할 수 있는 소셜 텍스트를 위한 철자 오류 교정 방법 및 장치를 제공하는 데 있다.An object of the present invention is to provide a spelling error correction method and apparatus for a social text in which many spellings can be accurately corrected by using a rule-based correction method and a statistical-based correction method in combination.

상기 목적을 달성하기 위해 본 발명의 일실시예에 의하면, 규칙 기반 교정 방식에 대한 정보 및 통계 기반 교정 방식에 대한 정보가 저장된 데이터베이스를 유지하는 단계; 원문의 입력을 수신하는 단계; 상기 원문에 대해 상기 규칙 기반 교정 방식에 대한 정보를 이용하여 제1교정 문장을 생성하는 단계; 상기 원문에 대해 상기 통계 기반 교정 방식에 대한 정보를 이용하여 제2교정 문장을 생성하는 단계; 및 상기 제1교정 문장과 제2교정 문장을 조합하여 최종 교정 문장을 생성하는 단계를 포함하는 소셜 텍스트를 위한 철자 오류 교정 방법이 개시된다.According to an embodiment of the present invention, there is provided a method for managing a rule-based calibration method, the method comprising: maintaining a database storing information on a rule-based calibration method and information on a statistical-based calibration method; Receiving an input of the original text; Generating a first calibration sentence for the original text using information on the rule-based calibration scheme; Generating a second calibration sentence for the original text using information on the statistical based calibration scheme; And generating a final proofreading sentence by combining the first proofreading sentence and the second proofreading sentence.

상기 목적을 달성하기 위해 본 발명의 일실시예에 의하면, 규칙 기반 교정 방식에 대한 정보 및 통계 기반 교정 방식에 대한 정보가 저장된 데이터베이스; 원문의 입력을 수신하는 입력부; 상기 원문에 대해 상기 규칙 기반 교정 방식에 대한 정보를 이용하여 제1교정 문장을 생성하는 제1교정부; 상기 원문에 대해 상기 통계 기반 교정 방식에 대한 정보를 이용하여 제2교정 문장을 생성하는 제2교정부; 상기 제1교정 문장과 제2교정 문장을 조합하여 최종 교정 문장을 생성하는 결과 혼합부; 및 상기 데이터베이스, 상기 입력부, 상기 제1교정부, 상기 제2교정부 및 상기 결과 혼합부를 제어하는 제어부를 포함하는 소셜 텍스트를 위한 철자 오류 교정 장치가 개시된다.According to an aspect of the present invention, there is provided a database system including: a database storing information on a rule-based calibration method and information on a statistical-based calibration method; An input unit for receiving an input of the original text; A first calibration unit for generating a first calibration sentence using information on the rule-based calibration method for the original text; A second calibration unit for generating a second calibration sentence using the information on the statistical based calibration method for the original text; A result mixing unit for generating a final proofreading sentence by combining the first proofreading sentence and the second proofreading sentence; And a control unit for controlling the database, the input unit, the first calibration unit, the second calibration unit, and the resultant mixing unit.

본 발명의 일실시예에 의한 소셜 텍스트를 위한 철자 오류 교정 방법 및 장치는 많은 철자를 정확하게 교정할 수 있다.The spelling error correction method and apparatus for social texts according to an embodiment of the present invention can accurately correct many spellings.

본 발명의 일실시예에 의하면, 기존 방식에 비해 오류 교정이 더 잘 된 문장을 생성할 수 있기에, 다른 자연어 처리 기술의 전처리 수단으로 유용하게 사용될 수 있다.According to an embodiment of the present invention, since a sentence with better error correction can be generated as compared with the conventional method, it can be usefully used as a preprocessing means of other natural language processing techniques.

도 1은 본 발명의 일실시예와 관련된 소셜 텍스트를 위한 철자 오류 교정 장치의 블록도이다.
도 2는 본 발명의 일실시예와 관련된 소셜 텍스트를 위한 철자 오류 교정 방법을 나타내는 흐름도이다.
도 3은 본 발명의 일실시예와 관련된 소셜 텍스트를 위한 철자 오류 교정 방법 중 규칙 기반 교정 방식을 먼저 적용한 후에 통계 기반 통계 방식을 적용하는 철자 오류 교정 방법의 개념도이다.
도 4 내지 도 5는 본 발명의 일실시예와 관련된 소셜 텍스트를 위한 철자 오류 교정 방법에 따른 성능 변화를 나타내는 그래프이다.1 is a block diagram of a spelling error correction apparatus for social texts according to an embodiment of the present invention.
2 is a flowchart illustrating a spelling error correction method for a social text related to an embodiment of the present invention.
FIG. 3 is a conceptual diagram of a spelling error correction method applying a statistical-based statistical method after first applying a rule-based correction method among spelling error correction methods for a social text related to an embodiment of the present invention.
4 to 5 are graphs illustrating performance changes according to a spelling error correction method for a social text related to an embodiment of the present invention.

이하, 본 발명의 일실시예와 관련된 소셜 텍스트를 위한 철자 오류 교정 방법 및 장치에 대해 도면을 참조하여 설명하도록 하겠다.Hereinafter, a spelling error correction method and apparatus for a social text related to an embodiment of the present invention will be described with reference to the drawings.

본 명세서에서 사용되는 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "구성된다" 또는 "포함한다" 등의 용어는 명세서상에 기재된 여러 구성 요소들, 또는 여러 단계들을 반드시 모두 포함하는 것으로 해석되지 않아야 하며, 그 중 일부 구성 요소들 또는 일부 단계들은 포함되지 않을 수도 있고, 또는 추가적인 구성 요소 또는 단계들을 더 포함할 수 있는 것으로 해석되어야 한다.As used herein, the singular forms "a", "an" and "the" include plural referents unless the context clearly dictates otherwise. In this specification, the terms "comprising ", or" comprising "and the like should not be construed as necessarily including the various elements or steps described in the specification, Or may be further comprised of additional components or steps.

본 명세서에서 소셜 텍스트라 함은 소셜 미디어에서 사용자가 생성한 텍스트를 의미한다.Here, the social text means a text created by the user in the social media.

도 1은 본 발명의 일실시예와 관련된 소셜 텍스트를 위한 철자 오류 교정 장치의 블록도이다.1 is a block diagram of a spelling error correction apparatus for social texts according to an embodiment of the present invention.

도시된 바와 같이, 철자 오류 교정 장치(100)는 데이터베이스(110), 입력부(120), 제1교정부(130), 제2교정부(140), 결과 혼합부(150), 출력부(160), 및 제어부(170)를 포함할 수 있다.1, the spelling error correcting apparatus 100 includes a database 110, an input unit 120, a first calibration unit 130, a second calibration unit 140, a result mixing unit 150, an output unit 160 And a control unit 170. The control unit 170 includes a control unit 170,

상기 데이터베이스(110)에는 규칙 기반 교정 방식에 대한 정보 및 통계 기반 교정 방식에 대한 정보가 저장될 수 있다. The database 110 may store information on the rule-based calibration method and information on the statistical-based calibration method.

상기 규칙 기반 교정 방식은 교정 규칙을 생성한 후, 교정 규칙에 맞는 유형의 오류에 해당 규칙을 적용하여 철자 오류 교정을 하는 교정 방법이다. 교정 규칙의 생성에는 자동 생성과 수동 생성으로 나뉘며, 자동 생성의 경우 원본 문장과 정답 문장 쌍으로 이루어진 교정 말뭉치로부터 교정 내역을 추출하는 방식이며, 수동 생성은 교정 전문가가 직접 교정 규칙을 구성하는 방식이다. 상기 규칙 기반 교정 방식에 대한 정보는 오류 패턴 및 상기 오류 패턴과 쌍으로 매핑된 교정 패턴을 포함할 수 있다.The rule-based calibration method is a calibration method that generates a calibration rule, and then corrects the spelling error by applying the corresponding rule to a type of error corresponding to the calibration rule. The generation of calibration rules is divided into automatic generation and manual generation. In the case of automatic generation, the calibration history is extracted from the calibration corpus composed of the original sentence and the correct sentence pair. The manual generation is a method in which the calibration experts construct the calibration rule . The information on the rule-based calibration method may include an error pattern and a calibration pattern mapped in pairs with the error pattern.

통계 기반 교정 방식은 원본 문장에 대해 변환될 수 있는 모든 후보 문장들을 생성한 후, 가장 점수가 높은 후보 문장을 선택하는 방식이다. 상기 점수는 '후보 문장이 실제로 작성될 확률'과 '원본 문장으로부터 후보 문장으로 변환될 확률'을 고려하여 책정될 수 있다. 예를 들어, 상기 후보 문장이 실제로 작성될 확률과 원본 문장으로부터 후보 문장으로 변환될 확률이 곱을 이용하여 상기 점수가 산정될 수 있다.The statistical-based correction method generates all the candidate sentences that can be converted to the original sentence, and then selects the candidate sentence having the highest score. The score can be set in consideration of the 'probability that the candidate sentence is actually written' and the 'probability that the sentence is converted into the candidate sentence.' For example, the score may be calculated using the product of the probability that the candidate sentence is actually written and the probability that the sentence is converted into the candidate sentence.

입력부(120)는 사용자가 작성한 소셜 텍스트(원문)를 입력 받을 수 있다. The input unit 120 can receive a user-created social text (original text).

제1교정부(130)는 상기 규칙 기반 교정 방식에 대한 정보를 이용하여 제1교정 문장을 생성할 수 있다. The first calibration unit 130 may generate the first calibration sentence using information on the rule-based calibration method.

제2교정부(130)는 통계 기반 교정 방식에 대한 정보를 이용하여 제2교정 문장을 생성할 수 있다. The second calibration unit 130 may generate a second calibration sentence using information on the statistical based calibration method.

결과 혼합부(150)는 상기 제1교정부(130)에서 생성된 제1교정 문장과 상기 제2교정부(140)에서 생성된 제2교정 문장을 조합하여 최종 교정 문장을 생성할 수 있다. The result mixing unit 150 may generate a final correction sentence by combining the first correction sentence generated by the first correction unit 130 and the second correction sentence generated by the second correction unit 140. [

출력부(160)는 상기 최종 교정 문장을 출력할 수 있다.The output unit 160 may output the final correction sentence.

제어부(170)는 데이터베이스(110), 입력부(120), 제1교정부(130), 제2교정부(140), 결과 혼합부(150) 및 출력부(160)를 전반적으로 제어할 수 있다.The control unit 170 may control the database 110, the input unit 120, the first calibration unit 130, the second calibration unit 140, the result mixing unit 150, and the output unit 160 .

도 2는 본 발명의 일실시예와 관련된 소셜 텍스트를 위한 철자 오류 교정 방법을 나타내는 흐름도이다.2 is a flowchart illustrating a spelling error correction method for a social text related to an embodiment of the present invention.

제어부(170)는 규칙 기반 교정 방식에 대한 정보 및 통계 기반 교정 방식에 대한 정보가 저장된 데이터베이스(110)를 유지할 수 있다(S210).The control unit 170 may maintain the database 110 in which information on the rule-based calibration method and information on the statistical-based calibration method are stored (S210).

입력부(120)는 원문 입력을 수신할 수 있다(S220). 상기 원문은 소셜 텍스트를 포함할 수 있다.The input unit 120 may receive the original text input (S220). The original text may include social text.

제1교정부(130)는 상기 원문에 대해 규칙 기반 교정 방식에 따라 철자 오류를 교정하여 제1교정 문장을 생성할 수 있다(S230). 제1교정부(130)는 규칙 기반 교정 방식에 대한 정보에 포함된 오류 패턴 및 상기 오류 패턴과 쌍으로 매핑된 교정 패턴을 이용하여 제1교정 문장을 생성할 수 있다.The first calibration unit 130 may generate a first calibration sentence by correcting the spelling error according to the rule-based calibration method for the original text (S230). The first calibration unit 130 may generate the first calibration sentence using the error pattern included in the information on the rule-based calibration method and the calibration pattern mapped with the error pattern.

제2교정부(140)는 상기 원문에 대해 통계 기반 교정 방식에 따라 철자 오류를 교정하여 제2교정 문장을 생성할 수 있다(S240). 상기 제2교정부(140)는 통계 기반 교정 방식에 대한 정보에 포함된 오류 패턴에 대응되는 후보 교정 패턴, 상기 후보 교정 패턴의 변환 확률, 및 상기 후보 교정 패턴의 출현 확률을 이용하여 원문의 철자 오류를 교정할 수 있다. 예를 들어, 제2교정부(140)는 후보 교정 패턴의 변환 확률과 상기 후보 교정 패턴의 출현 확률의 곱을 이용하여 점수를 책정하고, 책정된 점수가 가장 높은 교정 패턴을 포함하는 문장을 제2교정 문장으로 선택할 수 있다.The second calibration unit 140 may generate a second calibration sentence by correcting the spelling error according to the statistical-based calibration method for the original text (S240). The second orthogonal unit 140 uses the candidate calibration pattern corresponding to the error pattern included in the information on the statistical-based calibration method, the conversion probability of the candidate calibration pattern, and the appearance probability of the candidate calibration pattern, The error can be corrected. For example, the second calibration unit 140 calculates a score using the product of the conversion probability of the candidate calibration pattern and the appearance probability of the candidate calibration pattern, and assigns a sentence including the calibration pattern having the highest calculated score to the second Can be selected as correction sentence.

상기 제1교정부(130)와 상기 제2교정부(140)는 서로 독립적으로 병렬적으로 동작될 수도 있다.The first calibration unit 130 and the second calibration unit 140 may be operated independently of each other in parallel.

결과 혼합부(150)는 상기 제1교정문과 제2교정문을 조합하여 최종 교정문을 생성할 수 있다(S250). The result mixing unit 150 may generate a final proofreading statement by combining the first proofreading statement and the second proofreading statement (S250).

규칙 기반 교정 방식과 통계 기반 교정 방식을 비교하였을 때, 규칙 기반 교정 방식은 통계 기반 교정 방식보다 교정 정확률이 높고, 교정 재현률은 상대적으로 낮다. When the rule-based and statistical-based calibration methods are compared, the rule-based calibration method has higher correction accuracy than the statistical-based correction method and the calibration recall rate is relatively lower.

따라서 상기 결과 혼합부(150)는 규칙 기반 교정 방식에 의한 교정 결과와 통계 기반 교정 방식에 의한 교정 결과가 충돌할 경우, 규칙 기반 교정 방식에 의한 교정 내역이 우선적으로 적용되게 함으로써, 교정 정확률을 최대한 유지할 수 있다. 즉, 결과 혼합부(150)는 상기 제2교정문을 적용하여 교정할 경우에는 제1교정문에서 교정된 부분이 아닌 부분만이 교정되게 할 수 있다.Therefore, when the result of the calibration based on the rule-based calibration method conflicts with the result of the calibration based on the statistical-based calibration method, the resultant mixing unit 150 allows the calibration-based calibration method to be applied preferentially, . That is, when calibrating by applying the second calibration statement, the result mixing unit 150 may correct only the non-calibrated portion in the first calibration statement.

이하에서 규칙 기반 교정 방식에 의한 교정 결과를 먼저 적용하고, 차후에 통계 기반 교정 방식에 의한 교정 결과를 적용하는 방식을 선 규칙 후 통계 방법이라 한다.Hereinafter, a method of first applying a calibration result by a rule-based calibration method and applying a calibration result by a statistical-based calibration method in the future is referred to as a post-rule statistical method.

출력부(160)는 최종 교정 문장을 사용자 인터페이스를 통해 출력할 수 있다(S260).The output unit 160 may output the final corrected sentence through the user interface (S260).

도 3은 본 발명의 일실시예와 관련된 소셜 텍스트를 위한 철자 오류 교정 방법 중 선 규칙 후 통계 방법의 개념도이다.3 is a conceptual diagram of a post-rule statistical method in a spelling error correction method for a social text related to an embodiment of the present invention.

도시된 바와 같이, 결과 혼합부(140)는 먼저 입력 문장에 대해 규칙 기반 철자 오류 교정 모델 이용하는 규칙 기반 교정 방식과 통계 기반 철자 오류 교정 모델을 이용하는 통계 기반 교정 방식에 따라 각각 교정 문장을 생성한 후에, 두 교정 문장을 조합할 수 있다. 이 때, 결과 혼합부(140)는 입력 문장과 비교했을 때 규칙 기반 교정 방식에 의해 교정이 이루어진 부분은 통계 기반 교정 방식에 의한 교정 시 교정 여부와 무관하게 무조건 적용할 수 있다. 결과 혼합부(140)는 그런 다음 규칙 기반 교정 방식에 의해 교정이 이루어지지 않은 부분에 대해 통계 기반 교정 방식에 의해 교정이 이루어진 부분을 마저 적용할 수 있다. As shown in the figure, the result mixing unit 140 first generates a correction sentence according to a rule-based correction method using a rule-based spelling error correction model and a statistical-based correction method using a statistical-based spelling error correction model for an input sentence , Two correction sentences can be combined. In this case, the result mixing unit 140 can be unconditionally applied irrespective of whether the calibration is performed by the rule-based calibration method or not by the statistical-based calibration method, when compared with the input sentence. The result mixing unit 140 may then apply the corrected portion of the portion that is not corrected by the rule-based correction method by the statistical-based correction method.

다음은 선 규칙 후 통계 방법에 의한 교정의 예시이다. The following is an example of calibration by statistical method after line rule.

오류 포함 원문: 결쿠 후회스럽지 않는 선택임니다. Including errors Original: I'm sorry I do not regret choice.

원문에는 총 3개의 철자 오류가 포함되어 있다. The original text contains a total of three spelling errors.

규칙 기반 교정: 결코 후회스럽지 않는 선택입니다. Rule-based proofing: It's a choice you'll never regret.

교정 규칙 - 쿠 → 코, 임 → 입Correction rules - Ku → ko

규칙 기반 교정에서는 2개의 철자 오류를 교정하였다. 이 2개의 교정은 혼합 방식 교정에도 모두 적용된다.In rule-based calibration, two spelling errors were corrected. These two calibrations also apply to mixed-mode calibration.

통계 기반 교정: 결국 후회스럽지 않은 선택입니다.Statistics-Based Calibration: In the end, it is a choice that is not regrettable.

교정 내역 - 쿠 → 국, 는 → 은, 임 → 입Calibration record - Ku → station, → go to silver, go to →

통계 기반 교정에서는 3개의 철자 오류를 교정하였다. 이 중에서 2개의 교정(쿠 → 국, 임 →입)은 이미 규칙 기반 교정에 의해 교정된 부분(쿠→ 코, 임 → 입)이기 때문에 제외되어, 나머지 1개의 교정(는 → 은)만이 혼합 방식 교정에 적용되었다.In statistical based calibration, three spelling errors were corrected. Of these, two calibrations (K → K → K → K) are excluded because they are already calibrated by rule-based calibration (K → K, K → K) and only the remaining calibration (K → K) It was applied to calibration.

혼합 방식 교정(선 규칙 후 통계 방법): 결코 후회스럽지 않은 선택입니다.Mixed method calibration (statistical method after line rule): This is an option that is never regrettable.

적용된 규칙 기반 교정: 쿠 → 코, 임 → 입Applied rule-based correction: K → K

적용된 통계 기반 교정: 는 → 은Applied statistics-based calibration:

따라서 혼합 방식 교정(선 규칙 후 통계 방법)은 규칙 기반 교정 방식의 장점인 높은 교정 정확률과 통계 기반 교정 방식의 장점인 교정 재현율을 적절히 취할 수 있다.Therefore, the mixed method (post-rule statistical method) can appropriately take the high recalculation accuracy, which is the advantage of the rule-based recalculation method, and the recall rate, which is an advantage of the statistical-based calibrating method.

이하에서는 각각의 말뭉치를 10-fold 교차 검증법을 통해 철자 오류 교정 방식에 대한 평가(실험)에 대해 설명하도록 하겠다.Hereinafter, the evaluation of the spelling error correction method (experiment) will be described through the 10-fold cross validation method for each corpus.

본 실험에서는 아래 수학식에 기재된 평가 척도들을 사용하였다.In this experiment, the evaluation scales described in the following equation were used.

교정 정확률과 교정 재현율은 시스템이 교정한 어절수와 교정 코퍼스의 정답 어절 수를 기준으로 삼았다. 그리고 F-measure(또는 F1)는 교정 정확률과 교정 재현율을 하나로 표현하기 위해 사용하였다. F1 척도는 교정 정확률과 교정 재현율을 조합한 평가 척도이다. Calibration accuracy and recall rate were based on the number of corrective words in the system and the correct number of corrective words in the corrective corpus. The F-measure (or F1) was used to represent the calibration accuracy and the recall rate. The F1 scale is a combination of the calibration accuracy and the recall rate.

소셜 텍스트로는 소셜 미디어인 SMS와 트위터로부터 생성한 말뭉치를 실험에 사용하였으며, 해당 말뭉치에 대한 통계는 표 1과 같다.The social texts, social media SMS and Twitter generated corpus were used in the experiment. The statistics for the corpus are shown in Table 1.

말뭉치 종류Corpus type 총 문장 수Total sentences 총 어절 수Total number of words 문장 당 어절 수Number of words per sentence 오류 어절 수Error word count SMSSMS 109,084109,084 336,564336,564 3.093.09 70,22670,226 트윗Tweets 13,53413,534 85,83785,837 6.346.34 3,0183,018

도 4 내지 도 5는 본 발명의 일실시예와 관련된 소셜 텍스트를 위한 철자 오류 교정 방법에 따른 성능 변화를 나타내는 그래프이다. 도 4는 SMS 말뭉치에 대한 선 규칙 후 통계 방법의 성능 변화를 나타낸 그래프이고, 도 5는 트윗 말뭉치에 대한 선 규칙 후 통계 방법의 성능 변화를 나타낸 그래프이다.4 to 5 are graphs illustrating performance changes according to a spelling error correction method for a social text related to an embodiment of the present invention. FIG. 4 is a graph showing a performance change of a statistical method after a line rule for an SMS corpus, and FIG. 5 is a graph showing a performance change of a statistical method after a line rule for a tweet corpus.

도 4에서는 SMS 말뭉치에 대해 파라미터를 변경한 실험 결과를 그래프로 표현하였으며, 규칙 기반 교정 방식의 교정 규칙의 정확도 임계값을 70%로, 통계 기반 교정 방식의 교정 변환 확률 임계값을 0% 설정하였을 경우 가장 높은 F1 척도 값(76.18)을 보였다.In FIG. 4, the experimental results obtained by changing the parameters for the SMS corpus are represented by a graph, and the accuracy threshold value of the calibration rule of the rule-based calibration method is set to 70% and the calibration conversion probability threshold value of the statistical-based calibration method is set to 0% (76.18), respectively.

반면, 도 5는 트윗 말뭉치에 대해 파라미터를 변경한 실험 결과를 그래프로 표현하였으며, 규칙 기반 교정 방식의 교정 규칙의 정확도 임계값을 60%로, 통계 기반 교정 방식의 교정 변환 확률 임계값을 10% 설정하였을 경우 가장 높은 F1 척도 값(45.28)을 보였다.In contrast, FIG. 5 is a graph showing experimental results obtained by changing the parameters for the tweet corpus. The threshold value of the calibration rule of the rule-based calibration method is set to 60%, the calibration conversion probability threshold value of the statistical- The highest F1 score (45.28) was shown.

두 말뭉치 상에서 최고 성능을 보인 혼합 방식이 변환 확률 임계값 설정에서 차이를 보인 이유는 트윗 말뭉치의 경우 전체 어절 대비 교정 어절 비율이 낮아 교정 쌍 변환 후부가 매우 적었기 때문이다. The reason for the difference in the conversion probability threshold setting is that the twosome corpus has a very small number of rearranged pairs of correction pairs because of the low ratio of the total word-to-phrase ratio.

전술한 바와 같이, 본 발명의 일실시예에 의한 소셜 텍스트를 위한 철자 오류 교정 방법 및 장치는 많은 철자를 정확하게 교정할 수 있다.As described above, the spelling error correcting method and apparatus for social text according to an embodiment of the present invention can correctly correct many spellings.

상술한 소셜 텍스트를 위한 철자 오류 교정 방법 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터로 판독 가능한 기록 매체에 기록될 수 있다. 이때, 컴퓨터로 판독 가능한 기록매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 한편, 기록매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다.The spelling error correction method for the above-described social texts can be implemented in the form of a program command that can be executed through various computer means and recorded on a computer-readable recording medium. At this time, the computer-readable recording medium may include program commands, data files, data structures, and the like, alone or in combination. On the other hand, the program instructions recorded on the recording medium may be those specially designed and configured for the present invention or may be available to those skilled in the art of computer software.

컴퓨터로 판독 가능한 기록매체에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(Magnetic Media), CD-ROM, DVD와 같은 광기록 매체(Optical Media), 플롭티컬 디스크(Floptical Disk)와 같은 자기-광 매체(Magneto-Optical Media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. The computer-readable recording medium includes a magnetic recording medium such as a magnetic medium such as a hard disk, a floppy disk and a magnetic tape, an optical medium such as a CD-ROM and a DVD, a magnetic disk such as a floppy disk, A magneto-optical media, and a hardware device specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like.

한편, 이러한 기록매체는 프로그램 명령, 데이터 구조 등을 지정하는 신호를 전송하는 반송파를 포함하는 광 또는 금속선, 도파관 등의 전송 매체일 수도 있다.The recording medium may be a transmission medium such as a light or metal line, a wave guide, or the like including a carrier wave for transmitting a signal designating a program command, a data structure, and the like.

또한, 프로그램 명령에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상술한 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The program instructions also include machine language code, such as those generated by the compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

상기와 같이 설명된 소셜 텍스트를 위한 철자 오류 교정 방법 및 장치 상기 설명된 실시예들의 구성과 방법이 한정되게 적용될 수 있는 것이 아니라, 상기 실시예들은 다양한 변형이 이루어질 수 있도록 각 실시예들의 전부 또는 일부가 선택적으로 조합되어 구성될 수도 있다.The spelling error correction method and apparatus for the social text as described above The configuration and method of the above-described embodiments are not limitedly applied. Instead, the embodiments may be applied to all or some of the embodiments May be selectively combined.

100: 철자 오류 교정 장치
110: 데이터베이스
120: 입력부
130: 제1교정부
140: 제2교정부
150: 결과 혼합부
160: 출력부
170: 제어부100: Spelling Error Correction Device
110: Database
120: Input unit
130:
140: Second Correction
150: Result mixing section
160: Output section
170:

Claims

Maintaining a database in which information on a rule-based calibration method and information on a statistical based calibration method are stored;
Receiving an input of the original text;
Generating a first calibration sentence for the original text using information on the rule-based calibration scheme;
Generating a second calibration sentence for the original text using information on the statistical based calibration scheme; And
Generating a final proofread sentence by combining the first proofread sentence and the second proofread sentence,
The final correction sentence generation step
Extracting a spelling error part by the first correction sentence;
Extracting a spelling error part by the second correction sentence; And
When the spelling error portion of the first correction sentence overlaps with the spelling error portion of the second correction sentence, the overlapping spelling error portion includes applying the correction pattern of the first correction sentence. How to Correct Spelling Error for Social Text.

delete

2. The method of claim 1, wherein the information on the rule-based calibration method
An error pattern, and a calibration pattern mapped in pairs with the error pattern.

4. The method of claim 3, wherein the second calibration sentence generation step
Selecting a sentence having a highest score among a plurality of candidate sentences,
Wherein the score is calculated in consideration of a conversion probability of a candidate correction pattern corresponding to an error pattern included in the information on the statistical based correction method and an appearance probability of the candidate correction pattern. Way.

The method according to claim 1,
Wherein the rule-based correction method has a higher correction accuracy and a lower correction error rate than the statistical-based correction method.

A database storing information on rule-based calibration methods and statistics-based calibration methods;
An input unit for receiving an input of the original text;
A first calibration unit for generating a first calibration sentence using information on the rule-based calibration method for the original text;
A second calibration unit for generating a second calibration sentence using the information on the statistical based calibration method for the original text;
A result mixing unit for generating a final proofreading sentence by combining the first proofreading sentence and the second proofreading sentence; And
And a controller for controlling the database, the input unit, the first calibration unit, the second calibration unit, and the resultant mixing unit,
The result mixing unit extracts a spelling error part by the first correction sentence, extracts a spelling error part by the second correction sentence, and outputs the spelling error part by the first correction sentence and the second correction sentence by the second correction sentence Wherein when the spelling error portion is overlapped, the correction pattern of the first correction sentence is applied to the overlapped spelling error portion.

delete

7. The method of claim 6, wherein the information on the rule-
An error pattern, and a calibration pattern mapped as a pair with the error pattern.

9. The method of claim 8,
Wherein the second calibration unit generates the second calibration sentence by selecting a sentence having the highest score among a plurality of candidate sentences,
Wherein the score is calculated in consideration of a conversion probability of a candidate correction pattern corresponding to an error pattern included in the information on the statistical based correction method and an appearance probability of the candidate correction pattern. Device.

The method according to claim 6,
Wherein the rule-based correction method has a higher correction accuracy and a lower correction error rate than the statistical-based correction method.