KR20230061001A

KR20230061001A - Apparatus and method for correcting text

Info

Publication number: KR20230061001A
Application number: KR1020210145794A
Authority: KR
Inventors: 김민지; 신나영; 유재춘
Original assignee: 삼성에스디에스 주식회사
Priority date: 2021-10-28
Filing date: 2021-10-28
Publication date: 2023-05-08

Abstract

Disclosed are an apparatus and a method for correcting a document. In accordance with one embodiment, the apparatus for correcting a document includes: a shape reproduction correction module correcting a sentence to be corrected based on visual characteristics of the sentence to be corrected; a pronunciation reproduction correction module correcting the sentence to be corrected based on auditory characteristics of the sentence to be corrected; a composite reproduction correction module correcting the sentence to be corrected based on statistical characteristics of a language to which the sentence to be corrected belongs; and a determination module receiving the sentence to be corrected, and determining whether to correct the sentence to be corrected using at least one of the shape reproduction correction module, the pronunciation reproduction correction module, and the composite reproduction correction module. Therefore, the present invention is capable of effectively correcting an error in a text without user intervention.

Description

Document correction device and method {APPARATUS AND METHOD FOR CORRECTING TEXT}

개시되는 실시예들은 문서의 오류를 교정하기 위한 기술과 관련된다.The disclosed embodiments relate to techniques for correcting errors in documents.

인간이 작성한 문장에는 실수 또는 고의에 의한 오타나 문자의 형태적, 발음적 특성을 이용해 우회적으로 표현된 단어들이 포함될 수 있다. 또한 문서를 OCR 처리해 텍스트를 얻을 때에도 다양한 이유로 잘못된 텍스트가 추출되는 경우가 많다. 이렇게 일상 생활에서 생성된 텍스트, OCR 결과로 생성된 텍스트 등은 오타를 포함하거나 정제되지 않은 경우가 많아 교정에 많은 시간과 비용이 필요하다.Sentences written by humans may include mistakes or intentional typos or indirectly expressed words using morphological and phonetic characteristics of characters. In addition, even when text is obtained by OCRing a document, incorrect text is often extracted for various reasons. Texts generated in daily life and texts generated as a result of OCR often contain typos or are not refined, requiring a lot of time and money for proofreading.

특히 스마트폰 등과 같이 상대적으로 오타가 발생하기 쉬운 휴대용 단말기를 사용해서 글을 작성하거나 챗봇 등을 이용하는 경우가 늘어나면서 텍스트 교정 수요가 증가하고 있다. 특히 챗봇에서는 사용자의 발화 의도(intent)를 파악하는 것이 중요한데, 오타가 존재할 경우 사용자의 의도를 제대로 파악하지 못해서 챗봇 프로세스가 전혀 다른 방향으로 진행되거나 아예 시작되지 않을 수 있기 때문이다.In particular, the demand for text correction is increasing as the use of portable terminals such as smart phones, etc., which are relatively prone to typos, are used to write articles or use chatbots. In particular, it is important for chatbots to understand the user's utterance intent, because if there is a typo, the chatbot process may proceed in a completely different direction or not start at all because the user's intention is not correctly grasped.

대한민국 공개특허공보 제10-2021-0023288호 (2021.03.04)Republic of Korea Patent Publication No. 10-2021-0023288 (2021.03.04)

개시되는 실시예들은 하나 이상의 문장으로 구성된 텍스트의 오류를 교정하기 위한 기술적인 수단을 제공하기 위한 것이다.The disclosed embodiments are intended to provide technical means for correcting errors in text composed of one or more sentences.

예시적인 실시예에 따르면, 교정 대상 문장의 시각적 특성에 기반하여 상기 교정 대상 문장을 교정하는 형상 재생성 교정 모듈; 상기 교정 대상 문장의 청각적 특성에 기반하여 상기 교정 대상 문장을 교정하는 발음 재생성 교정 모듈; 상기 교정 대상 문장이 속한 언어의 통계적 특성에 기반하여 상기 교정 대상 문장을 교정하는 복합 재생성 교정 모듈; 및 상기 교정 대상 문장을 입력받고, 상기 형상 재생성 교정 모듈, 상기 발음 재생성 교정 모듈 및 상기 복합 재생성 교정 모듈 중 하나 이상을 이용하여 상기 교정 대상 문장을 교정할지의 여부를 판단하는 판별 모듈을 포함하는 문서 교정 장치가 제공된다.According to an exemplary embodiment, a shape regeneration correction module correcting the target sentence based on visual characteristics of the target sentence; a pronunciation regeneration correction module for correcting the target sentence based on the auditory characteristics of the target sentence; a complex regeneration correction module for correcting the sentence to be proofread based on the statistical characteristics of the language to which the sentence to be proofread belongs; and a determination module that receives the sentence to be corrected and determines whether to correct the sentence to be corrected by using at least one of the shape regeneration correction module, the pronunciation regeneration correction module, and the complex regeneration correction module. An orthodontic device is provided.

상기 형상 재생성 교정 모듈은, 상기 교정 대상 문장을 이미지로 변환한 뒤 광학 문자 인식을 이용하여 변환된 상기 이미지에서 텍스트를 인식할 수 있다.The shape regeneration correction module may convert the sentence to be corrected into an image and then recognize text in the converted image using optical character recognition.

상기 형상 재생성 교정 모듈은, 상기 인식된 텍스트에 포함된 비완전 글자(noncomplete character)의 개수가 상기 교정 대상 문장에 포함된 비완전 글자의 개수보다 작은 경우, 상기 인식된 텍스트를 교정된 문장으로 설정할 수 있다.The shape regeneration correction module sets the recognized text as a corrected sentence when the number of noncomplete characters included in the recognized text is smaller than the number of noncomplete characters included in the correction target sentence. can

상기 형상 재생성 교정 모듈은, 상기 인식된 텍스트를 상기 교정 대상 문장과 비교하여 변경된 영역을 식별하고, 상기 변경된 영역의 발생 빈도가 변경전 영역의 발생 빈도보다 높은 경우, 상기 인식된 텍스트를 교정된 문장으로 설정할 수 있다.The shape regeneration correction module compares the recognized text with the sentence to be corrected to identify a changed area, and if the frequency of occurrence of the changed area is higher than the frequency of occurrence of the area before the change, the recognized text is converted to the corrected sentence. can be set to

상기 발음 재생성 교정 모듈은, 음성 합성 수단(Text-to-Speech)을 이용하여 상기 교정 대상 문장으로부터 음성 데이터를 생성한 뒤, 음성 인식 수단(Speech-To-Text)을 이용하여 상기 음성 데이터를 텍스트로 변환할 수 있다.The pronunciation regeneration correction module generates voice data from the sentence to be corrected using a speech synthesis unit (Text-to-Speech), and converts the voice data into text using a speech recognition unit (Speech-To-Text). can be converted to

상기 발음 재생성 교정 모듈은, 상기 변환된 텍스트에 포함된 비완전 글자(noncomplete character)의 개수가 상기 교정 대상 문장에 포함된 비완전 글자의 개수보다 작은 경우, 상기 변환된 텍스트를 교정된 문장으로 설정할 수 있다.The pronunciation regeneration correction module sets the converted text as a corrected sentence when the number of noncomplete characters included in the converted text is smaller than the number of noncomplete characters included in the sentence to be corrected. can

상기 발음 재생성 교정 모듈은, 상기 변환된 텍스트를 상기 교정 대상 문장과 비교하여 변경된 영역을 식별하고, 상기 변경된 영역의 발생 빈도가 변경전 영역의 발생 빈도보다 높은 경우, 상기 변환된 텍스트를 교정된 문장으로 설정할 수 있다.The pronunciation regeneration correction module compares the converted text with the sentence to be corrected to identify a changed region, and when the frequency of occurrence of the changed region is higher than the frequency of occurrence of the region before change, the converted text is converted to the corrected sentence. can be set to

상기 복합 재생성 교정 모듈은, 상기 교정 대상 문장을 복수의 형태소로 분할하고, 상기 복수의 형태소 중 하나 이상의 교정 대상 토큰을 선택하며, 상기 교정 대상 토큰 별로 하나 이상의 교정 후보 토큰을 생성할 수 있다.The compound regeneration correction module may divide the proofread target sentence into a plurality of morphemes, select one or more proofreading tokens from among the plurality of proofreading tokens, and generate one or more proofreading candidate tokens for each proofreading token.

상기 복합 재생성 교정 모듈은, 상기 분할된 형태소의 품사가 용언에 해당하는 경우, 해당 형태소를 기본형으로 변환할 수 있다.When the part-of-speech of the divided morpheme corresponds to a verb, the complex regeneration correction module may convert the corresponding morpheme into a basic form.

상기 복합 재생성 교정 모듈은, 상기 복수의 형태소 중 기 설정된 사전 데이터베이스에서 조회되지 않는 형태소를 상기 교정 대상 토큰으로 선택할 수 있다.The compound regeneration correction module may select a morpheme that is not searched in a preset dictionary database among the plurality of morphemes as the correction target token.

상기 복합 재생성 교정 모듈은, 상기 교정 대상 문장을 기 설정된 언어 모델에 입력하여 상기 사전 데이터베이스에서 조회되지 않는 형태소의 위치에 해당하는 토큰을 예측하게 하되, 상기 언어 모델이 상기 사전 데이터베이스에서 조회되지 않는 형태소에 해당하는 토큰을 예측하지 못하는 경우, 상기 사전 데이터베이스에서 조회되지 않는 형태소를 상기 교정 대상 토큰으로 선택할 수 있다.The compound regeneration correction module inputs the sentence to be corrected into a preset language model and predicts a token corresponding to a position of a morpheme not searched in the dictionary database, but the language model is not searched in the dictionary database. If a token corresponding to is not predicted, a morpheme not searched in the dictionary database may be selected as the correction target token.

상기 복합 재생성 교정 모듈은, 상기 교정 대상 문장을 기 설정된 언어 모델에 입력하여 상기 교정 대상 토큰의 위치에 해당하는 토큰을 예측하도록 하고, 상기 언어 모델이 예측한 N개(N은 1 이상의 자연수)의 예측 토큰을 상기 교정 대상 토큰에 대한 상기 교정 후보 토큰으로 설정할 수 있다.The compound regeneration correction module inputs the correction target sentence into a preset language model to predict a token corresponding to the location of the correction target token, and N number (N is a natural number of 1 or more) predicted by the language model. A prediction token may be set as the remediation candidate token for the remediation target token.

상기 복합 재생성 교정 모듈은, 상기 1차 교정 대상 토큰과 편집 거리가 K(K는 1 이상의 자연수) 이내인 토큰을 상기 교정 후보 토큰에 추가할 수 있다.The complex regeneration calibration module may add a token whose editing distance is within K (K is a natural number equal to or greater than 1) from the primary calibration target token to the calibration candidate token.

상기 복합 재생성 교정 모듈은, 상기 교정 대상 토큰이 용언에 해당하는 경우, 상기 교정 대상 토큰으로부터 하나 이상의 후보 기본형을 생성하고, 상기 하나 이상의 후보 기본형 각각과 편집 거리가 K(K는 1 이상의 자연수) 이내인 토큰을 상기 교정 후보 토큰에 추가할 수 있다.The complex regeneration correction module, when the token to be proofreaded corresponds to a term, generates one or more candidate base types from the token to be proofread, and an editing distance from each of the one or more candidate base types is within K (where K is a natural number of 1 or greater). In tokens may be added to the calibration candidate tokens.

상기 복합 재생성 교정 모듈은, 상기 교정 후보 토큰 중 기 설정된 사전 또는 코퍼스에서 조회되지 않는 토큰을 제외할 수 있다.The compound regeneration calibration module may exclude tokens that are not searched in a preset dictionary or corpus from among the calibration candidate tokens.

상기 복합 재생성 교정 모듈은, 상기 하나 이상의 교정 후보 토큰 각각에 대하여, 상기 교정 대상 토큰과의 레벤슈타인 거리, 상기 교정 대상 토큰과의 품사 유사도, 코퍼스 내의 단어 빈도, 및 상기 코퍼스 내의 숙어 빈도 중 하나 이상을 계산하고, 계산된 값에 기초하여 상기 하나 이상의 교정 후보 토큰 중 상기 교정 대상 토큰을 대체할 토큰을 선택할 수 있다.The complex regeneration correction module may, for each of the one or more correction candidate tokens, at least one of a Levenstein distance with the correction target token, a part-of-speech similarity with the correction target token, a word frequency in a corpus, and an idiom frequency in the corpus. Calculate , and select a token to replace the calibration target token from among the one or more calibration candidate tokens based on the calculated value.

상기 장치는, 상기 형상 재생성 교정 모듈, 상기 발음 재생성 교정 모듈 및 상기 복합 재생성 교정 모듈 중 하나 이상을 이용하여 상기 교정 대상 문장을 교정한 1차 교정 문장을 추가 교정하는 추가 교정 모듈을 더 포함할 수 있다.The apparatus may further include an additional correction module for additionally correcting the first correction sentence obtained by correcting the correction target sentence using at least one of the shape regeneration correction module, the pronunciation regeneration correction module, and the complex regeneration correction module. there is.

상기 추가 교정 모듈은, 상기 1차 교정 문장을 입력받아 이를 추가 교정한 2차 교정 문장을 출력하고, 상기 2차 교정 문장이 기 설정된 정답 문장과 일치하도록 학습될 수 있다.The additional correction module may receive the first corrected sentence, output a second corrected sentence obtained by additionally correcting the first corrected sentence, and learn to match the second corrected sentence with a preset correct answer sentence.

다른 예시적인 실시예에 따르면, 컴퓨터상에서 수행되는 방법으로서, 교정 대상 문장을 입력받는 단계; 교정 대상 문장의 시각적 특성에 기반하여 상기 교정 대상 문장을 교정하는 형상 재생성 교정 모듈, 상기 교정 대상 문장의 청각적 특성에 기반하여 상기 교정 대상 문장을 교정하는 발음 재생성 교정 모듈, 및 상기 교정 대상 문장이 속한 언어의 통계적 특성에 기반하여 상기 교정 대상 문장을 교정하는 복합 재생성 교정 모듈 중 하나 이상을 이용하여 상기 교정 대상 문장을 교정할지의 여부를 판단하는 단계; 및 상기 판단 결과에 기반하여 상기 교정 대상 문장을 교정하는 단계를 포함하는, 문서 교정 방법이 제공된다.According to another exemplary embodiment, a method performed on a computer includes receiving a sentence to be corrected; A shape regeneration correction module for correcting the correction target sentence based on the visual characteristics of the correction target sentence, a pronunciation regeneration correction module for correcting the correction target sentence based on the auditory characteristics of the correction target sentence, and the correction target sentence determining whether or not to proofread the sentence to be proofread by using one or more complex regeneration proofreading modules for proofreading the sentence to be proofreaded based on statistical characteristics of the language to which it belongs; and correcting the target sentence based on the determination result.

상기 교정하는 단계는, 상기 형상 재생성 교정 모듈, 상기 발음 재생성 교정 모듈 및 상기 복합 재생성 교정 모듈 중 하나 이상을 이용하여 상기 교정 대상 문장을 교정한 1차 교정 문장을 추가 교정하는 단계를 더 포함할 수 있다.The correcting may further include correcting the first correction sentence obtained by correcting the sentence to be corrected using at least one of the shape regeneration correction module, the pronunciation regeneration correction module, and the complex regeneration correction module. there is.

개시되는 실시예들에 따르면 텍스트에 나타난 오류의 특성에 따라 최적의 오류 정정 모듈을 적용함으로써 사용자의 개입 없이 텍스트의 오류를 효과적으로 정정할 수 있다.According to the disclosed embodiments, text errors can be effectively corrected without user intervention by applying an optimal error correction module according to the characteristics of errors appearing in the text.

도 1은 일 실시예에 따른 문서 교정 장치(100)를 설명하기 위한 블록도
도 2는 일 실시예에 따른 판별 모델(112)의 학습 과정을 설명하기 위한 예시도
도 3은 일 실시예에 따른 학습된 판별 모델(112)에서 문서로부터 교정이 필요한 문장을 판별하는 과정을 설명하기 위한 예시도
도 4는 일 실시예에 따른 형상 재생성 교정 모듈(114)에서 문장을 교정하는 과정을 설명하기 위한 흐름도
도 5는 일 실시예에 따른 발음 재생성 교정 모듈(116)에서 문장을 교정하는 과정을 설명하기 위한 흐름도
도 6은 일 실시예에 따른 복합 재생성 교정 모듈(118)에서 문장을 교정하는 과정을 설명하기 위한 흐름도
도 7은 일 실시예에 따른 추가 교정 모델(120)을 학습하는 과정을 설명하기 위한 예시도이다
도 8은 일 실시예에 따른 문서 교정 방법(800)을 설명하기 위한 흐름도
도 9는 예시적인 실시예들에서 사용되기에 적합한 컴퓨팅 장치를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도1 is a block diagram illustrating a document proofreading apparatus 100 according to an exemplary embodiment.
2 is an exemplary diagram for explaining a learning process of a discrimination model 112 according to an embodiment
3 is an exemplary diagram for explaining a process of discriminating a sentence requiring correction from a document in a learned discrimination model 112 according to an embodiment.
4 is a flowchart for explaining a process of correcting sentences in the shape regeneration correction module 114 according to an embodiment
5 is a flowchart for explaining a process of correcting a sentence in the pronunciation regeneration correction module 116 according to an embodiment
6 is a flowchart for explaining a process of correcting sentences in the complex regeneration correction module 118 according to an embodiment.
7 is an exemplary diagram for explaining a process of learning an additional calibration model 120 according to an embodiment.
8 is a flowchart for explaining a document proofreading method 800 according to an embodiment.
9 is a block diagram illustrating and describing a computing environment including a computing device suitable for use in example embodiments.

이하, 도면을 참조하여 본 발명의 구체적인 실시형태를 설명하기로 한다. 이하의 상세한 설명은 본 명세서에서 기술된 방법, 장치 및/또는 시스템에 대한 포괄적인 이해를 돕기 위해 제공된다. 그러나 이는 예시에 불과하며 본 발명은 이에 제한되지 않는다.Hereinafter, specific embodiments of the present invention will be described with reference to the drawings. The detailed descriptions that follow are provided to provide a comprehensive understanding of the methods, devices and/or systems described herein. However, this is only an example and the present invention is not limited thereto.

본 발명의 실시예들을 설명함에 있어서, 본 발명과 관련된 공지기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략하기로 한다. 그리고, 후술되는 용어들은 본 발명에서의 기능을 고려하여 정의된 용어들로서 이는 사용자, 운용자의 의도 또는 관례 등에 따라 달라질 수 있다. 그러므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이다. 상세한 설명에서 사용되는 용어는 단지 본 발명의 실시예들을 기술하기 위한 것이며, 결코 제한적이어서는 안 된다. 명확하게 달리 사용되지 않는 한, 단수 형태의 표현은 복수 형태의 의미를 포함한다. 본 설명에서, "포함" 또는 "구비"와 같은 표현은 어떤 특성들, 숫자들, 단계들, 동작들, 요소들, 이들의 일부 또는 조합을 가리키기 위한 것이며, 기술된 것 이외에 하나 또는 그 이상의 다른 특성, 숫자, 단계, 동작, 요소, 이들의 일부 또는 조합의 존재 또는 가능성을 배제하도록 해석되어서는 안 된다.In describing the embodiments of the present invention, if it is determined that the detailed description of the known technology related to the present invention may unnecessarily obscure the subject matter of the present invention, the detailed description will be omitted. In addition, terms to be described later are terms defined in consideration of functions in the present invention, which may vary according to the intention or custom of a user or operator. Therefore, the definition should be made based on the contents throughout this specification. Terminology used in the detailed description is only for describing the embodiments of the present invention and should in no way be limiting. Unless expressly used otherwise, singular forms of expression include plural forms. In this description, expressions such as "comprising" or "comprising" are intended to indicate any characteristic, number, step, operation, element, portion or combination thereof, one or more other than those described. It should not be construed to exclude the existence or possibility of any other feature, number, step, operation, element, part or combination thereof.

도 1은 일 실시예에 따른 문서 교정 장치(100)를 설명하기 위한 블록도이다. 일 실시예에 따른 문서 교정 장치(100)는 하나 이상의 문장으로 구성된 문서를 입력받고, 상기 문서 내의 각 문장 중 교정이 필요한 문장을 판별하여 이를 교정하기 위한 장치이다. 개시되는 실시예에서 문서 교정 장치(100)의 교정 대상이 되는 문서는 보고서, 기사 등의 공식 문서뿐만 아니라, 인터넷 게시판, SNS 게시글, 채팅 메시지 등 온라인 또는 오프라인으로 생성되는 모든 종류의 텍스트를 제한 없이 포함한다.1 is a block diagram illustrating a document proofreading apparatus 100 according to an exemplary embodiment. The document proofreading apparatus 100 according to an embodiment is a device for receiving a document composed of one or more sentences, determining sentences requiring proofreading among sentences in the document, and correcting the sentences. In the disclosed embodiment, documents to be corrected by the document proofreading apparatus 100 include all types of texts generated online or offline, such as Internet bulletin boards, SNS posts, chatting messages, etc., as well as official documents such as reports and articles, without limitation. include

도시된 바와 같이, 일 실시예에 따른 문서 교정 장치(100)는 문장 분리 모듈(102), 판별 모듈(104), 교정 모듈(106)을 포함한다.As shown, the document proofreading apparatus 100 according to an embodiment includes a sentence separation module 102 , a judgment module 104 , and a proofreading module 106 .

문장 분리 모듈(102)은 교정 대상 문서를 문장 단위로 분리한다. 전술한 바와 같이 교정 대상 문서는 하나 이상의 문장으로 구성되는 모든 종류의 텍스트를 포함할 수 있다. 일 실시예에서, 문장 분리 모듈(102)은 문장부호(온점, 물음표, 느낌표 등)를 기준으로 한 룰 기반 혹은 기계 학습 모델 기반으로 교정 대상 문서를 문장 단위로 분리하도록 구성될 수 있다.The sentence separation module 102 separates the document to be proofread into sentence units. As described above, the document to be redacted may include all kinds of texts composed of one or more sentences. In one embodiment, the sentence separation module 102 may be configured to separate the document to be proofread into sentence units based on rules based on punctuation marks (periods, question marks, exclamation marks, etc.) or machine learning models.

개시되는 실시예에서 문서 교정 장치(100)는 교정 대상 문서의 각 문장 별로 교정을 수행한다. 이때 각 문장들은 서로간에 의존성이 없으므로 각각의 문장에 대하여 병렬로 교정을 수행할 경우 매우 빠르게 문서에 대한 교정을 수행할 수 있다. 다시 말해 문서 교정 장치(100)의 전체 교정 속도는 개별 문장에 대한 반복 교정 횟수에 의해서만 영향을 받을 뿐, 전체 문장의 개수에는 영향을 받지 않는다. 이하의 설명에서는 문장 분리 모듈(102)에서 분리된 문장을 "교정 대상 문장"으로 칭하기로 한다.In the disclosed embodiment, the document proofreading apparatus 100 performs proofreading for each sentence of the proofreading target document. At this time, since each sentence does not depend on each other, proofreading of a document can be performed very quickly when proofreading is performed in parallel for each sentence. In other words, the entire proofreading speed of the document proofreading device 100 is affected only by the number of iterative proofreadings for individual sentences, and is not affected by the number of whole sentences. In the following description, the sentence separated by the sentence separation module 102 will be referred to as a “sentence to be proofread”.

판별 모듈(104)은 문장 분리 모듈(102)에서 생성된 각 교정 대상 문장 별 교정 필요 여부를 판단한다. 판별 모듈(104)은 상기 각 교정 대상 문장에 대하여, 후술할 교정 모듈(106) 내의 3가지 교정 모듈 중 하나 이상의 교정 모듈을 적용할 지의 여부를 판단할 수 있다. 예컨대 판별 모듈(104)은 문장에 따라 교정이 필요 없는 것으로 판단하거나, 하나 또는 두 개 이상의 교정 모듈에 의한 교정이 필요한 것으로 판단할 수 있다. 일 실시예에서, 판별 모듈(104)은 기계 학습 기반의 판별 모델(112)을 이용하여 각 교정 대상 문장에 대한 교정 필요 여부 및 교정에 사용될 교정 모듈의 종류를 판별하도록 구성될 수 있다. 판별 모델(112)에 대한 상세한 사항은 후술하기로 한다.The determination module 104 determines whether correction is required for each correction target sentence generated in the sentence separation module 102 . The determination module 104 may determine whether to apply one or more correction modules among three correction modules in the correction module 106 to be described later to each sentence to be corrected. For example, the determination module 104 may determine that correction is not required according to the sentence or may determine that correction by one or more correction modules is required. In one embodiment, the discrimination module 104 may be configured to determine whether correction is required for each sentence to be corrected and the type of correction module to be used for correction by using the machine learning-based discrimination model 112 . Details of the discrimination model 112 will be described later.

교정 모듈(106)은 판별 모듈(104)에 의하여 교정이 필요한 교정 대상 문장에 대한 교정을 수행한다. 교정 모듈(106)은 형상 재생성 교정 모듈(114), 발음 재생성 교정 모듈(116) 및 복합 재생성 교정 모듈(118)을 포함한다.The proofreading module 106 performs proofreading on sentences to be proofread which require proofreading by the determination module 104 . The correction module 106 includes a shape regeneration correction module 114 , a pronunciation regeneration correction module 116 and a complex regeneration correction module 118 .

형상 재생성 교정 모듈(114)은 교정 대상 문장의 시각적 특성에 기반하여 교정 대상 문장을 교정한다. 일 실시예에서, 형상 재생성 교정 모듈(114)은 교정 대상 문장을 이미지화 및 업스케일링(upscaling)하여 고화질의 이미지로 변환한 뒤, 변환된 이미지를 광학 문자 인식(OCR; Optical Character Recognition)을 통하여 재인식함으로써 문장을 교정할 수 있다.The shape regeneration correction module 114 corrects the sentence to be proofread based on the visual characteristics of the sentence to be proofread. In one embodiment, the shape regeneration correction module 114 converts the sentence to be corrected into a high-quality image by imaging and upscaling the sentence, and then recognizing the converted image through Optical Character Recognition (OCR). You can correct the sentence by doing this.

발음 재생성 교정 모듈(116)은 교정 대상 문장의 청각적 특성에 기반하여 교정 대상 문장을 교정한다. 일 실시예에서, 발음 재생성 교정 모듈(116)은 소정의 음성 합성 수단(Text-to-Speech)을 이용하여 상기 교정 대상 문장으로부터 음성 데이터를 생성한 뒤, 별도의 음성 인식 수단(Speech-To-Text)을 이용하여 상기 음성 데이터를 텍스트로 재변환함으로써 문장을 교정할 수 있다.The pronunciation reproduction correction module 116 corrects the sentence to be corrected based on the auditory characteristics of the sentence to be corrected. In one embodiment, the pronunciation reproduction correction module 116 generates voice data from the correction target sentence using a predetermined speech synthesis means (Text-to-Speech), and then separate speech recognition means (Speech-To-Speech). Text) to correct the sentence by reconverting the voice data into text.

복합 재생성 교정 모듈(118)은 교정 대상 문장이 속한 언어의 통계적 특성에 기반하여 교정 대상 문장을 교정한다. 일 실시예에서, 복합 재생성 교정 모듈(118)은 형태소 분석기를 이용하여 교정 대상 문장을 형태소로 분할하고 각 단어의 원형을 복원한다. 이후 복합 재생성 교정 모듈(118)은 분할된 형태소 중 사전에 등재되어 있지 않은 형태소를 교정 대상 토큰으로 선정한 뒤 해당 언어의 통계적 특성을 고려하여 해당 교정 대상 토큰을 다른 단어로 치환함으로써 문장의 교정을 수행한다.The complex regeneration correction module 118 corrects the sentence to be proofread based on the statistical characteristics of the language to which the sentence to be proofread belongs. In one embodiment, the complex regeneration correction module 118 divides a sentence to be corrected into morphemes using a morpheme analyzer and restores the original form of each word. Thereafter, the compound regeneration correction module 118 selects a morpheme not registered in the dictionary among the divided morphemes as a correction target token, and then corrects the sentence by substituting the correction target token with another word in consideration of the statistical characteristics of the corresponding language. do.

한편, 일 실시예에 따른 문서 교정 장치(100)는 추가 교정 모듈(108) 및 판단 모듈(110)을 더 포함할 수 있다.Meanwhile, the document proofreading apparatus 100 according to an embodiment may further include an additional proofreading module 108 and a judgment module 110 .

추가 교정 모듈(108)은 앞서 설명한 형상 재생성 교정 모듈(114), 발음 재생성 교정 모듈(116) 및 복합 재생성 교정 모듈(118) 중 하나 이상을 이용하여 교정 대상 문장을 교정한 1차 교정 문장을 추가 교정한다. 일 실시예에서 추가 교정 모듈(114)은 기계 학습 기반의 추가 교정 모델(120)을 더 포함할 수 있다. 추가 교정 모델(120)은 상기 1차 교정 문장을 입력받아 이를 추가 교정한 2차 교정 문장을 출력하고, 상기 2차 교정 문장이 기 설정된 정답 문장과 일치하도록 학습될 수 있다.The additional correction module 108 adds a first correction sentence obtained by correcting the sentence to be corrected using one or more of the shape regeneration correction module 114, the pronunciation regeneration correction module 116, and the complex regeneration correction module 118 described above. correct In one embodiment, the supplemental calibration module 114 may further include a machine learning-based supplemental calibration model 120 . The additional correction model 120 receives the first correction sentence, outputs a second correction sentence obtained by additionally correcting the first correction sentence, and may learn to match the second correction sentence with a preset correct answer sentence.

판단 모듈(110)은 교정 모듈(106) 또는 추가 교정 모듈(108)에 따른 교정 결과에 대한 추가 교정 필요 여부를 판단한다. 일 실시예에서 판단 모듈(110)은 판별 모듈(104)과 동일한 판별 모델(112)을 이용하여 상기 추가 교정 필요 여부를 판단할 수 있다. 그러나 이는 예시적인 것으로서 판단 모듈(110)은 판별 모듈(104)과 다른 별도의 판별 모델을 이용하여 상기 추가 교정 필요 여부를 판단할 수도 있다.The determination module 110 determines whether additional calibration is required for the calibration result according to the calibration module 106 or the additional calibration module 108 . In one embodiment, the determination module 110 may determine whether the additional calibration is required using the same determination model 112 as the determination module 104 . However, this is just an example, and the determination module 110 may determine whether the additional correction is necessary by using a separate determination model different from the determination module 104 .

판별 모델(112)의 판별 결과 추가 교정이 필요한 경우, 판단 모듈(110)은 교정 모듈(106) 또는 추가 교정 모듈(108)에 의하여 교정된 문장을 교정 모듈(106)로 재입력한다. 그러나 이와 달리 추가 교정이 필요하지 않은 경우, 판단 모듈(110)은 교정된 문서를 출력한다. 또한 일 실시예에서, 판단 모듈(110)은 교정 모듈(106)에 의한 교정 횟수가 기 설정된 최대 교정 횟수에 도달했는지의 여부를 판단하고, 최대 교정 횟수에 도달한 경우에는 추가 교정 필요 여부와 관계없이 교정된 문서를 출력하도록 구성될 수 있다.As a result of the judgment by the discrimination model 112, if additional correction is required, the judgment module 110 re-inputs the sentence corrected by the correction module 106 or the additional correction module 108 to the correction module 106. However, if additional proofreading is not required otherwise, the decision module 110 outputs the corrected document. In addition, in one embodiment, the determination module 110 determines whether the number of calibrations by the calibration module 106 has reached a preset maximum number of calibrations, and if the maximum number of calibrations has been reached, the relationship between whether additional calibrations are required It can be configured to output a document that has been redacted without it.

전술한 바와 같이, 교정 모듈(106)은 형상 재생성 교정 모듈(114), 발음 재생성 교정 모듈(116) 및 복합 재생성 교정 모듈(118)을 포함한다. 이하에서는 각 교정 모듈 별로 교정 가능한 문장 내 오류를 예시하여 설명한다.As described above, the correction module 106 includes a shape regeneration correction module 114 , a pronunciation regeneration correction module 116 and a complex regeneration correction module 118 . Hereinafter, errors in sentences that can be corrected for each correction module will be described by way of example.

먼저, 형상 재생성 교정 모듈(114)을 통해 교정 가능한 오류는 다음과 같다.First, errors that can be corrected through the shape regeneration correction module 114 are as follows.

1. 단어 자체로는 아무 의미가 없지만 재미를 위해 형태를 보았을 때 특정 텍스트가 연상되도록 자/모음을 조합해 우회적으로 표현한 경우 1. In case the word itself has no meaning, but is expressed indirectly by combining letters/vowels so that a specific text is associated when looking at the form for fun

(예: 멍멍이 -> 댕댕이)(e.g. doggie -> doggie)

2. 문서를 OCR한 결과가 목적 텍스트와 유사하게는 보이지만 실제로는 틀린 텍스트가 추출되는 경우 2. If the result of OCR of the document looks similar to the target text, but actually extracts incorrect text

(예: 아기 -> or71, 이야기 -> oloF71, 아가미 -> or7rㅁㅣ, 아가리 -> or7rzl)(e.g. baby -> or71, story -> oloF71, gills -> or7rㅁㅣ, gills -> or7rzl)

3. 텍스트를 입력하다 자음과 모음을 분리하여 입력하는 오타가 발생한 경우 3. If you make a typo by separating consonants and vowels while entering text

(예: 밥을 먹었다. -> 밥을 먹었ㄷㅏ .)(Example: I ate rice. -> I ate rice.)

위의 예시들은 언어의 형태적 특성에 따라 의도적 또는 비의도적으로 오류가 발생된 사례이다. 따라서 OCR이 교정 효과를 제공할 수 있다. OCR 결과가 좋지 않은 경우에도 화질 개선 후 재OCR을 함으로써 개선된 결과를 기대할 수 있다.The above examples are cases in which errors were intentionally or unintentionally generated depending on the morphological characteristics of language. Thus, OCR can provide a corrective effect. Even when the OCR result is not good, an improved result can be expected by performing OCR again after improving the image quality.

다음으로, 발음 재생성 교정 모듈(116)을 통해 교정 가능한 오류는 다음과 같다.Next, errors that can be corrected through the pronunciation regeneration correction module 116 are as follows.

1. 발음의 유사성을 이용해 일부 단어를 발음이 비슷한 다른 단어나 숫자로 교체된 경우1. When some words are replaced with other words or numbers with similar pronunciation using similarity of pronunciation

(예: 하2하2~ 5랜만이에요! (입력 문장) / 하이하이~ 오랜만이에요! (의도한 문장))(Example: It's been a while ha2ha2~ 5ran! (input sentence) / Hi hi~ Long time no see! (intended sentence))

2. 실제로 발음했을 ?? 해당 단어로 발음되는 다른 단어로 교체된 경우2. Did you actually pronounce it?? Replaced by another word pronounced with that word

(예: 열어분~ (입력 문장) / 여러분~ (의도한 문장))(Example: Open person~ (input sentence) / Everyone~ (intended sentence))

3. 발음에 거센소리나 된소리를 추가하거나 모음을 변형하여 문장을 작성한 경우. 원래 문장에서 자음과 모음에 변이를 조금씩 주어 생성한 문장으로 해당 언어의 네이티브가 보았을 때는 무슨 의미인지 파악할 수 있으나, 해당 언어를 모르는 외국인이 보거나 번역기를 사용하면 제대로 번역되지 않는 경우3. In the case of writing a sentence by adding a harsh sound or a hard sound to the pronunciation or modifying the vowel. A sentence created by slightly changing consonants and vowels in the original sentence. When a native of the language sees it, it is possible to understand what it means, but it is not properly translated when a foreigner who does not know the language sees it or uses a translator

(예: 외국 음식점 리뷰에서 식당 주인은 해당 리뷰가 좋은 리뷰라고 생각하고 삭제하지 않게 하면서 한국인에게는 해당 음식점을 비추천하려는 목적으로 작성된 글(Example: In a foreign restaurant review, the owner of the restaurant thinks the review is a good review and does not want to delete it, but it is written for the purpose of not recommending the restaurant to Koreans.

So delicious food and good atmosphere! I wanna visit again! 쩔?? 까찌 마?施?! (입력 문장)So delicious food and good atmosphere! I wanna visit again! dope?? Kachi Ma?施?! (input sentence)

So delicious food and good atmosphere! I wanna visit again! 절대 가지 마세요! (의도한 문장))So delicious food and good atmosphere! I wanna visit again! never go! (intended sentence))

위의 예시들은 발음의 유사성을 이용해 오류가 발생한 사례이다. 따라서 TTS를 거치면서 어색한 문장이 음성 데이터로 변환되면서 교정 결과로 적합한 발화가 얻어지고, 그 발화가 STT를 거치면서 문장 전체 맥락에 맞는 텍스트로 교정되는 효과를 기대할 수 있다.The above examples are cases in which an error occurred using the similarity of pronunciation. Therefore, an awkward sentence is converted into voice data through TTS, an appropriate utterance is obtained as a result of correction, and the effect of correcting the utterance into text suitable for the entire context of the sentence can be expected through STT.

마지막으로 복합 재생성 교정 모듈(118)을 통해 교정 가능한 오류는 단어 수준에서 자음이나 모음이 사소하게 틀리는 등 일반적인 의미에서의 오타가 발생한 경우이다.Finally, an error that can be corrected through the compound regeneration correction module 118 is a case where a typo in a general sense occurs, such as a slight mismatch of a consonant or a vowel at a word level.

(예: 피스타는 진자 ?記羚年쨉? 예악이 너무 힘둘다. (입력 문장)(Example: Pista is a pendulum ? 記羚年实? The etiquette is too strong. (Input sentence)

파스타는 진짜 맛있었는데 예약이 너무 힘들다. (의도한 문장))The pasta was really good, but it's so hard to make a reservation. (intended sentence))

위의 예제는 단어 수준에서 오타가 발생한 경우이다. 기존 교정기에서는 편집 거리나 기 설정된 규칙을 많이 사용하여 오타의 교정을 수행하였다. 그러나 개시되는 실시예에서는 사전과 딥러닝 언어 모델, 단어 퍼터베이터(Word Perturbator), 코퍼스의 특징 등을 고려하여 해당 문장이 사용된 코퍼스의 맥락에 맞으면서도 문장 전체 맥락에 적절한 단어를 추천하도록 구성된다.The example above is a typo at the word level. In existing proofreaders, typos were corrected by using a lot of editing distances or preset rules. However, in the disclosed embodiment, considering the dictionary, the deep learning language model, the word perturbator, and the characteristics of the corpus, it is configured to recommend a word suitable for the context of the entire sentence while matching the context of the corpus in which the sentence is used. .

일 실시예에서, 교정 모듈(106)은 각각의 서브 교정 모듈들(형상 재생성 교정 모듈(114), 발음 재생성 교정 모듈(116) 및 복합 재생성 교정 모듈(118))을 순차적으로 실행할 수 있다. 예를 들어, 형상 재생성 교정 모듈(114), 발음 재생성 교정 모듈(116) 및 복합 재생성 교정 모듈(118)의 순서로 실행 순서가 설정된 상태에서 특정 문장에 대하여 형상 재생성 교정 모듈(114) 및 복합 재생성 교정 모듈(118)을 통항 교정이 필요하다고 판단된 경우, 교정 모듈(106)은 형상 재생성 교정 모듈(114)을 먼저 수행하고 그 결과를 복합 재생성 교정 모듈(118)에 입력하여 최종 교정 결과를 생성하도록 구성될 수 있다.In one embodiment, the correction module 106 may sequentially execute each of the sub-correction modules (shape regeneration correction module 114, pronunciation regeneration correction module 116, and complex regeneration correction module 118). For example, the shape regeneration correction module 114 and complex regeneration correction module 114 for a specific sentence in a state in which the execution order is set in the order of the shape regeneration correction module 114, the pronunciation regeneration correction module 116, and the complex regeneration correction module 118. When it is determined that the correction module 118 needs to be corrected through the correction module 106, the correction module 106 first performs the shape regeneration correction module 114 and inputs the result to the complex regeneration correction module 118 to generate the final correction result. can be configured to

도 2는 일 실시예에 따른 판별 모델(112)의 학습 과정을 설명하기 위한 예시도이다. 도시된 바와 같이, 판별 모델(112)의 학습을 위하여 문장 단위로 분리한 문서, 교정이 필요한 문장의 인덱스, 및 교정에 필요한 모듈 정보가 판별 모델(112)로 입력된다. 2 is an exemplary diagram for explaining a learning process of the discrimination model 112 according to an embodiment. As shown, for learning of the discrimination model 112 , documents separated by sentences, indexes of sentences requiring correction, and module information required for correction are input to the discrimination model 112 .

예를 들어, 학습 대상인 문서가 다음의 두 개의 문장으로 구성되었다고 가정하다.For example, assume that the document to be studied consists of the following two sentences.

(인덱스 0) 저는 개를 좋아합니다.(index 0) I like dogs.

(인덱스 1) 댕댕이들은 너무 궈엽답니다.(Index 1) The dogs are too cute.

또한, 각 교정 모듈에 다음과 같은 모듈 번호가 할당되었다고 가정하자.Also, assume that the following module numbers are assigned to each calibration module.

0: 형상 재생성 교정 모듈(114) 교정 필요0: Calibration of the shape regeneration calibration module (114) is required

1: 발음 재생성 교정 모듈(116) 교정 필요1: Pronunciation regeneration correction module 116 needs correction

2: 복합 재생성 교정 모듈(118) 교정 필요2: Complex Regenerative Calibration Module (118) Needs Calibration

3: 교정 불필요3: no calibration required

상기 학습 문서의 경우 인덱스 1에만 교정이 필요하며, 구체적으로 형상 재생성 교정 모듈(114)에 의하여 "댕댕이"를 "멍멍이"로, 복합 재생성 교정 모듈(118)에의 의하여 "궈엽답니다"를 "귀엽답니다"로 교정할 필요가 있다고 가정하자. 그러면 해당 문서에 대한 (교정이 필요한 문장의 인덱스, 교정에 필요한 모듈 정보)의 쌍은 다음과 같이 주어진다.In the case of the learning document, correction is required only for index 1, and specifically, “dog” is changed to “dog” by the shape regeneration correction module 114, and “cute” is changed to “cute” by the complex regeneration correction module 118. Suppose we need to correct with ". Then, the pair of (index of the sentence requiring proofreading, module information required for proofreading) for the document is given as follows.

(1, 0) -> 문장 인덱스 1에 형상 재생성 교정 모듈(114) 교정 필요(1, 0) -> Correction of the shape regeneration correction module (114) at sentence index 1 is required

(1, 2) -> 문장 인덱스 1에 복합 재생성 교정 모듈(118) 교정 필요(1, 2) -> Complex regeneration correction module (118) correction required at sentence index 1

판별 모델은(112)은 상기와 같은 정보를 입력받고, 학습 문서 내에서 교정이 필요한 문장의 인덱스 및 필요한 모듈 정보가 출력되도록 학습될 수 있다.The discrimination model 112 may receive the above information and learn to output an index of a sentence requiring correction within a learning document and necessary module information.

도 3은 일 실시예에 따른 학습된 판별 모델(112)에서 문서로부터 교정이 필요한 문장을 판별하는 과정을 설명하기 위한 예시도이다. 3 is an exemplary diagram for explaining a process of discriminating a sentence requiring correction from a document in the learned discrimination model 112 according to an embodiment.

도시된 바와 같이, 도 2의 학습 과정을 통해 학습된 판별 모델(112)은 문장 단위로 분리된 문서를 입력받고, 각 문장에 대한 교정 모듈별 예측값을 출력한다. 예를 들어, 판별 모델(112)이 다음과 같은 문서를 입력받는다고 가정하자.As shown, the discrimination model 112 learned through the learning process of FIG. 2 receives a document divided into sentence units and outputs a predicted value for each correction module for each sentence. For example, assume that the discrimination model 112 receives the following documents.

(인덱스 0) 저는 개를 좋아합니다.(index 0) I like dogs.

판별 모델(112)은 입력된 문서에 포함된 각 문장 별로 각 교정 모듈 별 교정 필요성을 0~1 사이의 확률로 출력할 수 있다. 예를 들어, (인덱스 1)의 각 모듈별 교정 필요성이 다음과 같이 출력된다고 가정하자(모듈 번호는 도 2와 동일).The discrimination model 112 may output the need for correction for each correction module for each sentence included in the input document with a probability between 0 and 1. For example, assume that the need for calibration for each module of (index 1) is output as follows (the module number is the same as in FIG. 2).

0: 0.50: 0.5

1: 0.051: 0.05

2: 0.32:0.3

3: 0.153: 0.15

판별 모델(112)은 출력된 확률이 기 설정된 임계값 이상인 경우 해당 모듈에 대한 교정이 필요한 것으로 판단될 수 있다. 예를 들어, 상기 임계값이 0.3인 경우, 위 출력값을 통해 형상 재생성 교정 모듈(114) 및 복합 재생성 교정 모듈(118)을 통항 교정이 필요하다고 판단될 수 있다. 이 경우 판별 모델은 다음과 같이 해당 문장의 인덱스와 교정이 필요한 모듈 번호의 쌍을 출력할 수 있다.When the output probability of the discrimination model 112 is greater than or equal to a preset threshold value, it may be determined that calibration of the corresponding module is required. For example, when the threshold value is 0.3, it may be determined that correction through the shape regeneration correction module 114 and the complex regeneration correction module 118 is necessary through the above output value. In this case, the discrimination model can output a pair of the index of the corresponding sentence and the module number requiring correction as follows.

(1, 0), (1, 2)(1, 0), (1, 2)

도 4는 일 실시예에 따른 형상 재생성 교정 모듈(114)에서 문장을 교정하는 과정을 설명하기 위한 흐름도이다.4 is a flowchart illustrating a process of correcting sentences in the shape regeneration correction module 114 according to an exemplary embodiment.

단계 402에서, 형상 재생성 교정 모듈(114)은 교정 대상 문장을 입력받는다.In step 402, the shape regeneration correction module 114 receives a sentence to be corrected.

단계 404에서, 형상 재생성 교정 모듈(114)은 입력된 교정 대상 문장을 이미지로 변환한다. 일 실시예에서, 형상 재생성 교정 모듈(114)은 텍스트를 이미지로 변환될 때 육안으로 용이하게 식별 가능하도록 기 설정된 크기 이상의 사이즈를 가지도록 변환할 수 있다. 또한 형상 재생성 교정 모듈(114)은 텍스트와 배경색이 분명한 대비를 가지도록 이미지의 색상을 설정할 수 있다. 예를 들어, 텍스트의 글씨는 검은색, 배경색은 흰색으로 설정될 수 있다.In step 404, the shape regeneration correction module 114 converts the input sentence to be corrected into an image. In one embodiment, the shape regeneration correction module 114 may convert the text to have a size equal to or greater than a preset size so as to be easily identifiable with the naked eye when converted into an image. In addition, the shape regeneration correction module 114 may set the color of the image so that text and background color have clear contrast. For example, the text of the text may be set to black, and the background color may be set to white.

단계 406에서, 형상 재생성 교정 모듈(114)은 변환된 이미지에 대한 업스케일링 또는 화질 개선 중 하나 이상을 수행한다. 일반적으로 화질만 다르고 나머지는 동일한 조건의 이미지에 대해 광학 문자 인식을 수행할 경우, 화질 수준이 좋을수록 인식률이 높은 것이 알려져 있다. 따라서 형상 재생성 교정 모듈(114)은 404 단계에서 얻어진 이미지에 대한 광학 문자 인식을 수행하기 전, 이미지의 업스케일링을 먼저 수행한다. 이 때 업스케일링 방식은 보간법 등의 일반적인 업스케일링 방식 또는 텍스트에 특화된 별도의 업스케일링 알고리즘 등을 다양하게 이용할 수 있다. 또한 업스케일링 이외에도 이미지의 화질을 개선하기 위한 여러 전처리 과정이 추가적으로 수행될 수 있다.At step 406, the shape regeneration calibration module 114 performs one or more of upscaling or quality improvement on the transformed image. In general, when optical character recognition is performed on images with different image quality but the rest are the same, it is known that the higher the quality level, the higher the recognition rate. Accordingly, the shape regeneration calibration module 114 first performs upscaling of the image before performing optical character recognition on the image obtained in step 404 . In this case, as the upscaling method, a general upscaling method such as an interpolation method or a separate upscaling algorithm specialized for text may be variously used. In addition to upscaling, various preprocessing processes may be additionally performed to improve image quality.

단계 408에서, 형상 재생성 교정 모듈(114)은 광학 문자 인식(OCR)을 이용하여 업스케일링된 이미지에서 텍스트를 인식한다.At step 408, the shape regeneration correction module 114 recognizes text in the upscaled image using optical character recognition (OCR).

단계 410에서, 형상 재생성 교정 모듈(114)은 입력된 문장과 상기 408 단계에서 인식된 텍스트를 비교하여 문장 교정을 수행하고 교정 결과를 출력한다.In step 410, the shape regeneration correction module 114 compares the input sentence with the text recognized in step 408, performs sentence correction, and outputs a correction result.

일 실시예에서, 형상 재생성 교정 모듈(114)은 인식된 텍스트에 포함된 비완전 글자(noncomplete character)의 개수가 교정 대상 문장에 포함된 비완전 글자의 개수보다 작은 경우, 상기 인식된 텍스트를 교정된 문장으로 설정할 수 있다. 예를 들어, 교정 대상 문장에 자음 또는 모음 단독으로만 구성된 글자(비완전 글자)가 있는 경우, OCR을 통해 비완전 글자의 개수가 줄어들수록 잘 교정된 것으로 본다.In one embodiment, the shape regeneration correction module 114 corrects the recognized text when the number of noncomplete characters included in the recognized text is smaller than the number of noncomplete characters included in the sentence to be corrected. can be set in a sentence. For example, if there are letters (incomplete letters) composed of only consonants or vowels in a sentence to be corrected, the correction is considered better as the number of incomplete letters decreases through OCR.

다른 실시예에서, 형상 재생성 교정 모듈(114)은, 상기 인식된 텍스트를 상기 교정 대상 문장과 비교하여 변경된 영역을 식별하고, 상기 변경된 영역의 발생 빈도가 변경전 영역의 발생 빈도보다 높은 경우, 상기 인식된 텍스트를 교정된 문장으로 설정할 수 있다. 이때 상기 영역은 문장의 일 부분으로서, 형태소, 단어 또는 어절 등 다양한 의미로 이해될 수 있다.In another embodiment, the shape regeneration correction module 114 compares the recognized text with the sentence to be corrected to identify a changed area, and if the frequency of occurrence of the changed area is higher than the frequency of occurrence of the area before the change, the Recognized text can be set as a corrected sentence. In this case, the region is a part of a sentence and may be understood in various meanings such as a morpheme, a word, or a word.

예를 들어, 상기 교정 대상 문장이 속한 언어의 전체 코퍼스(corpus) 또는 사용자가 선정한 특정 코퍼스를 띄어쓰기 및 복합명사 단위(예를 들어, '과일상자'의 경우 '과일'과 '상자'의 두 개의 단어로 분리)로 나누고 해당 단어 및 어절의 발생 빈도를 저장한 데이터셋을 사용한 경우를 가정하자. For example, the entire corpus of the language to which the sentence to be corrected belongs or a specific corpus selected by the user is spaced and a compound noun unit (for example, in the case of 'fruit box', two 'fruit' and 'box' Let's assume that we use a dataset that is divided into words) and stores the occurrence frequencies of the words and phrases.

만약 광학 문자 인식이 잘 수행되었다면 인식된 텍스트를 교정 대상 문장과 비교했을 때 변경된 영역의 코퍼스에서의 빈도수가 변경 전 영역보다 높게 나올 것이다. 그러나 반대로 광학 문자 인식이 잘 되지 않은 경우에는 코퍼스에서 잘 조회되지 않는 특이한 단어들이 나올 것이다. 이와 같은 특성을 이용하여 형상 재생성 교정 모듈(114)은 인식된 텍스트를 상기 교정 대상 문장과 비교하여 변경된 영역을 식별하고, 상기 변경된 영역의 발생 빈도가 변경전 영역의 발생 빈도보다 높은 경우, 상기 인식된 텍스트를 교정된 문장으로 설정할 수 있다.If the optical character recognition is well performed, when the recognized text is compared with the sentence to be corrected, the frequency in the corpus of the changed area will be higher than that of the area before the change. However, conversely, if the optical character recognition is not good, unusual words that are not well searched in the corpus will come out. using these characteristics The shape regeneration correction module 114 compares the recognized text with the sentence to be corrected to identify a changed area, and if the frequency of occurrence of the changed area is higher than the frequency of occurrence of the area before the change, the recognized text is converted into the corrected sentence. can be set to

위와 같은 과정을 거쳐 형상 재생성 교정 모듈(114)은 상기 408 단계에서 인식된 텍스트를 교정 문장으로 출력하거나, 또는 402 단계에서 입력된 문장을 그대로 출력하게 된다.Through the above process, the shape regeneration correction module 114 outputs the text recognized in step 408 as a correction sentence or outputs the sentence input in step 402 as it is.

도 5는 일 실시예에 따른 발음 재생성 교정 모듈(116)에서 문장을 교정하는 과정을 설명하기 위한 흐름도이다.5 is a flowchart illustrating a process of correcting a sentence in the pronunciation regeneration correction module 116 according to an exemplary embodiment.

단계 502에서, 발음 재생성 교정 모듈(116)은 교정 대상 문장을 입력받는다.In step 502, the pronunciation regeneration correction module 116 receives a sentence to be corrected.

단계 504에서, 발음 재생성 교정 모듈(116)은 음성 합성 수단(Text-to-Speech)을 이용하여 상기 교정 대상 문장으로부터 음성 데이터를 생성한다. 일 실시예에서, 발음 재생성 교정 모듈(116)은 교정 대상 문장에 숫자가 존재할 경우 해당 숫자의 발음 종류별로 서로 다른 복수 개의 음성 데이터를 생성할 수 있다. 예를 들어, 발음 재생성 교정 모듈(116)은 해당 숫자가 기수(일, 이 등) 또는 서수(하나, 둘 등)로 읽힐 경우를 고려하여 모든 경우에 대한 음성 데이터를 생성할 수 있다. 숫자를 세는 방법에 따라 발음이 달라질 수 있기 때문이다.In step 504, the pronunciation reproduction correction module 116 generates voice data from the sentence to be corrected by using a text-to-speech. In an embodiment, the pronunciation regeneration correction module 116 may generate a plurality of different voice data for each pronunciation type of the corresponding number when there is a number in the sentence to be corrected. For example, the pronunciation regeneration correction module 116 may generate voice data for all cases in consideration of cases in which the number is read as cardinal numbers (one, two, etc.) or ordinal numbers (one, two, etc.). This is because the pronunciation can change depending on how the number is counted.

단계 506에서, 발음 재생성 교정 모듈(116)은 음성 인식 수단(Speech-To-Text)을 이용하여 상기 음성 데이터를 텍스트로 변환한다.In step 506, the pronunciation reproduction correction module 116 converts the voice data into text using a speech recognition means (Speech-To-Text).

단계 508에서, 발음 재생성 교정 모듈(116)은 입력된 문장과 상기 506 단계에서 변환된 텍스트를 비교하여 문장 교정을 수행하고 교정 결과를 출력한다. 이때 문장의 교정이 제대로 이루어졌는지를 평가하기 위한 방법은 상기 단계 410에서와 동일하다. 예를 들어, 일 실시예에서 발음 재생성 교정 모듈(116)은 상기 변환된 텍스트에 포함된 비완전 글자(noncomplete character)의 개수가 상기 교정 대상 문장에 포함된 비완전 글자의 개수보다 작은 경우, 상기 변환된 텍스트를 교정된 문장으로 설정할 수 있다. 다른 실시예에서, 발음 재생성 교정 모듈(116)은 변환된 텍스트를 상기 교정 대상 문장과 비교하여 변경된 영역을 식별하고, 상기 변경된 영역의 발생 빈도가 변경전 영역의 발생 빈도보다 높은 경우, 상기 변환된 텍스트를 교정된 문장으로 설정할 수 있다. In step 508, the pronunciation regeneration correction module 116 compares the input sentence with the text converted in step 506, performs sentence correction, and outputs a correction result. At this time, the method for evaluating whether the correction of the sentence is properly performed is the same as in step 410 above. For example, in one embodiment, the pronunciation regeneration correction module 116, when the number of noncomplete characters included in the converted text is smaller than the number of noncomplete characters included in the target sentence for correction, the Converted text can be set as redacted sentences. In another embodiment, the pronunciation regeneration correction module 116 compares the converted text with the sentence to be corrected to identify a changed region, and if the frequency of occurrence of the changed region is higher than the frequency of occurrence of the original region, the converted text You can set text to redacted sentences.

만약 상기 단계 504에서 복수 개의 음성 데이터가 생성된 경우, 발음 재생성 교정 모듈(116)은 이 중에서 비완전 글자의 개수가 가장 작거나 변경된 영역의 코퍼스 또는 사전에서의 빈도가 가장 높은 텍스트를 교정된 문장으로 설정할 수 있다.If a plurality of voice data is generated in step 504, the pronunciation regeneration correction module 116 selects the text with the smallest number of incomplete characters or the highest frequency in the corpus or dictionary in the changed region and the corrected sentence. can be set to

위와 같은 과정을 거쳐 발음 재생성 교정 모듈(116)은 상기 506 단계에서 변환된 텍스트(복수 개의 텍스트가 변환된 경우 그 중 어느 하나)를 교정 문장으로 출력하거나, 또는 502 단계에서 입력된 문장을 그대로 출력하게 된다.Through the above process, the pronunciation regeneration correction module 116 outputs the text converted in step 506 (if a plurality of texts are converted) as a correction sentence, or outputs the sentence input in step 502 as it is will do

도 6은 일 실시예에 따른 복합 재생성 교정 모듈(118)에서 문장을 교정하는 과정을 설명하기 위한 흐름도이다.6 is a flowchart illustrating a process of proofreading sentences in the compound regeneration proofreading module 118 according to an embodiment.

단계 602에서, 복합 재생성 교정 모듈(118)은 교정 대상 문장을 입력받는다.In step 602, the complex regeneration proofreading module 118 receives a sentence to be proofread.

단계 604에서, 복합 재생성 교정 모듈(118)은 입력된 교정 대상 문장을 복수의 형태소로 분할한다. 만약 분할된 형태소의 품사가 용언에 해당하는 경우, 복합 재생성 교정 모듈(118)은 해당 형태소를 기본형으로 변환할 수 있다.In step 604, the compound regeneration correction module 118 divides the input sentence to be proofread into a plurality of morphemes. If the part-of-speech of the divided morpheme corresponds to a verb, the compound regeneration correction module 118 may convert the corresponding morpheme into a basic form.

예를 들어, 다음과 같은 교정 대상 문장이 입력되었다고 가정하자.For example, suppose that the following correction target sentence is entered.

교정 대상 문장: "참외가 ?記獵?."Sentence to be corrected: "Melon is ?記獵?."

상기 교정 대상 문장을 형태소 분할하면 다음과 같다.Morphological division of the sentence to be corrected is as follows.

참외(명사), 가(조사), ?弩獵?(동사)Melon (noun), ga (subjective), ?弩獵? (verb)

이때 "?弩獵?"는 동사이므로 이를 기본형으로 변환하면 "?年?"가 된다.At this time, since "?弩獵?" is a verb, converting it into a basic form becomes "?年?".

단계 606에서, 복합 재생성 교정 모듈(118)은 상기 분할 및 기본형으로 변환된 형태소 중 하나 이상을 교정 대상 토큰으로 선정한다.In step 606, the complex regeneration calibration module 118 selects one or more of the morphemes converted to the division and base form as a calibration target token.

일 실시예에서, 복합 재생성 교정 모듈(118)은 추출된 형태소를 기 설정된 사전 데이터베이스에서 조회하고, 만약 조회되지 않는 형태소가 있는 경우 이를 교정 대상 토큰으로 선정할 수 있다. 이때 복합 재생성 교정 모듈(118)은 표제어와 품사 모두 일치하여야 해당 형태소가 사전에 존재하는 것으로 판단할 수 있다. 전술한 예의 경우, "?年?"는 일반적인 사전에서 조회되지 않는 단어이므로 교정 대상 토큰으로 선정될 수 있다.In one embodiment, the compound regeneration correction module 118 searches the extracted morpheme in a preset dictionary database, and if there is a morpheme that is not searched, it may select it as a correction target token. At this time, the compound regeneration correction module 118 can determine that the corresponding morpheme exists in advance only when both the headword and the part-of-speech match. In the case of the above example, since "?年?" is a word that is not looked up in a general dictionary, it may be selected as a token to be corrected.

추가적인 실시예에서, 복합 재생성 교정 모듈(118)은 상기 교정 대상 문장을 기 설정된 언어 모델에 입력하여 상기 사전 데이터베이스에서 조회되지 않는 형태소의 위치에 해당하는 토큰을 예측하게 하되, 상기 언어 모델이 상기 사전 데이터베이스에서 조회되지 않는 형태소에 해당하는 토큰을 예측하지 못하는 경우, 상기 사전 데이터베이스에서 조회되지 않는 형태소를 상기 교정 대상 토큰으로 선택할 수 있다. 즉 본 실시예의 경우 사전 및 언어 모델의 두 단계를 통해 교정 대상 토큰을 선정한다. 복합 재생성 교정 모듈(118)은 상기 언어 모델이 상기 형태소를 예측하지 못하는 경우에만 상기 특정 토큰을 상기 교정 대상 토큰으로 선택할 수 있다. 예를 들어 "너는 ?薦? 먹었다."라는 문장을 언어 모델에 입력하고 사전 데이터베이스에서 조회되지 않는 형태소(여기서는 ??)의 위치에 해당하는 토큰을 예측하게 한다고 가정하자. 그러면 언어 모델은 "??"의 위치에 해당하는 토큰으로 "빵", "밥", "점심", "아침" 등의 토큰들을 추천할 수 있다. 이 경우 추천된 토큰에 "??"이 들어 있지 않으므로, 복합 재생성 교정 모듈(118)은 "??"을 교정 대상 토큰으로 선택할 수 있다. 또한"빵", "밥", "점심", "아침"등의 토큰들은 교정 대상 토큰인 "??"이 위치한 곳의 토큰을 예측한 결과로 나온 토큰이므로 교정 대상 토큰인 "??"에 대한 교정 후보 토큰들로 설정할 수 있다. 교정 후보 토큰에 대해서는 이하의 608 단계에서 좀 더 상세히 설명하기로 한다.In an additional embodiment, the compound regeneration correction module 118 inputs the correction target sentence into a preset language model to predict a token corresponding to a position of a morpheme not searched in the dictionary database, If a token corresponding to a morpheme not searched in the database is not predicted, a morpheme not searched in the dictionary database may be selected as the correction target token. That is, in this embodiment, a token to be corrected is selected through two stages of a dictionary and a language model. The compound regeneration correction module 118 may select the specific token as the correction target token only when the language model does not predict the morpheme. For example, let's assume that the sentence "You ate ?薦?" is entered into the language model and a token corresponding to the position of a morpheme (here, ??) not found in the dictionary database is predicted. Then, the language model may recommend tokens such as "bread", "rice", "lunch", and "breakfast" as tokens corresponding to the position of "??". In this case, since “??” is not included in the recommended token, the complex regeneration calibration module 118 may select “??” as a calibration target token. In addition, tokens such as "bread", "rice", "lunch", and "breakfast" are tokens that are the result of predicting the token where the token to be corrected "??" is located, so the token to be corrected is "??" can be set as correction candidate tokens for The calibration candidate token will be described in more detail in step 608 below.

일반적으로 BERT 등의 사전 학습된 언어 모델은 파인 튜닝(fine-tuning)을 통해 다양한 자연어 처리 태스크를 수행할 수 있는 언어 모델로서, 학습 과정 자체에서 문장에서 랜덤하게 단어를 삭제하고 삭제된 단어를 맞추는 방식으로 학습이 수행된다. 따라서 만약 언어 모델이 해당 단어를 유추하지 못하는 경우, 해당 단어는 교정 대상 토큰에 해당할 가능성이 높다고 판단할 수 있다. 예를 들어, 복합 재생성 교정 모듈(118)은 상기 언어 모델로부터 삭제된 형태소에 대한 M개(M은 1 이상의 자연수)의 추측 단어를 생성하고, 여기에 해당 형태소가 포함되지 않는 경우 해당 형태소를 교정 대상 토큰으로 최종 선정할 수 있다.In general, a pre-trained language model such as BERT is a language model that can perform various natural language processing tasks through fine-tuning. learning is carried out in a way Accordingly, if the language model cannot infer the corresponding word, it may be determined that the corresponding word is highly likely to correspond to the token to be corrected. For example, the compound regeneration correction module 118 generates M guessed words (M is a natural number greater than or equal to 1) for the deleted morpheme from the language model, and corrects the corresponding morpheme when the corresponding morpheme is not included therein. It can be finally selected as the target token.

단계 608에서, 복합 재생성 교정 모듈(118)은 상기 교정 대상 토큰 별로 하나 이상의 교정 후보 토큰을 생성한다.In step 608, the complex regeneration calibration module 118 generates one or more reclamation candidate tokens for each of the reclamation target tokens.

일 실시예에서, 복합 재생성 교정 모듈(118)은, 교정 대상 문장을 기 설정된 언어 모델에 입력하여 상기 교정 대상 토큰의 위치에 해당하는 토큰을 예측하도록 하고, 상기 언어 모델이 예측한 N개(N은 1 이상의 자연수)의 예측 토큰을 상기 교정 대상 토큰에 대한 상기 교정 후보 토큰으로 설정할 수 있다.In one embodiment, the compound regeneration correction module 118 inputs a sentence to be proofread into a preset language model to predict a token corresponding to the position of the token to be proofread, and the language model predicts N (N is a natural number of 1 or greater) may be set as the calibration candidate token for the calibration target token.

전술한 예에서, 복합 재생성 교정 모듈(118)은 교정 대상 토큰으로 선정된 "?磯?"에 대한 교정 후보 토큰으로 BERT 등의 사전 학습된 언어 모델을 활용하여 "크다", "달다", "맛있다", "익었다", "비싸다"의 5개의 단어를 교정 후보 토큰으로 설정할 수 있다.In the above example, the compound regeneration correction module 118 utilizes a pretrained language model such as BERT as a proofreading candidate token for "?磯?" Five words of "delicious", "ripe", and "expensive" can be set as correction candidate tokens.

한편, 복합 재생성 교정 모듈(118)은 단어 퍼터베이터(Word Perturbator)를 이용하여 교정 후보 토큰을 추가 생성할 수 있다.Meanwhile, the complex regeneration calibration module 118 may additionally generate a calibration candidate token using a word perturbator.

만약 교정 대상 토큰이 용언이 아닐 경우, 복합 재생성 교정 모듈(118)은, 언어 모델에서 도출된 교정 대상 토큰(1차 교정 대상 토큰)과 편집 거리가 K(K는 1 이상의 자연수) 이내(1~K)인 토큰을 상기 교정 후보 토큰에 추가할 수 잇다.If the proofreading target token is not a verb, the complex regeneration proofreading module 118 determines that the editing distance from the proofreading token (first proofreading token) derived from the language model is within K (where K is a natural number greater than or equal to 1) (1 to 1). K) tokens may be added to the calibration candidate tokens.

이와 달리 만약 교정 대상 토큰이 용언에 해당하는 경우, 복합 재생성 교정 모듈(118)은 상기 교정 대상 토큰으로부터 하나 이상의 후보 기본형을 생성하고, 상기 하나 이상의 후보 기본형 각각과 편집 거리가 K(K는 1 이상의 자연수) 이내인 토큰을 상기 교정 후보 토큰에 추가할 수 있다. 이는 형태소 분석을 통해 생성된 교정대상 토큰의 어간이 잘못 분석되었을 확률을 고려한 것이다. 후보 기본형을 생성하는 과정에서는 교정 대상 토큰의 맨 앞 글자에서부터 연속된 L개의 글자를 어간이라고 가정하고 그 뒤에 '하다', '다' 등의 어미를 붙여 후보 기본형을 생성하게 된다.In contrast, if the proofreading target token corresponds to a term, the complex regeneration correction module 118 generates one or more candidate basic types from the proofreading target token, and the editing distance between each of the one or more candidate basic types is K (K is greater than or equal to 1). natural number) may be added to the calibration candidate tokens. This is in consideration of the probability that the stem of the token to be corrected generated through morpheme analysis is incorrectly analyzed. In the process of generating the candidate basic type, it is assumed that L consecutive letters from the first letter of the token to be corrected are the stem, and the candidate basic type is created by attaching suffixes such as 'had' or 'da'.

예를 들어, "?磯?"는 용언이므로 후보 기본형을 재생성하는 과정이 필요하다. For example, since "?磯?" is a verb, a process of regenerating the candidate basic type is required.

먼저, 어간을 '??'으로 가정할 경우, 원형은 '?磯?', '?饑求?'가 되고, 단어 퍼터베이터를 통해 다음과 같은 교정 후보 토큰이 추가 생성될 수 있다.First, if the stem is assumed to be '??', the prototypes are '?磯?', '?饑求?', and the following correction candidate tokens can be additionally generated through the word perturbator.

?磯? -> 맛다, 마다, 말다...?磯? -> Taste, every, roll...

?饑求? -> 말하다...?饑求? -> Talk...

만약 어간이 '?記?'인 경우, 원형은 '?記獵?', '?記例求?'가 되고, 단어 퍼터베이터를 통해 다음과 같은 교정 후보 토큰이 추가 생성될 수 있다.If the stem is '?gi?', the original form becomes '?gi獵?', '?記例求?', and the following correction candidate tokens can be additionally generated through the word perturbator.

?記獵? -> 맛있다... / ?記例求? -> ...?Record? -> Delicious... / ?記例求? -> ...

복합 재생성 교정 모듈(118)은, 상기 교정 후보 토큰 중 기 설정된 사전 또는 코퍼스에서 조회되지 않는 토큰을 제외할 수 있다.The complex regeneration calibration module 118 may exclude tokens that are not searched in a preset dictionary or corpus from among the calibration candidate tokens.

단계 610에서, 복합 재생성 교정 모듈(118)은 교정 후보 토큰 중 하나로 교정 대상 토큰을 대체한다. 일 실시예에서 복합 재생성 교정 모듈(118)은, 상기 하나 이상의 교정 후보 토큰 각각에 대하여, 상기 교정 대상 토큰과의 레벤슈타인 거리, 상기 교정 대상 토큰과의 품사 유사도, 코퍼스 내의 단어 빈도, 및 상기 코퍼스 내의 숙어 빈도 중 하나 이상을 계산하고, 계산된 값에 기초하여 상기 하나 이상의 교정 후보 토큰 중 상기 교정 대상 토큰을 대체할 토큰을 선택할 수 있다. 최종 교정 후보 토큰을 선정하기 위한 상기 각각의 피처들을 설명하면 다음과 같다.In step 610, the complex regeneration calibration module 118 replaces the token to be reclaimed with one of the reclamation candidate tokens. In one embodiment, the compound regeneration correction module 118 may, for each of the one or more proofreading candidate tokens, determine the Levenstein distance with the proofreading token, the part-of-speech similarity with the proofreading token, the frequency of words in the corpus, and the corpus. One or more of the idiom frequencies within may be calculated, and a token to replace the proofreading target token may be selected from among the one or more proofreading candidate tokens based on the calculated value. Each of the features for selecting the final calibration candidate token is described as follows.

1. 레벤슈타인 거리: 교정 대상 토큰과 교정 후보 토큰과의 편집거리를 구하는 알고리즘으로 편집거리가 작을수록 형태적으로 비슷한 단어로 본다. 예를 들어, 귤과 귤은 편집거리 0으로 같은 단어이며, 귤과 글은 편집거리 1의 다른 단어가 된다.1. Levenstein distance: An algorithm that calculates the editing distance between the proofreading target token and the proofreading candidate token. The smaller the editing distance, the more morphologically similar words are considered. For example, tangerine and tangerine are the same word with an edit distance of 0, and tangerine and text are different words with an edit distance of 1.

2. 품사 유사도: 교정 대상 문장과 교정 후보 토큰이 대체된 문장을 각각 형태소 분석기로 돌려 품사를 판정하고 해당 품사가 비슷한 성질일수록 높은 점수를 부여한다. 예를 들어, 각 품사가 동사와 형용사로 나올 경우 비슷한 성질이므로 높은 점수를 부여할 수 있다. 품사간의 유사도는 각 품사의 언어학적 특성을 고려하여 설정될 수 있다.2. Part-of-speech similarity: The sentence to be corrected and the sentence in which the correction candidate token is replaced are run through a morpheme analyzer to determine the part-speech, and the higher the score, the more similar the corresponding part-of-speech is. For example, if each part of speech appears as a verb and an adjective, high scores can be given because they are similar in nature. The degree of similarity between parts of speech may be set in consideration of linguistic characteristics of each part of speech.

3. 코퍼스 내의 단어 빈도: 대량의 산문 데이터가 포함된 일반 코퍼스 또는, 사용자가 지정한 특정 코퍼스에서 형태소 분석기를 이용하여 각 단어들의 빈도를 기록한 다음 교정 후보 토큰들의 빈도를 서로 비교한다. 이때 용언의 경우 원형으로 변환해 빈도를 계산하며 빈도가 높은 교정후보 토큰일수록 높은 점수를 부여한다.3. Word frequency in the corpus: After recording the frequency of each word in a general corpus containing a large amount of prose data or a specific corpus designated by the user using a morpheme analyzer, the frequency of proofreading candidate tokens is compared with each other. At this time, in the case of a verb, it is converted into a prototype and the frequency is calculated, and a higher score is given to a correction candidate token with a higher frequency.

4. 코퍼스 내의 숙어 빈도(Idiom Recommendation): 코퍼스를 대상으로 N그램 (N gram)의 빈도를 기록한 숙어 후보군을 만든다. 교정 대상 문장 내 교정 대상 토큰을 교정 후보 토큰으로 대체하고 해당 토큰을 포함한 N그램들의 숙어 후보군에서의 빈도를 비교한다. 용언의 경우 단어 빈도와 마찬가지로 원형으로 변환하여 세며 빈도가 클수록 큰 점수를 부여한다. 4. Idiom Recommendation in the corpus: Idiom candidate group is created with the frequency of N grams recorded in the corpus. The token to be corrected in the sentence to be corrected is replaced with the token to be corrected, and the frequencies of N-grams including the token are compared in the idiom candidate group. In the case of verbs, they are converted into prototypes and counted in the same way as word frequencies, and the higher the frequency, the higher the score.

숙어 후보군을 생성하는 방법은 다음과 같다.The method of generating the idiom candidate group is as follows.

예시 문장이 "나는 밥을 맛있게 먹었다."인 경우, 이로부터 단순 N그램을 추출한다. 예를 들어, 바이그램인 경우 (나는, 밥을), (밥을, 맛있게), (맛있게, 먹다)가 추출된다(용언의 경우 기본형으로 변환).If the example sentence is "I ate rice deliciously", simple N-grams are extracted from it. For example, in the case of bigrams, (I, rice), (rice, delicious), (delicious, eat) are extracted (converted to the basic form in the case of verbs).

다음으로 SVO 추출기(SVO Extractor)를 이용해 주어(S)와 목적어(O), 주어(S)와 동사(V), 목적어(O)와 동사(V)의 쌍을 추출한다. 위의 예에서는 (나는 밥을), (나는, 먹다), (밥을, 먹다)의 3개의 쌍이 추출된다.Next, pairs of subject (S) and object (O), subject (S) and verb (V), and object (O) and verb (V) are extracted using the SVO Extractor. In the example above, three pairs of (I eat rice), (I eat rice), and (I eat rice) are extracted.

다음으로 의존성 파서(Dependency Parser)를 이용해 의존 관계가 있는 쌍을 추출해 센다. 위의 예에서는 (나는, 먹다), (밥을, 먹다), (밥을, 맛있게)의 3개의 쌍이 의존 관계가 있는 쌍으로 추출된다. Next, a dependency parser is used to extract and count dependent pairs. In the example above, the three pairs of (I, eat), (Rice, eat), and (Rice, delicious) are extracted as pairs with dependencies.

개시되는 실시예의 경우 SVO 추출기와 의존성 파서를 이용하여 숙어처럼 동시발생빈도가 높은 튜플들을 코퍼스로부터 얻어 내 사용할 수 있다.In the case of the disclosed embodiment, tuples with high co-occurrence like idioms can be obtained from the corpus using the SVO extractor and the dependency parser.

복합 재생성 교정 모듈(118)은 복수의 교정 후보 토큰에 대하여 상기 1 내지 4의 피처를 구하고 이들의 값을 가중평균하여 스코어를 계산한다. 이때 레벤슈타인 거리의 경우 낮을수록, 나머지 피처들의 경우 높을수록 최종 스코어가 높도록 가중평균을 수행한다. 가중평균에 사용하는 각 가중치 정도는 사용자가 설정하거나 모델 학습을 통해 최적화할 수 있다. 실시예에 따라 상기 피처들 중 일부를 제외하거나, 특정한 하나의 피처만을 사용할 수도 있다.The complex regeneration calibration module 118 calculates a score by obtaining features 1 to 4 above for a plurality of calibration candidate tokens and weighted averaging their values. In this case, the weighted average is performed so that the final score is higher as the Levenstein distance is lower and as the other features are higher. The degree of each weight used in the weighted average can be set by the user or optimized through model learning. Depending on embodiments, some of the features may be excluded or only one specific feature may be used.

복합 재생성 교정 모듈(118)은 상기 스코어가 가장 높은 교정 후보 토큰을 최종 교정 후보 토큰으로 선정하고, 선정된 최종 교정 후보 토큰으로 상기 교정 대상 토큰을 대체함으로써 교정 대상 문장을 교정한다. 만약 교정 대상 토큰이 용언일 경우 용언의 활용을 고려하여 최종 교정 후보 토큰을 적절히 변형한다.The complex regeneration correction module 118 selects a correction candidate token having the highest score as a final correction candidate token, and replaces the correction target token with the selected final correction candidate token, thereby correcting the correction target sentence. If the token to be corrected is a verb, the final correction candidate token is appropriately transformed considering the use of the verb.

단계 612에서, 복합 재생성 교정 모듈(118)은 교정 결과를 출력한다.In step 612, the complex regeneration calibration module 118 outputs the calibration result.

도 7은 일 실시예에 따른 추가 교정 모델(120)을 학습하는 과정을 설명하기 위한 예시도이다. 7 is an exemplary diagram for explaining a process of learning an additional calibration model 120 according to an embodiment.

일 실시예에 따른 추가 교정 모델(120)은 형상 재생성 교정 모듈(114), 발음 재생성 교정 모듈(116) 및 복합 재생성 교정 모듈(118)에서 문장 교정이 완전하게 이루어지지 않은 경우 이를 추가적으로 교정하기 위하여 사용될 수 있다. 이를 위하여 추가 교정 모델(120)은 형상 재생성 교정 모듈(114), 발음 재생성 교정 모듈(116) 및 복합 재생성 교정 모듈(118)을 통과한 문장을 입력받아 추가 교정된 문장을 출력하고, 추가 교정된 문장이 정답과 동일해지도록 학습될 수 있다. 추가 교정 모델로는 seq2seq, transformer 등의 언어 모델이 사용될 수 있다. 이와 같이 추가 교정 모델(120)을 이용할 경우 교정된 문장에 남아 있을 수 있는 세부적인 오타 등을 교정할 수 있다.The additional correction model 120 according to an embodiment is used to additionally correct sentence corrections that are not completely performed in the shape regeneration correction module 114, the pronunciation regeneration correction module 116, and the complex regeneration correction module 118. can be used To this end, the additional correction model 120 receives sentences that have passed through the shape regeneration correction module 114, the pronunciation regeneration correction module 116, and the complex regeneration correction module 118, outputs additionally corrected sentences, and outputs additionally corrected sentences. The sentence may be learned to be identical to the correct answer. As an additional calibration model, language models such as seq2seq and transformer may be used. In this way, when the additional correction model 120 is used, it is possible to correct detailed typos that may remain in the corrected sentence.

도 8은 일 실시예에 따른 문서 교정 방법(800)을 설명하기 위한 흐름도이다. 도시된 방법은 하나 이상의 프로세서들, 및 상기 하나 이상의 프로세서들에 의해 실행되는 하나 이상의 프로그램들을 저장하는 메모리를 구비한 컴퓨팅 장치, 예컨대 전술한 문서 교정 장치(100)에서 수행될 수 있다. 도시된 흐름도에서는 상기 방법 또는 과정을 복수 개의 단계로 나누어 기재하였으나, 적어도 일부의 단계들은 순서를 바꾸어 수행되거나, 다른 단계와 결합되어 함께 수행되거나, 생략되거나, 세부 단계들로 나뉘어 수행되거나, 또는 도시되지 않은 하나 이상의 단계가 부가되어 수행될 수 있다.8 is a flowchart illustrating a document proofreading method 800 according to an exemplary embodiment. The illustrated method may be performed in a computing device having one or more processors and a memory storing one or more programs executed by the one or more processors, such as the document proofreading device 100 described above. In the illustrated flowchart, the method or process is divided into a plurality of steps, but at least some of the steps are performed in reverse order, combined with other steps, performed together, omitted, divided into detailed steps, or shown. One or more steps not yet performed may be added and performed.

단계 802에서, 문서 교정 장치(100)는 교정 대상 문장을 입력받는다.In step 802, the document proofreading apparatus 100 receives a sentence to be proofread.

단계 804에서, 문서 교정 장치(100)는 형상 재생성 교정 모듈(114), 발음 재생성 교정 모듈(116), 및 복합 재생성 교정 모듈(118) 중 하나 이상을 이용하여 상기 교정 대상 문장을 교정할지의 여부를 판단한다.In step 804, the document correction apparatus 100 determines whether to correct the sentence to be corrected using at least one of the shape regeneration correction module 114, the pronunciation regeneration correction module 116, and the complex regeneration correction module 118. judge

단계 806에서, 문서 교정 장치(100)는 상기 판단 결과에 기반하여 상기 교정 대상 문장을 교정한다. 이때 상기 806 단계는, 형상 재생성 교정 모듈(114), 발음 재생성 교정 모듈(116), 및 복합 재생성 교정 모듈(118) 중 하나 이상을 이용하여 교정 대상 문장을 교정한 1차 교정 문장을 추가 교정하는 단계를 더 포함할 수 있다.In step 806, the document correction apparatus 100 corrects the correction target sentence based on the determination result. In this case, in step 806, the first correction sentence corrected by using at least one of the shape regeneration correction module 114, the pronunciation regeneration correction module 116, and the complex regeneration correction module 118 is additionally corrected. Further steps may be included.

도 9는 예시적인 실시예들에서 사용되기에 적합한 컴퓨팅 장치를 포함하는 컴퓨팅 환경을 예시하여 설명하기 위한 블록도이다. 도시된 실시예에서, 각 컴포넌트들은 이하에 기술된 것 이외에 상이한 기능 및 능력을 가질 수 있고, 이하에 기술되지 않은 것 이외에도 추가적인 컴포넌트를 포함할 수 있다.9 is a block diagram illustrating and describing a computing environment including a computing device suitable for use in example embodiments. In the illustrated embodiment, each component may have different functions and capabilities other than those described below, and may include additional components other than those not described below.

도시된 컴퓨팅 환경(10)은 컴퓨팅 장치(12)를 포함한다. 일 실시예에서, 컴퓨팅 장치(12)는 전술한 문서 교정 장치(100)일 수 있다.The illustrated computing environment 10 includes a computing device 12 . In one embodiment, the computing device 12 may be the document proofreading device 100 described above.

컴퓨팅 장치(12)는 적어도 하나의 프로세서(14), 컴퓨터 판독 가능 저장 매체(16) 및 통신 버스(18)를 포함한다. 프로세서(14)는 컴퓨팅 장치(12)로 하여금 앞서 언급된 예시적인 실시예에 따라 동작하도록 할 수 있다. 예컨대, 프로세서(14)는 컴퓨터 판독 가능 저장 매체(16)에 저장된 하나 이상의 프로그램들을 실행할 수 있다. 상기 하나 이상의 프로그램들은 하나 이상의 컴퓨터 실행 가능 명령어를 포함할 수 있으며, 상기 컴퓨터 실행 가능 명령어는 프로세서(14)에 의해 실행되는 경우 컴퓨팅 장치(12)로 하여금 예시적인 실시예에 따른 동작들을 수행하도록 구성될 수 있다.Computing device 12 includes at least one processor 14 , a computer readable storage medium 16 and a communication bus 18 . Processor 14 may cause computing device 12 to operate according to the above-mentioned example embodiments. For example, processor 14 may execute one or more programs stored on computer readable storage medium 16 . The one or more programs may include one or more computer-executable instructions, which when executed by processor 14 are configured to cause computing device 12 to perform operations in accordance with an illustrative embodiment. It can be.

컴퓨터 판독 가능 저장 매체(16)는 컴퓨터 실행 가능 명령어 내지 프로그램 코드, 프로그램 데이터 및/또는 다른 적합한 형태의 정보를 저장하도록 구성된다. 컴퓨터 판독 가능 저장 매체(16)에 저장된 프로그램(20)은 프로세서(14)에 의해 실행 가능한 명령어의 집합을 포함한다. 일 실시예에서, 컴퓨터 판독 가능 저장 매체(16)는 메모리(랜덤 액세스 메모리와 같은 휘발성 메모리, 비휘발성 메모리, 또는 이들의 적절한 조합), 하나 이상의 자기 디스크 저장 디바이스들, 광학 디스크 저장 디바이스들, 플래시 메모리 디바이스들, 그 밖에 컴퓨팅 장치(12)에 의해 액세스되고 원하는 정보를 저장할 수 있는 다른 형태의 저장 매체, 또는 이들의 적합한 조합일 수 있다.Computer-readable storage medium 16 is configured to store computer-executable instructions or program code, program data, and/or other suitable form of information. Program 20 stored on computer readable storage medium 16 includes a set of instructions executable by processor 14 . In one embodiment, computer readable storage medium 16 includes memory (volatile memory such as random access memory, non-volatile memory, or a suitable combination thereof), one or more magnetic disk storage devices, optical disk storage devices, flash memory devices, other forms of storage media that can be accessed by computing device 12 and store desired information, or any suitable combination thereof.

통신 버스(18)는 프로세서(14), 컴퓨터 판독 가능 저장 매체(16)를 포함하여 컴퓨팅 장치(12)의 다른 다양한 컴포넌트들을 상호 연결한다.Communications bus 18 interconnects various other components of computing device 12, including processor 14 and computer-readable storage medium 16.

컴퓨팅 장치(12)는 또한 하나 이상의 입출력 장치(24)를 위한 인터페이스를 제공하는 하나 이상의 입출력 인터페이스(22) 및 하나 이상의 네트워크 통신 인터페이스(26)를 포함할 수 있다. 입출력 인터페이스(22) 및 네트워크 통신 인터페이스(26)는 통신 버스(18)에 연결된다. 입출력 장치(24)는 입출력 인터페이스(22)를 통해 컴퓨팅 장치(12)의 다른 컴포넌트들에 연결될 수 있다. 예시적인 입출력 장치(24)는 포인팅 장치(마우스 또는 트랙패드 등), 키보드, 터치 입력 장치(터치패드 또는 터치스크린 등), 음성 또는 소리 입력 장치, 다양한 종류의 센서 장치 및/또는 촬영 장치와 같은 입력 장치, 및/또는 디스플레이 장치, 프린터, 스피커 및/또는 네트워크 카드와 같은 출력 장치를 포함할 수 있다. 예시적인 입출력 장치(24)는 컴퓨팅 장치(12)를 구성하는 일 컴포넌트로서 컴퓨팅 장치(12)의 내부에 포함될 수도 있고, 컴퓨팅 장치(12)와는 구별되는 별개의 장치로 컴퓨팅 장치(12)와 연결될 수도 있다.Computing device 12 may also include one or more input/output interfaces 22 and one or more network communication interfaces 26 that provide interfaces for one or more input/output devices 24 . An input/output interface 22 and a network communication interface 26 are connected to the communication bus 18 . Input/output device 24 may be coupled to other components of computing device 12 via input/output interface 22 . Exemplary input/output devices 24 include a pointing device (such as a mouse or trackpad), a keyboard, a touch input device (such as a touchpad or touchscreen), a voice or sound input device, various types of sensor devices, and/or a photographing device. input devices, and/or output devices such as display devices, printers, speakers, and/or network cards. The exemplary input/output device 24 may be included inside the computing device 12 as a component constituting the computing device 12, or may be connected to the computing device 12 as a separate device distinct from the computing device 12. may be

이상에서 대표적인 실시예를 통하여 본 발명에 대하여 상세하게 설명하였으나, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 전술한 실시예에 대하여 본 발명의 범주에서 벗어나지 않는 한도 내에서 다양한 변형이 가능함을 이해할 것이다. 그러므로 본 발명의 권리범위는 설명된 실시예에 국한되어 정해져서는 안 되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등한 것들에 의해 정해져야 한다.Although the present invention has been described in detail through representative examples above, those skilled in the art can make various modifications to the above-described embodiments without departing from the scope of the present invention. will understand Therefore, the scope of the present invention should not be limited to the described embodiments and should not be defined, and should be defined by not only the claims to be described later, but also those equivalent to these claims.

100: 문서 교정 장치
102: 문장 분리 모듈
104: 판별 모듈
106: 교정 모듈
108: 추가 교정 모듈
110: 판단 모듈
112: 판별 모델
114: 형상 재생성 교정 모듈
116: 발음 재생성 교정 모듈100: document proofing device
102: sentence separation module
104: discrimination module
106 Calibration module
108: additional calibration module
110: judgment module
112: discriminant model
114: shape regeneration correction module
116: Pronunciation regeneration correction module

Claims

a shape regeneration correction module for correcting the target sentence based on the visual characteristics of the target sentence;
a pronunciation regeneration correction module for correcting the target sentence based on the auditory characteristics of the target sentence;
a complex regeneration correction module for correcting the sentence to be proofread based on the statistical characteristics of the language to which the sentence to be proofread belongs; and
Document proofreading comprising a determination module receiving the sentence to be corrected and determining whether to correct the sentence to be corrected using at least one of the shape regeneration correction module, the pronunciation regeneration correction module, and the complex regeneration correction module. Device.

The method of claim 1,
The shape regeneration correction module,
The document proofreading device converting the sentence to be corrected into an image and then recognizing text in the converted image using optical character recognition.

The method of claim 2,
The shape regeneration correction module,
and setting the recognized text as a corrected sentence when the number of noncomplete characters included in the recognized text is smaller than the number of noncomplete characters included in the sentence to be corrected.

The method of claim 2,
The shape regeneration correction module,
A document proofreading device that compares the recognized text with the sentence to be corrected to identify a changed area, and sets the recognized text as a corrected sentence when the frequency of occurrence of the changed area is higher than the frequency of occurrence of the area before the change. .

The method of claim 1,
The pronunciation regeneration correction module,
A document proofreading device that generates voice data from the sentence to be corrected using a speech synthesis unit (Text-to-Speech) and then converts the voice data into text using a speech recognition unit (Speech-To-Text). .

The method of claim 5,
The pronunciation regeneration correction module,
and setting the converted text as a corrected sentence when the number of noncomplete characters included in the converted text is smaller than the number of noncomplete characters included in the sentence to be corrected.

The method of claim 5,
The pronunciation regeneration correction module,
A document proofreading device that compares the converted text with the sentence to be corrected to identify a changed region, and sets the converted text as a corrected sentence when the occurrence frequency of the changed region is higher than the occurrence frequency of the original region. .

The method of claim 1,
The complex regeneration correction module,
Dividing the sentence to be corrected into a plurality of morphemes;
Selecting one or more correction target tokens from among the plurality of morphemes;
A document proofreading device that generates one or more proofreading candidate tokens for each proofreading target token.

The method of claim 8,
The complex regeneration correction module,
If the part-of-speech of the divided morpheme corresponds to a predicate, converting the corresponding morpheme into a basic form, the document proofreading device.

The method of claim 8,
The complex regeneration correction module,
A document proofreading device that selects a morpheme that is not searched in a preset dictionary database among the plurality of morphemes as the proofread target token.

The method of claim 10,
The complex regeneration correction module,
The sentence to be corrected is input into a preset language model to predict a token corresponding to a position of a morpheme not searched in the dictionary database, but the language model does not predict a token corresponding to a morpheme not searched in the dictionary database. If not, select a morpheme that is not searched in the dictionary database as the proofreading target token.

The method of claim 8,
The complex regeneration correction module,
The sentence to be corrected is input into a preset language model to predict a token corresponding to the position of the target token to be corrected, and N prediction tokens (N is a natural number equal to or greater than 1) predicted by the language model are the tokens to be corrected. set as the redaction candidate token for the document redaction device.

The method of claim 12,
The complex regeneration correction module,
The document proofreading apparatus of claim 1 , wherein a token whose edit distance is within K (K is a natural number greater than or equal to 1) from the primary proofreading target token is added to the proofreading candidate token.

The method of claim 12,
The complex regeneration correction module,
If the token to be corrected corresponds to a term, one or more candidate basic forms are generated from the token to be corrected;
and adding a token whose editing distance from each of the one or more candidate primitives is within K, where K is a natural number equal to or greater than 1, to the proofreading candidate tokens.

The method according to any one of claims 12 to 14,
The complex regeneration correction module,
A document proofreading device that excludes tokens that are not looked up in a predetermined dictionary or corpus from among the proofreading candidate tokens.

The method of claim 8,
The complex regeneration correction module,
For each of the one or more calibration candidate tokens,
Calculate at least one of a Levenstein distance with the token to be corrected, a similarity of parts of speech with the token to be corrected, a frequency of a word in a corpus, and a frequency of an idiom in the corpus, and based on the calculated values, among the one or more candidate tokens for correction A document proofreading apparatus for selecting a token to replace the proofreading target token.

The method of claim 1,
Further comprising an additional proofreading module for additional proofreading of a first proofread sentence obtained by proofreading the sentence to be proofread by using at least one of the shape regeneration correction module, the pronunciation regeneration correction module, and the complex regeneration correction module.

The method of claim 17
wherein the additional correction module receives the first correction sentence, outputs a second correction sentence obtained by further correcting it, and learns the second correction sentence to match a preset correct answer sentence.

As a method performed on a computer,
receiving a sentence to be corrected;
A shape regeneration correction module for correcting the sentence to be corrected based on the visual characteristics of the sentence to be corrected;
A pronunciation regeneration correction module for correcting the sentence to be corrected based on the auditory characteristics of the sentence to be corrected; and
determining whether or not to proofread the sentence to be proofread by using one or more of complex regeneration proofreading modules for correcting the sentence to be proofreaded based on statistical characteristics of the language to which the subject sentence to be proofread belongs; and
and correcting the target sentence based on the determination result.

The method of claim 19
The correcting step is
Further comprising the step of additionally proofreading the first proofread sentence obtained by proofreading the sentence to be proofread by using at least one of the shape regeneration correction module, the pronunciation regeneration correction module, and the complex regeneration correction module.