KR20090055087A

KR20090055087A - Method and system for evaluating document image automatically for optical character recognition

Info

Publication number: KR20090055087A
Application number: KR1020070121819A
Authority: KR
Inventors: 윤병훈; 강재우; 김원용
Original assignee: 엔에이치엔(주)
Priority date: 2007-11-28
Filing date: 2007-11-28
Publication date: 2009-06-02
Also published as: KR100923935B1

Abstract

A method for automatically evaluating a document image for the OCR and a system thereof are provided to obtain a clear image by extracting features of the document image and then offering an evaluation score and feedback to a user. A system(100) comprises a feature data extractor(110), a feature data evaluating unit(120), a scanner control unit(130), an interface unit(140), a communication unit(150) and a control unit(160). The feature data extractor extracts feature data which shows the attribute of a document image related to a character recognition rate of the OCR(Optical Character Recognition) from a document image. A feature data evaluating unit evaluates the extracted feature data. So as to re-set a scanner according to the evaluated feature data, the scanner control unit feeds back a scanner control method to a user.

Description

TECHNICAL AND SYSTEM FOR EVALUATING DOCUMENT IMAGE AUTOMATICALLY FOR OPTICAL CHARACTER RECOGNITION}

본 발명은 OCR(Optical Character Recognition)을 위한 문서 영상의 자동 평가 방법 및 시스템에 관한 것으로서, 구체적으로는, 문서 영상을 인식하기 전에 문서 영상의 특징을 추출하여 항목별로 수치화하여 평가함으로써 신속하게 문서 영상의 문자 인식률을 예측할 수 있으며, 이와 같은 항목별 점수 및 예상 인식률을 피드백으로 사용자에게 알려주어 사용자가 스캐너의 팩터들을 조정하거나 OCR 전처리 과정으로 진행될 수 있도록 유도함으로써, 시행착오 없이 간편하게 문서 영상의 문자 인식률을 향상시키기 위한 방법 및 시스템에 관한 것이다. The present invention relates to a method and system for automatically evaluating a document image for OCR (Optical Character Recognition). Specifically, the document image is quickly extracted by quantifying and evaluating the feature of the document image before recognizing the document image. The character recognition rate of can be predicted, and the user can adjust the factors of the scanner or proceed to OCR preprocessing process by informing the user with the feedback of the score and the expected recognition rate for each item so that the character recognition rate of the document image can be easily performed without trial and error. The present invention relates to a method and system for improving the system.

일반적으로, 종이에 기록되어 있는 문서를 디지털화하기 위해서 문서를 스캐닝하여 문서 영상을 인식하는 OCR(Optical Character Recognition) 기술이 사용된다. 그러나 OCR을 사용하더라도 변색 등 문서 자체의 보존 상태와 스캔 시에 생기는 잡음 등의 영향으로 완벽한 상태의 영상을 얻기 힘들기 때문에, 문서 영상에 대한 높은 문자 인식률을 기대하기는 어렵다.In general, in order to digitize a document recorded on paper, an OCR (Optical Character Recognition) technique that scans a document and recognizes a document image is used. However, even when OCR is used, it is difficult to obtain a perfect image due to the preservation state of the document itself such as discoloration and noise generated during scanning, so that it is difficult to expect a high character recognition rate for the document image.

국내에서 최근 수 년간 상용화된 문서 인식 시스템은 한글 인식에 초점을 맞추는 추세에 있으며, 구조적인 특성을 이용하여 한글 문자를 초성, 중성, 종성으로 분리한 후 자소 단위의 인식을 시도하는 것이 보편적인 경향이다. 이러한 시스템은 원본문서를 스캐닝 할 때 상대적으로 높은 해상도와 잡음이 없어야 제대로 기능할 수 있다는 문제점이 있다. 따라서 원본문서가 깨끗하지 않거나 스캐닝 시 문자의 크기와 스캐너의 해상도가 적합하지 않은 경우 인식이 불가능한 경우가 발생할 수 있다.In Korea, the document recognition system that has been commercialized in recent years tends to focus on the recognition of Hangeul, and it is common to attempt to recognize the phoneme units after separating Hangeul characters into initial, neutral, and final characters using structural characteristics. to be. Such a system has a problem that it can function properly without relatively high resolution and noise when scanning an original document. Therefore, if the original document is not clean or the size of the character and the resolution of the scanner are not appropriate when scanning, it may be impossible to recognize.

이에 따라 종래부터 문서 영상에 포함된 문자에 대한 인식을 수행하기 전에 전처리 과정을 통해 문서 영상의 잡음을 비롯한 인식에 불필요한 요소들을 제거하고자 하는 노력을 기울여 왔다. 하지만 전처리 과정을 거치더라도 인식에 방해가 되는 요소가 남아있을 수 있으며, 결국 이런 방해요소로 인해 문서의 최종 인식률은 낮아지게 되는데, 이러한 경우에는 불필요하게 많은 연산량을 허비한 상태로 재차 문서 인식을 시도해야 하므로 효율적인 프로세스를 달성할 수 없게 되었다.Accordingly, prior to performing the recognition of the characters included in the document image, efforts have been made to remove unnecessary elements such as noise of the document image through the preprocessing process. However, even after the preprocessing process, there may be a disturbing factor in the recognition, and as a result, the final recognition rate of the document may be lowered due to such an obstacle. As a result, efficient processes have not been achieved.

따라서, 본 발명의 목적은, 종래 기술의 문제점을 해결하고 원본문서를 스캐닝 하여 생성한 문서 영상을 인식하기 전에 문서 영상의 특징을 추출하여 항목별로 수치화하여 평가함으로써 신속하게 문서 영상의 문자 인식률을 예측할 수 있으며, 이와 같은 항목별 점수를 피드백으로 사용자에게 알려주어 사용자가 스캐너의 해상도, 명도를 재설정하거나 문서의 기울임을 교정할 수 있도록 하거나 실제 OCR 전처리 과정으로 진행될 수 있도록 함으로써, 시행착오 없이 간편하게 문서 영상의 문자 인식률을 향상시키기 위함이다. Accordingly, an object of the present invention is to solve the problems of the prior art and to quickly predict the character recognition rate of a document image by extracting the feature of the document image and evaluating the numerical value by item before recognizing the document image generated by scanning the original document. By giving the feedback of such item-specific score to the user, the user can reset the resolution, brightness of the scanner, correct the skew of the document, or proceed to the actual OCR preprocessing process, so that the document image can be easily performed without trial and error. This is to improve the character recognition rate.

상기한 바와 같은 본 발명의 목적을 달성하고, 후술하는 본 발명의 특징적인 기능을 수행하기 위한, 본 발명의 특징적인 구성은 하기와 같다.In order to achieve the object of the present invention as described above, and to perform the characteristic functions of the present invention described below, the characteristic configuration of the present invention is as follows.

본 발명의 일 태양에 따르면, 문자가 포함된 이미지 파일인 문서 영상에 대하여 OCR(Optical Character Recognition) 기술을 사용하여 인식 과정을 수행하기 전에 문자에 대한 예상 인식률을 구하기 위한 방법으로서, (a) 상기 OCR의 문자 인식률과 관련된 상기 문서 영상의 속성을 나타내는 적어도 하나의 특징 데이터를 상기 문서 영상으로부터 추출하고, 상기 추출된 특징 데이터마다 상기 특징 데이터가 상기 OCR에 의한 문자 인식에 있어서 어느 정도 적합한 상태인지를 수치화한 항목별 점수를 획득하는 단계, (b) 상기 항목별 점수마다 가중치 - 상기 가중치는 상기 특징 데이터의 속성 자체가 상기 문서 영상에 대한 문자 인식을 행함에 있어서 영 향력이 클수록 큰 수치로 결정됨 - 를 적용하는 단계, 및 (c) 상기 가중치가 적용된 항목별 점수를 가산하여 상기 예상 인식률을 제공하는 단계를 포함하는 방법을 제공한다.According to an aspect of the present invention, a method for obtaining an expected recognition rate for a character before performing a recognition process using an optical character recognition (OCR) technique for a document image, which is an image file containing characters, comprising: (a) the Extract at least one feature data representing an attribute of the document image related to a character recognition rate of an OCR from the document image, and determine how suitable the feature data is for character recognition by the OCR for each of the extracted feature data. Obtaining a numerical score for each item, (b) a weight for each score for each item, wherein the weight is determined to be larger as the influence of the feature data on the document image is greater in the character recognition. (C) adding the weighted item-specific scores to the expected It provides a method comprising providing a recognition rate.

본 발명에 따르면, 문서 영상의 특징을 추출하여 평가 점수와 피드백을 사용자에게 제공함으로써, 직접 스캐너를 제어하여 깨끗한 영상을 획득하도록 해 준다.According to the present invention, the feature of the document image is extracted and the evaluation score and feedback are provided to the user, thereby directly controlling the scanner to obtain a clear image.

또한, 본 발명에 따르면, 상기 피드백은 추출된 특징을 항목별 수치로 나타내어 제공되기 때문에 문서 영상의 실제 OCR 전처리 과정에 도움이 되며, 평가 결과가 점수로 디스플레이되므로 영상을 인식하기 전에도 명시적으로 인식률 추정이 가능하다는 효과가 있다. In addition, according to the present invention, since the feedback is provided by expressing the extracted feature as a numerical value for each item, it is helpful for the actual OCR preprocessing process of the document image, and the evaluation result is displayed as a score, so that the recognition rate is explicitly recognized even before the image is recognized. The effect is that estimation is possible.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된 다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일 또는 유사한 기능을 지칭한다.DETAILED DESCRIPTION The following detailed description of the invention refers to the accompanying drawings that show, by way of illustration, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that the various embodiments of the present invention are different but need not be mutually exclusive. For example, certain shapes, structures, and characteristics described herein may be embodied in other embodiments without departing from the spirit and scope of the invention with respect to one embodiment. In addition, it is to be understood that the location or arrangement of individual components within each disclosed embodiment may be changed without departing from the spirit and scope of the invention. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present invention, if properly described, is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled. Like reference numerals in the drawings refer to the same or similar functions throughout the several aspects.

이하에서는 첨부된 도면을 참조하여 본 발명의 실시예를 상세히 설명하도록 한다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따라, 문서 영상으로부터 OCR의 문자 인식률과 관련된 문서 영상의 속성을 나타내는 특징 데이터를 추출하고, 상기 특징 데이터를 문서 영상의 품질을 인식기가 문자인식 과정을 수행 전에 항목별로 수치화 하여 평가함으로써 빠른 시간에 문서 영상의 문자 인식률을 예측하기 위한 전체 시스템(100)의 구성도이다.1 is a diagram illustrating extracting feature data indicating a property of a document image related to a character recognition rate of an OCR from a document image, and extracting the feature data from the document image before performing a character recognition process. It is a block diagram of the entire system 100 for predicting the character recognition rate of the document image in a short time by digitizing and evaluating each item.

도 1을 참조하면, 전체 시스템(100)은 특징 데이터 추출부(110), 특징 데이터 수치 계산부(120), 스캐너 조절부(130), 인터페이스부(140), 통신부(150), 제어부(160) 등을 포함할 수 있다. Referring to FIG. 1, the entire system 100 includes a feature data extractor 110, a feature data value calculator 120, a scanner controller 130, an interface unit 140, a communication unit 150, and a controller 160. ) May be included.

본 발명의 일 실시예에 따르면, 특징 데이터 추출부(110), 특징 데이터 수치 계산부(120), 스캐너 조절부(130), 인터페이스부(140), 통신부(150), 제어부(160) 는 그 중 적어도 일부가 사용자 단말 장치에 포함되거나 사용자 단말 장치와 통신하는 프로그램 모듈들일 수 있다(다만, 도 1에서는 특징 데이터 추출부(110), 특징 데이터 수치 계산부(120), 스캐너 조절부(130), 인터페이스부(140), 통신부(150), 제어부(160)가 모두 사용자 단말 장치에 포함되어 있는 것으로 예시하고 있다). 이러한 프로그램 모듈들은 운영 시스템, 응용 프로그램 모듈 및 기타 프로그램 모 듈의 형태로 사용자 단말 장치에 포함될 수 있으며, 물리적으로 여러가지 공지의 기억 장치 상에 저장될 수도 있다. 또한, 이러한 프로그램 모듈들은 사용자 단말 장치와 통신 가능한 원격 기억 장치에 저장될 수도 있다. 이러한 프로그램 모듈들은 본 발명에 따라 후술할 특정 업무를 수행하거나 특정 추상 데이터 유형을 실행하는 루틴, 서브루틴, 프로그램, 오브젝트, 컴포넌트, 데이터 구조 등을 포괄하지만, 이에 제한되지는 않는다. According to the exemplary embodiment of the present invention, the feature data extractor 110, the feature data numerical calculator 120, the scanner controller 130, the interface 140, the communicator 150, and the controller 160 are At least some of them may be program modules included in the user terminal device or in communication with the user terminal device. However, in FIG. 1, the feature data extractor 110, the feature data numerical calculator 120, and the scanner controller 130 may be used. The interface unit 140, the communication unit 150, and the controller 160 are all included in the user terminal device. Such program modules may be included in the user terminal device in the form of an operating system, an application module, and other program modules, and may be physically stored on various known storage devices. Also, such program modules may be stored in a remote storage device that can communicate with a user terminal device. Such program modules include, but are not limited to, routines, subroutines, programs, objects, components, data structures, etc. that perform particular tasks or execute particular abstract data types, described below, in accordance with the present invention.

특징 데이터 추출부(110)는 문서 영상으로부터 OCR의 문자 인식률과 관련된 문서 영상의 속성을 나타내는 특징 데이터를 추출하는 역할을 수행한다. 여기서 특징 데이터란 영상의 기울기 정도, 영상의 이진화시 클러스터링 반복 정도, 영상의 포함된 잡음의 비율, 영상의 텍스트 영역의 설정 정도, 텍스트 영역에서의 문자분할 정도, 문자분할영역에서의 명도대비 정도, 문자 세선화 과정에서의 반복 정도, ‘ㅇ’과 같은 루프 문자의 크기 등을 의미한다. The feature data extractor 110 extracts feature data representing a property of the document image related to the character recognition rate of the OCR from the document image. Here, the feature data is the degree of tilt of the image, the degree of clustering repetition in the binarization of the image, the ratio of noise included in the image, the degree of setting the text area of the image, the degree of character division in the text area, the degree of brightness contrast in the character division area, The amount of repetition in the character thinning process and the size of the loop character such as 'ㅇ'.

또한, 특징 데이터 수치 계산부(120)는 특징 데이터 추출부(110)에 의해 추출된 특징 데이터를 수치화하여 평가하고, 소정의 가중치와 결합하여 가중합을 생성하는 역할을 수행한다. In addition, the feature data numerical calculator 120 calculates and evaluates the feature data extracted by the feature data extractor 110, and generates a weighted sum by combining the feature data with a predetermined weight.

한편, 스캐너 조절부(130)는 상기 수치화된 특징 데이터에 따라 스캐너를 재설정하도록 사용자에게 스캐너 제어 방법을 피드백하는 역할을 수행한다. Meanwhile, the scanner controller 130 feeds back a scanner control method to the user to reset the scanner according to the digitized feature data.

인터페이스부(140)는 디지털 기기의 화면을 통해 상기 수치화된 특징 데이터 값 및 상기 가중합 등을 사용자에게 알려주고 사용자로부터 OCR 과정을 진행시킬지 여부에 대한 응답을 입력 받는 역할을 수행한다. The interface unit 140 notifies the user of the digitized feature data value and the weighted sum, etc. on the screen of the digital device, and receives a response from the user about whether to proceed with the OCR process.

통신부(150)는 시스템(100) 내부의 각 구성 모듈 사이의 신호를 송수신하거나 다양한 외부 장치와의 데이터 송수신을 수행하는 것을 담당한다. The communication unit 150 is responsible for transmitting and receiving signals between each component module in the system 100 or transmitting and receiving data with various external devices.

본 발명에 따른 제어부(160)는 특징 데이터 추출부(110), 특징 데이터 수치 계산부(120), 스캐너 조절부(130), 인터페이스부(140), 통신부(150) 간의 데이터의 흐름을 제어하는 기능을 수행한다. 즉, 본 발명에 따른 제어부(160)는 통신부(150)를 통하여 각 구성 모듈 간에 송수신되는 신호를 제어함으로써, 특징 데이터 추출부(110), 특징 데이터 수치 계산부(120), 스캐너 조절부(130), 인터페이스부(140)에서 각각의 고유의 기능을 수행하도록 제어한다. The controller 160 according to the present invention controls the flow of data between the feature data extractor 110, the feature data numerical calculator 120, the scanner controller 130, the interface unit 140, and the communication unit 150. Perform the function. That is, the control unit 160 according to the present invention by controlling the signals transmitted and received between each component through the communication unit 150, the feature data extraction unit 110, feature data numerical calculation unit 120, scanner adjusting unit 130 ), The interface unit 140 controls to perform each unique function.

도 2는 본 발명의 일 실시예에 따라, 문서 영상의 사전 평가 값을 이용하여 스캐너를 재설정하는 시스템이 동작하기 위한 전체적인 시계열적 흐름을 나타내는 흐름도이다. 2 is a flow diagram illustrating the overall time series flow for operating a system for resetting a scanner using a pre-evaluation value of a document image, in accordance with one embodiment of the present invention.

우선, 스캐너 등과 같은 디지털 기기는 원본문서를 스캐닝하여 문서 영상을 생성한다(S210).First, a digital device such as a scanner generates a document image by scanning an original document (S210).

생성된 문서 영상으로부터 영상의 기울기 정도(Skew), 영상의 이진화시 클러스터링 반복 정도(Binarization Iteration: BI), 영상에 포함된 잡음의 비율(Noise Ratio: NR), 영상의 텍스트 영역의 설정 정도(Zone Detection: ZD), 텍스트 영역에서의 문자분할 정도(Segmentation Difficulty: SD), 문자분할영역에서의 명도대비정도(Segmentation Contrast: SC), 문자 세선화 과정에서의 반복 정도(Thinning Iteration: TI), ‘ㅇ’ 문자와 같은 루프 문자의 크기(Size of Loop Character: SLC) 등을 평가항목으로 하여 특징을 추출하고 각 항목별로 수치화하여 점수를 산 출한다(S220).Skew of the image from the generated document image, Clustering Iteration (BI) when binarizing the image, Noise Ratio (NR) included in the image, and setting of the text area of the image (Zone) Detection: ZD), Segmentation Difficulty (SD) in Text Areas, Segmentation Contrast (SC) in Text Areas, Thinning Iteration (TI), ' ㅇ 'character is extracted using the size of loop character (SLC) such as character as an evaluation item, and the score is calculated by digitizing each item (S220).

상기 산출한 각 항목별 점수는 문서 영상의 종류에 따라 최적으로 결정되는 각각의 가중치를 적용하여 합한 총 점수, 즉 예상 인식률을 사용자에게 피드백으로서 제공해 준다(S230).The calculated scores for each item are provided as feedback to the user as a total score, that is, the expected recognition rate, that is summed by applying respective weights that are optimally determined according to the type of document image (S230).

그 후, 시스템(100)은 사용자로부터 OCR을 진행할 것인지에 대한 응답이 입력되는지 검출한다(S240). 사용자는 응답을 입력함에 있어서, 상기 S230 단계에서 산출된 예상 인식률 값을 참조로 할 수 있을 것이다.Thereafter, the system 100 detects whether a response to the OCR from the user is input (S240). In inputting the response, the user may refer to the expected recognition rate value calculated in step S230.

만약, 상기 예상 인식률이 적절하다고 판단되어 사용자로부터 OCR을 진행하겠다는 입력이 인가되면, 시스템(100)은 OCR 과정을 진행한다(S250). OCR 과정은 (1) 회색 변환, 흑백 변환, 기울기 보정 등의 프로세스를 포함하는 전처리 과정, (2) 레이아웃 분석, 글/그림/표에 대한 구분 및 인식 등의 프로세스를 포함하는 구조 분석 과정, (3) 문자 분할(segmentation) 과정, (4) 특징 추출 프로세스를 포함하는 문자 인식 과정, (5) 후처리 등으로 이루어질 수 있으며, 이와 같은 OCR 기술 자체는 일반적으로 알려진 기술이므로 여기서는 자세한 설명을 줄이도록 한다. 이와 같은 (1)~(5)의 프로세스를 거쳐 문자 등이 인식된 결과를 바탕으로 최종 인식률이 어느 정도인지를 알 수 있을 것이다.If it is determined that the expected recognition rate is appropriate and the user inputs an OCR to proceed, the system 100 proceeds to the OCR process (S250). The OCR process consists of (1) a preprocessing process including processes such as gray conversion, black and white conversion, and skew correction, (2) a structural analysis process including processes such as layout analysis, classification and recognition of text / pictures / tables, ( 3) a character segmentation process, (4) a character recognition process including a feature extraction process, and (5) a post-processing. The OCR technology itself is a generally known technology, so the detailed description thereof will be omitted here. do. Through the process of (1) to (5) as described above, it is possible to know how much the final recognition rate is based on the result of the recognition of the characters.

최종 인식률에 대한 정보가 획득되면, 상기 S230 단계에서 산출된 예상 인식률과 상기 최종 인식률 사이의 상관 관계를 살펴볼 수 있으며, 각 평가 항목에 적용되는 가중치에 따라 예상 인식률과 최종 인식률이 거의 동일해질 수 있을 것이다.When the information on the final recognition rate is obtained, the correlation between the expected recognition rate calculated in step S230 and the final recognition rate may be examined, and the expected recognition rate and the final recognition rate may be substantially the same according to the weight applied to each evaluation item. will be.

한편, S240 단계에서, 사용자는 상기 예상 인식률이 낮다고 생각할 수 있으며, 이 경우에는 OCR과정에서의 문서 영상에 대한 문자 인식률을 더욱 높게 하기 위해, OCR 진행을 시작하지 않고 스캐너의 팩터(factor)들을 조정할 수 있다. 이 경우, 제어부(160)는 S220단계에서 계산된 각 평가항목별 점수를 토대로 스캐너 제어 방법을 제시한다(S270). 예를 들어, 제어부(160)는 각 평가항목별 점수를 참고로 사용자에게 스캐너의 자동급지장치를 제어하도록 하거나 스캐너의 밝기를 조정하도록 하거나 스캐너의 해상도를 높이도록 하는 등의 구체적인 스캐너 제어 방법을 제시할 수 있다. 물론, 이에 한정되는 것은 아니며, 사용자에게 상기의 정보를 제시하지 않고 자동적으로 적절한 상태가 되도록 조정될 수 있음은 물론이다. On the other hand, in step S240, the user may think that the expected recognition rate is low, in this case, in order to further increase the character recognition rate for the document image in the OCR process, adjust the factors of the scanner without starting the OCR process Can be. In this case, the controller 160 proposes a scanner control method based on the score for each evaluation item calculated in step S220 (S270). For example, the controller 160 suggests a specific scanner control method such as allowing the user to control the automatic document feeder of the scanner, adjust the brightness of the scanner, or increase the resolution of the scanner by referring to the score of each evaluation item. can do. Of course, the present invention is not limited thereto, and may be adjusted to be in an appropriate state automatically without presenting the above information to the user.

S260 단계에서 제공되는 정보에 따라 스캐너의 각종 팩터가 조정되고 난 후에는, S210 단계로 돌아가 또 다시 원본문서를 스캐닝하여 예상 인식률을 구해볼 수 있을 것이다.After the various factors of the scanner are adjusted according to the information provided in step S260, the process may return to step S210 and scan the original document again to obtain an expected recognition rate.

도 3은 본 발명의 일 실시예에 따른 문서 영상의 평가를 각 항목별로 수행하는 프로세스를 나타내는 구체적인 흐름도이다. 3 is a detailed flowchart illustrating a process of performing evaluation of a document image for each item according to an embodiment of the present invention.

원본문서를 스캐닝하여 생성한 문서 영상에 대해 영상의 기울기 정도(S302: 각도특성추출), 영상의 이진화시 클러스터링 반복 정도(S303: BI특성추출), 영상에 포함된 잡음의 비율(S304: NR특성추출), 영상의 텍스트 영역의 설정 정도(S305: ZD특성추출), 텍스트 영역에서의 문자분할 정도(S306: SD특성추출), 문자분할영역에서의 명도대비정도(S307: TI특성추출), 문자 세선화 과정에서의 반복 정도(S308: SC특성추출), ‘ㅇ’ 문자와 같은 루프 문자의 크기(S309: SLC특성추출)를 평가항 목으로 각각 특성 값을 계산하고(S310), 각 평가 항목들의 점수가 도출되면 인식에 중요한 영향을 미치는 순서에 따라 가중치를 부여한 후 이를 전부 합하여 가령 100점 만점의 최종 평가 점수(즉, 예상 인식률)를 산출한다(S311). The degree of tilt of the image (S302: angular characteristic extraction) for the document image generated by scanning the original document, the clustering repetition degree (S303: BI characteristic extraction) when binarizing the image, and the ratio of noise included in the image (S304: NR characteristic) Extraction), setting degree of the text area of the image (S305: ZD characteristic extraction), character division degree in the text area (S306: SD characteristic extraction), brightness contrast degree in the character division area (S307: TI characteristic extraction), character The value of the repetition in the thinning process (S308: SC characteristic extraction) and the size of the loop character such as the 'ㅇ' character (S309: SLC characteristic extraction) are calculated using the evaluation items, respectively (S310), and each evaluation item. When the scores are derived, the weights are assigned according to the order in which they have a significant influence on the recognition, and then the sums are added to calculate the final evaluation score (that is, the expected recognition rate) of 100 points (S311).

구체적으로, 문서 영상의 기울기 정도(S302)는 영상의 스캐닝 과정에서 고려되어야 하는 문제로서, 문서 영상 기울기로 인해 왜곡이 심해지면 문서인식 자체가 불가능하기 때문에 기울기에 대한 교정이 필요하다. 도 4a는 기울어진(Skew) 영상을 나타내며, 이러한 이유로 예상 인식률이 낮게 계산될 수 있으며, 이러한 경우 피드백을 통하여 도 4b와 같이 기울기를 교정한 올바른 영상으로 조정될 수 있다. Specifically, the degree of inclination (S302) of the document image is a problem to be considered in the scanning process of the image. If the distortion is severe due to the document image tilt, the document recognition itself is impossible, so it is necessary to correct the tilt. FIG. 4A illustrates a skew image, and for this reason, the expected recognition rate may be calculated to be low, and in this case, the feedback may be adjusted to a correct image corrected for tilt as shown in FIG. 4B.

도 5의 좌측 영역은 문서 영상의 기울기 허용 범위를 나타내며, 도 5의 우측 영역은 영상의 4부분에서 각도를 추출하는 예를 보여준다.The left region of FIG. 5 represents a tilt allowable range of the document image, and the right region of FIG. 5 shows an example of extracting an angle from four portions of the image.

문서 영상을 인식하기 위해서는 기울어진(Skew) 영상을 올바르게 되돌려야 한다. 문서 영상을 올바르게 되돌리려면 문서 영상의 기울어진 정도를 각도로 파악해야 한다. 이 각도는 보통 문서의 위쪽 부분의 기울어진 정도로 판단되는데, 이는 문서 전체에 대한 기울기로 활용되기에 부족한 점이 있기 때문에, 도 5의 우측 영역에 도시된 바와 같이 문서의 위, 아래, 왼쪽, 오른쪽 부분에서 각도를 검출한 다음 이들 각도들의 평균을 취하는 것이 바람직하다. 하지만, 이 때 4 부분의 각도들 중 나머지 각도들과 차이가 큰 각도는 오류가 있는 각도로 판단하고 제외시킬 수 있으며, ±5도 범위를 벗어나는 각도가 있는 경우에도 이를 제외시키고 평균 각도를 산출할 수 있다. 일반적으로 정확하게 스캐닝된 문서 영상은 기울기가 0도에 가깝다. 따라서, 각도가 1도만 기울어도 문자의 왜곡이 심해지기 때문에 ±5도 의 범위를 벗어난 문서 영상은 스캔이 잘못된 것이라고 판단하여 점수를 0으로 취급할 수 있다. 기본적으로 각도는 시계 방향이 양수 값을 나타내도록 한다. 영상의 기울기 평가에서는 0도를 기준으로 ±5도의 범위에서 0.5도씩 변화될 때마다 점수를 차등으로 산출할 수 있다. In order to recognize the document image, the skew image must be correctly returned. In order to return the document image correctly, it is necessary to know the angle of tilt of the document image. This angle is usually judged as the inclination of the upper part of the document, which is insufficient to be used as the inclination of the entire document, and thus the upper, lower, left, and right parts of the document as shown in the right area of FIG. It is preferable to detect the angle at and then take the average of these angles. However, at this time, an angle having a large difference from the remaining angles of the four parts may be determined as an error angle and excluded, and even if there is an angle that is out of the ± 5 degree range, the average angle may be calculated. Can be. In general, a scanned image of a document has an inclination close to zero degrees. Therefore, even if the angle is inclined by only 1 degree, the distortion of the character is severe, so that the document image outside the range of ± 5 degrees can be judged to be a wrong scan and treat the score as 0. By default, the angle causes the clockwise direction to show a positive value. In the tilt evaluation of the image, the score may be calculated differentially every 0.5 degrees in the range of ± 5 degrees with respect to 0 degrees.

도 3의 S303단계는 영상의 이진화시 클러스터링 반복 정도(Binarization Iteration, BI)를 의미한다. 이는 문서 영상을 인식 하기 위해 컬러(RGB) 영상을 그레이스케일(Gray-Scale) 영상으로 변환하고, 이를 다시 글자를 나타내는 검정색과 글자가 아닌 배경을 나타내는 흰색으로 구성된 이진(Binary) 영상으로 변환하는 과정이다. Step S303 of Figure 3 refers to the degree of clustering iteration (Binarization Iteration, BI) during the binarization of the image. This process converts a color image into a grayscale image to recognize a document image, and then converts the image into a binary image composed of black representing letters and white representing backgrounds rather than letters. to be.

0~255 단계로 구성된 그레이 영상을 이진 영상으로 변환하는 방법은 기본적으로 영상 전체의 평균을 문턱 값(Threshold)으로 사용하는 Global Thresholding이 있으나, 이진 영상은 글자색과 배경색이 뚜렷하게 구별되어야 하기 때문에 Global Thresholding의 고정적인 문턱 값만으로는 항상 좋은 결과는 얻기는 힘들다. 이와 같은 문제점을 개선하기 위해 Andreas E. Savakis가 저술하고, 1998년 IEEE Proceedings of international Conference on Image Processing(ICIP'98)에 게재된 논문인 "Adaptive Document Image Thresholding Using Foreground and Background Clustering"에 개시된 이미지의 글자색과 배경색을 클러스터링 해나가면서 유동적인 Threshold를 찾아가는 Foreground and Background Clustering(FBC) 기술을 사용하여 행하여질 수 있다. The method of converting gray image composed of 0 ~ 255 steps into binary image basically has Global Thresholding which uses the average of the whole image as threshold.But binary image has to be distinguished from text color and background color. A fixed threshold of thresholding will not always yield good results. To remedy this problem, the author of Andreas E. Savakis, published in IEEE Proceedings of international Conference on Image Processing (ICIP'98), published in 1998, published in "Adaptive Document Image Thresholding Using Foreground and Background Clustering," This can be done using Foreground and Background Clustering (FBC) technology, which finds a fluid threshold while clustering text and background colors.

문서의 변색이나 짙은 배경색으로 글자와 배경의 구분이 어려운 영상의 경우 에는 좋은 이진 영상을 얻기 힘들다. 상기 설명한 FBC 기술에 따르면, 글자와 배경 화소를 클러스터링 해가면서 각각 평균값을 찾아 나간다. 변색이나 짙은 배경색의 영상을 FBC방법으로 변환할 경우 클러스터링 과정이 일반 영상에 비해 많아진다. 영상의 이진화 평가는 클러스터링 과정의 반복 횟수를 근거로 점수를 산출한다. It is difficult to obtain a good binary image in the case of a discoloration of a document or a dark background color that makes it difficult to distinguish between a letter and a background. According to the FBC technique described above, the average value is found while clustering letters and background pixels. When the image of discoloration or dark background color is converted by FBC method, the clustering process is more than that of general image. The binarization evaluation of an image calculates a score based on the number of iterations of the clustering process.

도 6은 클러스터링 반복 횟수를 반환하는 의사코드(Pseudo Code)를 도시한다. 먼저 영상전체의 평균값을 구하고, 이를 기준으로 흑화소와 백화소를 분류한다. 그리고 흑화소로 분류된 화소들에서의 평균값과, 백화소로 분류된 화소들에서의 평균값으로 다시 흑백화소를 분류한다. 이러한 과정을 각각의 평균값이 변하지 않을 때까지 반복 실행한다. 이에 따라, FBC 반복 횟수에 반비례하여 평가를 하게 된다.6 shows a pseudo code for returning the number of clustering repetitions. First, the average value of the entire image is obtained, and black and white pixels are classified based on the average value. The black and white pixels are further classified into average values of pixels classified as black pixels and average values of pixels classified as white pixels. This process is repeated until each mean value does not change. Accordingly, the evaluation is inversely proportional to the number of FBC repetitions.

도 3의 S304단계는 영상에 포함된 잡음의 비율(Noise Ration: NR)을 의미한다.In step S304 of FIG. 3, a noise ratio (NR) included in an image is represented.

문서 영상의 글자 부분에 잡음이 섞이면 전혀 다른 글자로 인식되는 경우가 생긴다. 이러한 잡음이 글자의 한 부분인지 판단하여 글자 부분이 아니라면 잡음을 제거해야 한다. 잡음 제거를 위해선 크게 저주파통과필터(Low Pass Filter), 미디언 필터(Median Filter), Smoothing 등이 적용될 수 있다. 이들 잡음 제거 방법 중 원본 영상의 강한 에지(Edge)와 상세한 부분을 보존하기 위해서는 미디언 필터가 효과적일 수 있지만 이에 한정되는 것은 아니다. When noise is mixed in the text portion of a document image, it may be recognized as a completely different character. Determine if this noise is part of a letter and remove the noise if it is not a letter part. In order to remove noise, a low pass filter, a median filter, and a smoothing may be applied. The median filter may be effective to preserve strong edges and details of the original image, but the present invention is not limited thereto.

하지만, 미디언 필터를 사용하더라도 완벽한 잡음 제거는 할 수 없는바, 잡 음 평가 항목에서는 미디언 필터를 적용한 영상에서 잡음으로 판단된 화소들의 비율을 판단하여 점수를 산출한다. 잡음의 비율은 다음과 같이 나타낼 수 있다.However, even if the median filter is used, noise cannot be completely removed. In the noise evaluation item, a score is calculated by determining the ratio of pixels determined as noise in the median filter applied image. The ratio of noise can be expressed as follows.

도 3의 S305단계는 영상의 텍스트 영역의 설정 정도(Zone Detection, ZD)를 의미한다.In operation S305 of FIG. 3, a setting degree (Zone Detection, ZD) of the text area of the image is represented.

투영 프로파일을 사용하여 문서 영상의 영역 분류를 할 경우 그림 영역은 하나의 영역으로 설정되어야 하고, 텍스트 영역은 보통 하나의 문단 이상이 하나의 영역으로 설정될 수 있다. 하지만, 불규칙한 글자 간격과 줄 간격 등으로 인해 설정된 영역이 너무 작거나 큰 경우가 생기기 마련이다. 도 7의 좌측 영역은 영역 분류 결과 너무 작게 설정된 텍스트 영역의 예를 도시하며, 도 7의 우측 영역은 올바르게 설정된 예를 도시한다. When classifying a document image using the projection profile, the picture area should be set to one area, and the text area may be set to one area of more than one paragraph. However, there are cases where the set area is too small or too large due to irregular letter spacing and line spacing. The left region of FIG. 7 shows an example of a text area that is set too small as a result of region classification, and the right region of FIG. 7 shows an example that is correctly set.

즉, S305 단계에 따르면, 문서 영상의 전처리를 통해 문서 영상의 기울어짐이 교정된 후에 문서 영상을 의미 있는 영역들로 분할할 수 있는데, 각 영역들은 텍스트 또는 그래픽 정보(그림, 테이블, 선 등)를 포함할 수 있다. 이를 위해 S305 단계는 문서 영상으로부터 기하학적인 구조를 추출하여 문서 영상을 다양한 영역의 집합으로 분할하는 단계와 각 영역의 특성을 조사하여 텍스트와 비텍스트 영역으로 분류하는 영역 분류 단계로 구성될 수 있다. 영역 분류는 일반적으로 영상의 가로 방향과 세로 방향으로 투영하여 프로파일을 생성하고 이를 기준으로 영 역을 나눈다.That is, according to step S305, after the inclination of the document image is corrected through the preprocessing of the document image, the document image may be divided into meaningful regions, each of which is text or graphic information (pictures, tables, lines, etc.). It may include. To this end, step S305 may include extracting a geometric structure from the document image to divide the document image into a set of various regions, and classifying the region into a text and non-text region by examining the characteristics of each region. Area classification generally generates a profile by projecting in the horizontal and vertical directions of an image and dividing the area based on this.

이와 같은 영역 분류를 통한 평가에서는 문서 영상의 분류 결과 설정된 영역의 폭과 높이가 기준 이하인 영역들이 전체 영역에서 차지하는 비율을 점수로 환산한다. 즉, 바르지 못하게 설정된 영역의 비율을 평가하게 된다.In the evaluation based on the area classification, the ratio of the area in which the width and height of the area set as a result of the classification of the document image is less than the reference in the whole area is converted into a score. That is, the ratio of the incorrectly set area is evaluated.

도 3의 S306 단계는 텍스트 영역에서의 문자 분할 정도(Segmentation Difficulty: SD)를 의미한다.In step S306 of FIG. 3, the degree of character division (SD) in the text area is represented.

문서 구조 분석을 통해 추출된 텍스트 영역은 인식의 대상인 문자 단위로 분할되는데, 일반적으로 투영 프로파일(Projection Profile) 방법을 적용하거나 오일석, 김수형 외 2인이 저술하고, 2002년 정보과학회지, 제 20권 제 8호, pp.24-34 에 게재된 “문서 영상 처리 기술과 디지털 도서관”에 게시된 분석(connected component analysis) 방법 등을 적용할 수 있다. 이에 대해서는 이하에서 도 8 및 도 9를 참조로 자세히 설명된다.The text area extracted through document structure analysis is divided into character units that are subject to recognition. Generally, the projection profile method is applied or written by Yu Seok, Kim Soo Hyung, and two others, 2002 Journal of Information Science, Vol. 20, No. 20 You can apply the method of connected component analysis published in "Document Image Processing Technology and Digital Library" published in No. 8, pp.24-34. This will be described in detail with reference to FIGS. 8 and 9 below.

도 8은 텍스트 영역에서의 문자 단위의 분할 결과를 도시한다.8 illustrates a division result of a character unit in a text area.

영역 분할에서 글자 영역으로 분할된 영역은 인식기가 인식할 수 있도록 문자 분할의 과정을 거쳐야 한다. 문자 분할은 전처리 과정의 최종 과정으로 인식에 가장 많은 영향을 미치므로 비정상적인 문자 분할로 인해 인식결과가 전혀 다르게 나올 수 있다. 한글의 문자 분할의 경우 분할된 사각형(Rectangle)은 폭과 높이의 비율이 비슷해야 한다. In the area division, the area divided into the character areas must go through the process of character division so that the recognizer can recognize it. Character segmentation is the final process of the preprocessing process, and thus has the most influence on recognition. Therefore, abnormal character segmentation may result in completely different recognition results. In the case of Korean character division, the divided rectangle should have a similar ratio of width and height.

도 9는 글자 간격이 너무 좁아 문자들이 합쳐진 비정상적인 문자 분할을 보여준다. '스‘와 ’피‘가 합쳐진 경우 분할 사각형의 높이와 폭의 길이가 상당한 차이를 보인다는 것을 알 수 있다.9 shows an abnormal character division in which letters are so narrow that the characters are combined. When the 's' and' p's are combined, it can be seen that the height and width of the divided rectangles show a significant difference.

문자 분할 평가에서는 영역 분류에서 텍스트로 설정된 영역들에 대해 문자 분할을 시도하여 비정상적으로 분할된 문자들을 파악한다. 그리고 전체 문자 분할에 대한 비정상적인 문자 분할의 비율을 계산하여 점수를 산출한다.In the character division evaluation, character division is attempted for areas set as text in the area classification to identify abnormally divided characters. The score is calculated by calculating the ratio of abnormal character division to total character division.

도 3의 S307 단계는 문자 분할 영역에서의 명도 대비 정도(Segment Contrast, SC)를 의미한다. 분할된 문자에서 명도 대비 정도를 평가하기 위해 하기 식을 이용한 값을 산출한다. In step S307 of FIG. 3, the brightness contrast degree (Segment Contrast, SC) in the character segmentation region. A value using the following equation is calculated to evaluate the degree of contrast in the divided characters.

하기 식의 계산 값, 즉 명도 대비 값이 1에 가까울수록 글자와 배경이 명확히 분리됨을 의미하며, 명도 대비 값에 비례하여 문서 영상의 문자 인식률은 높아진다. As the calculated value of the following equation, that is, the brightness contrast value is closer to 1, it means that the text and the background are clearly separated, and the character recognition rate of the document image is increased in proportion to the brightness contrast value.

도 3의 S308 단계는 문자 세선화 과정에서의 반복 정도(Tinning Iteration, TI)를 의미하며, 이는 세선화 과정을 수행하면서 문자의 두께를 1로 만들기 위한 세선화 반복 횟수를 측정하여 평가하는 것으로 문자의 굵기를 측정하기 위한 방법 중 하나이다. 도 10는 문자 세선화 과정을 보여 준다. 도 10의 좌측 이미지는 문서 영상에서 하나의 문자를 추출한 것을 보여주며, 도 10의 중간 이미지는 상기 추출한 문자를 표준화하였고, 도 10의 우측 이미지는 상기 문자의 두께를 1로 세선화한 모습을 보여준다.Step S308 of FIG. 3 refers to the degree of repetition (Tinning Iteration, TI) in the character thinning process, which measures and evaluates the number of thinning repetitions for making the thickness of the character 1 while performing the thinning process. One of the methods for measuring the thickness of. 10 shows the character thinning process. The left image of FIG. 10 shows that one character is extracted from a document image. The middle image of FIG. 10 standardizes the extracted character, and the right image of FIG. 10 shows that the thickness of the character is thinned to 1. FIG. .

도 3의 S309 단계는 ‘ㅇ’ 과 같은 루프 모양이 들어간 문자의 크기를 측정하는 과정(Size of Loop Character: SLC)을 의미하며, 도 11은 SLC 과정을 보여준다. Step S309 of FIG. 3 refers to a process of measuring the size of a character having a loop shape such as 'ㅇ' (Size of Loop Character: SLC), and FIG. 11 shows the SLC process.

도 11의 상단 영역에는 문자의 연결 요소 분석을 통해 자소 단위로 분할하는 과정이 도시되며, 도 11의 하단 영역에는 분할된 자소 중 ㅁ, ㅂ, ㅇ(ㅎ), ㅍ과 같은 루프 문자가 추출된 결과가 도시된다. 루프의 크기를 측정한 값을 근거로 SLC 특성 값을 정할 수 있다. In the upper region of FIG. 11, a process of dividing into character units through analysis of a connection element of a character is illustrated, and in the lower region of FIG. 11, loop characters such as ㅁ, ㅂ, ㅇ (ㅎ) and ‰ are extracted. The result is shown. The SLC characteristic value can be determined based on the measured loop size.

도 12는 도 3의 S302 단계부터 S309 단계에 해당되는 평가 항목을 적용한 자동 평가 방법을 사용하여 하나의 문서 영상을 평가한 결과 화면을 도시한 예시도이다. FIG. 12 is an exemplary diagram illustrating a screen of a result of evaluating one document image by using an automatic evaluation method applying an evaluation item corresponding to steps S302 to S309 of FIG. 3.

도 12의 좌측 영역은 문서 영상이 두 개의 텍스트 영역과 하나의 그림 영역으로 구분된 상태를 도시한다. The left area of FIG. 12 illustrates a state in which the document image is divided into two text areas and one picture area.

본 발명에 따른 실시예들은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(Floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 본 발명의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Embodiments according to the present invention can be implemented in the form of program instructions that can be executed by various computer means can be recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination. Program instructions recorded on the media may be those specially designed and constructed for the purposes of the present invention, or they may be of the kind well-known and available to those having skill in the computer software arts. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tape, optical media such as CD-ROMs, DVDs, and magnetic disks, such as floppy disks. Magneto-optical media, and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine code generated by a compiler, but also high-level language code that can be executed by a computer using an interpreter or the like. The hardware device described above may be configured to operate as one or more software modules to perform the operations of the present invention, and vice versa.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다.In the present invention as described above has been described by the specific embodiments, such as specific components and limited embodiments and drawings, but this is provided to help a more general understanding of the present invention, the present invention is not limited to the above embodiments. For those skilled in the art, various modifications and variations are possible from these descriptions.

따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be limited to the described embodiments, and all the things that are equivalent to or equivalent to the claims as well as the following claims will belong to the scope of the present invention. .

도 1은 OCR을 위한 문서 영상의 자동평가 시스템을 나타내는 블록도이다. 1 is a block diagram illustrating an automatic evaluation system of a document image for OCR.

도 2는 문서 영상의 평가를 수행하는 프로세스를 나타내는 흐름도이다. 2 is a flowchart illustrating a process of performing evaluation of a document image.

도 3은 문서 영상의 속성을 반영하는 복수의 특성 데이터를 이용하여 예상 인식률을 구하기 위한 프로세스를 나타내는 흐름도이다.3 is a flowchart illustrating a process for obtaining an expected recognition rate using a plurality of characteristic data reflecting attributes of a document image.

도 4a는 기울어진(Skew) 영상을 나타내고, 도 4b는 교정을 통한 올바른 영상을 도시한 도면이다. 4A illustrates a skew image, and FIG. 4B illustrates a correct image through calibration.

도 5는 기울어진 영상에 대해 기울어진 정도를 판단하는 예를 나타내는 도면이다.5 is a diagram illustrating an example of determining an inclination degree with respect to an inclined image.

도 6은 클러스터링 반복 횟수를 반환하는 의사코드(Pseudo Code)를 도시한다.6 shows a pseudo code for returning the number of clustering repetitions.

도 7은 문서 영상에 대한 영역분류 결과를 나타내는 도면이다.7 is a diagram illustrating a region classification result for a document image.

도 8은 문서 영상의 텍스트 영역에서의 문자 단위 분할 결과를 보여주는 도면이다.8 illustrates a result of character unit division in a text area of a document image.

도 9는 글자 간격이 너무 좁아 문자들이 합쳐져 비정상적으로 문자 분할이 된 예를 보여주는 도면이다. 9 is a diagram illustrating an example in which character division is abnormally divided due to too narrow a letter spacing.

도 10은 문자 세선화 과정을 도시한 도면이다.10 is a diagram illustrating a character thinning process.

도 11은 SLC 과정을 통해 루프 부분을 검출하는 예를 보여주는 도면이다. 11 is a diagram illustrating an example of detecting a loop portion through an SLC process.

도 12는 평가 항목을 적용한 자동 평가 방법을 사용하여 하나의 문서 영상을 평가한 결과 화면을 도시한 예시도이다. 12 is an exemplary view illustrating a result screen of evaluating one document image by using an automatic evaluation method to which an evaluation item is applied.

<도면의 주요 부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

110: 특징 데이터 추출부110: feature data extraction unit

120: 특징 데이터 수치 계산부120: characteristic data numerical calculation unit

130: 스캐너 조절부130: scanner control

140: 인터페이스부140: interface unit

150: 통신부150: communication unit

160: 제어부160: control unit

Claims

As a method for obtaining an expected recognition rate for a character before performing a recognition process using an optical character recognition (OCR) technique for a document image, which is an image file containing characters,

(a) extracting from the document image at least one feature data indicating an attribute of the document image related to the character recognition rate of the OCR, wherein the feature data is to some extent in character recognition by the OCR for each extracted feature data; Obtaining a score for each item quantifying whether it is in a suitable state;

(b) applying a weight for each score for each item, wherein the weight is determined as a larger value as the attribute itself of the feature data has greater influence in performing character recognition on the document image; and

(c) adding the scores for each item to which the weight is applied to provide the expected recognition rate.

The method of claim 1,

In step (c),

And providing the item-specific score.

The method of claim 2,

The document image,

Method characterized in that it is generated by scanning the original document rather than digital data.

The method of claim 2,

and (d) adjusting factors of a scanner for reproducing the document image with reference to the expected recognition rate and the item-specific score.

The method of claim 2,

(d) presenting a scanner control method to the user to reset the scanner according to the expected recognition rate and the item-specific score.

The method of claim 5,

The method of resetting the scanner comprises at least one of controlling the automatic document feeder of the scanner, adjusting the brightness of the scanner or increasing the resolution of the scanner.

The method of claim 2,

(d) if the expected recognition rate is greater than or equal to a predetermined threshold, further comprising performing an OCR process.

The method of claim 2,

and (d) receiving a determination as to whether to perform an OCR process from the user after the expected recognition rate is provided to the user.

The method of claim 1,

In the step (a),

The feature data may include skew of an image, clustering repetition degree (BI) of binarization of an image, ratio of noise included in an image (NR), setting degree of a text area of an image (ZD), and characters in a text area. Include at least one of the degree of division (SD), the degree of brightness contrast in the character division region (SC), the degree of repetition in the character thinning process (TI), and the size of the loop character (SLC) such as 'ㅇ'. How to feature.

The method of claim 9,

The skew of the document image is determined by detecting angles of four portions (Left, Right, Top, Bottom) of the document image and referring to the average of these angles.

The method of claim 9,

The clustering repetition degree (BI) during binarization of the document image is determined by referring to the number of clustering repetitions in the binarization process using FBC.

The method of claim 9,

And a ratio (NR) of noise included in the document image is determined by referring to a ratio of noise among all images of the document image.

The method of claim 9,

And a setting degree (ZD) of a text area of the document image is determined by referring to a ratio of areas in which the width and height of the area set as a result of the classification of the document image are less than a reference value in the entire area.

The method of claim 9,

Character division degree (SD) in the text area of the document image,

After the character segmentation is performed in the text region, if the width or height of the segmented character region shows a difference greater than or equal to a predetermined threshold compared to the average character width or height of the characters included in the document image, it is determined to be an abnormal character segmentation region. If it is said, it is determined by referring to the ratio of the abnormal character segmentation area.

The method of claim 9,

The brightness contrast degree SC in the text division region of the document image is

And is determined with reference to the calculated value of the equation.

The method of claim 9,

The repetition degree (TI) in the character thinning process of the document image is

And determining the number of repeated thinnings to make the thickness of the character 1.

The method of claim 9,

The size of the loop character (SLC) of the document image is

And extracting the loop part by dividing a character into a phoneme unit through a connected component analysis.

A computer-readable medium for recording a computer program for executing the method according to any one of claims 1 to 17.