KR101028670B1

KR101028670B1 - A method, system and computer readable recording medium for recognizing character strings contained in a document using a language model and ORC

Info

Publication number: KR101028670B1
Application number: KR1020080103890A
Authority: KR
Inventors: 양병석; 서희철; 윤병훈; 성기준; 이도길
Original assignee: 엔에이치엔(주)
Priority date: 2008-10-22
Filing date: 2008-10-22
Publication date: 2011-04-12
Anticipated expiration: 2028-10-22
Also published as: JP2010102709A; KR20100044668A

Abstract

본 발명은 이미지 영역 및 텍스트 영역으로 이루어진 문서에 포함된 문자열을 인식하는 방법, 시스템 및 컴퓨터 판독 가능한 기록 매체에 관한 것이다. 본 발명의 일 태양에 따르면, (a) 상기 문서의 문서 구조를 분석하여 텍스트 영역과 이미지/노이즈 영역으로 분류하는 단계, (b) 제1 OCR을 사용하여 상기 텍스트 영역 내에 포함된 문자열을 인식하는 단계, (c) 언어모델을 통하여 상기 텍스트 영역 중 텍스트 영역으로 잘못 분류된 특정 영역에 포함된 문자열을 찾아내고, 상기 제1 OCR으로부터 획득된 상기 특정 영역에 대한 위치 정보를 참조하여, 상기 특정 영역을 상기 이미지/노이즈 영역으로 재분류하는 단계, 및 (d) 상기 (a) 단계 및 상기 (c) 단계에서 분류된 이미지/노이즈 영역에 대해 제2 OCR을 사용하여 상기 이미지/노이즈 영역에 포함된 문자열을 인식하는 단계를 포함하는 방법이 제공된다.The present invention relates to a method, a system and a computer readable recording medium for recognizing a character string included in a document consisting of an image area and a text area. According to an aspect of the present invention, (a) analyzing the document structure of the document and classifying it into a text area and an image / noise area, (b) recognizing a character string included in the text area using a first OCR. (C) finding a character string included in a specific area that is incorrectly classified as a text area among the text areas through a language model, and referring to the location information of the specific area obtained from the first OCR, the specific area. Classifying the image / noise region into the image / noise region, and (d) using the second OCR for the image / noise region classified in the steps (a) and (c). A method is provided that includes recognizing a string.

OCR, 문자 인식, 언어모델 OCR, Character Recognition, Language Model

Description

METHOOD, SYSTEM, AND COMPUTER-READABLE RECORDING MEDIUM FOR RECOGNIZING CHARACTERS INCLUDED IN A DOCUMENT BY USING LANGUAGE MODEL AND OCR}

본 발명은 언어모델과 OCR을 이용하여 문서에 포함된 문자열을 인식하는 방법, 시스템 및 컴퓨터 판독 가능한 기록 매체에 관한 것으로서, 보다 상세하게는, 언어모델을 통해 OCR 결과에 포함된 텍스트 노이즈를 제거하고 OCR 인식 결과와 언어모델을 통해 이미지 영역을 판단하며 이미지로서 판단된 영역에 대해서는 이미지에 특화된 OCR 엔진을 이용하여 인식을 수행하는 방법, 시스템, 및 컴퓨터 판독 가능한 기록 매체에 관한 것이다.The present invention relates to a method, a system and a computer-readable recording medium for recognizing character strings included in a document using a language model and an OCR. More particularly, the present invention provides a method of removing text noise included in an OCR result through a language model. The present invention relates to a method, a system, and a computer-readable recording medium for determining an image region through an OCR recognition result and a language model and performing recognition using an OCR engine specialized for an image.

최근, 디지털 저장 매체의 급속한 보급에 따라 기존에 지면으로 존재하였던 문서들에 대한 디지털화 작업이 활발히 전개되고 있다. 이와 같은 현상은 문서에 포함된 문자를 자동으로 인식하는 기술인 광학 문자 인식 기술(Optical Character Recognition: OCR)의 발전에 따라 더욱 더 가속화되고 있는 실정이다.Recently, with the rapid dissemination of digital storage media, digitalization of documents that have existed on the ground has been actively developed. Such a phenomenon is accelerating with the development of optical character recognition (OCR), which is a technology for automatically recognizing characters included in a document.

문서 안에 이미지와 텍스트가 병존하는 경우에는 문자 인식을 위하여 텍스트 영역과 이미지 영역을 구별하는 것이 필요로 되지만, 이와 같이 텍스트 영역과 이미지 영역을 분별하는 것이 용이하지 않다는 문제점이 있어 왔다.When images and text coexist in a document, it is necessary to distinguish between a text area and an image area for character recognition, but there has been a problem in that it is not easy to distinguish between the text area and the image area.

여기서, 문서에 포함된 문자열을 인식하는 방법에는 여러 가지가 있는데 그 중 하나가 언어모델을 이용하는 것이다. 언어모델은 사전, 사용 빈도, 사용 확률 등을 기반으로 하여 다수의 입력 문자열에 대해 가장 문법이나 확률적으로 가능성이 높은 출력을 내는 방법이다. 이와 같은 언어모델과 관련하여서는 대한민국 공개특허공보 제2006-46128호의 "카메라 입력된 문서들을 위한 저해상 OCR" 등에 개시되어 있으며, 음성 인식 방법 및 시스템 등에서 널리 이용되고 있다.Here, there are several ways to recognize the character string included in the document, one of which is using a language model. The language model is the method that produces the most grammatical or probabilistic output for multiple input strings based on dictionaries, frequency of use, and probability of use. Regarding such a language model, it is disclosed in Korean Patent Application Publication No. 2006-46128, "Low Resolution OCR for Camera Input Documents", and the like, and is widely used in speech recognition methods and systems.

하지만, 이미지 영역 중 일부가 텍스트 영역으로 삽입되는 경우에는 상기와 같은 종래의 언어모델을 통한다고 하더라도 가장 문법이나 확률적으로 가능성이 높은 출력을 행하게 되므로, 매우 지저분한 인식 결과가 나오게 된다. 실제로도, 문서 구조 분석 작업, 즉 문서를 정확하게 이미지 영역 및 텍스트 영역으로 구분하는 것이 기술적으로 어려운바 상기와 같은 문제가 빈번히 발생할 수 있었다.However, when a part of the image region is inserted into the text region, even if it is through the conventional language model as described above, the most grammar or probable probability is output, resulting in a very messy recognition result. Indeed, it is technically difficult to analyze a document structure, that is, to accurately classify a document into an image area and a text area, and thus the above problems may frequently occur.

따라서, 본 발명의 목적은, 상기와 같은 종래 기술의 문제점을 모두 해결하기 위하여, 이미지/노이즈 영역 및 텍스트 영역으로 이루어진 문서에 포함된 문자를 보다 정확하게 인식하기 위하여 언어모델을 통한 분석 및 OCR 기기에 입력되는 글자가 문서 전체에서 어디에 위치하는지에 대한 정보를 참조하여 텍스트 영역으로 잘못 편입된 이미지/노이즈 영역을 판단할 수 있도록 하는 것이다.Accordingly, an object of the present invention, to solve all the problems of the prior art as described above, in order to more accurately recognize the characters contained in the document consisting of the image / noise area and the text area to analyze through a language model and OCR device By referring to the information on where the input text is located throughout the document, it is possible to determine an image / noise area that is incorrectly incorporated into the text area.

또한, 본 발명의 다른 목적은, 이미지/노이즈 영역 및 텍스트 영역으로 이루어진 문서에 있어서, 정확도 높게 이미지/노이즈 영역으로 구분된 영역에 포함된 문자에 대해서는 이미지 특화된 OCR 기술을 이용하여 문자 인식을 성공적으로 수행할 수 있도록 하는 것이다.In addition, another object of the present invention, in the document consisting of the image / noise area and the text area, for the characters contained in the area divided into the image / noise area with high accuracy, character recognition is successfully performed using image-specific OCR technology To make it work.

상기한 바와 같은 본 발명의 목적을 달성하고, 후술하는 본 발명의 특징적인 효과를 실현하기 위한, 본 발명의 특징적인 구성은 하기와 같다.The characteristic structure of this invention for achieving the objective of this invention mentioned above, and realizing the characteristic effect of this invention mentioned later is as follows.

본 발명의 일 태양에 따르면, 문서에 포함된 문자열을 인식하는 방법으로서, (a) 상기 문서의 문서 구조를 분석하여 텍스트 영역과 이미지/노이즈 영역으로 분류하는 단계, (b) 제1 OCR을 사용하여 상기 텍스트 영역 내에 포함된 문자열을 인식하는 단계, (c) 언어모델을 통하여 상기 텍스트 영역 중 텍스트 영역으로 잘못 분류된 특정 영역에 포함된 문자열을 찾아내고, 상기 제1 OCR으로부터 획득된 상기 특정 영역에 대한 위치 정보를 참조하여, 상기 특정 영역을 상기 이미지/노이즈 영 역으로 재분류하는 단계, 및 (d) 상기 (a) 단계 및 상기 (c) 단계에서 분류된 이미지/노이즈 영역에 대해 제2 OCR을 사용하여 상기 이미지/노이즈 영역에 포함된 문자열을 인식하는 단계를 포함하는 방법이 제공된다.According to an aspect of the present invention, a method for recognizing a character string included in a document, the method comprising: (a) analyzing a document structure of the document and classifying it into a text area and an image / noise area; and (b) using a first OCR. Recognizing a character string included in the text region, (c) finding a character string included in a specific region misclassified as a text region among the text regions through a language model, and obtaining the specific region obtained from the first OCR. Reclassifying the specific region into the image / noise region with reference to the positional information on, and (d) a second method for the image / noise region classified in the steps (a) and (c). A method is provided that includes recognizing a character string included in the image / noise region using OCR.

본 발명의 다른 태양에 따르면, 텍스트 영역 및 이미지/노이즈 영역으로 이루어지는 문서에 포함된 문자열을 인식하는 시스템에 있어서, 제1 OCR을 사용하여 상기 텍스트 영역 내에 포함된 문자열을 인식하는 제1 OCR부, 제2 OCR을 사용하여 상기 이미지/노이즈 영역 내에 포함된 문자열을 인식하는 제2 OCR부, 및, 상기 문서의 문서 구조를 분석하여 텍스트 영역과 이미지/노이즈 영역으로 잠정 분류한 후, 언어모델을 통하여 상기 텍스트 영역 중 텍스트 영역으로 잘못 분류된 특정 영역에 포함된 문자열을 찾아내고, 상기 제1 OCR부로부터 획득된 상기 특정 영역에 대한 위치 정보를 참조하여, 상기 특정 영역을 상기 이미지/노이즈 영역으로 재분류하는 문서 구조 분석부를 포함하는 시스템이 제공된다.According to another aspect of the present invention, there is provided a system for recognizing a character string included in a document consisting of a text region and an image / noise region, the system comprising: a first OCR portion for recognizing a character string included in the text region using a first OCR; A second OCR unit for recognizing a character string included in the image / noise area using a second OCR, and temporarily classifying the document structure of the document into a text area and an image / noise area, and then using a language model. Finding a character string included in a specific area that is incorrectly classified as a text area among the text areas, and referring to the positional information about the specific area obtained from the first OCR unit, redesigning the specific area as the image / noise area. A system is provided that includes a document structure analysis unit for sorting.

이 외에도, 다른 방법, 다른 시스템, 및 상기 방법들을 실행하기 위한 컴퓨터 프로그램을 기록하는 컴퓨터 판독 가능한 기록 매체가 더 제공된다.In addition to this, there is further provided a computer readable recording medium for recording another method, another system, and a computer program for executing the methods.

본 발명의 특징적인 구성으로 인해 달성되는 본 발명의 효과는 다음과 같다.The effects of the present invention achieved due to the characteristic configuration of the present invention are as follows.

1. 본 발명에 따르면, 이미지 영역 및 텍스트 영역으로 이루어진 문서에 대하여 종래의 상용 OCR 프로그램을 사용하여 문자 인식을 수행하는 것보다 정확도를 높일 수 있는 효과를 누릴 수 있다.1. According to the present invention, it is possible to enjoy the effect of increasing the accuracy of the document consisting of the image area and the text area than performing character recognition using a conventional commercial OCR program.

2. 본 발명에 따르면, 임의의 문서에 포함된 이미지 영역 및 텍스트 영역을 정확하게 구분할 수 있는바, 텍스트 전용 OCR 및 이미지 전용 OCR 등을 적재적소에 적용할 수 있다.2. According to the present invention, an image area and a text area included in an arbitrary document can be accurately distinguished, and text-only OCR and image-only OCR can be applied in place.

후술하는 본 발명에 대한 상세한 설명은, 본 발명이 실시될 수 있는 특정 실시예를 예시로서 도시하는 첨부 도면을 참조한다. 이들 실시예는 당업자가 본 발명을 실시할 수 있기에 충분하도록 상세히 설명된다. 본 발명의 다양한 실시예는 서로 다르지만 상호 배타적일 필요는 없음이 이해되어야 한다. 예를 들어, 여기에 기재되어 있는 특정 형상, 구조 및 특성은 일 실시예에 관련하여 본 발명의 정신 및 범위를 벗어나지 않으면서 다른 실시예로 구현될 수 있다. 또한, 각각의 개시된 실시예 내의 개별 구성요소의 위치 또는 배치는 본 발명의 정신 및 범위를 벗어나지 않으면서 변경될 수 있음이 이해되어야 한다. 따라서, 후술하는 상세한 설명은 한정적인 의미로서 취하려는 것이 아니며, 본 발명의 범위는, 적절하게 설명된다면, 그 청구항들이 주장하는 것과 균등한 모든 범위와 더불어 첨부된 청구항에 의해서만 한정된다. 도면에서 유사한 참조부호는 여러 측면에 걸쳐서 동일하거나 유사한 기능을 지칭한다.DETAILED DESCRIPTION The following detailed description of the invention refers to the accompanying drawings that show, by way of illustration, specific embodiments in which the invention may be practiced. These embodiments are described in sufficient detail to enable those skilled in the art to practice the invention. It should be understood that the various embodiments of the present invention are different but need not be mutually exclusive. For example, certain features, structures, and characteristics described herein may be implemented in other embodiments without departing from the spirit and scope of the invention in connection with an embodiment. It is also to be understood that the position or arrangement of the individual components within each disclosed embodiment may be varied without departing from the spirit and scope of the invention. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present invention, if properly described, is defined only by the appended claims, along with the full range of equivalents to which such claims are entitled. Like reference numerals in the drawings refer to the same or similar functions throughout the several aspects.

이하, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 하기 위하여, 본 발명의 바람직한 실시예들에 관하여 첨부된 도면을 참조하여 상세히 설명하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art can easily implement the present invention.

참고로, 본 발명의 명세서에서는, 텍스트 영역에 대해 광학 문자 인식을 수행하는 OCR로서 일반 텍스트 특화된 OCR을, 이미지 영역에 포함된 문자에 대해 광 학 문자 인식을 수행하는 OCR로서 이미지 특화된 OCR을 상정하여 기술하였지만, 반드시 이에 한정되는 것은 아니며, 텍스트 영역에 대해 광학 문자 인식을 수행하는 OCR로서 텍스트 및 이미지에 전부 사용 가능한 OCR이나 기타 다른 타입의 OCR을 채택하거나 이미지 영역에 포함된 문자에 대해 광학 문자 인식을 수행하는 OCR로서 이미지 및 텍스트에 전부 사용 가능한 OCR이나 기타 다른 타입의 OCR을 채택하는 경우도 본 발명의 권리범위에 포함된다 할 것이다.For reference, in the specification of the present invention, a normal text-specific OCR as an OCR that performs optical character recognition for a text area, and an image-specific OCR as an OCR that performs optical character recognition for a character included in an image area, Although described, the present invention is not necessarily limited thereto. An OCR that performs optical character recognition on a text area, which adopts OCR or another type of OCR that can be used for both text and images, or optical character recognition on a character included in an image area. In the case of adopting OCR or any other type of OCR that can be used for both images and text as an OCR for performing the above, it will be included in the scope of the present invention.

[본 발명의 바람직한 실시예][Preferred Embodiments of the Invention]

도 1은 본 발명의 일 실시예에 따른 광학 문자 인식 시스템(100)의 구성을 예시적으로 나타내는 도면이다.1 is a diagram illustrating the configuration of an optical character recognition system 100 according to an embodiment of the present invention.

도 1을 참조하면, 광학 문자 인식 시스템(100)은 문서 정보 입력부(110), 문서 구조 분석부(120), 텍스트 OCR부(130), 이미지 OCR부(140), 제어부(150), 및 통신부(160)를 포함할 수 있다. 본 발명의 일 실시예에 따르면, 문서 정보 입력부(110), 문서 구조 분석부(120), 텍스트 OCR부 (130), 이미지 OCR부 (140), 제어부(150), 및 통신부(160)는 그 중 적어도 일부가 외부 단말 장치나 외부 서버 등과 통신하는 프로그램 모듈들일 수 있다. 이러한 프로그램 모듈들은 운영 시스템, 응용 프로그램 모듈 및 기타 프로그램 모듈로서 광학 문자 인식 시스템(100)에 포함될 수 있으며, 물리적으로는 여러 가지 공지의 기억 장치 상에 저장될 수 있다. 또한, 이러한 프로그램 모듈들은 광학 문자 인식 시스템(100)과 통신 가능한 원격 기억 장치에 저장될 수도 있다. 한편, 이러한 프로그램 모듈들은 본 발명에 따라 후술할 특정 업무를 수행하거나 특정 추상 데이터 유형을 실행하는 루틴, 서브루 틴, 프로그램, 오브젝트, 컴포넌트, 데이터 구조 등을 포괄하지만, 이에 제한되지는 않는다.Referring to FIG. 1, the optical character recognition system 100 may include a document information input unit 110, a document structure analyzer 120, a text OCR unit 130, an image OCR unit 140, a controller 150, and a communication unit. 160 may be included. According to an embodiment of the present invention, the document information input unit 110, the document structure analysis unit 120, the text OCR unit 130, the image OCR unit 140, the control unit 150, and the communication unit 160 is At least some of the program modules may be in communication with an external terminal device or an external server. Such program modules may be included in the optical character recognition system 100 as an operating system, an application module, and other program modules, and may be physically stored on various known storage devices. In addition, these program modules may be stored in a remote storage device that can communicate with the optical character recognition system 100. Meanwhile, such program modules include, but are not limited to, routines, subroutines, programs, objects, components, data structures, etc. that perform particular tasks or execute particular abstract data types, which will be described later, according to the present invention.

본 발명의 일 실시예에 따른 광학 문자 인식 시스템(100)은 스캐너, 카메라 등과 같은 화상 데이터 생성 장치에 포함되거나 연결될 수 있으며, 본 발명의 다른 실시예에 따른 광학 문자 인식 시스템(100)은 개인용 컴퓨터(예를 들어, 데스크탑 컴퓨터, 노트북 컴퓨터, 태블릿 컴퓨터, 팜톱 컴퓨터 등), 워크스테이션, PDA, 웹 패드, 이동 전화기 등과 같은 디지털 기기에 포함되거나 연결될 수 있을 것이다. 여기서, 통신 네트워크는 유선 및 무선 등과 같은 그 통신 양태를 가리지 않고 구성될 수 있으며, 근거리 통신망(LAN: Local Area Network), 도시권 통신망(MAN: Metropolitan Area Network), 광역 통신망(WAN: Wide Area Network) 등 다양한 통신망으로 구성될 수 있다.The optical character recognition system 100 according to an embodiment of the present invention may be included in or connected to an image data generating apparatus such as a scanner or a camera, and the optical character recognition system 100 according to another embodiment of the present invention may be a personal computer. (Eg, desktop computer, laptop computer, tablet computer, palmtop computer, etc.), workstations, PDAs, web pads, mobile phones, and the like, may be included or connected to. Here, the communication network may be configured without regard to communication modes such as wired and wireless, and may include a local area network (LAN), a metropolitan area network (MAN), and a wide area network (WAN). It may be configured with a variety of communication networks.

본 발명의 일 실시예에 따른 문서 정보 입력부(110)는 디지털 기기로부터 텍스트 및/또는 이미지가 포함된 문서에 대한 정보를 수신할 수 있고 상기 수신된 문서에 대한 정보를 문서 구조 분석부(120)에 전달하는 기능을 수행할 수 있다.The document information input unit 110 according to an exemplary embodiment of the present invention may receive information on a document including text and / or images from a digital device, and transmit the information about the received document to the document structure analyzer 120. It can perform the function of delivering to.

본 발명의 일 실시예에 따른 문서 구조 분석부(120)는 문서의 구조를 파악함에 있어서, 문서의 전체 영역을 텍스트 영역 및 이미지 영역 등으로 분류할 수 있다. 문서 구조에 대한 분석은 Zone 단위로 수행되는데, 일반적으로 이러한 단위는 문자열의 문단 단위와 거의 유사하다. 상기 문서 구조 분석부(120)에 대해서는 추후 보다 자세히 설명될 것이다.The document structure analyzer 120 according to an embodiment of the present invention may classify the entire area of the document into a text area, an image area, and the like in identifying the structure of the document. The analysis of the document structure is performed in zones, which are generally similar to the paragraph units of strings. The document structure analysis unit 120 will be described later in more detail.

본 발명의 일 실시예에 따른 텍스트 OCR부(130)는 텍스트 영역에 포함된 문 자를 인식하는 기능을 수행한다. 도 2에 도시된 바와 같이, 텍스트 OCR부(130)는 세그멘테이션부(131), 문자 정규화부(132), 및 문자 인식부(133) 를 포함할 수 있으나 반드시 이에 한정되는 것은 아니다.The text OCR unit 130 according to an embodiment of the present invention performs a function of recognizing a character included in a text area. As shown in FIG. 2, the text OCR unit 130 may include a segmentation unit 131, a character normalization unit 132, and a character recognition unit 133, but is not limited thereto.

본 발명의 일 실시예에 따른 세그멘테이션부(131)는 텍스트 영역에 포함된 문자열을 문자별로 분할하는 기능을 수행할 수 있다. The segmentation unit 131 according to an embodiment of the present invention may perform a function of dividing a string included in a text area for each character.

구체적으로, 세그멘테이션부(131)는 텍스트 영역에 대해 행간 주사(Projection)를 통해 텍스트 영역 내에 포함된 라인들을 분할하는 기능을 수행할 수 있고, 분할된 라인에 연결 성분 레이블링(Connected Component Labeling) 기법을 적용하여 콤마, 마침표, 느낌표, 콜론, 세미콜론, 괄호, 인용 부호 등과 같은 구두점을 인식하는 기능을 수행할 수 있으며, 구두점을 중심으로 분리된 워드(Word)를 공백을 기준으로 재분할하는 기능을 수행하고 해당되는 언어의 특성에 따라 문자를 분할하는 기능을 수행할 수 있다. 물론, 본 발명의 세그멘테이션부(131)는 상기 기능에 한정되는 것은 아닐 것이며, 다양한 변형예를 상정하여 구현될 수 있을 것이다.In detail, the segmentation unit 131 may perform a function of dividing the lines included in the text area through projection between the text areas and using the connected component labeling technique. It can be applied to recognize punctuation such as comma, period, exclamation point, colon, semicolon, parentheses, quotation marks, etc., and repartitions the word separated from the punctuation based on the space. Character division may be performed according to the characteristics of the language. Of course, the segmentation unit 131 of the present invention will not be limited to the above functions, and may be implemented assuming various modifications.

한편, 본 발명의 일 실시예에 따른 문자 정규화부(132)는 분할된 문자들을 특정 비율로 정규화하는 기능을 수행할 수 있고, 문자 인식부(133)는 정규화된 문자들을 인식하는 기능을 수행할 수 있다.Meanwhile, the character normalization unit 132 according to an embodiment of the present invention may perform a function of normalizing the divided characters at a specific ratio, and the character recognition unit 133 may perform a function of recognizing normalized characters. Can be.

본 발명의 일 실시예에 따른 이미지 OCR부(140)는 이미지/노이즈 영역에 포함된 문자를 인식하는 기능을 수행한다.The image OCR unit 140 according to an embodiment of the present invention performs a function of recognizing a character included in an image / noise region.

본 발명의 일 실시예에 따른 이미지 OCR부(140)는 공지의 이미지에 특화된 OCR을 이용하여 구현될 수 있을 것이다. 예를 들면, 이미지에 특화된 OCR 기술은, Chuang Li 등이 저술하고, 2001년 IEEE에 게재된 논문인“Automatic Text Location in Natural Scene Images", 및 Li Xu 등이 저술하고, 중국 상하이 JiaoTong 대학의 Department of Computer Science and Engineering에서 발표한 논문인 "A Novel Method for Character Segmentation in Natural Scenes" 등과 같은 공지의 기술 중 적어도 하나를 이용하여 수행될 수 있을 것이다(상기 논문들에 기재된 내용은 그 전체가 본 명세서에 병합되어 있는 것으로 고려되어야 한다). 다만, 상기에서 열거된 공지 기술에 의해 본 발명이 한정되어 해석되는 것은 아님을 밝혀둔다.The image OCR unit 140 according to an embodiment of the present invention may be implemented using an OCR specialized for a known image. For example, image-specific OCR technology is written by Chuang Li et al., Published in the IEEE in 2001, “Automatic Text Location in Natural Scene Images,” and Li Xu et al., Department of JiaoTong University, Shanghai, China. It may be carried out using at least one of known techniques such as "A Novel Method for Character Segmentation in Natural Scenes", which is a paper published by of Computer Science and Engineering. It is to be understood that the present invention is not to be construed as limited by the known techniques listed above.

또한, 본 발명의 일 실시예에 따른 제어부(150)는 문서 정보 입력부(110), 문서 구조 분석부(120), 텍스트 OCR부(130), 이미지 OC““R부(140), 및 통신부(160) 간의 데이터의 흐름을 제어하는 기능을 수행한다.In addition, the controller 150 according to an embodiment of the present invention includes a document information input unit 110, a document structure analysis unit 120, a text OCR unit 130, an image OC ““ R unit 140, and a communication unit ( 160) to control the flow of data between.

또한, 본 발명의 일 실시예에 따른 통신부(160)는 본 발명에 따른 광학 문자 인식 시스템(100)이 스캐너, 카메라 등과 같은 외부 장치와 통신할 수 있도록 하는 기능을 수행할 수 있다.In addition, the communication unit 160 according to an embodiment of the present invention may perform a function to enable the optical character recognition system 100 according to the present invention to communicate with an external device such as a scanner, a camera, or the like.

이하에서는, 본 발명의 일 실시예에 따른 광학 문자 인식 시스템(100)이 이미지 영역 및 텍스트 영역으로 이루어진 문서에 포함된 문자열을 인식하기 위한 과정을 도 3 및 도 4를 참조로 상세히 설명하도록 한다.Hereinafter, a process for recognizing a character string included in a document including an image area and a text area by the optical character recognition system 100 according to an embodiment of the present invention will be described in detail with reference to FIGS. 3 and 4.

1. 텍스트 영역과 이미지/노이즈 영역의 구분1. Distinction between text area and image / noise area

도 3은 본 발명의 일 실시예에 따라 텍스트 영역 및 이미지 영역으로 이루어진 문서에 포함된 문자열을 인식하는 과정을 구체적으로 나타내는 도면이다.3 is a diagram specifically illustrating a process of recognizing a character string included in a document including a text area and an image area according to an embodiment of the present invention.

우선, 입력된 문서의 문서 구조를 분석하여 상기 문서를 텍스트 영역과 이미지/노이즈 영역으로 잠정 분리하는 단계(S110)가 수행된다.First, a step (S110) of analyzing a document structure of an input document and temporarily separating the document into a text area and an image / noise area is performed.

문서 구조 분석부(120)는 이진화된 영상 정보의 영역에 따라 문서의 구조를 분석할 수 있다. 가령, 이진화된 영상 정보를 기억장치 내의 텍스트 표준 패턴과 비교하여 유사도가 높은 영역을 구획화하여 텍스트 영역으로 분류하고, 텍스트 표준 패턴과 유사도가 낮은 영역을 구획화하여 이미지/노이즈 영역으로 분류한다. 텍스트 표준 패턴은 다양한 글자체의 폰트 정보로서 데이터베이스의 형태로 기억장치에 저장되어 있을 수 있으며 문서 구조 분석부(120)가 문서의 구조를 분석하여 영역을 분류하는 과정에서 참조될 수 있을 것이다. 이때, 전체 문서에서 텍스트 영역 및 이미지/노이즈 영역이 차지하는 위치에 대한 정보가 저장될 수 있을 것이다. The document structure analyzer 120 may analyze the structure of the document according to the area of the binarized image information. For example, the binarized image information is classified into a text area by segmenting an area having high similarity with the text standard pattern in the storage device, and a region having a low similarity with the text standard pattern is classified into an image / noise area. The text standard pattern may be stored in the storage device in the form of a database as font information of various fonts, and may be referred to in the process of classifying the area by analyzing the structure of the document. In this case, information on the position occupied by the text area and the image / noise area in the entire document may be stored.

2. 텍스트 영역 내의 문자열 인식2. Recognize Strings in Text Areas

상기 단계에서 분류된 텍스트 영역을 분석하여 텍스트 영역 내에 포함된 문자열에 관한 정보를 인식하는 단계(S120)가 수행된다.In operation S120, the text area classified in the step is analyzed to recognize the information about the character string included in the text area.

텍스트 OCR부(130)는 앞서 언급한 바와 같이 세그멘테이션부(131)에서 텍스트 영역에 포함된 문자열들을 각각의 구성 문자로 분할하는 기능을 수행하고, 문자 정규화부(132)는 분할된 문자들을 특정 비율로 정규화하는 기능을 수행하고, 문자 인식부(133)는 정규화된 문자들을 인식하는 기능을 수행할 수 있다.As described above, the text OCR unit 130 performs a function of splitting the strings included in the text area into respective constituent characters in the segmentation unit 131, and the character normalization unit 132 divides the divided characters into a specific ratio. The normalization operation may be performed, and the character recognition unit 133 may perform a function of recognizing normalized characters.

3. 텍스트 영역으로 판단된 특정 영역을 텍스트 영역에서 제외3. Exclude from the text area a specific area determined to be the text area

텍스트 영역으로 판단된 영역 중 특정 영역을 텍스트 영역에서 제외하는 단 계(S130)가 수행된다. A step S130 of excluding a specific area from the text area determined as the text area is performed.

우선, 상기 단계(S130)를 설명하기에 앞서, 본 단계에서 적용되는 언어모델의 개념에 대해 살펴본다. 언어모델은 OCR된 결과를 보정해 주는 기능을 수행하는데, 구체적으로 이미지 영역으로 잠정 판단된 영역 중 어떤 특정 영역이 그릇되게 텍스트 영역으로 분류되어 OCR 되었는지 판단하기 위하여 상기 이미지 영역 중 특정 영역에 포함된 글자들이 특정 임계 디스턴스(distance)를 넘는지 판단하여 상기 임계 디스턴스를 넘는 경우에 상기 특정 영역이 텍스트 영역으로 잘못 분류된 이미지/노이즈 영역임을 알려주고, 언어모델의 출력데이터에서 제거하는 기능을 수행한다. 여기서, 임의의 영역에 포함된 글자들의 디스턴스를 계산하는 방법 자체는 획 수의 차이 또는 획의 위치의 차이 등을 이용하여 구하거나 다양한 종래 기술을 참조로 구현될 수 있을 것이며, 당업계에서 공지 기술에 해당하므로 상세한 설명은 줄이도록 한다.First, before describing step S130, the concept of a language model applied in this step will be described. The language model performs a function of correcting the OCR result. Specifically, the language model is included in a specific region of the image region to determine which specific region of the region temporarily determined as the image region is incorrectly classified as a text region. It determines whether the letters exceed a certain threshold distance and indicates that the specific area is an image / noise area that is incorrectly classified as a text area when it exceeds the threshold distance, and removes from the output data of the language model. Here, the method itself for calculating the distance of the letters included in any region may be obtained using a difference in the number of strokes or a difference in the position of the stroke, or may be implemented with reference to various conventional techniques, and known in the art. The detailed description will be shortened.

이때, 문서 구조 분석부(120)는 상기에서 설명한 언어모델을 통하여 텍스트 영역 내의 특정 영역에 속한 글자들이 특정 임계 디스턴스를 넘는 경우 상기 특정 영역을 잘못된 영역(텍스트 영역으로 분류되어서는 안되는 영역)으로 판단하여 언어모델의 출력 데이터에서 제거할 수 있는데, 이를 위하여 텍스트 OCR부(130)의 도움을 받을 수 있다. 이는, 언어모델의 입력 값 자체는 순수 텍스트 값으로만 존재하는바, 상기 특정 영역을 출력데이터에서 제거하기 위해서는 텍스트 OCR부(130)가 획득하고 있는 입력 글자의 위치 정보를 참조해야 할 필요가 있기 때문이다. 이와 같이 언어모델을 통해 지나치게 디스턴스 값이 높은 특정 영역을 찾아내고 OCR에 의해 상기 특정 영역의 위치 정보를 알아냄으로써, 어떠한 단위로 텍스트 영역과 이미지/노이즈 영역을 분리하는 것이 보다 적합할지에 대해 판단할 수 있게 된다.In this case, the document structure analyzer 120 determines that the specific area is an invalid area (an area that should not be classified as a text area) when the letters belonging to the specific area in the text area exceed a certain threshold distance through the language model described above. It can be removed from the output data of the language model, for this it can be helped by the text OCR unit 130. This means that the input value of the language model itself exists only as a pure text value. In order to remove the specific area from the output data, it is necessary to refer to the position information of the input character acquired by the text OCR unit 130. Because. In this way, it is possible to determine whether it is more suitable to separate the text area and the image / noise area by finding a specific area having a too high distance value through the language model and finding the location information of the specific area by OCR. It becomes possible.

4. 제외된 특정 영역을 이미지/노이즈 영역에 병합4. Merge specific excluded areas into image / noise areas

문서 구조 분석부(120)와 텍스트 OCR부(130)에 의해 텍스트 영역으로부터 제외된 특정 영역을 이미지/노이즈 영역에 병합하는 단계(S140)가 수행된다.Merging a specific area excluded from the text area by the document structure analyzing unit 120 and the text OCR unit 130 into the image / noise area (S140) is performed.

제어부(160)는 문서 구조 분석부(120)에 의해 단계(S110)에서 이미지/노이즈 영역으로 분류된 영역과 단계(S130)에서 텍스트 영역으로부터 제외되어 이미지/노이즈 영역으로 재분류된 영역을 병합한다.The control unit 160 merges the area classified into the image / noise area in step S110 by the document structure analysis unit 120 and the area which is excluded from the text area in step S130 and reclassified into the image / noise area in step S130. .

가령, 도 4 를 참조하면, 단계(S130) 및 단계(S140)를 통해 임의의 문서(400)를 텍스트 영역(400a) 및 이미지/노이즈 영역(400b)으로 정확하게 분류할 수 있음을 알 수 있다.For example, referring to FIG. 4, it can be seen through step S130 and step S140 that any document 400 can be accurately classified into a text area 400a and an image / noise area 400b.

5. 이미지/노이즈 영역 내의 문자열을 인식5. Recognize character strings in the image / noise area

상기 병합된 이미지/노이즈 영역을 이미지에 특화된 광학 문자 판독기(OCR)를 포함하는 이미지 OCR부(140)를 통해 분석하여, 이미지/노이즈 영역 내에 존재할 수 있는 문자열을 인식하는 단계(S150)가 수행된다.The merged image / noise region is analyzed through an image OCR unit 140 including an optical character reader (OCR) specialized for an image, and a step (S150) of recognizing a character string that may exist in the image / noise region is performed. .

도 4 를 참조하면, 단계(S150)를 통하여 이미지/노이즈 영역(400b) 내에 존재하는 문자열(420, 440)이 인식되고 있음을 알 수 있다. 여기서, 이미지/노이즈 영역(400b) 내에 존재하는 영역(410, 430)에 대해서는, 영역(410, 430)에 문자가 포함되어 있는 것으로 잘못 인식할 수 있지만 언어모델에 의해 워드 단위로 노이즈 판단을 할 경우 이러한 영역(410, 430)을 이미지에 특화된 OCR에 의해 인식한 결과 인“

”이나“

”는 노이즈로 판단되어 OCR 결과에서 제거될 수 있을 것이다. Referring to FIG. 4, it can be seen that the

character strings

420 and 440 existing in the image / noise area 400b are recognized through step S150. Here, the

areas

410 and 430 existing in the image / noise area 400b may be misrecognized as including the characters in the

areas

410 and 430, but noise may be determined in units of words by the language model. If these

areas

410 and 430 are recognized by the OCR specialized for the image,

"or"

Is considered to be noise and can be removed from the OCR result.

6. 텍스트 영역 및 이미지/노이즈 영역 내의 문자열을 병합6. Merge strings within text areas and image / noise areas

텍스트 영역에서 인식된 문자열과 상기 이미지/노이즈 영역에서 인식된 문자열를 병합하는 단계(S160)가 수행된다.Merging the character string recognized in the text area and the character string recognized in the image / noise area (S160) is performed.

제어부(160)는 텍스트 OCR부(130)에 의해 인식된 문자열과 상기 이미지 OCR부(140)에 의해 인식된 문자열을 병합한다.The controller 160 merges the character string recognized by the text OCR unit 130 and the character string recognized by the image OCR unit 140.

도 4 를 참조하면, 텍스트 OCR부(130)에 의해 인식된 텍스트 영역(400a) 내의 문자열과 이미지 OCR부(140)에 의해 인식된 이미지 영역(400b) 내의 문자열(420, 440)이 병합되어 제공될 수 있음을 알 수 있다Referring to FIG. 4, a text string in the text area 400a recognized by the text OCR unit 130 and a text string 420 and 440 in the image area 400b recognized by the image OCR unit 140 are merged and provided. I can see that

이상 설명된 본 발명에 따른 실시예들은 다양한 컴퓨터 구성요소를 통하여 수행될 수 있는 프로그램 명령어의 형태로 구현되어 컴퓨터 판독 가능한 기록 매체에 기록될 수 있다. 컴퓨터 판독 가능한 기록 매체는 프로그램 명령어, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 기록 매체에 기록되는 프로그램 명령어는 본 발명을 위하여 특별히 설계되고 구성된 것이거나 컴퓨터 소프트웨어 분야의 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능한 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD 등과 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media) 및 ROM, RAM, 플 래시 메모리 등과 같은, 프로그램 명령어를 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령어의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드도 포함된다. 상기 하드웨어 장치는 본 발명에 따른 처리를 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.Embodiments according to the present invention described above may be implemented in the form of program instructions that may be executed by various computer components, and may be recorded in a computer-readable recording medium. The computer readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the recording medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the computer software field. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tape, optical recording media such as CD-ROMs, DVDs, etc., and magneto-optical media such as floptical disks. And hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine code, such as produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter. The hardware device may be configured to operate as one or more software modules to perform the process according to the invention, and vice versa.

이상에서 본 발명이 구체적인 구성요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나, 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명이 상기 실시예들에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형을 꾀할 수 있다.Although the present invention has been described by specific embodiments such as specific components and limited embodiments and drawings, it is provided to help a more general understanding of the present invention, but the present invention is not limited to the above embodiments. For those skilled in the art, various modifications and variations can be made from such descriptions.

따라서, 본 발명의 사상은 상기 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐만 아니라 이 특허청구범위와 균등하게 또는 등가적으로 변형된 모든 것들은 본 발명의 사상의 범주에 속한다고 할 것이다.Therefore, the spirit of the present invention should not be construed as being limited to the above-described embodiments, and all of the equivalents or equivalents of the claims, as well as the following claims, I will say.

도 1은 본 발명의 일 실시예에 따른 광학 문자 인식 시스템의 구성을 예시적으로 나타내는 도면이다.1 is a diagram showing the configuration of an optical character recognition system according to an embodiment of the present invention by way of example.

도 2는 본 발명의 일 실시예에 따른 텍스트 OCR부의 상세한 구성을 예시적으로 나타내는 도면이다.2 is a diagram exemplarily illustrating a detailed configuration of a text OCR unit according to an embodiment of the present invention.

도 3은 본 발명의 일 실시예에 따른 텍스트 영역 및 이미지 영역으로 이루어진 문서에 포함된 문자열을 인식하는 과정을 개략적으로 나타내는 도면이다.3 is a diagram schematically illustrating a process of recognizing a character string included in a document including a text area and an image area according to an embodiment of the present invention.

도 4는 본 발명의 일 실시예에 따라 텍스트 영역 및 이미지 영역으로 이루어진 문서에 포함된 문자열을 인식하는 경우에 대한 예시를 나타내는 도면이다.4 is a diagram illustrating an example of recognizing a character string included in a document including a text area and an image area according to an embodiment of the present invention.

<도면의 주요 부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

100: 광학 문자 인식 시스템100: optical character recognition system

110: 문서 정보 입력부110: document information input unit

120: 문서 구조 분석부120: document structure analysis unit

130: 텍스트 OCR부130: OCR text

131: 세그멘테이션부131 segmentation unit

132: 문자 정규화부132: character normalization unit

133: 문자 인식부133: character recognition unit

140: 이미지 OCR부140: image OCR unit

150: 제어부150: control unit

160: 통신부160: communication unit

Claims

In a method of recognizing a string included in a document,

(a) analyzing the document structure of the document and classifying it into a text area and an image / noise area;

(b) recognizing a character string included in the text area using a first OCR,

(c) finding a character string included in a specific area that is incorrectly classified as a text area among the text areas through a language model, and referring to the location information about the specific area obtained from the first OCR, Reclassifying into image / noise regions, and

(d) recognizing a character string included in the image / noise region using a second OCR for the image / noise region classified in the steps (a) and (c)

How to include.

The method of claim 1,

Wherein the first OCR is an OCR specific to plain text, and the second OCR is an OCR specialized to images.

The method of claim 1,

(e) merging and providing the results recognized in steps (b) and (d).

The method of claim 1,

In the step (c),

The specific area,

And obtaining a distance value with respect to a result of recognizing the letters included in the text area, corresponding to an area including letters that exceed a certain threshold distance.

The method of claim 1,

The step (d)

Determining noise on a word basis to remove noise from the recognized character string.

A computer-readable recording medium having recorded thereon a computer program for executing the method according to any one of claims 1 to 5.

A system for recognizing character strings included in a document consisting of a text area and an image / noise area,

A first OCR unit for recognizing a character string included in the text area using a first OCR,

A second OCR unit for recognizing a character string included in the image / noise region using a second OCR, and

After analyzing the document structure of the document and temporarily classifying it into a text area and an image / noise area, finding a string included in a specific area that is incorrectly classified as a text area among the text areas through a language model, and the first OCR unit. A document structure analyzer for reclassifying the specific area into the image / noise area by referring to the location information of the specific area obtained from the

System comprising a.

The method of claim 7, wherein

Wherein the first OCR is an OCR specific to plain text, and the second OCR is an OCR specific to images.

The method of claim 7, wherein

And a controller configured to merge and provide a result recognized by the first OCR unit and a result recognized by the second OCR unit.

The method of claim 7, wherein

The specific area,

And obtaining a distance value with respect to a result of recognizing the characters included in the text area, wherein the system corresponds to an area including letters that exceed a certain threshold distance.

The method of claim 7, wherein

The second OCR unit,

And remove the noise from the recognized result using the second OCR.

The method of claim 11,

System for determining the noise in units of words to remove the noise.