KR101831204B1

KR101831204B1 - Method and apparatus for document area segmentation

Info

Publication number: KR101831204B1
Application number: KR1020170046633A
Authority: KR
Inventors: 조남익; 서원교; 길태호
Original assignee: 주식회사 한글과컴퓨터; 서울대학교산학협력단
Priority date: 2017-04-11
Filing date: 2017-04-11
Publication date: 2018-02-22

Abstract

Proposed are a method and apparatus for segmenting a document area. The method for segmenting a document area includes the steps of: detecting text line candidates by clustering connection elements included in a document image; detecting a text line by removing a non-text element by performing filtering using two different threshold values with regard to the detected text line candidates; detecting a non-text area by performing repetitive XY cut after removing the detected text line from the document image; and detecting a document boundary by comparing the detected non-text area with an interest area. Accordingly, the present invention can accurately detect the text line with regard to a document image with a complicated layout.

Description

[0001] METHOD AND APPARATUS FOR DOCUMENT AREA SEGMENTATION [0002]

본 명세서에서 개시되는 실시예들은 문서 영상에서 텍스트 영역, 비텍스트 영역 및 문서 경계를 검출하기 위한 방법 및 장치에 관한 것이다.The embodiments disclosed herein relate to a method and apparatus for detecting text areas, non-text areas and document boundaries in a document image.

최근에는 스마트폰과 같이 개인이 휴대하는 기기에 구비된 카메라의 성능이 향상됨에 따라, 스캐너를 이용하여 문서를 스캔하는 대신에 스마트폰 등에 구비된 카메라를 이용하여 문서를 촬영하고, 촬영된 문서 영상을 스캔 영상 대신 사용하는 경우가 많다.In recent years, the performance of cameras provided in personal devices such as smart phones has been improved. Therefore, instead of scanning a document using a scanner, a document is photographed using a camera provided in a smart phone or the like, Is often used instead of a scan image.

그런데, 문서 영상을 전자 문서로 변환하는 과정에서, 문자 인식을 수행할 때 텍스트 이외의 요소가 포함되면 문자 인식 결과가 나빠진다. 따라서, 전자 문서 변환시의 정확도를 높이기 위해서는, 문서 영상에서 텍스트 영역 및 비텍스트 영역은 각각 어느 부분인지, 그리고 문서 전체의 경계는 어디인지를 결정하는 전처리 과정이 필요하다.However, in the process of converting a document image into an electronic document, character recognition results are deteriorated when an element other than text is included in performing character recognition. Therefore, in order to improve the accuracy of the electronic document conversion, a preprocessing process is required to determine which part of the text area and the non-text area are in the document image, and the boundary of the entire document.

관련하여 선행기술 문헌인 한국특허공개번호 제2000-0033954호에서는 클러스터링에 기반하여 문서 영상을 분할하는 방법 및 장치를 개시하며, 자세하게는 스캐너를 통해 입력된 문서 영상에 포함된 문자 영역들 사이의 여백들만에 기초하여 클러스터링을 수행하고, 그 결과에 기초하여 문자 영역들을 분할하여 추출하는 내용을 개시하고 있다.Korean Patent Laid-Open Publication No. 2000-0033954 discloses a method and apparatus for dividing document images based on clustering. More specifically, Japanese Patent Application Laid-Open No. 2000-0033954 discloses a method and apparatus for dividing document images based on clustering, Clustering is performed on the basis of only the character regions, and the character regions are divided and extracted based on the result.

한편, 전술한 배경기술은 발명자가 본 발명의 도출을 위해 보유하고 있었거나, 본 발명의 도출 과정에서 습득한 기술 정보로서, 반드시 본 발명의 출원 전에 일반 공중에게 공개된 공지기술이라 할 수는 없다.On the other hand, the background art described above is technical information acquired by the inventor for the derivation of the present invention or obtained in the derivation process of the present invention, and can not necessarily be a known technology disclosed to the general public before the application of the present invention .

본 명세서에서 개시되는 실시예들은, 문서 영상에서 텍스트 영역, 비텍스트 영역 및 문서 경계를 높은 정확도로 검출할 수 있는 방법 및 장치를 제시하는데 목적이 있다.Embodiments disclosed herein are aimed at providing a method and apparatus capable of detecting a text area, a non-text area, and a document boundary in a document image with high accuracy.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 일 실시예에 따르면, 문서 영역 분할 방법은, 문서 영상에 포함되는 연결 요소들을 군집화(clustering)하여 텍스트 라인 후보들을 검출하는 단계, 상기 검출된 텍스트 라인 후보들에 대해서, 서로 다른 두 개의 문턱값을 이용한 필터링을 수행함으로써, 비텍스트 요소를 제거하고 텍스트 라인을 검출하는 단계, 상기 검출된 텍스트 라인을 상기 문서 영상에서 제거한 후, 반복 X-Y 컷(cut)을 수행함으로써 비텍스트 영역을 검출하는 단계 및 상기 검출된 비텍스트 영역을 관심 영역과 비교함으로써 문서 경계를 검출하는 단계를 포함할 수 있다.According to an embodiment of the present invention, there is provided a method of dividing a document region, the method comprising: clustering connection elements included in a document image to detect text line candidates; Removing the non-text elements and detecting a text line by performing filtering using two different threshold values for the candidates, removing the detected text line from the document image, and then repeating the repetitive X-Y cut Detecting a non-text area by performing a search on the detected non-text area, and detecting the document boundary by comparing the detected non-text area with the area of interest.

다른 실시예에 따르면, 문서 영역 분할 방법을 수행하기 위한 컴퓨터 프로그램으로서, 문서 영역 분할 방법은, 문서 영상에 포함되는 연결 요소들을 군집화(clustering)하여 텍스트 라인 후보들을 검출하는 단계, 상기 검출된 텍스트 라인 후보들에 대해서, 서로 다른 두 개의 문턱값을 이용한 필터링을 수행함으로써, 비텍스트 요소를 제거하고 텍스트 라인을 검출하는 단계, 상기 검출된 텍스트 라인을 상기 문서 영상에서 제거한 후, 반복 X-Y 컷(cut)을 수행함으로써 비텍스트 영역을 검출하는 단계 및 상기 검출된 비텍스트 영역을 관심 영역과 비교함으로써 문서 경계를 검출하는 단계를 포함할 수 있다.According to another embodiment, a computer program for performing a document area dividing method, the method comprising: clustering connecting elements included in a document image to detect text line candidates; Removing the non-text elements and detecting a text line by performing filtering using two different threshold values for the candidates, removing the detected text line from the document image, and then repeating the repetitive X-Y cut Detecting a non-text area by performing a search on the detected non-text area, and detecting the document boundary by comparing the detected non-text area with the area of interest.

또 다른 실시예에 따르면, 문서 영역 분할 방법을 수행하는 프로그램이 기록된 컴퓨터 판독 가능한 기록매체로서, 문서 영역 분할 방법은, 문서 영상에 포함되는 연결 요소들을 군집화(clustering)하여 텍스트 라인 후보들을 검출하는 단계, 상기 검출된 텍스트 라인 후보들에 대해서, 서로 다른 두 개의 문턱값을 이용한 필터링을 수행함으로써, 비텍스트 요소를 제거하고 텍스트 라인을 검출하는 단계, 상기 검출된 텍스트 라인을 상기 문서 영상에서 제거한 후, 반복 X-Y 컷(cut)을 수행함으로써 비텍스트 영역을 검출하는 단계 및 상기 검출된 비텍스트 영역을 관심 영역과 비교함으로써 문서 경계를 검출하는 단계를 포함할 수 있다.According to another aspect of the present invention, there is provided a computer-readable medium having recorded thereon a program for performing a method of dividing a document region, the method comprising: clustering connection elements included in a document image to detect text line candidates Removing the non-text elements and detecting a text line by performing filtering using the two different threshold values for the detected text line candidates, removing the detected text line from the document image, Detecting the non-text area by performing a repetitive X-Y cut, and detecting the document boundary by comparing the detected non-text area with the area of interest.

또 다른 실시예에 따르면, 문서 영역 분할 장치는, 문서 영상의 처리와 관련된 입력을 수신하고, 문서 영상의 처리가 진행되는 상황 및 결과를 보여주기 위한 입출력부, 문서 영역 분할을 수행하기 위한 프로그램이 저장되는 저장부 및 상기 프로그램을 실행함으로써 상기 문서 영상의 영역 분할을 수행하는 제어부를 포함하며, 상기 제어부는, 상기 문서 영상에 포함되는 연결 요소들을 군집화(clustering)하여 텍스트 라인 후보들을 검출하고, 상기 검출된 텍스트 라인 후보들에 대해서, 서로 다른 두 개의 문턱값을 이용한 필터링을 수행함으로써, 비텍스트 요소를 제거하고 텍스트 라인을 검출하며, 상기 검출된 텍스트 라인을 상기 문서 영상에서 제거한 후, 반복 X-Y 컷(cut)을 수행함으로써 비텍스트 영역을 검출하고, 상기 검출된 비텍스트 영역을 관심 영역과 비교함으로써 문서 경계를 검출할 수 있다.According to another embodiment, the document area dividing apparatus includes an input / output unit for receiving an input related to processing of a document image and for displaying a situation and a result of processing of the document image, a program for performing document area division And a control unit for performing an area division of the document image by executing the program, wherein the control unit detects clusters of the text line candidates by clustering connection elements included in the document image, Removing the non-text elements and detecting the text line by performing filtering using the two different threshold values for the detected text line candidates, removing the detected text line from the document image, a non-text area is detected by performing a cut operation on the detected non-text area, The document boundary can be detected by comparing with the station.

전술한 과제 해결 수단 중 어느 하나에 의하면, 문서 영상에 포함된 연결 요소들을 군집화한 후 투사를 수행함으로써, 복잡한 레이아웃을 갖는 문서 영상에 대해서도 텍스트 라인을 정확하게 검출할 수 있다.According to any one of the above-mentioned means for solving the problems, it is possible to accurately detect a text line even for a document image having a complicated layout by clustering connection elements included in the document image and then performing projection.

또한, 2개의 서로 다른 문턱값을 이용하는 분류기를 적용함으로써 비텍스트 요소를 제거함에 있어서도 정확도를 높일 수 있다.In addition, by applying a classifier using two different threshold values, it is possible to improve the accuracy in removing non-text elements.

또한, 초기값으로 주어지는 관심 영역과 검출된 비텍스트 영역을 비교하여 문서 경계를 검출함으로써, 문서 경계 검출의 정확도를 높일 수 있다.In addition, by detecting the document boundary by comparing the detected non-text region with the region of interest given as the initial value, the accuracy of document boundary detection can be increased.

개시되는 실시예들에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 개시되는 실시예들이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The effects obtained in the disclosed embodiments are not limited to the effects mentioned above, and other effects not mentioned are obvious to those skilled in the art to which the embodiments disclosed from the following description belong It can be understood.

도 1은 일 실시예에 따른 문서 영역 분할 장치의 구성을 도시한 블록도이다.
도 2는 일 실시예에 따른 문서 영상을 도시한 도면이다.
도 3은 일 실시예에 따라 MSER 알고리즘을 적용하고, 연결 요소들을 추출하여 군집화를 수행한 결과를 도시한 도면이다.
도 4는 일 실시예에 따라 문서 영상에서 각 군집에 세로 방향의 투사를 수행하여, 군집을 분리한 결과를 도시한 도면이다.
도 5는 일 실시예에 따라 문서 영상에서 각 군집에 가로 방향의 투사를 수행하여, 군집을 분리한 결과를 도시한 도면이다.
도 6은 일 실시예에 따라 문서 영상에서 텍스트 라인이 검출된 결과를 도시한 도면이다.
도 7은 일실시예에 따라 문서 영상에서 텍스트 영역 및 비텍스트 영역을 분할하고, 문서 경계를 검출한 결과를 도시한 도면이다.
도 8은 일 실시예에 따른 문서 영역 분할 방법을 설명하기 위한 순서도이다.1 is a block diagram showing a configuration of a document area dividing apparatus according to an embodiment.
2 is a diagram illustrating a document image according to an exemplary embodiment.
FIG. 3 is a diagram illustrating a result of applying an MSER algorithm according to an exemplary embodiment, and extracting connection elements and performing clustering.
FIG. 4 is a diagram illustrating a result of separating clusters by performing vertical projection on each of the clusters in a document image according to an exemplary embodiment.
FIG. 5 is a diagram showing a result of separating clusters by performing a horizontal projection on each of the clusters in a document image according to an embodiment.
6 is a diagram illustrating a result of detecting a text line in a document image according to an embodiment.
7 is a diagram showing a result of dividing a text area and a non-text area in a document image and detecting a document boundary according to an embodiment.
8 is a flowchart for explaining a document area dividing method according to an embodiment.

아래에서는 첨부한 도면을 참조하여 다양한 실시예들을 상세히 설명한다. 아래에서 설명되는 실시예들은 여러 가지 상이한 형태로 변형되어 실시될 수도 있다. 실시예들의 특징을 보다 명확히 설명하기 위하여, 이하의 실시예들이 속하는 기술분야에서 통상의 지식을 가진 자에게 널리 알려져 있는 사항들에 관해서 자세한 설명은 생략하였다. 그리고, 도면에서 실시예들의 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Various embodiments are described in detail below with reference to the accompanying drawings. The embodiments described below may be modified and implemented in various different forms. In order to more clearly describe the features of the embodiments, detailed descriptions of known matters to those skilled in the art are omitted. In the drawings, parts not relating to the description of the embodiments are omitted, and like parts are denoted by similar reference numerals throughout the specification.

명세서 전체에서, 어떤 구성이 다른 구성과 "연결"되어 있다고 할 때, 이는 '직접적으로 연결'되어 있는 경우뿐 아니라, '그 중간에 다른 구성을 사이에 두고 연결'되어 있는 경우도 포함한다. 또한, 어떤 구성이 어떤 구성을 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한, 그 외 다른 구성을 제외하는 것이 아니라 다른 구성들을 더 포함할 수도 있음을 의미한다.Throughout the specification, when a configuration is referred to as being "connected" to another configuration, it includes not only a case of being directly connected, but also a case of being connected with another configuration in between. In addition, when a configuration is referred to as "including ", it means that other configurations may be included, as well as other configurations, as long as there is no specially contradicted description.

이하 첨부된 도면을 참고하여 실시예들을 상세히 설명하기로 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings.

도 1은 일 실시예에 따른 문서 영역 분할 장치의 구성을 도시한 블록도이다.1 is a block diagram showing a configuration of a document area dividing apparatus according to an embodiment.

일 실시예에 따른 문서 영역 분할 장치는, 문서를 직접 촬영하고 촬영된 문서 영상의 영역을 분할하는 장치일 수 있다. 예를 들어, 문서 영역 분할 장치는 카메라를 구비하며 문서 영역 분할을 수행하기 위한 어플리케이션이 설치된 스마트폰 또는 태블릿 등의 단말일 수 있다.The document area dividing device according to an embodiment may be a device for directly photographing a document and dividing an area of the photographed document image. For example, the document area dividing device may be a terminal, such as a smart phone or a tablet, equipped with an application for performing document area division.

또는, 일 실시예에 따른 문서 영역 분할 장치는, 다른 장치로부터 문서 영상을 수신하고, 수신한 문서 영상의 영역을 분할하는 장치일 수 있다. 예를 들어, 문서 영역 분할 장치는 문서 영역 분할을 수행하기 위한 프로그램이 설치된 데스크탑 또는 노트북 등일 수 있다.Alternatively, the document area dividing device according to an embodiment may be a device for receiving a document image from another apparatus and dividing an area of the received document image. For example, the document area dividing device may be a desktop or a notebook installed with a program for performing document area dividing.

또는, 일 실시예에 따른 문서 영역 분할 장치는, 네트워크를 통해 사용자의 단말로부터 문서 영상을 수신하여 영역 분할을 수행하고, 그 결과를 다시 사용자의 단말로 전송하는 서버일 수도 있다.Alternatively, the document area dividing apparatus according to an exemplary embodiment may be a server that receives a document image from a user's terminal through a network, performs area division, and transmits the document image to the user's terminal.

도 1을 참조하면, 일 실시예에 따른 문서 영역 분할 장치(100)는 입출력부(110), 제어부(120), 저장부(130) 및 통신부(140)를 포함할 수 있다. 또한, 도 1에는 도시되지 않았지만, 문서 영역 분할 장치(100)는 촬영부를 더 포함할 수도 있다.1, the document area dividing apparatus 100 may include an input / output unit 110, a control unit 120, a storage unit 130, and a communication unit 140. Although not shown in FIG. 1, the document area dividing apparatus 100 may further include a photographing unit.

입출력부(110)는 사용자로부터 문서 영상의 처리와 관련된 입력을 수신하고, 문서 영상의 처리가 진행되는 상황 및 결과를 보여주는 화면을 표시할 수 있다. 예를 들어, 입출력부(110)는 사용자 입력을 수신하는 조작 패널(operation panel) 및 화면을 표시하는 디스플레이 패널(display panel) 등을 포함할 수 있다.The input / output unit 110 receives input related to the processing of the document image from the user, and can display a screen showing the progress of the processing of the document image and the result. For example, the input / output unit 110 may include an operation panel for receiving user input and a display panel for displaying a screen.

구체적으로, 입력부는 키보드, 물리 버튼, 터치 스크린, 카메라 또는 마이크 등과 같이 다양한 형태의 사용자 입력을 수신할 수 있는 장치들을 포함할 수 있다. 또한, 출력부는 디스플레이 패널 또는 스피커 등을 포함할 수 있다. 다만, 이에 한정되지 않고 입출력부(110)는 다양한 입출력을 지원하는 구성을 포함할 수 있다.In particular, the input unit may include devices capable of receiving various types of user input, such as a keyboard, a physical button, a touch screen, a camera or a microphone. Also, the output unit may include a display panel or a speaker. However, the present invention is not limited to this, and the input / output unit 110 may include various input / output support structures.

제어부(120)는 문서 영역 분할 장치(100)의 전체적인 동작을 제어하며, CPU 등과 같은 프로세스를 포함할 수 있다. 일 실시예에 따르면, 제어부(120)는 적어도 하나의 프로세서를 포함할 수 있다. 제어부(120)는 저장부(130)에 저장된 프로그램을 실행시킴으로써 문서 영역을 분할하는 프로세스들을 수행할 수 있다.The control unit 120 controls the overall operation of the document area dividing apparatus 100, and may include a process such as a CPU. According to one embodiment, the control unit 120 may include at least one processor. The control unit 120 may execute the processes of dividing the document area by executing the program stored in the storage unit 130. [

저장부(130)에는 파일, 어플리케이션 및 프로그램 등과 같은 다양한 종류의 데이터가 설치 및 저장될 수 있다. 일 실시예에 따르면, 저장부(130)에는 문서 영역 분할 방법을 수행하기 위한 프로그램이 설치될 수 있다.Various types of data such as files, applications, programs, and the like may be installed and stored in the storage unit 130. According to an exemplary embodiment, a program for performing a document area dividing method may be installed in the storage unit 130. [

통신부(140)는 다른 디바이스 또는 네트워크와 유무선 통신을 수행할 수 있다. 이를 위해, 통신부(140)는 다양한 유무선 통신 방법 중 적어도 하나를 지원하는 통신 모듈을 포함할 수 있다. 예를 들어, 통신 모듈은 칩셋(chipset)의 형태로 구현될 수 있다. 일 실시예에 따르면, 통신부(140)는 다른 장치로부터 문서 영상을 수신할 수 있다.The communication unit 140 can perform wire / wireless communication with another device or a network. To this end, the communication unit 140 may include a communication module supporting at least one of various wired / wireless communication methods. For example, the communication module may be implemented in the form of a chipset. According to one embodiment, the communication unit 140 can receive a document image from another apparatus.

통신부(140)가 지원하는 무선 통신은, 예를 들어 Wi-Fi(Wireless Fidelity), Wi-Fi Direct, 블루투스(Bluetooth), UWB(Ultra Wide Band) 또는 NFC(Near Field Communication) 등일 수 있다. 또한, 통신부(140)가 지원하는 유선 통신은, 예를 들어 USB 또는 HDMI(High Definition Multimedia Interface) 등일 수 있다. 또한, 통신부(140)는 인터넷 또는 이동통신망을 통해 목적지에 데이터 또는 메시지 등을 전송할 수도 있다.The wireless communication supported by the communication unit 140 may be Wi-Fi (Wireless Fidelity), Wi-Fi Direct, Bluetooth, UWB (Ultra Wide Band), NFC (Near Field Communication), or the like. The wired communication supported by the communication unit 140 may be, for example, USB or High Definition Multimedia Interface (HDMI). Also, the communication unit 140 may transmit data or messages to a destination via the Internet or a mobile communication network.

이하에서는 일 실시예에 따라 제어부(120)가 문서 영역을 분할하는 과정을 자세히 설명하도록 한다.Hereinafter, a process of dividing a document area by the controller 120 according to an embodiment will be described in detail.

제어부(120)는 문서 영상으로부터 텍스트 영역 검출, 비텍스트 영역 검출 및 문서 경계 검출을 순서대로 수행할 수 있다.The control unit 120 can sequentially perform text area detection, non-text area detection, and document boundary detection from a document image.

제어부(120)는 문서 영상에 포함되는 연결 요소(connected component)들을 군집화(clustering)하여 텍스트 라인 후보들을 검출하고, 검출된 텍스트 라인 후보들에 대해서, 서로 다른 두 개의 문턱값을 이용한 필터링을 수행함으로써 비텍스트 요소를 제거하여 텍스트 영역을 검출할 수 있다.The controller 120 clusters the connected components included in the document image to detect the text line candidates and performs filtering using the two different threshold values for the detected text line candidates, The text area can be detected by removing the text element.

이어서, 제어부(120)는 검출된 텍스트 라인 및 배경을 문서 영상에서 제거한 후, 반복 X-Y 컷(cut)을 수행함으로써 비텍스트 영역을 검출할 수 있다.Then, the control unit 120 can detect the non-text area by removing the detected text line and the background from the document image, and then performing the repeated X-Y cut.

마지막으로, 제어부(120)는 검출된 비텍스트 영역을 미리 설정된 관심 영역과 비교함으로써 문서 경계를 검출할 수 있다.Finally, the control unit 120 can detect the document boundary by comparing the detected non-text region with a predetermined region of interest.

제어부(120)가 텍스트 영역을 검출하는 과정을 자세히 설명하면 다음과 같다.A process of detecting the text area by the control unit 120 will be described in detail.

제어부(120)는 문서 영상으로부터 연결 요소들을 추출한다. 예를 들어, 제어부(120)는 레티넥스(Retinex) 이진화 방법을 이용하여 문서 영상을 이진화 영상으로 변환하고, 이진화 영상으로부터 연결 요소들을 추출할 수 있다. 또는, 제어부(120)는 MSER(Maximally Stable External Region) 알고리즘을 이용하여 블롭(blob)들을 추출하는 방식으로 연결 요소들을 추출할 수 있다.The control unit 120 extracts connection elements from the document image. For example, the control unit 120 may convert a document image into a binary image using a Retinex binarization method, and extract connection elements from the binarized image. Alternatively, the controller 120 may extract the connection elements by extracting blobs using an MSER (Maximally Stable External Region) algorithm.

레티넥스 이진화 방법을 식으로 표현하면 아래의 수학식 1과 같다.The Retinex binarization method can be expressed by the following equation (1).

이때, B는 이진화 영상이고, I는 입력 영상이다. 또한, G·I는 입력 영상에 가우시안 필터링을 수행한 영상이고, k₁은 문턱값이다.In this case, B is a binary image and I is an input image. Also, G · I is an image obtained by performing Gaussian filtering on the input image, and k ₁ is a threshold value.

MSER 알고리즘은 주변의 인접한 픽셀들과 강도(intensity)가 상이한 픽셀들의 집합 영역을 블롭(blob)으로 검출하는 알고리즘이다.The MSER algorithm is an algorithm for blobly detecting an aggregate region of pixels having different intensities from neighboring neighboring pixels.

제어부(120)는 이진화 알고리즘 또는 MSER 알고리즘을 통해 추출된 연결 요소들에 필터링을 수행함으로써, 비텍스트에 해당되는 연결 요소들을 제거하고, 텍스트에 해당될 가능성이 높은 연결 요소들만을 추출할 수 있다. 예를 들어, 제어부(120)는 8방향의 이웃 픽셀들 중에서 전경값을 갖는 픽셀들끼리 연결하고, 연결된 픽셀들을 아래와 같은 방식으로 필터링할 수 있다.The controller 120 performs filtering on the extracted connection elements through the binarization algorithm or the MSER algorithm, thereby removing the connection elements corresponding to the non-text, and extracting only the connection elements likely to correspond to the text. For example, the control unit 120 may connect pixels having foreground values among neighboring pixels in eight directions, and may filter connected pixels in the following manner.

우선, 하나로 연결된 연결 요소들의 픽셀 개수를 N이라고 하면, 아래의 수학식 2와 같이 표현할 수 있고, 이때 C의 공분산 행렬을 아래의 수학식 3과 같이 분해할 수 있다.First, if the number of pixels of the connected connection elements is N, the following equation (2) can be obtained. In this case, the covariance matrix of C can be decomposed as shown in Equation (3) below.

이때,

는 고유값 및 고유벡터이다.At this time,

Are eigenvalues and eigenvectors.

텍스트에 해당되는 연결 요소의 경우 보통 고유값들의 곱으로 표현되는 영역에 비해 픽셀 개수의 비율이 높고, 고유값들의 비율이 크지 않다. 따라서, 이러한 성질을 이용하여 아래의 수학식 4 및 수학식 5를 만족하는 연결 요소들을 필터링함으로써, 비텍스트에 해당되는 연결 요소들을 최대한 제거할 수 있다.In the case of a link element corresponding to text, the ratio of the number of pixels is higher and the ratio of the eigenvalues is not larger than the area represented by the product of the eigenvalues. Therefore, by filtering the connection elements satisfying the following equations (4) and (5) using this property, the connection elements corresponding to the non-text can be removed as much as possible.

이때,

는 문턱값에 해당된다.At this time,

Corresponds to the threshold value.

제어부(120)는 필터링되고 남은 연결 요소들을 군집화하고, 텍스트 라인 후보들을 검출할 수 있다. 먼저, 제어부(120)는 같은 문단에 포함되는 텍스트들은 서로 비슷한 속성을 가지고 있다는 점을 이용하여 연결 요소들의 군집화를 수행할 수 있다. 예를 들어, 제어부(120)는 각 연결 요소들의 색 거리(color distance), 획 굵기(stroke width) 비율, 경계 박스(bounding box) 둘레의 비율, 세로 방향 또는 가로 방향으로 동일 선상에 놓여 있는지 여부 등에 기초하여 연결 요소들을 유사한 속성을 가지는 것들끼리 군집화할 수 있다.The control unit 120 may group the filtered and remaining connection elements, and may detect the text line candidates. First, the control unit 120 may perform clustering of connection elements using the fact that the texts included in the same paragraph have similar properties. For example, the control unit 120 determines whether the color distance, the stroke width ratio, the ratio of the perimeter of the bounding box, the vertical direction, , It is possible to cluster the connection elements among those having similar attributes.

도 2에는 일 실시예에 따른 문서 영상을 도시하였다. 그리고, 도 3에는 도 2의 문서 영상에서 MSER 알고리즘을 적용하고, 연결 요소들을 추출하여 군집화를 수행한 결과 영상을 도시하였다. 도 3을 참조하면, 문서 영상(300)에 포함된 연결 요소들 중 비슷한 속성을 갖는 연결 요소들이 하나의 군집(301)을 형성하고 있음을 확인할 수 있다. 이때, 동일한 색상의 연결 요소들이 동일한 군집에 포함되는 것인데, 설명의 편의를 위해 일부 다른 색상의 연결 요소들이 포함되더라도 직사각형 형태로 군집(301)의 경계를 설정하였다.FIG. 2 shows a document image according to an embodiment. FIG. 3 shows an image obtained by applying the MSER algorithm in the document image of FIG. 2, extracting connection elements, and performing clustering. Referring to FIG. 3, it can be seen that the connection elements having similar attributes among the connection elements included in the document image 300 form a single community 301. In this case, the connecting elements of the same color are included in the same cluster. For the sake of convenience of explanation, the boundaries of the cluster 301 are set in a rectangular shape even if some color connecting elements are included.

연결 요소들의 군집화가 완료된 후, 제어부(120)는 군집들에 대해서 투사(projection)를 수행하여 텍스트 라인 후보들을 검출할 수 있다. 예를 들어, 제어부(120)는 각각의 군집들에 대해서 세로 방향 및 가로 방향 중 적어도 한 방향으로 투사를 수행함으로써, 각각의 군집을 텍스트 문단 및 텍스트 라인에 해당되는 작은 군집들로 나누고, 텍스트 라인에 해당되는 작은 군집들을 텍스트 라인 후보들로 검출할 수 있다.After the clustering of the connection elements is completed, the control unit 120 may detect the text line candidates by projecting the clusters. For example, the control unit 120 performs projection in at least one of vertical direction and horizontal direction for each of the clusters, divides the respective clusters into small clusters corresponding to a text paragraph and a text line, Can be detected as text line candidates.

예로 든 문서 영상의 경우, 텍스트 문단은 컬럼(column) 단위로 나누어져 있다. 따라서, 제어부(120)는 군집화가 완료된 도 3의 문서 영상(300)에서 각 군집에 세로 방향으로 투사를 수행하고, 그 결과 각 군집에 포함된 연결 요소의 평균 획 굵기의 일정 배수(e.g. 4배) 이상 큰 틈이 존재한다면, 이를 문단이 나누어지는 경계로 보고, 틈을 경계로 작은 군집들로 분리할 수 있다. 도 4에는 도 3에 도시된 문서 영상(300)에서 각 군집에 세로 방향으로 투사를 수행함으로써, 군집을 텍스트 문단 단위의 작은 군집들로 분리한 결과를 도시하였다. 도 4의 문서 영상(400)을 참조하면, 도 3에서 하나의 군집(301)을 구성하던 연결 요소들이, 작은 두 개의 군집들(401, 402)로 분리되었음을 알 수 있다. 두 군집들(401, 402) 사이의 틈의 길이(d1)가 연결 요소의 평균 획 굵기의 일정 배수 이상이므로, 제어부(120)는 두 연결 요소들을 서로 다른 군집으로 분리하였다.In the example document image, the text paragraphs are divided into columns. Accordingly, the controller 120 performs projection in the vertical direction in each of the clusters in the document image 300 of FIG. 3 in which clustering has been completed, and as a result, the control unit 120 obtains a certain multiple of the average stroke widths ) If there is a large gap, it can be seen as a boundary where paragraphs are divided, and can be divided into small clusters with gaps. FIG. 4 shows a result of separating the clusters into small clusters in units of text paragraphs by projecting the clusters in the vertical direction in the document image 300 shown in FIG. Referring to the document image 400 of FIG. 4, it can be seen that the connection elements constituting one cluster 301 in FIG. 3 are divided into two smaller clusters 401 and 402. Since the length d1 of the gap between the two clusters 401 and 402 is equal to or greater than a certain multiple of the average stroke widths of the connecting elements, the controller 120 divides the two connecting elements into different clusters.

제어부(120)는 이어서 도 4에 도시된 문서 영상(400)에서 각 군집에 가로 방향으로 투사를 수행함으로써, 각각의 군집을 텍스트 라인 단위의 더 작은 군집으로 분리할 수 있다. 예를 들어, 제어부(120)는 텍스트 문단 단위로 군집이 나누어진 도 4의 문서 영상(400)에서 각 군집에 가로 방향으로 투사를 수행하고, 그 결과 각 군집에 포함된 연결 요소의 평균 획 굵기의 일정 배수(e.g. 0.5배)보다 작은 틈이 있다면, 이를 텍스트 라인이 나누어지는 경계로 보고, 틈을 경계로 작은 군집들로 분리할 수 있다. 도 5에는 도 4에 도시된 문서 영상(400)에서 각 군집에 가로 방향으로 투사를 수행함으로써, 군집을 텍스트 라인 단위의 작은 군집들(501, 502)로 분리한 결과를 도시하였다. 도 5의 문서 영상(500)을 참조하면, 도 4에서 하나의 군집(402)을 구성하던 연결 요소들 중 일부가, 작은 두 개의 군집들(501, 502)로 분리되었음을 알 수 있다. 두 군집들(501, 502) 사이의 틈의 길이(d2)가 연결 요소의 평균 획 굵기의 일정 배수 보다 작으므로, 제어부(120)는 두 연결 요소들을 서로 다른 군집으로 분리하였다.The control unit 120 then separates each of the clusters into smaller clusters in units of text lines by performing a horizontal projection on each clusters in the document image 400 shown in Fig. For example, the control unit 120 performs horizontal projection on each of the clusters in the document image 400 of FIG. 4 in which the clusters are divided in units of text paragraphs, and as a result, the average stroke width If there is a gap smaller than a certain multiple (for example, 0.5 times), it can be seen as a boundary on which the text lines are divided and can be divided into small clusters with gaps. FIG. 5 shows a result of separating the clusters into small clusters 501 and 502 in units of text lines by projecting the clusters in the document image 400 shown in FIG. 4 in the horizontal direction. Referring to the document image 500 of FIG. 5, it can be seen that some of the connecting elements that constitute one cluster 402 in FIG. 4 are divided into two smaller clusters 501 and 502. Since the length d2 of the gap between the two assemblies 501 and 502 is smaller than a certain multiple of the average stroke width of the connecting elements, the controller 120 divides the two connecting elements into different clusters.

한편, 도 2에 도시된 문서 영상의 경우, 텍스트 문단이 컬럼 단위로 나누어져 있어 제어부(120)는 세로 방향 및 가로 방향의 투사를 모두 수행함으로써 텍스트 문단 및 텍스트 라인을 분리하였으나, 만약 로우(row) 단위로 텍스트 문단이 나누어진 경우라면 가로 방향의 투사만을 수행함으로써 텍스트 문단 및 텍스트 라인에 해당되는 작은 군집들로 나눌 수 있다. 예를 들어, 제어부(120)는 연결 요소들이 군집화된 문서 영상에서 각 군집에 가로 방향의 투사를 수행하고, 그 결과 확인된 틈들의 너비를 연결 요소의 평균 획 굵기와 비교함으로써, 확인된 틈들 각각이 문단을 나누는 경계인지 또는 라인을 나누는 경계인지를 판단할 수 있다.On the other hand, in the case of the document image shown in FIG. 2, since the text paragraphs are divided into columns, the control unit 120 separates text lines and text lines by performing both vertical and horizontal projections, If the text paragraph is divided into units, it can be divided into small clusters corresponding to the text paragraph and the text line by performing only the projection in the horizontal direction. For example, the control unit 120 performs horizontal projection on each cluster in the clustered document image, and compares the width of the identified gaps with the average stroke width of the connecting element, It is possible to judge whether this paragraph is a boundary dividing line or dividing a line.

제어부(120)는 텍스트 라인 후보들을 검출하였으면, 텍스트 라인 후보들에 분류기를 이용한 필터링을 수행하여 비텍스트 요소들을 제거하고, 텍스트 라인들을 검출할 수 있다. 이때, 제어부(120)는 2개의 문턱값을 갖는 분류기를 이용함으로써 비텍스트 요소를 제거함에 있어서 정확도를 높일 수 있다.When the control unit 120 detects the text line candidates, the control unit 120 performs filtering using the classifier on the text line candidates to remove non-text elements and detect text lines. At this time, the controller 120 can increase the accuracy in removing the non-text elements by using a classifier having two threshold values.

일 실시예에 따르면, 제어부(120)는 2개의 문턱값을 갖는 MLP(Multi- Layer Perceptron) 분류기(classifier)를 이용하여, 텍스트 라인 후보들 중에서 비텍스트 요소들을 제거할 수 있다. 먼저, 제어부(120)는 2개의 문턱값 중 낮은 문턱값을 이용한 필터링을 수행함으로써, 일부 비텍스트 요소가 남더라도 최대한 텍스트 라인이 제거되는 것을 방지하면서 비텍스트 요소를 제거한다. 즉, 제어부(120)는 낮은 문턱값을 이용하여 1차 필터링을 수행한다.According to one embodiment, the controller 120 may remove non-text elements from the text line candidates using a multi-layer perceptron (MLP) classifier having two thresholds. First, the controller 120 performs filtering using the lower threshold value of the two threshold values, thereby removing the non-text elements while preventing the text lines from being removed as much as possible even if some non-text elements remain. That is, the controller 120 performs first-order filtering using a low threshold value.

1차 필터링을 수행한 결과 남은 텍스트 라인 후보들에 대해서, 제어부(120)는 군집화를 수행한다. 하나의 문단에서는 텍스트 라인들의 높이와 텍스트 라인 사이의 간격이 일정하다는 성질에 기초하여, 제어부(120)는 남은 텍스트 라인 후보들을 군집화하여 문단으로 만든다.The control unit 120 performs clustering on the remaining text line candidates as a result of performing the first-order filtering. In one paragraph, based on the property that the height of the text lines and the spacing between the text lines are constant, the control unit 120 groups the remaining text line candidates into paragraphs.

이러한 군집화를 수행했음에도 문단에 포함되지 못하고 여전히 하나씩 남아있는 텍스트 라인 후보가 있다면, 제어부(120)는 하나씩 남아있는 텍스트 라인 후보에 대해서 2개의 문턱값 중 높은 문턱값을 이용하여 2차 필터링을 수행한다.If there are still text line candidates that are not included in the paragraph even though the clustering has been performed, the controller 120 performs the second-order filtering using the higher threshold value of the two threshold values for the remaining text line candidates .

제어부(120)는 2차 필터링을 수행한 후에 남아있는 텍스트 라인 후보들을 텍스트 라인으로 검출하고, 텍스트 라인을 포함하는 영역을 텍스트 영역으로 검출할 수 있다.The control unit 120 may detect the text line candidates remaining after performing the second-order filtering as a text line, and may detect an area including the text line as a text area.

도 5의 문서 영상(500)에서 추출된 텍스트 라인 후보들 중에서, 비텍스트 요소가 제거되고 텍스트 라인이 검출된 결과를 도 6에 도시하였다. 도 6의 문서 영상(600)은 텍스트 라인만을 포함하고 있다.Among the text line candidates extracted from the document image 500 of FIG. 5, the result of the removal of the non-text elements and the detection of the text line is shown in FIG. The document image 600 of FIG. 6 includes only text lines.

텍스트 영역의 검출이 완료되었으면, 제어부(120)는 비텍스트 영역을 검출한다. 제어부(120)가 비텍스트 영역을 검출하는 과정을 자세히 설명하면 다음과 같다.When the detection of the text area is completed, the control unit 120 detects the non-text area. A process of detecting the non-text area by the control unit 120 will be described in detail.

제어부(120)는 문서 영상을 이진화한 영상으로부터, 앞서 검출된 텍스트 영역을 제거한다. 이때, 남은 전경 픽셀들, 즉 연결 요소들은 비텍스트 영역에 포함될 가능성이 높다. 하지만, 하드카피 문서를 카메라로 촬영한 경우라면 문서 영상에 문서 이외의 배경이 포함되어 있을 수 있으므로, 남은 연결 요소들 중 일부는 배경에 해당될 수도 있다. 따라서, 배경에 해당되는 연결 요소들을 제거하기 위해 다음과 같은 작업이 필요하다.The control unit 120 removes the previously detected text area from the binarized image of the document image. At this time, the remaining foreground pixels, i.e., connection elements, are likely to be included in the non-text area. However, if a hardcopy document is photographed with a camera, the document image may include a background other than the document, so that some of the remaining connection elements may correspond to the background. Therefore, in order to remove the connection elements corresponding to the background, the following operations are required.

우선, 제어부(120)는 문서 영상 중 적어도 일부를 관심 영역으로 미리 설정하고, 앞서 검출된 텍스트 라인들에 기초하여 관심 영역의 크기를 조절한다. 이때, 제어부(120)는 전체의 문서 영상 중에서 문서에 해당될 가능성이 높은 영역을 관심 영역의 초기값으로 설정할 수 있다. 제어부(120)는 검출된 텍스트 라인 중 일부가 미리 설정된 관심 영역의 외부에 존재한다면, 관심 영역이 그 텍스트 라인을 포함하게 되도록 관심 영역을 확장한다.First, the control unit 120 sets at least a part of the document image as a region of interest in advance, and adjusts the size of the region of interest based on the detected text lines. At this time, the control unit 120 may set an area having a high probability of being a document to be the initial value of the ROI from among the entire document images. The control unit 120 expands the region of interest such that the region of interest includes the text line if some of the detected text lines are outside the predetermined region of interest.

이렇게 관심 영역의 크기가 조절되면, 제어부(120)는 관심 영역의 경계에 걸쳐서 존재하는 연결 요소가 있는지 확인한다. 관심 영역의 경계에 걸쳐서 존재하는 연결 요소가 있다면, 제어부(120)는 해당 연결 요소에 포함된 픽셀들 중에서 관심 영역의 내부에 존재하는 픽셀의 수와 관심 영역의 외부에 존재하는 픽셀의 수를 비교한다. 만약, 관심 영역의 내부에 존재하는 픽셀의 수가 관심 영역의 외부에 존재하는 픽셀의 수보다 많다면, 해당 연결 요소는 배경이 아닌 문서의 비텍스트 영역에 해당될 가능성이 높으므로 제어부(120)는 해당 연결 요소를 남겨둔다. 하지만, 관심 영역의 외부에 존재하는 픽셀의 수가 관심 영역의 내부에 존재하는 픽셀의 수보다 많다면, 해당 연결 요소는 배경에 해당될 가능성이 높으므로 제어부(120)는 해당 연결 요소를 제거한다.When the size of the ROI is adjusted, the controller 120 determines whether there is a connection element existing over the boundary of the ROI. If there is a connection element existing over the boundary of the ROI, the controller 120 compares the number of pixels existing in the ROI with the number of pixels existing outside the ROI among the pixels included in the ROI do. If the number of pixels existing inside the ROI is larger than the number of pixels existing outside the ROI, the connection element is likely to correspond to the non-text area of the document, not the background. Leave that connection element. However, if the number of pixels existing outside the region of interest is larger than the number of pixels existing in the region of interest, the corresponding connection element is likely to correspond to the background, and therefore the controller 120 removes the connection element.

이와 같이 배경에 해당될 가능성이 높은 연결 요소들을 문서 영상에서 제거한 후, 제어부(120)는 남은 연결 요소들에 대해서 반복 X-Y 컷(cut)을 수행함으로써 비텍스트 영역을 검출한다. 예를 들어, 제어부(120)는 텍스트 영역 및 배경에 해당될 가능성이 높은 연결 요소들이 제거된 문서 영상에 가로 방향 및 세로 방향으로 투사한 결과 얻은 히스토그램에서, 연결 요소들 사이의 연속된 0의 개수가 미리 설정된 문턱값 이하라면 연결 요소들을 하나의 비텍스트 영역으로 검출한다. 반대로, 연결 요소들 사이의 연속된 0의 개수가 문턱값보다 크다면, 제어부(120)는 연결 요소들을 각각 별도의 비텍스트 영역으로 검출한다. 문턱값의 크기를 적절하게 조절함으로써 효과적으로 비텍스트 영역을 검출할 수 있다.After removing the connection elements that are likely to correspond to the background in the document image, the controller 120 detects the non-text area by performing a repeated X-Y cut on the remaining connection elements. For example, in the histogram obtained as a result of projecting in the horizontal direction and the vertical direction on the document image in which connection elements highly likely to correspond to the text region and background are removed, the number of consecutive zeros Is equal to or less than a predetermined threshold value, the connection elements are detected as one non-text area. Conversely, if the number of consecutive zeros between the connection elements is greater than the threshold value, the control unit 120 detects the connection elements as separate non-text areas. It is possible to detect the non-text area effectively by appropriately adjusting the size of the threshold value.

비텍스트 영역의 검출이 완료되었으면, 제어부(120)는 문서 경계를 검출할 수 있다. 제어부(120)가 문서 경계를 검출하는 과정을 자세히 설명하면 다음과 같다.When the detection of the non-text area is completed, the control unit 120 can detect the document boundary. The process of detecting the document boundary by the control unit 120 will be described in detail as follows.

제어부(120)는 검출된 비텍스트 영역을 미리 설정된 관심 영역과 비교한다. 제어부(120)는 관심 영역의 외부에 존재하는 비텍스트 영역이 있다면, 관심 영역을 확장하여 관심 영역 내부에 비텍스트 영역이 들어오도록 한다. 문서 영상에 포함된 모든 비텍스트 영역이 관심 영역 내부에 존재하게 되었다면, 제어부(120)는 관심 영역의 경계를 문서 경계로 검출한다.The control unit 120 compares the detected non-text area with a predetermined area of interest. If there is a non-text area existing outside the area of interest, the controller 120 expands the area of interest and causes the non-text area to enter the area of interest. If all of the non-text areas included in the document image exist within the ROI, the control unit 120 detects the ROI of the ROI as a document boundary.

문서 영상으로부터 텍스트 영역 및 비텍스트 영역을 분할하고, 문서 경계를 검출한 예를 도 7에 도시하였다. 도 7을 참조하면, 문서 영상(700)에서 텍스트 영역(701) 및 비텍스트 영역(702)이 분할되었다. 또한, 텍스트 영역(701) 및 비텍스트 영역(702)을 포함하고, 배경은 제외하도록 문서 경계(703)가 검출되었다.An example in which a text region and a non-text region are divided from a document image and a document boundary is detected is shown in Fig. 7, in the document image 700, the text area 701 and the non-text area 702 are divided. In addition, the document boundary 703 was detected to include the text area 701 and the non-text area 702, and to exclude the background.

이와 같이 일 실시예에 따른 문서 영역 분할 장치(100)를 이용하면, 문서 영상에 포함된 연결 요소들을 군집화한 후 투사를 수행함으로써, 복잡한 레이아웃을 갖는 문서 영상에 대해서도 텍스트 라인을 정확하게 검출할 수 있다. 또한, 2개의 서로 다른 문턱값을 이용하는 분류기를 적용함으로써 비텍스트 요소를 제거함에 있어서도 정확도를 높일 수 있다. 또한, 초기값으로 주어지는 관심 영역과 검출된 비텍스트 영역을 비교하여 문서 경계를 검출함으로써, 문서 경계 검출의 정확도를 높일 수 있다.By using the document area dividing apparatus 100 according to an embodiment, the text lines can be accurately detected even for a document image having a complicated layout by performing clustering after connecting the connection elements included in the document image . In addition, by applying a classifier using two different threshold values, it is possible to improve the accuracy in removing non-text elements. In addition, by detecting the document boundary by comparing the detected non-text region with the region of interest given as the initial value, the accuracy of document boundary detection can be increased.

도 8은 일 실시예에 따른 문서 영역 분할 방법을 설명하기 위한 순서도이다.8 is a flowchart for explaining a document area dividing method according to an embodiment.

도 8에 도시된 실시예에 따른 문서 영역 분할 방법은 도 1에 도시된 문서 영역 분할 장치(100)에서 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하에서 생략된 내용이라고 하더라도 도 1에 도시된 문서 영역 분할 장치(100)에 관하여 이상에서 기술한 내용은 도 8에 도시된 실시예에 따른 문서 영역 분할 방법에도 적용될 수 있다.The document area dividing method according to the embodiment shown in FIG. 8 includes the steps of time-series processing in the document area dividing apparatus 100 shown in FIG. Therefore, the contents described above with respect to the document area dividing apparatus 100 shown in FIG. 1 can be applied to the document area dividing method according to the embodiment shown in FIG. 8 even if omitted from the following description.

도 8을 참조하면, 801 단계에서 문서 영상에 포함되는 연결 요소들을 군집화하여 텍스트 라인 후보들을 검출한다. 구체적으로는, 문서 영상에 앞서 설명한 이진화 알고리즘 또는 MSER 알고리즘을 적용하여 연결 요소들을 추출하고, 추출된 연결 요소들을 군집화한다. 이어서,연결 요소의 각 군집에 대해서, 가로 방향 및 세로 방향 중 적어도 하나의 방향으로 투사함으로써, 텍스트 라인 후보들을 검출할 수 있다.Referring to FIG. 8, in step 801, text line candidates are detected by grouping connection elements included in a document image. Specifically, the connection elements are extracted by applying the binarization algorithm or the MSER algorithm described above to the document image, and the extracted connection elements are clustered. The text line candidates can then be detected by projecting in at least one of the horizontal and vertical directions for each cluster of connecting elements.

또한 이때, 문서 영상으로부터 연결 요소들을 추출함에 있어서, 문서 영상에 이진화 알고리즘 또는 MSER 알고리즘을 적용하여 추출된 연결 요소들에 대해서, 고유값 및 픽셀 개수에 기초하여 필터링을 수행함으로써 비텍스트 요소를 제거한 후, 남은 연결 요소들만을 추출할 수 있다.At this time, in extracting the connection elements from the document image, filtering is performed on the extracted connection elements by applying a binarization algorithm or MSER algorithm to the document image based on the eigenvalue and the number of pixels, thereby removing non-text elements , Only the remaining connection elements can be extracted.

또한 이때, 텍스트 라인 후보들을 검출함에 있어서, 군집에 포함된 연결 요소 사이의 여백의 너비와 연결 요소의 평균 획 굵기를 비교한 결과에 따라 군집을 복수의 세부 군집들로 나누고, 세부 군집들을 텍스트 라인 후보들로 검출할 수 있다.In this case, when the text line candidates are detected, the cluster is divided into a plurality of detailed clusters according to the result of comparing the width of the margins between the connection elements included in the clusters and the average stroke width of the connecting elements, It can be detected as candidates.

이어서 802 단계에서는 검출된 텍스트 라인 후보들에 대해서 서로 다른 두 개의 문턱값을 이용한 필터링을 수행함으로써, 비텍스트 요소를 제거하고 텍스트 라인을 검출할 수 있다. 구체적으로는, 텍스트 라인 후보들에 대해서, 제1 문턱값을 이용한 필터링을 수행하여 비텍스트 요소를 제거하고, 남은 텍스트 라인 후보들을 문단으로 군집화하며, 남은 텍스트 라인 후보들 중 문단으로 군집화되지 않은 텍스트 라인 후보들에 대해서, 제1 문턱값보다 큰 제2 문턱값을 이용한 필터링을 수행하여 비텍스트 요소를 제거하고, 남은 텍스트 라인 후보들을 텍스트 라인을 검출할 수 있다.Next, in step 802, filtering is performed using two different threshold values for the detected text line candidates, thereby removing the non-text elements and detecting the text line. Specifically, filtering is performed on the text line candidates using the first threshold value to remove the non-text elements. The remaining text line candidates are grouped into paragraphs, and text line candidates not grouped into paragraphs of the remaining text line candidates The filtering using the second threshold value larger than the first threshold value may be performed to remove the non-text elements, and the remaining text line candidates may be detected as text lines.

803 단계에서는 검출된 텍스트 라인을 문서 영상에서 제거한 후, 반복 X-Y 컷을 수행함으로써 비텍스트 영역을 검출할 수 있다. 구체적으로는, 문서 영상에서 검출된 텍스트 라인을 제거하고, 검출된 텍스트 라인에 기초하여 관심 영역을 조절하고, 관심 영역의 경계에 걸쳐서 존재하는 연결 요소들이 배경에 해당되는지를 판단하고, 배경으로 판단된 연결 요소들을 제거하고, 남은 연결 요소들에 대해서 반복 X-Y 컷을 수행함으로써 비텍스트 영역을 검출할 수 있다.In step 803, the non-text area can be detected by removing the detected text line from the document image and then performing the repeated X-Y cut. Specifically, the text line detected in the document image is removed, the region of interest is adjusted based on the detected text line, the connection elements existing over the boundary of the region of interest correspond to the background, The non-text area can be detected by removing the connected elements and performing a repetitive X-Y cut on the remaining connecting elements.

804 단계에서는 검출된 비텍스트 영역을 관심 영역과 비교함으로써 문서 경계를 검출할 수 있다. 구체적으로는, 비텍스트 영역이 관심 영역의 외부에도 존재한다면, 관심 영역의 내부에 비텍스트 영역이 포함되도록 관심 영역을 확장하고, 확장된 관심 영역의 경계를 문서 경계로 결정할 수 있다.In step 804, the document boundary can be detected by comparing the detected non-text area with the area of interest. Specifically, if the non-text area exists outside the area of interest, the area of interest may be expanded so that the non-text area is included in the area of interest, and the boundary of the expanded area of interest may be determined as the document boundary.

이상의 실시예들에서 사용되는 '~부'라는 용어는 소프트웨어 또는 FPGA(field programmable gate array) 또는 ASIC 와 같은 하드웨어 구성요소를 의미하며, '~부'는 어떤 역할들을 수행한다. 그렇지만 '~부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '~부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '~부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램특허 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들, 및 변수들을 포함한다.The term " part " used in the above embodiments means a hardware component such as a software or a field programmable gate array (FPGA) or an ASIC, and the 'part' performs certain roles. However, 'part' is not meant to be limited to software or hardware. &Quot; to " may be configured to reside on an addressable storage medium and may be configured to play one or more processors. Thus, by way of example, 'parts' may refer to components such as software components, object-oriented software components, class components and task components, and processes, functions, , Subroutines, segments of program patent code, drivers, firmware, microcode, circuitry, data, databases, data structures, tables, arrays, and variables.

구성요소들과 '~부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '~부'들로 결합되거나 추가적인 구성요소들과 '~부'들로부터 분리될 수 있다.The functions provided within the components and components may be combined with a smaller number of components and components or separated from additional components and components.

뿐만 아니라, 구성요소들 및 '~부'들은 디바이스 또는 보안 멀티미디어카드 내의 하나 또는 그 이상의 CPU 들을 재생시키도록 구현될 수도 있다.In addition, the components and components may be implemented to play back one or more CPUs in a device or a secure multimedia card.

도 8을 통해 설명된 실시예에 따른 문서 영역 분할 방법은 컴퓨터에 의해 실행 가능한 명령어 및 데이터를 저장하는, 컴퓨터로 판독 가능한 매체의 형태로도 구현될 수 있다. 이때, 명령어 및 데이터는 프로그램 코드의 형태로 저장될 수 있으며, 프로세서에 의해 실행되었을 때, 소정의 프로그램 모듈을 생성하여 소정의 동작을 수행할 수 있다. 또한, 컴퓨터로 판독 가능한 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터로 판독 가능한 매체는 컴퓨터 기록 매체일 수 있는데, 컴퓨터 기록 매체는 컴퓨터 판독 가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함할 수 있다. 예를 들어, 컴퓨터 기록 매체는 HDD 및 SSD 등과 같은 마그네틱 저장 매체, CD, DVD 및 블루레이 디스크 등과 같은 광학적 기록 매체, 또는 네트워크를 통해 접근 가능한 서버에 포함되는 메모리일 수 있다.The document area segmentation method according to the embodiment described with reference to FIG. 8 can also be implemented in the form of a computer-readable medium for storing instructions and data executable by a computer. At this time, the command and data may be stored in the form of program code, and when executed by the processor, a predetermined program module may be generated to perform a predetermined operation. In addition, the computer-readable medium can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. The computer-readable medium can also be a computer storage medium, which can be volatile and non-volatile, implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, Volatile, removable and non-removable media. For example, the computer recording medium may be a magnetic storage medium such as an HDD and an SSD, an optical recording medium such as a CD, a DVD and a Blu-ray Disc, or a memory included in a server accessible via a network.

또한 도 8을 통해 설명된 실시예에 따른 문서 영역 분할 방법은 컴퓨터에 의해 실행 가능한 명령어를 포함하는 컴퓨터 프로그램(또는 컴퓨터 프로그램 제품)으로 구현될 수도 있다. 컴퓨터 프로그램은 프로세서에 의해 처리되는 프로그래밍 가능한 기계 명령어를 포함하고, 고레벨 프로그래밍 언어(High-level Programming Language), 객체 지향 프로그래밍 언어(Object-oriented Programming Language), 어셈블리 언어 또는 기계 언어 등으로 구현될 수 있다. 또한 컴퓨터 프로그램은 유형의 컴퓨터 판독가능 기록매체(예를 들어, 메모리, 하드디스크, 자기/광학 매체 또는 SSD(Solid-State Drive) 등)에 기록될 수 있다. The document area dividing method according to the embodiment described with reference to FIG. 8 may also be implemented as a computer program (or a computer program product) including instructions executable by a computer. A computer program includes programmable machine instructions that are processed by a processor and can be implemented in a high-level programming language, an object-oriented programming language, an assembly language, or a machine language . The computer program may also be recorded on a computer readable recording medium of a type (e.g., memory, hard disk, magnetic / optical medium or solid-state drive).

따라서 도 8을 통해 설명된 실시예에 따른 문서 영역 분할 방법은 상술한 바와 같은 컴퓨터 프로그램이 컴퓨팅 장치에 의해 실행됨으로써 구현될 수 있다. 컴퓨팅 장치는 프로세서와, 메모리와, 저장 장치와, 메모리 및 고속 확장포트에 접속하고 있는 고속 인터페이스와, 저속 버스와 저장 장치에 접속하고 있는 저속 인터페이스 중 적어도 일부를 포함할 수 있다. 이러한 성분들 각각은 다양한 버스를 이용하여 서로 접속되어 있으며, 공통 머더보드에 탑재되거나 다른 적절한 방식으로 장착될 수 있다.Therefore, the document area dividing method according to the embodiment described with reference to FIG. 8 can be implemented by the computer program as described above being executed by the computing device. The computing device may include a processor, a memory, a storage device, a high speed interface connected to the memory and the high speed expansion port, and a low speed interface connecting the low speed bus and the storage device. Each of these components is connected to each other using a variety of buses and can be mounted on a common motherboard or mounted in any other suitable manner.

여기서 프로세서는 컴퓨팅 장치 내에서 명령어를 처리할 수 있는데, 이런 명령어로는, 예컨대 고속 인터페이스에 접속된 디스플레이처럼 외부 입력, 출력 장치상에 GUI(Graphic User Interface)를 제공하기 위한 그래픽 정보를 표시하기 위해 메모리나 저장 장치에 저장된 명령어를 들 수 있다. 다른 실시예로서, 다수의 프로세서 및(또는) 다수의 버스가 적절히 다수의 메모리 및 메모리 형태와 함께 이용될 수 있다. 또한 프로세서는 독립적인 다수의 아날로그 및(또는) 디지털 프로세서를 포함하는 칩들이 이루는 칩셋으로 구현될 수 있다.Where the processor may process instructions within the computing device, such as to display graphical information to provide a graphical user interface (GUI) on an external input, output device, such as a display connected to a high speed interface And commands stored in memory or storage devices. As another example, multiple processors and / or multiple busses may be used with multiple memory and memory types as appropriate. The processor may also be implemented as a chipset comprised of chips comprising multiple independent analog and / or digital processors.

또한 메모리는 컴퓨팅 장치 내에서 정보를 저장한다. 일례로, 메모리는 휘발성 메모리 유닛 또는 그들의 집합으로 구성될 수 있다. 다른 예로, 메모리는 비휘발성 메모리 유닛 또는 그들의 집합으로 구성될 수 있다. 또한 메모리는 예컨대, 자기 혹은 광 디스크와 같이 다른 형태의 컴퓨터 판독 가능한 매체일 수도 있다.The memory also stores information within the computing device. In one example, the memory may comprise volatile memory units or a collection thereof. In another example, the memory may be comprised of non-volatile memory units or a collection thereof. The memory may also be another type of computer readable medium such as, for example, a magnetic or optical disk.

그리고 저장장치는 컴퓨팅 장치에게 대용량의 저장공간을 제공할 수 있다. 저장 장치는 컴퓨터 판독 가능한 매체이거나 이런 매체를 포함하는 구성일 수 있으며, 예를 들어 SAN(Storage Area Network) 내의 장치들이나 다른 구성도 포함할 수 있고, 플로피 디스크 장치, 하드 디스크 장치, 광 디스크 장치, 혹은 테이프 장치, 플래시 메모리, 그와 유사한 다른 반도체 메모리 장치 혹은 장치 어레이일 수 있다.And the storage device can provide a large amount of storage space to the computing device. The storage device may be a computer readable medium or a configuration including such a medium and may include, for example, devices in a SAN (Storage Area Network) or other configurations, and may be a floppy disk device, a hard disk device, Or a tape device, flash memory, or other similar semiconductor memory device or device array.

상술된 실시예들은 예시를 위한 것이며, 상술된 실시예들이 속하는 기술분야의 통상의 지식을 가진 자는 상술된 실시예들이 갖는 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 상술된 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.It will be apparent to those skilled in the art that the above-described embodiments are for illustrative purposes only and that those skilled in the art will readily understand that other embodiments can be readily modified without departing from the spirit or essential characteristics of the embodiments described above You will understand. It is therefore to be understood that the above-described embodiments are to be considered in all respects only as illustrative and not restrictive. For example, each component described as a single entity may be distributed and implemented, and components described as being distributed may also be implemented in a combined form.

본 명세서를 통해 보호 받고자 하는 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태를 포함하는 것으로 해석되어야 한다.It is to be understood that the scope of the present invention is defined by the appended claims rather than the foregoing description and should be construed as including all changes and modifications that come within the meaning and range of equivalency of the claims, .

100: 문서 영역 분할 장치 110: 입출력부
120: 제어부 130: 저장부
140: 통신부100: Document area dividing device 110: Input /
120: control unit 130:
140:

Claims

A document area dividing method,
Clustering connection elements included in a document image to detect text line candidates;
Removing the non-text elements and detecting a text line by performing filtering using two different threshold values for the detected text line candidates;
Detecting a non-text area by removing the detected text line from the document image and then performing a repeated X-Y cut; And
Detecting the document boundary by comparing the detected non-text region with a region of interest.

The method according to claim 1,
Wherein the step of detecting the text line candidates comprises:
Extracting connection elements by applying a binarization algorithm or an MSER (Maximally Stable External Region) algorithm to the document image;
Clustering the extracted connection elements; And
Detecting the text line candidates by projecting in at least one of a horizontal direction and a vertical direction for each cluster of the connecting element.

3. The method of claim 2,
Wherein the extracting of the coupling elements comprises:
Filtering the extracted connection elements based on the eigenvalues and the number of pixels by applying the binarization algorithm or the MSER algorithm to the document image to remove the non-text elements, and then extracting only the remaining connection elements. How to.

3. The method of claim 2,
Wherein the step of detecting the text line candidates comprises:
Dividing the cluster into a plurality of detailed clusters according to a result of comparing a width of a margin between connection elements included in the cluster and an average stroke of the connection element, and detecting the detailed clusters as text line candidates. How to.

The method according to claim 1,
Wherein the step of detecting the text line comprises:
Performing filtering using the first threshold value for the text line candidates to remove non-text elements, and grouping the remaining text line candidates into paragraphs; And
Performing filtering using a second threshold value larger than the first threshold value for text line candidates not grouped into paragraphs of the remaining text line candidates to remove non-text elements, and detecting remaining text line candidates as text lines &Lt; / RTI >

The method according to claim 1,
Wherein the detecting the non-text area comprises:
Removing a text line detected in the document image;
Adjusting the region of interest based on the detected text line;
Determining whether connection elements existing over a boundary of the ROI correspond to a background;
Removing a connection element determined as a background; And
And performing an iterative XY cut on the remaining connection elements.

The method according to claim 1,
Wherein the step of detecting the document boundary comprises:
Expanding the region of interest such that the non-text region is included within the region of interest if the non-text region is also outside the region of interest; And
And determining the boundaries of the extended region of interest as document boundaries.

A computer-readable recording medium on which a program for carrying out the method according to claim 1 is recorded.

A computer program stored in a medium for performing the method recited in claim 1, which is performed by a document area dividing device.

A document area dividing device comprising:
An input / output unit for receiving an input related to processing of a document image and for displaying a status and a result of processing of the document image;
A storage unit for storing a program for performing document area division; And
And a control unit for performing area division of the document image by executing the program,
The control unit detects text line candidates by clustering connection elements included in the document image and performs filtering using the two threshold values different from each other for the detected text line candidates, Detects a text line, removes the detected text line from the document image, and then performs a repetitive XY cut to detect the non-text area, and compares the detected non-text area with the area of interest Detecting a document boundary.

11. The method of claim 10,
Wherein,
A connection element is extracted by applying a binarization algorithm or an MSER (Maximally Stable External Region) algorithm to the document image, and the extracted connection elements are clustered, and in each of the clusters of the connection element, And detecting the text line candidates by projecting the text line candidates in at least one direction.

12. The method of claim 11,
Wherein,
Filtering the extracted connection elements based on the eigenvalues and the number of pixels by applying the binarization algorithm or the MSER algorithm to the document image to remove the non-text elements, and then extracting only the remaining connection elements. .

12. The method of claim 11,
Wherein,
Dividing the cluster into a plurality of detailed clusters according to a result of comparing a width of a margin between connection elements included in the cluster and an average stroke of the connection element, and detecting the detailed clusters as text line candidates. .

11. The method of claim 10,
Wherein,
Filtering the text line candidates by using a first threshold value to remove non-text elements, grouping the remaining text line candidates into paragraphs, and selecting text line candidates that are not clustered as paragraphs of the remaining text line candidates And performing filtering using a second threshold value that is greater than the first threshold value to remove non-text elements, and detects remaining text line candidates as text lines.

11. The method of claim 10,
Wherein,
The text line detected in the document image is removed, the attention area is adjusted based on the detected text line, the connection elements existing over the boundary of the ROI are judged to correspond to the background, Wherein the non-text area is detected by removing a connected element and performing a repetitive X-Y cut on the remaining connecting elements.

11. The method of claim 10,
Wherein,
Expanding the ROI such that the non-text area is included in the ROI if the non-text area exists outside the ROI, and determining the boundary of the ROI as the document boundary. .