KR101571681B1

KR101571681B1 - Method for analysing structure of document using homogeneous region

Info

Publication number: KR101571681B1
Application number: KR1020140192693A
Authority: KR
Inventors: 김수형; 나인섭; 오아란; 뚜안; 강재우; 금지수
Original assignee: 주식회사 디오텍; 전남대학교산학협력단
Priority date: 2014-12-29
Filing date: 2014-12-29
Publication date: 2015-11-25

Abstract

The present invention relates to a method for analyzing a document structure by using a homogenous region which includes the steps of: binarizing a document image into black and white by analyzing the document image; extracting a homogenous region by dividing the document image with respect to at least a part of white lines based on the width of the white lines and the width of black lines which are formed by binarizing the document image; classifying connected components into text components and non-text components by analyzing the connected components included in the homogeneous region, and then obtaining a text document including the text components and a non-text document including the non-text components; dividing the text document based on the distance between the text components; and classifying the non-text components included in the non-text document based on feature values of the non-text components. Accuracy for classifying the connected components into the text components and the non-text components can be increased by analyzing the connected components based on the homogenous region.

Description

[0001] METHOD FOR ANALYSIS STRUCTURE OF DOCUMENT USING HOMOGENEOUS REGION [0002]

본 발명은 동질 영역을 이용한 문서 구조의 분석 방법에 관한 것으로서, 보다 상세하게는 반복적인 분할을 통해 동질 영역을 추출한 후 추출된 동질 영역에 기초하여 문서 구조를 분석하는 방법에 관한 것이다.The present invention relates to a method of analyzing a document structure using a homogeneous region, and more particularly, to a method of extracting a homogeneous region through repeated division and analyzing a document structure based on the extracted homogeneous region.

광학 문자 인식 (optical character recognition; OCR) 은 텍스트 형태의 이미지를 소정의 문자 인식 처리 과정을 거쳐 컴퓨터가 판독 가능한 텍스트 데이터로 변환하는 것을 의미한다. 광학 문자 인식은 종이 문서 또는 텍스트가 포함된 전자적 이미지를 전자 문서로 변환하는 경우, 사용자가 직접 타이핑하여 텍스트를 입력하는 수작업 과정을 생략할 수 있으므로 널리 이용되고 있다. 광학 문자 인식을 위해서는 변환하고자 하는 문서의 구조를 분석하는 과정이 요구된다. 문서의 구조를 분석함으로써 문서를 텍스트 영역, 테이블 영역, 이미지 영역 및 라인 영역 등으로 분류할 수 있다. 문서 구조를 분석하는 과정의 정확성은 광학 문자 인식의 정확성에 영향을 미칠 수 있다.Optical character recognition (OCR) refers to conversion of an image in the form of text into computer-readable text data through a predetermined character recognition process. Optical character recognition is widely used when a paper document or an electronic image including text is converted into an electronic document because the user can omit the manual operation of typing and inputting the text. For optical character recognition, a process of analyzing the structure of a document to be converted is required. By analyzing the structure of the document, the document can be classified into a text area, a table area, an image area, and a line area. The accuracy of the process of analyzing the document structure can affect the accuracy of optical character recognition.

종래에는 문서 구조를 분석하기 위해 탑-다운 (top-down) 방식 및 바텀-업 (bottom-up) 방식이 주로 이용되었다. 탑-다운 방식이란 문서 이미지를 블록으로 분할하고, 블록을 텍스트 라인으로 분할하고, 텍스트 라인을 단어로 분류하는 방법이다. 탑-다운 방식은 컴퓨팅 시간이 짧고 정형화된 문서 이미지를 효율적으로 분석할 수 있으나, 문서 이미지의 구조가 복잡해지는 경우 정확한 분할이 어렵다는 단점이 있다. 한편, 바텀-업 방식이란 문서 이미지 내에 포함된 단어를 결정하고, 단어를 텍스트 라인으로 병합하고, 텍스트 라인을 블록으로 병합하는 방법이다. 바텀-업 방식은 문서 이미지의 구조가 복잡해지더라도 문서 이미지를 분석하는 것이 가능하나, 컴퓨닝 시간이 과도하게 길고 분석을 위해 필요한 문턱값 (threshold value) 을 적절하게 설정하는 것이 어렵다는 단점이 있다. 상술한 단점들을 극복하기 위해 탑-다운 방식과 바텀-업 방식이 조합된 방식이 제안되었으나, 이 방식은 상술한 단점들을 근본적으로 해결하지 못한다.Conventionally, a top-down method and a bottom-up method are mainly used for analyzing a document structure. The top-down method is a method of dividing a document image into blocks, dividing a block into text lines, and classifying text lines into words. The top-down approach can shorten the computation time and efficiently analyze the formal document image, but it has a disadvantage in that it is difficult to accurately divide the document image if the structure of the document image becomes complicated. On the other hand, the bottom-up method is a method of determining words included in a document image, merging words into text lines, and merging text lines into blocks. The bottom-up method is capable of analyzing a document image even if the structure of the document image is complicated. However, it has a disadvantage that it is difficult to appropriately set a threshold value necessary for analysis and the computation time is excessively long. Although a combination of a top-down scheme and a bottom-up scheme has been proposed to overcome the disadvantages described above, this scheme does not fundamentally solve the above-mentioned disadvantages.

따라서, 컴퓨팅 시간이 짧고 복잡한 문서 이미지의 구조를 정확하게 분석할 수 있는 방법 및 장치의 개발이 요구된다.
[관련기술문헌]
1. 언어모델과 ＯＣＲ을 이용하여 문서에 포함된 문자열을 인식하는 방법, 시스템 및 컴퓨터 판독 가능한 기록 매체 (특허출원번호 제10-2008-0103890호)Therefore, it is required to develop a method and apparatus for accurately analyzing the structure of a document image with a short computing time and a complexity.
[Related Technical Literature]
1. A method, system and computer-readable recording medium for recognizing a character string included in a document using a language model and an OCR (Patent Application No. 10-2008-0103890)

본 발명이 해결하고자 하는 과제는 문서 이미지에 포함된 백색 라인 및 흑색 라인의 특성을 이용하여 반복적으로 문서 이미지를 분할함으로써, 동일한 특성을 갖는 동질 영역 (homogenious region) 을 추출할 수 있는 동질 영역을 이용한 문서 구조의 분석 방법을 제공하는 것이다.SUMMARY OF THE INVENTION It is an object of the present invention to provide an image processing apparatus and a method of processing a document image by dividing a document image repeatedly using characteristics of a white line and a black line included in a document image to thereby extract a homogenious region having the same characteristics And a method of analyzing the document structure.

본 발명이 해결하고자 하는 다른 과제는 추출된 동질 영역에 기초하여 연결된 컴포넌트 (connected component) 를 분석함으로써, 연결된 컴포넌트를 텍스트 컴포넌트 및 비-텍스트 컴포넌트 (non-text) 로 정확하게 분류할 수 있는 동질 영역을 이용한 문서 구조의 분석 방법을 제공하는 것이다.Another problem to be solved by the present invention is to analyze a connected component based on the extracted homogeneous region, thereby to identify a homogeneous region that can correctly classify connected components into text components and non-text components And to provide a method for analyzing the structure of a document.

본 발명이 해결하고자 하는 또 다른 과제는 연결된 컴포넌트를 2회의 필터링을 통해 분류함으로써, 연결된 컴포넌트를 텍스트 컴포넌트 및 비-텍스트 컴포넌트로 빠르고 정확하게 분류할 수 있는 동질 영역을 이용한 문서 구조의 분석 방법을 제공하는 것이다.Another object of the present invention is to provide a method of analyzing a document structure using a homogeneous region capable of quickly and accurately classifying connected components into a text component and a non-text component by classifying connected components through two filtering operations will be.

본 발명의 과제들은 이상에서 언급한 과제들로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The problems of the present invention are not limited to the above-mentioned problems, and other problems not mentioned can be clearly understood by those skilled in the art from the following description.

전술한 바와 같은 과제를 해결하기 위하여 본 발명의 일 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법은 문서 이미지를 분석하여 문서 이미지를 흑백으로 이진화하는 단계, 문서 이미지를 이진화하여 형성된 백색 라인의 폭 및 흑색 라인의 폭에 기초하여 백색 라인 중 적어도 일부를 기준으로 문서 이미지를 분할함으로써 동질 영역을 추출하는 단계, 동질 영역에 포함된 연결된 컴포넌트를 분석하여 연결된 컴포넌트를 텍스트 컴포넌트 및 비-텍스트 컴포넌트로 분류하고, 텍스트 컴포넌트가 포함된 텍스트 문서 및 비-텍스트 컴포넌트가 포함된 비-텍스트 문서를 획득하는 단계, 텍스트 컴포넌트 간의 거리에 기초하여 텍스트 문서를 분할하는 단계 및 비-텍스트 컴포넌트의 특성값에 기초하여 비-텍스트 문서에 포함된 비-텍스트 컴포넌트를 분류하는 단계를 포함하는 것을 특징으로 한다.According to an aspect of the present invention, there is provided a method of analyzing a document structure using a homogeneous region, the method comprising: analyzing a document image to binarize the document image in monochrome; Extracting the homogeneous region by dividing the document image based on at least a portion of the white lines based on the width and the width of the black line; analyzing the linked components contained in the homogeneous region to determine the associated component as a text component and a non- Dividing a text document based on a distance between text components, and dividing the text document based on the property values of the non-text component, Non-text components included in the non-text document And a step of classifying the image data.

본 발명의 다른 특징에 따르면, 동질 영역을 추출하는 단계는, 문서 이미지의 픽셀 중 흑색 픽셀의 빈도를 나타내는 문서 이미지의 가로 축 또는 세로 축 방향의 곡선을 획득하는 단계, 곡선의 변화율 및 흑색 픽셀의 빈도의 통계값에 기초하여 이질 영역 (heterogeneous region) 의 존재 여부를 판단하는 단계, 이질 영역이 존재하는 경우, 이질 영역에 포함된 백색 라인의 폭 및 흑색 라인의 폭에 기초하여 백색 라인 중 적어도 일부를 기준으로 문서 이미지를 분할하는 단계를 포함하는 것을 특징으로 한다.According to another aspect of the present invention, the step of extracting the homogeneous region includes the steps of obtaining a curve in the horizontal axis or the longitudinal axis direction of the document image indicating the frequency of the black pixels in the pixels of the document image, Determining a presence or absence of a heterogeneous region based on a statistical value of a frequency, determining whether or not a heterogeneous region exists, determining, based on a width of the white line and a width of the black line included in the heterogeneous region, And dividing the document image based on the document image.

본 발명의 또 다른 특징에 따르면, 곡선을 획득하는 단계, 이질 영역의 존재 여부를 판단하는 단계 및 문서 이미지를 분할하는 단계를 이질 영역이 존재하지 않을 때까지 반복하는 단계를 더 포함하는 것을 특징으로 한다.According to still another aspect of the present invention, there is provided an image processing method including the steps of acquiring a curve, determining whether a heterogeneous region is present, and dividing a document image until a heterogeneous region does not exist do.

본 발명의 또 다른 특징에 따르면, 텍스트 문서 및 비-텍스트 문서를 획득하는 단계는 연결된 컴포넌트를 분석하여 연결된 컴포넌트 중 비-텍스트 컴포넌트를 추출하는 단계, 동질 영역으로부터 비-텍스트 컴포넌트를 분리하여 비-텍스트 컴포넌트를 포함하는 비-텍스트 문서를 저장하는 단계를 포함하는 것을 특징으로 한다.According to another aspect of the invention, acquiring a text document and a non-text document comprises analyzing the linked components to extract non-textual components of the connected components, separating the non-textual components from the homogeneous region, And storing the non-text document including the text component.

본 발명의 또 다른 특징에 따르면, 비-텍스트 컴포넌트를 추출하는 단계는, 연결된 컴포넌트 및 연결된 컴포넌트를 둘러싸는 박스의 좌표, 폭, 높이 및 면적을 포함하는 특성값에 기초하여 연결된 컴포넌트를 텍스트 컴포넌트 또는 비-텍스트 컴포넌트로 분류하는 단계인 것을 특징으로 한다.According to still another aspect of the present invention, extracting the non-text component comprises extracting a non-text component from a text component or a text component based on a property value including a coordinate, a width, a height and an area of a box surrounding the connected component and the connected component Non-text component.

본 발명의 또 다른 특징에 따르면, 비-텍스트 컴포넌트를 추출하는 단계 및 비-텍스트 문서를 저장하는 단계를 비-텍스트 컴포넌트가 추출되지 않을 때까지 반복하는 단계 및 비-텍스트 컴포넌트가 제거되고 텍스트 컴포넌트가 포함된 텍스트 문서를 저장하는 단계를 포함하는 것을 특징으로 한다.In accordance with another aspect of the present invention, there is provided a method for extracting a non-text component, the method comprising: extracting a non-text component; and storing the non- And storing the text document including the text document.

본 발명의 또 다른 특징에 따르면, 비-텍스트 컴포넌트를 분류하는 단계는, 비-텍스트 컴포넌트를 라인 컴포넌트, 테이블 컴포넌트, 세퍼레이터 컴포넌트 및 노이즈 컴포넌트로 분류하는 단계인 것을 특징으로 한다.According to still another aspect of the present invention, classifying the non-text component is characterized by classifying the non-text component into a line component, a table component, a separator component, and a noise component.

본 발명의 또 다른 특징에 따르면, 텍스트 문서를 분할하는 단계는, 모폴로지 연산 (morphology operation) 을 통해 텍스트 컴포넌트를 팽창시킨 후, 텍스트 컴포넌트 간의 거리에 기초하여 텍스트 문서를 분할하는 단계인 것을 특징으로 한다.According to another aspect of the present invention, the step of dividing the text document is a step of dividing the text document based on the distance between the text components after inflating the text component through a morphology operation .

본 발명의 또 다른 특징에 따르면, 문서 이미지, 텍스트 문서 및 비-텍스트 문서는 사이즈가 동일한 것을 특징으로 한다.According to another aspect of the present invention, a document image, a text document and a non-text document are characterized by the same size.

전술한 바와 같은 과제를 해결하기 위하여 본 발명의 일 실시예에 따른 동질 영역을 이용한 문서 구조의 분서 장치는 문서 이미지를 분석하여 문서 이미지를 흑백으로 이진화하고, 문서 이미지에 포함된 백색 라인의 폭 및 흑색 라인의 폭에 기초하여 백색 라인 중 적어도 일부를 기준으로 문서 이미지를 분할함으로써 동질 영역을 추출하고, 동질 영역에 포함된 연결된 컴포넌트를 분석하여 연결된 컴포넌트를 텍스트 컴포넌트 및 비-텍스트 컴포넌트로 분류하고,텍스트 컴포넌트가 포함된 텍스트 문서 및 비-텍스트 컴포넌트가 포함된 비-텍스트 문서를 획득하고, 텍스트 컴포넌트 간의 거리에 기초하여 텍스트 문서를 분할하고, 비-텍스트 컴포넌트의 특성값에 기초하여 비-텍스트 문서에 포함된 비-텍스트 컴포넌트를 분류하는 명령어들의 세트를 포함하는 것을 특징으로 한다.According to an aspect of the present invention, there is provided a document distributor in a document structure using a homogeneous region, the document analyzing apparatus comprising: a document analyzing unit for analyzing a document image to binarize a document image in black and white, Extracting the homogeneous region by dividing the document image based on at least a part of the white lines based on the width of the black line, analyzing connected components included in the homogeneous region to classify the connected components into text components and non-text components, Obtaining a text document including a text component and a non-text document containing a non-text component, dividing the text document based on a distance between the text components, and based on the property value of the non-text component, A set of instructions to classify non-text components contained in And it characterized in that.

전술한 바와 같은 과제를 해결하기 위하여 본 발명의 일 실시예에 따른 컴퓨터 판독 가능 매체는 프로세서 및 메모리를 포함하고, 프로세서는 문서 이미지를 분석하여 문서 이미지를 흑백으로 이진화하고, 문서 이미지에 포함된 백색 라인의 폭 및 흑색 라인의 폭에 기초하여 백색 라인 중 적어도 일부를 기준으로 문서 이미지를 분할함으로써 동질 영역을 추출하고, 동질 영역에 포함된 연결된 컴포넌트를 분석하여 연결된 컴포넌트를 텍스트 컴포넌트 및 비-텍스트 컴포넌트로 분류하고, 텍스트 컴포넌트가 포함된 텍스트 문서 및 비-텍스트 컴포넌트가 포함된 비-텍스트 문서를 획득하고, 텍스트 컴포넌트 간의 거리에 기초하여 텍스트 문서를 분할하고, 비-텍스트 컴포넌트의 특성값에 기초하여 비-텍스트 문서에 포함된 비-텍스트 컴포넌트를 분류하고, 메모리는, 문서 이미지, 동질 영역, 연결된 컴포넌트, 텍스트 문서 및 비-텍스트 문서 중 적어도 하나 이상을 저장하는 것을 특징으로 한다.According to an aspect of the present invention, there is provided a computer readable medium including a processor and a memory, the processor analyzing a document image to binarize the document image in monochrome, Extracting the homogeneous region by dividing the document image based on at least a portion of the white lines based on the width of the line and the width of the black line, analyzing the linked components contained in the homogeneous region, Obtaining a non-text document including a text document and a non-text component including a text component, dividing the text document based on a distance between the text components, and based on the property value of the non-text component To classify non-text components contained in non-text documents , The memory, the document image, the homogeneous area, the connected component, a text document and the non-storage is characterized in that at least one of a text document.

기타 실시예의 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.The details of other embodiments are included in the detailed description and drawings.

본 발명은 문서 이미지에 포함된 백색 라인 및 흑색 라인의 특성을 이용하여 반복적으로 문서 이미지를 분할함으로써, 동일한 특성을 갖는 동질 영역을 정확하게 추출할 수 있다.The present invention can accurately extract a homogeneous region having the same characteristics by repeatedly dividing a document image by using the characteristics of the white line and the black line included in the document image.

본 발명은 추출된 동질 영역에 기초하여 연결된 컴포넌트를 분석함으로써, 연결된 컴포넌트를 텍스트 컴포넌트 및 비-텍스트 컴포넌트로 분류하는 정확성을 향상시킬 수 있는 효과가 있다.The present invention has the effect of improving the accuracy of classifying connected components into text components and non-text components by analyzing the connected components based on the extracted homogeneous regions.

본 발명은 연결된 컴포넌트를 2회의 필터링을 반복적으로 수행하여 분류함으로써, 연결된 컴포넌트를 텍스트 컴포넌트 및 비-텍스트 컴포넌트로 빠르고 정확하게 분류할 수 있다.The present invention can quickly and accurately classify connected components into text components and non-text components by repeatedly classifying the connected components by performing two repetitions of filtering.

본 발명에 따른 효과는 이상에서 예시된 내용에 의해 제한되지 않으며, 더욱 다양한 효과들이 본 명세서 내에 포함되어 있다.The effects according to the present invention are not limited by the contents exemplified above, and more various effects are included in the specification.

도 1은 본 발명의 일 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 장치의 개략적인 구성도이다.
도 2는 본 발명의 일 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법을 설명하기 위한 순서도이다.
도 3은 본 발명의 일 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법을 설명하기 위한 순서도이다.
도 4a 내지 도 4d는 본 발명의 몇몇 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법에서 흑색 픽셀의 빈도를 나타내는 곡선을 처리하는 예시적인 실시예를 도시한다.
도 5a 및 도 5b는 본 발명의 몇몇 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법에서의 동질 영역과 이질 영역을 예시적으로 도시한다.
도 6a 및 도 6b는 본 발명의 몇몇 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법에서 문서 이미지를 분할하는 예시적인 실시예를 도시한다.
도 7은 본 발명의 몇몇 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법에서 연결된 컴포넌트 간의 관계를 나타내는 개념도이다.
도 8은 본 발명의 몇몇 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법에 의해 획득된 텍스트 문서 및 비-텍스트 문서를 예시적으로 도시한다.
도 9는 본 발명의 몇몇 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법에 의해 분석된 문서 이미지를 예시적으로 도시한다.
도 10은 본 발명의 몇몇 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법의 성공율을 나타내는 그래프이다.1 is a schematic block diagram of an apparatus for analyzing a document structure using a homogeneous region according to an embodiment of the present invention.
2 is a flowchart illustrating a method of analyzing a document structure using a homogeneous region according to an embodiment of the present invention.
3 is a flowchart illustrating a method of analyzing a document structure using a homogeneous region according to an embodiment of the present invention.
4A to 4D illustrate an exemplary embodiment for processing a curve representing the frequency of black pixels in a method of analyzing a document structure using homogeneous regions according to some embodiments of the present invention.
5A and 5B illustrate homogeneous and heterogeneous regions in a method of analyzing a document structure using a homogeneous region according to some embodiments of the present invention.
6A and 6B illustrate an exemplary embodiment for partitioning a document image in a method of analyzing a document structure using a homogeneous region according to some embodiments of the present invention.
FIG. 7 is a conceptual diagram illustrating a relationship among connected components in a method of analyzing a document structure using a homogeneous region according to some embodiments of the present invention. FIG.
FIG. 8 exemplarily illustrates a text document and a non-text document obtained by a method of analyzing a document structure using a homogeneous region according to some embodiments of the present invention.
9 illustrates an exemplary document image analyzed by a method of analyzing a document structure using a homogeneous region according to some embodiments of the present invention.
10 is a graph illustrating the success rate of a method of analyzing a document structure using a homogeneous region according to some embodiments of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention, and the manner of achieving them, will be apparent from and elucidated with reference to the embodiments described hereinafter in conjunction with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims.

비록 제1, 제2 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.Although the first, second, etc. are used to describe various components, it goes without saying that these components are not limited by these terms. These terms are used only to distinguish one component from another. Therefore, it goes without saying that the first component mentioned below may be the second component within the technical scope of the present invention.

명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Like reference numerals refer to like elements throughout the specification.

본 발명의 여러 실시예들의 각각 특징들이 부분적으로 또는 전체적으로 서로 결합 또는 조합 가능하며, 당업자가 충분히 이해할 수 있듯이 기술적으로 다양한 연동 및 구동이 가능하며, 각 실시예들이 서로에 대하여 독립적으로 실시 가능할 수도 있고 연관 관계로 함께 실시 가능할 수도 있다.It is to be understood that each of the features of the various embodiments of the present invention may be combined or combined with each other partially or entirely and technically various interlocking and driving is possible as will be appreciated by those skilled in the art, It may be possible to cooperate with each other in association.

이하, 첨부된 도면을 참조하여 본 발명의 다양한 실시예들을 상세히 설명한다.Various embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 장치의 개략적인 구성도이다. 1 is a schematic block diagram of an apparatus for analyzing a document structure using a homogeneous region according to an embodiment of the present invention.

도 1을 참조하면 동질 영역을 이용한 문서 구조의 분석 장치 (100) 는 프로세서 (110) 및 메모리 (120) 를 포함한다. 동질 영역을 이용한 문서 구조의 분석 장치 (100) 는 범용 (general purpose) 컴퓨터, 특수 목적용 컴퓨터, 스마트 폰, 데스크탑 또는 노트북 컴퓨터 등이거나 이들과 결합하여 사용하는 부속 장치일 수 있다.Referring to FIG. 1, an apparatus 100 for analyzing a document structure using a homogeneous region includes a processor 110 and a memory 120. The apparatus 100 for analyzing the document structure using the homogeneous region may be a general purpose computer, a special purpose computer, a smart phone, a desktop or a notebook computer, or the like, or an accessory device used in combination with them.

프로세서 (110) 는 동질 영역을 이용한 문서 구조의 분석 장치에서 다양한 연산을 수행한다. 프로세서 (110) 는 문서 이미지를 흑백으로 이진화하고, 문서 이미지를 분할함으로써 동질 영역을 추출하고, 동질 영역에 포함된 연결된 컴포넌트 (connected component) 를 분석하여 텍스트 컴포넌트가 포함된 텍스트 문서 및 비-텍스트 컴포넌트가 포함된 비-텍스트 문서를 획득하고, 텍스트 컴포넌트 간의 거리에 기초하여 텍스트 문서를 분할하고, 비-텍스트 컴포넌트의 특성값에 기초하여 비-텍스트 문서에 포함된 비-텍스트 컴포넌트를 분류할 수 있다. 상술한 동작에 대해서는 도 2 및 도 3을 참조하여 상세히 후술한다.The processor 110 performs various operations in an apparatus for analyzing a document structure using a homogeneous region. The processor 110 extracts the homogeneous region by binarizing the document image in black and white, divides the document image, and analyzes connected components included in the homogeneous region to generate a text document containing the text component and a non- Text components that are included in the non-text document based on the property values of the non-text component, and to classify the non- . The above-mentioned operation will be described later in detail with reference to FIG. 2 and FIG.

메모리 (120) 는 데이터를 일시적으로 저장하고 데이터를 프로세서 (110) 와 교환한다. 메모리 (120) 는 상술한 문서 이미지, 동질 영역, 연결된 컴포넌트, 텍스트 컴포넌트, 비-텍스트 컴포넌트, 텍스트 문서 및 비텍스트 문서 등에 대한 데이터를 일시적으로 저장할 수 있다.The memory 120 temporarily stores data and exchanges data with the processor 110. The memory 120 may temporarily store data for the document image, homogeneous area, connected components, text components, non-text components, text documents, and non-text documents described above.

도 1에 도시되지는 않았으나, 동질 영역을 이용한 문서 구조의 분석 장치는 문서 이미지를 획득하기 위한 카메라 등과 같은 입력 수단을 포함할 수도 있다.Although not shown in FIG. 1, an apparatus for analyzing a document structure using a homogeneous region may include an input means such as a camera for acquiring a document image.

도 2는 본 발명의 일 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법을 설명하기 위한 순서도이다. 설명의 편의를 위해 도 1을 함께 참조하여 설명한다.2 is a flowchart illustrating a method of analyzing a document structure using a homogeneous region according to an embodiment of the present invention. For convenience of explanation, FIG. 1 will be described together.

본 발명의 일 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법은 프로세서 (110) 가 문서 이미지를 분석하여 문서 이미지를 흑백으로 이진화함으로써 개시된다 (S210). 본 발명의 일 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법을 수행하기 위해서는 문서 이미지의 픽셀값 각각을 백색을 나타내는 픽셀값 또는 흑색을 나타내는 픽셀값으로 변환하는 것이 요구된다. 픽셀값이란 픽셀의 밝기를 나타내는 데이터이다. 예를 들어, 백색을 나타내는 픽셀값이 0이고 흑색을 나타내는 픽셀값이 1일 수도 있고, 백색을 나타내는 픽셀값이 1이고 흑색을 나타내는 픽셀값이 0일 수도 있다. 문서 이미지가 컬러 이미지인 경우, 먼저 프로세서 (110) 는 컬러 이미지의 RGB 성분을 컬러 이미지를 그레이 스케일 (grayscale) 이미지로 변환할 수 있다. 컬러 이미지를 그레이스케일 이미지로 변환한 후, 프로세서 (110) 는 문턱값 (threshold value) 이하의 픽셀값을 갖는 픽셀들을 0으로 변환하고, 문턱값 이상의 픽셀값을 갖는 픽셀들을 1로 변환할 수 있다. 문턱값은 미리 결정된 값일 수도 있고, 픽셀에 따라 상이하게 결정되는 값일 수도 있다. 예를 들어, 문턱값은 “Sauvola 기법” (Sauvola, J. and Pietikainen, M., 2000. Adaptive document image binarization. Pattern Recognition, 22(2): 225-236.[doi:10.1016/S0031-3203(99)00055-2]) 에 의해 결정될 수도 있다.A method of analyzing a document structure using a homogeneous region according to an embodiment of the present invention is started by analyzing a document image and binarizing a document image in black and white by the processor 110 (S210). In order to perform a method of analyzing a document structure using a homogeneous region according to an embodiment of the present invention, it is required to convert pixel values of a document image into pixel values representing white or pixel values representing black. The pixel value is data representing the brightness of the pixel. For example, a pixel value representing white may be 0, a pixel value representing black may be 1, a pixel value representing white may be 1, and a pixel value representing black may be 0. If the document image is a color image, the processor 110 may first convert the RGB components of the color image to a grayscale image of the color image. After converting the color image to a grayscale image, the processor 110 may convert the pixels having pixel values below the threshold value to 0 and convert the pixels with pixel values above the threshold value to 1 . The threshold value may be a predetermined value or a value determined differently depending on the pixel. For example, the threshold value may be determined using the Sauvola technique (Sauvola, J. and Pietikainen, M., 2000. Adaptive document image binarization. Pattern Recognition , 22 (2): 225-236. [Doi: 10.1016 / S0031-3203 99) 00055-2]).

다음으로, 프로세서 (110) 는 이진화된 문서 이미지의 백색 라인의 폭 및 흑색 라인의 폭에 기초하여 백색 라인 중 적어도 일부를 기준으로 문서 이미지를 분할함으로써 동질 영역 (homogeneous region) 을 추출한다 (S220). 프로세서 (110) 는 이진화된 문서 이미지를 반복적으로 분할함으로써 동질 영역을 추출할 수 있다. 단계 (S220) 에 대해서는 도 3을 참조하여 상세히 후술한다.Next, the processor 110 extracts a homogeneous region by dividing the document image based on at least a part of the white lines based on the width of the white line and the width of the black line of the binarized document image (S220) . The processor 110 may extract the homogeneous region by repeatedly dividing the binarized document image. Step S220 will be described later in detail with reference to FIG.

다음으로, 프로세서 (110) 는 동질 영역에 포함된 연결된 컴포넌트 (connected component) 를 분석하여 연결된 컴포넌트를 텍스트 컴포넌트 및 비-텍스트 (non-text) 컴포넌트로 분류하고, 텍스트 컴포넌트가 포함된 텍스트 문서 및 비-텍스트 컴포넌트가 포함된 비-텍스트 문서를 획득한다 (S230). 프로세서 (110) 는 추출된 동질 영역 각각을 개별적으로 분석하여 연결된 컴포넌트를 분리함으로써 텍스트 문서 및 비-텍스트 문서를 획득할 수 있다. 단계 (S230) 에 대해서는 도 3을 참조하여 상세히 후술한다.Next, the processor 110 analyzes a connected component included in the homogeneous region to classify the connected component into a text component and a non-text component, classifies the text component and the non- A non-text document containing the text component is obtained (S230). The processor 110 may obtain a text document and a non-text document by analyzing each of the extracted homogeneous regions separately to separate connected components. Step S230 will be described later in detail with reference to FIG.

다음으로, 프로세서 (110) 는 텍스트 컴포넌트 간의 거리에 기초하여 텍스트 문서를 분할한다 (S240). 프로세서 (110) 는 텍스트 컴포트 간의 거리에 기초하여 텍스트 문서를 분할함으로써, 텍스트 컴포넌트 간의 거리가 균일한 동질 텍스트 영역을 추출할 수 있다. 프로세서 (110) 는 모폴로지 연산 (morphology operation) 을 통해 텍스트 컴포넌트를 팽창시킨 후, 텍스트 컴포넌트 간의 거리에 기초하여 텍스트 문서를 분할할 수도 있다. 예를 들어, 프로세서 (110) 는 텍스트 컴포넌트 각각을 일정한 배율로 팽창시킨 후, 팽창된 텍스트 컴포넌트를 분석하여 하나의 단일체를 이루는 텍스트 컴포넌트들을 포함하는 동질 텍스트 영역을 추출할 수 있다. 단일체를 이루는 텍스트 컴포넌트들이란 팽창된 텍스트 컴포넌트가 하나의 연결된 컴포넌트를 이루는 텍스트 컴포넌트들을 의미한다. 프로세서 (110) 는 텍스트 컴포넌트를 서로 상이한 배율로 팽창시킬 수도 있다. 예를 들어, 프로세서 (110) 는 텍스트 컴포넌트를 팽창시키는 배율을 텍스트 컴포넌트가 포함된 동질 영역의 특성에 따라 상이하게 설정할 수도 있다. 프로세서 (110) 는 텍스트 컴포넌트가 포함된 동질 영역 내의 텍스트 컴포넌트 간의 거리가 먼 경우에는 배율을 크게 설정하고, 텍스트 컴포넌트 간의 거리가 가까운 경우에는 배율을 작게 설정할 수 있다. 동질 영역의 특성에 따라 배율을 상이하게 설정함으로써, 동질 텍스트 영역이 보다 정확히 추출될 수 있다.Next, the processor 110 divides the text document based on the distance between the text components (S240). The processor 110 can extract a homogeneous text area having a uniform distance between the text components by dividing the text document based on the distance between the text comforts. The processor 110 may expand the text component through a morphology operation and then partition the text document based on the distance between the text components. For example, the processor 110 may expand each of the text components at a constant magnification, and then analyze the expanded text components to extract a homogeneous text area that includes text components that form a single entity. Text components that make up a unit are text components whose expanded text components make up a connected component. Processor 110 may expand the text components to different magnifications from each other. For example, the processor 110 may set the magnification for expanding the text component to be different depending on the characteristics of the homogeneous region including the text component. The processor 110 may set the magnification to be large when the distance between the text components in the homogeneous area including the text component is long and to set the magnification when the distance between the text components is close. By setting the magnification differently according to the characteristics of the homogeneous area, the homogeneous text area can be extracted more accurately.

다음으로, 프로세서 (110) 는 비-텍스트 컴포넌트의 특성값에 기초하여 비-텍스트 문서에 포함된 비-텍스트 컴포넌트를 분류한다 (S250). 프로세서 (110) 는 비-텍스트 컴포넌트를, 예를 들어, 라인 컴포넌트, 테이블 컴포넌트, 세퍼레이터 컴포넌트 및 노이즈 컴포넌트로 분류할 수 있다. 비-텍스트 컴포넌트의 특성값이란 비-텍스트 컴포넌트 및 비-텍스트 컴포넌트를 둘러싸는 바운딩 박스의 폭, 높이 및 면적 등을 의미한다. Next, the processor 110 classifies the non-text components included in the non-text document based on the property values of the non-text component (S250). Processor 110 may classify non-text components into, for example, line components, table components, separator components, and noise components. The property value of a non-text component means the width, height, and area of the bounding box surrounding non-text and non-text components.

예를 들어, 프로세서 (110) 는 종횡비가 매우 크거나 작은 비-텍스트 컴포넌트 및 밀도가 매우 높은 비-텍스트 컴포넌트를 라인 컴포넌트로 분류할 수 있다. 프로세서 (110) 는 밀도가 낮고, 사각형 컴포넌트를 포함하고, 바운딩 박스가 다른 텍스트 컴포넌트를 통과하지 않고, 바운딩 박스 내에 다수의 텍스트 컴포넌트를 포함하는 비-텍스트 컴포넌트를 테이블 컴포넌트로 분류할 수 있다. 프로세서 (110) 는 밀도가 매우 낮고 (예를 들어, 0.01 이하), 바운딩 박스가 텍스트 컴포넌트를 포함하는 비-텍스트 컴포넌트를 세퍼레이터 컴포넌트로 분류할 수 있다. 프로세서 (110) 는 면적이 매우 작은 비-텍스트 컴포넌트를 노이즈 컴포넌트로 분류할 수 있다. 프로세서 (110) 는 라인 컴포넌트, 테이블 컴포넌트, 세퍼레이터 컴포넌트 및 노이즈 컴포넌트가 분류된 후 남은 비-텍스트 컴포넌트를 이미지 컴포넌트로 분류하라 수 있다. 비-텍스트 컴포넌트 만을 포함하는 비-텍스트 문서 내에서 상술한 분류를 수행하므로, 비-텍스트 컴포넌트의 분류의 정확성이 향상될 수 있다.For example, the processor 110 may classify non-text components of very large or small aspect ratios and non-text components of very high density into line components. Processor 110 may classify non-text components that are low in density, include a rectangular component, a bounding box not passing through other text components, and a plurality of text components in the bounding box into table components. The processor 110 may classify non-text components that contain text components as separator components, such that the bounding box is very low in density (e.g., less than 0.01). The processor 110 may classify non-text components that are very small in area into noise components. The processor 110 may classify the line components, table components, separator components, and non-text components that remain after the noise components are sorted into image components. By performing the classification described above in a non-text document containing only non-text components, the accuracy of classification of non-text components can be improved.

도 3은 본 발명의 일 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법을 설명하기 위한 순서도이다. 설명의 편의를 위해 도 1을 함께 참조하여 설명한다.3 is a flowchart illustrating a method of analyzing a document structure using a homogeneous region according to an embodiment of the present invention. For convenience of explanation, FIG. 1 will be described together.

본 발명의 일 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법은 프로세서 (110) 가 문서 이미지를 분석하여 문서 이미지를 흑백으로 이진화함으로써 개시된다 (S210). 단계 (S210) 에 대해서는 도 2를 참조하여 설명하였으므로 중복 설명을 생략한다.A method of analyzing a document structure using a homogeneous region according to an embodiment of the present invention is started by analyzing a document image and binarizing a document image in black and white by the processor 110 (S210). Since step S210 has been described with reference to FIG. 2, redundant description will be omitted.

다음으로, 프로세서 (110) 는 문서 이미지의 픽셀 중 흑색 픽셀의 빈도를 나타내는 문서 이미지의 가로 축 또는 세로 축 방향의 곡선을 획득한다 (S221). 프로세서 (110) 가 최초로 문서 이미지를 분석하는 경우, 프로세서 (110) 는 문서 이미지의 전체 영역을 대상으로 이하의 분석을 수행한다. 도 4a 내지 도 4d는 본 발명의 몇몇 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법에서 흑색 픽셀의 빈도를 나타내는 곡선을 처리하는 예시적인 실시예를 도시한다. 프로세서 (110) 는 도 4a 도시된 문서 이미지로부터 도 4b에 도시된 바와 같이, 문서 이미지의 흑색 픽셀을 세로 축으로 프로젝션 (projection) 한 히스토그램을 산출할 수 있다. 다만, 도 4b에 도시된 바와 같이, 산출된 히스토그램은 매우 불규칙한 프로파일을 갖기 때문에 분석에 적합하지 않다. 따라서, 프로세서 (110) 는, 도 4c에 도시된 바와 같이, 히스토그램의 프로파일을 스무드 (smooth) 하게 변환하여 세로 축 방향의 곡선을 획득할 수 있다.Next, the processor 110 obtains a horizontal or vertical axis curve of the document image indicating the frequency of the black pixels among the pixels of the document image (S221). When the processor 110 first analyzes a document image, the processor 110 performs the following analysis on the entire area of the document image. 4A to 4D illustrate an exemplary embodiment for processing a curve representing the frequency of black pixels in a method of analyzing a document structure using homogeneous regions according to some embodiments of the present invention. The processor 110 may produce a histogram of the vertical projection of the black pixels of the document image from the document image shown in Fig. 4a, as shown in Fig. 4b. However, as shown in Fig. 4B, the calculated histogram is not suitable for analysis because it has a very irregular profile. Accordingly, the processor 110 can smoothly convert the profile of the histogram to obtain a curve in the longitudinal axis direction, as shown in Fig. 4C.

문서 이미지를 가로 축에 프로젝션한 히스토그램을 산출하고 산출된 히스토그램을 스무드닝 (smoothening) 하는 예시적인 수학식은 다음과 같다.An exemplary equation for calculating a histogram of a document image projected on the horizontal axis and smoothening the calculated histogram is as follows.

[수학식 1][Equation 1]

[수학식 2]&Quot; (2) "

여기서, x는 픽셀의 가로 축 좌표이고, y는 픽셀의 세로 축 좌표이고, a는 가로 좌표의 최대값이고, b는 세로 좌표의 최대값이다. f(x,y)는 좌표 (x,y) 에서의 픽셀값으로, 0 (백색 픽셀) 또는 1 (흑색 픽셀) 의 값을 갖는다. p_x는 가로 좌표가 x인 흑색 픽셀의 수이고, P는 p_x를 원소로 갖는 집합이다. 여기서, s는 스무드닝 (smoothing) 을 위한 커널 사이즈 (kernel size) 를 의미하고, sp_x는 p_x를 스무드하게 변환한 값이며, SP는 sp_x를 원소로 갖는 집합이다.Here, x is the horizontal axis coordinate of the pixel, y is the vertical axis coordinate of the pixel, a is the maximum value of the abscissa, and b is the maximum value of the ordinate. f (x, y) is a pixel value at coordinates (x, y) and has a value of 0 (white pixel) or 1 (black pixel). p _x is the number of black pixels with abscissa x and P is the set with p _x as an element. Here, s means a kernel size for smoothing, sp _x is a value obtained by converting p _x smoothly, and SP is a set having sp _x as an element.

상술한 바와 같이, 프로세서 (110) 는 흑색 픽셀을 가로 축 또는 세로 축으로 프로젝션한 후 스무드닝함으로써 문서 이미지의 가로 축 또는 세로 축 좌표에 따른 흑색 픽셀의 빈도를 나타내는 가로 축 또는 세로 축 방향의 곡선을 획득할 수 있다. 여기서, 곡선이란 연속적인 점들의 집합을 의미한다.As described above, the processor 110 projects a black pixel on a horizontal axis or a vertical axis, and smoothes the black pixel, thereby generating a horizontal axis or a vertical axis direction curve representing the frequency of black pixels according to the horizontal axis or vertical axis coordinate of the document image Can be obtained. Here, a curve means a set of consecutive points.

다음으로, 프로세서 (110) 는 곡선의 변화율 및 흑색 픽셀의 빈도의 통계값에 기초하여 이질 영역 (heterogeneous region) 의 존재 여부를 판단한다 (S222). 도 4d에 도시된 바와 같이, 프로세서 (110) 는 획득된 곡선을 미분하여 곡선의 변화율을 산출할 수 있다. 곡선의 변화율을 통해 흑색 픽셀의 분포를 파악할 수 있으므로, 곡선의 변화율은 영역의 동질성을 판단하는 기준이 될 수 있다. Next, the processor 110 determines whether a heterogeneous region exists based on the statistical values of the curve change rate and the black pixel frequency (S222). As shown in FIG. 4D, the processor 110 may calculate the rate of change of the curve by differentiating the obtained curve. Since the distribution of the black pixels can be grasped through the rate of change of the curve, the rate of change of the curve can be a criterion for judging the homogeneity of the region.

예를 들어, 프로세서 (110) 는 곡선의 변화율이 0이 되는 극점을 산출하고, 이웃하는 극점 간 거리를 산출하고, 극점 간 거리에 대한 분산을 산출할 수 있다. 극점 간 거리의 분산이 낮다는 것은 분석된 영역이 동질 영역이라는 것을 의미하고, 극점 간 거리의의 분산이 높다는 것은 분석된 영역이 이질 영역이라는 것을 의미한다. 예를 들어, 프로세서 (110) 는 극점 간 거리의 분산이 1.1 이하인 영역을 동질 영역으로 판단하고, 극점 간 거리의 분산이 1.1 이상인 영역을 이질 영역으로 판단할 수 있다. 4a 내지 도 4d를 참조하면, 도시된 텍스트 영역은 극점 간 거리가 일정하므로 동질 영역임을 알 수 있다. 도 5a 및 도 5b는 본 발명의 몇몇 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법에서의 동질 영역과 이질 영역을 예시적으로 도시한다. 도 5a를 참조하면, 도 5a에 도시된 영역은 이웃하는 극점 간 거리 (d1, d2) 가 일정하므로 도 5a에 도시된 영역은 동질 영역으로 판단될 수 있다. 도 5b를 참조하면, 도 5b에 도시된 영역은 이웃하는 극점 간 거리 (d3, d4) 의 차이가 크므로 도 5b에 도시된 영역은 이질 영역으로 판단될 수 있다.For example, the processor 110 may calculate a pole at which the rate of change of the curve is zero, calculate a distance between neighboring poles, and calculate a variance of the distance between poles. The low variance of pole-to-pole distance means that the analyzed region is homogeneous, and the high variance of pole-to-pole distance means that the analyzed region is a heterogeneous region. For example, the processor 110 may determine a region having a dispersion of pole-to-pole distances of 1.1 or less as a homogeneous region and a region having a pole-to-pole distance variance of 1.1 or more as a heterogeneous region. 4A to 4D, it can be seen that the illustrated text area is a homogeneous area because the distance between the poles is constant. 5A and 5B illustrate homogeneous and heterogeneous regions in a method of analyzing a document structure using a homogeneous region according to some embodiments of the present invention. Referring to FIG. 5A, the regions shown in FIG. 5A may be determined as homogeneous regions because the distances d1 and d2 between neighboring poles are constant. Referring to FIG. 5B, the region shown in FIG. 5B is different from the neighboring pole distances d3 and d4, so that the region shown in FIG. 5B can be determined as a heterogeneous region.

곡선의 도함수를 산출하고, 곡선의 극점을 산출하고, 극점 간 거리를 산출하고, 극점 간 거리의 분산을 산출하는 예시적인 수학식은 다음과 같다.An exemplary equation for calculating the derivative of the curve, calculating the pole of the curve, calculating the distance between the poles, and calculating the variance of the distance between the poles is as follows.

[수학식 3]&Quot; (3) "

[수학식 4]&Quot; (4) "

[수학식 5]&Quot; (5) "

[수학식 6]&Quot; (6) "

여기서, dsp_x는 가로 좌표 x에서의 sp_x의 기울기이며, DSP는 dsp_x를 원소로 갖는 집합이다. x_i는 sp_x의 기울기가 0이 되는 극점의 가로 좌표이며, E는 x_i를 원소로 갖는 집합이다. de_i는 이웃하는 극점 간의 거리를 의미하며, DE는 de_i를 원소로 갖는 집합이다. V는 de_i의 분산을 의미하고, μ는 de_i의 평균을 의미하며, n은 de_i의 개수를 의미한다. Here, _x is the slope of the dsp sp _x on the abscissa x, DSP is the set of elements having a dsp _x. x _i is the abscissa of the pole at which the slope of sp _x is zero, and E is the set of elements with x _i as an element. de _i means the distance between neighboring poles, DE is a set with de _i as an element. V refers to the dispersion of the de _i, and μ indicates the average of the de _i and, n refers to the number of de _i.

다음으로, 프로세서 (110) 는 이질 영역이 존재하는 경우, 이질 영역에 포함된 백색 라인의 폭 및 흑색 라인의 폭에 기초하여 백색 라인 중 적어도 일부를 기준으로 문서 이미지를 분할한다 (S223). 프로세서 (110) 가 최초로 문서 이미지를 분석하는 경우에는 문서 이미지의 전체 영역을 대상으로 분석을 수행하므로 이질 영역이 존재할 확률이 높다. 도 6a 및 도 6b는 본 발명의 몇몇 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법에서 문서 이미지를 분할하는 예시적인 실시예를 도시한다. Next, the processor 110 divides the document image based on at least a part of the white lines based on the width of the white line and the width of the black line included in the heterogeneous region when a heterogeneous region exists (S223). When the processor 110 analyzes the document image for the first time, the analysis is performed on the entire area of the document image, so that there is a high probability that the heterogeneous area exists. 6A and 6B illustrate an exemplary embodiment for partitioning a document image in a method of analyzing a document structure using a homogeneous region according to some embodiments of the present invention.

도 6a를 참조하면, 프로세서 (110) 는 이질 영역에 포함된 백색 라인의 폭 (w1, w2, w3) 을 산출할 수 있다. 프로세서 (110) 는 이질 영역에 포함된 백색 라인 중 중간값보다 큰 폭을 갖고 이질 영역에 포함된 백색 라인 중 가장 큰 폭을 갖는 백색 라인을 기준으로 이질 영역을 분할할 수 있다. 도 6a에서는 프로세서 (110) 는 중간값보다 크고 가장 큰 폭 (w2) 을 갖는 백색 라인을 기준으로 이질 영역을 분할한다.Referring to FIG. 6A, the processor 110 may calculate the widths (w1, w2, w3) of the white lines included in the heterogeneous region. The processor 110 may divide the heterogeneous region based on the white line having the largest width among the white lines included in the heterogeneous region and having a width greater than the median of the white lines included in the heterogeneous region. In FIG. 6A, the processor 110 divides the heterogeneous region with respect to the white line having a width greater than the intermediate value and having the largest width w2.

도 6b를 참조하면, 프로세서 (110) 는 이질 영역에 포함된 흑색 라인의 폭 (b1, b2, b3, b4) 을 산출할 수 있다. 프로세서 (110) 는 이질 영역에 포함된 흑색 라인 중 중간값보다 큰 폭을 갖고 이질 영역에 포함된 흑색 라인 중 가장 큰 폭을 갖는 흑색 라인을 파악하고, 파악된 흑색 라인과 인접하는 백색 라인을 기준으로 이질 영역을 분할할 수 있다. 도 6a에서는 프로세서 (110) 는 중간값보다 크고 가장 큰 폭 (b3) 을 갖는 흑색 라인에 인접하는 2개의 백색 라인을 기준으로 이질 영역을 분할한다. 백색 라인의 폭 뿐만 아니라 흑색 라인의 폭에 기초하여 이질 영역을 분할함으로써, 동질 영역을 더욱 정확하게 추출할 수 있다는 본 발명의 유리한 효과가 획득된다.Referring to FIG. 6B, the processor 110 may calculate the widths (b1, b2, b3, b4) of black lines included in the heterogeneous area. The processor 110 determines a black line having a width greater than an intermediate value among the black lines included in the heterogeneous region and having the widest width among the black lines included in the heterogeneous region and sets the white line adjacent to the identified black line as a reference The heterogeneous region can be divided. In FIG. 6A, the processor 110 divides the heterogeneous region with respect to two white lines adjacent to the black line having a width larger than the intermediate value and having the largest width b3. The advantageous effect of the present invention is obtained that the homogeneous region can be extracted more accurately by dividing the heterogeneous region based on the width of the black line as well as the width of the white line.

프로세서 (110) 는 곡선을 획득하는 단계 (S221), 이질 영역의 존재 여부를 판단하는 단계 (S222) 및 문서 이미지를 분할하는 단계 (S223) 를 이질 영역이 존재하지 않을 때까지 반복할 수 있다. 즉, 프로세서 (110) 는 문서 이미지를 분할한 후, 분할된 영역 각각을 기준으로 분석을 수행하여 분할된 영역 각각이 이질 영역인지 여부를 판단하고, 다시 이질 영역을 분할할 수 있다. 프로세서 (110) 가 분할을 반복하여 분할된 영역 각각이 모두 동질 영역으로 판단되는 경우, 프로세서 (110) 는 영역의 분할을 중단한다.The processor 110 may repeat the steps of acquiring a curve (S221), determining whether a heterogeneous region is present (S222), and dividing a document image (S223) until no heterogeneous region exists. That is, the processor 110 may divide the document image, and then perform an analysis based on each of the divided regions to determine whether each of the divided regions is a heterogeneous region, and divide the heterogeneous region again. If the processor 110 repeatedly divides and each of the divided areas is judged to be a homogeneous area, the processor 110 stops dividing the area.

프로세서 (110) 는 문서 이미지의 세로 축 방향의 분할이 완료된 후 동일한 방법을 이용하여 문서 이미지를 가로 축 방향으로 분할할 수 있다. 가로 축 방향의 분할이 완료되는 경우, 문서 이미지는 직사각형 영역들을 포함하게 되고, 이 직사각형 영역들은 모두 동질 영역에 해당한다. 이하의 단계들은 추출된 동질 영역에 기초하여 수행된다. 상술한 바와 같이 문서 이미지의 분할을 반복하여 동질 영역을 추출한 후 문서에 포함된 컴포넌트를 분류함으로써, 문서 구조의 분석을 더욱 정확하게 수행할 수 있다는 본 발명의 유리한 효과가 획득된다.The processor 110 may divide the document image in the horizontal axis direction by using the same method after the vertical axis direction division of the document image is completed. When the division in the horizontal axis direction is completed, the document image includes rectangular areas, all of which correspond to homogeneous areas. The following steps are performed based on the extracted homogeneous regions. The advantageous effect of the present invention is obtained that the analysis of the document structure can be performed more accurately by classifying the components included in the document after repeating the division of the document image as described above to extract the homogeneous region.

다음으로, 프로세서 (110) 는 텍스트 문서 및 비-텍스트 문서를 획득하는 단계는 연결된 컴포넌트를 분석하여 연결된 컴포넌트 중 비-텍스트 컴포넌트를 추출한다 (S231). 연결된 컴포넌트란 문서 이미지 내에서 서로 연결된 흑색 픽셀의 집합을 의미한다. 예를 들어, 글자 “ㅐ”는 모든 흑색 픽셀이 연결되어 있으므로, 글자 “ㅐ”로부터는 하나의 연결된 컴포넌트가 추출될 수 있고, 글자 “ㅔ”는 흑색 픽셀이 분리된 부분이 있으므로, 글자 “ㅔ”로부터는 2개의 연결된 컴포넌트가 추출될 수 있다. 프로세서 (110) 는 문서 이미지에 포함된 동질 영역 각각으로부터 연결된 컴포넌트를 추출하고, 추출된 연결된 컴포넌트를 라벨링 (labelling) 할 수 있다. 프로세서 (110) 는 연결된 컴포넌트 및 연결된 컴포넌트를 둘러싸는 바운딩 박스의 좌표, 폭, 높이 및 면적을 포함하는 특성값에 기초하여 연결된 컴포넌트를 텍스트 컴포넌트 또는 비-텍스트 컴포넌트로 분류할 수 있다. 바운딩 박스란 하나의 연결된 컴포넌트를 내부에 포함하고 문서 이미지의 가로 축 및 세로 축 방향의 변을 갖는 최소 면적의 직사각형을 의미한다. 바운딩 박스는 좌상단 점 및 우하단 점의 좌표로서 정의될 수 있다. 프로세서 (110) 는 2번의 필터링을 통해 연결된 컴포넌트 중 비-텍스트 컴포넌트를 추출할 수 있다.Next, the processor 110 obtains the text document and the non-text document by analyzing the connected components and extracts non-text components among the connected components (S231). A connected component is a set of black pixels connected together in a document image. For example, since the letter "ㅐ" is connected to all the black pixels, one connected component can be extracted from the letter "ㅐ", and the letter "ㅔ" &Quot; from which two connected components can be extracted. The processor 110 may extract the connected components from each of the homogeneous regions included in the document image and label the extracted connected components. The processor 110 may classify the connected components as text components or non-text components based on the property values, including the coordinate, width, height, and area of the bounding box surrounding the connected component and the connected component. A bounding box refers to a rectangle having a minimum area including sides connected to a horizontal axis and a vertical axis of a document image including a connected component. The bounding box can be defined as the coordinates of the upper left point and the lower right point. Processor 110 may extract non-textual components of connected components through two filtering operations.

프로세서 (110) 는 비-텍스트 컴포넌트를 개략적으로 추출하기 위한 1차 필터링으로서 미리 설정된 조건을 만족하는 연결된 컴포넌트를 비-텍스트 컴포넌트로 추출할 수 있다. 1차 필터링은 비-텍스트 컴포넌트일 확률이 매우 높은 연결된 컴포넌트를 빠르게 추출하기 위한 과정이므로, 조건의 설정은 신중하게 이루어져야 할 필요가 있다. 예를 들어, 프로세서 (110) 는 시각적으로 인식하기 어려울 정도로 작은 연결된 컴포넌트, 밀도가 매우 작아 내부가 빈 도형일 확률이 높은 연결된 컴포넌트 및 종횡비가 극단적으로 크거나 작아 라인일 확률이 높은 연결된 컴포넌트 등을 비-텍스트 컴포넌트로 추출할 수 있다. The processor 110 may extract the connected components as non-text components that satisfy a predetermined condition as a first order filtering to roughly extract non-text components. Since the primary filtering is a process for quickly extracting connected components that are very likely to be non-text components, the setting of the conditions needs to be done carefully. For example, the processor 110 may include a connected component that is small enough to be visually perceptible, a connected component that has a very low density so that the interior is likely to be an empty shape, and a connected component that has an extremely high or low aspect ratio, You can extract it as a non-text component.

1차 필터링에서 비-텍스트 컴포넌트를 추출하기 위한 조건을 나타내는 예시적인 수학식은 다음과 같다.An exemplary equation for expressing the condition for extracting the non-text component in the primary filtering is as follows.

[수학식 7]&Quot; (7) "

[수학식 8]&Quot; (8) "

[수학식 9]&Quot; (9) "

[수학식 10]&Quot; (10) "

여기서, CC_i는 특정 동질 영역 내에 포함된 i번째 연결된 컴포넌트이고, CC_area(CC_i)는 CC_i의 면적으로서 CC_i가 차지하는 픽셀의 개수를 의미하고, T_area는 문턱값을 의미한다. 예를 들어, T_area는 6일 수도 있다. B_CC_i는 CC_i의 바운딩 박스를 의미하고, Inside(B_CC_i)는 B_CC_i 내부에 포함된 다른 연결된 컴포넌트의 바운딩 박스의 개수를 의미하고, T_inside는 문턱값을 의미한다. 예를 들어, T_inside는 3일 수도 있다. B_dens(CC_i)는 CC_i의 밀도로서 CC_i가 차지하는 픽셀의 개수를 B_CC_i가 차지하는 픽셀의 개수로 나눈 값을 의미하며, T_dens는 문턱값을 의미한다. 예를 들어, T_dens는 0.06일 수 있다. W_i는 B_CC_i의 폭을 의미하고, H_i는 B_CC_i의 높이를 의미하고, HW_ratio(CC_i)는 W_i와 H_i 중 작은 값을 큰 값으로 나눈 값이고, T_HWratio는 문턱값을 의미한다. 예를 들어, T_HWratio는 0.06일 수도 있다.Here, CC _i is the number of the i-th and the connected component, the pixels are occupied by an area of a CC _i CC_area (CC _i) CC _i is contained in the homogeneous specific region, _area, and T denotes the threshold. For example, T _area may be six. B_CC _i denotes the bounding box of CC _i , Inside (B_CC _i ) denotes the number of bounding boxes of other connected components included in B_CC _i , and T _inside denotes a threshold value. For example, T _inside may be 3. B_dens (CC _i) means a value obtained by dividing the number of pixels occupied by the CC _i by the number of pixels occupied by the B_CC _i _i and a density of the CC, and T means a _dens threshold. For example, T _dens can be 0.06. W _i is the width of B_CC _i , H _i is the height of B_CC _i , HW_ratio (CC _i ) is the smaller of W _i and H _i divided by the larger value, and T _HWratio is the threshold it means. For example, T _HWratio may be 0.06.

1차 필터링에 있어서, 수학식 7, 8 9 및 10에 의해 제시된 조건들은 예시적인 것이므로, 이에 제한되지 않고, 비-텍스트 컴포넌트를 추출하기 위한 다양한 다른 조건들이 이용될 수 있다.In the first order filtering, the conditions presented by equations (7), (8) and (10) are exemplary, and not limited thereto, various other conditions for extracting non-text components may be used.

프로세서 (110) 는 1차 필터링에서 추출되지 않은 나머지 연결된 컴포넌트를 1차 필터링보다 정밀한 2차 필터링을 통해 비-텍스트 컴포넌트로 추출할 수 있다. Processor 110 may extract the remaining connected components that have not been extracted in the first-order filtering as non-text components through more precise second-order filtering than first-order filtering.

2차 필터링에 있어, 먼저, 프로세서 (110) 는 일정 이상의 면적을 갖는 연결된 컴포넌트를 후보 컴포넌트로 추출할 수 있다. 예를 들어, 프로세서 (110) 는 동일한 동질 영역 내에 포함된 모든 연결된 컴포넌트의 면적의 중간값의 일정 배수보다 큰 연결된 컴포넌트를 후보 컴포넌트로 추출할 수 있다. 동일한 동질 영역 내에 포함된 텍스트 컴포넌트의 면적은 서로 큰 차이가 없으므로, 동일한 동질 영역 내에서 다른 연결된 컴포넌트와 면적 차이가 큰 연결된 컴포넌트는 비-텍스트 컴포넌트에 해당할 수 있다.In the second filtering, first, the processor 110 may extract a connected component having a certain area or more as a candidate component. For example, the processor 110 may extract a connected component that is larger than a certain multiple of the median value of the area of all connected components included in the same homogeneous region, as a candidate component. Since the areas of the text components included in the same homogeneous region are not so different from each other, a connected component having a large area difference from other connected components within the same homogeneous region may correspond to a non-text component.

프로세서 (110) 는 추출된 후보 컴포넌트 및 후보 컴포넌트의 바운딩 박스의 좌표에 기초하여 후보 컴포넌트와 다른 연결된 컴포넌트의 바운딩 박스 간의 거리를 산출할 수 있고, 바운딩 박스 간의 거리에 기초하여 가로 방향으로 인접하는 연결된 컴포넌트를 추출하고 가로 방향으로 인접하는 연결된 컴포넌트 간의 거리를 산출할 수 있다. 동일한 동질 영역 내에 포함된 텍스트 컴포넌트들은 서로 거리는 일정하고 좌측으로 하나의 연결된 컴포넌트와 인접하고 우측으로 하나의 연결된 컴포넌트와 인접한다. 상술한 텍스트 컴포넌트의 특성에 기초하여, 프로세서 (110) 는 후보 컴포넌트와 가로 방향으로 인접하는 연결된 컴포넌트의 개수 및 후보 컴포넌트와 가로 방향으로 인접하는 연결된 컴포넌트 간의 거리에 기초하여 비-텍스트 컴포넌트를 추출할 수 있다.The processor 110 may calculate the distance between the bounding boxes of the candidate component and other connected components based on the coordinates of the bounding box of the extracted candidate component and the candidate component, You can extract components and calculate the distance between connected components that are adjacent in the horizontal direction. The text components contained within the same homogeneous region are adjacent to one connected component to the left and one connected component to the right, with a constant distance from each other. Based on the characteristics of the text components described above, the processor 110 may extract the non-text components based on the number of connected components that are adjacent in the horizontal direction to the candidate component and the distance between connected components that are adjacent in the horizontal direction to the candidate component .

2차 필터링에서 후보 컴포넌트를 추출하기 위한 예시적인 수학식은 다음과 같다.An exemplary equation for extracting a candidate component in the second order filtering is as follows.

[수학식 11]&Quot; (11) "

여기서, CCs는 후보 컴포넌트를 포함하는 동질 영역 내에 포함된 연결된 컴포넌트의 집합을 의미하고, CC_j는 CCs 내에 포함된 모든 연결된 컴포넌트를 의미하고, CC_i는 후보 컴포넌트를 의미하고, median(CC_area)는 CCs에 포함된 모든 연결된 컴포넌트의 면적의 중간값을 의미하고, k는 문턱값을 의미한다.Here, CCs means a set of connected components included in a homogeneous region including a candidate component, CC _j means all connected components included in CCs, CC _i means a candidate component, and median (CC_area) means Means the middle value of the area of all connected components included in the CCs, and k means the threshold value.

k는 설정 가능한 상수로서, k가 크게 설정되면 추출의 정확성이 향상되고, k가 작게 설정되면 컴퓨팅 시간이 짧아진다. 정확성 및 컴퓨팅 시간 양자를 모두 고려하여 적합한 k값을 산출하기 위한 예시적인 수학식은 다음과 같다.k is a settable constant. If k is set large, the accuracy of extraction is improved. If k is set small, the computing time is shortened. An exemplary equation for calculating a suitable k value in consideration of both the accuracy and the computing time is as follows.

[수학식 12]&Quot; (12) "

여기서, mean(CC_area)는 후보 컴포넌트를 포함하는 동질 영역 내에 포함된 모든 연결된 컴포넌트의 면적의 평균값을 의미한다.Here, mean (CC_area) means an average value of the area of all connected components included in the homogeneous region including the candidate component.

후보 컴포넌트 중 비-텍스트 컴포넌트를 추출하기 위한 예시적인 수학식은 다음과 같다.An exemplary equation for extracting a non-text component of a candidate component is as follows.

[수학식 13]&Quot; (13) "

[수학식 14]&Quot; (14) "

도 7은 본 발명의 몇몇 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법에서 연결된 컴포넌트 간의 관계를 나타내는 개념도이다. 도 7을 참조하면, 수학식 13 및 14에서, LNN(CC_i)는 CC_i와 좌측으로 가장 가까운 연결된 컴포넌트를 의미하고, LNN(CC_j)는 CC_j와 좌측으로 가장 가까운 연결된 컴포넌트를 의미하고, num{LNN(CC_i)}는 CC_i와 좌측으로 가장 가까운 연결된 컴포넌트의 개수를 의미하고, max_2nd(CC_area(CC_i))는 CC_i의 면적 중 두번째로 큰 면적을 의미한다. LNWS(CC_i)는 CC_i와 LNN(CC_i) 간의 거리를 의미하고, RNWS(CC_i)는 CC_i와 RNN(CC_i) 간의 거리를 의미하고, RNN(CC_i)는 CC_i와 우측으로 가장 가까운 연결된 컴포넌트를 의미하고 MeanWS는 RNWS(CCi)의 평균을 의미한다. 프로세서 (110) 는 상술한 수학식 13 또는 14 중 하나 이상을 만족시키는 후보 컴포넌트를 비-텍스트 컴포넌트로 추출할 수도 있다.FIG. 7 is a conceptual diagram illustrating a relationship among connected components in a method of analyzing a document structure using a homogeneous region according to some embodiments of the present invention. FIG. Referring to FIG. 7, in Equations (13) and (14), LNN (CC _i ) means the connected component closest to CC _i and LNN (CC _j ) means connected component closest to CC _j , num {LNN (CC _i )} means the number of connected components closest to CC _i and leftmost, and max _2nd (CC_area (CC _i )) means the second largest area of CC _i . LNWS (CC _i) the CC _i and LNN the distance between (CC _i), and RNWS (CC _i) the CC _i and the distance between the RNN (CC _i), and RNN (CC _i) the CC _i and the right And MeanWS is the average of RNWS (CCi). Processor 110 may extract candidate components that satisfy one or more of Equations 13 or 14 above as non-text components.

상술한 바와 같이 간단한 연산을 통해 비-텍스트 컴포넌트를 개략적으로 추출하는 첫번째 필터링 및 정밀한 연산을 통해 모든 비-텍스트 컴포넌트를 추출하는 두번째 필터링을 이용함으로써, 짧은 컴퓨팅 시간 동안 비-텍스트 컴포넌트를 정확하게 추출할 수 있다는 본 발명의 유리한 효과가 획득된다.By using the first filtering, which roughly extracts non-text components through a simple operation as described above, and the second filtering, which extracts all non-text components through precise operations, it is possible to accurately extract non-text components during a short computing time The advantageous effect of the present invention is obtained.

다음으로, 프로세서 (110) 는 동질 영역으로부터 비-텍스트 컴포넌트를 분리하여 비-텍스트 컴포넌트를 포함하는 비-텍스트 문서를 저장한다 (S232). 프로세서 (110) 는 추출된 비-텍스트 컴포넌트를 분리한 후, 분리된 비-텍스트 컴포넌트를 수집하여 비-텍스트 컴포넌트의 집합인 비-텍스트 문서를 획득할 수 있고, 획득된 비-텍스트 문서를 메모리 (120) 에 저장할 수 있다.Next, the processor 110 separates the non-text component from the homogeneous region and stores the non-text document including the non-text component (S232). The processor 110 may separate the extracted non-text components and then collect the separated non-text components to obtain a non-text document that is a collection of non-text components, (120).

다음으로, 프로세서 (110) 는 분리된 비-텍스트 컴포넌트의 존재 여부를 판단한다 (S233). 프로세서 (110) 는 비-텍스트 컴포넌트가 추출되었는지 여부를 판단하여 비-텍스트 컴포넌트가 존재하는지 여부를 판단할 수 있다. 예를 들어, 프로세서 (110) 는 비-텍스트 컴포넌트가 분리되기 전의 문서 이미지의 픽셀 분포와 비-텍스트 컴포넌트가 분리된 후의 문서 이미지의 픽셀 분포를 산출하여 양자가 상이한 경우에는 분리된 비-텍스트 컴포넌트가 존재하는 것으로 판단하고, 양자가 동일한 경우 분리된 비-텍스트 컴포넌트가 존재하지 않는 것으로 판단할 수도 있다.Next, the processor 110 determines whether there is a separated non-text component (S233). The processor 110 may determine whether a non-text component is present by determining whether the non-text component is extracted. For example, the processor 110 may calculate the pixel distribution of the document image before the non-text component is separated and the pixel distribution of the document image after the non-text component has been separated so that if the two are different, And judges that there is no separated non-text component if they are the same.

프로세서 (110) 는 비-텍스트 컴포넌트를 추출하는 단계 (S231) 및 비-텍스트 컴포넌트를 분리하여 비-텍스트 문서를 저장하는 단계를 비-텍스트 컴포넌트가 추출되지 않을 때까지 반복할 수도 있다. 예를 들어, 프로세서 (110) 는 비-텍스트 컴포넌트가 분리되기 전의 문서 이미지의 픽셀 분포와 비-텍스트 컴포넌트가 분리된 후의 문서 이미지의 픽셀 분포가 동일해 질 때까지, 단계 (S231) 및 단계 (S232) 를 반복할 수 있다.The processor 110 may repeat the step of extracting the non-text component (S231) and the step of separating the non-text component and storing the non-text document until the non-text component is not extracted. For example, the processor 110 may perform steps S231 and S231 until the pixel distribution of the document image before the non-text component is separated and the pixel distribution of the document image after the non- S232) can be repeated.

프로세서 (110) 는 비-텍스트 컴포넌트가 제거된 후 남아있는 연결된 컴포넌트, 즉, 텍스트 컴포넌트를 수집하여 텍스트 컴포넌트의 집합인 텍스트 문서를 획득할 수 있고, 획득된 텍스트 문서를 메모리 (120) 에 저장할 수 있다.The processor 110 may collect the connected components, i.e., text components, remaining after the non-text component is removed, to obtain a text document that is a collection of text components, and store the obtained text document in the memory 120 have.

도 8은 본 발명의 몇몇 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법에 의해 획득된 텍스트 문서 및 비-텍스트 문서를 예시적으로 도시한다. 도 8을 참조하면, 프로세서 (110) 는 문서 이미지 (710) 를 이진화하여 이진화된 문서 이미지 (720) 를 획득할 수 있고, 이진화된 문서 이미지 (720) 를 분할하여 동질 영역을 추출한 후 이진화된 문서 이미지 (720) 를 동질 영역 별로 분석하여 이진화된 문서 이미지 (720) 로부터 텍스트 컴포넌트 및 비-텍스트 컴포넌트를 추출하고, 텍스트 컴포넌트만을 포함하는 텍스트 문서 (730) 및 비-텍스트 컴포넌트 만을 포함하는 비-텍스트 문서 (740) 를 획득할 수 있다. 문서 이미지 (710), 이진화된 문서 이미지 (720), 텍스트 문서 (730) 및 비-텍스트 문서 (740) 는 사이즈가 동일할 수 있다.FIG. 8 exemplarily illustrates a text document and a non-text document obtained by a method of analyzing a document structure using a homogeneous region according to some embodiments of the present invention. Referring to FIG. 8, the processor 110 may obtain a binarized document image 720 by binarizing the document image 710, extract the homogeneous region by dividing the binarized document image 720, Extracts text components and non-text components from the binarized document image 720 by analyzing the image 720 by homogeneous regions, and generates a text document 730 containing only text components and non-text containing only non- Document 740 may be obtained. The document image 710, the binarized document image 720, the text document 730, and the non-text document 740 may be the same size.

다음으로, 프로세서 (110) 는 텍스트 컴포넌트 간의 거리에 기초하여 텍스트 문서를 분할하고 (S240), 프로세서 (110) 는 비-텍스트 컴포넌트의 특성값에 기초하여 비-텍스트 문서에 포함된 비-텍스트 컴포넌트를 분류한다 (S250). 단계 (S240) 및 단계 (S250) 은 동시에 수행될 수도 있고, 순차적으로 수행될 수도 있고, 역순으로 수행될 수도 있다. 단계 (240) 및 단계 (S250) 에 대해서는 도 2를 참조하여 설명하였으므로 중복 설명을 생략한다.Next, the processor 110 divides the text document based on the distance between the text components (S240), and the processor 110 determines whether the non-text component included in the non-text document (S250). Steps S240 and S250 may be performed simultaneously, sequentially, or in reverse order. Step 240 and step S250 have been described with reference to FIG. 2, and redundant description will be omitted.

도 9는 본 발명의 몇몇 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법에 의해 분석된 문서 이미지를 예시적으로 도시한다. 도 9를 참조하면, 텍스트 문서는 텍스트 컴포넌트 간의 거리에 기초하여 동질 텍스트 영역 (911, 912, 913, 414, 915, 916, 917, 918, 919) 으로 분할된다. 비-텍스트 문서에 포함된 비-텍스트 컴포넌트는 밀도가 낮고 내부에 텍스트를 포함하는 테이블 컴포넌트 (941), 직선으로 이루어진 라인 컴포넌트 및 세퍼레이터 컴포넌트 (931, 932, 933, 934, 936) 및 테이플 컴포넌트, 라인 컴포넌트 또는 세퍼레이터 컴포넌트에 포함되지 않는 이미지 컴포넌트 (921, 922, 923, 924) 로 분류된다. 도 9에 도시된 바와 같이, 본 발명의 일 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법에 의해 문서 이미지의 컴포넌트 각각은 종류별로 정확히 분류될 수 있다. 9 illustrates an exemplary document image analyzed by a method of analyzing a document structure using a homogeneous region according to some embodiments of the present invention. 9, a text document is divided into homogeneous text areas 911, 912, 913, 414, 915, 916, 917, 918, 919 based on the distance between text components. The non-text component included in the non-text document is a table component 941 having a low density and containing text therein, a linear line component and separator components 931, 932, 933, 934, 936, , Line components, or image components 921, 922, 923, and 924 that are not included in the separator component. As shown in FIG. 9, each of the components of the document image can be accurately classified according to the type by the analysis method of the document structure using the homogeneous region according to an embodiment of the present invention.

도 10은 본 발명의 몇몇 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법의 성공률을 나타내는 그래프이다. 도 10에 도시된 성공률은 “F-measure”를 통해 산출된 값이다. 도 10에 도시된 비교예 1, 2, 3 및 4의 성공률은 각각 “Fraunhofer”, “FineReader”, “Tesseract” 및 “DICE”를 이용한 경우의 성공률을 나타낸다.10 is a graph illustrating the success rate of a method of analyzing a document structure using a homogeneous region according to some embodiments of the present invention. The success rate shown in Fig. 10 is a value calculated through " F-measure ". The success rates of Comparative Examples 1, 2, 3 and 4 shown in Fig. 10 indicate the success rates when using "Fraunhofer", "FineReader", "Tesseract" and "DICE", respectively.

도 10을 참조하면, 본 발명의 일 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법을 이용한 경우 텍스트 컴포넌트에 대한 분류 성공률은 96.88%이고, 비-텍스트 컴포넌트에 대한 분류 성공률은 92.27%이다. 비교예1, 2, 3 및 4와 비교해 보면, 특히 비-텍스트 컴포넌트의 경우, 본 발명의 일 실시예에 따른 동질 영역을 이용한 문서 구조의 분석 방법의 분류 성공률이 비교예 1, 2, 3 및 4에 비해 약 20% 정도 높은 것을 알 수 있다. 상술한 바와 같이, 반복적인 분할을 통해 동질 영역을 추출한 후, 1차 필터링 및 2차 필터링을 반복하여 비-텍스트 컴포넌트를 분류함으로써, 비-텍스트 컴포넌트에 대한 우수한 분류 성공률이 달성될 수 있다.Referring to FIG. 10, the classification success rate of the text component is 96.88% and the classification success rate of the non-text component is 92.27% when the document structure analysis method using the homogeneous region according to an embodiment of the present invention is used. Compared with the comparative examples 1, 2, 3 and 4, the classification success rate of the method of analyzing the document structure using the homogeneous region according to the embodiment of the present invention, 4, which is about 20% higher than that of Comparative Example 1. As described above, a good classification success rate for non-text components can be achieved by extracting homogeneous regions through repetitive segmentation and then classifying the non-text components by repeating the first-order filtering and the second-order filtering.

본 명세서에서, 각 블록 또는 각 단계는 특정된 논리적 기능 (들) 을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또한, 몇 가지 대체 실시예들에서는 블록들 또는 단계들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대, 잇달아 도시되어 있는 두 개의 블록들 또는 단계들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들 또는 단계들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.In this specification, each block or each step may represent a part of a module, segment or code that includes one or more executable instructions for executing the specified logical function (s). It should also be noted that in some alternative embodiments, the functions mentioned in the blocks or steps may occur out of order. For example, two blocks or steps shown in succession may in fact be performed substantially concurrently, or the blocks or steps may sometimes be performed in reverse order according to the corresponding function.

본 명세서에 개시된 실시예들과 관련하여 설명된 방법 또는 알고리즘의 단계는 프로세서에 의해 실행되는 하드웨어, 소프트웨어 모듈 또는 그 2 개의 결합으로 직접 구현될 수도 있다. 소프트웨어 모듈은 RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터, 하드 디스크, 착탈형 디스크, CD-ROM 또는 당업계에 알려진 임의의 다른 형태의 저장 매체에 상주할 수도 있다. 예시적인 저장 매체는 프로세서에 커플링되며, 그 프로세서는 저장 매체로부터 정보를 판독할 수 있고 저장 매체에 정보를 기입할 수 있다. 다른 방법으로, 저장 매체는 프로세서와 일체형일 수도 있다. 프로세서 및 저장 매체는 주문형 집적회로 (ASIC) 내에 상주할 수도 있다. ASIC는 사용자 단말기 내에 상주할 수도 있다. 다른 방법으로, 프로세서 및 저장 매체는 사용자 단말기 내에 개별 컴포넌트로서 상주할 수도 있다.The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software module may reside in a RAM memory, a flash memory, a ROM memory, an EPROM memory, an EEPROM memory, a register, a hard disk, a removable disk, a CD-ROM or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, which is capable of reading information from, and writing information to, the storage medium. Alternatively, the storage medium may be integral with the processor. The processor and the storage medium may reside within an application specific integrated circuit (ASIC). The ASIC may reside within the user terminal. Alternatively, the processor and the storage medium may reside as discrete components in a user terminal.

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 더욱 상세하게 설명하였으나, 본 발명은 반드시 이러한 실시예로 국한되는 것은 아니고, 본 발명의 기술사상을 벗어나지 않는 범위 내에서 다양하게 변형실시될 수 있다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 그러므로, 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.Although the embodiments of the present invention have been described in detail with reference to the accompanying drawings, it is to be understood that the present invention is not limited to those embodiments and various changes and modifications may be made without departing from the scope of the present invention. . Therefore, the embodiments disclosed in the present invention are intended to illustrate rather than limit the scope of the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. Therefore, it should be understood that the above-described embodiments are illustrative in all aspects and not restrictive. The scope of protection of the present invention should be construed according to the following claims, and all technical ideas within the scope of equivalents should be construed as falling within the scope of the present invention.

100 동질 영역을 이용한 문서 구조의 분석 방법
110 프로세서
120 메모리
d1, d2, d3, d4 이웃하는 극점 간 거리
w1, w2, w3 백색 라인의 폭
d1, d2, d3, d4 흑색 라인의 폭
710 문서 이미지
720 이진화된 문서 이미지
730 텍스트 문서
740 비-텍스트 문서
911, 912, 913, 414, 915, 916, 917, 918, 919 동질 텍스트 영역
921, 922, 923, 924 이미지 컴포넌트
931, 932, 933, 934, 936 라인 컴포넌트 및 세퍼레이터 컴포넌트
941 테이블 컴포넌트Method of analyzing document structure using 100 homogeneous regions
110 processor
120 Memory
d1, d2, d3, d4 Distance between adjacent poles
w1, w2, w3 Width of the white line
d1, d2, d3, d4 width of black line
710 document image
720 Binary document image
730 text documents
740 non-text documents
911, 912, 913, 414, 915, 916, 917, 918, 919,
921, 922, 923, 924 image components
931, 932, 933, 934, and 936 line components and a separator component
941 Table component

Claims

Analyzing the document image to binarize the document image in black and white;
Extracting a homogeneous region by dividing the document image based on at least a part of the white lines based on a width of a white line formed by binarizing the document image and a width of a black line;
Classifying the connected component into a text component and a non-text component by analyzing a connected component included in the homogeneous region, and classifying the text component including the text component and the non- Obtaining an included non-text document;
Dividing the text document based on a distance between the text components; And
And classifying the non-text component included in the non-text document based on a property value of the non-text component.

The method according to claim 1,
The step of extracting the homogeneous region comprises:
Obtaining a curve in the horizontal axis or the vertical axis direction of the document image indicating the frequency of black pixels among the pixels of the document image;
Determining a presence of a heterogeneous region based on a change rate of the curve and a statistical value of the frequency of the black pixel;
Dividing the document image based on at least a part of the white lines based on the width of the white line and the width of the black line included in the heterogeneous region when the heterogeneous region exists. A method for analyzing a document structure using homogeneous regions.

3. The method of claim 2,
Further comprising the step of obtaining the curve, determining whether the heterogeneous region is present, and dividing the document image until the heterogeneous region does not exist. An analysis method of document structure used.

The method according to claim 1,
The step of obtaining the text document and the non-text document
Analyzing the connected component to extract the non-text component of the connected component;
And separating the non-text component from the homogeneous region to store the non-text document comprising the non-text component.

5. The method of claim 4,
Wherein extracting the non-text component comprises:
Classifying the connected component into the text component or the non-text component based on a property value including a coordinate, a width, a height and an area of the box surrounding the connected component and the connected component. A method for analyzing document structure using homogeneous regions.

5. The method of claim 4,
Extracting the non-text component and storing the non-text document until the non-text component is not extracted; And
And storing the text document in which the non-text component is removed and the text component is included.

The method according to claim 1,
Wherein classifying the non-text component comprises:
And classifying the non-text component into a line component, a table component, a separator component, and a noise component.

The method according to claim 1,
Wherein the step of dividing the text document comprises:
Expanding the text component through a morphology operation and then dividing the text document based on a distance between the text components.

The method according to claim 1,
Wherein the document image, the text document and the non-text document are the same size.

Analyzing the document image to binarize the document image in monochrome,
Extracting a homogeneous region by dividing the document image based on at least a part of the white lines based on a width of a white line and a width of a black line included in the document image,
Analyzing connected components included in the homogeneous region to classify the connected component into a text component and a non-text component, acquiring a text document including the text component and a non-text document including the non-text component ,
Dividing the text document based on a distance between the text components,
And a set of instructions for classifying the non-text component included in the non-text document based on a property value of the non-text component.

A processor; And
Memory,
The processor
Analyzing the document image to binarize the document image in monochrome,
Extracting a homogeneous region by dividing the document image based on at least a part of the white lines based on a width of a white line and a width of a black line included in the document image,
Analyzing connected components included in the homogeneous region to classify the connected component into a text component and a non-text component, acquiring a text document including the text component and a non-text document including the non-text component ,
Dividing the text document based on a distance between the text components,
Classifying the non-text components included in the non-text document based on a property value of the non-text component,
The memory comprising:
Wherein the at least one of the document image, the homogeneous region, the connected component, the text document, and the non-text document is stored.