KR20010104164A

KR20010104164A - A method for automatically analyzing document layout

Info

Publication number: KR20010104164A
Application number: KR1020000025639A
Authority: KR
Inventors: 이성환
Original assignee: 이성환
Priority date: 2000-05-13
Filing date: 2000-05-13
Publication date: 2001-11-24

Abstract

별도의 매개 변수 입력과 같은 부가적인 과정이 필요 없이 자동으로, 활자체 크기와 행간, 문서 구조와 무관하게 문서를 정확하게 분석 및 저장할 수 있도록 한 방법이 개시되어 있다. 본 발명은 다양한 활자체와 행간, 문서 구조가 있는 문서 영상을 매개 변수와 무관하게 문서 분할이 가능한 자동 문서 구조 분석 방법을 제공하고자 하는 것이며, 입력된 문서 영상을 최대 동질 영역으로 분할하고, 각 영역을 텍스트, 그림, 표, 선으로 정확하게 분류하기 위하여 피라미드 구조로 계층화하여 다단계로 분석하고, 어떤 영역을 분할할지의 여부를 결정하기 위하여 주기성을 측정하고, 주기성을 찾기 어려운 영역에 대하여 질감 기반 분석함으로서, 다양한 활자체 크기와 행간, 문서 구조를 갖는 문서를 매개 변수에 무관하게 자동으로 정확히 분석할 수 있는 효과를 얻을 수 있다.A method is disclosed for automatically analyzing and storing a document regardless of typeface size, line spacing, or document structure without the need for additional processing such as additional parameter input. An object of the present invention is to provide an automatic document structure analysis method capable of dividing a document image having various typefaces, lines, and document structures irrespective of parameters. The present invention divides an input document image into maximum homogeneous regions, and divides each region. In order to accurately classify text, pictures, tables, and lines, it is hierarchized into pyramid structures and analyzed in multiple stages.Methodity is measured to determine which areas are divided, and texture-based analysis is performed on areas where periodicity is difficult to find. Documents with various typographical sizes, lines, and document structures can be analyzed automatically and accurately regardless of parameters.

Description

A METHOD FOR AUTOMATICALLY ANALYZING DOCUMENT LAYOUT}

본 발명은 자동 문서 구조 분석 방법에 관한 것으로서, 보다 상세하게는 다단계 분석, 주기성 측정, 및 질감 분석을 통해 활자체 크기와 행간, 문서 구조와 무관하게 문서를 정확하게 분석 및 저장할 수 있도록 한 방법에 관한 것이다.The present invention relates to an automatic document structure analysis method, and more particularly, to a method for accurately analyzing and storing a document regardless of typeface size, line spacing, and document structure through multi-step analysis, periodicity measurement, and texture analysis. .

정보 기술의 발전과 산업 정보화로 인하여 정보를 담고 있는 문서의 사용량이 꾸준하게 증가하고 있으며, 이러한 정보를 담은 문서는 종이 문서가 읽기 편하다는 이유로 신문, 잡지, 보고서, 책 등과 같은 간행물로 계속 출판되고 있다.Due to the development of information technology and industrial informatization, the usage of documents containing information is steadily increasing, and the documents containing such information are continuously published in newspapers, magazines, reports, books, etc., because the paper documents are easy to read. .

그러나, 출판하고자 하는 문서가 대용량인 경우 출판의 비용 및 보관 공간이 커지는 문제점이 발생되므로, 대용량의 종이 문서를 전자 문서로 자동 변환할 수 있다면, 내용 검색이 쉬워질 뿐만 아니라 저장 공간 역시 획기적으로 줄일 수 있다.However, if the document to be published is large, the cost of publishing and storage space are increased. Therefore, if a large paper document can be automatically converted into an electronic document, content retrieval is not only easy, but also the storage space is drastically reduced. Can be.

그러나, 종이 문서에서 전자 문서로 자동 변환하는 과정이 쉬운 일이 아니며, 이러한 문서 자동 변환 작업을 수행하기 위해서는 문서 구조 분석이 선행되어야 한다.However, the process of automatically converting a paper document into an electronic document is not an easy task, and in order to perform such automatic document conversion, document structure analysis must be preceded.

문서 구조 분석 및 문서 분할 방법 중 대표적인 방법은 검은 화소나 연결 요소 등의 지역 정보로 시작하여 단어를 구성하고, 단어를 행으로, 행을 문단으로 계속 결합하는 방법인 상향식 접근 방법과 문서의 전체적인 정보로부터 시작하여 문서를 열로 분할하고, 분할된 열을 블록으로, 블록을 행으로 행을 단어로 분할하는 하향식 접근 방법이 있다.Representative methods of document structure analysis and document segmentation are the bottom-up approach and the overall information of the document, which is a method of composing words starting with local information such as black pixels or connecting elements and continuously combining words into lines and lines into paragraphs. A top-down approach is to split the document into columns, split the columns into blocks, the blocks into rows, and the words into words.

상기 상향식 접근 방법에는 입력 문서 영상에서 연결 요소를 추출한 후, 같은 형태의 요소들을 반복적으로 결합하여 점차적으로 확장해 나가는 연결 요소 확장 방법이 널리 이용되고 있다.In the bottom-up approach, a connection element extension method of gradually extracting a connection element from an input document image and then gradually combining elements of the same type is widely used.

상기 연결 요소 확장 방법은 대체적으로 연결 요소의 분류나 분석 및 확장하는데 걸리는 시간 복잡도가 크며, 특히 신문과 같은 연결 요소가 많은 경우 그 시간 복잡도는 기하급수적으로 커지게 된다.The connection element expansion method generally has a large time complexity for classifying, analyzing, and extending the connection elements, and especially when there are many connection elements such as newspapers, the time complexity increases exponentially.

이러한 시간 복잡도를 줄이기 위하여 상기 연결 요소 확장 방법에 최소 비용 신장 트리로 문서 구조를 계층화하고 발견적 교수법을 추가하는 방법이 최근에 제안되었다. 그러나, 이 방법 역시 초기의 잘못된 확장으로 인한 분할 오류를 가지고 올 수 있다.In order to reduce this time complexity, a method of layering a document structure with a least cost tree and adding a heuristic teaching method to the connection element extension method has recently been proposed. However, this method can also introduce segmentation errors due to early erroneous expansion.

한편, 하향식 접근 방법은 상향식 방법보다 시간 복잡도가 작고 사람이 어떤 물체를 볼 때 저해상도에서부터 고해상도로 본다는 점과 유사하나 사각형으로 나눌 수 없는 복잡한 문서 구조나 다양한 폰트 크기가 있는 문서의 경우 분할이 어려워진다는 단점을 가지고 있다.On the other hand, the top-down approach is less time-consuming than the bottom-up approach and is similar to what a person sees from a lower resolution to a higher resolution, but it becomes more difficult to split a document with a complex document structure or various font sizes that cannot be divided into rectangles. It has a disadvantage.

또한 기하학적 문서 구조 분석이나 문서 분할과 관련된 많은 방법중 문서 영상 안의 텍스트나 그림, 그래픽과 같은 영역을 질감을 가진 영역으로 간주하여 문서 영상에서 동일한 질감 영역을 찾음으로서 문서 분할을 수행하는 질감 기반 방법은 시간 복잡도가 매우 크다는 단점을 가지고 있다. 즉, 질감 영역의 특성을 찾기 위해 원하는 지역 공간 주파수와 방향에 맞는 여러 가지의 필터를 사용해야 하기 때문에 지역 정보를 추출하기 위해 많은 마스크를 필요로 하며, 마스크의 크기를 줄이면 큰 폰트의 글자와 같이 큰 질감을 갖는 영역을 감지할 수 없게 되고, 반대로 마스크의 크기를 늘리면 시간 복잡도가 커지게 된다.Also, among many methods related to geometric document structure analysis or document segmentation, the texture-based method that performs document segmentation by finding the same textured region in the document image by considering the regions such as text, pictures, and graphics in the document image as a textured region The disadvantage is that the time complexity is very large. In other words, to find the characteristics of the texture area, a number of masks are required to extract local information because various filters are needed for the desired local spatial frequency and direction. Textured areas cannot be detected, and conversely, increasing the size of the mask increases the time complexity.

그러므로, 이와 같은 질감 기반 방법의 단점을 보완하기 위하여 두개의 필터만으로 각 해상도에서 지역적 특성을 추출하는 다단계 분석을 적용하게 되었으며, 이 경우에도 다단계의 질감 정보를 추출하는 과정에 드는 시간이 많이 소요되고 다른 형태의 영역이지만 비슷한 질감을 갖고 있는 경우 혼동되거나 혼합되는 문제점이 종종 발생된다.Therefore, in order to make up for the drawbacks of this texture-based method, multi-step analysis is applied to extract local features at each resolution using only two filters. In this case, it takes a lot of time to extract the multi-level texture information. Different types of regions but with similar textures often cause confusion or mixing problems.

위에서 언급된 방법이외에 Antonacopoulos는 영역의 윤곽선을 추출하기 위하여 흰 공간을 사용하였으며, Jain과 Yu는 문서 분할과 영역 분류에 전형적인 상향식 방법을 사용하고 논리적 구조 분석을 위해 하향식 정보를 갖는 문서 모델을 사용하였다. 그러나, 이 방법들은 임계값과 파라미터에 따라 성능이 좌우되기 때문에 여러 개의 임계값과 파라미터의 설정이 불가피하다는 문제점이 있었다.In addition to the methods mentioned above, Antonacopoulos used white space to extract region outlines, and Jain and Yu used a typical bottom-up method for document segmentation and region classification, and a document model with top-down information for logical structure analysis. . However, these methods have a problem in that the setting of several thresholds and parameters is inevitable because performance depends on the thresholds and parameters.

이에, 본 발명은 상기한 단점을 보완하기 위하여 창출된 것으로서, 본 발명의 목적은 문서의 주기적 특성을 추출하는 주기성 측정 방법과 주기적 특성을 추출할 수 없는 영역에 대해 질감 기반 분석을 통해 다양한 활자체와 행간, 문서 구조가 있는 문서 영상을 매개 변수와 무관하게 문서 분할이 가능한 자동 문서 구조 분석 방법을 제공하고자 하는 것이다.Accordingly, the present invention was created to compensate for the above disadvantages, and an object of the present invention is to provide a method for measuring periodicity of a document's periodic characteristics and a variety of typefaces through texture-based analysis of regions in which periodic characteristics cannot be extracted. The purpose of the present invention is to provide an automatic document structure analysis method capable of dividing a document image with line and document structure irrespective of parameters.

상기와 같은 목적을 달성하기 위한 본 발명의 자동 문서 구조 분석 방법은, a) 분석하고자 하는 문서가 입력되었는지를 판단하고, 문서가 입력되었다고 판단될 때, 입력된 문서를 저해상도에서 고해상도로 변환하여 피라미드 구조를 생성하고, 문서의 다단계 영상을 추출하는 단계; b) 상기 a) 단계에서 저해상도의 피라미드 구조의 최상위 레벨 영상을 연결 요소 분석을 행하여 문서의 경계 사각형을 추출하고, 상기 최상위 레벨 영상을 제외한 나머지 영상의 가로 방향과 세로 방향으로 소정 영역의 주기성을 측정하여 주기적 특성을 추출하는 단계; c) 상기 b) 단계에서소정 영역이 단일 주기를 갖는 영역인지를 판단하고, 단일 주기성을 갖지 않는다고 판단되면, 문서를 분할할 위치를 검출하여 다수개의 영역으로 분할한 후 b)단계로 진행하는 단계; d) 상기 c)단계에서 측정된 주기가 단일 주기를 가진다고 판단되면, 최대 동질 영역을 텍스트, 그림, 표, 및 선으로 분류하는 단계; 및 e) 상기 d)단계에서 분류된 텍스트를 문자 인식기로 공급하고, 그림을 압축하며, 표를 재생성하고, 선과 같은 그래픽을 벡터화하여 데이터베이스에 저장하는 단계로 이루어지는 것을 특징으로 한다.In the automatic document structure analysis method of the present invention for achieving the above object, a) determine whether the document to be analyzed is input, and when it is determined that the document is input, converts the input document from low resolution to high resolution to pyramid Generating a structure and extracting a multi-level image of the document; b) In the step a), the top-level image of the low-resolution pyramid structure is analyzed by connecting element analysis to extract the boundary rectangle of the document, and the periodicity of a predetermined region is measured in the horizontal and vertical directions of the remaining images except for the top-level image. Extracting periodic characteristics; c) determining whether the predetermined region is a region having a single period in step b), if it is determined that the region does not have a single periodicity, detecting a position to divide a document, dividing the document into a plurality of regions, and then proceeding to step b) ; d) if it is determined that the period measured in step c) has a single period, classifying the maximum homogeneous region into text, picture, table, and line; And e) supplying the text classified in step d) to a character recognizer, compressing a picture, regenerating a table, and vectorizing a graphic such as a line and storing it in a database.

본 발명에 의하면, 분석하고자 하는 문서를 파라미드 구조로 계층화하여 저해상도에서 시작하여 고해상도로 처리하고, 특정 영역의 주기적 특성을 추출할 수 있는 주기성 측정 방법을 제안하고, 이 주기성 측정을 통하여 단일 주기를 갖는 영역으로 문서를 분할하고, 분할된 소정 영역의 주기성을 검출한 후 최대 동질 영역을 텍스트, 그림, 표, 및 선으로 분류하여 저장함으로서, 활자체 크기와 행간, 문서 구조와 무관하게 정확하게 문서의 구조를 자동으로 분석할 수 있다.According to the present invention, a periodicity measurement method for layering a document to be analyzed into a paramid structure, starting at a low resolution and performing a high resolution, and extracting periodic characteristics of a specific region is proposed. By dividing a document into regions having a predetermined area, detecting the periodicity of the divided predetermined region, and classifying and storing the maximum homogeneous region into text, pictures, tables, and lines, the structure of the document is precisely irrespective of typeface size, line size, and document structure. Can be analyzed automatically.

도 1은 본 발명이 적용되는 문서 입력 장치의 구성을 보인 도이다.1 is a diagram showing the configuration of a document input device to which the present invention is applied.

도 2는 본 발명에 따른 자동 문서 분석 및 저장 과정을 설명하기 위한 흐름도이다.2 is a flowchart illustrating an automatic document analysis and storage process according to the present invention.

도 3은 본 발명에 따른 주기성을 측정하기 위한 과정을 보인 도이다.3 is a diagram illustrating a process for measuring periodicity according to the present invention.

도 4는 본 발명이 적용되는 문서를 수평 방향으로 분할한 일예들을 보인 도이다.4 is a diagram illustrating one example of dividing a document to which the present invention is applied in a horizontal direction.

〈도면의 주요부분에 대한 부호의 설명〉<Explanation of symbols for main parts of drawing>

10 : 문서 입력부 20 : 프로세서10: document input unit 20: processor

30 : 데이터베이스 40 : 표시부30: database 40: display unit

이하, 본 발명의 실시 예를 통해 본 발명을 보다 상세히 설명한다.Hereinafter, the present invention will be described in more detail with reference to the following examples.

도 1은 본 발명이 적용되는 문서 입력 장치의 구성을 보인 블럭도로서, 문서를 입력하는 문서 입력부(10)와, 입력된 문서를 분석하여 텍스트영역, 그래픽영역, 그림영역, 및 표영역으로 분류하고, 그 분류된 각각의 영역을 텍스트영역은 문자 인식기로 입력하여 아스키(ASCII) 코드로, 영상영역은 압축, 표영역은 재생성, 및 선과 같은 그래픽영역은 벡터화하도록 동작하는 프로세서(20)와, 상기프로세서(20)의 출력정보들을 저장하는 데이터베이스(30)와, 사용자의 요구에 따라 데이터베이스(30)에 저장된 문서를 디스플레이시키는 표시부(40)로 구비된다.1 is a block diagram showing a configuration of a document input device to which the present invention is applied, and analyzes the document input unit 10 for inputting a document and the input document and classifies it into a text area, a graphic area, a picture area, and a table area. A processor 20 operable to input each classified area into a text recognizer using ASCII code, compress an image area, regenerate a table area, and vectorize a graphic area such as a line; A database 30 storing the output information of the processor 20 and a display unit 40 for displaying a document stored in the database 30 according to a user's request.

도 2는 본 발명에 따른 기하학적 구조의 문서 분석 및 저장 과정을 설명하기 위한 흐름도이다.2 is a flowchart illustrating a document analysis and storage process of a geometric structure according to the present invention.

도 1을 참조하여 도 2에 도시된 흐름도의 동작을 상세하게 설명한다.The operation of the flowchart shown in FIG. 2 will be described in detail with reference to FIG. 1.

상기 문서 입력부(10)를 통해 입력된 문서는 프로세서(20)에 공급되고, 분석하고자 하는 문서를 받은 프로세서(20)는 문서가 입력되었는지를 확인한 후(단계 S11), 이 문서를 피라미드 구조로 만들어 다단계 영상을 추출하는 단계(S12)로 진행한다.The document input through the document input unit 10 is supplied to the processor 20, and the processor 20 receiving the document to be analyzed checks whether the document is input (step S11), and forms the document in a pyramid structure. In step S12, the multi-step image is extracted.

상기 단계(S12)는 우선 입력되는 영상의 해상도가 50 X 50 화소보다 작을 때까지 가로 1/2, 세로 1/2, 즉 한번에 1/4씩 반복하여 줄여나감으로서 최저 해상도에서 최고 해상도로 다단계의 영상을 추출한다.In the step S12, first, the image is repeatedly reduced by 1/2, vertical 1/2, or 1/4 at a time until the resolution of the input image is smaller than 50 X 50 pixels. Extract the image.

여기서, n(정수) 레벨에 있는 각 S⁽ⁿ⁾화소들은 보다 더 자세한 n-1 레벨에 있는 네 개의 화소 S_i ^(n-1)(i=1,2,3,4)와 대응되고, S_i ^(n-1)화소는 S⁽ⁿ⁾의 자식화소가 되며, 만약 네 개의 화소 S_i ^(n-1)(i=1,2,3,4)중 하나 이상이 검은 화소이면, 그것의 부모 화소인 S⁽ⁿ⁾도 검은 화소로 할당된다. 이에 대한 식은 검은 화소를 1이라하고 흰 화소를 0이라 할 때 다음 식으로 표현된다.Here, each S ⁽ⁿ⁾ pixel at the n (integer) level corresponds to four pixels S _i ^(n-1) (i = 1,2,3,4) at a more detailed n-1 level, S _i ^(n-1) pixels become child pixels of S ⁽ⁿ⁾ , and if one or more of the four pixels S _i ^(n-1) (i = 1,2,3,4) are black pixels, then S ^{(n) which} is the parent pixel of ^is also assigned to the black pixel. The equation for this is expressed as the following equation when the black pixel is 1 and the white pixel is 0.

그러므로, S⁽ⁿ⁾에서의 화소 수는 S_i ^(n-1)에서의 화소 수의 1/4이 된다. 이러한 방법으로 입력된 문서의 피라미드 구조를 생성하여 다단계 영상을 출력한다. Therefore, the number of pixels in S ⁽ⁿ⁾ is 1/4 of the number of pixels in _Si ^(n-1) . In this way, a pyramid structure of the input document is generated to output a multi-level image.

상기 단계(S12)에서 다단계의 영상의 출력이 완료된 후 프로세서(20)는 상기 다단계의 영상으로부터 텍스트영역, 그림영역, 표영역, 및 그래픽영역에 대한 주기성을 추출하기 위하여 문서의 기울어진 각를 산출한 후 투영 히스토그램을 생성하는 단계(S13)로 진행한다.After the output of the multi-stage image is completed in step S12, the processor 20 calculates the inclination angle of the document to extract the periodicity for the text area, the picture area, the table area, and the graphic area from the multi-stage image. Proceeding to step S13 of generating a post-projection histogram.

즉, 문서의 기울어진 각(θ)을 구하고, 이 기울어진 각(θ)의 방향에 따라 다음 식을 이용하여 가로( 또는 세로) 방향의 투영 히스토그램을 구한다.That is, the inclination angle [theta] of the document is obtained, and the projection histogram in the horizontal (or vertical) direction is obtained using the following equation according to the direction of the inclination angle [theta].

여기서, 상기 I(x,y)는 넓이가 width(폭) x height(높이)인 영상에서의 명도값이고, P_H ⁽ⁿ⁾는 n번째 레벨에서의 가로 방향 투영 히스토그램이다. 도 3의 (b)는 (a)의 기울어진 텍스트 영역에 대한 투영 히스토그램의 일례를 보이 도이다.Here, I (x, y) is a brightness value in an image whose width is width x height, and P _H ⁽ⁿ⁾ is a horizontal projection histogram at an nth level. FIG. 3B shows an example of a projection histogram of the inclined text area of (a).

상기 투영 히스토그램의 생성이 완료된 후 프로세서(20)는 상기 투영 히스토그램의 노이즈성분을 제거하기 위하여 단계(S13)로 부터 구하여진 투영 히스토그램을 평할화시키는 단계(S14)로 진행한다.After the generation of the projection histogram is completed, the processor 20 proceeds to leveling the projection histogram obtained from step S13 to remove noise components of the projection histogram.

즉, 도 3의 (b)의 투영 히스트그램은 다음 식을 이용하여 투영 히스토그램의 노이즈성분이 제거된다.That is, in the projection histogram of FIG. 3B, the noise component of the projection histogram is removed using the following equation.

여기서, s는 커널 크기이고, m_y는 정수형이다. 도 6은 도 5의 투영 히스토그램을 평활화한 결과를 보인 도이다.Where s is the kernel size and m _y is an integer. FIG. 6 is a diagram illustrating a result of smoothing the projection histogram of FIG. 5.

상기 단계(S14)를 통해 투영 히스토그램의 평활화한 후 프로세서(20)는 이웃 요소들간의 기울기를 산출하기 위한 단계(S15)로 진행하고, 이웃 요소들간의 기울기를 산출하기 위하여 다음식을 이용한다.After smoothing the projection histogram through step S14, the processor 20 proceeds to step S15 for calculating the slope between neighboring elements, and uses the following equation to calculate the slope between the neighboring elements.

도 3의 (d)는 도 3의 (c)의 평활화된 투영 히스토그램을 미분한 결과를 보인 도이다. 상기의 식을 통해 이웃 요소들간의 기울기 산출이 완료된 후 프로세서(20)는 기울기의 부호가 교차되는 부호 교차점을 산출하기 위한 단계(S16)로 진행되고, 단계(S16)는 기울기의 부호가 교차되는 교차점을 다음 식을 이용하여 산출한다.FIG. 3D is a diagram showing the result of differentiating the smoothed projection histogram of FIG. 3C. After the calculation of the slope between the neighboring elements is completed through the above equation, the processor 20 proceeds to step S16 for calculating a sign intersection point at which the sign of the slope intersects, and in step S16, the sign of the slope crosses. The intersection is calculated using the following equation.

상기 식으로부터 산출된 부호 교차점은 투영 히스토그램의 지역 최대값이나 지역 최소값에 해당되며, 여기서, 도 3의 (e)는 도 3의 (d)의 평활화된 투영 히스토그램의 미분한 결과에 따른 부호 교차점을 보인 도이다.The sign intersection point calculated from the above equation corresponds to a local maximum value or a local minimum value of the projection histogram, where (e) of FIG. 3 represents a sign intersection point according to the derivative result of the smoothed projection histogram of FIG. It is shown.

상기 단계(S16)에 의해 이웃 부호 교차점의 산출이 완료되면, 프로세서(20)는 상기 산출된 이웃의 부호 교차점으로부터 이들 간의 거리에 대한 분산을 산출하여 상기 문서의 소정 영역에 대한 주파수 분포를 산출하는 단계(S17)로 진행되며, 상기 단계(S17)의 분산(V)은 다음 식을 이용하여 산출된다.When the calculation of the neighboring code intersections is completed by the step S16, the processor 20 calculates a variance of the distance between them from the calculated neighboring code intersections to calculate a frequency distribution for a predetermined region of the document. Proceeding to step S17, the variance V of step S17 is calculated using the following equation.

여기서, 상기이다.Where to be.

상기 식으로 부터 산출된 이웃 부호 교차점에 대한 거리의 분산값이 낮다는 것은 그 영역 안의 행들의 주파수 분포가 거의 일치한다는 것이고, 그로 인해 그 영역이 동질 영역으로 이루어져 있다는 것을 알 수 있다.The low variance of the distance to the neighboring sign intersection calculated from the above equation indicates that the frequency distribution of the rows in the region is almost identical, and thus the region is composed of homogeneous regions.

반대로, 상기 식으로 부터 산출된 이웃 부호 교차점에 대한 거리의 분산값이 높다는 것은 그 영역 안의 행들의 주파수 분포가 불규칙하다는 것이고, 그로 인해 그 영역의 소정위치에 분할할 부분이 존재한다는 것을 의미한다.On the contrary, a high variance value of the distance to the neighboring code intersection point calculated from the above equation means that the frequency distribution of the rows in the area is irregular, and thus there is a part to be divided at a predetermined position of the area.

상기 단계(S17)로 부터 이웃 부호 교차점에 대한 거리의 분산값 산출이 완료되면, 프로세서(20)는 상기 분산값(V)의 범위를 정규화하여 정규값(P)을 산출하고, 이 산출된 정규값(P)으로부터 상기 소정 영역이 단일 주기성을 가지는지를 판단하기 위한 결정값(D(P))를 산출하는 단계(S18)로 진행한다. 단계(S18)는 다음 식을이용하여 정규값(P)를 산출한다.When the calculation of the variance of the distance to the neighboring code intersection point is completed from the step S17, the processor 20 normalizes the range of the variance value V to calculate the normal value P, and calculates the calculated normal Proceeding to step S18, from the value P, a determination value D (P) for determining whether the predetermined region has a single periodicity is calculated. Step S18 calculates the normal value P using the following equation.

여기서, 소정 영역의 분산값(V)이 0으로 가면, 정규값(P)은 1이 되고, 소정 영역의 분산값이 1로 가면, 정규값(P)은 0이 된다. 그러므로, 소정 영역의 주기성은 0에서 1사이의 범위를 가지는 정규값(P)에 의해 표현된다.Here, when the variance value V of the predetermined area goes to zero, the normal value P becomes 1, and if the variance value of the predetermined area goes to 1, the normal value P becomes zero. Therefore, the periodicity of the predetermined area is represented by the normal value P having a range of 0 to 1.

상기에서 설명한 바와 같이, 상기 단계(S18)에 의해 정규값(P)이 산출된 후 이 산출된 정규값(P)에 따라 소정영역이 단일 주기성을 가지는지를 판단하는 결정값(D(P))를 산출한다.As described above, after the normal value P is calculated by the step S18, the determined value D (P) for determining whether the predetermined region has a single periodicity according to the calculated normal value P. Calculate

즉, 소정 영역이 단일 주기를 가지는 지의 여부를 판단하는 결정값(D(P))은 다음 식에 의해 결정되며, 여기서 임계치(TH)는 0.5로 설정되고, 이 임계치(TH)는 분산값(V)이 1.099일 때의 값이다.That is, the determination value D (P) for determining whether a predetermined area has a single period is determined by the following equation, where the threshold value TH is set to 0.5, and the threshold value TH is a variance value ( The value is when V) is 1.099.

상기 식으로부터 산출된 결정값 D(P)가 0이면, 부호 교차점들의 차의 분산이 높고 동질 영역이 아님을 의미하므로 소정 영역에 대해 한번 이상의 분할이 필요하다. 그러나, 상기 결정값 D(P)가 1이면, 부호 교차점들의 차의 분산이 낮고 동질 영역임을 의미하므로 소정 영역에 대한 분할이 필요없다.If the determined value D (P) calculated from the above equation is 0, it means that the variance of the difference between the sign intersection points is high and not a homogeneous region, so that at least one division is required for the predetermined region. However, if the determination value D (P) is 1, it means that the variance of the difference between sign intersections is low and homogeneous, so that no division for a predetermined region is necessary.

상기 단계(S18)에서 상기 결정값 D(P)의 산출이 완료되면, 프로세서(20)는 상기 결정값 D(P)에 따라 소정 영역이 단일 주기를 가지는 지를 검사하는단계(S19)로 진행한다.When the calculation of the determination value D (P) is completed in step S18, the processor 20 proceeds to step S19 of checking whether a predetermined region has a single period according to the determination value D (P). .

그리고, 상기 단계(S19)에서 상기 결정값(D(P))에 따라 문서를 분할하여야 한다고 판단되면, 문서를 수평 또는 수직방향으로 분할하는 단계(S20)로 진행하며, 단계(S20)는 도 4의 (a)에 도시된 바와 같이, 문서를 수평 투영하였을 때 흰 공간이 다른 흰 공간보다 커서 그곳을 분할하여야 하는 경우와, 도 4의 (b)에 도시된 바와 같이, 수평 투영했을 때 검은 공간이 다른 검은 공간보다 커서 그 옆의 흰 공간을 분할해야 하는 경우로 분류되며, 이 두 가지 경우에 대해 아래의 방법으로 분할 위치를 찾아 2개의 영역으로 분할한다.If it is determined in step S19 that the document should be divided according to the determination value D (P), the process proceeds to step S20 of dividing the document in a horizontal or vertical direction, and step S20 is shown in FIG. As shown in FIG. 4 (a), when the document is horizontally projected, the white space is larger than other white spaces, and the black space is divided when the document is horizontally projected. As shown in FIG. Since the space is larger than other black spaces, it is classified as the case where the white space next to it is to be divided. For these two cases, the division position is found by the following method and divided into two areas.

우선 문서를 수평 투영하였을 때 흰 공간이 다른 흰 공간보다 커서 그곳을 분할하여야 하는 경우에는 크기가 작은 것부터 큰 것 순으로 정렬된 흰 공간의 집합 W중 임의의 요소를 W_i,중간 요소를 W_med, 및 최대 요소를 W_max라고 할 때, W_i〉W_med와 W_i= W_max를 만족하면, 상기 임의의 요소 W_i에서 분할한다.If the white space is larger than the other white spaces when the document is horizontally projected, and there is a need to divide them, the set of white spaces arranged in the order of smallest to largest W, W _{i and} middle elements W _med When the maximum element is W _max , and W _i > W _med and W _i = W _max , the element is divided into any element W _i .

또한, 문서를 수평 투영했을 때 검은 공간이 다른 검은 공간보다 커서 그 옆의 흰 공간을 분할해야 하는 경우에는 크기 순으로 정렬된 검은 공간의 집합 B중 임의의 요소를 B_i, 중간 요소를 B_med, 및 최대 요소를 B_max라고 할 때, B_i〉B_med와 B_i= B_max를 만족하면, 상기 임의의 요소 W_i-1에서 분할한다.In the case of a black area when the horizontal projection of a document to split the white space of the cursor next to it than the other black area has an arbitrary element of the set of black space aligned in order of size B B _i, the intermediate elements, B _med When the maximum element is referred to as B _max , and B _i > B _med and B _i = B _max , it is divided into any of the above elements W _i-1 .

그리고, 상기의 과정을 통해 문서의 분할이 완료된 후 분할된 문서영역에 대해 주기성을 산출하는 단계(S13)로 진행한다.After the document is divided through the above process, the process proceeds to step S13 of calculating periodicity for the divided document area.

그러나, 종래의 다른 방법들과 마찬가지로 상기의 과정을 통한 문서 분할 방법도, 입력된 문서가 2개의 영역이 닿거나 겹치는 영역과, 큰 문자로 시작하는 문단, 몇 안되는 단어와 하나의 행으로 이루어지는 영역에서는 주기성 측정만으로는 분할이나 분류가 어렵다.However, like other conventional methods, the document splitting method through the above process also includes an area in which the input document touches or overlaps two areas, a paragraph starting with a large letter, a few words and a single line. In the case of the periodicity measurement, it is difficult to divide or classify.

그러므로, 본 발명에서는 상기의 과정들을 통해 입력된 문서를 최대 동질 영역들로 분할한 후 프로세서(20)는 상기 문서의 분할 결과를 확증하는 단계(S21)를 추가하였다. 이러한 문서 분할 결과를 확증하기 위한 단계(S21)는 상기 단계(S18)에서 산출된 결정값(D(P))이 1이고, 상기 수평 방향 투영 히스토그램의 부호 교차점의 수가 3보다 작거나 같은 지를 판단하고, 여기서, 결정값(D(P))가 1이고, 상기 수평 방향 투영 히스토그램의 부호 교차점의 수가 3보다 작거나 같다고 판단되면, 프로세서(20)는 상기 주기성 측정 결과를 신뢰할 수 없다고 결정하여 질감 기반 분석하는 단계(S22)로 진행한다. 상기 질감 기반 분석의 경우 통상의 2D Harr 웨이브렛을 사용하고, 질감 분석을 실행하는 과정을 다음과 같다.Therefore, in the present invention, after dividing a document input through the above processes into maximum homogeneous regions, the processor 20 adds a step (S21) of confirming the division result of the document. In step S21 for confirming the result of document division, the determination value D (P) calculated in step S18 is 1, and it is determined whether the number of sign intersections of the horizontal projection histogram is less than or equal to three. If the determined value D (P) is 1 and the number of sign intersections of the horizontal projection histogram is less than or equal to 3, the processor 20 determines that the periodicity measurement result is unreliable and is textured. The analysis proceeds to step S22. In the case of the texture based analysis, a process of using a conventional 2D Harr wavelet and performing texture analysis is as follows.

첫째, 웨이브렛 변환은 1차원과 유사하게 피라미드 알고리즘으로 계산하고, 2차원 웨이브렛 변환은 x축과 y축 방향으로 1차원 웨이브렛 변환을 적용한 것으로서, 매 레벨에서의 원래의 영상 A_n+1f를 A_nf, D_n ¹f, D_n ²f, D_n ³f로 분해한다.First, the wavelet transform is calculated by a pyramid algorithm similar to the one-dimensional, and the two-dimensional wavelet transform is a one-dimensional wavelet transform applied in the x- and y-axis directions, and the original image A _{n + 1} at every level. Decompose f into A _n f, D _n ¹ f, D _n ² f, and D _n ³ f.

즉, 먼저 A_n+1f의 매 2번째 행들을 1차원 필터로 콘볼루션하고, 그 결과들의 매 2번째 열들을 다른 1차원 필터로 콘볼루션한다. 이와 같은 과정을 통해 입력된 문서는 도 13에 도시된 바와 같이, LL, LH, HL, HH라고 불리우는 네 개의 하위 영상으로 변환할 수 있다. 여기서, LL 하위 영상은 가로축, 세로축 모두 저주파 부분인 A_nf 결과와 같고, LH 하위 영상은 가로축은 저주파 부분이고 세로축은 고주파 부분이며, D_n ¹f 결과와 같다.That is, first every second row of A _{n + 1} f is convolved with a one-dimensional filter, and every second column of results is convolved with another one-dimensional filter. As illustrated in FIG. 13, the document input through the above process may be converted into four sub-images called LL, LH, HL, and HH. Here, the LL lower image is the same as the result of A _n f in which the horizontal axis and the vertical axis are the low frequency parts, and the LH sub image is the low frequency part and the vertical axis is the high frequency part, and the same as the D _n ¹ f result.

그리고, 상기 HL 하위 영상은 가로축은 고주파, 세로축은 저주파인 부분을 나타내며, D_n ²f 결과와 같고, 상기 HH 하위 영상은 가로축 및 세로축 모두가 고주파인 부분을 나타내고, D_n ³f 결과와 같다.In addition, the HL lower image indicates a portion where the horizontal axis is a high frequency and the vertical axis is a low frequency, as shown in the result of D _n ² f, and the HH lower image shows a portion where both the horizontal axis and the vertical axis are in high frequency, and is equal to the D _n ³ f result. .

상기 2진 영상의 2D Harr 웨이브렛의 경우 A_nf, D_n ¹f, D_n ²f, D_n ³f에 대응되는 LL⁽ⁿ⁾, LH⁽ⁿ⁾, HL⁽ⁿ⁾, HH⁽ⁿ⁾하위 영상들은 다음 식으로부터 연산된다.In the case of the 2D Harr wavelet of the binary image, LL ⁽ⁿ⁾ , LH ⁽ⁿ⁾ , HL ⁽ⁿ⁾ , and HH ⁽ⁿ corresponding to A _n f, D _n ¹ f, D _n ² f, and D _n ³ f ⁾ . ⁾ Sub images are computed from the following equation.

상기 식으로부터 연산된 변환 계수에 따라 그림, 텍스트, 가로선, 및 세로선으로 분류한다. 예를 들어 LH 하위 영상의 계수가 높고, HL 하위 영상의 계수가 낮으면 가로 성분만이 있다는 것을 의미하므로 가로선(H)으로 할당한다. 반대로, LH 하위 영상의 계수가 낮고, HL 하위 영상의 계수가 높으면 세로 성분만이 있다는 것을 의미하므로 세로선(V)으로 할당한다.According to the conversion coefficient calculated from the above equation, it is classified into picture, text, horizontal line, and vertical line. For example, if the coefficient of the LH sub-image is high and the coefficient of the HL sub-image is low, it means that there is only a horizontal component and thus is assigned to the horizontal line (H). On the contrary, if the coefficient of the LH sub-image is low and the coefficient of the HL sub-image is high, it means that there is only a vertical component and thus is assigned to the vertical line (V).

만약 LH 와 HL 하위 영상이 모두 높으면 가로 성분과 세로 성분이 있다는 것을 의미하므로 텍스트(T)로 할당하고, LH와 HL 하위 영상의 계수가 모두 낮으면, 가로 성분과 세로 성분이 없는 그림이나 배경으로 할당하며, 여기서, LL 하위 영상의 계수가 높으면 그림(I)로 할당하고 LL 하위 영상 계수가 낮으면 배경(X)로 할당한다. 상기의 내용은 다음 표로 나타낸다.If both LH and HL subpictures are high, this means that there are horizontal and vertical components, so assign them as text (T) .If the coefficients of both LH and HL subpictures are low, the picture or background has no horizontal and vertical components. If the coefficient of the LL lower image is high, it is allocated to the picture (I), and if the LL lower image coefficient is low, it is assigned to the background (X). The above contents are shown in the following table.

LHLH LOW(0)LOW (0) HIGH(1)HIGH (1) HLHL LOW(0)LOW (0) I/XI / X HH HIGH(1)HIGH (1) VV TT

둘째, 마스크로 모든 클래스에 대해 콘볼루션을 행하여 이웃 계수들의 할당 결과를 판단하고, 각 이웃간의 관계를 나타내는 각 클래스의 마스크는 다음 표와 같다.Second, the convolution of all classes is performed as a mask to determine the allocation result of neighbor coefficients, and the mask of each class representing the relationship between neighbors is shown in the following table.

n번째 레벨의 (x,y)점에서 해당하는 계수들을 콘볼루션한 결과값은 다음 식으로부터 산출된다.The result of convolving corresponding coefficients at the (x, y) point of the nth level is calculated from the following equation.

여기서, 상기 m_i ^c는 대응하는 클래스의 마스크의 i번째 요소이며, b_i는 b_i에할당된 클래스가 클래스 c,와 같으면 1이고, 그렇지 않으면 0이 된다.Here, m _i ^c is the i th element of the mask of the corresponding class, b _i is the class assigned to b _i is class c, Equals 1, otherwise 0.

셋째, 각 레벨의 콘볼루션한 결과값은 다음 식을 이용하여 누적되며, 상기와 같은 누적과정은 최상위 레벨(N 레벨)에서부터 최하위 레벨(1)까지 적용되고, 각 클래스에 대응하는 계수 c의 최종 누적값 A¹ _x,y(c)가 산출된다.Third, the convoluted result of each level is accumulated using the following equation, and the cumulative process as described above is applied from the highest level (N level) to the lowest level (1) and the final value of the coefficient c corresponding to each class. The cumulative value A ¹ _{x, y} (c) is calculated.

마지막으로 최하위 레벨인 1레벨에서 대응하는 계수의 클래스를 그림, 텍스트, 가로선, 또는 세로선으로 결정하는데, 결정방법은 상기 그림(I), 텍스트(T), 세로선(V), 및 가로선(H)의 최종 누적값 중에서 최대값으로 그에 대응하는 계수 c의 클래스를 결정한다.Finally, at the lowest level, level 1, the class of the corresponding coefficient is determined as a picture, text, horizontal line, or vertical line. The determination method is the picture (I), the text (T), the vertical line (V), and the horizontal line (H). The class of the coefficient c corresponding to the maximum value of the final cumulative value of is determined.

상기 질감 기반 분석을 통한 문서 분석의 결과에서 텍스트 영역이 아닌 영역(그림, 가로선, 세로선)과 텍스트 영역 중 크기가 작은 영역을 제거한 후 단계(S13)로 진행한다.In step S13, the non-text area (picture, horizontal line, vertical line) and the small area of the text area are removed from the result of document analysis through the texture-based analysis.

한편, 상기 단계(S21)에서 결정값(D(P))가 1이고, 상기 수평 방향 투영 히스토그램의 부호 교차점의 수가 3보다 작거나 같지 않다고 판단되면, 프로세서(20)는 정확하게 문서가 분석되었다고 판단하여 문서를 각각 텍스트영역, 선영역, 표영역, 및 그림영역으로 분류하는 단계(S23)로 진행한다.On the other hand, if it is determined in step S21 that the determined value D (P) is 1 and the number of sign intersections of the horizontal projection histogram is not less than or equal to 3, the processor 20 determines that the document has been correctly analyzed. In step S23, the document is classified into a text area, a line area, a table area, and a picture area, respectively.

예를 들어, 상기의 과정을 통해 수평 방향으로 주기적 성분을 갖고, 수평 방향으로 평활화한 투영 히스토그램에서 하나 이상의 봉우리가 나타나며, 상기 단계(S19)에서의 결정값(D(P))이 1이고, 단계(S16)에서 수평 방향의 평활화한 투영 히스토그램의 부호 교차점의 수가 3 이상이면 그 영역은 텍스트 영역으로 분류한다.For example, through the above process, one or more peaks appear in the projection histogram having a periodic component in the horizontal direction and smoothed in the horizontal direction, and the determined value D (P) in the step S19 is 1, If the number of sign intersections of the horizontally smoothed projection histogram in step S16 is three or more, the area is classified as a text area.

만약 세로 방향으로 주기적 특성을 갖고, 세로 방향으로 평활화한 투영 히스토그램에서 하나 이상의 봉우리를 가지고 있으며, 상기 단계(S19)에서의 결정값 D(P)이 1이고, 단계(S16)에서 수평 방향의 평활화한 투영 히스토그램의 부호 교차점의 수가 3 보다 작거나 같으면, 그 영역은 제목이나 부제목, 범례, 캡션과 같은 하나의 행으로 이루어진 텍스트 영역으로 분류한다.If it has a periodic characteristic in the longitudinal direction and at least one peak in the projection histogram smoothed in the longitudinal direction, the determined value D (P) in step S19 is 1, and the horizontal smoothing in step S16. If the number of sign intersections in a projection histogram is less than or equal to 3, the area is classified as a text area consisting of one line, such as a title, subtitle, legend, or caption.

상기 분류된 문단 텍스트영역의 문자는 문자 인식기로 공급되어 데이터베이스(30)에 저장된다.Characters of the classified paragraph text area are supplied to a character recognizer and stored in the database 30.

한편, 끊어진 선을 이어준 후 그 영역이 하나의 연결 요소로 구성되어 있고, 가로와 세로의 비율이 5배이상이면 선으로 분류하며, 이 분류된 선은 그래픽영역으로 벡터화하여 데이터베이스(30)에 저장된다.On the other hand, after connecting the broken line, the area is composed of one connecting element, and if the ratio of horizontal and vertical is 5 times or more, it is classified as a line, and this classified line is vectorized into a graphic area to the database 30. Stored.

그리고, 분류되지 않은 영역에 대하여 가로선이 추출되고, 소정 영역의 넓이와 비슷한 크기를 갖는 윗선과 아랫선이 검출되며, 윗선과 아랫선을 제외한 하나 이상의 가로선이 검출되면 표로 분류한다. 상기 분류된 표는 재생성하여 데이터베이스(30)에 저장된다.Then, horizontal lines are extracted for the unclassified area, and upper and lower lines having a size similar to the area of the predetermined area are detected, and when one or more horizontal lines except for the upper and lower lines are detected, they are classified into a table. The sorted table is regenerated and stored in the database 30.

상기에서 분류되지 않은 나머지 영역은 모두 그림영역으로 분류하고, 이 분류된 그림 영역은 압축하여 데이터베이스(30)에 저장된다.All remaining areas not classified above are classified as picture areas, and the classified picture areas are compressed and stored in the database 30.

상기에서 설명한 바와 같이, 본 발명에 따르면, 입력된 문서 영상을 최대 동질 영역으로 분할하고, 각 영역을 텍스트, 그림, 표, 선으로 정확하게 분류하기 위하여 피라미드 구조를 만들고, 주기성 측정 및 질감 기반 분석을 적용함으로서, 별도의 매개 변수 입력과 같은 부가적인 과정이 필요없이 자동으로 다양한 활자체 크기와 행간, 문서 구조를 갖는 문서를 정확하게 분할할 수 있는 효과를 얻을 수 있다.As described above, according to the present invention, the input document image is divided into maximum homogeneous regions, and a pyramid structure is formed to accurately classify each region into text, pictures, tables, and lines, and periodicity measurement and texture-based analysis are performed. By applying it, it is possible to automatically divide a document having various typeface sizes, lines, and document structures automatically without the need for an additional process such as additional parameter input.

또한, 본 발명을 상기한 실시 예를 들어 구체적으로 설명하였지만, 본 발명은 이에 제한되는 것이 아니고, 당업자의 통상의 지식의 범위 내에서 그 변형이나 개량이 가능하다.In addition, although the present invention has been described in detail with reference to the embodiments described above, the present invention is not limited thereto, and modifications and improvements are possible within the scope of ordinary knowledge of those skilled in the art.

Claims

a) determining whether a document to be analyzed is input, and when it is determined that the document is input, processing the input document at a high resolution at a low resolution to generate a pyramid structure, and extracting a multi-stage image of the document;

b) In step a), a connection element analysis is performed on the top-level image of the low resolution pyramid structure to extract the boundary rectangle of the document, and the periodicity of a predetermined region is measured in the horizontal and vertical directions of the remaining images except for the top-level image. Extracting periodic characteristics;

c) determining whether a predetermined area is a single period in step b), and if it is determined that the predetermined area does not have a single periodicity, detecting a position to divide a document, dividing the document into a plurality of areas, and then proceeding to step b). ;

d) if it is determined that the period measured in step c) has a single period, classifying the maximum homogeneous region into text, picture, table, and line; And

e) supplying the text classified in step d) to a character recognizer, compressing a picture, regenerating a table, and vectorizing a graphic such as a line and storing the text in a database.

The method of claim 1, wherein b)

b-1) obtaining an inclined angle of the document and generating a projection histogram according to the direction of the angle;

b-2) smoothing the projection histogram generated in step b-1) to remove noise components;

b-3) calculating a slope between neighboring elements by differentiating the smoothed projection histogram in step b-2);

b-4) calculating the sign intersection of the projection histogram from the slope between the neighboring elements;

b-5) calculating a frequency distribution in a predetermined region by calculating a variance of the distance between sign intersections of the projection histogram calculated in step b-4); And

b-6) calculating a normal value by normalizing the variance in step b-5), and calculating a determination value for determining the periodicity of the predetermined region from the calculated normal value. Analytical Method.

The method of claim 2, wherein b-6)

b-iii) calculating a normal value P and outputting a normal value P by calculating Equation 1 below; And

b-ii) calculating the determination value D (P) for determining a single periodicity of a predetermined area according to the normal value by calculating Equation 2 below.

.. Equation 1

.. Equation 2

The method of claim 1, wherein the step c) when said arbitrary element of the set W of the diluent space aligned in order of size W _i, the intermediate element W _med, and the maximum element of W _max, W _i> W _med And W _i = W _max , the document is divided by the arbitrary element W _i .

The method of claim 1, wherein the step c) is B _i > B _med when any element of the set B of the black spaces arranged in size is B _i , the middle element B _med , and the maximum element B _max . And B _i = B _max , the document is divided by the arbitrary element W _i-1 .

c) determining whether a predetermined region is a region having a single period in step b), detecting a position to divide a document when the determination is not made with a single periodicity, dividing the document into a plurality of regions, and then proceeding to step b). ;

d) if it is determined in step c) that the predetermined area has a single period, verifying the result of document division;

e) analyzing the document according to the texture mood analysis only for the corresponding area when it is determined that the document division result is incorrect in step d);

f) classifying the largest homogeneous region into text, picture, table, and line when the document division result is determined to be accurate in step e) or the document division result is determined to be accurate in step d); And

g) supplying the text classified in step d) to a character recognizer, compressing a picture, regenerating a table, and vectorizing a graphic such as a line and storing it in a database.

7. The automatic document structure analysis method according to claim 6, wherein the step c) determines that a predetermined area of the document does not have a single period when the determination value is zero.

7. The automatic document structure analysis method according to claim 6, wherein the step d) determines that the document analysis result is incorrect when the determination value is 1 and the code intersection point is 3 or less.