KR100310306B1

KR100310306B1 - Method of analyzing document image on computer and recording media recording it

Info

Publication number: KR100310306B1
Application number: KR1019980042336A
Authority: KR
Inventors: 문경애; 황영섭; 오원근
Original assignee: 오길록; 한국전자통신연구원
Priority date: 1998-10-09
Filing date: 1998-10-09
Publication date: 2001-12-17
Also published as: KR20000025311A

Abstract

본 발명은 컴퓨터에서의 서식문서 영상 분석방법과 이를 기록한 기록매체에 관한 것으로, 문서 영상 내에 포함된 서식(form)의 자동 처리를 위해, 서식의 선분을 추출하고, 서식의 항목을 추출하여 서식을 처리함으로써, 다양한 구조의 표가 포함된 인쇄문서와 손으로 그린 표, 복잡한 서식 문서 등을 효과적으로 처리할 수 있는 서식문서 영상 분석방법 및 이를 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는 데 그 목적이 있다.The present invention relates to a method of analyzing a form document image on a computer and a recording medium recording the same. For automatic processing of a form included in a document image, the line segment of the form is extracted and the form item is extracted to form the form. By processing, the purpose of the present invention is to provide a method of analyzing a document image, which can effectively process printed documents containing various structures, hand-drawn tables, and complex form documents, and a computer-readable recording medium recording the same. have.

서식문서를 포함한 영상으로부터 서식을 자동처리하기 위한 서식문서영상 분석방법은 상기 서식문서를 포함하는 영상을 수학적 형태학에 의해 분석하여 서식 선분을 추출하는 제1단계; 상기 제1단계에서 추출된 선분을 여과하고, 상기 여과된 선분을 그룹화하여 서식 선을 추출하는 제2단계; 및 상기 제2단계에서 추출된 수직. 수평선으로부터 타당한 서식의 항목들을 추출하는 제3단계를 포함한 것을 특징으로 한다.A format document image analysis method for automatically processing a form from an image including a form document includes: a first step of extracting a form segment by analyzing an image including the form document by mathematical morphology; A second step of filtering the line segment extracted in the first step and extracting the formatting line by grouping the filtered line segments; And the vertical extracted in the second step. And a third step of extracting items of a proper form from a horizontal line.

Description

Image Analysis Method of Form Document on Computer and Recording Media Recording the Same

본 발명은 컴퓨터에서의 서식문서 영상 분석방법과 이를 기록한 기록매체에 관한 것으로, 더욱 자세하게는 문서 영상 내에 포함된 서식(form)의 자동 처리를 위해, 서식의 선분을 추출하고, 서식의 항목을 추출하여 서식을 처리함으로써, 다양한 구조의 표가 포함된 인쇄문서와 손으로 그린 표, 복잡한 서식 문서 등을 효과적으로 처리할 수 있는 서식문서 영상 분석방법 및 그 기록매체에 관한 것이다.The present invention relates to a method for analyzing a form document image on a computer and a recording medium recording the same. More particularly, the line segment of a form is extracted and an item of a form is extracted for automatic processing of a form included in the document image. The present invention relates to a format document image analysis method and a recording medium capable of effectively processing a printed document, a hand-drawn table, a complex form document, and the like, by processing a form.

서식문서는 각 항목의 위치와 크기가 엄격히 정해진 물리적 형식(physical format)과 그 항목의 위치와 크기가 그 위상(topology)과 내용(content)을 유지하는 범위 내에서 변형이 있어도 같은 형식의 서식으로 간주하는 논리적 형식(logical format)으로 나눌 수 있다.The format document is in the same format, even if the position and size of each item are strictly defined, and if the position and size of the item are deformed to the extent that they maintain their topology and content. Can be divided into logical formats.

물리적 형식의 서식 문서를 처리할 경우는 스캔된 문서영상에서 서식 부분만을 제거한 후, 해당 항목 부분을 OCR을 이용해 처리할 수 있다. 하지만, 이러한 방법은 해석할 모든 서식에 대한 사전지식을 서식모델(form model)로 등록하고 있어야 한다. 그러나, 사무실에서 사용되는 대부분의 문서는 정해진 형식에 맞게 각 사용자가 재편집 혹은 크기조정 등의 변형을 가하게 되므로, 종래의 방법을 이용하려면 하나의 서식에 대해 여러 개의 서식모델을 등록하고 있어야만 하였다. 이러한 이유로 인해, 사용자가 스캔된 서식문서를 영상을 자동 처리하는 데 많은 어려움이 있었다.When processing a form document in a physical format, only a portion of the form may be removed from the scanned document image, and then the corresponding portion may be processed using OCR. However, this method requires that prior knowledge of all forms to be interpreted is registered as a form model. However, since most documents used in an office are modified by each user to be re-edited or resized according to a predetermined format, in order to use the conventional method, it was necessary to register several form models for one form. For this reason, the user has a lot of difficulties in automatically processing the scanned form document image.

따라서, 본 발명은 상기와 같은 종래 기술의 문제점을 해결하기 위한 것으로, 본 발명의 목적은 문서 영상 내에 포함된 서식(form)의 자동 처리를 위해, 서식의 선을 추출하고 서식의 항목을 추출하여 서식의 구조를 분석하고, 추출된 서식 항목을 개별 문자 분리 및 인식하여 재이용이 가능한 파일 포맷으로 표현하는 효과적인 서식문서 영상 분석방법과, 이를 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공하는 데 있다.Accordingly, the present invention is to solve the problems of the prior art as described above, an object of the present invention is to extract the line of the form and to extract the items of the form for automatic processing of the form (form) contained in the document image The present invention provides a method of analyzing an image of a format document, which analyzes the structure of a form, separates and recognizes extracted form items into a file format that can be reused, and a computer-readable recording medium recording the same.

도 1은 본 발명에 따른 서식 문서의 분석방법을 설명하기 위한 흐름도.1 is a flowchart illustrating a method of analyzing a form document according to the present invention.

도 2는 본 발명에 따른 서식의 선분 추출 단계의 상세 흐름도.2 is a detailed flowchart of the segment extraction step of the form according to the present invention.

도 3 및 도 4는 본 발명에 의해 처리된 예를 나타낸 도면.3 and 4 show examples processed by the present invention.

상기 목적을 달성하기 위한 본 발명의 방법은 서식문서를 포함한 영상으로부터 서식을 자동처리하기 위한 서식문서영상 분석방법에 있어서, 상기 서식문서를 포함하는 영상을 수학적 형태학에 의해 분석하여 서식 선분을 추출하는 제1단계; 상기 제1단계에서 추출된 선분을 여과하고, 상기 여과된 선분을 그룹화하여 서식 선을 추출하는 제2단계; 상기 제2단계에서 추출된 수직. 수평선으로부터 타당한 서식의 항목들을 추출하는 제3단계; 및 추출된 서식 항목을 개별 문자 분리하여 인식한 다음 파일 변환하는 제4단계를 포함한 것을 특징으로 한다.The method of the present invention for achieving the above object is a form document image analysis method for automatically processing a form from an image including a form document, the form segment is extracted by analyzing the image containing the form document by mathematical morphology First step; A second step of filtering the line segment extracted in the first step and extracting the formatting line by grouping the filtered line segments; Vertical extracted in the second step. Extracting items of a proper format from a horizontal line; And a fourth step of recognizing the extracted format item by separating individual characters and then converting the file.

또한 본 발명에 따른 기록매체는, 서식 문서를 포함하는 영상으로부터 구조 분석을 통해 서식을 추출하기 위한 컴퓨터로 읽을 수 있는 기록매체에 있어서, 상기 서식문서를 포함하는 영상을 형태학적 닫힘 연산을 통해 분석한 다음, 형태학적 열림 연산을 통해 수평 선분 혹은 수직 선분이 되는 연결요소를 추출하는 단계; 상기 추출된 선분들 중 기울기가 임계값 T를 넘는 선분에 대해 각각 여과하고, 상기 여과된 모든 수평 선분들을 그룹화하여 서식 선을 추출하는 단계; 상기 추출된 수평, 수직 서식 선으로부터 서식 항목들을 추출하는 단계; 및 상기 추출된 서식 항목을 개별적으로 문자 분리하여 인식한 다음, 파일 변환하는 단계를 실행시키기 위한 프로그램을 기록한다.In addition, the recording medium according to the present invention, in a computer-readable recording medium for extracting the form through the structural analysis from the image containing the form document, by analyzing the image containing the form document through a morphological closing operation Then, extracting a connecting element that becomes a horizontal segment or a vertical segment through a morphological opening operation; Filtering each of the extracted line segments with respect to a line segment having a threshold value T and extracting a format line by grouping all the filtered horizontal line segments; Extracting form items from the extracted horizontal and vertical form lines; And a program for executing the step of converting a file after recognizing the extracted form item separately by character.

본 방법은 종래의 방법처럼 단순한 이미지끼리의 정합만으로 처리하는 것이 아니라 서식 문서의 구조를 분석하여 이의 물리적 좌표와 논리적 구조를 함께 이용한다. 상기 논리적 구조를 서식문서의 정합에 사용하면 서식 문서의 위치 및 크기의 변형에 적절히 대응할 수 있어, 단지 물리적 형식으로 서식을 분류할 경우와 같이 동일한 서식을 중복 등록하여야 하는 문제점을 해결할 수 있다.This method does not process only the matching of images with the conventional method, but analyzes the structure of the format document and uses its physical coordinates and logical structure together. When the logical structure is used to match the format document, it is possible to appropriately cope with a change in the position and size of the format document, thereby solving the problem of having to repeatedly register the same form, such as when classifying the form into a physical form.

또한, 본 발명에 의하면 종래의 방법에서 항목 추출 오류(오류율 약 11.3% 내지 16.7%)의 원인이었던 서식의 선이 일부 파손된 경우 및 항목 내에 필기한 문자와 서식의 선이 접촉된 경우에도 처리가 가능하며, 사각형 내에 또 다른 사각형으로 둘러 쌓인 경우와 같이 복잡한 서식의 처리도 가능하다. 따라서, 항목 추출의 오류율을 종래 방법에 비해 현저히 낮출 수 있다.Further, according to the present invention, even when the line of the form which was the cause of the item extraction error (error rate of about 11.3% to 16.7%) is partially broken in the conventional method, and when the written character and the line of the form are in contact with the item, It is also possible to process complex forms, such as when a rectangle is enclosed within another rectangle. Therefore, the error rate of item extraction can be significantly lowered compared with the conventional method.

이하, 첨부된 도면을 참조하여 본 발명에 따른 가장 바람직한 일 실시예를 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 서식 문서의 분석방법을 설명하기 위한 도면이고, 도 2는 도 1의 서식의 선분 추출 단계(11)의 세부적으로 표현한 도면이다.FIG. 1 is a view for explaining a method for analyzing a form document according to the present invention, and FIG. 2 is a detailed representation of the line segment extraction step 11 of the form of FIG. 1.

본 발명의 분석방법은 크게 수학적 형태학(mathematical morphology)에 의한 서식 선분을 추출하는 단계와, 추출된 선분의 여과(filtering)와 그룹화(grouping)를 통해 서식 선을 추출하는 단계, 그리고 추출된 수직.수평선으로부터 타당한 서식의 항목들을 추출하는 단계로 구성된다.The analysis method of the present invention comprises the steps of extracting the form line by mathematical morphology (mathematical morphology), the step of extracting the form line through the filtering and grouping (grouping) of the extracted line segment, and the extracted vertical. It consists of extracting items of proper form from the horizontal line.

마찬가지로, 컴퓨터로 읽을 수 있는 기록매체는 수학적 형태학(mathematical morphology)에 의한 서식 선분을 추출하는 단계와, 상기 추출된 선분을 여과(filtering)하고, 상기 여과된 선분을 그룹화(grouping)하여 서식 선을 추출하는 단계와, 상기 추출된 수직.수평선으로부터 타당한 서식의 항목들을 추출하는 단계와, 그리고 상기 추출된 각 항목을 개별적으로 문자 분리하고 인식하는 단계를 실행시키기 위한 프로그램을 기록하고 있다.Similarly, the computer-readable recording medium may extract the formatting line by mathematical morphology, filter the extracted line segment, and group the filtered line segment to form the formatting line. A program for executing the step of extracting, extracting items of a proper format from the extracted vertical and horizontal lines, and separating and recognizing the extracted items individually are recorded.

이러한 본 발명의 구성을 도 1 및 도 2를 참조하여 보다 구체적으로 살펴보면 아래와 같다.Looking at the configuration of the present invention in more detail with reference to Figures 1 and 2 as follows.

먼저, 서식의 선분을 추출하는 과정(도 1의 11)을 살펴본다.First, a process of extracting line segments of a form (11 of FIG. 1) will be described.

스캔한 서식문서 영상에 대해 먼저, 도 2에 도시된 바와 같이 형태학적 닫힘 연산(21,22)(morphological closing operation)을 통해 서식의 선이 깨어진 부분이 연결될 수 있도록 하고, 그 후 형태학적 열림 연산(23,24)(morphological opening operation)을 통해 수평 선분 혹은 수직 선분이 되는 연결요소(connected component) C_h 혹은 C_v 를 수학식 1과 같이 추출한다.For the scanned form document image, first, the broken portions of the form may be connected through morphological closing operations (21 and 22) as shown in FIG. 2, and then the morphological opening operation is performed. (23,24) (connected component being a horizontal segment or a vertical segment through morphological opening operations C _h or C _v Is extracted as in Equation 1.

여기서, 형태학적 닫힘 연산은 불림 연산(21) 수행 후, 침식 연산(22)을 수행하는 것을 말하고, 형태학적 열림 연산은 침식 연산(23) 수행 후, 불림 연산(24)을 수행하는 것을 말한다.Here, the morphological closing operation refers to performing the erosion operation 22 after performing the Boolean operation 21, and the morphological opening operation refers to performing the Boolean operation 24 after performing the erosion operation 23.

[수학식 1][Equation 1]

여기서, B_h 와 B_v 는 폭이 L인 [1...1]이고, B′_h 와 B′_v 는 가장 넓은 틈(gap)을 L'로 보았을 때, 폭이 L'인 [1...1]이며, Θ는 침식(erosion), 는 불림(dilation)을 나타낸다.here, B _h Wow B _v Is [1 ... 1] of width L, B ′ _h Wow B ′ _v Is the [1 ... 1] width L 'when the widest gap is L', Θ is the erosion, Represents a division.

이러한 수학적 형태학(mathematical morphology)에 의한 서식의 선분 추출 방법은 문자와 서식문서의 선분의 접촉에 무관하다. 또한, 서식의 선분 폭의 굵기에 따라 서식의 기울어진 정도의 허용치가 다르나, 선의 굵기가 약 0.4mm정도의 일반적인 서식문서의 경우 약 12°정도 기울어진 서식도 처리가 가능하다.The method of extracting the segment of the form by such mathematical morphology is irrelevant to the contact between the segment of the character and the form document. In addition, although the allowable value of the inclination degree of the form differs according to the thickness of the line segment width of the form, in a general form document having a line thickness of about 0.4 mm, a form inclined about 12 ° can be processed.

다음, 상기와 같은 과정을 통해 추출된 선분의 여과 및 그룹화(도 1의 12) 과정을 살펴본다.Next, look at the process of filtering and grouping the extracted line segment (12 of Figure 1) through the above process.

추출된 선분들 중 높이와 기울기가 모두 큰 수평선분과 폭이 크고 기울기가 작은 수직선분을 여과하고, 또한 모든 선분에 대해 기울기가 임계값 T를 넘으면 여과한다. 여기서, 임계값 T는 직선 서식의 경우에는 T= tg^-1 (1/50)이고, 점선 서식의 경우에는 T= tg^-1 (1/24)이다.Among the extracted line segments, horizontal lines with large heights and slopes and vertical lines with large and small slopes are filtered, and when all slopes exceed the threshold T for all the line segments. Where threshold T is T = tg ^-1 (1/50), T = for dotted line format tg ^-1 (1/24).

모든 선분에 대해 여과 과정을 수행한 후, 서식의 선을 찾기 위해 서로 근접한 수직(혹은 수평) 위치에 있는 수평(혹은 수직) 선분들은 하나의 선(line)으로 그룹화한다. 여기서, 그룹화 과정은 위에서 아래로 모든 수평 선분들을 정렬하고, 현재 서식 선의 수를 '0'의 값으로 초기화 한다. 그런 다음에 현재 선분이 현재 서식선과 근접해 있는 지를 판단하여, 근접해 있으면 현재 선분을 현재 서식 선에 포함시키고, 근접해 있지 않으면 새로운 서식선을 생성하여 서식 선의 수를 1 증가시킨 다음, 현재 선분을 현재 서식 선에 포함시킨다. 이러한 과정은 모든 수평선분들에 대해 수행된다.After performing a filtration process on all line segments, horizontal (or vertical) segments in close proximity to one another (or vertical) are grouped into a single line to find the lines of the form. Here, the grouping process aligns all horizontal line segments from top to bottom, and initializes the current number of formatting lines to a value of '0'. It then determines if the current line is close to the current line, if it is close, includes the current line in the current line, if not, creates a new line to increase the number of lines by 1, and then adds the current line to the current line. Include it in the line. This process is performed for all horizontal lines.

한편, 현재 선분이 현재 서식선에 근접해 있는지를 판단하는 데 있어, 두 선분 i와 j가 표준선분과 같은 쪽에 있으면 거리 d_ij 는 ｜d_i-d_j｜이고, 두 선분 i와 j가 표준선분과 다른 쪽에 있으면 거리 d_ij 는 ｜d_i+d_j｜이다. 여기서, 표준선분은 서식과 같은 기울기를 가지며, 서식의 선 중 가장 긴 선분을 지나는 선분이고, d_i와 d_j는 표준선분과 각 선분의 중심으로부터의 거리를 말한다.On the other hand, in determining whether the current line segment is close to the current line, if two segments i and j are on the same side as the standard line, d _ij Is | d _i -d _j |, if two segments i and j are on the other side of the standard line, the distance d _ij Is d _i + d _j | Here, the standard line segment has the same slope as the format line, and the line segment passes the longest line segment of the form line, and d _i and d _j refer to the standard line segment and the distance from the center of each line segment.

다음은 이와 같은 과정을 통해 추출한 선으로부터 서식의 항목을 추출하는 단계(도 1의 13)를 살펴본다.Next, look at the step of extracting the item of the form from the line extracted through the process (13 of Fig. 1).

추출된 서식의 수평 혹은 수직선이 서식의 항목을 둘러싸고 있는 사각형의 타당한 에지(edge)인 지를 판정하기 위해 다음과 같은 조건에 따라 처리한다.In order to determine whether the horizontal or vertical line of the extracted form is a valid edge of the rectangle surrounding the form's items, the following conditions are used.

먼저, 사각형들의 합이 어느 후보 사각형 전체를 커버하면 그 후보 사각형은 고려대상에서 제외한다.First, if the sum of the rectangles covers an entire candidate rectangle, the candidate rectangle is excluded from consideration.

그리고, 다른 사각형들의 에지(edge)들이 어느 후보 사각형의 한쪽 에지 전체를 커버하면, 그 후보의 에지는 타당하므로, 상기 후보 사각형을 추출한다.If the edges of the other rectangles cover the entire edge of one of the candidate rectangles, the candidate edges are valid, and thus the candidate rectangles are extracted.

또한, 다른 사각형들의 에지(edge)들이 어느 후보 사각형 한쪽 에지의 부분을 커버하고, 커버되지 않은 부분의 길이가 충분히 긴 경우에 그 후보의 에지는 타당하므로, 마찬가지로 상기 후보 사각형을 추출한다.In addition, since the edges of the other rectangles cover a portion of one edge of one candidate rectangle, and the edge of the candidate is valid when the length of the uncovered portion is long enough, the candidate rectangle is similarly extracted.

상기와 같은 과정을 통해 추출된 각 항목은 개별 문자분리 및 인식과정(14)을 통해 처리된 후, 파일변환 과정(15)을 거치게 된다. 본 발명의 실시예에서는 파일 변환 형식을 HTML 형식으로 처리하여 HTML파일로 표현한다.Each item extracted through the above process is processed through a separate character separation and recognition process (14), and then goes through a file conversion process (15). In the embodiment of the present invention, the file conversion format is processed in the HTML format and represented as an HTML file.

도 3 및 도 4는 본 발명에 의해 처리된 결과를 나타낸 도면으로, 도 3은 서식과 항목 내에 필기한 문자와 서식의 선이 접촉된 복잡한 서식의 구조를 분석하는 실행 예를 나타낸 도면이고, 도 4는 항목에 숫자를 수기한 손으로 그린 서식의 구조를 분석하고, 그 결과를 HTML로 표현한 실행 예를 나타낸 도면이다.3 and 4 is a view showing the results processed by the present invention, Figure 3 is a view showing an example of the execution of the analysis of the structure of the complex form in contact with the line of the form and the characters written in the form and items, 4 is a diagram showing an example of analyzing the structure of a hand-drawn style in which numbers are written on the items, and expressing the results in HTML.

상기와 같은 본 발명은 컴퓨터로 읽을 수 있는 기록매체에 기록되고, 컴퓨터에 의해 처리되게 된다.As described above, the present invention is recorded on a computer-readable recording medium and processed by a computer.

상기와 같은 본 발명에 의하면 내용이 채워진 임의의 서식문서의 구조를 분석하기 위해, 위상(topology)을 유지하는 범위에서는 변형이 있어도 동일한 서식으로 간주하는 논리적 형식의 구조분석을 포함하여, 크기와 위치의 변형에 민감하지 않아, 최소한의 등록으로 서식의 분류가 가능하다. 또한 종래의 방법들에 의해 처리가 곤란했던 서식의 선이 깨어진 경우와, 점선의 서식의 경우, 항목 내에 필기한 문자와 서식의 선이 접촉된 경우, 서식이 기울어진 경우 등에서도 좋은 처리결과를 얻을 수 있었다. 즉, 본 발명을 이용하여 다양한 실험을 수행한 결과 약 97% 이상의 항목 추출율을 보여 종래에 비해 월등히 향상된 효과가 있었다.According to the present invention as described above, in order to analyze the structure of any form document filled with contents, including the structural analysis of the logical form which is regarded as the same form even if there is a deformation in the range maintaining the topology, the size and position It is not sensitive to the variation of, allowing for the classification of forms with minimal registration. In addition, if the lines of the form which were difficult to process by the conventional methods are broken, and in the case of the dotted line form, when the handwritten characters and the line of the form are in contact with each other, the form is inclined, and the result is good. Could get That is, as a result of performing various experiments using the present invention, the extraction rate of the item was about 97% or more, which was much improved compared to the conventional art.

Claims

In the form document image analysis method for automatically processing the form from the image containing the form document,

A first step of extracting the format line segment by analyzing the image including the format document by mathematical morphology;

A second step of filtering the line segment extracted in the first step and extracting the formatting line by grouping the filtered line segments;

Vertical extracted in the second step. Extracting items of a proper format from a horizontal line; And

And a fourth step of recognizing the extracted format item by separating individual characters and converting the file.

The method of claim 1,

The first step,

A first sub-step of analyzing the input image through a morphological closing operation so that broken portions of the form can be connected;

After performing the first sub-step, through the morphological opening operation connecting elements that become horizontal segments or vertical segments (

C _h

or

C _v

Format document image analysis method comprising a second sub-step of extracting.

The method of claim 2,

Connecting element

C _h

And connecting elements that become vertical segments

C _v

Is a format document image analysis method, characterized in that calculated by the following equation.

here,

B _h

Wow

B _v

Is [1 ... 1] with a width of about 11/6,

B ′ _h

Wow

B ′ _v

Is [1 ... 1] about 1/6 wide, Θ is erosion, Each represents a division.

The method of claim 1,

The process of filtering the line segment of the second step,

A format document image analysis method, characterized in that for each of the extracted line segments, the horizontal line segment having a large height and slope, the vertical line segment having a large width and a small slope, and the line segment whose slope exceeds a threshold value T for all segments.

The method of claim 4, wherein

The threshold T is T =

tg ^-1

(1/50), T = for dotted line format

tg ^-1

Format document image analysis method, characterized in that (1/24).

The method of claim 1,

Grouping the filtered line segment of the second step,

A first sub-step of aligning all horizontal line segments and initializing the number of current formatting lines to a value of '0';

A second sub step of determining whether the current line segment is close to the current formatting line after alignment and initialization;

A third sub step of including the current line segment in the current format line if the current line segment is close to the current format line;

A fourth sub step of generating a new formatting line by increasing the number of formatting lines by 1 if the current line segment is not close to the current formatting line, and then including the current line segment in the new formatting line; And

And a fifth sub step of repeating the grouping according to the proximity of the line segment and the format line to the last horizontal line.

The method of claim 6,

In the sub-step, the process of determining whether the current line segment is close to the current formatting line,

When the current line segment i and the current format line j are located on the same side of the standard line segment, the distance between the current line segment and the current format line (

d _ij

) Is obtained by | d _i -d _j |

If the current line segment (i) and the current format line (j) are located on the other side of the standard line segment, the distance between the current line segment and the current format line (

d _ij

) Is obtained by | d _i -d _j |

The distance between the current line segment and the current formatting line (

d _ij

) Is compared with the threshold value to determine whether the proximity document image document analysis.

Here, the standard line segment has the same slope as the format line and passes through the longest line segment of the format line, and d _i and d _j refer to the distance between the standard line segment and the current line line to the current line.

The method of claim 1,

The third step,

A first sub step of excluding the candidate rectangle if the sum of the rectangles covers the entire candidate rectangle;

A second sub step of extracting the candidate rectangle as a format item when edges of other rectangles cover the entire edge of one candidate rectangle; And

And a third sub-step of extracting the candidate rectangle as a form item when the edges of the other rectangles cover a portion of one edge of a candidate rectangle and the length of the uncovered part is long enough. Way.

In the computer-readable recording medium for extracting the form through the structural analysis from the image containing the form document,

Analyzing the image including the form document through a morphological closing operation, and extracting a connecting element that becomes a horizontal segment or a vertical segment through a morphological opening operation;

Filtering each of the extracted line segments with respect to a line segment having a threshold value T and extracting a format line by grouping all the filtered horizontal line segments;

Extracting form items from the extracted horizontal and vertical form lines; And

A computer-readable recording medium having recorded thereon a program for executing the step of converting a file after recognizing the extracted form item separately.