KR100331035B1

KR100331035B1 - Automatic Analysis of Format Document Image

Info

Publication number: KR100331035B1
Application number: KR1019980049189A
Authority: KR
Inventors: 문경애; 황영섭; 오원근
Original assignee: 한국전자통신연구원
Priority date: 1998-11-17
Filing date: 1998-11-17
Publication date: 2002-09-05
Also published as: KR20000032648A

Abstract

본 발명은 서식 문서 영상의 자동 해석 방법에 관한 것으로, 더욱 자세하게는 종이로 작성된 기존의 방대한 사무실의 서식 문서를 스케너나 카메라와 같은 입력 매체를 통하여 입력받아 이를 해석하여 전자적 서식 문서로 바꾸어줌으로써, 다양하고 복잡한 구조의 서식 문서를 자동 처리할 수 있게 하는 효과적인 서식 문서 영상의 자동 해석 방법을 제공하는 데 그 목적이 있다.The present invention relates to a method for automatically interpreting a form document image, and more particularly, by receiving a form document of an existing large office made of paper through an input medium such as a scanner or a camera, and converting the form document into an electronic form document. The purpose of the present invention is to provide an effective method for automatically analyzing the format document image, which enables the automatic processing of the format document having a complicated structure.

본 발명에 따르면, 내용이 채워진 임의의 서식 문서 영상을 전자 서식 문서로 바꾸어 주기 위한 서식 문서 영상의 자동 해석 방법에 있어서, 서식 문서 영상의 구조를 분석하여 입력 서식 문서 영상을 표현하는 표현 모델을 만드는 제 1 단계; 입력 서식에 대한 표현 모델과 이미 등록된 표준 모델과 비교하는 제 2 단계; 상기 제 2 단계의 비교 결과, 일치하는 모델이 없으면 새로운 모델로 등록 한 후에, 새로운 서식에 대해 처리하고자 하는 항목을 항목정의 데이터베이스에 추가하는 제 3 단계; 및 상기 제 2 단계의 비교 결과, 일치하는 모델이 있으면 일치하는 서식으로 분류한 후에, 분류된 입력 서식 문서영상으로부터 처리하고자 하는 항목 영역을 추출하여 전자 서식 문서로 바꾸어 주는 제 4 단계를 포함하여 이루어진 서식 문서 영상의 자동 해석 방법이 제공된다.According to the present invention, in the automatic interpretation method of a format document image for converting an arbitrary format document image filled with contents into an electronic form document, an expression model for expressing an input format document image is analyzed by analyzing the structure of the format document image. First step; A second step of comparing the representation model for the input format with a standard model already registered; A third step of adding an item to be processed for the new form to the item definition database after registering as a new model if there is no matching model as a result of the comparison of the second step; And a fourth step of extracting an item area to be processed from the classified input form document image and converting it into an electronic form document after classifying the matching model if there is a matching model as a result of the comparison in the second step. An automatic interpretation method of the form document image is provided.

Description

Automatic Analysis of Format Document Image

본 발명은 서식 문서 영상의 자동 해석 방법에 관한 것으로, 더욱 자세하게는 종이로 작성된 기존의 방대한 사무실의 서식 문서를 스케너나 카메라와 같은 입력 매체를 통하여 입력받아 이를 해석하여 전자적 서식 문서로 바꾸어줌으로써, 다양하고 복잡한 구조의 서식 문서를 자동 처리할 수 있게 하는 효과적인 서식 문서 영상의 자동 해석 방법에 관한 것이다.The present invention relates to a method for automatically interpreting a form document image, and more particularly, by receiving a form document of an existing large office made of paper through an input medium such as a scanner or a camera, and converting the form document into an electronic form document. The present invention relates to an effective method for automatically interpreting a format document image, which can automatically process a format document of a complex structure.

서식 문서는 각 항목의 위치와 크기가 엄격히 정해진 물리적 형식(physical format)과 그 항목의 위치와 크기가 그 위상(topology)과 내용(content)을 유지하는 범위 내에서 변형이 있어도 같은 형식의 서식으로 간주하는 논리적 형식(logical format)으로 나눌 수 있다.Format documents are formatted in the same format, even if there is a variation in the physical format in which the position and size of each item is strictly determined, and the position and size of the item maintain its topology and content. Can be divided into logical formats.

물리적 형식의 서식 문서를 처리할 경우는 스캔된 문서영상에서 서식 부분만을 제거한 후, 해당 항목 부분을 OCR을 이용해 처리할 수 있다. 하지만, 이러한 방법은 해석할 모든 서식에 대한 사전지식을 서식모델(form model)로 등록하고 있어야 한다. 그러나, 사무실에서 사용되는 대부분의 문서는 정해진 형식에 맞게 각 사용자가 재편집 혹은 크기조정 등의 변형을 가하게 되므로, 종래의 방법을 이용하려면 하나의 서식에 대해 여러 개의 서식모델을 등록하고 있어야만 하였다. 이러한 이유로 인해, 기존의 물리적 형식의 서식 문서 해석 방법은 스캔된 서식문서 영상을 자동 처리하는데 효과적이지 못하다.When processing a form document in a physical format, only a portion of the form may be removed from the scanned document image, and then the corresponding portion may be processed using OCR. However, this method requires that prior knowledge of all forms to be interpreted is registered as a form model. However, since most documents used in an office are modified by each user to be re-edited or resized according to a predetermined format, in order to use the conventional method, it was necessary to register several form models for one form. For this reason, the conventional method of interpreting the format document in physical format is not effective for automatically processing the scanned form document image.

본 발명은 앞서 설명한 바와 같은 종래 기술의 문제점을 해결하기 위하여 안출된 것으로서, 서식의 구조를 분석하여 표현모델을 생성한 후에, 이를 기존 표준 모델과 비교하여 논리적 형식에 의한 서식 분류를 수행하며, 분류된 서식에서 추출하고자 하는 항목영역으로부터 문자 분리 및 문자 인식 처리를 수행하고, 그 결과를 재 이용 및 편집이 가능한 전자 문서로 자동 변환할 수 있도록 함으로써 문서 영상 내에 포함된 서식을 자동 처리할 수 있도록 하는 서식 문서 영상의 자동 해석 방법을 제공하는 데 그 목적이 있다.The present invention has been made to solve the problems of the prior art as described above, after analyzing the structure of the form to generate an expression model, and compares it with the existing standard model to perform the classification of the format by logical form, Character separation and character recognition processing from the item area to be extracted from the generated form, and the result can be automatically converted into an electronic document that can be reused and edited to automatically process the format contained in the document image. The purpose is to provide an automatic interpretation method of the format document image.

도 1은 본 발명의 일실시예에 따른 서식 문서 영상의 자동 해석 방법의 흐름도이고,1 is a flowchart of an automatic interpretation method of a form document image according to an embodiment of the present invention.

도 2는 도 1의 서식 문서 영상의 표현 모델 생성 과정의 흐름도이고,FIG. 2 is a flowchart of a process of generating an expression model of a format document image of FIG. 1;

도 3은 도 1의 서식 문서 영상의 분류 및 등록 과정의 흐름도이고.3 is a flowchart of a classification and registration process of the form document image of FIG. 1.

도 4는 본 발명에 따른 서식의 표현 모델 생성에 사용되는 서식 문서의 예이고,4 is an example of a form document used for generating a representation model of a form according to the present invention;

도 5a 는 본 발명에 따른 본 발명에 따른 서식의 분류 및 등록을 하는데 사용될 수 있는 서식 문서의 예이고,5A is an example of a form document that may be used to classify and register a form according to the present invention in accordance with the present invention,

도 5b 는 상기 도 5a에 도시된 바와 같은 서식 문서를 등록하는 실행 예이고,FIG. 5B is an execution example of registering a form document as shown in FIG. 5A;

도 6은 상기 도 5a에 도시된 바와 같은 서식 문서에서 항목을 추출하는 실행예이고,FIG. 6 illustrates an example of extracting an item from a form document as shown in FIG. 5A.

도 7은 상기 도 5a에 도시된 바와 같은 서식 문서에서 임의의 항목 내 문자를 분리하는 실행예이고,FIG. 7 is an example of separating characters in an arbitrary item in a format document as shown in FIG. 5A;

도 8은 상기 도 5a에 도시된 바와 같은 서식 문서를 인식한 결과를 HTML 파일로 변환한 실시예이며,FIG. 8 illustrates an embodiment in which a result of recognizing a format document as illustrated in FIG. 5A is converted into an HTML file.

도 9a 내지 도 9d는 본 발명에 적용한 임의의 서식 및 상기 임의의 서식을 본 발명에 따른 방법을 사용하여 분석한 결과를 보여주는 도이다.9A-9D are diagrams showing the results of analysis of any form applied to the present invention and the method using the method according to the present invention.

앞서 설명한 바와 같은 목적을 달성하기 위한 본 발명에 따르면, 내용이 채워진 임의의 서식 문서 영상을 전자 서식 문서로 바꾸어 주기 위한 서식 문서 영상의 자동 해석 방법에 있어서, 서식 문서 영상의 구조를 분석하여 입력 서식 문서 영상을 표현하는 표현 모델을 만드는 제 1 단계; 입력 서식에 대한 표현 모델과 이미 등록된 표준 모델과 비교하는 제 2 단계; 상기 제 2 단계의 비교 결과, 일치하는 모델이 없으면 새로운 모델로 등록 한 후에, 새로운 서식에 대해 처리하고자 하는 항목을 항목정의 데이터베이스에 추가하는 제 3 단계; 및 상기 제 2 단계의 비교 결과, 일치하는 모델이 있으면 일치하는 서식으로 분류한 후에, 분류된 입력 서식 문서영상으로부터 처리하고자 하는 항목 영역을 추출하여 전자 서식 문서로 바꾸어 주는 제 4 단계를 포함하여 이루어진 서식 문서 영상의 자동 해석 방법이 제공된다.According to the present invention for achieving the object as described above, in the automatic interpretation method of the form document image for converting any form document image filled with the electronic form document, the structure of the form document image by analyzing the input form Creating a representation model representing the document image; A second step of comparing the representation model for the input format with a standard model already registered; A third step of adding an item to be processed for the new form to the item definition database after registering as a new model if there is no matching model as a result of the comparison of the second step; And a fourth step of extracting an item area to be processed from the classified input form document image and converting it into an electronic form document after classifying the matching model if there is a matching model as a result of the comparison in the second step. An automatic interpretation method of the form document image is provided.

또한, 보다 바람직하게는, 상기 제 1 단계는, 스캔한 서식 문서 영상에 대해 서식의 선분을 추출하는 제 5 단계; 추출한 수직 및 수평 선분으로부터 서식의 항목을 둘러싸고 있는 사각형을 추출하는 제 6 단계; 및 추출된 수직 및 수평 선분의 길이, 높이, 위치 및 선분간의 관계 정보와 추출된 시각형을 이루는 상,하,좌,우 에지(edge)에 대한 정보에 의해 입력 서식 영상의 표현 모델을 생성하는 제 7 단계를 포함하여 이루어진 서식 문서 영상 자동 해석 방법이 제공된다.More preferably, the first step may include: a fifth step of extracting line segments of the form with respect to the scanned form document image; A sixth step of extracting a rectangle surrounding the item of the form from the extracted vertical and horizontal segments; And generating a representation model of the input format image by information on the relationship between the extracted length, height, position, and segment of the vertical and horizontal segments and information on the upper, lower, left, and right edges forming the extracted visual form. There is provided a method for automatically analyzing a formatted document image including the seventh step.

또한, 보다 바람직하게는, 상기 제 5 단계는, 수리 형태학(mathematical morphology) 연산에 의해 수평과 수직 선분을 이루는 연결요소(connected component)를 추출하는 제 8 단계; 추출한 연결요소들에 대해 선분의 폭 및 기울기를 계산하여 서식의 선으로 간주하기에 적합치 않은 연결요소를 여과(filtering)하는 제 9 단계; 및 위에서 아래로 모든 수평 방향 연결요소를 정렬한 후에, 왼쪽에서 오른쪽으로 모든 수직 방향 연결요소를 정렬하여 서로 근접한 수평 혹은 수직 방향의 연결요소들을 하나의 선으로 그룹화(grouping)하여 서식의 선분을 추출하는 제 10 단계를 포함하여 이루어진 서식 문서 영상의 자동 해석 방법이 제공된다.Also preferably, the fifth step may include: an eighth step of extracting connected components forming horizontal and vertical segments by mathematical morphology calculations; A ninth step of filtering the connection elements that are not suitable to be regarded as lines of the form by calculating the width and the slope of the line segments for the extracted connection elements; And align all horizontal connectors from top to bottom, and then align all vertical connectors from left to right to group adjacent horizontal or vertical connectors into one line to extract the segment of the form. An automatic analysis method of a format document image including the tenth step is provided.

또한, 보다 바람직하게는, 상기 제 6 단계는, 사각형들의 합이 어느 후보 사각형 전체를 커버하면 상기 후보 사각형은 제외시키는 제 8 단계; 다른 사각형들의 에지(edge)들이 어느 후보 사각형의 한쪽 에지 전체를 커버하면, 후보 사각형을 서식 항목으로 추출하는 제 9 단계; 및 다른 사각형들의 에지들이 어느 후보 사각형한쪽 에지의 부분을 커버하고, 커버되지 않은 부분의 길이가 충분히 긴 경우, 상기 후보 사각형을 서식 항목으로 추출하는 제 10 단계를 포함하여 이루어진 서식 문서 영상의 자동 해석 방법이 제공된다.More preferably, the sixth step may include: an eighth step of excluding the candidate rectangle if the sum of the rectangles covers the entire candidate rectangle; Extracting candidate rectangles as formatting items when edges of other rectangles cover the entire edge of one candidate rectangle; And a tenth step of extracting the candidate rectangle as a form item when the edges of other rectangles cover a portion of one candidate rectangle and the length of the uncovered part is long enough. A method is provided.

또한, 보다 바람직하게는, 상기 제 10 단계에서 서로 근접한 수평 혹은 수직 방향의 연결요소들을 하나의 선으로 그룹화(grouping)하는 과정은, 두 연결요소 i와 j가 표준선분과 같은 쪽에 있으면 거리 d_ij는 ｜d_i- d_j｜이고, 두 선분 i와 j가 표준선분과 다른 쪽에 있으면 거리 d_ij는 ｜d_i+ d_j｜로 계산하여 그 거리가 일정 임계치 이하이면 근접해 있다고 판단하여 수평 혹은 수직 방향의 연결요소들을 하나의 선으로 그룹화하는 것을 특징으로 하는 서식문서 영상 자동 해석 방법이 제공된다.More preferably, in the tenth step, the grouping of the horizontal or vertical connecting elements adjacent to each other in a single line may include a distance d _ij if the two connecting elements i and j are on the same side as the standard line segment. Is | d _i -d _j | and if two segments i and j are on the other side of the standard line, the distance d _ij is calculated as | d _i + d _j | There is provided a method for automatically analyzing a format document image, wherein the connecting elements in the direction are grouped into one line.

또한, 보다 바람직하게는, 상기 제 2 단계는, 서식의 논리적 구조 정보와 물리적 위치 및 크기 정보를 현재 이미 등록된 서식 모델과의 거리를 계산하여 계산된 거리가 소정값 이하인지를 판단하여 일치하는 지를 비교하는 것을 특징으로 하는 서식 문서 영상 자동 해석 방법이 제공된다.More preferably, the second step may be performed by calculating the distance between the logical structure information of the form, the physical position and the size information from the currently registered form model, and determining whether the calculated distance is equal to or less than a predetermined value. There is provided a method for automatically analyzing a format document image, which is characterized by comparing the features.

또한, 본 발명에 따르면, 컴퓨터에, 내용이 채워진 임의의 서식 문서 영상을 전자 서식 문서로 바꾸어 주기 위한 서식 문서 영상의 자동 해석 방법에 있어서, 서식 문서 영상의 구조를 분석하여 입력 서식 문서 영상을 표현하는 표현 모델을 만드는 제 1 단계; 입력 서식에 대한 표현 모델과 이미 등록된 표준 모델과 비교하는 제 2 단계; 상기 제 2 단계의 비교 결과, 일치하는 모델이 없으면 새로운 모델로 등록 한 후에, 새로운 서식에 대해 처리하고자 하는 항목을 항목정의 데이터베이스에 추가하는 제 3 단계; 및 상기 제 2 단계의 비교 결과, 일치하는 모델이 있으면 일치하는 서식으로 분류한 후에, 분류된 입력 서식 문서영상으로부터 처리하고자 하는 항목 영역을 추출하여 전자 서식 문서로 바꾸어 주는 제 4 단계를 실행시키기 위한 프로그램을 기록한 기록매체가 제공된다.In addition, according to the present invention, in the automatic interpretation method of a form document image for converting an arbitrary form document image filled with a content into an electronic form document on a computer, the input form document image is represented by analyzing the structure of the form document image. A first step of creating an expression model; A second step of comparing the representation model for the input format with a standard model already registered; A third step of adding an item to be processed for the new form to the item definition database after registering as a new model if there is no matching model as a result of the comparison of the second step; And as a result of the comparison of the second step, if there is a matched model, classify it into a matching format, and then perform a fourth step of extracting an item region to be processed from the classified input format document image and converting the area into an electronic format document. A recording medium for recording the program is provided.

본 방법은 종래의 방법처럼 단순한 이미지끼리의 정합만으로 처리하는 것이 아니라 서식 문서의 구조를 분석하여 이의 물리적 좌표와 논리적 구조를 함께 이용한다. 상기 논리적 구조를 서식문서의 정합에 사용하면 서식 문서의 위치 및 크기의 변형에 적절히 대응할 수 있어, 단지 물리적 형식으로 서식을 분류할 경우와 같이 동일한 서식을 중복 등록하여야 하는 문제점을 해결할 수 있다.This method does not process only the matching of images with the conventional method, but analyzes the structure of the format document and uses its physical coordinates and logical structure together. When the logical structure is used to match the format document, it is possible to appropriately cope with a change in the position and size of the format document, thereby solving the problem of having to repeatedly register the same form, such as when classifying the form into a physical form.

또한, 본 발명에 의하면 종래의 방법에서 항목 추출 오류의 주요 원인이었던 서식의 선이 일부 파손된 경우 및 항목 내에 필기한 문자와 서식의 선이 접촉된 경우에도 처리가 가능하며, 사각형 내에 또 다른 사각형으로 둘러 쌓인 경우와 같이 복잡한 서식의 처리도 가능하여 항목 추출의 오류율을 종래 방법에 비해 현저히 낮출 수 있다.In addition, according to the present invention, it is possible to process the case where the lines of the form, which were the main cause of the item extraction error in the conventional method, are partially broken, and also when the handwritten characters and the lines of the form are in contact with each other. Complex forms can be processed as in the case of enclosing, so that the error rate of item extraction can be significantly lower than that of the conventional method.

아래에서, 본 발명에 따른 서식 문서 영상의 자동 해석 방법의 양호한 실시예를 첨부한 도면을 참조로 하여 상세히 설명하겠다.Hereinafter, with reference to the accompanying drawings, a preferred embodiment of the automatic interpretation method of the format document image according to the present invention will be described in detail.

도면에서, 도 1은 본 발명의 일실시예에 따른 서식 문서 영상의 자동 해석 방법의 흐름도이고, 도 2는 도 1의 서식 문서 영상의 표현 모델 생성 과정의 흐름도이고, 도 3은 도 1의 서식 문서 영상의 분류 및 등록 과정의 흐름도이다.1 is a flowchart of a method for automatically analyzing a format document image according to an embodiment of the present invention, FIG. 2 is a flowchart of a process of generating a representation model of the form document image of FIG. 1, and FIG. 3 is a form of FIG. 1. A flowchart of the classification and registration process of document images.

도면에 도시된 바와 같이, 본 발명의 일실시예에 따른 서식 문서 영상의 자동 해석 방법은, 서식 문서 영상을 입력받으면(101), 서식의 표현 모델을 생성한다(102).As shown in the figure, in the method for automatically analyzing a form document image according to an embodiment of the present invention, when a form document image is input (101), a representation model of a form is generated (102).

그리고, 상기 서식의 표현 모델의 생성 과정은, 스캔한 서식 문서 영상에 대해 먼저, 도 2에 도시된 바와 같이 서식의 선분을 추출하고(201), 추출한 수직 및수평 선분으로부터 서식의 항목을 둘러싸고 있는 사각형을 추출한다(202).In the process of generating the representation model of the form, first, as shown in FIG. 2, the line segment of the form is extracted (201), and surrounding the item of the form from the extracted vertical and horizontal segments. Extract the rectangle (202).

이후에, 상기 추출된 수직 및 수평 선분의 길이, 높이, 위치 및 선분간의 관계 등의 정보와 추출된 시각형을 이루는 상,하,좌,우 에지(edge)에 대한 정보에 의해 입력 서식 영상의 표현 모델을 생성하고(203), 종료한다.Subsequently, information on the length, height, position, and line segment of the extracted vertical and horizontal segments and information on upper, lower, left, and right edges forming the extracted visual form may be used. The representation model is generated (203) and ends.

여기서, 상기 서식의 선분 추출(201)은 먼저, 수리 형태학(mathematical morphology) 연산에 의해 수평과 수직 선분을 이루는 연결요소(connected component)를 추출하고, 이러한 연결요소들에 대해 선분의 폭 및 기울기 등을 계산하여 서식의 선으로 간주하기에 적합치 않은 연결요소를 여과(filtering)한 다음, 위에서 아래로 모든 수평 방향 연결요소를 정렬하고, 왼쪽에서 오른쪽으로 모든 수직 방향 연결요소를 정렬하여 서로 근접한 수평 혹은 수직 방향의 연결요소들을 하나의 선으로 그룹화(grouping)한다.Here, the line segment extraction 201 of the form first extracts the connected components forming horizontal and vertical segments by mathematical morphology calculation, and the width and the slope of the line segments with respect to the connected components. Calculate and filter out the connectors that are not suitable to be considered lines of the form, then align all horizontal connectors from top to bottom, and align all vertical connectors from left to right, Group the connecting elements in the vertical direction into one line.

그리고, 현재 연결요소가 현재 서식의 선분에 근접해 있는지를 판단하는 데 있어, 두 연결요소 i와 j가 표준선분과 같은 쪽에 있으면 거리 d_ij는 ｜d_i- d_j｜이고, 두 선분 i와 j가 표준선분과 다른 쪽에 있으면 거리 d_ij는 ｜d_i+ d_j｜이다.And, in determining whether the current connecting element is close to the line segment of the current form, if two connecting elements i and j are on the same side as the standard line, the distance d _ij is | d _i -d _j |, and the two segments i and j If is on the other side of the standard line, the distance d _ij is | d _i + d _j |

여기서, 표준선분은 서식과 같은 기울기를 가지며, 서식의 선분 중 가장 긴 선분을 지나는 연결요소이고, d_i와 d_j는 표준선분과 각 연결요소와의 중심으로부터의 거리를 말한다.Here, the standard line segment has the same slope as the form, and is a connecting element passing through the longest line segment of the form, and d _i and d _j represent a distance from the center of the standard line segment and each connecting element.

다음은 상기 서식의 선분 추출 과정(201)을 통해 추출한 서식의 수직 혹은 수평 선분으로부터 서식의 사각형을 추출하는 과정(202)을 살펴본다.Next, a process 202 of extracting a rectangle of a form from vertical or horizontal line segments of the form extracted through the line segment extraction process 201 of the form will be described.

추출된 서식의 수평 혹은 수직선이 서식의 항목을 둘러싸고 있는 사각형의 타당한 에지(edge)인지를 판정하기 위해 다음과 같은 조건에 따라 처리한다.In order to determine whether the horizontal or vertical lines of the extracted form are valid edges of the rectangle surrounding the items of the form, processing is performed according to the following conditions.

먼저, 사각형들의 합이 어느 후보 사각형 전체를 커버하면 그 후보 사각형은 고려대상에서 제외한다.First, if the sum of the rectangles covers an entire candidate rectangle, the candidate rectangle is excluded from consideration.

그리고, 다른 사각형들의 에지(edge)들이 어느 후보 사각형의 한쪽 에지 전체를 커버하면, 그 후보의 에지는 타당하므로, 상기 후보 사각형을 추출한다.If the edges of the other rectangles cover the entire edge of one of the candidate rectangles, the candidate edges are valid, and thus the candidate rectangles are extracted.

또한, 다른 사각형들의 에지(edge)들이 어느 후보 사각형 한쪽 에지의 부분을 커버하고, 커버되지 않은 부분의 길이가 충분히 긴 경우에 그 후보의 에지는 타당하므로, 마찬가지로 상기 후보 사각형을 추출한다.In addition, since the edges of the other rectangles cover a portion of one edge of one candidate rectangle, and the edge of the candidate is valid when the length of the uncovered portion is long enough, the candidate rectangle is similarly extracted.

상기와 같이 서식 문서 영상의 구조 분석 과정을 통해 추출된 정보로부터 다음과 같이 서식을 표현하는 표현 모델을 만든다. 이 표현 모델은 서식의 선분에 대한 정보인 (NH, NV, H, V, D_h, D_v)와 서식의 사각형에 대한 정보인 (NR, (X_min, Y_min), (X_max, Y_max), (E_top, E_bottom, E_left, E_right))로 나타낸다.As described above, an expression model expressing a format is generated from information extracted through a structure analysis process of the format document image as follows. This representation model is (NH, NV, H, V, D _h , D _v ) for the line segments of the form and (NR, (X _min , Y _min ), (X _max , Y) for the rectangles of the form. _max ), (E _top , E _bottom , E _left , E _right )).

여기서, NH와 NV는 추출한 수평 선분과 수직 선분의 개수이고, H와 V는 수평선, 수직선들간의 관계를 나타내는 행렬이고, D_h와 D_v는 각 수평 선분과 수직 선분의 길이를 말하고, NR은 추출한 사각형의 개수이고, (X_min, Y_min) 와 (X_max, Y_max)는 서식의 항목을 둘러싸고 있는 최 외접 사각형의 좌측 상단과 우측 하단 꼭지점의 위치 좌표이고, (E_top, E_bottom, E_left, E_right)는 사각형을 이루는 상, 하, 좌, 우의 네 에지(edge)로 몇번째 수평 혹은 수직선인지를 나타낸다.Where NH and NV are the number of extracted horizontal and vertical segments, H and V are matrices representing the relationship between horizontal and vertical lines, D _h and D _v are the lengths of the horizontal and vertical segments, and NR is The number of rectangles extracted, (X _min , Y _min ) and (X _max , Y _max ) are the position coordinates of the top left and bottom right corners of the outermost rectangle that surrounds the item in the form, and (E _top , E _bottom , E _left , E _right ) are the four edges of the top, bottom, left, and _right of the rectangle, indicating the number of horizontal or vertical lines.

한편, 서식의 표현 모델을 생성하는 과정(102)을 수행한 후에는 서식의 분류 및 등록 과정을 수행한다.On the other hand, after performing a process 102 of generating a representation model of the form, the process of classifying and registering the form is performed.

다음은 상기 서식의 표현 모델 생성 과정에서 생성한 입력 서식 문서 영상에 대한 표현 모델을 이미 등록된 표준 모델과 정합하여 분류하는 서식 분류 및 등록 과정(103)을 살펴본다.Next, a form classification and registration process 103 for matching and classifying an expression model for an input form document image generated in the process of generating a representation model of the form, with a standard model already registered will be described.

상기 서식 문서 영상의 분류 및 등록 과정(103)은 상기의 서식 문서 영상의 구조 분석 과정으로부터 얻은 서식의 논리적 구조 정보와 물리적 위치 및 크기 정보를 현재 이미 등록된 서식의 모델과 정합하여 일치하는 서식 문서가 있으면 그 서식에 적합한 처리과정으로 넘어가고, 일치하는 서식 이 없으면 새로운 서식 문서이므로 사용자로부터 서식의 이름 및 처리할 항목의 정보를 입력받아 새로운 서식 모델로 등록한다.The classification and registration process 103 of the form document image is performed by matching the logical structure information, physical position and size information of the form obtained from the structure analysis process of the form document image with a model of the currently registered form. If it exists, the process proceeds to the proper form. If there is no matching form, it is a new form document. Therefore, the user receives the form name and information of the item to be processed and registers it as a new form model.

본 발명의 상기 과정의 결과는 기존의 방법처럼 단순한 이미지끼리의 정합이 아니라 서식 문서의 구조를 분석하여 이의 물리적 좌표와 논리적 구조를 함께 활용하는 장점이 있다.The result of the above process of the present invention has the advantage of utilizing the physical coordinates and the logical structure of the format document by analyzing the structure of the format document rather than the simple matching of the images as in the conventional method.

논리적 구조를 정합에 사용하면 서식 문서의 위치 및 크기의 변형에 적절히 대응할 수 있어 단지 물리적 형식으로 서식을 분류할 때의 단점인 동일한 서식을 중복 등록하는 문제점을 해결하게 되는 장점이 있다.The use of the logical structure in matching can suitably cope with variations in the position and size of the form document, thereby solving the problem of repeatedly registering the same form, which is a disadvantage in classifying the form into a physical form.

다음으로 도 3에 도시되어 있는 서식의 분류 및 등록 과정을 상세히 설명하면 다음과 같다.Next, the process of classifying and registering the form shown in FIG. 3 will be described in detail.

도 3에 도시된 바와 같이 입력 서식 문서 영상을 분류하기 위해 모든 서식 모델 F_i에 대한 입력 서식 문서 영상 F에 대해 거리를 계산하여, 가장 거리가 가까운 서식 모델 F_s를 구한다(301).As shown in FIG. 3, in order to classify the input format document image, the distance is calculated for the input format document image F for all the format model F _i , and the format model F _s closest to the distance is obtained (301).

이후에, 구해진 서식 모델 F_s와 입력 서식 문서 영상 F와의 거리가 실험을 통하여 결정된 임계치보다 작은지를 판단하여(302), 작으면 입력 서식 문서 영상을 서식 모델 F_s로 분류한 후에(304), 종료하며, 크면 입력 서식 문서 영상 F를 새로운 서식 모델 F로 등록한 후에(303), 종료한다.Subsequently, it is determined whether the distance between the obtained form model F _s and the input form document image F is smaller than a threshold determined through experimentation (302), and if it is small, after classifying the input form document image into the form model F _s (304), If it is large, the input format document image F is registered as a new format model F (303) and then terminated.

한편, 서식의 분류 및 등록 과정(103)을 수행한 후에, 등록이 수행되었는지를 판단한다(104).On the other hand, after performing the classification and registration process 103 of the form, it is determined whether registration has been performed (104).

판단 결과, 새로운 서식 모델 등록 과정이 수행된 경우에는 항목 정의 데이터베이스에 새로운 서식모델에 대해 처리해야 할 항목을 삽입한 후에(107), 항목내의 문자분리 및 문자인식과정을 수행하고(106), 새로운 등록 과정이 수행되지 않은 경우에는 서식 Fs에서 처리하고자 하는 각 항목의 정보를 항목 정의 데이터베이스에서 가져와(105), 그 항목에 적합한 개별 문자 분리 및 인식 과정(106)을 통해 처리한다.As a result of the determination, when a new form model registration process is performed, after inserting an item to be processed for the new form model in the item definition database (107), character separation and character recognition processes within the item are performed (106). If the registration process has not been performed, the information of each item to be processed in the form Fs is taken from the item definition database (105), and processed through a separate character separation and recognition process (106) suitable for the item.

상기와 같은 단계를 통해 처리된 결과는 재 이용 및 재 편집이 가능한 파일로 변환 과정(108)을 거치게 된다.The results processed through the above steps are converted into a file 108 for reuse and re-editing.

본 발명의 실시예에서는 파일 변환 형식을 HTML 형식으로 하여, 입력된 서식 영상의 최종 처리 결과를 HTML파일로 표현한다.In the embodiment of the present invention, the file conversion format is HTML, and the final processing result of the input format image is expressed as an HTML file.

그러면, 지금부터 앞서 설명한 본 발명에 따른 서식 문서 영상의 자동 해석방법에 대한 이론적인 설명을 예시적인 한 문서를 통한 실시예를 통해 구체적으로 설명하도록 하겠다.Then, the theoretical description of the automatic interpretation method of the format document image according to the present invention described above will now be described in detail through an exemplary document.

먼저, 앞서 서식의 선분에 대한 정보(NH, NV, H, V, D_h, D_V)와 서식의 사각형에 대한 정보(NR, (X_min, Y_min), (X_max, Y_max), (E_top, E_bottom, E_left, E_right))를 통해 서식을 표현하는 표현 모델을 만드는 절차를 설명한다.First, the information about the line segment of the form (NH, NV, H, V, D _h , D _V ) and the information about the rectangle of the form (NR, (X _min , Y _min ), (X _max , Y _max ), (E _top , E _bottom , E _left , E _right )) describes the procedure for creating an expression model that expresses styles.

도 4를 참고하면, 본 발명에 적용되는 간단한 서식 문서의 예가 도시되어 있다. 도 4에 도시된 바와 같은 서식의 표현 모델(F)는 NH 가 5개이고 NV 가 3개이므로, 수평선과 수직선들간의 관계를 나타내는 행령인 H 와 V 는 다음과 같다.Referring to Fig. 4, an example of a simple format document applied to the present invention is shown. Since the expression model F of the format as shown in FIG. 4 has 5 NHs and 3 NVs, the commands H and V representing the relationship between the horizontal line and the vertical line are as follows.

그리고, D_h와 D_v의 값은 상기 수평성분과 수직성분의 길이로, 예를들어 D_h'0'의 값은 수평선 "0" 과 수평선 "1"사이의 픽셀수가 된다.The values of D _h and D _v are the lengths of the horizontal component and the vertical component, for example, the value of D _{h'0 '} is the number of pixels between the horizontal line “0” and the horizontal line “1”.

또한 NR은 7개(사각형의 수)가 되고, (X_min, Y_min)와 (X_max, Y_max)는 각 사각형의 최외접 사각형의 좌표로 항목 추출단계에서 이 값을 사용한다. 한편, (E_top, E_bottom, E_left, E_right)값의 계산은 사각형 R4 경우를 예로 들면, 사각형의 최상위(top)에지 E_top은 수평선 2, 바닥(bottom)에지 E_bottom는 수평선 4, 왼쪽(Left)에지 E_left는 수직선 0, 그리고 오른쪽(right)에지 E_right는 수직선 1이 되므로, (2,4,0,1)이 된다.In addition, the number of NRs is 7 (the number of squares), and (X _min , Y _min ) and (X _max , Y _max ) are used as the coordinates of the outermost rectangle of each rectangle in the item extraction step. On the other hand, (E _top, E _bottom, E _left, E _right) calculation of the value, for the case rectangular R4 for example, the top (top) edge of the square E _top is horizontal. 2, the floor (bottom) edge E _bottom is horizontal 4, Left edge E _left becomes vertical line 0, and right edge E _right becomes vertical line 1, resulting in (2,4,0,1).

이어서, 도 5a 및 도 5b를 참고하여 서식의 분류 및 등록 절차를 설명한다.Next, the procedure for classifying and registering a form will be described with reference to FIGS. 5A and 5B.

먼저, 현재 등록되어 있는 서식 문서와 정합하여 일치하는 서식 문서가 있으면 그 서식에 적합한 처리과정으로 넘어가고, 일치하는 서식이 없으면 이는 새로운 서식 문서이므로 사용자에게서 서식의 이름을 입력 받아 새로운 서식으로 등록한다.First, if there is a matching form document that matches the currently registered form document, the process proceeds to the proper processing for that form. If there is no matching form, it is a new form document. .

필기숫자를 그 항목의 내용으로 하는 임의로 그린 서식(도 5a)의 표현모델을 구한 후, 참조모델과 비교하여 이미 등록된 모델인지 새로운 모델인지를 판정하여 새로운 모델인 경우 도 5b에 도시된 실행 예와 같이 등록창을 통하여 참조모델에 등록한다.After obtaining the expression model of the arbitrarily drawn form (FIG. 5A) having the handwritten number as the content of the item, it is compared with the reference model to determine whether it is a model already registered or a new model. Register to the reference model through the registration window as shown.

도 6을 참고하면, 앞서 '서식의 표현 모델 생성'단계를 통해 얻은 표현모델(F)에서 각 사각형의 정보에 기초하여 서식의 항목을 추출하는 한 실행예가 간단한 설명과 함께 도시되어 있다.Referring to FIG. 6, an execution example of extracting an item of a form based on information of each rectangle from an expression model F obtained through the step of generating an expression model of a form is illustrated with a brief description.

도 6의 처리 화면은 서식의 수평선분 및 수직선분을 찾아 서식의 사각형을 찾고, 그 사각형의 정보(즉, 사각형의 갯수, 각 사각형의 좌표값, 사각형을 이루는 4개의 에지에 해당되는 수평/수평선분의 번호)로부터 도 6의 작은 팝업창 및 네모칸으로 도시된 서식의 항목에 해당되는 부분을 찾는다.The processing screen of FIG. 6 finds a horizontal rectangle and a vertical line of the form to find a rectangle of the form, and the information of the rectangle (ie, the number of rectangles, the coordinate values of each rectangle, and the horizontal / horizontal line corresponding to the four edges of the rectangle). Minute number) to find the part corresponding to the item of the form shown by the small pop-up window and the square of FIG.

다음으로, 도 7을 참고하면 항목 내 문자를 분리하는 한 예가 도시되어 있는데, 앞서의 "항목추출" 단계에서 추출한 각 항목의 순서는 표현모델을 생성할 때 일정 규칙에 의해 도 7에 도시된 바와 같은 순으로 번호 매김을 한다. 도 7에 도시된 바와 같이 추출된 각 항목의 문자의 최외접 사각형의 좌표값은 문자인식기로 넘어가서 한자씩 인식되며, 이에 대한 설명은 생략한다.Next, referring to FIG. 7, an example of separating characters in items is illustrated. The order of each item extracted in the “item extraction” step is shown in FIG. 7 by a predetermined rule when generating an expression model. Number them in the same order. As shown in FIG. 7, the coordinate values of the outermost quadrangle of the extracted characters of each item are transferred to the character recognizer and recognized one by one, and description thereof will be omitted.

이렇게 인식된 결과는 도 8에 도시된 바와 같이 태그 생성에 의해 웹(Web) 문서인 HTML 파일로 자동 변환된다.This recognized result is automatically converted into an HTML file which is a Web document by tag generation as shown in FIG.

마지막으로, 도 9를 참고하면, 본 발명에 따른 서식 문서 영역의 자동 해석 방법을 일반 서식에 실제 사용한 예가 도시되어 있는데, 도 9a 내지 도 9d는 각각 서식 영상 예, 서식 영상의 구조 분석 결과 화면, 서식 영상의 분류 결과 화면, 그리고 서식 영상의 항목 처리 결과 화면을 나타내고 있다.Finally, referring to FIG. 9, an example of actually using the automatic interpretation method of the form document area according to the present invention in a general form is illustrated. FIGS. 9A to 9D illustrate examples of form images, structure analysis result screens of form images, The result screen of the classification of the format video and the item processing result screen of the format video are shown.

앞서 상세히 설명한 바와 같이 본 발명의 서식 문서 영상의 자동 해석 방법은 종이로 작성된 기존의 방대한 사무실의 서식 문서를 스케너나 카메라와 같은 입력 매체를 통하여 입력받아 이를 해석하여 전자적 서식 문서로 바꾸어줌으로써 다양하고 복잡한 구조의 서식 문서 영상을 자동 처리할 수 있게 하며, 내용이 채워진 임의의 서식문서의 구조를 분석하기 위해 위상(topology)을 유지하는 범위에서는 크기와 위치의 변형이 있어도 동일한 서식으로 간주하는 논리적 형식의 구조 분석 방법을 포함하므로, 최소한의 등록으로 효과적으로 서식 문서 영상을 분류할 수 있도록 하는 효과가 있다.As described in detail above, the automatic interpretation method of the form document image of the present invention receives a form document of an existing large office made of paper through an input medium such as a scanner or a camera, and converts the form document into an electronic form document. Structured form It enables the automatic processing of document images, and in the form that maintains the topology to analyze the structure of any form document filled with contents, it is a logical form that is regarded as the same format even if the size and position variation are changed. Since the structure analysis method is included, it is effective to classify the format document image with minimal registration.

또한 종래의 방법들에 의해 처리가 곤란했던 서식의 선이 깨어진 경우와, 점선의 서식의 경우, 항목 내에 필기한 문자와 서식의 선이 접촉된 경우, 서식이 기울어진 경우 등에서도 좋은 처리결과를 얻을 수 있게 되는 효과가 있다.In addition, if the lines of the form which were difficult to process by the conventional methods are broken, and in the case of the dotted line form, when the handwritten characters and the line of the form are in contact with each other, the form is inclined, and the result is good. There is an effect that can be obtained.

이상에서 본 발명의 서식 문서 영상의 자동 해석 방법에 대한 기술사상을 첨부도면과 함께 서술하였지만 이는 본 발명의 가장 양호한 실시예를 예시적으로 설명한 것이지 본 발명을 한정하는 것은 아니다. 또한, 이 기술분야의 통상의 지식을 가진 자이면 누구나 본 발명의 기술사상의 범주를 이탈하지 않는 범위내에서 다양한 변형 및 모방이 가능함은 명백한 사실이다.Although the technical idea of the automatic interpretation method of the format document image of the present invention has been described together with the accompanying drawings, this is illustrative of the best embodiment of the present invention and is not intended to limit the present invention. In addition, it is obvious that any person skilled in the art can make various modifications and imitations without departing from the scope of the technical idea of the present invention.

Claims

In the automatic interpretation method of a form document image for converting an arbitrary form document image filled with content into an electronic form document,

A first step of analyzing a structure of the format document image to create an expression model representing the input format document image;

A second step of comparing the representation model for the input format with a standard model already registered;

A third step of adding an item to be processed for the new form to the item definition database after registering as a new model if there is no matching model as a result of the comparison of the second step; And

And a fourth step of extracting an item area to be processed from the classified input form document image and converting it into an electronic form document after classifying the matching form if there is a matching model. Automatic interpretation of document images.

The method of claim 1,

The first step is,

A fifth step of extracting line segments of the form from the scanned form document image;

Extracting a rectangle surrounding the item of the form from the extracted vertical and horizontal segments; And

A method of generating a representation model of an input format image based on the relationship between the length, height, position, and segment of the extracted vertical and horizontal segments and the information about the upper, lower, left, and right edges forming the extracted visual form. Automated interpretation of format document image including 7 steps.

The method of claim 2,

The fifth step,

An eighth step of extracting connected components forming horizontal and vertical segments by mathematical morphology calculation;

A ninth step of filtering the connection elements that are not suitable to be regarded as lines of the form by calculating the width and the slope of the line segments for the extracted connection elements; And

After aligning all the horizontal connectors from top to bottom, align all vertical connectors from left to right to group adjacent horizontal or vertical connectors into one line to extract the segment of the form. An automatic interpretation method of a form document image including the tenth step.

The method of claim 2,

The sixth step,

An eighth step of excluding the candidate rectangle if the sum of the rectangles covers the entire candidate rectangle;

Extracting candidate rectangles as formatting items when edges of other rectangles cover the entire edge of one candidate rectangle; And

And a tenth step of extracting the candidate rectangle as a form item when the edges of the other rectangles cover a portion of one edge of one candidate rectangle and the length of the uncovered portion is long enough. .

The method of claim 3, wherein

In the tenth step, the grouping of the horizontal or vertical connecting elements adjacent to each other by one line is performed.

If two connecting elements i and j are on the same side as the standard line, the distance d _ij is | d _i -d _j |, and if two segments i and j are on the other side of the line, the distance d _ij is | d _i + d _j | If the distance is less than a predetermined threshold value is determined to be close to the automatic document document image analysis method characterized by grouping the horizontal or vertical connection elements in a single line.

The method of claim 1,

The second step,

Method for automatic analysis of form document image, characterized in that the logical structure information of the form and the physical position and size information is calculated by comparing the distance with the currently registered form model to determine whether the calculated distance is equal to or less than a predetermined value. .

On your computer,

As a result of the comparison of the second step, if there is a matched model, the program is executed to perform the fourth step of extracting an item area to be processed from the classified input form document image and converting it into an electronic form document. Recording medium.