KR100573392B1

KR100573392B1 - Method and System for digitalizing a large volume of documents based on character recognition with adaptive training module to real data

Info

Publication number: KR100573392B1
Application number: KR1020040007621A
Authority: KR
Inventors: 곽희규
Original assignee: 한국과학기술원; 주식회사 유씨티코리아
Priority date: 2004-02-05
Filing date: 2004-02-05
Publication date: 2006-04-25
Also published as: KR20050079378A

Abstract

본 발명은 문서의 디지털화 과정에서 문자인식의 효율적인 구성에 관한 것으로, 특히 방대하고 다양한 문서 디지털화 과정에서 실제데이터에 대한 적응학습이 가능하도록 문자인식 엔진에 탑재된 대표패턴 모델의 자동 생성에 관한 것이다.The present invention relates to an efficient configuration of character recognition in a document digitization process, and more particularly, to automatic generation of a representative pattern model mounted on a character recognition engine to enable adaptive learning of actual data in a large and diverse document digitization process.

상기 와 같은 본 발명은 문서영상의 구조 및 텍스트 정보가 들어 있는 문서데이터로부터 디지털화 대상문서에 포함된 문자 패턴에 대한 출현빈도를 추출하는 단계; 문서 구조 및 분할정보를 이용하여 각 문자 패턴에 대한 개별영상을 분할하는 단계; 상기 개별영상 분할 정보를 이용하여 각 문자 패턴영상에 대한 통계적 특징을 추출하는 단계; 상기 통계적 특징을 이용하여 각 문자 패턴에 대한 대표모델을 생성하여 입력되는 문자패턴과 비교하도록 제공하는 단계를 포함한다. 따라서, 본 발명에 의해 문서 디지털화 시스템의 문자인식 엔진은 실제데이터에 대한 새로운 대표패턴 모델을 구비하여 그 성능을 극대화 할 수 있다.The present invention as described above comprises the steps of extracting the frequency of appearance of the character pattern included in the digitization target document from the document data containing the structure and text information of the document image; Dividing the individual image for each character pattern using the document structure and the segmentation information; Extracting statistical features of each character pattern image using the individual image segmentation information; And generating a representative model for each character pattern using the statistical feature and comparing the input character pattern with an input character pattern. Therefore, according to the present invention, the character recognition engine of the document digitization system may be equipped with a new representative pattern model for actual data to maximize its performance.

문서 디지털화, 문자인식, 대표패턴, 특징 추출, 적응학습Document Digitization, Character Recognition, Representative Patterns, Feature Extraction, Adaptive Learning

Description

Method and System for digitalizing a large volume of documents based on character recognition with adaptive training module to real data}

도 1은 문자인식 대상에 따른 분류를 보여주는 도면,1 is a view illustrating a classification according to a character recognition object,

도 2는 문자인식에 의한 문서 디지털화 시스템의 구성도,2 is a configuration diagram of a document digitization system using character recognition;

도 3은 문자인식 과정을 보여주는 도면,3 is a view showing a character recognition process,

도 4는 본 발명의 실시예에 따른 문자인식 엔진의 적응학습 모듈부의 구성을 보여주는 도면,4 is a view showing the configuration of the adaptive learning module of the character recognition engine according to an embodiment of the present invention;

도 5는 문서에 나타나는 문자 패턴의 출현빈도에 대한 분포예를 보여주는 도면,5 is a diagram showing a distribution example of the frequency of appearance of a character pattern appearing in a document;

도 6은 문서영상 구조 및 분할정보DB의 문자패턴 정보를 보여주는 도면,6 is a view showing character pattern information of a document image structure and segmentation information DB;

도 7a는 문자영역의 비선형 형태정규화 방법을 보여주는 도면,7A is a view showing a nonlinear form normalization method of a text area;

도 7b는 정규화된 영상에서 추출되는 윤곽선 화소의 방향 특징을 보여주는 도면,FIG. 7B is a diagram illustrating a direction feature of contour pixels extracted from a normalized image; FIG.

도 8a는 윤곽선 화소로 표현되는 문자패턴의 대표모델 재현(인쇄체 문자)예,8A illustrates an exemplary model representation of a character pattern (printed character) represented by outline pixels;

도 8b는 윤곽선 화소로 표현되는 문자패턴의 대표모델 재현(필기체 문자)예,8B is an example of representative model reproduction (written characters) of a character pattern represented by outline pixels;

도 9a는 적응학습 모듈의 인터페이스(1단계)예를 보여주는 도면,9A is a diagram showing an example of an interface (step 1) of an adaptive learning module;

도 9b는 적응학습 모듈의 인터페이스(2단계)예를 보여주는 도면,9b is a diagram showing an example of an interface (step 2) of an adaptive learning module;

도 9c는 적응학습 모듈의 인터페이스(3단계)예를 보여주는 도면,9C is a diagram showing an example of an interface (step 3) of an adaptive learning module;

도 9d는 적응학습 모듈의 인터페이스(4단계)예를 보여주는 도면,9D is a diagram showing an example of an interface (step 4) of an adaptive learning module;

도 10은 적응학습과정 4단계의 계층적 입/출력 관계를 보여주는 도면,10 is a diagram showing a hierarchical input / output relationship of four steps of an adaptive learning process;

도 11a는 1단계에서 생성되는 문자패턴 출현 빈도 조사서의 예를 보여주는 도면,11A is a view showing an example of a letter pattern appearance frequency survey generated in step 1;

도 11b는 문서영상으로부터 분할된 100개의 '가'패턴들(2단계)을 보여주는 도면,11B is a view showing 100 'ga' patterns (step 2) divided from a document image;

도 12는 문자인식 엔진의 인식성능 및 문자모델 개수 변화를 보여주는 도면이다.12 is a view illustrating a change in the recognition performance and the number of character models of the character recognition engine.

본 발명은 방대한 문서의 디지털화 과정에서 입력의 자동화 요구에 부응하여 활용되어 온 문자인식에 관한 기술로, 특히 적응학습 모듈을 이용한 효율적인 문서 디지털화 방법 및 시스템에 관한것이다. 문자인식은 크게 온라인(on-line)과 오프라인(off-line) 문자인식으로 나뉘며, 문서의 디지털화 과정은 후자에 해당한다. 문자인식 대상에 따른 분류는 도 1에 잘 나타나 있다.The present invention relates to a character recognition technology that has been utilized in response to the automation requirements for input in the process of digitizing a large document, and more particularly, to an efficient document digitization method and system using an adaptive learning module. Character recognition is largely divided into on-line and off-line character recognition, and the digitalization process of the document corresponds to the latter. The classification according to the character recognition object is well illustrated in FIG. 1.

일반적으로, 인쇄체 문자에 비하여 필기체 문자가 더 복잡한 변형을 포함하고 있으며, 필기체 문자를 인식하기 위해서는 복잡한 변형을 어떻게 흡수할 것인가를 가장 중요시 한다. 제약된 필기체의 경우에는 그 변형을 최소화하기 위해 십자형(田)의 안내선에 일정한 형태를 갖추도록 하기도 한다. 인쇄체의 경우에도 활자체에 따른 변형을 최소화하기 위해 다중폰트 인식(omni-font OCR) 엔진보다는 단일폰트 인식(mono-font OCR)을 활용하는 것이 일반적이다. 그만큼 실제데이터에 따른 인식기 성능의 편차가 크다는 것을 의미한다.In general, handwritten characters include more complex variations than printed characters, and in order to recognize handwritten characters, it is most important how to absorb complex variations. In the case of constrained cursive writing, cross-shaped guide lines may be shaped to minimize deformation. In the case of printed matters, it is common to use mono-font OCR rather than omni-font OCR engine to minimize typographical variations. This means that the deviation of the performance of the recognizer according to the actual data is large.

한편 문자인식은 입력 데이터로부터 특징을 추출하고, 미리 정해진 여러 모델중의 한 모델로 분류하는 과정으로 구성된다. 학습(training)과정에서 미리 구축되는 모델은 인식을 통해 하나의 개념으로 대응시키는 기준과 같아서, 모델의 성질에 따라 인식 대상 자료가 제한되거나 문자인식의 성능에 영향을 미친다.On the other hand, the character recognition consists of a process of extracting features from input data and classifying them into one of a plurality of predetermined models. Models that are built in advance during the training process are the same as the criteria for mapping them into a concept through recognition, which limits the recognition data or affects the performance of character recognition.

컴퓨터가 문자를 인식하는 난이도에는 질적인 측면과 양적인 측면이 있다. 사람이 가지고 있는 뛰어난 문자인식 기능은 직관과 경험에 바탕을 두고 있어서 컴퓨터에 인식과정을 객관화하고 정량화하여 알고리즘의 형태로 공식화하는 일반적인 방법이 완성되어 있지 않기 때문에 질적인 어려움이 발생한다. 예를 들어, 인쇄체 문자의 경우 다양한 활자체(font) 형태나 크기(size), 기울어짐(slant) 등의 불일치로 인하여 어려움이 발생한다.There are qualitative and quantitative aspects of computer difficulty in recognizing characters. The excellent character recognition function of human beings is based on intuition and experience, and thus quality difficulties occur because the general method of objectifying, quantifying, and formulating the recognition process on the computer is not completed. For example, in the case of printed characters, difficulties arise due to inconsistencies in various font shapes, sizes, slants, and the like.

문자인식의 양적인 측면은 공학적으로 방대한 기억용량이나 인식 소요시간에서 실용상의 구현이 불가능한 경우에 발생하며, 정보량(information content)이 많은 문자를 대상으로 하는 경우에 특히 심하게 나타난다. 예를 들어, 인식 대상이 숫자(10종)에서 영문자(52종), 한글(11,172종), 더 나아가 한자(약 50,000여종)로 바뀜에 따라 증가하게 되는데, 이것은 인식장치를 구성할 때 글자의 종류가 증가함에 따라 기억해야 하는 문자의 형상이 증가하고, 다양한 가능성을 비교 판정하는 인식장치의 규모 증가를 초래하기 때문이다. The quantitative aspect of character recognition occurs when it is impossible to implement practically in engineering large memory capacity or recognition time, and is particularly severe when characterizing a large amount of information content. For example, as the recognition object is changed from numbers (10 types) to English letters (52 types), Korean characters (11,172 types), and furthermore to Chinese characters (about 50,000 types), this increases the number of characters. This is because, as the type increases, the shape of the character to be stored increases, leading to an increase in the size of the recognition device for comparing and determining various possibilities.

상기와 같은 문자인식의 질적이고 양적인 난이도는 모든 데이터에 일정한 성능을 가지는 범용의 문자인식 구현을 어렵게 하는 요인이다. 따라서 방대한 문서의 효율적인 디지털화 문제를 해결하기 위해서, 기존의 문서 디지털화 시스템에 문자인식의 난이도가 최소화된 형태로 적용될 수 있는 추가적인 모듈의 탑재가 불가피하다.The qualitative and quantitative difficulty of character recognition as described above is a factor that makes it difficult to implement general-purpose character recognition having a certain performance on all data. Therefore, in order to solve the problem of efficiently digitizing a large amount of documents, it is inevitable to mount an additional module that can be applied to the existing document digitization system in a form in which the difficulty of character recognition is minimized.

따라서 상기 종래기술의 문제점을 해결하기 위한 본 발명의 목적은 실제데이터의 질적, 양적 측면을 고려하지 않고 실제 데이터에 가장 적합하고 일정한 최적의 성능을 유지하는 문서의 디지털화 방법 및 시스템을 제공함에 있다.Accordingly, an object of the present invention to solve the problems of the prior art is to provide a method and system for digitizing a document that is most suitable for real data and maintains optimum performance without considering the qualitative and quantitative aspects of real data.

본 발명의 다른 목적은 문서의 디지털화 시스템이 실제 문서 데이터에 가장 적합하게 동작하도록, 문자 인식 엔진을 적응적으로 구동시키는 방법 및 시스템을 제공함에 있다.Another object of the present invention is to provide a method and system for adaptively driving a character recognition engine so that the digitization system of a document works best with actual document data.

본 발명의 또 다른 목적은 문서의 디지털화 시스템을 적응적으로 구성함으로써, 수시로 오퍼레이터(운영자)가 원하는 시간에 기 디지털화된 문서 데이터를 활용하여 실제 패턴 모델을 생성할 수 있도록 하는 방법 및 시스템을 제공함에 있다.
It is still another object of the present invention to provide a method and system for adaptively configuring a digitization system of a document so that an operator (operator) can generate a real pattern model using the digitized document data at any time. have.

상기와 같은 본 발명의 목적들을 달성하기 위한 본 발명은 문서의 디지털화 시스템을 구성하는 문자인식 엔진에 오프라인 적응학습 모듈을 탑재시키는 것이다. 본 발명에서 제시하는 오프라인 적응 학습 모듈은 실제 문서 데이터에 나타나는 패턴의 분포를 조사하여 해당 패턴들에 대한 새로운 모델을 만들어 준다. 따라서, 적응학습 모듈에 의해 수시로 오퍼레이터가 원하는 시간에 기 디지털화된 문서를 활용하여 실제 패턴 모델을 생성할 수 있기 때문에 문자인식 엔진은 실제데이터의 질적, 양적 측면을 고려하지 않아도 되며, 실제 데이터에 가장 적합하고 일정한 최적의 성능을 유지할 수 있게 된다.The present invention for achieving the above object of the present invention is to mount the offline adaptive learning module to the character recognition engine constituting the digitization system of the document. The offline adaptive learning module proposed in the present invention creates a new model for the patterns by examining the distribution of patterns appearing in the actual document data. Therefore, the character recognition engine does not have to consider the qualitative and quantitative aspects of the actual data because the adaptive learning module can generate the actual pattern model by using the digitized document at any time by the operator. It is possible to maintain proper and consistent optimum performance.

문서의 디지털화 시스템을 적응적으로 구동시키기 위한 본 발명은, 문서의 디지털화 과정에서 생성된 문서의 구조 및 분할 정보, 텍스트 정보를 활용한다. 이를 구체적으로 살펴보면, 적응학습 모듈의 동작은, 문서영상의 구조 및 텍스트 정보가 들어 있는 문서데이터로부터 디지털화 대상문서에 포함된 문자 패턴에 대한 출현빈도를 추출하는 단계; 문서 구조 및 분할정보를 이용하여 각 문자 패턴에 대한 개별영상을 분할하는 단계; 상기 개별영상 분할 정보를 이용하여 각 문자 패턴 영상에 대한 통계적 특징을 추출하는 단계; 상기 통계적 특징을 이용하여 각 문자 패턴에 대한 대표모델을 생성하여 입력되는 문자패턴과 비교하도록 제공하는 단계를 포함한다.The present invention for adaptively driving a digitization system of a document utilizes the structure, split information, and text information of a document generated in the process of digitizing a document. In detail, the operation of the adaptive learning module may include extracting a frequency of occurrence of a character pattern included in a digitization target document from document data including a structure of a document image and text information; Dividing the individual image for each character pattern using the document structure and the segmentation information; Extracting statistical features of each character pattern image using the individual image segmentation information; And generating a representative model for each character pattern using the statistical feature and comparing the input character pattern with an input character pattern.

이하 본 발명의 바람직한 실시예의 상세한 설명이 첨부된 도면들을 참조하여 설명될 것이다. 도면들 중 참조번호들 및 동일한 구성요소들에 대해서는 비록 다른 도면상에 표시되더라도 가능한 한 동일한 참조번호들 및 부호들로 나타내고 있음에 유의해야 한다. 하기에서 본 발명을 설명함에 있어, 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다. 하기의 설명에 있어서, 적응학습 또는 학습과정이라 함은 본 발명의 실시예에 따른 적응학습모듈이 문서의 디지털화 과정에서 생성되는 문서영상의 분할정보DB 및 문서 텍스트 정보를 이용하여 문자패턴의 대표모델을 생성하는 과정을 의미한다.DETAILED DESCRIPTION A detailed description of preferred embodiments of the present invention will now be described with reference to the accompanying drawings. It should be noted that reference numerals and like elements among the drawings are denoted by the same reference numerals and symbols as much as possible even though they are shown in different drawings. In the following description of the present invention, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the subject matter of the present invention, the detailed description thereof will be omitted. In the following description, the adaptive learning or learning process refers to a representative model of a character pattern by using an adaptive learning module according to an embodiment of the present invention by using the document information and the segmentation information DB of the document image generated during the digitization of the document. It means the process of generating.

본 발명이 적용되는 문서 디지털화 시스템은 도 2에 도시된 바와 같다. 상기 도 2를 참조하면, 문서의 디지털화는 입력되는 문서영상으로부터 텍스트정보를 추출하기 위해 자동화단계와 검증인터페이스에 의한 검증 및 교정 단계를 거치게 된다. 여기서 자동화 단계는 문서영상의 구조분석 및 분할에 의한 문자영역 추출, 문자인식과정 등으로 구성된다. 상기 도 2에서, 문자인식에 의한 군집화 및 군집단위 검증에 관한 기술은 대한민국 특허출원 2002-0067286호(한자 영상 군집화를 통한 고문서 입력 및 교정시스템)와 본원 발명자에 의해 선출원된 대한민국 특허출원 2003-0088230호(문자 군집화 기술 기반 문서의 디지털화 검증 시스템 및 방법)에 자세히 개시되어 있다.Document digitization system to which the present invention is applied is as shown in FIG. Referring to FIG. 2, the digitization of the document goes through an automation step and a verification and correction step by a verification interface to extract text information from an input document image. Here, the automation step is composed of text area extraction and character recognition process by structural analysis and segmentation of document image. In FIG. 2, the technique for clustering and character verification by character recognition is described in Korean Patent Application No. 2002-0067286 (high document input and correction system through Chinese character image clustering) and Korean Patent Application 2003-0088230 filed by the present inventor. (Digital verification system and method for character clustering technology based documents).

대용량 문서 디지털화 시스템의 문자인식 과정은 도 3에 나타나 있는데, 입력패턴에 대해 전처리(preprocessing), 특징 추출(feature extraction), 분류(classification), 후처리(postprocessing) 과정을 통해 인식 결과를 생성한다. 분류 과정에서는 입력 패턴을 미리 학습된 문자모델과의 비교를 통해 가장 가능성이 높은 텍스트코드를 결정하게 되는데, 본 발명의 적응학습 모듈은 입력패턴에 적합한 문자모델을 자동으로 생성해준다. 학습과정에서 활용하는 정보는 디지털화 과정에서 생성되는 문서영상의 분할정보 DB 및 텍스트 DB 이다.The character recognition process of the large-capacity document digitization system is shown in FIG. 3. The recognition result is generated through preprocessing, feature extraction, classification, and postprocessing on the input pattern. In the classification process, the most likely text code is determined by comparing the input pattern with a previously learned character model. The adaptive learning module of the present invention automatically generates a character model suitable for the input pattern. The information used in the learning process is the fragment information DB and text DB of the document image generated during the digitization process.

<본 발명의 원리>Principle of the Invention

도 4는 본 발명의 실시예에 따른 문자인식 엔진의 적응학습 모듈부의 구성을 보여주는 도면이다. 상기 도 4를 참조하면, 문서구조/분할정보DB(110)에는 문서 디지털화 수행부(100)에 의해 문서의 디지털화 과정에서 생성되는 문자구조/분할정보 데이터가 저장된다. 또한, 문서 텍스트 정보DB(130)에는 자동화 단계와 검증인터페이스에 의한 검증 및 교정 단계가 완료된 문서 텍스트 정보가 저장된다. 문자인식 모듈(120)은 적응학습모듈부(200)로부터 제공되는 실제 문서 데이터에 가장 적합한 문자패턴 대표모델을 통해 분류과정(classification)에서 가장 가능성이 높은 텍스트코드를 결정하고, 이를 문서 디지털화에 이용한다.4 is a view showing the configuration of the adaptive learning module of the character recognition engine according to an embodiment of the present invention. Referring to FIG. 4, the document structure / division information DB 110 stores the character structure / division information data generated by the document digitization performing unit 100 during the digitization process of the document. In addition, the document text information DB 130 stores the document text information in which the automation step and the verification and correction step by the verification interface are completed. The character recognition module 120 determines the most likely text code in the classification process through the character pattern representative model that is most suitable for the actual document data provided from the adaptive learning module unit 200, and uses it for document digitization. .

방대한 양의 문서 디지털화를 담당하는 오퍼레이터는 수시로 상기 문자인식 모듈(120)의 성능이 최적이 아닌 경우에는 상기 적응학습 모듈(200)을 이용하여 문자인식의 성능을 실제 문서 데이터에 적합하도록 일정하게 유지할 수가 있다. 본 발명과 같이 적응학습 모듈(200)이 수시로 구동되는 것은 문자인식의 질적이고 양적인 난이도를 해결할 수 있는 두 가지 장점을 가지고 있다. 먼저 질적인 난이도는 문자인식 엔진이 학습 단계의 데이터와 실제 데이터간의 상이한 요소들을 잘 흡수하지 못하는 일반성 문제에서 야기되기 때문에, 문자인식 엔진이 실제 데이터를 활용하여 학습한다는 것은 데이터의 다양한 형태 즉, 활자체의 형태, 크기, 기울어짐, 영상의 품질 등의 상태를 흡수하기가 용이하다는 것을 의미한다. 또한, 양적인 난이도는 인식해야 하는 데이터의 정보량이 많은 경우에 특히 심하게 나타나는데, 실제 문서 데이터에서 출현하는 패턴만을 대상으로 학습을 시키는 것은 문자인식 엔진에서 기억해야 하는 문자형상 및 비교 판정해야 하는 다양한 가능성의 수를 줄여주기 때문에 이러한 측면을 해결할 수 있다.The operator in charge of the digitization of a large amount of documents from time to time, if the performance of the character recognition module 120 is not optimal, using the adaptive learning module 200 to maintain the constant of the character recognition to suit the actual document data There is a number. The adaptive learning module 200 is often driven as in the present invention has two advantages that can solve the quantitative difficulty of character recognition. First, since the quality difficulty arises from the generality problem that the character recognition engine does not absorb the different elements between the data at the learning stage and the actual data well, the character recognition engine learns by using the actual data. This means that the shape, size, tilt, and quality of the image are easily absorbed. In addition, quantitative difficulty is particularly severe when there is a large amount of information to be recognized. Learning only the patterns that appear in the actual document data is a possibility of character shapes that need to be remembered by the character recognition engine and various possibilities of comparison and determination. This aspect can be solved by reducing the number.

상기 적응학습모듈부(200)는 문서텍스트 정보DB(130)에 저장된 데이터로부터 디지털화 대상 문서에 포함된 다양한 문자패턴들 각각의 출현 패턴분포를 조사하는 문자 패턴분포 조사부(210), 상기 문서구조/분할정보DB(110)에 저장된 데이터로부터 문자패턴 정보를 분할하는 문자패턴 분할부(230), 분할된 각각의 문자패턴 정보로부터 통계적 특징정보를 추출하는 통계적 특징정보 추출부(250), 상기 통계적 특징정보 추출부(250)에 의해 추출한 특징값의 평균 또는 가중치 평균을 이용하여 문 자패턴의 대표모델을 생성하는 대표모델 생성부를 포함하여 구성된다.The adaptive learning module unit 200 is a character pattern distribution research unit 210 for examining the appearance pattern distribution of each of the various character patterns included in the digitization target document from data stored in the document text information DB 130, the document structure / Character pattern divider 230 for dividing the character pattern information from the data stored in the partition information DB 110, Statistical feature information extraction unit 250 for extracting statistical feature information from each of the divided character pattern information, The statistical feature It comprises a representative model generation unit for generating a representative model of the character pattern using the average or the weighted average of the feature values extracted by the information extraction unit 250.

이하 상기 도 4에 도시된 적응합습 모듈부(200)를 통한 적응학습 과정을 상세히 살펴본다.Hereinafter, the adaptive learning process through the adaptive training module unit 200 shown in FIG. 4 will be described in detail.

A. 문서의 출현 문자패턴 분포 조사A. Examining the Distribution of Character Patterns

오퍼레이터는 상기 문자인식모듈(120)이 최적의 성능을 발휘하도록, 적응학습모듈부(200)를 구동시킨다. 적응학습모듈부(200)가 구동되면, 상기 문자 패턴분포 조사부(210)는 문서텍스트 정보DB(130)에 저장된 데이터로부터 디지털화 대상 문서에 포함된 다양한 문자패턴들 각각의 출현 패턴분포를 조사한다. 문서는 다양한 문자패턴들을 포함하고 그 문자패턴들은 각각의 출현 빈도를 가지고 있어서, 어떤 문자패턴은 자주 사용되는 반면, 어떤 것은 거의 사용되지 않는다. 따라서 문자인식 엔진이 거의 사용되지 않는 패턴에 대해 비교 모델을 가지고 있지 않거나 잘 적합하지 못한다 할지라도 전체적인 성능에 크게 영향을 주지 않지만, 자주 사용되는 패턴에 대해 잘 적합하지 못할 경우 성능에 매우 큰 영향을 끼친다. 이것은 대용량 문서의 디지털화 과정에서 문자인식 엔진이 문자패턴의 출현 빈도에 민감하게 반응해야한다 것을 의미한다. The operator drives the adaptive learning module unit 200 so that the character recognition module 120 exhibits optimal performance. When the adaptive learning module unit 200 is driven, the character pattern distribution research unit 210 examines the appearance pattern distribution of each of the various character patterns included in the digitization target document from data stored in the document text information DB 130. The document contains various character patterns and the character patterns have their respective frequency of appearance, so that some character patterns are frequently used, while others are rarely used. Therefore, if the character recognition engine does not have or do not have a comparative model for a pattern that is rarely used, it does not significantly affect the overall performance. Inflicted. This means that the character recognition engine should be sensitive to the frequency of occurrence of the character pattern during the digitization of large documents.

상기 문자패턴 분포 조사부(210)에 의해 조사된 문자패턴들 각각의 출현 패턴분포에 관한 정보는 패턴출현 빈도 조사서(220)에 저장된다.The information on the appearance pattern distribution of each of the character patterns irradiated by the character pattern distribution research unit 210 is stored in the pattern appearance frequency survey 220.

B. 출현 문자 패턴 영역의 분할B. Segmentation of appearance character pattern areas

출현 문자 패턴 영역의 분할이라 함은 하기와 같은 방법으로 패턴별 문자영상을 추출함을 의미한다.The division of the appearance character pattern region means that the character image for each pattern is extracted in the following manner.

문서영역으로부터 실제 출현 패턴 분포 조사가 완료되면, 상기 문자패턴 분할부(230)는 상기 문서구조/분할정보DB(110)와 상기 패턴 출현 빈도 조사서(220)에 저장된 데이터로부터 각 패턴별 문자영상을 추출한다. 이 과정에서 문서로부터 출현 문자패턴의 분포정보 추출과 더불어 대표모델을 구성해야하는 문자패턴이 결정되며, 문서영상의 구조정보 및 분할 정보 DB를 활용하여 각 패턴별 문자영상을 추출하게 된다. 추출된 패턴별 문자영상 정보는 문자패턴 영상DB(240)에 저장된다.When the examination of the actual appearance pattern distribution is completed from the document area, the character pattern division unit 230 displays a character image of each pattern from data stored in the document structure / division information DB 110 and the pattern appearance frequency survey 220. Extract. In this process, as well as the distribution information of the emerged character pattern from the document, the character pattern to form the representative model is determined, and the character image for each pattern is extracted by using the structure information and the segmentation information DB of the document image. The extracted character image information for each pattern is stored in the character pattern image DB 240.

도 6은 문서영상의 구조 및 분할정보DB에 의해 분할될 문자패턴 정보의 예를 나타낸다. 여기서, 문자패턴 정보의 예로는, 해당 문서에서 문자영역의 좌표값과 문자패턴의 코드값 외에, 해당 문서의 페이지 번호(Page Idx), 문자가 속하는 단락번호(Para, Idx), 문자가 속하는 줄(Line Idx), 문자의 순번(Char Idx)등이 있다.6 shows an example of character pattern information to be divided by the structure of the document image and the segmentation information DB. Here, examples of the character pattern information may include page numbers (Page Idx), paragraph numbers (Para and Idx) to which the character belongs, lines other than the coordinates of the character area and code values of the character pattern in the document. (Line Idx), character order (Char Idx), and so on.

C. 문자패턴의 통계적 특징정보 추출C. Statistical Characteristic Information Extraction of Character Patterns

문서영상으로부터 실제 출현 패턴의 분할이 완료되면, 상기 통계적 특징정보 분할부(250)는 상기 문자패턴 영상DB(240)에 저장된 데이터를 이용하여 분할된 각각의 문자패턴 영상으로부터 통계적 특징정보를 추출한다. 이 단계는, 문자인식 과정의 전처리 및 특징추출과 동일하게 수행된다. 먼저, 전처리 단계에서는 영상의 위치 및 크기, 형태 정보를 정규화(normalization)하고, 특징추출 단계에서는 정규화된 영상으로부터 추출한 통계적 정보를 수치화한다. 따라서, 상기 대표모델 생성 부(270)는 영상의 위치 및 크기, 형태 정보를 정규화하는 수단과, 상기 정규화된 영상으로부터 추출한 통계적 정보를 수치화하는 수단으로 구현될 수 있다.When the division of the actual appearance pattern is completed from the document image, the statistical feature information dividing unit 250 extracts statistical feature information from each of the divided character pattern images using data stored in the character pattern image DB 240. . This step is performed in the same manner as the preprocessing and the feature extraction of the character recognition process. First, in the preprocessing step, the position, size, and shape information of the image are normalized, and in the feature extraction step, statistical information extracted from the normalized image is digitized. Accordingly, the representative model generator 270 may be implemented as a means for normalizing the position, size, and shape information of the image, and a means for digitizing statistical information extracted from the normalized image.

정규화된 영상을 얻는 방법은 비선형 형태 정규화(nonlinear shape normalization) 기법인 점 밀도(dot density), 교차횟수(crossing count), 선 간격(line interval) 등을 이용하는데, 도 7a는 한자 '李'영상을 64*64 크기로 정규화한 후 8*8 화소(pixel) 영역을 하나의 블록(block)으로 하여 64(8*8)블록을 형성한 예이다. 다양한 문자패턴의 변이를 흡수하기 위한 통계적 특징은 획 개수(stroke count), 윤곽화소(contour-pixel count), 윤곽선 방향(contour direction), 배경 면적(backgound area), 투영(projection) 등이 있는데, 도 7b는 윤곽선 방향 특징(CDF: Contour Directional Feature)의 추출 예를 나타낸다. 만약 도 8b에서와 같이 문자패턴의 윤곽선 화소의 방향이 네 방향(horizontal, vertical, diagonal, inverse diagonal) 중 하나라고 가정하면, 임의의 문자패턴에서 추출할 수 있는 윤곽선 방향 특징은 256(8*8*4)차원의 벡터가 된다.A method of obtaining a normalized image uses a nonlinear shape normalization technique such as dot density, crossing count, line interval, and the like. After normalizing to 64 * 64 size, 64 (8 * 8) block is formed using 8 * 8 pixel area as one block. Statistical features for absorbing variations in various character patterns include stroke count, contour-pixel count, contour direction, backgound area, and projection. 7B illustrates an example of extracting a contour directional feature (CDF). If the direction of the contour pixel of the character pattern is one of four directions (horizontal, vertical, diagonal, inverse diagonal) as shown in FIG. 8B, the contour direction feature that can be extracted from any character pattern is 256 (8 * 8). 4) It becomes a vector of dimensions.

분할된 문자 패턴 영상으로부터 추출된 통계적 특징정보는 문자패턴 특징정보 저장부(260)에 저장된다.Statistical feature information extracted from the segmented character pattern image is stored in the character pattern feature information storage unit 260.

D. 문자패턴의 대표모델 생성D. Create Representative Model of Character Pattern

분할된 문자 패턴으로부터 통계적 특징정보 추출이 완료되면, 상기 대표모델 생성부(270)는 상기 문자패턴 특징정보 저장부(260)에 저장된 패턴의 특징정보로부터 문자패턴의 대표모델을 생성한다. 문자패턴의 대표모델은 학습용 문자영상들로 부터 추출한 특징값의 평균(average) 또는 가중치 평균(weighted average)을 취하여 쉽게 얻을 수 있다. 만약 문자패턴의 통계적 특징으로 윤곽선 방향 특징을 사용한다면, 도 8a, 8b와 같이 문자패턴의 대표모델을 64*64크기의 그레이 영상(gray-scale image)으로 재현할 수 있다. 이것은 각 패턴별 100개의 문자영상으로부터 추출한 윤곽선상의 화소들의 분포를 256 그레이 영상으로 구성한 것인데, 흰색의 화소는 그 지점에 위치한 패턴의 윤곽선 화소가 전무함을, 검은색의 화소는 패턴의 윤곽선 화소가 그 위치에 항상 존재한다는 것을 의미한다. 도 8a는 형태 변이(shape variation)가 비교적 적은 인쇄체 문자패턴의 대표모델을 나타내기 때문에 대표모델도 정형화된 결과를 나타내고 윤곽선의 존재가 매우 뚜렷하게 나타난다. 반면, 도 8b는 필기 형태(written style)가 매우 자유로운 필사본으로부터 추출한 문자영상이기 때문에 대표모델이 뚜렷한 형태를 가지지 못하고 윤곽선의 형태가 희미한 결과를 보인다.When statistical feature information extraction is completed from the divided character pattern, the representative model generator 270 generates a representative model of the character pattern from the feature information of the pattern stored in the character pattern feature information storage unit 260. The representative model of the character pattern can be easily obtained by taking the average or the weighted average of feature values extracted from the learning character images. If the contour direction feature is used as the statistical feature of the character pattern, a representative model of the character pattern can be reproduced as a gray-scale image of 64 * 64 size as shown in FIGS. 8A and 8B. This is composed of 256 gray images of the distribution of contour pixels extracted from 100 character images for each pattern. The white pixels have no contour pixels of the pattern located at that point, and the black pixels have the contour pixels of the pattern. It always exists at that location. 8A shows a representative model of a printed character pattern having a relatively small shape variation, the representative model also shows a standardized result and the presence of outlines is very clearly shown. On the other hand, since FIG. 8B is a text image extracted from a manuscript with a very free handwritten style, the representative model does not have a distinct shape, and the outline shape is faint.

<실시예 들><Examples>

도 9a 내지 9d는 본 발명의 실시예에 따른 적응학습 모듈의 인터페이스 구성을 보여주는 도면이다. 상기 도 9a 내지 9d는 실제 문서데이터에 적합한 문자패턴의 대표모델을 생성하는 과정을 보여준다. 9A to 9D illustrate an interface configuration of an adaptive learning module according to an embodiment of the present invention. 9A to 9D illustrate a process of generating a representative model of a character pattern suitable for actual document data.

상기 도 9a의 "AWD_Inspection"은 문서의 출현패턴 분포를 조사하는 단계로서, 문서의 텍스트 정보가 있는 디렉토리, 즉 상기 문서텍스트 정보DB(130)를 지정하는 영역과 문서의 문자패턴에 대한 분포를 저장할 파일, 즉 상기 패턴출현 빈도 조사서(220)을 지정하는 영역이 제공된다. "AWD_Inspection" of FIG. 9A is a step of examining the distribution of appearance patterns of a document, and stores a directory in which the text information of the document is located, that is, a region specifying the document text information DB 130 and a distribution of the character pattern of the document. An area for specifying a file, that is, the pattern appearance frequency survey 220, is provided.

상기 도 9b의 "Random Selection"은 문서영상으로부터 문자패턴의 영역을 분할하는 단계로서, 문서의 문자패턴에 대한 분포를 저장한 파일을 지정하는 영역과 문서의 구조/분할 정보가 저장된 디렉토리, 즉 문서구조/분할정보DB(110)를 지정하는 영역과 상기 문자패턴 영상DB(240)인 분할될 문자패턴의 영상저장 디렉토리를 지정하는 영역과 분할될 문자패턴 영상의 최대 개수가 표시되는 영역이 제공된다. "Random Selection" in FIG. 9B is a step of dividing an area of a character pattern from a document image. The "Random Selection" of FIG. An area for designating a structure / division information DB 110, an area for designating an image storage directory for the character pattern to be divided, which is the character pattern image DB 240, and an area for displaying the maximum number of character pattern images to be divided are provided. .

상기 도 9c의 "FeatureEx"는 문자패턴의 특징을 추출하는 단계로서, 상기 문자패턴 특징정보 저장부(260)인 문자패턴의 통계적 특징 저장 디렉토리를 지정하는 영역과 분할된 영상 저장 디렉토리를 지정하는 영역과 비선형 형태 정규화 기법들을 선택할 수 있는 영영과 통계적 특징 추출 방법을 선택할 수 있는 영역을 제공한다. "FeatureEx" of FIG. 9C is a step of extracting a feature of a character pattern. The character pattern feature information storage unit 260 is an area for specifying a statistical feature storage directory for a character pattern and an area for specifying a divided image storage directory. It provides a range of choices for both English and statistical feature extraction methods to choose from and nonlinear shape normalization techniques.

상기 도 9d의 "Representative"는 문자패턴의 대표모델을 생성하는 단계로서, 문자패턴의 통계적 특징 저장 디렉토리를 지정하는 영역과 상기 문자패턴 대표모델DB(280)인 각 문자패턴의 대표모델 저장 디렉토리를 지정하는 영역이 제공된다.In FIG. 9D, "representative" is a step of generating a representative model of a character pattern, and an area for designating a statistical feature storage directory of the character pattern and a representative model storage directory of each character pattern which is the character pattern representative model DB 280. A designated area is provided.

디지털화된 문서정보로부터 실질적인 문자패턴의 대표모델을 생성하는 위 4단계의 계층적 입/출력 관계는 도 10에 잘 나타나 있다. 도 11a는 도 9a의 "AWD_Inspection" 단계에서 약 740,000자 분량의 문서로부터 생성되는 "CharFrequency.txt"의 예를 보여주고, 도 11b는 도 9b의 "Random Selection" 단계에서 무작위로(random) 분할된 100개의 "가"패턴 문자영상 예를 보여주고 있다.The hierarchical input / output relationship of the above four steps of generating a representative model of the actual character pattern from the digitized document information is shown in FIG. 10. FIG. 11A shows an example of “CharFrequency.txt” generated from about 740,000 characters of document in the “AWD_Inspection” step of FIG. 9A, and FIG. 11B is randomly divided in the “Random Selection” step of FIG. 9B. 100 "ga" pattern text images are shown.

한편 본 발명의 상세한 설명에서는 구체적인 실시 예에 관해 설명하였으나, 본 발명의 범위에서 벗어나지 않는 한도 내에서 여러가지 변형이 가능함은 물론이다. 그러므로 본 발명의 범위는 설명된 실시 예에 국한되어 정해져서는 안되며 후술하는 특허청구의 범위뿐만 아니라 이 특허청구의 범위와 균등한 것들에 의해 정해져야 한다. Meanwhile, in the detailed description of the present invention, specific embodiments have been described, but various modifications may be made without departing from the scope of the present invention. Therefore, the scope of the present invention should not be limited to the described embodiments, but should be determined not only by the scope of the following claims, but also by the equivalents of the claims.

상술한 바와 같이 본 발명은 대용량 문서의 효율적인 디지털화를 위한 환경에서 문자인식 엔진에 실제 데이터에 바로 적합할 수 있는 적응학습 모듈을 구비함으로써 양적, 질적인 문제를 최소화할 수 있어서 최적의 성능을 유지한다는 이점이 있다. As described above, the present invention maintains optimal performance by minimizing quantitative and qualitative problems by providing an adaptive learning module that can be directly adapted to actual data in a character recognition engine in an environment for efficient digitalization of large documents. There is an advantage.

구체적인 데이터를 통해 본 발명의 효과를 나타내기 위해, 일반적인 훈련데이터(training data)에 의해 학습된 문자인식 엔진과 주기적인 적응학습 모듈을 탑재한 문자인식 엔진의 인식성능 및 모델의 개수를 측정하였다. 정량적인 평가를 위한 실험데이터는 1974년에 발행된 약 7백만 자 분량의 책자로서, 납 활자판에 의해 인쇄되었기 때문에 문서영상의 품질이 매우 낮다. 실험데이터는 위 책자에서 약 70만 자 분량을 400dpi로 스캐닝한 문서영상으로, 문자인식 이전에 기본적인 영상전처리(image preprocessing) 기법을 활용하여 잡영(noise) 및 기울어짐(skew) 등을 제거하였다. In order to show the effect of the present invention through specific data, the recognition performance and the number of models of the character recognition engine equipped with the character recognition engine and the periodic adaptive learning module learned by general training data were measured. The experimental data for quantitative evaluation is a 7 million-character booklet published in 1974, and the quality of document images is very low because it is printed by lead typeface. The experimental data is a document image scanned at 400dpi of about 700,000 characters from the above booklet. Before the character recognition, noise and skew were removed by using basic image preprocessing techniques.

도 12는 인식 성능 및 문자모델의 개수 변화에 대한 실험결과를 나타내는데, 문자인식 엔진 (a)는 한글문서의 99.999%를 표현할 수 있는 2,350클래스의 문자모델을 포함하는 것으로, 문자모델은 각 클래스 당 100개의 문자영상으로부터 생성하였다. 문자인식 엔진 (b)는 실험데이터의 약 5%(약 35,000자) 정도를 대상으로 적응학습 단계를 수행한 것이며, (c), (d), (e) 각각은 20%, 40%, 50% 정도를 적응학습 단계에 활용한 결과이다.

Fig. 12 shows the experimental results on the recognition performance and the number of character models. The character recognition engine (a) includes 2,350 class character models that can represent 99.999% of Korean documents. 100 character images were generated. The character recognition engine (b) is an adaptive learning step for about 5% (about 35,000 characters) of the experimental data, and (c), (d), and (e) are 20%, 40%, and 50, respectively. This is the result of using% in the adaptive learning stage.

문자인식 엔진 (a)의 경우는, 거의 모든 데이터를 인식하기 위한 문자모델을 포함하고 있지만 실제 문서데이터와는 영상 품질 및 문자의 형태상의 변이 등이 심하여 낮은 인식률을 보이고 있다. 반면, 문자인식 엔진 (b)에서부터 (e)까지는 실제데이터를 활용하여 적응학습을 수행하였기 때문에 인식 성능 및 문자모델의 개수가 최적의 상태를 유지하고 있다. 특히, 엔진 (b)의 경우는 약 5%의 실제데이터만을 대상으로 학습을 수행하였기 때문에 894개 문자모델만을 포함하고 있지만 엔진 (a)보다 훨씬 높은 인식률을 보이고 있으며, 이것은 본 발명의 두드러진 효과라고 할 수 있다.
The character recognition engine (a) includes a character model for recognizing almost all data. However, the character recognition engine (a) has a low recognition rate due to severe image quality and shape variation of the actual document data. On the other hand, from the character recognition engines (b) to (e), since the adaptive learning was performed using the actual data, the recognition performance and the number of the character models were maintained at the optimal state. In particular, the engine (b) includes only 894 character models because the training was performed on only about 5% of actual data, but the recognition rate is much higher than that of the engine (a). can do.

본 발명의 활용을 통해 대용량 문서를 디지털화하는 환경에서는 매우 복잡한 대규모 문자인식 엔진을 탑재하기 보다는 소규모 문자인식 엔진이라도 소수의 실제데이터를 활용한 적응학습 단계가 결합된 형태가 일정한 성능을 유지하기에는 더 효율적일 것으로 기대한다.In the environment of digitizing large documents through the use of the present invention, even a small character recognition engine combined with an adaptive learning step utilizing a small number of actual data may be more efficient to maintain constant performance than a very large character recognition engine. Expect to be.

Claims

In the digitizing method of the document comprising the step of comparing the input character pattern with a previously stored character model,

Extracting a frequency of appearance of a character pattern included in a digitization target document from document data containing document structure and text information;

Dividing the individual image for each character pattern using the document structure and the segmentation information;

Extracting statistical features of each character pattern image using the individual image segmentation information;

Providing a representative model for each character pattern from the digitization target document from time to time by using the statistical feature and comparing the input character pattern with the input character pattern;

Character recognition-based digitization method of a large document comprising a.

The method of claim 1, wherein the information for dividing the individual image for each character pattern comprises:

A character including at least one of a coordinate value of a character area of the digitization target document and a code value of a character pattern, a page number of the digitization target document, a paragraph number to which the character belongs, a line to which the character belongs, and a sequence number of the character Recognition-based digitization of large documents.

The method of claim 1, wherein extracting the statistical feature comprises:

Normalizing the position, size, and shape information of the image;

Digitizing statistical information extracted from the normalized information

Character recognition based on the digitization method of a large document comprising a.

The method of claim 1 or 3, wherein the statistical characteristic information,

A method of digitizing large text documents based on text recognition, comprising at least one of stroke count, outline pixel, outline direction, background area, and projection information.

The representative model of any one of claims 1 to 3, wherein

Character recognition-based large document digitization method comprising at least one or more of the average or weighted average of the extracted feature value.

In the digitization system of a document comprising a character recognition module for comparing the input character pattern with a previously stored character model,

A character pattern distribution research unit for examining the frequency of occurrence of the character pattern included in the digitized document from data stored in the document text information DB;

A character pattern divider for dividing an individual image for each character pattern using document structure and segmentation information;

A statistical feature information extraction unit for extracting statistical features of each character pattern image using the individual image segmentation information;

Representative model generation unit for generating a representative model for each character pattern from time to time using the statistical feature to compare with the input character pattern

Digitization system of a document comprising a.