KR100655916B1

KR100655916B1 - Document image processing and verification system for digitalizing a large volume of data and method thereof

Info

Publication number: KR100655916B1
Application number: KR1020040055974A
Authority: KR
Inventors: 곽희규; 최운식
Original assignee: 한국과학기술원; 주식회사 유씨티코리아
Priority date: 2004-07-19
Filing date: 2004-07-19
Publication date: 2006-12-08
Also published as: KR20060007204A

Abstract

본 발명은 스캐닝 과정에서 획득한 일련의 문서영상들을 분석하여 문자영역을 자동으로 일괄 분할하는 문서영상 구조분석 및 문자영역 분할 모듈과, 상기 문서영상의 자동 분할 결과를 영상기반 인터페이스를 통해 검증하고 교정하는 문서영상 분할 검증 모듈과, 상기 문서영상의 각 문자영역에 지능형 문자인식을 적용하여 텍스트코드를 결정하고 동일한 코드의 문자영역들을 모아서 군집영상을 생성하는 문자군집화 모듈과, 상기 각 코드별 군집영상을 대상으로 문자인식 결과를 군집단위로 검증하여 일괄 텍스트코드를 부여하는 문자군집단위 검증 및 텍스트 변환 모듈과, 상기 문자인식에서 텍스트코드를 결정하지 못한 소규모 문자영역을 개별적으로 입력하는 문자단위 코드 입력 모듈과, 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 군집단위 인터페이스를 통해 1차적으로 검증하고 교정하는 군집단위 검증 및 교정 모듈과, 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 문서의 텍스트라인 단위 인터페이스를 통해 2차적으로 검증하고 교정하는 텍스트 라인 단위 원문 검증 및 교정 모듈과, 상기 문서영상의 문자영역을 표현하는 인터페이스를 통해 문서의 각종 태그정보 및 부가정보를 입력하는 내용 태깅 및 부가정보 입력 모듈을 포함하는 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템에 관한 것이다.The present invention analyzes and analyzes a series of document images obtained during a scanning process, and analyzes and corrects document image structure analysis and text area segmentation module for automatically batch segmenting a text area, and verifying and correcting the result of automatic segmentation of the document image through an image-based interface. A document image segmentation verification module, a character clustering module for determining a text code by applying intelligent character recognition to each character region of the document image, and generating a cluster image by collecting character regions of the same code; A character cluster unit verification and text conversion module for granting a batch text code by verifying the character recognition result in a cluster unit for the character recognition, and a character unit code input for individually inputting a small character area for which the text code is not determined in the character recognition. Module and the batch text code assignment or the individual input A cluster unit verification and correction module for first verifying and correcting each character region of the input document image through a cluster unit interface, and each character region of the document image input through the batch text code assignment or the individual input Text input unit verification and correction module for verifying and correcting the document secondly through the text line unit of the document, and inputting various tag information and additional information of the document through the interface representing the character area of the document image. The present invention relates to a document image processing and verification system for digitization of massive data including tagging and additional information input modules.

본 발명에 따르면, 방대한 문서데이터로부터 고품질 디지털 텍스트정보를 초 고속으로 획득할 수 있으며, 문자군집단위 검증 및 일괄 텍스트변환이 가능하도록 함으로써 정확도와 처리율을 증가시킬 수 있으며, 문서영상의 텍스트정보 추출을 위한 자료처리 및 시각적 검증 과정이 군집단위로 이루어짐으로써 오류 탐색시간을 절감할 수 있다. According to the present invention, it is possible to obtain high-quality digital text information from a large amount of document data at a high speed, and to increase accuracy and processing rate by enabling character cluster unit verification and batch text conversion, and extracting text information from document images. Data processing and visual verification process can be performed in clusters to reduce error detection time.

고문서 디지털화, 문자 인식, 군집화, 검증 및 교정, 텍스트코드, 유니코드Digitizing Old Documents, Character Recognition, Clustering, Verification and Correction, Text Code, Unicode

Description

Document image processing and verification system for digitization of massive data and its method {DOCUMENT IMAGE PROCESSING AND VERIFICATION SYSTEM FOR DIGITALIZING A LARGE VOLUME OF DATA AND METHOD THEREOF}

도 1은 종래의 필사본, 목판본, 활자본 문서를 스캐너를 통해 획득한 문서영상의 예를 나타내는 도면.1 is a view showing an example of a document image obtained through a scanner of a conventional manuscript, woodblock, letterpress document.

도 2는 종래 기술에 의한 대규모 인력에 의한 수작업기반 디지털화 작업 공정의 흐름도.Figure 2 is a flow diagram of a manual hand-based digitization work process by a large-scale manpower in the prior art.

도 3은 종래 기술에 의한 문자인식에 의한 디지털화 작업 공정의 흐름도. 3 is a flowchart of a digitization work process using character recognition according to the prior art.

도 4는 본 발명의 일 실시예에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템의 구조도.4 is a structural diagram of a document image processing and verification system for digitizing massive data according to an embodiment of the present invention.

도 5는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 다양한 잡영 및 기울어짐이 포함된 문서영상의 분석 및 분할 결과의 일 예를 나타내는 도면.5 is a diagram illustrating an example of an analysis and segmentation result of a document image including various miscellaneous and tilting in a document image processing and verification system according to an embodiment of the present invention.

도 6은 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 문서영상의 계층적 구조를 나타내는 도면.6 is a diagram illustrating a hierarchical structure of a document image in a document image processing and verification system according to an embodiment of the present invention.

도 7은 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 목판본 및 필사본, 활자본 영상으로부터 구조분석 및 문자영역 분할 결과의 일 예를 나타 내는 도면.7 is a view showing an example of a structural analysis and character region segmentation results from woodblock, manuscript and letterpress images in a document image processing and verification system according to an embodiment of the present invention.

도 8은 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 문서영상 분할 결과의 검증 및 교정에 활용하는 툴바의 일 예를 나타내는 도면.8 is a diagram illustrating an example of a toolbar used for verifying and correcting a document image segmentation result in a document image processing and verification system according to an embodiment of the present invention.

도 9a는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 문자영역에 대한 문자인식 적용, 인식신뢰도에 의한 채택(accept)을 나타내는 도면.FIG. 9A is a diagram illustrating an application based on character recognition and recognition reliability of a text area in a document image processing and verification system according to an embodiment of the present invention. FIG.

도 9b는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 문자영역에 대한 문자인식 적용, 인식신뢰도에 의한 기각(reject)을 나타내는 도면.FIG. 9B is a view illustrating rejection by application of character recognition and recognition reliability in a text area in a document image processing and verification system according to an embodiment of the present invention. FIG.

도 10은 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 다양한 문서데이터에 적용한 문자군집화 모듈의 예를 나타내는 도면,10 is a view showing an example of a character clustering module applied to various document data in a document image processing and verification system according to an embodiment of the present invention;

도 11은 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 한자 ‘之’ 클래스에 대해 인식신뢰도를 기준으로 군집영상을 생성한 결과 및 오류 분포의 일 예를 나타내는 도면.FIG. 11 is a diagram illustrating an example of a result of generating a cluster image and an error distribution for a Chinese character '之' class in a document image processing and verification system according to an embodiment of the present invention.

도 12는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 문자 군집영상 검증 및 텍스트변환 인터페이스의 일 예를 나타내는 도면.12 is a diagram illustrating an example of a character cluster image verification and text conversion interface in a document image processing and verification system according to an embodiment of the present invention.

도 13a는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 수작업 입력 인터페이스의 초기 화면(문자영역과 텍스트코드의 일대일 대응)의 일 예를 나타내는 도면.FIG. 13A illustrates an example of an initial screen (one-to-one correspondence between a text area and a text code) of a manual input interface in a document image processing and verification system according to an embodiment of the present invention. FIG.

도 13b는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 수작업 입력 인터페이스의 입력 모드의 일 예를 나타내는 도면.13B illustrates an example of an input mode of a manual input interface in a document image processing and verification system according to an embodiment of the present invention.

도 14는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 문자 군집단위 검증 및 교정 인터페이스(한자 ‘中’ 클래스의 문자영역 나열 화면)의 일 예를 나타내는 도면.FIG. 14 is a diagram illustrating an example of a character cluster unit verification and correction interface (a text area listing screen of a Chinese character '中' class) in a document image processing and verification system according to an embodiment of the present invention; FIG.

도 15는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 다양한 데이터의 문서영상 텍스트라인 단위 검증 및 교정 인터페이스의 일 예를 나타내는 도면,15 is a view illustrating an example of a document image text line unit verification and correction interface of various data in a document image processing and verification system according to an embodiment of the present invention;

도 16a는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 문서의 구조적 태그 정보를 부여하는 문서영상 기반 인터페이스의 일 예를 나타내는 도면,FIG. 16A illustrates an example of a document image-based interface for assigning structural tag information of a document in a document image processing and verification system according to an embodiment of the present invention; FIG.

도 16b는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 문서영상의 구조분석을 통해 자동으로 추출한 띄어쓰기 정보를 나타내는 도면,16B is a diagram illustrating spacing information automatically extracted through structural analysis of a document image in a document image processing and verification system according to an embodiment of the present invention;

도 17a는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 XML 태그 및 부가정보가 부여된 최종의 텍스트파일의 일 예를 나타내는 도면,17A illustrates an example of a final text file to which an XML tag and additional information are attached in a document image processing and verification system according to an embodiment of the present invention;

도 17b는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 유니코드와 Shift+JIS 변환 과정에서 생성된 오류 파일의 일 예를 나타내는 도면이다.17B is a diagram illustrating an example of an error file generated during Unicode and Shift + JIS conversion in a document image processing and verification system according to an embodiment of the present invention.

도 18은 본 발명의 다른 실시예에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템의 구조도.18 is a structural diagram of a document image processing and verification system for digitizing massive data according to another embodiment of the present invention.

도 19는 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증방법의 흐름도.19 is a flowchart of a document image processing and verification method for digitization of massive data according to the present invention.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

410: 문서영상 구조분석 및 문자영역 분할 모듈 410: Document image structure analysis and text area segmentation module

420: 문서영상 분할 검증 모듈420: document image segmentation verification module

430: 문자군집화 모듈430: character clustering module

440: 문자군집단위 검증 및 텍스트 변환 모듈440: character cluster unit verification and text conversion module

450: 문자단위 코드 입력 모듈450: character unit code input module

460: 군집단위 검증 및 교정 모듈460: Cluster Validation and Calibration Module

470: 텍스트 라인 단위 원문 검증 및 교정 모듈470: Text line unit text verification and correction module

480: 내용 태깅 및 부가정보 입력 모듈480: Content tagging and additional information input module

본 발명은 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템에 관한 것으로, 더욱 구체적으로는 필사본, 목판본, 활자본 등과 같은 다양한 문서데이터로부터 고품질 디지털 텍스트정보를 초고속으로 획득할 수 있도록 고성능 문서영상처리 및 지능형 문자인식을 포함하는 자동화 기술과 영상기반 검증 및 교정인터페이스를 결합한 것인 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템에 관한 것이다.The present invention relates to a document image processing and verification system for the digitization of a large amount of data, and more particularly to high-performance document image processing to obtain high-quality digital text information at a high speed from various document data such as manuscript, woodblock, and printed text. The present invention relates to a document image processing and verification system for digitizing a large amount of data, which combines an automatic technology including intelligent character recognition with an image-based verification and correction interface.

도 1은 종래의 필사본, 목판본, 활자본 문서를 스캐너를 통해 획득한 문서영상의 예를 나타내는 도면으로, 도 1의 (a)는 국사편찬위원회가 소장하고 있는 승정원일기(承政院日記) 영인본의 필사본의 문서영상이며, 도 1의 (b)는 민족문화추진위가 소장하고 있는 문집총간(文集叢刊)의 목판본의 문서영상이며, 도 1의 (c)는 서울대 규장각이 소장하고 있는 일성록(日省錄)의 필사본의 문서영상이며, 도 1의 (d)는 1972년 발행된 이광수 전집의 활자본의 문서영상이다.1 is a view showing an example of a document image obtained through a scanner of a conventional manuscript, woodblock, and letterpress document, Figure 1 (a) is a seungwonwon diary (English-Japanese version) possessed by the National History Compilation Committee It is a document image of a manuscript, and FIG. 1 (b) is a document image of a woodblock copy of the Mungun Chonggan (文集) owned by the National Culture Promotion Committee, and (c) of FIG. I) is a document image of a manuscript of FIG. 1 (d) is a document image of a typeface of Lee Kwang-su's collection published in 1972.

이러한 종래의 필사본 또는 목판본에 속하는 역사적 사료와 같은 문서데이터는 오랜 소장기간에 따른 문서 훼손 및 고대 한자라는 특수성 때문에 일반적으로 전문가들에 의한 수작업으로 디지털화 작업이 수행되고 있다. 즉 문자 하나하나를 눈으로 확인하여 입력하는 수작업 방법을 사용하여 디지털화 작업을 수행하고 있으며, 예외적으로 보존 상태가 좋고 판독이 수월한 경우 패턴인식 분야의 문자인식(OCR) 기술을 활용하는 자동화 방법을 고려할 수 있으나 정확도가 수작업에 의한 디지털화 작업에 비해 낮은 단점이 있다. 따라서 이러한 종래의 수작업에 의한 문서의 디지털화 방법은 방대한 인력 비용 및 처리 시간, 제한적 자료처리 규모라는 문제점을 안고 있다. Document data such as historical feeds belonging to the conventional manuscripts or woodblocks are generally digitized manually by specialists due to document damages due to long collection periods and the peculiarity of ancient Chinese characters. In other words, digitization is performed using a manual method that checks and inputs each character visually.In exceptional cases, if the preservation status is good and the reading is easy, the automation method that utilizes the character recognition (OCR) technology in the field of pattern recognition is considered. However, the accuracy is lower than the manual digitization work. Therefore, the conventional method of digitizing documents by manual has the problems of huge manpower cost, processing time, and limited data processing scale.

도 2는 종래 기술에 의한 대규모 인력에 의한 수작업기반 디지털화 작업 공정의 흐름도를 나타낸다. Figure 2 shows a flow diagram of a manual hand-based digitization work process by a large-scale manpower in the prior art.

도시되듯이, 디지털화할 문서를 문서 복사 및 스캐닝을 통하여 문서영상으로 만들고(S210) 이를 대규모 인력을 동원하여 일일이 수작업으로 데이터를 입력하고(S220), 또한 대규모 인력을 동원하여 일일이 글자마다 검증하고 교정하여(S230) 문서화된 DB를 생성하게 된다(S240). As shown, the document to be digitized into document image through document copying and scanning (S210) by mobilizing a large manpower to manually input data (S220), also by mobilizing a large manpower to verify and correct every letter By (S230) it will generate a documented DB (S240).

이러한 수작업에 의한 데이터 입력, 검증 및 교정에 따른 단점을 개선하기 위해서 국내외적으로 활자본과 같은 정형적인 구조를 가진 인쇄문서에 대한 디지털화는 자동화 기술이 많이 연구되었고, 상용화 제품들도 출시되었다. In order to improve the shortcomings caused by manual data input, verification and correction, the digitalization of printed documents having a formal structure such as typefaces at home and abroad has been studied a lot of automation technologies and commercialized products have been released.

도 3은 종래 기술에 의한 상용 문자인식에 의한 디지털화 작업 공정의 흐름도이다. 도시되듯이, 디지털화할 문서를 스캐닝을 통하여 문서영상으로 만들고(S310), 이를 OCR을 이용하여 텍스트 정보로 변환하고(S320), 변환된 정보를 검증 및 교정하여(S330) 문서화된 DB를 생성하게 된다(S340). 3 is a flowchart of a digitization work process using commercial character recognition according to the prior art. As shown, the document to be digitized into a document image through scanning (S310), convert it to text information using OCR (S320), verify and correct the converted information (S330) to generate a documented DB It becomes (S340).

기계적으로 생성된 활자본의 경우 이러한 종래의 자동 문자인식 기술을 적용할 수 있지만, 필사본이나 목판본과 같은 비정형적인 구조를 갖은 문서에 대한 기술은 자동화에 대한 다양한 장애요소를 해결하지 못해 연구개발 현황이 매우 미미하다. 또한 종래의 문자인식 기술에 의한 디지털화는 각 요소기술, 예컨대 문서 전처리, 문서 분할, 문자 인식 및 후처리 등의 기술이 순차적으로 수행되어 고속으로 텍스트정보를 추출하지만 요소기술 사이의 오류가 누적되어 매우 낮은 품질의 텍스트정보를 추출하게 되며, 추출된 텍스트정보를 기반으로 전면 수작업 검증 및 교정 작업을 수행해야 한다. In the case of mechanically generated typefaces, such conventional automatic character recognition technology can be applied. However, the technology for documents with atypical structures such as manuscripts and woodblocks does not solve various obstacles to automation. Insignificant In addition, the digitization by the conventional character recognition technology extracts text information at high speed by sequentially performing each element technology such as document preprocessing, document segmentation, character recognition, and post-processing, but the error between the element technologies is accumulated. Low quality text information will be extracted, and full manual verification and correction should be performed based on the extracted text information.

따라서 종래의 문자인식 기술에 의한 디지털화 방법은 텍스트정보 추출이 매우 빠르지만 추출된 텍스트정보의 품질 보장이 불가능하고, 품질 향상을 위해 반복적 전면검증이 필요하여 검증 및 교정 과정이 매우 느리다. Therefore, in the conventional method of digitization by character recognition technology, the extraction of text information is very fast, but the quality of the extracted text information cannot be guaranteed, and the verification and correction process is very slow because it needs repeated full verification to improve the quality.

이러한 것은 종래의 문자인식 기술에 의한 디지털화 방법의 경우 다양한 문서의 구조적 분석(document layout analysis) 오류 및 인식엔진의 양적, 질적인 오류에 의해 매우 품질이 낮은 텍스트 정보가 만들어지기 때문이다. 따라서 고품질의 정보 생성을 위해서는 수작업으로 텍스트정보의 검증 및 교정 작업을 수행해야 하는데, 이에 대한 추가적 비용 지출이 불가피하다.This is because, in the case of the digitization method by the conventional character recognition technology, very low quality text information is created by errors in various document layout analysis and quantitative and qualitative errors of the recognition engine. Therefore, in order to generate high quality information, it is necessary to manually verify and correct text information, and additional expenses are inevitable.

따라서 필사본이나 목판본과 같은 비정형적인 구조를 포함하며 소장기간이 오랜 문서데이터의 효율적인 디지털화를 위해서는 활자본에 비해 상대적으로 자동화인식 성능이 낮은 문제점을 해결해야 하고, 자동화 결과에 대한 빠르고 정확한 검증 및 교정 방법을 개발하여야 한다.Therefore, in order to efficiently digitize document data that has long storage periods and include atypical structures such as manuscripts and woodblocks, it is necessary to solve the problem of relatively low recognition performance compared to the typeface. Develop.

본 발명의 목적은 방대한 문서데이터로부터 고품질 디지털 텍스트정보를 초고속으로 획득할 수 있으며, 문자군집단위 검증 및 일괄 텍스트변환이 가능하도록 함으로써 정확도와 처리율을 증가시킬 수 있으며, 문서영상의 텍스트정보 추출을 위한 자료처리 및 시각적 검증 과정이 군집단위로 이루어짐으로써 오류 탐색시간을 절감할 수 있는 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템 및 그 방법을 제공하는 데 있다.An object of the present invention is to obtain high-quality digital text information from a large amount of document data at a high speed, and to increase the accuracy and throughput by enabling character cluster unit verification and batch text conversion, and extracting text information from document images. It is to provide a document image processing and verification system and method for digitizing a large amount of data that can reduce error search time by performing data processing and visual verification process in a cluster unit.

상기 기술적 과제를 달성하기 위하여, 본 발명은 스캐닝 과정에서 획득한 일련의 문서 영상들의 계층적 구조를 추출하기 위해서 상기 문서 영상들의 투영 프로파일(Projection Profile) 또는 연결요소분석(Connected Component Analysis)을 사용하여 분석하여 문자영역을 자동으로 일괄 분할하는 문서영상 구조분석 및 문자영역 분할 모듈과, 상기 문서영상의 자동 분할 결과를 영상기반 인터페이스를 통해 검증하고 교정하는 문서영상 분할 검증 모듈과, 상기 문서영상 분할 검증 모듈에 의해서 검증된 상기 문서영상의 각 문자영역에 지능형 문자인식을 적용하여 텍스트코드를 결정하고 동일한 코드의 문자영역들을 모아서 군집영상을 생성하는 문자군집화 모듈과, 상기 각 코드별 군집영상을 대상으로 문자인식 결과를 군집단위로 검증하여 일괄 텍스트코드를 부여하는 문자군집단위 검증 및 텍스트 변환 모듈과, 상기 문자인식에서 텍스트코드를 결정하지 못한 소규모 문자영역을 개별적으로 입력하는 문자단위 코드 입력 모듈과, 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 군집단위 인터페이스를 통해 1차적으로 검증하고 교정하는 군집단위 검증 및 교정 모듈과, 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 문서의 텍스트라인 단위 인터페이스를 통해 2차적으로 검증하고 교정하는 텍스트 라인 단위 원문 검증 및 교정 모듈과, 상기 문서영상의 문자영역을 표현하는 인터페이스를 통해 문서의 각종 태그정보 및 부가정보를 입력하는 내용 태깅 및 부가정보 입력 모듈을 포함하는 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템을 제공한다.In order to achieve the above technical problem, the present invention uses a projection profile or a Connected Component Analysis of the document images to extract the hierarchical structure of the series of document images obtained in the scanning process. Document image structure analysis and character region segmentation module for automatically batch segmenting text areas by analyzing, Document image segmentation verification module for verifying and correcting the result of automatic segmentation of the document image through an image-based interface; A character clustering module for determining a text code by applying intelligent character recognition to each character region of the document image verified by the module, and generating a cluster image by collecting character regions of the same code, and targeting the cluster image for each code. Grant batch text code by verifying character recognition result by cluster unit A character group unit verification and text conversion module for inputting a character unit code input module for individually inputting a small character area for which a text code cannot be determined in the character recognition, and the batch text code is provided or inputted through the individual inputs A cluster unit verification and correction module for first verifying and correcting each character region of a document image through a cluster unit interface, and displaying each character region of the document image input through the batch text code assignment or the individual input. Content tagging and addition of text line-based original text verification and correction module for secondary verification and correction through text line interface and inputting various tag information and additional information of document through interface representing text area of the document image. Massive data digital with information input module Provides document image processing and verification system.

본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템에 있어서, 스캐닝을 통해 획득한 문서영상을 저장하는 문서영상 데이터베이스와, 문자인식에 의해 각 문자영역으로 만들어진 문자군집영상을 저장하는 문자군집영상 데이터베이스와, 문서영상의 구조분석 및 자동분할에 의한 데이터를 저장하는 문서영상 구조 및 분할 데이터베이스와, 획득된 최종의 텍스트 정보를 저장하는 텍스트 데이터베이스를 더 포함하는 것이 바람직하다.In the document image processing and verification system for the digitization of massive data according to the present invention, a document image database storing document images obtained through scanning, and a character storing image grouping image made of each character area by character recognition It is preferable to further include a cluster image database, a document image structure and segmentation database storing data by structure analysis and automatic segmentation of the document image, and a text database storing final text information obtained.

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템에 있어서, 상기 문서영상 구조분석 및 문자영역 분할 모듈은, 상기 문서영상 획득과정에서 생기는 다양한 잡영과 기울어짐을 자동으로 검출하여 교정하는 것이 바람직하다.In addition, in the document image processing and verification system for the digitization of a large amount of data according to the present invention, the document image structure analysis and character region segmentation module, to automatically detect and correct various miscellaneous artifacts and skews generated during the document image acquisition process It is preferable.

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템에 있어서, 상기 문서영상 분할 검증 모듈은, 텍스트라인의 생성 및 삭제, 문자영역의 생성 및 삭제, 크기 조정, 병합, 가로 분할, 세로 분할의 기능을 가지는 인터페이스를 포함하는 것이 바람직하다.In addition, in the document image processing and verification system for digitizing a large amount of data according to the present invention, the document image segmentation verification module includes: generation and deletion of text lines, generation and deletion of character regions, resizing, merging, horizontal division, It is preferable to include an interface having a function of vertical division.

삭제delete

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템에 있어서, 상기 문자군집화 모듈은, 유클리디안 거리(Euclidean distance) 또는 마할라노비스 거리(Mahalanobis distance) 방법을 사용하여 최소거리에 해당하는 문자클래스로 문자영역의 텍스트코드를 결정하는 것이 바람직하다.In addition, in the document image processing and verification system for digitization of a large amount of data according to the present invention, the character clustering module uses a Euclidean distance or Mahalanobis distance method to a minimum distance. It is desirable to determine the text code of the character area with the corresponding character class.

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템에 있어서, 상기 문자군집화 모듈은, 상기 유클리디안 거리 또는 마할라노비스 거리 방법을 사용하여 그 거리값을 인식신뢰도로 활용하여 문서의 군집화에 대한 채택 또는 기각을 결정하는 것이 바람직하다.In addition, in the document image processing and verification system for digitization of the massive data according to the present invention, the character clustering module, using the Euclidean distance or Mahalanobis distance method using the distance value as the recognition reliability document It is desirable to determine the adoption or rejection of the clustering of.

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템에 있어서, 상기 문자군집단위 검증 및 텍스트 변환 모듈은, 상기 인식신뢰도를 활용하여 문서의 군집화에 대한 채택이 결정된 문자영역에 대해서 일괄 텍스트코드를 부여하는 것이 바람직하다.In addition, in the document image processing and verification system for digitizing a large amount of data according to the present invention, the character group unit verification and text conversion module is collectively used for the character area in which the adoption of the clustering of documents is determined using the recognition reliability. It is desirable to give a text code.

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템에 있어서, 상기 문자단위 코드 입력 모듈은, 상기 인식신뢰도를 활용하여 문서의 군집화에 대한 기각이 결정된 문자영역에 대해서 텍스트코드를 개별적으로 부여하는 것이 바람직하다.In addition, in the document image processing and verification system for digitization of the massive data according to the present invention, the character unit code input module separately uses a text code for a character region in which rejection of grouping of documents is determined using the recognition reliability. It is preferable to give as.

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템에 있어서, 상기 문자단위 코드 입력 모듈은, 한자 초고속 입력법, 부수/획수기반 입력법 또는 음가기반 입력법을 사용하는 것이 바람직하다.In addition, in the document image processing and verification system for digitizing a large amount of data according to the present invention, it is preferable that the character unit code input module uses a kanji ultra-fast input method, a copy / stroke number-based input method, or a phonetic-based input method. .

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템에 있어서, 상기 군집단위 검증 및 교정 모듈은, 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 인식신뢰도를 기준으로 서로 다른 색으로 표시하는 것이 바람직하다.In addition, in the document image processing and verification system for digitizing a large amount of data according to the present invention, the cluster unit verification and correction module, each text area of the document image input through the batch text code assignment or the individual input It is preferable to display in different colors based on the recognition reliability.

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템에 있어서, 상기 내용 태깅 및 부가정보 입력 모듈은, 유니코드 4.0이 지원되는 텍스트정보를 Shift+JIS 텍스트 형태로 변환하는 것이며, 변환시 일치하지 않는 텍스트코드의 문서상의 위치와 해당 유니코드값을 기록한 오류 로그(log) 파일을 생성하는 것이 바람직하다.In addition, in the document image processing and verification system for digitizing a large amount of data according to the present invention, the content tagging and additional information input module converts text information supported by Unicode 4.0 into Shift + JIS text format. It is desirable to create an error log file that records the location of the document with the text code that does not match and the corresponding Unicode value.

또한 본 발명은 스캐닝 과정에서 획득한 일련의 문서 영상들의 계층적 구조를 추출하기 위해서 상기 문서 영상들의 투영 프로파일 또는 연결요소분석을 사용하여 분석하여 문자영역을 자동으로 일괄 분할하는 단계와, 상기 문서영상의 자동 분할 결과를 영상기반 인터페이스를 통해 검증하고 교정하는 단계와, 상기 영상기반 인터페이스를 통해 검증되고 교정된 문서영상의 각 문자영역에 지능형 문자인식을 적용하여 텍스트코드를 결정하고 동일한 코드의 문자영역들을 모아서 군집영상을 생성하는 단계와, 상기 각 코드별 군집영상을 대상으로 문자인식 결과를 군집단위로 검증하여 일괄 텍스트코드를 부여하는 단계와, 상기 문자인식에서 텍스트코드를 결정하지 못한 소규모 문자영역을 개별적으로 입력하는 단계와, 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 군집단위 인터페이스를 통해 1차적으로 검증하고 교정하는 단계와, 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 문서의 텍스트라인 단위 인터페이스를 통해 2차적으로 검증하고 교정하는 단계와, 상기 문서영상의 문자영역을 표현하는 인터페이스를 통해 문서의 각종 태그정보 및 부가정보를 입력하는 단계를 포함하는 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증방법을 제공한다.In another aspect, the present invention is the step of automatically segmenting a text area by analyzing using a projection profile or a connection element analysis of the document images to extract the hierarchical structure of a series of document images obtained during the scanning process, and the document image Verifying and correcting the result of automatic segmentation through the image-based interface, and determining the text code by applying intelligent character recognition to each character region of the document image verified and corrected through the image-based interface. Generating a cluster image by collecting them, and verifying a character recognition result in a cluster unit for each cluster image for each code and assigning a batch text code to the cluster image, and a small character region in which the text code is not determined in the character recognition. Individually inputting, and assigning the batch text code Verifying and correcting each character area of the document image input through the individual input through a cluster unit interface; and assigning the batch text code or each character of the document image input through the individual input. Secondly verifying and correcting an area through a text line unit interface of the document, and inputting various tag information and additional information of the document through an interface representing a text area of the document image. Provides document image processing and verification method for digitization.

본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증방법에 있어서, 상기 스캐닝 과정에서 획득한 일련의 문서영상들을 분석하여 문자영역을 자동으로 일괄 분할하는 단계는, 상기 문서영상 획득과정에서 생기는 다양한 잡영과 기울어짐을 자동으로 검출하여 교정하는 단계를 포함하는 것이 바람직하다.In the document image processing and verification method for digitization of massive data according to the present invention, the step of automatically dividing a text area by analyzing a series of document images obtained in the scanning process may occur in the document image acquisition process. It is preferable to include the step of automatically detecting and correcting various miscellaneous and tilting.

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증방법에 있어서, 상기 문서영상의 각 문자영역에 지능형 문자인식을 적용하여 텍스트코드를 결정하고 동일한 코드의 문자영역들을 모아서 군집영상을 생성하는 단계는, 유클리디안 거리 또는 마할라노비스 거리 방법을 사용하여 최소거리에 해당하는 문자클래스로 문자영역의 텍스트코드를 결정하는 단계를 포함하는 것이 바람직하다.In addition, in the document image processing and verification method for the digitization of massive data according to the present invention, intelligent character recognition is applied to each character region of the document image to determine a text code, and generate a cluster image by collecting character regions of the same code. Preferably, the step of determining the text code of the character area by the character class corresponding to the minimum distance using the Euclidean distance or Mahalanobis distance method.

삭제delete

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증방법에 있어서, 상기 문서영상의 각 문자영역에 지능형 문자인식을 적용하여 텍스트코드를 결정하고 동일한 코드의 문자영역들을 모아서 군집영상을 생성하는 단계는, 상기 유클리디안 거리 또는 마할라노비스 거리 방법을 사용하여 그 거리값을 인식신뢰도로 활용하여 문서의 군집화에 대한 채택 또는 기각을 결정하는 단계를 포함하는 것이 바람직하다.In addition, in the document image processing and verification method for the digitization of massive data according to the present invention, intelligent character recognition is applied to each character region of the document image to determine a text code, and generate a cluster image by collecting character regions of the same code. Preferably, the step of using the Euclidean distance or Mahalanobis distance method using the distance value as the recognition reliability to determine the adoption or rejection of the clustering of the document.

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증방법에 있어서, 상기 각 코드별 군집영상을 대상으로 문자인식 결과를 군집단위로 검증하여 일괄 텍스트코드를 부여하는 단계는, 상기 인식신뢰도를 활용하여 문서의 군집화에 대한 채택이 결정된 문자영역에 대해서 일괄 텍스트코드를 부여하는 단계를 포함하는 것이 바람직하다.Further, in the document image processing and verification method for digitizing the vast data according to the present invention, the step of verifying the character recognition result in a cluster unit for the clustered image for each code and assigning a batch text code to the recognition reliability It is preferable to include the step of assigning a batch text code to the character area is determined to adopt the clustering of the document by using.

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증방법에 있어서, 상기 문자인식에서 텍스트코드를 결정하지 못한 소규모 문자영역을 개별적으로 입력하는 단계는, 상기 인식신뢰도를 활용하여 문서의 군집화에 대한 기각이 결정된 문자영역에 대해서 텍스트코드를 개별적으로 부여하는 단계를 포 함하는 것이 바람직하다.In addition, in the document image processing and verification method for digitizing a large amount of data according to the present invention, the step of individually inputting a small character area that does not determine the text code in the character recognition, using the recognition reliability grouping of documents It is preferable to include the step of individually assigning a text code to the character area for which the rejection of the character is determined.

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증방법에 있어서, 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 군집단위 인터페이스를 통해 1차적으로 검증하고 교정하는 단계는, 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 인식신뢰도를 기준으로 서로 다른 색으로 표시하는 단계를 포함하는 것이 바람직하다.In addition, in the document image processing and verification method for digitizing a large amount of data according to the present invention, each character region of the document image input through the batch text code assignment or the individual input is primarily verified through a cluster unit interface. The step of calibrating may include displaying each text area of the document image input through the batch text code assignment or the individual input in different colors based on recognition reliability.

또한 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증방법에 있어서, 상기 문서영상의 문자영역을 표현하는 인터페이스를 통해 문서의 각종 태그정보 및 부가정보를 입력하는 단계는, 유니코드 4.0이 지원되는 텍스트정보를 Shift+JIS 텍스트 형태로 변환하는 것이며, 변환시 일치하지 않는 텍스트코드의 문서상의 위치와 해당 유니코드값을 기록한 오류 로그 파일을 생성하는 단계를 포함하는 것이 바람직하다.In addition, in the document image processing and verification method for digitizing a large amount of data according to the present invention, the step of inputting various tag information and additional information of the document through an interface representing the character region of the document image, Unicode 4.0 is Converting supported text information to Shift + JIS text form, and preferably, generating an error log file that records the location of the document and the corresponding Unicode value of the inconsistent text code.

또한 본 발명은 스캐닝 과정에서 획득한 일련의 문서 영상들의 계층적 구조를 추출하기 위해서 상기 문서 영상들의 투영 프로파일 또는 연결요소분석을 사용하여 분석하여 문자영역을 자동으로 일괄 분할하는 기능과, 상기 문서영상의 자동 분할 결과를 영상기반 인터페이스를 통해 검증하고 교정하는 기능과, 상기 영상기반 인터페이스를 통해 검증되고 교정된 문서영상의 각 문자영역에 지능형 문자인식을 적용하여 텍스트코드를 결정하고 동일한 코드의 문자영역들을 모아서 군집영상을 생성하는 기능과, 상기 각 코드별 군집영상을 대상으로 문자인식 결과를 군집단위로 검증하여 일괄 텍스트코드를 부여하는 기능과, 상기 문자인식에서 텍스트코드를 결정하지 못한 소규모 문자영역을 개별적으로 입력하는 기능과, 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 군집단위 인터페이스를 통해 1차적으로 검증하고 교정하는 기능과, 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 문서의 텍스트라인 단위 인터페이스를 통해 2차적으로 검증하고 교정하는 기능과, 상기 문서영상의 문자영역을 표현하는 인터페이스를 통해 문서의 각종 태그정보 및 부가정보를 입력하는 기능을 실현시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 제공한다.In addition, the present invention provides a function of automatically batch-dividing a text area by analyzing using a projection profile or a connection element analysis of the document images to extract a hierarchical structure of a series of document images obtained during a scanning process, and the document image. The function of verifying and correcting the result of automatic segmentation through the image-based interface, and applying the intelligent character recognition to each character region of the document image verified and corrected through the image-based interface, determine the text code and determine the character region of the same code. A function of generating a clustered image by gathering them, a function of assigning a batch text code by verifying a character recognition result in a cluster unit for each clustered image of each code, and a small character area in which the text code is not determined in the character recognition. And the batch text code grant function Has a function of first verifying and correcting each character area of the document image input through the individual input through a cluster unit interface, and assigning the batch text code or each character of the document image input through the individual input. A program for secondly verifying and correcting an area through a text line interface of a document and inputting various tag information and additional information of a document through an interface representing a text area of the document image is provided. Provide a computer-readable recording medium for recording.

이하, 본 발명의 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템 및 그 방법을 도면을 참조로 하여 보다 구체적으로 설명한다.Hereinafter, a document image processing and verification system for digitizing a large amount of data of the present invention and a method thereof will be described in more detail with reference to the accompanying drawings.

도 4는 본 발명의 일 실시예에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템의 구조도이다.4 is a structural diagram of a document image processing and verification system for digitizing massive data according to an embodiment of the present invention.

도시되듯이 본 발명의 일 실시예에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템(400)은 문서영상 구조분석 및 문자영역 분할 모듈(410)과, 문서영상 분할 검증 모듈(420)과, 문자군집화 모듈(430)과, 문자군집단위 검증 및 텍스트 변환 모듈(440)과, 문자단위 코드 입력 모듈(450)과, 군집단위 검증 및 교정 모듈(460)과, 텍스트 라인 단위 원문 검증 및 교정 모듈(470)과, 내용 태깅 및 부가정보 입력 모듈(480)을 포함한다. As shown, the document image processing and verification system 400 for digitization of massive data according to an embodiment of the present invention includes a document image structure analysis and character region segmentation module 410, a document image segmentation verification module 420, Character grouping module 430, character group unit verification and text conversion module 440, character unit code input module 450, group unit verification and correction module 460, text line unit text verification and correction Module 470 and content tagging and additional information input module 480.

문서영상 구조분석 및 문자영역 분할 모듈(410)은 스캐닝 과정에서 획득한 일련의 문서영상들을 분석하여 문자영역을 자동으로 일괄 분할한다. The document image structure analysis and text area segmentation module 410 automatically divides the text area by analyzing a series of document images acquired during the scanning process.

스캐너를 통해 획득한 문서영상은 획득 과정에서 생기는 다양한 잡영(noise) 과 기울어짐(skew) 등을 포함하고 있기 때문에 영상분석 이전에 품질을 향상시키는 전처리 과정이 필요하다. 따라서 문서영상 구조분석 및 문자영역 분할 모듈(410)은 구조분석 이전에 영상의 기울어짐을 자동으로 검출하여 교정하는 기능을 포함하며, 문서영상에 존재하는 화소들의 연결요소(connected component)가 잡영인지 여부를 먼저 검사하는 과정을 거친다. Since the document image acquired through the scanner contains various noises and skews generated during the acquisition process, a preprocessing process to improve the quality before image analysis is necessary. Therefore, the document image structure analysis and character region segmentation module 410 includes a function of automatically detecting and correcting an inclination of an image before the structure analysis, and whether the connected component of pixels existing in the document image is miscellaneous. First, go through the process of checking.

도 5는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 다양한 잡영 및 기울어짐이 포함된 문서영상의 분석 및 분할 결과의 일 예를 나타내는 도면이다. 5 is a diagram illustrating an example of an analysis and segmentation result of a document image including various miscellaneous and tilting in the document image processing and verification system according to an embodiment of the present invention.

도 5의 (a)에 도시되듯이 다양한 잡영이 포함되어 있고 또한 문서영상이 기울어져 있는 상태의 스캐닝된 영상을 입력받아 이를 분석하고 문자별로 분할하여 도 5의 (b)와 같은 영상으로 분할하게 된다.As shown in (a) of FIG. 5, various miscellaneous images are included, and the scanned image is inputted, the document image is tilted, analyzed, divided into characters, and divided into images as shown in FIG. 5 (b). do.

일반적으로 문서는 문단의 집합이며 문단은 텍스트라인의 집합이고 텍스트라인은 문자들의 집합이라 할 수 있기 때문에 문서는 문단, 텍스트라인, 문자들의 계층적 구조(hierarchical structure)를 가진다고 할 수 있다. 따라서 문서영상 구조분석 및 문자영역 분할 모듈(410)은 도 6의 (a)에 도시된 바와 같은 문서영상을 도 6의 (b)에 도시된 바와 같이 계층적 구조로 분석하고 이를 도 6의 (c)에 도시되듯이 각각의 문단과 텍스트라인, 문자들로 분할하여 문서의 계층적 구조를 추출해내는 것으로서, 문서영상의 문자영역들은 계층구조에 따른 논리적인 순서가 할당된다.In general, a document is a set of paragraphs, a paragraph is a set of text lines, and a text line is a set of characters. Thus, a document has a hierarchical structure of paragraphs, text lines, and characters. Therefore, the document image structure analysis and character region segmentation module 410 analyzes the document image as shown in FIG. 6 (a) in a hierarchical structure as shown in FIG. As shown in c), the document is divided into paragraphs, text lines, and characters to extract the hierarchical structure of the document. The character regions of the document image are assigned a logical order according to the hierarchical structure.

본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서는 예컨대 영상 처리 및 분석기법 중 문서영상의 투영 프로파일, 연결요소분석 등을 사용할 수 있다.In the document image processing and verification system according to an embodiment of the present invention, for example, a projection profile of a document image, connection element analysis, and the like may be used in image processing and analysis.

본 발명의 일 실시예에 따른 문서영상 분할은 예컨대, 작업자가 처리할 문서영상의 폴더를 지정하면 일괄모드(batch mode)로 처리될 수 있다.The document image segmentation according to an embodiment of the present invention may be processed in a batch mode, for example, when a worker designates a folder of a document image to be processed.

도 7은 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 목판본 및 필사본, 활자본 영상으로부터 구조분석 및 문자영역 분할 결과의 일 예를 나타내는 도면이다.FIG. 7 is a diagram illustrating an example of structural analysis and character region segmentation results from woodblock, manuscript, and letterpress images in a document image processing and verification system according to an embodiment of the present invention.

도 7의 (a)는 세로로 된 목판본 문서영상에 대해서 구조분석 및 문자영역 분할을 한 결과이며, 도 7의 (b)는 세로로 된 필사본 문서영상에 대해서 구조분석 및 문자영역 분할을 한 결과이며, 도 7의 (c)는 가로로 된 활자본 문서영상에 대해서 구조분석 및 문자영역 분할을 한 결과이며, 도 7의 (d)는 세로로 된 활자본 문서영상에 대해서 구조분석 및 문자영역 분할을 한 결과를 각각 나타낸다.7 (a) shows the results of the structural analysis and the text area segmentation on the vertical woodblock document image, and FIG. 7 (b) shows the results of the structural analysis and the text area segmentation on the vertical manuscript document image. 7 (c) shows the results of structural analysis and character region segmentation on the horizontally printed document image, and FIG. 7 (d) shows structural analysis and character region segmentation on the vertically printed document image. Each result is shown.

문서영상 분할 검증 모듈(420)은 상기 문서영상의 자동 분할 결과를 영상기반 인터페이스를 통해 검증하고 교정한다.The document image segmentation verification module 420 verifies and corrects the automatic segmentation result of the document image through an image-based interface.

문서영상 분할 검증 모듈(420)은 문서영상의 자동 분할 결과를 나타내는 인터페이스를 기반으로 분할 과정에서 추출된 텍스트영역, 텍스트라인, 문자영역, 문서의 논리적인 순서 정보에 대한 검증 및 교정 기능을 제공한다.The document image segmentation verification module 420 provides a function of verifying and correcting the text region, text line, character region, and logical order information of a document extracted in a segmentation process based on an interface representing an automatic segmentation result of a document image. .

도 8은 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 문서영상 분할 결과의 검증 및 교정에 활용하는 툴바(tool bar)의 일 예를 나타내는 도면이다. 도시되듯이, 문자영역 생성(810), 문자영역 삭제(820), 문자영역들의 병합 (830), 문자영역의 수직분할(840), 문자영역의 수평분할(850), 문자영역의 크기 수정(860), 문서영상 재분할(870), 전체 보기(875), 텍스트라인 생성 표시(880), 블록 지정(885), 순서 교정(890) 등의 다양한 교정 작업을 수행하는 인터페이스를 제공한다.FIG. 8 is a diagram illustrating an example of a toolbar used for verifying and correcting a document image segmentation result in a document image processing and verification system according to an embodiment of the present invention. As shown, character area creation 810, character area deletion 820, merging of character areas 830, vertical division of character area 840, horizontal division of character area 850, modification of character area ( 860, a document image repartition 870, an entire view 875, a text line generation display 880, a block designation 885, an order correction 890, and the like.

문자군집화 모듈(430)은 상기 문서영상의 각 문자영역에 지능형 문자인식을 적용하여 텍스트코드를 결정하고 동일한 코드의 문자영역들을 모아서 군집영상을 생성한다. The character clustering module 430 determines a text code by applying intelligent character recognition to each character region of the document image, and generates a cluster image by collecting character regions of the same code.

문자군집화 모듈(430)은 문서영상을 구성하는 문자영역에 문자인식을 적용하여 텍스트코드를 결정하는 기능과 결정된 텍스트코드에 대한 확률적 인식신뢰도를 추출하는 기능과, 인식신뢰도에 따라 동일한 코드를 갖는 문자영역들을 하나의 군집으로 모으는 군집영상 기능을 포함한다.The character clustering module 430 has a function of determining a text code by applying character recognition to a character region constituting a document image, extracting a stochastic recognition reliability of the determined text code, and having the same code according to the recognition reliability. It includes a cluster image function of gathering text areas into one cluster.

문자인식 기능은 입력 문자영역을 분석하여 특징을 추출하고 미리 학습되어 있는 문자모델 중 하나로 분류하는 문제이다. 문자영역의 분류는 문자영역의 특징값과 문자모델의 대표값 사이의 거리측정법(distance metric)에 의해 이루어지는데, 본 발명에서는 유클리디안 거리 또는 마할라노비스 거리방법을 활용하여 최소거리에 해당하는 문자클래스로 문자영역의 코드를 결정한다. 학습과정에서 미리 구축되는 문자모델은 인식을 통해 하나의 개념으로 대응시키는 기준과 같아서 문자모델의 성질에 따라 인식 성능은 영향을 받는다. The character recognition function is a problem of extracting a feature by analyzing an input text area and classifying it into one of pre-trained character models. The classification of the text area is performed by the distance metric between the feature value of the text area and the representative value of the text model. In the present invention, the minimum distance is applied by using the Euclidean distance or the Mahalanobis distance method. The character class determines the code of the character field. In the learning process, the character model that is built in advance is the same as the criterion that maps to a concept through recognition, and the recognition performance is affected by the characteristics of the character model.

이러한 적응학습을 통한 문자모델 생성은 예컨대 본 출원인에 의해서 2004년 2월 25일자로 출원된 "적응학습 모듈이 탑재된 문자인식 기반 대용량 문서 디지털 화 방법 및 시스템"라는 명칭의 대한민국 특허출원번호 10-2004-7621호에 자세히 개시되어 있다.Character model generation through such adaptive learning is, for example, Korean Patent Application No. 10- entitled "Method and System for Digitization of Large Documents with Character Recognition Based on Adaptive Learning Module" filed February 25, 2004 by the present applicant. It is disclosed in detail in 2004-7621.

임의로 주어진 문자영역을 적당한 문자클래스로 대응시키는 기능으로 정의되는 문자인식에 대해 인식신뢰도를 추출하는 것은 인식 결과에 대한 채택 또는 기각 여부를 결정하기 위한 것이다. 유클리디안 거리측정법에서는 최소거리 값을 내는 문자클래스로 문자영역의 텍스트코드를 결정하였고 그 거리값을 인식신뢰도로 활용하였다. Extracting the recognition reliability for the character recognition defined by the function of arbitrarily matching a given character area with the appropriate character class is to determine whether to accept or reject the recognition result. In the Euclidean distance measurement method, the text code of the character area is determined as the character class that produces the minimum distance value, and the distance value is used as the recognition reliability.

마할라노비스 거리측정법에 의한 인식에서는 판별규칙을 이용하는 통계적 방법을 고려하여 선형판별분석(LDA: Linear Discriminant Analysis)을 적용하였다. 이것은 입력 문자영역에 대한 특징에 대해 특정 문자클래스가 나올 확률인 클래스별 사후확률(posterior probability)을 모두 구한 후, 그 중에서 가장 큰 확률값을 갖는 클래스의 레이블로 문자영역의 코드를 할당하며 이 때 사후확률 값을 인식신뢰도로 활용한다. In the recognition by Mahalanobis distance measurement method, linear discriminant analysis (LDA) was applied in consideration of statistical method using discriminant rule. It calculates all posterior probabilities for each character, which is the probability that a particular character class is shown, and assigns the code of the character area to the label of the class with the largest probability value among them. The probability value is used as the recognition reliability.

도 9a 및 도 9b는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 문자영역에 대한 문자인식 결과 및 인식신뢰도 추출, 주어진 임계치에 대한 채택(accept) 및 기각(reject)에 대한 인터페이스의 일 예를 나타내는 도면이다. 9A and 9B illustrate extraction of character recognition results and recognition reliability of a text area, an interface for accepting and rejecting a given threshold in a document image processing and verification system according to an embodiment of the present invention. It is a figure which shows an example.

도 9a는 입력 문자영역에 대한 텍스트코드 결정이 ‘承’이며, 인식신뢰도가 0.9999999934809로서 주어진 기준치 0.99를 상향하기 때문에 일괄 검증을 위한 군집화 채택을 결정한 화면이다. 9A is a screen for determining the adoption of clustering for batch verification because the determination of the text code for the input character area is '이' and the recognition reliability is raised to the reference value 0.99 given as 0.9999999934809.

반면, 도 9b는 입력 문자영역에 대한 텍스트코드 결정이 ‘改’이며, 인식신 뢰도가 0.5318746685161로서 주어진 기준치 0.99에 미치지 못해서 인식 및 군집화 기각을 결정한 화면이다. 이러한 경우는 향후 수작업에 의한 개별적 텍스트코드 입력 과정을 수행하게 된다. On the other hand, Fig. 9B is a screen for determining the recognition and clustering rejection because the determination of the text code for the input character area is '改' and the recognition reliability is 0.5318746685161 that does not reach the reference value of 0.99. In this case, the individual text code input process is performed manually.

도 10은 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 다양한 문서데이터에 적용한 문자군집화 모듈(430)의 예를 나타내는 도면으로서, 도 10의 (a)는 세로로 된 목판본 문서영상에 대해서 문자군집화 모듈(430)을 적용한 결과이며, 도 10의 (b)는 세로로 된 필사본 문서영상에 대해서 문자군집화 모듈(430)을 적용한 결과이며, 도 10의 (c)는 가로로 된 활자본 문서영상에 대해서 문자군집화 모듈(430)을 적용한 결과이며, 도 10의 (d)는 세로로 된 활자본 문서영상에 대해서 문자군집화 모듈(430)을 적용한 결과를 각각 나타낸다.FIG. 10 is a diagram illustrating an example of a character clustering module 430 applied to various document data in a document image processing and verification system according to an embodiment of the present invention. FIG. 10A illustrates a vertical woodblock document image. 10 is a result of applying the character grouping module 430, FIG. 10 (b) is a result of applying the character grouping module 430 to the vertical manuscript document image, and (c) of FIG. The text clustering module 430 is applied to the image, and FIG. 10 (d) shows the results of applying the text clustering module 430 to the vertical letter document image.

문자군집화 모듈(430)은 문자영역에 대한 텍스트코드 결정과 인식신뢰도 추출에 따라 주어진 임계치를 상향하는 문자영역만을 채택하여 일괄 검증을 위한 군집영상 생성을 수행한다. 각 문자별 군집영상은 동일한 텍스트코드로 분류된 문자영역들로 구성하며, 이 때 문자영역은 인식신뢰도에 따라 정렬된 형태로 배치한다. 이것은 인터페이스에 의한 검증 과정에서 선택적 주의집중(selective attention)이 가능하여 오류 검증에 따른 시간비용 절감 및 처리율 증가를 위한 것이다. The character clustering module 430 generates a clustered image for batch verification by adopting only a character region that raises a given threshold value according to text code determination and recognition reliability extraction of the character region. The clustered images for each character are composed of character areas classified with the same text code, and the character areas are arranged in a sorted form according to the recognition reliability. This is to allow selective attention in the verification process by the interface, and to reduce the time cost and increase the throughput due to error verification.

도 11은 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 한자 ‘之’ 클래스에 대해 인식신뢰도를 기준으로 군집영상을 생성한 결과 및 오류 분포의 일 예를 나타내는 도면이다. FIG. 11 is a diagram illustrating an example of a result of generating a cluster image and an error distribution for a Chinese character '之' class in a document image processing and verification system according to an embodiment of the present invention.

도시되듯이, 도 11의 (a)는 각각의 군집영상, 즉 총 100 페이지의 한자 '之' 클래스에 대한 군집영상(군집영상 1 내지 군집영상 100)에 대해서 각각의 군집영상에 대한 오류의 개수를 나타낸 것이며, 도 11의 (b)는 각각의 군집영상의 예를 나타내는 것이다. 이러한 군집영상은 인식신뢰도를 기준으로 각 문자영역을 내림차순 정렬 형태로 배치한 결과, 인식신뢰도가 상대적으로 낮은 군집영상(예컨대 군집영상100)에서 인식신뢰도가 상대적으로 높은 군집영상(예컨대 군집영상 1)에 비해서 많은 오류가 발견되었다. 따라서 인식 결과에 대한 일괄검증 과정에서 인식신뢰도가 낮은 부분에 대해서 선택적으로 집중할 수 있기 때문에 시간 절감 및 처리율을 배가시킬 수 있다.As shown, (a) of FIG. 11 shows the number of errors for each cluster image for each cluster image, that is, the cluster image (cluster image 1 to cluster image 100) for the Chinese character '之' class of 100 pages in total. 11 (b) shows an example of each cluster image. The cluster images are arranged in descending order based on the recognition reliability. As a result, the cluster image having a relatively high recognition reliability from the cluster image (for example, the cluster image 100) having a relatively low recognition reliability (for example, the cluster image 1). Many errors have been found. Therefore, it is possible to selectively focus on the part with low recognition reliability in the batch verification process on the recognition result, which can save time and double the processing rate.

문자군집단위 검증 및 텍스트 변환 모듈(440)은 상기 각 코드별 군집영상을 대상으로 문자인식 결과를 군집단위로 검증하여 일괄 텍스트코드를 부여한다.The character cluster unit verification and text conversion module 440 verifies the character recognition result in a cluster unit for the cluster image for each code and gives a batch text code.

문자군집단위 검증 및 텍스트 변환 모듈(440)은 문자군집화 과정에서 주어진 인식신뢰도 임계치를 상향한 문자영역으로 구성된 군집영상을 대상으로 자동 문자인식에 의한 오류를 검증하고 군집단위로 일괄 텍스트코드를 할당한다. The character clustering unit verification and text conversion module 440 verifies an error by automatic character recognition for a clustered image composed of a character area having a raised recognition reliability threshold in a character clustering process, and assigns a batch text code to the clustering unit. .

도 11의 (b)에 도시되듯이 각 문자별 군집영상은 문자인식에 의해 동일한 텍스트코드를 갖는 것으로 분류된 문자영역들로 구성되며, 인식신뢰도를 기준으로 정렬된 형태로 배치된다. 이것은 동일한 코드 값을 갖는 문자영역들로 구성되어 있기 때문에 시각적으로 검증이 매우 편리하며 오류 탐색시간이 절감되는 효과를 갖는다. As shown in (b) of FIG. 11, the clustered images for each character are composed of character areas classified as having the same text code by character recognition, and are arranged in a form aligned based on recognition reliability. Since it is composed of character areas having the same code value, it is very convenient to visually verify and reduce the error search time.

도 12는 이러한 문자군집영상에 대해서 문자군집영상 검증 및 텍스트변환을 위한 인터페이스의 일 예를 나타내는 도면이다. 도시되듯이 문자군집영상 검증 및 텍스트변환을 위한 인터페이스에는 문자 클래스(1210), 문자인식에 의해 군집화된 문자영역의 개수(1220), 군집영상(1230), 원 문서영상(1240), 텍스트코드(1250), 문서영상의 경로 정보(1260), 특정 문자클래스의 군집영상 관련정보(1270, 예컨대 전체 페이지수, 군집된 문자영역수, 페이지내 문자수, 현재 페이지) 등의 내용을 포함한다. 문자군집영상 검증 및 텍스트변환을 위한 인터페이스를 통해 작업자는 문자인식의 오류 발견시, 해당 문자영역에 ‘X' 표시하여 일괄 텍스트변환 과정에서 배제되도록 한다.FIG. 12 is a diagram illustrating an example of an interface for verifying a text group image and converting the text group image. As shown, the interface for character cluster image verification and text conversion includes a character class 1210, the number of character regions clustered by character recognition 1220, a cluster image 1230, an original document image 1240, and a text code ( 1250), path information 1260 of the document image, and group image related information 1270 (eg, the total number of pages, the number of clustered character areas, the number of characters in a page, the current page), and the like of a specific character class. Through the interface for text group image verification and text conversion, when an error of character recognition is found, the operator displays 'X' in the corresponding text area so that it is excluded from the batch text conversion process.

문자단위 코드 입력 모듈(450)은 상기 문자인식에서 텍스트코드를 결정하지 못한 소규모 문자영역을 개별적으로 입력한다.The character code input module 450 individually inputs a small character area for which the text code is not determined in the character recognition.

문자단위 코드 입력 모듈(450)은 문자인식에 의한 군집화 과정에서 인식신뢰도가 낮아서 기각되거나 군집영상 검증에서 인식 오류로 판명된 소규모 문자영역을 대상으로 직접 텍스트코드를 입력한다. The character unit code input module 450 directly inputs a text code to a small character region that is rejected due to a low recognition reliability in the clustering process by character recognition or is found to be a recognition error in the cluster image verification.

도 13a는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 수작업 입력 인터페이스의 초기 화면(문자영역과 텍스트코드의 일대일 대응)의 일 예를 나타내는 도면이고, 도 13b는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 수작업 입력 인터페이스의 입력 모드의 일 예를 나타내는 도면이다.FIG. 13A is a diagram illustrating an example of an initial screen (one-to-one correspondence between a text area and a text code) of a manual input interface in a document image processing and verification system according to an embodiment of the present invention, and FIG. 13B is an embodiment of the present invention. 10 is a diagram illustrating an example of an input mode of a manual input interface in a document image processing and verification system according to an example.

도 13a에 도시되듯이 수작업 입력 인터페이스에서는 문서영상의 문자영역과 텍스트코드 부분을 일대일(1:1)로 나열하는 부분(1310)과, 비교를 위해서 원문 영상의 해당 문자영역을 나타내는 부분(1320)을 포함하고 있다. 문서영상의 문자영역과 텍스트코드 부분을 일대일로 나열하는 부분(1310)에서는 이미 텍스트코드가 할 당된 영역과 텍스트코드 부분이 빈 영역으로 나누어지는데, 텍스트코드 부분이 빈 영역은 문자단위 코드 입력을 통해 직접 입력해야 할 영역을 의미한다. As shown in FIG. 13A, in the manual input interface, a portion 1310 for arranging a character region and a text code portion of a document image in one-to-one (1: 1), and a portion 1320 indicating a corresponding character region of the original image for comparison. It includes. In the part 1310, which lists the text area and the text code part of the document image, the area in which the text code is assigned and the text code part are divided into the blank areas. It means the area to input directly.

따라서 도 13b와 같이 해당 텍스트코드 부분이 빈 문자영역만을 모아서 입력을 진행하는데, 본 발명의 일 실시예에 따른 수작업 입력 인터페이스가 지원하는 입력 방법은 한자 초고속입력법, 부수/획수기반 입력법, 음가기반 입력법 등이다. 특히, 한자 초고속입력법은 숙련된 입력자가 하루(8시간 기준)에 약 15,000자를 처리할 수 있는 입력법이다. 이것은 한자의 모양을 분석하여 248종의 뿌리(root) 요소로 나누고, 이 뿌리요소들의 모양과 의미의 유사성을 모아서 40종의 대표뿌리로 구성한 다음, 이 대표뿌리 집합을 코드화하여 키보드 40개에 할당한 원리이다.Accordingly, as shown in FIG. 13B, the text code portion collects only the empty character area and proceeds with input. The input method supported by the manual input interface according to an embodiment of the present invention is a kanji ultra-fast input method, a copy / stroke number-based input method, Based typing. In particular, the Chinese character high speed input method is an input method that a skilled inputter can process about 15,000 characters per day (based on 8 hours). It analyzes the shape of Chinese characters and divides them into 248 kinds of root elements, collects the similarities of shapes and meanings of these root elements, forms 40 representative roots, and codes the representative root set to assign them to 40 keyboards. It is a principle.

본 발명의 일 실시예에 따른 문자단위 코드 입력 모듈(450)은 기본적으로 유니코드(unicode) 4.0을 지원하며 자체 IME에 의한 한글, 한자, 일어, 영문 입력이 가능하다. Character unit code input module 450 according to an embodiment of the present invention basically supports the Unicode (unicode) 4.0, it is possible to input Korean, Chinese, Japanese, English by its own IME.

군집단위 검증 및 교정 모듈(460)은 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 군집단위 인터페이스를 통해 1차적으로 검증하고 교정한다.The cluster unit verification and correction module 460 primarily verifies and corrects each character region of the document image input through the batch text code assignment or the individual input through a cluster unit interface.

군집단위 검증 및 교정 모듈(460)은 문자인식을 이용한 군집화에서 필터링한 데이터를 검증하고 일괄 텍스트 변환하는 과정과 개별적인 문자단위 텍스트코드 입력 과정을 거쳐서 문서영상의 결정된 텍스트코드에 대해서 1차적 검증 및 교정을 수행토록 인터페이스를 제공한다. 군집단위 검증 및 교정 모듈(460)의 사용자 인터페이스는 문자영역과 텍스트코드를 일대일(1:1)로 대응시키며, 동일한 코드 값을 갖는 문자영역들을 한꺼번에 검증할 수 있도록 고안하였다. The cluster unit verification and correction module 460 first verifies and corrects the determined text code of the document image through the process of verifying the filtered data in the clustering using character recognition, converting the batch text, and inputting individual character unit text codes. It provides an interface to do this. The user interface of the cluster unit verification and correction module 460 is designed to correspond to the character area and the text code in one-to-one (1: 1) and verify the character areas having the same code value at a time.

도 14는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 문자군집단위 검증 및 교정 인터페이스(한자 ‘中’ 클래스의 문자영역 나열 화면)의 일 예를 나타내는 도면이다. 도시되듯이, 처리되는 문서영상의 번호(1410)와, 한자 클래스(1420), 문서영상의 해당 문자영역(1430), 문자영역(1440), 해당 텍스트코드(1450)등의 정보가 표시되며, 해당 문자영역에 대한 원 문서영상을 표시함으로써 작업자가 문맥을 통한 검증 작업도 수행할 수 있도록 구성되었다. FIG. 14 is a diagram illustrating an example of a character group unit verification and correction interface (a text area listing screen of a Chinese character '中' class) in a document image processing and verification system according to an embodiment of the present invention. As shown, information such as the number 1410 of the document image to be processed, the Chinese character class 1420, the corresponding character region 1430, the character region 1440, and the corresponding text code 1450 of the document image is displayed. By displaying the original document image for the text area, the operator can also perform verification through context.

또한 문자군집단위 검증 및 교정 인터페이스에서 문자영역의 배치는 수작업 입력에 의해 코드가 결정된 도 14의 A 부분과 문자인식 및 검증에 의해 코드가 결정된 도 14의 B, C 부분으로 구성된다. 특히, 도 14의 B 및 도 14의 C 부분은 문자인식 과정의 인식신뢰도를 기준으로 나열하는데, 예컨대 자동화 전체 비율 중 5%, 95%에 해당하는 문자영역을 각각 구분하여 표현한다. In addition, the arrangement of character areas in the character cluster unit verification and correction interface is composed of part A of FIG. 14 in which a code is determined by manual input and part B and C in FIG. 14 in which a code is determined by character recognition and verification. Particularly, parts B of FIG. 14 and part C of FIG. 14 are arranged based on the recognition reliability of the character recognition process. For example, 5% and 95% of the total automation ratios are separately expressed.

도 14에서 A, B 및 C 부분의 영역을 각각 색깔을 달리하여 표시함으로써 구별이 가능하도록 할 수 있으며, 이것은 검증하는 작업자가 각 문자영역의 코드가 수작업을 통해 결정된 것인지 또는 문자인식에 의한 검증에 의해 결정된 것인지 등의 결정 배경을 알 수 있게 함으로써 오류 탐색시간을 절감하고 정확도 및 처리율을 향상시킬 수 있도록 한 것이다.In FIG. 14, the areas of the A, B, and C portions may be distinguished by displaying the colors in different colors, which may be determined by the operator who verifies whether the code of each character area is determined by hand or by character recognition. It is possible to reduce the error search time and improve the accuracy and throughput by determining the background of the decision such as whether it is determined by

텍스트 라인 단위 원문 검증 및 교정 모듈(470)은 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 문서의 텍스트라인 단위 인터페이스를 통해 2차적으로 검증하고 교정한다. 텍스트 라인 단 위 원문 검증 및 교정 모듈(470)은 문자영역의 텍스트코드 결정에 대한 2차적 검증 및 교정을 수행한다. The text line unit text verification and correction module 470 secondly verifies and corrects each character region of the document image input through the batch text code assignment or the individual input through the text line unit interface of the document. The text line unit original verification and correction module 470 performs secondary verification and correction on the text code determination of the character area.

도 15는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 다양한 데이터의 문서영상 텍스트라인 단위 검증 및 교정 인터페이스의 일 예를 나타내는 도면으로, 도 15의 (a)는 세로로 된 목판본 문서영상에 대한 인터페이스이며, 도 15의 (b)는 세로로 된 필사본 문서영상에 대한 인터페이스이며, 도 15의 (c)는 가로로 된 활자본 문서영상에 대한 인터페이스이며, 도 15의 (d)는 세로로 된 활자본 문서영상에 대한 인터페이스를 각각 나타낸다.FIG. 15 is a view illustrating an example of a document image text line unit verification and correction interface of various data in a document image processing and verification system according to an embodiment of the present invention, and FIG. 15A illustrates a vertical wood copy document. 15 (b) is an interface to a vertical manuscript document image, FIG. 15 (c) is an interface to a horizontal type document image image, and FIG. 15 (d) is a vertical interface. Each interface represents a print document document image.

도시되듯이 문서영상의 텍스트라인 정보를 기반으로 문자영역과 텍스트코드를 일대일(1:1)로 표현하였다. 이 인터페이스를 통하여 문서영상의 텍스트라인, 문자영역 분할, 논리적인 순서의 오류 등 구조적 분할정보 오류를 검증하고 교정한다. As shown, based on the text line information of the document image, the character area and the text code are expressed one-to-one (1: 1). This interface verifies and corrects structural segmentation information errors such as text lines, text area segmentation, and logical order errors in document images.

문서영상에 대한 텍스트코드 입력 및 검증 작업이 완료된 최종의 텍스트정보에 대해 문서의 구조적 태그(tag) 및 부가정보 등의 입력이 필요하다. 내용 태깅 및 부가정보 입력 모듈(480)은 상기 문서영상의 문자영역을 표현하는 인터페이스를 통해 문서의 각종 태그정보 및 부가정보를 입력한다.It is necessary to input structural tags and additional information of the document for the final text information after the text code input and verification work for the document image is completed. The content tagging and additional information input module 480 inputs various tag information and additional information of the document through an interface representing a text area of the document image.

내용 태깅 및 부가정보 입력 모듈(480)에서는 문서영상의 문자영역이 표현되는 인터페이스를 통해 XML(또는 HTML) 문서 생성을 위한 태그 및 다양한 부가정보(띄어쓰기 및 문단 구분 등) 등을 부여할 수 있다. 또한, 최종 문서의 구조적 정보가 부여된 텍스트파일을 생성할 수 있다. In the content tagging and additional information input module 480, a tag for generating an XML (or HTML) document and various additional information (such as spacing and paragraph division) may be provided through an interface in which a text area of a document image is expressed. In addition, it is possible to create a text file given the structural information of the final document.

내용 태깅 및 부가정보 입력 모듈(480)에 대한 인터페이스는 다양한 태그 및 부가정보를 정의하고, 일괄 자동 입력이 가능한 정보도 정의할 수 있다. 따라서 작업자는 문서영상의 문자영역이 표현되는 인터페이스 상에서 마우스 클릭(click)을 통해 정의된 태그 및 부가정보를 부여하는 것이 가능하다. The interface to the content tagging and additional information input module 480 may define various tags and additional information, and may also define information capable of batch automatic input. Therefore, the operator can assign the tag and the additional information defined through the mouse click on the interface where the text area of the document image is expressed.

도 16a는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 문서의 구조적 태그 정보를 부여하는 문서영상 기반 인터페이스의 일 예를 나타내는 도면이고, 도 16b는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 문서영상의 구조분석을 통해 자동으로 추출한 띄어쓰기 정보를 나타내는 도면이다. 16A is a diagram illustrating an example of a document image-based interface for providing structural tag information of a document in a document image processing and verification system according to an embodiment of the present invention, and FIG. 16B is a document according to an embodiment of the present invention. It is a diagram showing the spacing information extracted automatically through the structural analysis of the document image in the image processing and verification system.

문서의 각종 태그 및 부가정보 입력이 끝나면 XML(또는 HTML)기반 텍스트정보 추출이 가능하다. After inputting various tags and additional information of the document, XML (or HTML) based text information can be extracted.

도 17a는 본 발명의 일 실시예에 따른 문서영상처리 및 검증시스템에서 XML 태그 및 부가정보가 부여된 최종의 텍스트파일의 일 예를 나타내는 도면이다. 17A is a diagram illustrating an example of a final text file to which an XML tag and additional information are applied in a document image processing and verification system according to an embodiment of the present invention.

또한, 본 인터페이스는 유니코드 4.0이 지원되는 텍스트정보를 Shift+JIS 텍스트 형태로 변환이 가능한데, 이 때 일치하지 않는 텍스트코드는 문서상의 위치와 유니코드 등을 기록한 오류 로그(error log) 파일을 생성하여 향후 별도의 처리가 가능하도록 한다.In addition, this interface can convert the text information supported by Unicode 4.0 into Shift + JIS text format. In this case, an unmatched text code generates an error log file that records the location and Unicode of the document. To allow for separate processing in the future.

도시되듯이 오류 파일은, 문서 번호(1710)와 문서상의 위치(1720), 유니코드값(1730) 등의 데이터를 포함하여 향후 교정이 가능하도록 구성된다.As shown, the error file includes data such as document number 1710, location 1720 on document, Unicode value 1730, and the like, so as to enable future correction.

도 18은 본 발명의 다른 실시예에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템의 구조도이다.18 is a structural diagram of a document image processing and verification system for digitizing massive data according to another embodiment of the present invention.

도시되듯이 본 발명의 다른 실시예에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템(400')은 문서영상 구조분석 및 문자영역 분할 모듈(410)과, 문서영상 분할 검증 모듈(420)과, 문자군집화 모듈(430)과, 문자군집단위 검증 및 텍스트 변환 모듈(440)과, 문자단위 코드 입력 모듈(450)과, 군집단위 검증 및 교정 모듈(460)과, 텍스트 라인 단위 원문 검증 및 교정 모듈(470)과, 내용 태깅 및 부가정보 입력 모듈(480)과, 문서영상 데이터베이스(1810)와, 문서군집영상 데이터베이스(1820)와, 문서영상 구조 및 분할 데이터베이스(1830)와, 텍스트 데이터베이스(1840)를 포함한다. As shown, the document image processing and verification system 400 'for digitizing massive data according to another embodiment of the present invention includes a document image structure analysis and character region segmentation module 410 and a document image segmentation verification module 420. And, character grouping module 430, character group unit verification and text conversion module 440, character unit code input module 450, group unit verification and correction module 460, text line unit text verification and A calibration module 470, a content tagging and additional information input module 480, a document image database 1810, a document cluster image database 1820, a document image structure and segmentation database 1830, and a text database ( 1840).

문서영상 구조분석 및 문자영역 분할 모듈(410)과, 문서영상 분할 검증 모듈(420)과, 문자군집화 모듈(430)과, 문자군집단위 검증 및 텍스트 변환 모듈(440)과, 문자단위 코드 입력 모듈(450)과, 군집단위 검증 및 교정 모듈(460)과, 텍스트 라인 단위 원문 검증 및 교정 모듈(470)과, 내용 태깅 및 부가정보 입력 모듈(480)은 도 4에 도시된 본 발명의 일 실시예에서와 동일하므로 설명을 생략한다.Document image structure analysis and character region segmentation module 410, document image segmentation verification module 420, character grouping module 430, character group unit verification and text conversion module 440, character unit code input module 450, the cluster unit verification and correction module 460, the text line unit text verification and correction module 470, and the content tagging and additional information input module 480 are illustrated in FIG. 4. As it is the same as the example, the description is omitted.

문서영상 데이터베이스(1810)는 스캐닝을 통해 획득한 문서영상을 저장한다. The document image database 1810 stores a document image obtained through scanning.

문서군집영상 데이터베이스(1820)는 문자인식에 의해 각 문자영역으로 만들어진 문자군집영상을 임시적으로 저장한다.The document cluster image database 1820 temporarily stores a character cluster image made of each character area by character recognition.

문서영상 구조 및 분할 데이터베이스(1830)는 문서영상의 구조분석 및 자동분할에 의한 데이터를 저장하는 것으로, 최종의 텍스트정보가 추출될 때까지 가공 단계에서 각 인터페이스가 다양한 정보를 취하고 가공한 결과를 저장하는 데이터베이스이다.The document image structure and segmentation database 1830 stores data by structural analysis and automatic segmentation of document images. The document image structure and segmentation database 1830 stores a variety of information and processing results of each interface in the processing stage until final text information is extracted. It is a database.

텍스트 데이터베이스(1840)는 본 발명에 따른 문서영상처리 및 검증시스템에 의해서 획득된 최종의 텍스트 정보를 저장한다.The text database 1840 stores the final text information obtained by the document image processing and verification system according to the present invention.

이러한 각 데이터베이스에 저장된 정보는 각 모듈에 의해서 판독되고 또한 갱신될 수 있다.The information stored in each of these databases can be read and updated by each module.

즉 문서영상 구조분석 및 문자영역 분할 모듈(410)은 문서영상 데이터베이스(1810)를 통하여 스캐닝된 문서영상을 입력받아 이를 분할하여 문서영상 구조 및 분할 데이터베이스(1830)에 저장한다. That is, the document image structure analysis and character region segmentation module 410 receives the scanned document image through the document image database 1810 and divides it, and stores the document image in the document image structure and the segmentation database 1830.

또한 문서영상 분할 검증 모듈(420)은 문서영상 구조 및 분할 데이터베이스(1830)에서 분할된 문서영상을 입력받아 검증하여 다시 문서영상 구조 및 분할 데이터베이스(1830)에 저장한다.In addition, the document image segmentation verification module 420 receives the document image divided from the document image structure and the segmentation database 1830 and verifies it and stores the document image structure and the segmentation database 1830 again.

문자군집화 모듈(430)은 문서영상 구조 및 분할 데이터베이스(1830)에서 문서영상을 입력받아 이를 군집영상으로 생성하여 문서군집영상 데이터베이스(1820)에 저장한다.The character clustering module 430 receives the document image from the document image structure and partitioning database 1830, generates it as a cluster image, and stores the document image in the document cluster image database 1820.

문자군집단위 검증 및 텍스트 변환 모듈(440)은 문서군집영상 데이터베이스(1820)에서 문자군집영상을 입력받아 이를 문자군집단위로 검증하고 일괄적으로 텍스트로 변환하여 문서영상 구조 및 분할 데이터베이스(1830)에 저장한다.The character group unit verification and text conversion module 440 receives the character group image from the document group image database 1820 and verifies it in the character group unit and converts the text into text in batches to the document image structure and the split database 1830. Save it.

문자단위 코드 입력 모듈(450)은 문서영상 구조 및 분할 데이터베이스(1830)에서 입력된 문서영상 중에서 문자군집화 모듈(430)에 의해서 군집화되지 않은 영 상들을 개별적으로 입력하여 문서영상 구조 및 분할 데이터베이스(1830)에 저장한다.The character unit code input module 450 individually inputs images that are not clustered by the character grouping module 430 among the document images input from the document image structure and division database 1830 to separately input the document image structure and division database 1830. ).

군집단위 검증 및 교정 모듈(460)은 문서영상 구조 및 분할 데이터베이스(1830)에서 입력받은 문서영상을 토대로 군집단위 검증 및 교정을 수행하여 문서영상 구조 및 분할 데이터베이스(1830)에 저장한다.The cluster unit verification and correction module 460 performs cluster unit verification and correction based on the document image input from the document image structure and the segmentation database 1830 and stores the group image verification and correction in the document image structure and the segmentation database 1830.

텍스트 라인 단위 원문 검증 및 교정 모듈(470)은 문서영상 구조 및 분할 데이터베이스(1830)에서 입력받은 문서영상을 토대로 텍스트라인 단위의 검증 및 교정을 수행하여 문서영상 구조 및 분할 데이터베이스(1830)에 저장한다.The text line unit text verification and correction module 470 performs verification and correction on a text line basis based on the document image input from the document image structure and segmentation database 1830 and stores the text line unit in the document image structure and segmentation database 1830. .

내용 태깅 및 부가정보 입력 모듈(480)은 영상 구조 및 분할 데이터베이스(1830)에서 입력받은 문서영상을 토대로 내용 태깅 및 부가정보 입력을 수행하여 문서영상 구조 및 분할 데이터베이스(1830)에 저장한다.The content tagging and additional information input module 480 performs content tagging and additional information input based on the document image input from the image structure and segmentation database 1830 and stores the content tagging and additional information in the document image structure and segmentation database 1830.

이러한 과정을 통해서 생성된 텍스트 정보는 최종적으로 텍스트 데이터베이스(1840)에 저장된다.The text information generated through this process is finally stored in the text database 1840.

도 19는 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증방법의 흐름도이다.19 is a flowchart of a document image processing and verification method for digitizing massive data according to the present invention.

도시되듯이, 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증방법은 스캐닝 과정에서 획득한 일련의 문서영상들을 분석하여 문자영역을 자동으로 일괄 분할하는 단계(S110)와, 상기 문서영상의 자동 분할 결과를 영상기반 인터페이스를 통해 검증하고 교정하는 단계(S120)와, 상기 문서영상의 각 문자영역에 지능형 문자인식을 적용하여 텍스트코드를 결정하고 동일한 코드의 문자영 역들을 모아서 군집영상을 생성하는 단계(S130)와, 상기 각 코드별 군집영상을 대상으로 문자인식 결과를 군집단위로 검증하여 일괄 텍스트코드를 부여하는 단계(S140)와, 상기 문자인식에서 텍스트코드를 결정하지 못한 소규모 문자영역을 개별적으로 입력하는 단계(S150)와, 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 군집단위 인터페이스를 통해 1차적으로 검증하고 교정하는 단계(S160)와, 상기 일괄 텍스트코드 부여 또는 상기 개별적 입력을 통해 입력된 상기 문서영상의 각 문자영역을 문서의 텍스트라인 단위 인터페이스를 통해 2차적으로 검증하고 교정하는 단계(S170)와, 상기 문서영상의 문자영역을 표현하는 인터페이스를 통해 문서의 각종 태그정보 및 부가정보를 입력하는 단계(S180)를 포함한다.As shown, the document image processing and verification method for the digitization of the massive data according to the present invention comprises the steps of automatically dividing the character region by analyzing a series of document images obtained in the scanning process (S110), and the document image Verifying and correcting the result of the automatic segmentation through an image-based interface (S120), and applying intelligent character recognition to each character region of the document image to determine a text code, and collecting clustered character regions of the same code to generate a clustered image. (S130), verifying a character recognition result in a group unit for the clustered image for each code, and giving a batch text code (S140); and a small character area in which the text code is not determined in the character recognition. Step of individually inputting (S150) and the batch text code granting or the document input through the individual input First verifying and correcting each character area on the cluster unit interface (S160), and displaying each character area of the document image input through the batch text code assignment or the individual input unit in the text line unit interface of the document; Secondly verifying and correcting through S170 and inputting various tag information and additional information of the document through an interface representing a text area of the document image (S180).

본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증방법은 도 4를 통해서 설명된 본 발명에 따른 방대한 데이터의 디지털화를 위한 문서영상처리 및 검증시스템을 통하여 상세히 설명하였으므로 설명을 생략한다.The document image processing and verification method for digitizing the vast data according to the present invention has been described in detail through the document image processing and verification system for the digitization of the massive data according to the present invention described with reference to FIG.

비록 본원 발명이 구성이 예시적으로 설명되었지만 이는 단지 본 발명을 예시하기 위한 것이며, 본 발명의 보호 범위가 이들 예시에 의해 제한되는 것은 아니며, 본원 발명의 보호 범위는 청구범위의 기재를 통하여 정하여진다.Although the present invention has been described by way of example only, it is for the purpose of illustrating the invention only, and the protection scope of the present invention is not limited by these examples, the protection scope of the present invention is defined through the description of the claims .

이상 설명한 바와 같이, 본 발명에 따르면 방대한 문서데이터로부터 고품질 디지털 텍스트정보를 초고속으로 획득할 수 있으며, 문자군집단위 검증 및 일괄 텍스트변환이 가능하도록 함으로써 정확도와 처리율을 증가시킬 수 있으며, 문서영상 의 텍스트정보 추출을 위한 자료처리 및 시각적 검증 과정이 군집단위로 이루어짐으로써 오류 탐색시간을 절감할 수 있다. As described above, according to the present invention, high-quality digital text information can be obtained at a high speed from a large amount of document data, and the accuracy and processing rate can be increased by enabling character group unit verification and batch text conversion, and the text of the document image. Data processing and visual verification process for information extraction are performed in groups to reduce error search time.

특히, 본 발명은 국내외적으로 수작업에 대한 의존도가 컸던 역사, 문화적 사료의 디지털화 작업에서 효율적인 DB화를 촉진할 수 있는 자동화기술과 검증인터페이스의 결합이라는 점에서 그 파급효과가 매우 크다. 이것은 수작업에 의한 디지털화가 방대한 기간 및 인력, 비용을 요구하고 자료처리 규모가 소규모였기 때문이다.In particular, the present invention has a great ripple effect in that it is a combination of an automated technology and a verification interface that can promote efficient DBization in the digitization of historical and cultural feed, which has a great dependence on manual labor at home and abroad. This is because manual digitization requires vast periods of time, manpower, and costs, and the size of data processing is small.

또한, 본 발명은 문서영상 처리 및 인식 기술의 결과를 통해 문자군집단위 검증 및 일괄 텍스트변환이 가능하도록 함으로써 정확도와 처리율을 증가시켰으며, 문서영상의 텍스트정보 추출을 위한 자료처리 및 시각적 검증(eye-inspection) 과정이 군집단위로 이루어짐으로써 오류 탐색시간 절감 효과를 낼 수 있다.In addition, the present invention increases the accuracy and processing rate by enabling character group unit verification and batch text conversion through the results of document image processing and recognition technology, and processing data and visual verification for extracting text information from document images. -inspection) can reduce the error detection time by clustering.

또한 본 발명은 목판본 및 필사본과 같이 다양한 형태상의 변이를 포함하는 실제 데이터에 적응하기 위해 문자인식 과정의 문자모델을 파라미터화(parameterize)하고 주기적인 적응학습(adaptive training) 모듈을 탑재하여 데이터에 따라 별도의 문자모델을 제공할 수 있도록 하였다.In addition, the present invention parameterizes the character model of the character recognition process in order to adapt to the actual data including various forms of variation, such as woodblock and manuscript, and is equipped with a periodic adaptive training module according to the data A separate character model could be provided.

또한, 본 발명은 필사본의 경우 70%, 목판본의 경우 80%, 활자본의 경우 95% 이상 자동화에 의한 군집단위 처리가 가능하도록 한다. 그리고 자동화에 의해 처리하지 못한 소규모 데이터는 초고속 유니코드 입력기를 탑재하여 처리하도록 한다. 따라서 기존 수작업 지향적인 디지털화 방법에 비해 시간 및 경제적 비용이 60% 내지 80% 가량 절감되며, 자료처리 규모는 500% 이상 증가하게 된다. In addition, the present invention 70% of the manuscript, 80% of the wood copy, 95% or more in the case of typeface to enable processing by the cluster unit by automation. And small data that could not be processed by automation is equipped with a super fast Unicode input. As a result, the time and economic cost is reduced by 60% to 80% compared to the existing manual-oriented digitization method, and the data processing size is increased by more than 500%.

본 발명은 또한 문서영상의 텍스트 정보 추출 과정의 검증 및 교정 단계가 문서영상의 문자영역을 표현하는 영상기반 인터페이스를 통해 이루어지게 함으로써, 작업자가 원 문서영상을 시각적으로 이해하면서 작업을 할 수 있기 때문에 정확도 및 처리율을 배가시킬 수 있다. 또한, 문서영상의 텍스트정보 결정 배경이 수작업에 의한 것인지 또는 자동화에 의한 것인지, 자동화의 경우 문자인식의 신뢰도는 얼마인지 등의 정보를 작업자가 알 수 있어서 검증 과정의 선택적 주의집중이 가능하고 데이터의 효율적인 추적 및 관리가 가능하다.The present invention also allows the operator to work while visually understanding the original document image by allowing the verification and correction of the text information extraction process of the document image to be performed through an image-based interface representing the text area of the document image. It can double the accuracy and throughput. In addition, the operator can know the information such as whether the text information determination background of the document image is by manual or automation, and in the case of automation, what is the reliability of character recognition, so that selective attention can be focused on the verification process. Efficient tracking and management is possible.

Claims

In order to extract the hierarchical structure of a series of document images obtained in the scanning process, a document that automatically divides text areas by analyzing them using a projection profile or a connected component analysis. Image structure analysis and character region segmentation module,

A document image segmentation verification module for verifying and correcting the automatic segmentation result of the document image through an image-based interface;

A character clustering module for determining a text code by applying intelligent character recognition to each character area of the document image verified by the document image segmentation verification module, and generating a cluster image by collecting character areas of the same code;

A character cluster unit verification and text conversion module for verifying a character recognition result in a cluster unit for each cluster image for each code and granting a batch text code;

A character unit code input module for individually inputting a small character area for which the text code has not been determined in the character recognition;

A cluster unit verification and correction module for first verifying and correcting each character region of the document image input through the batch text code assignment or the individual input through a cluster unit interface;

A text line unit original verification and correction module for secondly verifying and correcting each character area of the document image input through the batch text code assignment or the individual input through a text line unit interface of a document;

Content tagging and additional information input module for inputting various tag information and additional information of a document through an interface representing a text area of the document image.

Document image processing and verification system for digitization of massive data including a.

The method of claim 1,

A document image database for storing document images obtained through scanning;

A character cluster image database for storing a character cluster image made of each character area by character recognition;

A document image structure and segmentation database for storing data by structure analysis and automatic segmentation of document images,

A text database that stores the final textual information obtained

Document image processing and verification system for digitization of massive data further comprising.

According to claim 1, wherein the document image structure analysis and character region segmentation module,

Document image processing and verification system for the digitization of a large amount of data that automatically detects and corrects the various miscellaneous and skew generated in the document image acquisition process.

(delete)

The method of claim 1, wherein the document image segmentation verification module,

A document image processing and verification system for digitizing vast data, including an interface having functions of creating and deleting text lines, creating and deleting text areas, resizing, merging, horizontal division, and vertical division.

The method of claim 1, wherein the character clustering module,

Document image processing and verification for digitization of massive data, which uses the Euclidean distance or Mahalanobis distance method to determine the text code of the text area with the character class corresponding to the minimum distance system.

The method of claim 6, wherein the character clustering module,

A document image processing and verification system for digitization of a vast amount of data, which uses the Euclidean distance or Mahalanobis distance method to determine the adoption or rejection of document clustering by using the distance value as recognition reliability.

The method of claim 7, wherein the character group unit verification and text conversion module,

The document image processing and verification system for digitizing a large amount of data for applying a collective text code to the character region is determined to adopt the clustering of documents by using the recognition reliability.

The method of claim 7, wherein the character unit code input module,

Document image processing and verification system for digitizing the vast amount of data to individually assign a text code to the character area is determined to reject the grouping of documents by using the recognition reliability.

The method of claim 1, wherein the character unit code input module,

A document image processing and verification system for digitizing a large amount of data, which uses a Chinese character high speed input method, an incident / stroke number based input method, or a sound value based input method.

The method of claim 1, wherein the cluster unit verification and calibration module,

And displaying each character area of the document image input through the batch text code assignment or the individual input in different colors based on recognition reliability.

The method of claim 1, wherein the content tagging and additional information input module comprises:

It converts text information supported by Unicode 4.0 to Shift + JIS text format and generates an error log file that records the location of the document with the unmatched text code and the corresponding Unicode value. Document image processing and verification system for digitization.

Automatically dividing the text area by analyzing the projection profile or the connection element analysis of the document images to extract the hierarchical structure of the series of document images obtained during the scanning process;

Verifying and correcting the automatic segmentation result of the document image through an image-based interface;

Determining a text code by applying intelligent character recognition to each character area of the document image verified and corrected through the image-based interface, and generating a cluster image by collecting character areas of the same code;

Verifying a character recognition result by a cluster unit for the clustered images for each code and assigning a batch text code;

Individually inputting a small character area for which the text code has not been determined in the character recognition;

Firstly verifying and correcting each character region of the document image input through the batch text code assignment or the individual input through a cluster unit interface;

Secondly verifying and correcting each character region of the document image input through the batch text code assignment or the individual input through a text line unit interface of the document;

Inputting various tag information and additional information of the document through an interface representing a text area of the document image;

Document image processing and verification method for digitization of massive data including a.

The method of claim 13, wherein the step of automatically dividing a text area by analyzing a series of document images acquired in the scanning process,

And automatically detecting and correcting various miscellaneous artifacts and inclinations generated during the document image acquisition process.

(delete)

The method of claim 13, wherein the determining of a text code by applying intelligent character recognition to each character area of the document image and generating a cluster image by collecting character areas of the same code comprises:

A method of processing and verifying a document image for digitizing a large amount of data, comprising determining a text code of a character region using a Euclidean distance or Mahalanobis distance method.

The method of claim 16, wherein the determining of a text code by applying intelligent character recognition to each character area of the document image and generating a cluster image by collecting character areas of the same code comprises:

Using the Euclidean distance or Mahalanobis distance method to determine the adoption or rejection of the clustering of documents by using the distance value as the recognition reliability; and Verification method.

18. The method of claim 17, wherein the step of assigning a batch text code by verifying a character recognition result in a clustering unit for the clustered image for each code,

And a batch text code is assigned to a character area in which adoption of a document clustering is determined using the recognition reliability.

18. The method of claim 17, wherein the step of individually inputting the small character area for which the text code is not determined in the character recognition,

And individually assigning a text code to a character region in which rejection of clustering of documents is determined using the recognition reliability.

The method of claim 13, wherein the verifying and correcting of each text area of the document image input through the batch text code assignment or the individual input through a cluster unit interface are performed.

And displaying each text area of the document image input through the batch text code assignment or the individual input in different colors based on recognition reliability. .

The method of claim 13, wherein inputting various tag information and additional information of a document through an interface representing a text area of the document image comprises:

Converting the text information supported by Unicode 4.0 to Shift + JIS text form, and generating an error log file that records the location of the document of the inconsistent text code and the corresponding Unicode value. Document Image Processing and Verification Method for Digitalization

In order to extract the hierarchical structure of a series of document images obtained in the scanning process, the function of automatically segmenting the text area by analyzing using the projection profile or the connection element analysis of the document images,

A function of determining a text code by applying intelligent character recognition to each character region of the document image verified and corrected through the image-based interface, and generating a cluster image by collecting character regions of the same code;

A function of verifying a character recognition result by a cluster unit for the clustered image for each code and granting a batch text code;

A function of individually inputting a small character area for which the text code is not determined in the character recognition;

A function of first verifying and correcting each character area of the document image input through the batch text code assignment or the individual input through a cluster unit interface;

Secondly verifying and correcting each character area of the document image input through the batch text code assignment or the individual input through a text line unit interface of the document;

A function of inputting various tag information and additional information of a document through an interface representing a text area of the document image.

A computer-readable recording medium having recorded thereon a program for realizing this.