KR102043693B1

KR102043693B1 - Machine learning based document management system

Info

Publication number: KR102043693B1
Application number: KR1020180115569A
Authority: KR
Inventors: 김지성
Original assignee: 김지성
Priority date: 2018-09-28
Filing date: 2018-09-28
Publication date: 2019-11-12

Abstract

The present invention relates to a document management system based on machine learning. In addition, according to the present invention, the document management system based on machine learning comprises: an input unit forming a document image from a nonelectronic document; and a document management server constructing a model for font using machine learning algorithm if there is no corresponding font in the font database after identifying font of the document image and recognizing characters in the document image by referring to font information of the model for the constructed font. By providing the document management system based on machine learning, the recognition rate for characters may be increased.

Description

Machine learning based document management system

본 발명은 기계 학습 기반의 문서 관리 시스템에 관한 것이다.The present invention relates to a machine learning based document management system.

일반적으로 문서는 기업이나 관공서 등의 모든 조직에서 업무상 사용되는 모든 서류 및 기록물을 뜻한다. In general, documents are all documents and records used for business in all organizations, such as corporations and government offices.

이러한 문서는 전자 형식의 문서(이하, 전자 문서)와 비전자 형식의 문서(이하, 비전자 문서)가 있는데, 전자 문서는 이미지나 동영상 같은 시청각 기록물에 대한 전자 파일을 뜻하며, 비전자 문서는 서류 및 기록물의 원본(예.종이)과 같은 오프라인 형식의 문서를 뜻한다.These documents include electronic documents (hereinafter referred to as "electronic documents") and non-electronic documents (hereinafter referred to as "non-electronic documents"). Electronic documents refer to electronic files for audiovisual recordings such as images and videos. And off-line documents such as originals (eg paper) of records.

이러한 비전자 문서는 보관하기 위한 장소가 필요하며, 최근에는 이러한 종이 문서를 전자 문서화 하여 저장 및 보관하려고 노력 중에 있다.Such non-electronic documents require a place for storage, and recently, efforts have been made to electronically store and store such paper documents.

국내공개번호 10-2017-0011754호Domestic Publication No. 10-2017-0011754 국내공개번호 10-2014-0124100호Domestic Publication No. 10-2014-0124100

본 발명은 상기와 같은 필요를 충족시기키 위하여 안출된 것으로, 문서 이미지의 폰트를 파악한 후에 폰트 데이터베이스에 해당 폰트가 없는 경우에 기계 학습 알고리즘을 이용하여 폰트에 대한 모델을 구축하고 구축된 폰트에 대한 모델의 폰트 정보를 참조하여 문서 이미지의 문자를 인식하는 문서 관리 서버를 포함하는 기계 학습 기반의 문서 관리 시스템을 제공하는데 있다.The present invention has been made to meet the above needs, and after identifying the font of the document image, if there is no corresponding font in the font database, a model for the font is constructed using a machine learning algorithm and The present invention provides a machine learning based document management system including a document management server that recognizes a character of a document image by referring to font information of a model.

본 발명은 비전자 문서로부터 문서 이미지를 형성하는 입력부; 및 문서 이미지의 폰트를 파악한 후에 폰트 데이터베이스에 해당 폰트가 없는 경우에 기계 학습 알고리즘을 이용하여 폰트에 대한 모델을 구축하고 구축된 폰트에 대한 모델의 폰트 정보를 참조하여 문서 이미지의 문자를 인식하는 문서 관리 서버를 포함한다.The present invention provides an input unit for forming a document image from a non-electronic document; And a document that recognizes a character in the document image by building a model for the font using a machine learning algorithm after identifying the font of the document image and using the machine learning algorithm if there is no corresponding font in the font database. Contains the management server.

본 발명은 문서 이미지의 폰트를 파악한 후에 폰트 데이터베이스에 해당 폰트가 없는 경우에 기계 학습 알고리즘을 이용하여 폰트에 대한 모델을 구축하고 구축된 폰트에 대한 모델의 폰트 정보를 참조하여 문서 이미지의 문자를 인식하여 문자 인식률을 높일 수 있다.According to the present invention, after identifying a font of a document image, if there is no corresponding font in the font database, a model for the font is constructed using a machine learning algorithm, and the character of the document image is recognized by referring to the font information of the model for the constructed font. Character recognition rate can be increased.

도 1은 본 발명의 바람직한 일 실시예에 따른 기계 학습 기반의 문서 관리 시스템의 구성도이다.
도 2는 도 1의 문서 관리 서버의 내부 구성도이다.
도 3은 도 2의 문자 인식부의 내부 구성도이다.
도 4는 도 2의 그림 문자 인식부의 내부 구성도이다.
도 5는 도 2의 그림 문자 인식부에거 문자 위치 인식 과정을 설명하기 위한 예시도이다.1 is a block diagram of a machine learning-based document management system according to an embodiment of the present invention.
FIG. 2 is a diagram illustrating an internal configuration of the document management server of FIG. 1.
3 is a diagram illustrating an internal configuration of the character recognition unit of FIG. 2.
4 is a diagram illustrating an internal configuration of the pictogram recognition unit of FIG. 2.
FIG. 5 is an exemplary diagram for describing a process of recognizing a character position in the graphic character recognition unit of FIG. 2.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. Advantages and features of the present invention, and methods for achieving them will be apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but can be implemented in various forms. The embodiments of the present invention make the posting of the present invention complete and the general knowledge in the technical field to which the present invention belongs. It is provided to fully convey the scope of the invention to those skilled in the art, and the present invention is defined only by the scope of the claims. Like reference numerals refer to like elements throughout.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. Unless otherwise defined, all terms (including technical and scientific terms) used in the present specification may be used in a sense that can be commonly understood by those skilled in the art. In addition, the terms defined in the commonly used dictionaries are not ideally or excessively interpreted unless they are specifically defined clearly. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase.

본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함될 수 있다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.In this specification, the singular may also include the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” and / or “comprising” refers to the presence of one or more other components, steps, operations and / or elements. Or does not exclude additions.

도 1은 본 발명의 바람직한 일 실시예에 따른 기계 학습 기반의 문서 관리 시스템의 구성도이다.1 is a block diagram of a machine learning-based document management system according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 바람직한 일 실시예에 따른 기계 학습 기반의 문서 관리 시스템은 스캐너(20) 및 문서 관리 서버(30)를 포함한다.Referring to FIG. 1, a machine learning based document management system according to an exemplary embodiment of the present invention includes a scanner 20 and a document management server 30.

스캐너(20)는 비전자 문서인 종이 문서(10)를 스캔한 이미지를 문서 관리 서버(30)로 전송한다. 여기에서는 스캐너(20)만 표시하였지만 카메라도 포함할 수 있으며, 스캐너와 카메라를 총칭하여 입력부라고 부를 수 있다.The scanner 20 transmits the scanned image of the paper document 10, which is a non-electronic document, to the document management server 30. Although only the scanner 20 is displayed here, the camera may also be included. The scanner and the camera may be collectively referred to as an input unit.

문서 관리 서버(30)는 종이 문서(10)를 스캔한 이미지(즉, 문서 이미지)를 메모리에 저장하고 사용자 측의 요청에 따라서 다른 장치로 전송할 수 있다.The document management server 30 may store the scanned image of the paper document 10 (ie, the document image) in the memory and transmit the image to the other device according to a request of the user.

이와 같은 문서 관리 서버(30)는 문서 이미지의 폰트를 파악한 후에 폰트 데이터베이스에 해당 폰트가 없는 경우에 기계 학습 알고리즘을 이용하여 폰트에 대한 모델을 구축하고 구축된 폰트에 대한 모델의 폰트 정보를 참조하여 문서 이미지의 문자를 인식한다. Such document management server 30, after grasping the font of the document image, if there is no corresponding font in the font database, builds a model for the font using a machine learning algorithm, and refers to the font information of the model for the constructed font. Recognize text in document images.

도 2는 도 1의 문서 관리 서버(30)의 내부 구성도이다.2 is a diagram illustrating an internal configuration of the document management server 30 of FIG. 1.

도 2를 참조하면, 도 1의 문서 관리 서버(30)는 문서 구조 분석부(110), 문자 인식부(120), 그림 문자 인식부(130) 및 문서 생성부(150)를 포함한다. Referring to FIG. 2, the document management server 30 of FIG. 1 includes a document structure analyzer 110, a character recognizer 120, a graphic character recognizer 130, and a document generator 150.

문서 구조 분석부(110)는 문서 이미지를 분석하여 문서 이미지에 포함된 문자와 그림을, 문자 블록 및 그림 블록으로 분류한다. The document structure analysis unit 110 analyzes the document image and classifies characters and pictures included in the document image into character blocks and picture blocks.

문서 이미지는 스캐너를 통해서 스캔된 이미지, 카메라를 통해서 촬영된 이미지 또는 이미지 파일로부터 읽어진 이미지일 수 있으며, 문서 이미지 포맷은 BMP, JPEG, PNG 등 다양한 이미지 포맷일 수 있다. The document image may be an image scanned by a scanner, an image captured by a camera, or an image read from an image file, and the document image format may be various image formats such as BMP, JPEG, and PNG.

또한, 문서 이미지는 컬러 이미지, 흑백 이미지 등 다양한 컬러 깊이를 가질 수 있다. In addition, the document image may have various color depths, such as a color image and a black and white image.

도 2에 도시되지는 않았으나, 문서 관리 서버(30)는 메모리를 더 포함할 수 있으며, 문서 이미지는 메모리에 저장될 수 있고, 문서 구조 분석부(110)는 메모리에 저장된 문서 이미지에 접근하여 문서 이미지에 대한 프로세스를 수행할 수 있다.Although not shown in FIG. 2, the document management server 30 may further include a memory, the document image may be stored in the memory, and the document structure analyzer 110 may access the document image stored in the memory to access the document. You can perform the process on the image.

문서 이미지는 문자 블록 및 그림 블록을 포함한다. 문자 블록은 문서 이미지에 포함된 문자 중 워드 프로세서 등에 의해 편집된 형태의 문자들의 블록을 의미한다. 그림 블록은 문서 이미지에 포함된 그림 또는 사진들의 블록을 의미한다.Document images include character blocks and picture blocks. The character block refers to a block of characters edited by a word processor or the like among the characters included in the document image. A picture block refers to a block of pictures or photos included in a document image.

문서 구조 분석부(110)는 문서 이미지에 포함된 문자 블록 및 그림 블록을 식별하여 분류한다. The document structure analysis unit 110 identifies and classifies character blocks and picture blocks included in the document image.

문서 구조 분석부(110)는 문서 이미지를 분석하여 문서 이미지를 문자 블록 및 그림 블록으로 분할한다. The document structure analyzer 110 analyzes the document image and divides the document image into character blocks and picture blocks.

문자 블록의 경우, 문서 복원을 위해 해당 블록에 포함된 문자에 대한 문자 인식이 수행되어야 하므로, 문서 구조 분석부(110)는 문자 블록을 문자 인식부(120)로 전송한다. In the case of the character block, since the character recognition for the characters included in the block to be restored, the document structure analysis unit 110 transmits the character block to the character recognition unit 120.

문서 구조 분석부(110)는 그림 블록의 경우, 포함된 문자에 대한 문자 인식이 필요한 경우에, 이를 위하여 그림 문자 인식부(130)로 전송한다.In the case of a picture block, when the text recognition for the included text is required, the document structure analyzer 110 transmits the text structure to the picture character recognition unit 130.

이와 달리 문서 구조 분석부(110)는 그림 블록에 포함된 문자에 대한 문자 인식이 필요하지 않은 경우에 문서 구조 분석부(110)는 그림 블록을 문서 생성부(150)로 바로 전송할 수 있다.In contrast, the document structure analyzer 110 may directly transmit the picture block to the document generator 150 when text recognition for the text included in the picture block is not necessary.

문자 인식부(120)는 문자 블록의 문서 이미지의 폰트를 파악한 후에 폰트 데이터베이스에 해당 폰트가 없는 경우에 기계 학습 알고리즘을 이용하여 폰트에 대한 모델을 구축하고 구축된 폰트에 대한 모델의 폰트 정보를 참조하여 문자 블록의 문서 이미지의 문자를 인식한다.After recognizing the font of the document image of the character block, the character recognition unit 120 constructs a model for the font using a machine learning algorithm when there is no corresponding font in the font database, and refers to the font information of the model for the constructed font. Character in the document image of the character block.

이와 같은 문자 인식부(120)는 도 3에 도시된 바와 같이 전처리기(121), 특징 추출기(122), 폰트 검색기(123), 폰트 데이터베이스(124), 기계 학습기(125), 기계 학습 데이터베이스(126), 인식기(127), 후처리기(128) 및 문자 인식 결과 평가기(129)를 포함한다.As shown in FIG. 3, the character recognition unit 120 includes a preprocessor 121, a feature extractor 122, a font searcher 123, a font database 124, a machine learner 125, and a machine learning database ( 126, a recognizer 127, a post processor 128, and a character recognition result evaluator 129.

전처리기(121)는 문자 블록에 포함된 문자의 기울어짐을 교정하고, 문자 블록에 포함된 잡티 등의 노이즈를 제거하고, 문자 블록에 포함된 문자가 컬러 이미지인 경우 문자 인식을 보다 수월하게 하기 위해 컬러 이미지를 흑백 이미지로 이진화하며, 문자 블록에 포함된 문자들에 대해 라인 분할(line segmentation), 단어 분할(word segmentation), 문자 분할(character segmentation) 을 수행하는 등의 다양한 전처리를 수행할 수 있다.The preprocessor 121 corrects the inclination of the characters included in the character block, removes noise such as blemishes included in the character block, and makes the character recognition easier when the characters included in the character block are color images. Binary color images to black and white images, and perform various preprocessing such as line segmentation, word segmentation, character segmentation, etc., on the characters included in the character block. .

특징 추출기(122)는 전처리가 완료된 문자 블록에 포함된 문자들에 대한 특징을 추출한다.The feature extractor 122 extracts the features of the characters included in the preprocessed character block.

그리고, 폰트 검색기(123)는 특징 추출기(122)에서 추출된 특징을 이용하여 폰트 데이터베이스(124)에 폰트 정보가 있는지를 판단한다.The font searcher 123 determines whether there is font information in the font database 124 using the feature extracted by the feature extractor 122.

폰트 검색기(123)는 다양한 종류의 폰트가 저장된 폰트 데이터베이스(124)에 접속하여 인식된 폰트와 가장 유사한 폰트를 검색한다. The font finder 123 accesses a font database 124 in which various kinds of fonts are stored and searches for a font most similar to the recognized font.

이때 특징 추출기(122)에 의해 추출된 특징점을 토대로 폰트 데이터베이스(124)의 폰트와 비교 또는 매칭하여 유사한 폰트를 검색한다. In this case, a similar font is searched by comparing or matching the font of the font database 124 based on the feature point extracted by the feature extractor 122.

또한 검색된 유사한 폰트는 1개 내지 3개를 검색할 수 있으며, 유사한 폰트 검색 수는 임의로 설정할 수 있으므로 특정수로 한정하지 않는다.In addition, the searched similar fonts can be searched from one to three, and the number of similar font searches can be arbitrarily set, so that the search is not limited to a specific number.

여기서 폰트 데이터베이스(124)는 본 발명의 폰트와 관련된 데이터들을 저장하는 저장부로서 통상적으로 내장형 메모리 또는 외장형 메모리를 비롯하여 네트워크에 있는 클라우드 컴퓨팅 네트워크의 가상화된 저장장치(간략하게 클라우드 저장장치라고 부름)를 더 포함할 수 있다. 예컨대 상기 클라우드 저장장치는 서비스 제공자가 제공하는 전용 클라우드 저장장치, 또는 포털 사이트(예 : 네이버, 다음 등)에서 제공하는 무료 클라우드 저장장치(또는 네트워크 HDD)를 더 포함할 수 있다. Here, the font database 124 is a storage unit for storing data related to the font of the present invention. The font database 124 is a virtual storage device of a cloud computing network in a network, which is generally called an internal memory or an external memory. It may further include. For example, the cloud storage may further include a dedicated cloud storage provided by a service provider, or a free cloud storage (or network HDD) provided by a portal site (eg, Naver, Daum, etc.).

이렇게 검색된 폰트 또는 유사한 폰트에 대해서, 각 폰트의 이미지, 명칭 등을 포함한 다양한 폰트 정보와 해당 폰트를 인식기(127)로 전송하여 문자 인식에 이용할 수 있도록 한다.For the fonts searched or similar, the various font information including the image, name, etc. of each font and the corresponding fonts are transmitted to the recognizer 127 to be used for character recognition.

한편, 폰트 검색기(123)가 특징 추출기(122)에 의해 추출된 특징점을 토대로 폰트 데이터베이스(124)의 폰트와 비교 또는 매칭하여 유사한 폰트를 검색한 결과, 유사한 폰트를 발견할 수 없다면, 기계 학습기(125)로 전달한다.On the other hand, if the font searcher 123 searches for similar fonts by comparing or matching the fonts of the font database 124 based on the feature points extracted by the feature extractor 122, if a similar font cannot be found, the machine learner ( 125).

기계 학습기(125)는 폰트 검색기(124)에서 전송된 문자 블록의 특징점을 로우 데이터로 기계 학습 데이터베이스(126)에 저장한다.The machine learner 125 stores, as raw data, the feature points of the character blocks sent from the font finder 124 in the machine learning database 126.

그리고, 기계 학습기(125)는 로우 데이터에 대해 다수의 붓스트랩 자료를 생성하고 각 붓스크랩 자료를 모델링한 후 결합하여 폰트에 대한 모델을 생성하여 폰트 정보를 형성한다. 이때 붓스크랩 자료는 단순 복원 임의 추출법을 통해 로우 데이터의 크기와 같은 표본 자료이다.The machine learner 125 generates a plurality of bootstrap data for raw data, models each brush scrap data, and combines them to generate a model for the font to form font information. Brush scrap data is sample data such as the size of raw data through a simple reconstruction random sampling method.

기계 학습 알고리즘은 배깅(bagging), 선형 회귀, 비선형적으로 모델링하는 다층 신경망 회귀, 지지벡터 회귀(Support Vector Regression, SVR), 비지도 학습기법의 하나로 EM(Expectation-maximization) 클러스터링 등 다양한 알고리즘 등 다양한 알고리즘을 포함할 수 있으며, 본 발명에서는 이에 한정하지 아니한다. The machine learning algorithm is one of various algorithms such as bagging, linear regression, multi-layer neural network regression modeling nonlinearly, support vector regression (SVR), and non-supervised learning technique, including various algorithms such as clustering (Expectation-maximization). An algorithm may be included and the present invention is not limited thereto.

상기 기계 학습기(125)는 문자 블록의 문자를 기계 학습 알고리즘을 통하여 인식하고 인식된 문자를 이용하여 특징점과 결합된 폰트에 대한 모델을 형성하며, 형성된 폰트에 대한 모델을 기반으로 폰트 정보를 생성하여 폰트 데이터베이스(124)에 저장하여 이후에 인식기(127)가 문자 블록의 문자를 인식할 때에 이용하도록 한다.The machine learner 125 recognizes a character of the character block through a machine learning algorithm, forms a model for the font combined with the feature point using the recognized character, and generates font information based on the model of the formed font. It is stored in the font database 124 so that the recognizer 127 can use it later when recognizing the characters of the character block.

인식기(127)는 특징 추출기(122)에 의해 추출된 문자들에 대한 특징에 기초하여, 구조적 분석이나 통계적 방법 등의 다양한 방법을 사용하여 문자 블록에 포함된 문자들을 인식하여 ASCII 등의 문자 코드를 생성한다.The recognizer 127 recognizes the characters included in the character block using various methods such as structural analysis or statistical method based on the characteristics of the characters extracted by the feature extractor 122 to obtain a character code such as ASCII. Create

후처리기(128)는 사전(dictionary, lexicon) 과 같은 언어 모델링을 이용하여, 인식기(127)에 의해 인식된 문자들 중 오인식된 문자를 정확한 문자로 교정한다.The post processor 128 uses language modeling such as a dictionary, lexicon, to correct the incorrectly recognized characters among the characters recognized by the recognizer 127 to the correct characters.

문자 인식 결과 평가기(129)는 인식기(127)에서의 문자 인식 결과에 대한 확신의 정도를 산출한다. 예를 들어, 문자 인식 결과 평가기(129)는 문자 인식 결과에 대한 확신의 정도를 수치화하여 산출할 수 있고, 산출 결과가 미리 결정된 임계값 이상인 경우 인식 성공이고, 임계값 미만인 경우 인식 실패인 것으로 판단할 수 있다.The character recognition result evaluator 129 calculates a degree of confidence in the character recognition result in the recognizer 127. For example, the character recognition result evaluator 129 may calculate and calculate the degree of confidence in the character recognition result, and if the calculation result is greater than or equal to the predetermined threshold value, the recognition success is successful. You can judge.

문자 인식 결과 평가기(129)에 의한 평가 결과가 인식 성공인 경우, 인식기(127)에 의한 인식 결과를 문서 생성부(150)로 전송한다.When the evaluation result by the character recognition result evaluator 129 is successful in recognition, the recognition result by the recognizer 127 is transmitted to the document generation unit 150.

한편, 그림 문자 인식부(130)는 문서 구조 분석부(110)에서 분류된 그림 블록에서 문자 영역을 추출하여 상기 문자 인식부(120)로 전송하여 상기 문자 인식부(120)가 인식한 문자를 상기 문서 생성부(150)로 전송하여 그림 블록의 문자 영역에 표시하도록 한다.Meanwhile, the graphic character recognition unit 130 extracts a character region from the image blocks classified by the document structure analysis unit 110 and transmits the character region to the character recognition unit 120 to transmit the character recognized by the character recognition unit 120. The image is transmitted to the document generator 150 to be displayed in the text area of the picture block.

이와 같은 그림 문자 인식부(130)는 도 4에 도시된 바와 같이 표지 인식기(141), 문자 영역 확정기(142) 및 그림 문자 인식기(143)를 포함한다.As illustrated in FIG. 4, the pictograph recognizer 130 includes a cover recognizer 141, a text area determiner 142, and a pictogram recognizer 143.

상기 표지 인식기(141)는 그림 블록에서 가로나 세로의 문자 표지의 존재 위치를 인식한다.The marker recognizer 141 recognizes the presence position of the horizontal or vertical character marker in the picture block.

여기에서 문자 표지는 가로바(일예로 ㅡ)와 세로바(일예로 l)를 의미하는 것으로 한글의 경우에 모든 글자가 가로바나 세로바를 포함하고 있어 이러한 문자 표지를 검출하여 문자 영역을 확정할 수 있다. 도 5를 보면, 각 문자마다 가로바와 세로바가 존재하는 것을 볼 수 있다. In this case, the letter sign means a horizontal bar (eg ㅡ) and a vertical bar (l for example). In the case of Hangul, all letters include a horizontal bar or vertical bar. have. Referring to Figure 5, it can be seen that there is a horizontal bar and a vertical bar for each character.

그리고, 문자 영역 확정기(142)는 표지 인식기(141)가 인식한 가로바나 세로바가 일정한 거리에 주기적으로 나타나는 경우에 일정한 거리에 있는 가로바나 세로바를 기준으로 문자 영역으로 확정한다.The text area determiner 142 determines the text area based on the horizontal bar or the vertical bar at a constant distance when the horizontal bar or the vertical bar recognized by the cover recognizer 141 periodically appears at a predetermined distance.

이때, 문자 영역 확정기(142)는 글자 '과', '의', '워'처럼 가로바와 세로바가 중첩되는 경우에 가로바가 중심시에 위치하기 때문에 세로바를 무시하고 문자 영역을 확정한다.In this case, the character area determiner 142 determines the character area by ignoring the vertical bar because the horizontal bar is positioned at the center when the horizontal bar and the vertical bar overlap with the letters 'and', 'of', and 'war'.

다음으로, 그림 문자 인식기(143)는 확정된 문자 영역을 문자 인식부(120)로 전송하여 문자 인식을 수행하고, 최종적으로 처리 결과를 전송받아 문자 영역과 해당 문자 영역에 대하여 인식된 문자 정보를 문서 생성부(150) 전송하여 그림 블록의 문자 영역의 문서를 생성하도록 한다.Next, the graphic character recognizer 143 transmits the determined character region to the character recognition unit 120 to perform character recognition, and finally receives the processing result to receive the character region and the recognized character information about the character region. The document generator 150 transmits the document generator 150 to generate a document of the text area of the picture block.

이와 같은 본 발명은 문서 이미지의 폰트를 파악한 후에 폰트 데이터베이스에 해당 폰트가 없는 경우에 기계 학습 알고리즘을 이용하여 폰트에 대한 모델을 구축하고 구축된 폰트에 대한 모델의 폰트 정보를 참조하여 문서 이미지의 문자를 인식하여 문자 인식률을 높일 수 있다.In the present invention as described above, after identifying the font of the document image, if there is no corresponding font in the font database, a model for the font is constructed using a machine learning algorithm, and the text of the document image is referred to by referring to the font information of the model for the constructed font. Can increase the character recognition rate.

본 명세서에 개시된 실시예들과 관련하여 설명된 방법 또는 알고리즘의 단계는 프로세서에 의해 실행되는 하드웨어, 소프트웨어 모듈 또는 그 2개의 결합으로 직접 구현될 수도 있다. 소프트웨어 모듈은 RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터, 하드 디스크, 착탈형 디스크, CD-ROM 또는 당업계에 알려진 임의의 다른 형태의 저장 매체에 상주할 수도 있다. 예시적인 저장 매체는 프로세서에 커플링되며, 그 프로세서는 저장 매체로부터 정보를 판독할 수 있고 저장 매체에 정보를 기입할 수 있다. 다른 방법으로, 저장 매체는 프로세서와 일체형일 수도 있다. 프로세서 및 저장 매체는 주문형 집적회로 (ASIC) 내에 상주할 수도 있다. ASIC는 사용자 단말기 내에 상주할 수도 있다. 다른 방법으로, 프로세서 및 저장 매체는 사용자 단말기 내에 개별 컴포넌트로서 상주할 수도 있다.The steps of a method or algorithm described in connection with the embodiments disclosed herein may be implemented directly in hardware, software module or a combination of the two executed by a processor. The software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor, which can read information from and write information to the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside within an application specific integrated circuit (ASIC). The ASIC may reside in a user terminal. In the alternative, the processor and the storage medium may reside as discrete components in a user terminal.

이상으로 실시예를 들어 본 발명을 더욱 상세하게 설명하였으나, 본 발명은 반드시 이러한 실시예로 국한되는 것은 아니고, 본 발명의 기술사상을 벗어나지 않는 범위 내에서 다양하게 변형실시될 수 있다. 따라서 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다. Although the present invention has been described in more detail with reference to the examples, the present invention is not necessarily limited to these examples, and various modifications can be made without departing from the spirit of the present invention. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention but to describe the present invention, and the scope of the technical idea of the present invention is not limited by these embodiments. The protection scope of the present invention should be interpreted by the following claims, and all technical ideas within the equivalent scope should be interpreted as being included in the scope of the present invention.

20 : 스캐너 30 : 문서 관리 서버
110 : 문서 구조 분석부 120 : 문자 인식부
130 : 그림 문자 인식부 150 : 문서 생성부20: Scanner 30: Document Management Server
110: document structure analysis unit 120: character recognition unit
130: character recognition unit 150: document generation unit

Claims

An input unit to form a document image from the non-electronic document; And
After grasping the font of the document image, if there is no corresponding font in the font database, a model for the font is constructed using a machine learning algorithm, and the document management that recognizes the text of the document image by referring to the font information of the model for the constructed font Server,
The document management server
A document structure analysis unit for analyzing the document image and classifying characters and pictures included in the document image into character blocks and picture blocks;
After grasping the fonts of the character blocks classified by the document structure analyzer, if there is no corresponding font in the font database, the model is constructed using a machine learning algorithm, and the document is referred to by referring to the font information of the model for the constructed font. A character recognition unit recognizing a character of an image; And
And a document generation unit for generating a document for the document image based on the character recognition result from the character recognition unit and the picture block.
A picture character recognition unit for extracting a character region from the picture block classified by the document structure analyzer and transmitting the character region to the character recognition unit to transmit the character recognized by the character recognition unit to the document generation unit for display in the character region of the picture block. Include,
The pictograph recognition unit
A cover recognizer for recognizing the presence position of a horizontal or vertical character cover in a picture block;
A character area determiner which determines a character block based on the horizontal bar or vertical bar at a predetermined distance when the horizontal bar or vertical bar recognized by the cover recognizer periodically appears at a predetermined distance; And
The character area determined by the character area determiner is transmitted to the character recognition unit to perform character recognition. Finally, the processing result is received and the character information recognized for the character area and the corresponding character area is transmitted to the document generation unit. Includes a glyph recognizer to generate a document of the text area of the block,
The letter cover is composed of a horizontal bar and a vertical bar,
The character area determiner
Character area is determined based on the position of the horizontal bar when the horizontal bar or vertical bar is in close proximity, and the horizontal bar is positioned at the center when the horizontal bar and vertical bar overlap with the letters 'and', 'of', and 'war'. Machine learning based document management system to determine the character area because it ignores the vertical bar.

The method according to claim 1,
And the input unit is a scanner that scans a non-electronic document and transmits a document image to the document management server.

The method according to claim 1,
And the input unit is a camera for photographing a non-electronic document and transmitting a document image to the document management server.

The method according to claim 1,
The document management server stores the document image in the memory and transmits to another device according to a request of the user side machine learning based document management system.

delete

The method according to claim 1,
The document structure analysis unit is a machine learning-based document management system for transmitting a picture block to the picture character recognition when the extraction of the text area from the picture block.

The method according to claim 1,
And the document structure analysis unit transmits the picture block to the document generation unit when the text area is not required to be extracted from the picture block.

delete

The method according to claim 1,
The character recognition unit
A preprocessor for preprocessing the document image;
A feature extractor for extracting features of characters included in a character block in which preprocessing is completed in the preprocessor;
A machine learner for machine learning the features extracted by the feature extractor to generate a model for the font to form font information;
A font searcher for determining whether there is font information in a font database using the features extracted by the feature extractor and if the font information is not present, the machine learner generates a model for the font to form font information; And
And a recognizer configured to perform character recognition on the character block by using the feature extracted by the feature extractor by referring to the font information formed by the machine learner.

The method according to claim 12,
The character recognition unit
And a post-processor for correcting misrecognized characters among the characters recognized by the recognizer to correct characters using language modeling such as a dictionary.

The method according to claim 12,
The character recognition unit
And a character recognition result evaluator for calculating a degree of confidence in the character recognition result in the recognizer.

The method according to claim 12,
The preprocessor corrects the inclination of the characters contained in the character block, removes the noise contained in the character block, and the characters contained in the character block binarizes the color image into a black and white image, for the characters contained in the character block. Machine learning based document management system that performs line segmentation, word segmentation, and character segmentation.