KR20230029206A

KR20230029206A - Apparatus for constructing training data for artificial intelligence based text recognition

Info

Publication number: KR20230029206A
Application number: KR1020210111434A
Authority: KR
Inventors: 하용욱; 임동진; 김경선
Original assignee: (주) 엔에이치엔다이퀘스트
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2023-03-03

Abstract

Provided is a device for accurately and conveniently constructing massive training data for artificial intelligence-based text recognition. If a segmenting module draws a bounding box around each character for characters in the image of a scanned document, a user can correct the bounding box using a segmenting verification module. A cluster processing part performs cluster processing for grouping characters with high similarity into one cluster for each character image separated by the bounding box and provides a tool for correcting the clusters. A code input part provides a tool for recognizing a character in the clustered character image, matching the character with a text, and correcting the same. An examination and correction part provides a tool for examining and correcting a segment and a text for a document image having the text inputted thereinto.

Description

Apparatus for constructing training data for artificial intelligence based text recognition}

본 발명은 인공지능 문자인식을 위한 학습 데이터 구축 장치에 관한 것으로서, 더욱 상세하게는 인공지능 문자인식을 위한 방대한 학습 데이터를 정확하고 간편하게 구축할 수 있는 장치에 관한 것이다. The present invention relates to an apparatus for constructing learning data for artificial intelligence character recognition, and more particularly, to an apparatus capable of accurately and conveniently constructing a vast amount of learning data for artificial intelligence character recognition.

문서를 텍스트로 변환하는 것에 대한 수요가 많다. 특히 오래된 고서 등의 문서를 텍스트로 변환하여 두면 인문학 연구와 고문서 연구 등에 큰 도움이 된다. Document to text conversion is in high demand. In particular, if documents such as old books are converted into text, it is of great help to research in the humanities and ancient documents.

문서에 포함된 글자를 텍스트로 자동으로 변환하려면 인공지능 문자인식 방법을 사용하는 것이 가장 효율적이다. 특허등록 제10-2264988호는 인공지능을 이용하여 한자를 인식하는 시스템을 개시하고 있다. It is most efficient to use an artificial intelligence character recognition method to automatically convert characters included in a document into text. Patent Registration No. 10-2264988 discloses a system for recognizing Chinese characters using artificial intelligence.

문서에 포함된 한자 등의 글자를 인식하는 과정은 다음과 같다. 먼저, 문서를 스캔한 이미지 파일에서 글자 이미지를 분리하고, 문자 출현 위치의 역학 관계 분석을 통해 메타정보를 획득한다. 분리된 글자 이미지를 자연어 문자와 매치하여 인식한다. 인식된 문자와 메타정보를 기반으로 지식베이스를 구축할 수 있다.The process of recognizing characters such as Chinese characters included in a document is as follows. First, the character image is separated from the image file where the document is scanned, and meta information is obtained through the dynamic relationship analysis of the location of the character appearance. Recognizes separated character images by matching them with natural language characters. A knowledge base can be built based on recognized characters and meta information.

특허등록 제10-1937398호는 고문서 이미지에서 한자 이미지를 분리하여 추출하는 방법을 개시하고 있다. 이와 같이 분리 추출된 글자 이미지는 신경망을 사용하여 인식할 수 있다. Patent Registration No. 10-1937398 discloses a method of separating and extracting Chinese character images from ancient document images. The separated and extracted character images can be recognized using a neural network.

그런데, 인공지능을 이용하여 문자를 인식하기 위해서는 문자인식을 위한 신경망을 미리 학습시켜야 한다. 신경망의 학습은 분리 추출된 글자 이미지와 해당 이미지가 나타내는 문자의 쌍을 포함하는 학습 데이터를 사용하여 이루어진다. However, in order to recognize a character using artificial intelligence, a neural network for character recognition must be trained in advance. Learning of the neural network is performed using training data including pairs of separately extracted character images and characters represented by the images.

문자 인식의 정확성을 높이기 위해서는 실제 문서에 추출한 대량의 학습 데이터를 구축할 필요가 있다. 이를 위하여 학습 데이터 구축의 정확도와 효율을 높이기 위한 도구에 대한 수요가 높다.In order to increase the accuracy of character recognition, it is necessary to build a large amount of training data extracted from actual documents. To this end, there is a high demand for tools to increase the accuracy and efficiency of building learning data.

본 발명은 이러한 점을 감안하여 이루어진 것으로서, 인공지능 문자인식을 위한 학습 데이터를 효율적으로 구축하면서도 오류를 줄일 수 있는 인공지능 문자인식을 위한 학습 데이터 구축장치를 제공하는 것을 목적으로 한다.The present invention has been made in view of these points, and an object of the present invention is to provide an apparatus for building learning data for artificial intelligence character recognition that can reduce errors while efficiently constructing learning data for artificial intelligence character recognition.

바람직한 실시예에 따른 본 발명의 인공지능 문자인식을 위한 학습 데이터 구축 장치는, 스캔된 문서 이미지 내의 글자에 대해서 한 글자씩 바운딩 박스를 쳐주는 세그먼팅 모듈과, 상기 바운딩 박스를 교정할 수 있는 도구를 제공하는 세그먼팅 검증모듈을 포함하는 세그먼트 처리부; 상기 바운딩 박스로 분리된 각각의 글자 이미지에 대해서 유사도가 높은 글자들을 하나의 군집으로 묶는 군집 처리를 수행하는 군집화 모듈과, 상기 군집들을 교정할 수 있는 도구를 제공하는 군집 검증모듈을 포함하는 군집 처리부; 군집 처리된 글자 이미지를 문자인식하여 문자 텍스트를 매칭시키고 이를 정정할 수 있는 도구를 제공하는 코드 입력부; 및 문자 텍스트가 입력된 문서 이미지에 대해서 세그먼트와 문자 텍스트를 검수하고 교정하는 도구를 제공하는 검수 및 교정부를 구비한다. An apparatus for building learning data for artificial intelligence character recognition according to a preferred embodiment of the present invention includes a segmenting module for entering a bounding box one by one for each character in a scanned document image, and a tool capable of correcting the bounding box. a segment processing unit including a segmenting verification module provided; A cluster processing unit including a clustering module that performs a clustering process for grouping letters having a high degree of similarity into one cluster for each character image separated by the bounding box, and a cluster verification module that provides a tool capable of correcting the clusters. ; a code input unit for recognizing clustered character images, matching character texts, and providing a tool capable of correcting them; and an inspecting and correcting unit providing a tool for inspecting and correcting segments and text of a document image into which text is input.

세그먼팅 모듈은, 인공지능에 의한 세그먼팅을 수행할 때 그 신뢰도 점수도 함께 출력할 수 있다. 세그먼트 검증모듈은 신뢰도가 낮은 세그먼팅 결과물에 대해서는 바운딩 박스의 색, 두께, 배경색 등을 다른 세그먼트와 다르게 표시할 수 있다. The segmenting module may also output the reliability score when segmenting by artificial intelligence is performed. The segment verification module may display the color, thickness, background color, etc. of the bounding box differently from other segments for the segmenting result having low reliability.

세그먼트 검증모듈은, 상하로 배치된 두 글자 사이를 클릭하여 클릭된 위치에서 두 개의 글자를 위 아래로 분리하는 세로자르기 도구와, 좌우로 나란히 배치된 두 개의 글자가 사이를 클릭하여 클릭된 위치에서 두 개의 글자를 좌우로 분리하는 가로자르기 도구와, 두 개의 세그먼트를 선택하여 하나의 세그먼트로 합치는 합치기 도구를 제공한다. The segment verification module is a vertical cut tool that separates two letters up and down at the clicked position by clicking between two letters arranged vertically, and a vertical cut tool that separates two letters arranged side by side at the clicked position. It provides a horizontal cut tool that separates two characters left and right, and a combine tool that selects two segments and merges them into one segment.

군집화 모듈은, 낱자 이미지의 특징 벡터를 추출하고 이를 각 글자의 특징벡터로 이용하여 특징 벡터 사이의 코사인 유사도(Cosine Similarity)를 통해 글자간 유사도를 산출하고 유사도를 기준으로 유사 글자들의 군집을 구축한다.The clustering module extracts the feature vector of the word image and uses it as the feature vector of each letter to calculate the similarity between letters through the cosine similarity between feature vectors, and builds clusters of similar letters based on the similarity. .

군집화 모듈은 글자 이미지들에 대해서 인공지능에 의한 군집 처리를 수행할 때 그 신뢰도 점수도 함께 출력할 수 있다. 군집 검증모듈은 신뢰도가 낮은 글자 이미지에 대해서는 색, 두께, 배경색, 테두리 등을 다른 글자 이미지와 다르게 표시할 수 있다. When the clustering module performs clustering processing by artificial intelligence on text images, it can also output the reliability score. The cluster verification module may display the color, thickness, background color, border, etc. of text images with low reliability differently from other text images.

군집 처리부는 현재까지 편집된 내용을 반영하여 군집을 재정렬하는 군집최적화 기능을 제공할 수 있다. 군집최적화 기능은, 군집 검증모듈이 어떤 글자 군집에 대해서 동일한 군집이라고 판단되는 군집들을 찾아서 표시하고, 사용자가 이 중에서 동일한 군집을 선택할 수 있도록 하는 것일 수 있다. The cluster processing unit may provide a cluster optimization function that rearranges the clusters by reflecting the contents edited so far. The cluster optimization function may allow the cluster verification module to find and display clusters that are determined to be the same cluster for a certain character cluster, and allow the user to select the same cluster among them.

군집 검증모듈은, 소정수 이하의 글자들의 군집으로 남아있는 군집에 대해서 첫 글자와 같은 글자들을 선택하여 합치는 '합치기' 동작과, 첫 글자와는 다르지만 나머지 글자들 사이에 군집이 존재하는 경우에 이들을 새로운 군집으로 합치는 '새로만들기' 동작을 수행할 수 있는 도구를 제공할 수 있다. The cluster verification module performs a 'merging' operation that selects and merges the same letters as the first letter for the cluster remaining as a cluster of letters of a predetermined number or less, and when a cluster exists between the remaining letters although it is different from the first letter A tool can be provided to perform a 'create new' operation that combines them into a new cluster.

코드 입력부는 문자인식 알고리즘을 수행하여 글자 이미지에서 문자를 인식하는 문자인식모듈을 포함하며, 인식된 문자 텍스트를 군집별로 정정할 수 있는 도구를 제공한다. The code input unit includes a character recognition module that recognizes characters in a character image by performing a character recognition algorithm, and provides a tool capable of correcting recognized character text by cluster.

검수 및 교정부는 문서 이미지와, 문서 이미지 내의 개별 글자들에 대해서 적용된 세그먼트 표시, 그리고 해당 문서 이미지 내의 각 글자에 대해서 할당된 문자 텍스트를 표시하고, 이를 정정할 수 있는 도구를 제공한다. The inspection and proofreading unit displays a document image, segment marks applied to individual characters in the document image, and character text assigned to each character in the document image, and provides a tool capable of correcting them.

검수 및 교정부는 사용자가 마우스를 어떤 글자에 갖다대면 해당 글자가 속한 열 또는 행에 속한 글자들에 대해서 입력된 문자 텍스트를 표시할 수 있다.The inspection and proofreading unit may display input text text for letters belonging to a column or row to which the corresponding letter belongs when the user moves the mouse to a certain letter.

본 발명에 따르면, 스캔된 문서 이미지에 대해서 인공지능으로 수행한 글자의 클러스터링과 문자 인식 내용을 간편하게 정정할 수 있도록 하므로, 인공지능 문자인식을 위한 학습 데이터를 효율적으로 구축할 수 있다.According to the present invention, it is possible to efficiently construct learning data for artificial intelligence character recognition because it is possible to easily correct character clustering and character recognition performed by artificial intelligence on scanned document images.

도 1은 본 발명의 바람직한 실시예에 따른 인공지능 문자인식을 위한 학습 데이터 구축 장치의 구성을 보여주는 블록도이다.
도 2는 세그먼트 처리부의 구성을 보여주는 블록도이다.
도 3은 군집 처리부의 구성을 보여주는 블록도이다.
도 4는 세그먼트 처리 결과를 보여주는 화면예이다.
도 5는 군집 검증을 위한 화면예이다.
도 6은 군집 합치기 기능을 설명하기 위한 도면이다.
도 7은 '나머지 군집' 과정에서 대표이미지와 같은 글자들을 선택하여 합치는 '합치기' 동작을 설명하기 위한 도면이다.
도 8은 '나머지 군집' 과정에서 대표이미지와는 다르지만 나머지 글자들 사이에 군집이 존재하는 경우에 이들을 새로운 군집으로 합치는 '새로만들기' 동작을 설명하기 위한 도면이다.
도 9는 문자 텍스트 입력시의 화면예이다.
도 10은 문서 이미지에 대한 검수 및 교정시의 화면예이다.1 is a block diagram showing the configuration of an apparatus for building learning data for artificial intelligence character recognition according to a preferred embodiment of the present invention.
2 is a block diagram showing the configuration of a segment processing unit.
3 is a block diagram showing the configuration of a cluster processing unit.
4 is a screen example showing a segment processing result.
5 is an example screen for cluster verification.
6 is a diagram for explaining a cluster merging function.
7 is a diagram for explaining a 'merging' operation in which characters such as representative images are selected and merged in a 'remaining cluster' process.
FIG. 8 is a diagram for explaining the 'create new' operation of combining the remaining letters into a new cluster when there are clusters between the remaining characters, although different from the representative image, in the process of 'remaining clusters'.
9 is an example of a screen when text is input.
10 is an example of a screen when inspecting and correcting a document image.

이하, 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명한다. 한편, 이하의 설명에서는 한자를 인식하는 경우를 예로 들어 설명하지만, 본 발명은 한글, 알파벳 등의 다른 문자의 인식에도 적용할 수 있다. 이하의 설명에서는 '글자'는 한자 이미지, 한글 이미지 등과 같이 서적에 포함된 문자 이미지를 나타내며, '문자 텍스트'는 컴퓨터 상에서 유니코드(UNICODE)와 같은 문자를 나타내는 코드로 표현한 것을 나타낸다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the drawings. Meanwhile, in the following description, the case of recognizing Chinese characters will be described as an example, but the present invention can also be applied to the recognition of other characters such as Hangul and the alphabet. In the following description, 'character' represents a character image included in a book, such as a Chinese character image or a Korean image, and 'character text' represents a code representing a character such as UNICODE on a computer.

본 발명의 인공지능 문자인식을 위한 학습 데이터 구축장치(100)는 도 1에 도시한 것처럼, 세그먼트 처리부(110), 군집 처리부(120), 코드 입력부(130), 검수 및 교정부(140)를 구비한다. 세그먼트 처리부(110)는 스캔된 문서 이미지(D) 내의 글자에 대해서 한 글자씩 바운딩 박스를 쳐주고 사용자가 이를 교정할 수 있는 도구를 제공한다. 군집 처리부(120)는 문서 내에 포함되어 있는 각 글자 이미지들을 동일한 글자로 인식되는 그룹으로 묶는 군집 처리를 수행하고 사용자가 이를 교정할 수 있는 도구를 제공한다. 코드 입력부(130)는 군집 처리된 글자 이미지를 문자인식하여 문자 텍스트를 매칭시키고 이를 정정할 수 있는 도구를 제공한다. 검수 및 교정부(140)는 문서 이미지에 대해서 세그먼트와 문자 텍스트 입력 결과를 검수하고 교정하여 최종적인 학습데이터(T)를 생성하는 도구를 제공한다. As shown in FIG. 1, the learning data construction apparatus 100 for artificial intelligence character recognition of the present invention includes a segment processing unit 110, a cluster processing unit 120, a code input unit 130, and an inspection and correction unit 140. provide The segment processor 110 sets a bounding box letter by letter for each letter in the scanned document image D and provides a tool for the user to correct it. The clustering processing unit 120 performs clustering processing to group each character image included in the document into groups recognized as the same character, and provides a tool for the user to correct this. The code input unit 130 provides a tool capable of matching character texts by recognizing clustered character images, and correcting them. The inspection and proofreading unit 140 provides a tool for generating final learning data (T) by inspecting and correcting the result of segment and character text input with respect to the document image.

세그먼트 처리부(110)는 도 2에 도시한 것처럼 세그먼팅 모듈(111)과 세그먼트 검증모듈(112)을 구비한다. 사용자가 세그먼트 처리를 할 대상 파일을 선택하고 세그먼트 처리를 요청하면 세그먼팅 모듈(111)은 해당 파일에 포함된 문서 이미지 내의 글자에 대해서 한 글자씩 바운딩 박스를 쳐주는 세그먼트 처리를 한다. 세그먼트 처리는 예를 들면 등록특허 제10-1937398호, 등록특허 제10-1182173호 등에 개시된 것과 같은 일반적인 세그먼트 방법으로 수행될 수 있으며, 본 발명은 특정 세그먼팅 방법에 한정되지 않는다. 일 실시예에서, 세그먼팅 모듈(111)에서는 고문서 면단위 이미지가 U-Net 기반의 네트워크를 지나며 히트맵(Heatmap)으로 변환된다. 세그먼팅 모듈(111)은 히트맵(Heatmap)을 워터쉐드(watershed) 알고리즘을 통해 분리하여 낱자의 위치(바운딩박스)를 생성한다. 이때 히트맵(Heatmap)은 각 픽셀이 낱자의 중심에 위치할 확률을 표현한다.The segment processing unit 110 includes a segmenting module 111 and a segment verification module 112 as shown in FIG. 2 . When a user selects a target file for segment processing and requests segment processing, the segmenting module 111 performs segment processing by entering a bounding box for each character in a document image included in the corresponding file. Segment processing may be performed by a general segment method such as those disclosed in Patent Registration Nos. 10-1937398 and 10-1182173, and the present invention is not limited to a specific segmenting method. In one embodiment, in the segmenting module 111, an ancient document face-by-side image is converted into a heatmap while passing through a U-Net-based network. The segmenting module 111 separates the heatmap through a watershed algorithm to generate the position of the letter (bounding box). At this time, the heatmap expresses the probability that each pixel is located at the center of the letter.

세그먼팅 모듈(111)에 의한 글자 세그먼팅이 종료되면, 세그먼트 검증모듈(112)은 도 4에 도시된 것처럼 세그먼팅 모듈(111)에 의해 세그먼트 처리된 문서 이미지를 표시하고 사용자가 수정할 수 있도록 한다.When character segmentation by the segmenting module 111 is completed, the segment verification module 112 displays the document image segmented by the segmenting module 111 as shown in FIG. 4 and allows the user to modify it. .

도 4에 도시된 것처럼, 문서 이미지에는 세그먼트 결과가 네모상자(바운딩 박스) 형태로 표시된다. 세그먼팅 모듈(111)에서 컨볼루션 신경망 등과 같은 인공지능에 의한 세그먼팅을 수행할 때 그 신뢰도 점수도 함께 출력될 수 있다. 세그먼트 검증모듈(112)은 신뢰도가 낮은 세그먼팅 결과물에 대해서는 바운딩 박스의 색, 두께, 배경색 등을 다른 세그먼트와 다르게 표시할 수 있다. 도 4에서는 신뢰도가 높은 세그먼트는 청색 상자로, 신뢰도가 상대적으로 낮은 세그먼트는 황색 상자로 표시되어 있다. 도 4의 중앙 상단에는 상하로 배치된 두 개의 글자가 하나로 세그먼팅 되어 황색 상자로 표시된 부분이 있다. 이와 같이 신뢰도에 따라 세그먼트 결과를 다르게 표시함으로써 사용자가 세그먼트 교정 작업을 보다 용이하게 수행할 수 있다.As shown in FIG. 4 , the segment result is displayed in the form of a square box (bounding box) in the document image. When the segmenting module 111 performs segmentation by artificial intelligence such as a convolutional neural network, the reliability score may also be output. The segment verification module 112 may display the color, thickness, background color, etc. of the bounding box differently from other segments for the segmenting result having low reliability. In FIG. 4 , segments with high reliability are indicated by blue boxes and segments with relatively low reliability are indicated by yellow boxes. At the top center of FIG. 4, there is a portion indicated by a yellow box in which two letters arranged vertically are segmented into one. By displaying the segment result differently according to the degree of reliability, the user can more easily perform the segment calibration task.

사용자는 세그먼팅이 잘못되어 있는 부분을 클릭하고 세그먼팅을 정정할 수 있다. 세그먼트 검증모듈(112)은 예를 들면 다음과 같은 도구를 사용자에게 제공할 수 있다.The user may click the part where the segmentation is incorrect and correct the segmentation. The segment verification module 112 may provide, for example, the following tools to the user.

o 가로자르기: 세그먼트를 지정된 위치에서 가로로 두 개의 세그먼트로 분리o Cross Cut: Split a segment into two segments horizontally at a specified location.

o 세로자르기: 세그먼트를 지정된 위치에서 세로로 두 개의 세그먼트로 분리o Vertical Cut: Split a segment into two segments vertically at a specified location.

o 재세그먼트: 현 세그먼트를 지우고 새로운 세그먼트를 생성o Re-segmentation: Delete the current segment and create a new segment

o 합치기: 세그먼트 두 개를 합쳐서 하나의 세그먼트로 생성o Merge: Merge two segments to create a single segment

o 되돌리기, 복원하기: 작업단계를 이전작업으로 되돌리거나 복원o Undo, restore: Return or restore the work step to the previous work

o 확대, 축소: 작업화면을 확대/축소o Zoom in/out: Zoom in/out the work screen

o 미니맵: 현재 작업중인 내용을 전체적으로 보여줌o Mini-map: Shows the current work as a whole

o 확대맵: 현재의 세그먼트된 작업을 크게 보여줌o Magnified Map: A larger view of the current segmented task

도 4의 화면에서 황색으로 표시된 부분의 두 글자 사이를 클릭한 후에 "세로자르기"를 선택하면 두개의 글자가 각각 위 아래로 분리되어 세그먼팅 된다. 만약 좌우로 나란히 배치된 두개의 글자가 하나의 세그먼트로 표시되어 있다면 "가로자르기" 도구를 사용하여 분리할 수 있다. 반대로, 하나의 글자가 두 개의 세그먼트로 분리되어 있다면 "합치기" 도구를 사용하여 하나의 세그먼트로 결합할 수 있다. 세그먼트 검증모듈(112)에서 세그먼트화된 글자 이미지들을 저장한다. In the screen of FIG. 4, when clicking between two letters in the yellow area and selecting "vertical cut", the two letters are separated up and down and segmented. If two letters side by side are displayed as a single segment, you can use the "Cross to Crop" tool to separate them. Conversely, if a single letter is split into two segments, you can use the "Combine" tool to combine them into one segment. The segmented character images in the segment verification module 112 are stored.

세그먼트 검증모듈(112)에서 검증된 세그먼트 처리된 문서 이미지는 JPEG + JSON 파일 쌍으로 이루어질 수 있으며, 세그먼팅 모듈(111)의 신경망을 훈련시키기 위한 학습 데이터로서 사용될 수 있다.The segmented document image verified in the segment verification module 112 may be configured as a JPEG+JSON file pair, and may be used as training data for training the neural network of the segmenting module 111 .

군집처리부(120)는 도 3에 도시한 것처럼 군집화 모듈(121)과, 군집 검증모듈(122)을 구비한다. The cluster processing unit 120 includes a clustering module 121 and a cluster verification module 122 as shown in FIG. 3 .

세그먼트 처리가 종료되면 군집화 모듈(121)이 바운딩 박스로 분리된 각각의 글자 이미지에 대해서 유사도가 높은 글자들을 하나의 군집으로 묶는 군집(clustering) 처리를 수행한다. 군집 처리는 등록특허 제10-1371657호, 등록특허 제10-1486495호 등에 개시된 것과 같은 일반적인 이미지 군집 방법(image clustering)으로 수행될 수 있으며, 본 발명은 특정 군집 방법에 한정되지 않는다. 일 실시예에서, 군집 학습 모델은 보틀넥(Bottleneck)을 적용한 ResNet을 기반으로 구성될 수 있다. 고문서 한자 낱자 데이터가 부족한 학습 데이터 구축 초기에는 한글 및 한자 오픈 데이터를 이용하여 모델을 학습한 후, 특징 추출부만 이용하여 한자 낱자 이미지의 특징 벡터를 추출하고 이를 각 글자의 특징벡터로 이용하여 특징 벡터 사이의 코사인 유사도(Cosine Similarity)를 통해 글자간 유사도를 산출하고 유사도를 기준으로 유사 글자 군집을 구축할 수 있다.When the segment processing is finished, the clustering module 121 performs a clustering process of grouping characters having a high degree of similarity into one cluster for each character image separated by a bounding box. The clustering process may be performed by a general image clustering method as disclosed in Patent Registration No. 10-1371657, Patent Registration No. 10-1486495, etc., and the present invention is not limited to a specific clustering method. In one embodiment, the cluster learning model may be configured based on ResNet to which Bottleneck is applied. In the early stage of building learning data with insufficient Chinese character data in old documents, after learning the model using Hangul and Chinese character open data, only the feature extraction unit was used to extract the feature vector of the Chinese character image and use it as a feature vector for each character. The similarity between letters can be calculated through the cosine similarity between vectors, and similar letter clusters can be constructed based on the similarity.

군집 검증모듈(122)은 군집화 모듈(121)에서 군집 처리된 파일을 읽어와서 사용자로 하여금 검증 및 편집할 수 있도록 한다. 글자 이미지들에 대해서 인공지능에 의한 군집 처리를 수행할 때 그 신뢰도 점수도 함께 출력될 수 있다. 군집 처리의 신뢰도가 낮은 글자 이미지에 대해서는 색, 두께, 배경색, 테두리 등을 다른 글자 이미지와 다르게 표시할 수 있다. The cluster verification module 122 reads the file processed by clustering in the clustering module 121 and allows the user to verify and edit it. When cluster processing by artificial intelligence is performed on text images, the reliability score may also be output. The color, thickness, background color, border, etc. of text images with low confidence in clustering can be displayed differently from other text images.

도 5에 군집 검증 및 편집 화면예가 도시되어 있다. 도 5의 화면에서 사용자가 “다음글자”를 클릭하면 좌측창에 군집 리스트가 표시된다. 군집 리스트에는 대표되는 글자 이미지와 해당 글자에 군집되어 있는 군집개수가 표시된다. 사용자가 리스트 중의 하나를 클릭하면 해당 글자에 대해서 군집된 글자 이미지들이 우측창에 표시된다. 5 shows an example of a cluster verification and editing screen. In the screen of FIG. 5, when the user clicks “next letter”, a list of clusters is displayed on the left window. In the cluster list, the representative character image and the number of clusters clustered in the corresponding character are displayed. When the user clicks one of the lists, the clustered character images for the corresponding character are displayed in the right window.

우측창에 표시된 군집 내에 사용자는 해당 군집에서 제외시키고자 하는 글자를 클릭한다. 클릭된 글자에는 해당 군집에서 제외되었음을 나타내는 표지가 표시된다. 도 5에서는 X 표시가 해당 글자가 군집에서 제외되었음을 나타내고 있다. 한편, x 표시된 된 글자를 다시 클릭하면 X 표시가 사라지고 해당 글자가 군집에 포함되게 된다. In the cluster displayed on the right window, the user clicks the letter to be excluded from the cluster. The clicked letter is marked with a marker indicating that it has been excluded from the cluster. In FIG. 5, the X mark indicates that the corresponding letter is excluded from the cluster. On the other hand, if you click the x-marked letter again, the x-mark disappears and the corresponding letter is included in the cluster.

도 6에 도시한 것처럼 문자군집 리스트에 동일한 글자가 복수개 표시될 수도 있다. 이 경우에는 '군집합치기' 기능을 사용하여 군집을 하나로 합칠 수 있다. 예를 들면, 사용자는 합치고자 하는 군집 중의 하나를 선택한 후에(①) '군집합치기' 버튼을 누르고(②), 합치고자 하는 다른 군집을 선택한다(③). 두 군집을 합치는 것이 맞는지 묻는 메시지창에서 '확인'을 선택하면 두 군집이 하나로 통합된다.As shown in FIG. 6, a plurality of identical characters may be displayed in the character cluster list. In this case, you can merge the clusters into one using the 'Merge Clusters' function. For example, the user selects one of the clusters to merge (①), presses the 'Merge Cluster' button (②), and selects another cluster to merge (③). If you select 'OK' in the message window asking whether it is correct to merge the two clusters, the two clusters are merged into one.

사용자는 어느 정도 군집 검증이 되면 중간중간에 군집최적화 기능을 수행할 수 있다. 군집최적화 기능을 실행하면 군집 검증모듈(122)은 현재까지 편집된 내용을 반영하여 군집을 재정렬한다. 실시예에 따라서는 재정렬된 군집에 대해서 '군집찾고 합치기' 기능이 제공될 수 있다. '군집찾고 합치기' 기능은 이미지 파일 내의 중복된 글자의 군집을 찾아 합쳐 주는 과정으로서, 시작글자에서 최근 글자 순으로 진행하며 같은 군집 글자들을 선택한 후 합치기를 수행한다. 예를 들면, 군집 검증모듈(122)이 어떤 글자 군집에 대해서 동일한 군집이라고 판단되는 군집들을 찾아서 표시하면, 사용자가 이 중에서 동일한 군집을 선택하는 과정을 시작글자에서 최근 글자에까지 반복한다.The user can perform the cluster optimization function in the middle when the cluster verification is completed to a certain extent. When the cluster optimization function is executed, the cluster verification module 122 rearranges the clusters by reflecting the contents edited so far. Depending on the embodiment, a 'find and merge' function may be provided for the rearranged clusters. The 'Find and Merge Cluster' function is a process of finding and merging overlapping letter clusters in an image file. It proceeds in order from the first letter to the most recent letter, selects the same cluster letters, and performs merging. For example, when the cluster verification module 122 finds and displays clusters determined to be the same cluster for a certain character cluster, the user repeats the process of selecting the same cluster from the first letter to the latest letter.

이상의 과정은 소정수 이상의 글자들로 이루어진 군집에 대해서 수행될 수 있다. 이러한 과정을 거친 후에 소정수 이하의 글자들의 군집으로 남아있는 군집에 대해서는 '나머지 군집' 과정을 수행할 수 있다. '나머지 군집' 과정에서는 도 7에 도시된 것처럼 대표이미지(첫 글자)와 같은 글자들을 선택하여 합치는 '합치기' 동작과, 도 8에 도시된 것처럼 대표이미지와는 다르지만 나머지 글자들 사이에 군집이 존재하는 경우에 이들을 새로운 군집으로 합치는 '새로만들기' 동작을 수행할 수 있다. The above process may be performed on a cluster consisting of characters of a predetermined number or more. After going through this process, a 'remaining clustering' process may be performed on the remaining clusters of characters less than or equal to a predetermined number. In the 'remaining clustering' process, as shown in FIG. 7, the 'merging' operation selects and combines the same letters as the representative image (first letter), and as shown in FIG. If they exist, a 'create new' operation can be performed to merge them into a new cluster.

군집 검증모듈(122)에서 검증된 글자 이미지들의 군집 데이터는 JPEG + JSON 파일 쌍으로 이루어질 수 있으며, 군집화 모듈(121)의 신경망을 훈련시키는데 학습 데이터로서 사용될 수 있다.The cluster data of the character images verified in the cluster verification module 122 may consist of a JPEG+JSON file pair, and may be used as training data to train the neural network of the clustering module 121 .

군집처리가 종료되면 사용자는 군집된 글자들에 대해서 문자 텍스트를 입력할 수 있다. 코드 입력부(130)에는 문자인식 알고리즘을 수행하여 글자 이미지에서 문자를 인식하는 문자인식모듈이 구비될 수 있다. 문자인식모듈은 신경망을 이용한 문자인식 알고리즘을 사용하여 글자 이미지를 문자 인식하여 문자 텍스트로 변환한다. 인식된 문자 텍스트는 같은 군집 내의 모든 글자 이미지에 대해서 동일하게 적용된다.When the clustering process is finished, the user can input character text for the clustered letters. The code input unit 130 may include a character recognition module that recognizes characters in a character image by performing a character recognition algorithm. The character recognition module uses a character recognition algorithm using a neural network to recognize character images and converts them into character text. The recognized character text is equally applied to all character images in the same cluster.

도 9의 화면예에 도시된 것처럼 코드 입력부(130)는 각 군집에 대해서 각 글자 이미지가 나타내는 문자 텍스트를 정정할 수 있는 도구를 제공한다. 좌측의 군집 리스트 창(91)에는 대표글자 이미지와 해당 이미지에 대해서 인식된 문자 텍스트가 표시된다. 사용자는 문자 인식에 오류가 있는 경우에는 군집 리스트 창(91)에서 정정하고자 하는 글자 이미지를 선택하고 올바른 문자 텍스트를 입력한다. 입력된 문자 텍스트는 같은 군집에 포함된 모든 글자 이미지에 일괄적으로 적용된다. 즉 문자 텍스트 정정은 군집별로 수행된다. 도 9의 예에서는 정정하고자 하는 한자 이미지를 선택한 후에 한자의 독음을 독음입력창(92)에 입력하면 해당 독음을 가진 한자들의 리스트가 한자열 표시창(93)에 표시되고, 사용자는 그 중에서 원하는 한자를 선택한다. 선택된 한자의 문자 텍스트는 같은 군집에 속하는 모든 글자 이미지에 대해서 적용된다.As shown in the screen example of FIG. 9 , the code input unit 130 provides a tool capable of correcting character text indicated by each character image for each cluster. In the cluster list window 91 on the left, a representative character image and character text recognized for the corresponding image are displayed. When there is an error in character recognition, the user selects a character image to be corrected in the cluster list window 91 and inputs the correct character text. The input text is collectively applied to all text images included in the same cluster. That is, character text correction is performed cluster by cluster. In the example of FIG. 9 , if a Chinese character image to be corrected is selected and then the Chinese character's single sound is input to the single sound input window 92, a list of Chinese characters with the corresponding single sound is displayed on the Chinese character string display window 93, and the user selects the desired Chinese character among them. Choose The character text of the selected Chinese character is applied to all character images belonging to the same cluster.

실시예에 따라서는 군집 과정과 코드 입력 과정을 동시에 수행할 수 있도록 구성하는 것도 가능하다. 또한, 실시예에 따라서는 코드 입력부(130)가 코드 입력 중에 발견되는 세그먼트 오류나 군집 오류를 정정할 수 있는 도구를 제공할 수 있다. 예를 들면 도 9의 화면에서 군집에 다른 글자가 포함되어 있다면 해당 글자(들)을 선택한 후에 군집/입력 새로만들기(94) 버튼을 눌러서 선택된 글자(들)에 대해서 새로운 군집을 만들고 코드를 입력할 수 있다. 또는, 세그먼트 오류나 군집 오류가 발생한 경우에는 오류 코드를 입력하고, 나중에 오류가 발생한 글자 이미지나 군집에 대해서 추후에 세그먼트 검증모듈(112) 또는 군집 검증모듈(122)을 사용하여 오류 정정을 하도록 구성할 수도 있다. Depending on the embodiment, it is also possible to configure the clustering process and the code input process to be performed simultaneously. Also, depending on embodiments, the code input unit 130 may provide a tool capable of correcting a segmentation error or grouping error found during code input. For example, if other letters are included in the cluster on the screen of FIG. 9, after selecting the corresponding letter(s), press the Create New Cluster/Input (94) button to create a new cluster for the selected letter(s) and enter a code. can Alternatively, when a segmentation error or clustering error occurs, an error code is input, and the error is corrected later using the segment verification module 112 or the cluster verification module 122 for the letter image or cluster in which the error occurred. may be

또한, 실시예에 따라서는 사용자가 알지 못하는 한자를 표시하였다가 나중에 전문가가 입력할 수 있도록 구성할 수 있다. 예를 들면, 군집 내의 글자 이미지 중에서 글자를 모르는 등 나중에 따로 입력할 필요가 있는 경우에는 도 9의 화면에서 이들 글자 이미지들을 선택한 후에 '나중에 입력하기' 버튼(94)을 클릭하여 선택된 글자 이미지들을 현재의 군집에서 삭제하고 낱글자로 입력되어야 할 글자 리스트에 추가할 수 있다.In addition, depending on the embodiment, it may be configured so that a user can display Chinese characters unknown to the user and later input them by an expert. For example, if it is necessary to input letters separately from among the letter images in the cluster, such as unknown letters, after selecting these letter images on the screen of FIG. can be deleted from the cluster of and added to the list of characters to be entered as single characters.

세그먼트 오류가 있거나, 뭉개지거나 해상도가 낮아서 알아보기 힘든 글자, 특수 기호 등과 같이 코드를 입력할 수 없는 글자 이미지는 입력에서 제외할 수 있다. 예를 들면, 도 9의 화면에서 이러한 글자 이미지들을 선택한 후에 '입력에서 제외하기' 버튼(95)을 클릭하여 선택된 글자 이미지들을 제외시킨다.Character images that cannot be entered into codes, such as characters that have segmentation errors, are mangled, or are difficult to recognize due to low resolution, special symbols, etc., can be excluded from input. For example, after selecting these character images on the screen of FIG. 9 , the 'exclude from input' button 95 is clicked to exclude the selected character images.

한 군집에 대해서 동일한 글자의 이미지만을 남기고 문자 텍스트를 입력하면 해당 군집에 포함된 모든 글자 이미지에 대해서 입력된 문자 텍스트가 연계되어 저장된다. 한 군집에 대한 문자 텍스트 입력이 종료하면 다음 군집으로 넘어가서 문자 텍스트 입력 절차를 반복하게 된다. If character text is entered leaving only images of the same character for one cluster, the input character text for all character images included in the cluster is linked and stored. When text input for one cluster is finished, it moves on to the next cluster and the text text entry procedure is repeated.

코드 입력부(130)에서 문자 텍스트가 입력된 글자 이미지는 JPEG + JSON 파일 쌍으로 이루어질 수 있으며, 문자인식모듈의 신경망을 훈련시키는데 학습 데이터로서 사용될 수 있다.The character image into which character text is input in the code input unit 130 may be composed of a JPEG+JSON file pair, and may be used as training data to train a neural network of a character recognition module.

모든 군집에 대해서 문자 텍스트 입력이 종료되면 검수 및 교정부(140)에서 문서 이미지에 대해서 세그먼트와 문자 텍스트를 검수 및 교정을 할 수 있는 도구를 제공한다. 검수 및 교정부(140)는 도 10에 도시된 것처럼 문서 이미지와, 문서 이미지 내의 개별 글자들에 대해서 적용된 세그먼트 표시, 그리고 해당 문서 이미지 내의 각 글자에 대해서 할당된 문자 텍스트를 표시한다. 그리고 검수 및 교정부(140)는 사용자가 글자를 선택하고 바운딩 박스 또는 문자 텍스트를 정정할 수 있는 도구를 제공한다.When text input is completed for all clusters, the inspection and proofreading unit 140 provides a tool capable of inspecting and correcting segments and text for document images. As shown in FIG. 10 , the inspection and correction unit 140 displays a document image, segment marks applied to individual characters in the document image, and character text assigned to each character in the document image. In addition, the inspection and proofreading unit 140 provides a tool for the user to select a character and correct the bounding box or character text.

실시예에 따라서는 도 10의 화면예에 도시한 것처럼 마우스를 어떤 글자에 갖다대면 해당 글자가 속한 열 또는 행에 속한 글자들에 대해서 입력된 문자 텍스트가 표시된다. 사용자는 입력된 문자 텍스트에 오류가 있거나 세그먼트에 오류가 있으면 해당 세그먼트를 선택하고 세그먼트를 정정하거나 문자 텍스트를 정정한다.Depending on the embodiment, as shown in the screen example of FIG. 10 , when the mouse is moved to a certain letter, the input text of the letters belonging to the column or row to which the corresponding letter belongs is displayed. If there is an error in the input character text or an error in the segment, the user selects the corresponding segment and corrects the segment or the character text.

예를 들면 도 11에서와 같이 세그먼트 오류가 있는 세그먼트를 삭제한 후에(①), 새로 세그먼팅을 한 후에(②) 해당 글자 이미지의 문자 텍스트를 입력한다.For example, as shown in FIG. 11, after deleting a segment with a segment error (①), performing segmentation again (②), the text of the corresponding letter image is input.

이상, 본 발명을 몇가지 예를 들어 설명하였으나, 본 발명의 실시예를 구성하는 모든 구성 요소들이 하나로 결합하거나 결합하여 동작하는 것으로 설명되었다고 해서, 본 발명이 반드시 이러한 실시예에 한정되는 것은 아니다. 본 발명의 목적 범위 안에서라면, 그 모든 구성 요소들이 하나 이상으로 선택적으로 결합하여 동작할 수도 있다. 또한, 그 모든 구성 요소들이 각각 하나의 독립적인 하드웨어로 구현될 수 있지만, 각 구성 요소들의 그 일부 또는 전부가 선택적으로 조합되어 하나 또는 복수 개의 하드웨어에서 조합된 일부 또는 전부의 기능을 수행하는 프로그램 모듈을 갖는 컴퓨터 프로그램으로서 구현될 수도 있다. 그 컴퓨터 프로그램을 구성하는 코드들 및 코드 세그먼트들은 본 발명의 기술 분야의 당업자에 의해 용이하게 추론될 수 있을 것이다. 이러한 컴퓨터 프로그램은 컴퓨터 또는 프로세서가 읽을 수 있는 저장매체(Computer Readable Media)에 저장되어 컴퓨터 또는 프로세서에 의하여 읽혀지고 실행됨으로써, 본 발명의 실시예를 구현할 수 있다. In the above, the present invention has been described with several examples, but even if all components constituting an embodiment of the present invention are combined or described as operating in combination, the present invention is not necessarily limited to these embodiments. Within the scope of the object of the present invention, all of the components may be selectively combined with one or more to operate. In addition, although all of the components may be implemented as a single independent piece of hardware, some or all of the components are selectively combined to perform some or all of the combined functions in one or a plurality of hardware. It may be implemented as a computer program having. Codes and code segments constituting the computer program may be easily inferred by a person skilled in the art. Such a computer program may implement an embodiment of the present invention by being stored in a computer or processor-readable storage medium, read and executed by a computer or processor.

이상에서 기재된 "포함하다", "구성하다" 또는 "가지다" 등의 용어는, 특별히 반대되는 기재가 없는 한, 해당 구성 요소가 내재할 수 있음을 의미하는 것이므로, 다른 구성 요소를 제외하는 것이 아니라 다른 구성 요소를 더 포함할 수 있는 것으로 해석되어야 한다. Terms such as "comprise", "comprise" or "having" described above mean that the corresponding component may be present unless otherwise stated, and therefore do not exclude other components. It should be construed that it may further include other components.

이상의 설명은 본 발명의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 발명에 개시된 실시예들은 본 발명의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 발명의 기술 사상의 범위가 한정되는 것은 아니다. 본 발명의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 발명의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely an example of the technical idea of the present invention, and various modifications and variations can be made to those skilled in the art without departing from the essential characteristics of the present invention. Therefore, the embodiments disclosed in the present invention are not intended to limit the technical idea of the present invention, but to explain, and the scope of the technical idea of the present invention is not limited by these embodiments. The protection scope of the present invention should be construed according to the claims below, and all technical ideas within the equivalent range should be construed as being included in the scope of the present invention.

110 세그먼트 처리부,
120 군집 처리부,
130 코드 입력부,
140 검수 및 교정부.110 segment processing unit,
120 cluster processing unit;
130 code input unit,
140 Inspection and Proofreading Department.

Claims

a segment processing unit including a segmenting module for entering a bounding box for each character in the scanned document image, and a segmenting verification module for providing a tool capable of correcting the bounding box;
A cluster processing unit including a clustering module that performs a clustering process for grouping letters having a high degree of similarity into one cluster for each character image separated by the bounding box, and a cluster verification module that provides a tool capable of correcting the clusters. ;
a code input unit for recognizing clustered character images, matching character texts, and providing a tool capable of correcting them;
an inspection and proofreading unit that provides a tool for inspecting and correcting segments and character text of document images;
Learning data construction device for artificial intelligence character recognition having a.

According to claim 1,
The segmenting module also outputs the reliability score when segmenting by artificial intelligence is performed,
The segment verification module displays the color, thickness, background color, etc. of the bounding box differently from other segments for the segmenting result with low reliability.
Learning data construction device for artificial intelligence character recognition.

The method of claim 1, wherein the segment verification module,
A vertical cutting tool that separates two letters up and down at the clicked position by clicking between two letters arranged vertically;
A horizontal cutting tool that separates two letters left and right at the clicked position by clicking between two letters arranged side by side;
Merge tool that selects two segments and merges them into one segment.
A device for building learning data for artificial intelligence character recognition that provides a.

According to claim 1,
The clustering module extracts the feature vector of the word image and uses it as the feature vector of each letter to calculate the similarity between letters through the cosine similarity between feature vectors, and constructs clusters of similar letters based on the similarity A device for building learning data for artificial intelligence character recognition.

According to claim 1,
When the clustering module performs clustering processing by artificial intelligence on the text images, the reliability score is also output,
The cluster verification module displays the color, thickness, background color, border, etc. of the text image having low reliability differently from other text images.
Learning data construction device for artificial intelligence character recognition.

According to claim 1,
The cluster processing unit provides a cluster optimization function that rearranges clusters by reflecting the contents edited so far,
In the cluster optimization function, the cluster verification module finds and displays clusters that are judged to be the same cluster for a certain character cluster, and allows the user to select the same cluster among them. Device for building learning data for artificial intelligence character recognition.

According to claim 6,
The cluster verification module performs a 'merging' operation in which letters same as the first letter are selected and combined for the remaining clusters of letters of a predetermined number or less, and when a cluster exists between the remaining letters although different from the first letter. An apparatus for building learning data for artificial intelligence character recognition, which provides a tool to perform a 'create' operation that combines them into a new cluster.

According to claim 1,
The code input unit includes a character recognition module that recognizes characters in a character image by performing a character recognition algorithm, and provides a tool for correcting the recognized character text by cluster. Learning data construction device for artificial intelligence character recognition.

According to claim 1,
The inspection and proofreading unit displays a document image, a segment display applied to individual characters in the document image, and character text assigned to each character in the document image, and provides a tool capable of correcting them. Learning data construction device for character recognition.

According to claim 9,
The learning data construction device for artificial intelligence character recognition, wherein the inspection and correction unit displays the input character text for the characters belonging to the column or row to which the character belongs when the user moves the mouse to a certain character.