KR102627591B1

KR102627591B1 - Operating Method Of Apparatus For Extracting Document Information AND Apparatus Of Thereof

Info

Publication number: KR102627591B1
Application number: KR1020210010713A
Authority: KR
Inventors: 주진선; 김인성; 강새암; 육근식
Original assignee: 주식회사 엘지유플러스
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2024-01-19
Also published as: KR20220107717A

Abstract

실시예는, 문서로부터 정보를 추출하기 위한 장치의 동작 방법에 대한 것이다. 장치의 동작 방법은, 문서의 문서 이미지로부터 적어도 하나의 글자 영역들을 검출하는 단계; 학습된 OCR 모델에 기초하여 검출된 글자 영역들 각각에서 나타내는 글자들을 인식하는 단계; 인식된 글자들로부터 문서의 유형에 대응하여 미리 정의된 정의 단어를 검색하는 단계; 문서 내 정의 단어의 좌표; 및 문서 내 정의 단어를 제외한 나머지 단어의 좌표를 비교하는 단계; 및 비교 결과를 기준으로 인식된 글자들을 미리 정의된 형식의 디지털 문서로 입력하는 단계를 포함할 수 있다.The embodiment relates to a method of operating a device for extracting information from a document. A method of operating the device includes detecting at least one character area from a document image of a document; Recognizing letters represented in each of the detected letter areas based on the learned OCR model; Searching for a predefined definition word corresponding to the type of document from the recognized characters; Coordinates of the defined word in the document; and comparing the coordinates of words other than the defined words in the document; and inputting the characters recognized based on the comparison result into a digital document in a predefined format.

Description

Operating method of a device for extracting information from a document and the device {Operating Method Of Apparatus For Extracting Document Information AND Apparatus Of Thereof}

실시예는, 문서로부터 정보를 추출하기 위한 장치의 동작 방법 및 그 장치에 관한 것이다.Embodiments relate to a method of operating a device and the device for extracting information from a document.

디지털 트랜스포메이션(Digital Transformation)이 화두가 되면서 아날로그 데이터(종이문서) 또는 가공되지 않은 형태의 데이터를 디지털화하여 정보 관리를 용이하게 하고 빅데이터 분석을 위한 입력 데이터로 활용하려는 시도가 많아지고 있다. As digital transformation becomes a hot topic, there are increasing attempts to digitize analog data (paper documents) or raw data to facilitate information management and use it as input data for big data analysis.

종래에는 이러한 데이터를 디지털 데이터로 변환하기 위해 정형화된 문서 양식을 만들고 이에 맞춰 사람이 직접 타이핑하였다. 일부는 OCR(optical character reader)을 도입하여 디지털 문서로 변환하였지만 저품질의 프린터에서 인쇄된 종이 문서나 해상도가 낮은 이미지 파일 또는 스캔하면서 발생하는 구겨짐, 기울어짐 등의 노이즈와 문서가 포함하고 있는 워터마크, 배경색에 의해 OCR 인식 정확도가 떨어지는 문제가 있다. 또한 OCR은 문서 내용 전부를 글자로 변환하고 인식된 글자와 글자 사이의 관계를 알 수 없기 때문에 정형화된 양식의 문서를 생성하기 위해서는 OCR 후 사람이 수동으로 필요한 정보만을 편집하는 과정을 다시 거쳐야 한다.Conventionally, in order to convert such data into digital data, a standardized document form was created and people manually typed it accordingly. Some have converted to digital documents by introducing OCR (optical character reader), but paper documents printed on low-quality printers, low-resolution image files, or noise such as creasing and slanting that occur during scanning and watermarks included in the documents , there is a problem that OCR recognition accuracy is reduced due to background color. In addition, since OCR converts the entire content of the document into letters and the relationship between the recognized letters is unknown, in order to create a document in a standardized format, a person must go through the process of manually editing only the necessary information after OCR.

명함이나 계산서, 영수증 등 문서의 내용이 길지 않고 문서 형식이 고정되어 있으며 정형화된 양식에 넣고자 하는 정보가 명확하고 문서 내에 항상 같은 자리에 있는 경우(명함: 성명, 연락처, 이메일, 영수증: 사용일, 사용금액, 사용처)는 특정 위치를 인식하여 정형화된 양식에 자동으로 입력하기도 하지만 문서 형식이 고정되어 있지 않고 포함하고 있는 정보가 각기 다른 문서를 디지털 문서로 변환하는 기술은 개시되지 않는다.If the content of the document, such as a business card, invoice, or receipt, is not long, the document format is fixed, the information to be entered in a standardized form is clear, and is always in the same place within the document (business card: name, contact information, email, receipt: date of use) , amount used, and location of use) can be automatically entered into a standardized form by recognizing a specific location, but the document format is not fixed and the technology for converting documents containing different information into digital documents has not been disclosed.

이와 관련된 선행특허로 한국특허 제10-2012-0111954호의 클라우드 OCR 명함 정보 관리 시스템이 개시된다.A related prior patent discloses the cloud OCR business card information management system of Korean Patent No. 10-2012-0111954.

실시예에 따른 발명은, 하나의 유형에 대해서 각기 다른 양식으로 작성된 문서들을 정형화된 양식의 디지털 문서로 자동으로 변환하는 방법을 제공하고자 한다.The invention according to the embodiment seeks to provide a method of automatically converting documents written in different formats for one type into digital documents in a standardized format.

실시예에 따른 발명은, 하나의 유형에 대해서 각기 다른 양식으로 작성된 문서들을 정형화된 양식의 디지털 문서로 자동으로 변환하는 방법을 제공할 수 있다.The invention according to the embodiment can provide a method of automatically converting documents written in different formats for one type into digital documents in a standardized format.

이에, 사람이 하던 일을 컴퓨터를 통해 처리함으로써 업무 시간을 단축하고 업무 생산성을 향상시킬 수 있다.Accordingly, work time can be shortened and work productivity can be improved by processing work previously done by people through computers.

도 1은 실시예에서, 문서로부터 정보를 추출하기 위한 장치의 동작 방법의 흐름도이다.
도 2는 실시예에서, 문서 이미지를 전처리하는 방법을 설명하기 위한 도면이다.
도 3은 실시예에서, 라벨링을 통해 글자 영역을 검출하는 방법을 설명하기 위한 도면이다.
도 4는 실시예에서, OCR 모델의 학습에 대해 설명하기 위한 도면이다.
도 5는 실시예에서, 문서 이미지의 글자가 인식된 일례이다.
도 6은 실시예에서, 공백이 존재하는 정의 단어를 인식하고 정의 단어에 대응하는 글자를 인식하는 방법을 설명하기 위한 도면이다.
도 7은 실시예에서, 정의 단어 및 정의 단어의 정보에 대해 설명하기 위한 도면이다.
도 8은 실시예에서, 정의 단어 및 정의 단어의 정보에 대해 설명하기 위한 도면이다.
도 9는 실시예에서, 좌표를 기준으로 정의 단어 및 정의 단어의 정보를 인식하는 방법을 설명하기 위한 도면이다.
도 10은 실시예에서, 문서로부터 정보를 추출하기 위한 장치의 구성을 설명하기 위한 블록도이다.1 is a flow diagram of a method of operating a device for extracting information from a document, in an embodiment.
Figure 2 is a diagram for explaining a method of preprocessing a document image in an embodiment.
Figure 3 is a diagram for explaining a method of detecting a character area through labeling in an embodiment.
Figure 4 is a diagram for explaining learning of an OCR model in an embodiment.
Figure 5 is an example of characters in a document image being recognized in an embodiment.
Figure 6 is a diagram for explaining a method of recognizing a definition word with a space and recognizing letters corresponding to the definition word, in an embodiment.
7 is a diagram for explaining definition words and information on definition words in an embodiment.
8 is a diagram for explaining definition words and information on definition words in an embodiment.
Figure 9 is a diagram for explaining a method of recognizing a definition word and information about the definition based on coordinates in an embodiment.
Figure 10 is a block diagram for explaining the configuration of a device for extracting information from a document in an embodiment.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the attached drawings. However, various changes can be made to the embodiments, so the scope of the patent application is not limited or limited by these embodiments. It should be understood that all changes, equivalents, or substitutes for the embodiments are included in the scope of rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are for descriptive purposes only and should not be construed as limiting. Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as “comprise” or “have” are intended to designate the presence of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification, but are not intended to indicate the presence of one or more other features. It should be understood that this does not exclude in advance the possibility of the existence or addition of elements, numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as generally understood by a person of ordinary skill in the technical field to which the embodiments belong. Terms defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related technology, and unless explicitly defined in the present application, should not be interpreted in an ideal or excessively formal sense. No.

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, when describing with reference to the accompanying drawings, identical components will be assigned the same reference numerals regardless of the reference numerals, and overlapping descriptions thereof will be omitted. In describing the embodiments, if it is determined that detailed descriptions of related known technologies may unnecessarily obscure the gist of the embodiments, the detailed descriptions are omitted.

또한, 실시 예의 구성 요소를 설명하는 데 있어서, 제1, 제2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다. Additionally, in describing the components of the embodiment, terms such as first, second, A, B, (a), and (b) may be used. These terms are only used to distinguish the component from other components, and the nature, sequence, or order of the component is not limited by the term. When a component is described as being "connected," "coupled," or "connected" to another component, that component may be directly connected or connected to that other component, but there is no need for another component between each component. It should be understood that may be “connected,” “combined,” or “connected.”

어느 하나의 실시 예에 포함된 구성요소와, 공통적인 기능을 포함하는 구성요소는, 다른 실시 예에서 동일한 명칭을 사용하여 설명하기로 한다. 반대되는 기재가 없는 이상, 어느 하나의 실시 예에 기재한 설명은 다른 실시 예에도 적용될 수 있으며, 중복되는 범위에서 구체적인 설명은 생략하기로 한다.Components included in one embodiment and components including common functions will be described using the same names in other embodiments. Unless stated to the contrary, the description given in one embodiment may be applied to other embodiments, and detailed description will be omitted to the extent of overlap.

도 1은 실시예에서, 문서로부터 정보를 추출하기 위한 장치의 동작 방법의 흐름도이다.1 is a flow diagram of a method of operating a device for extracting information from a document, in an embodiment.

실시예에서, 장치는 문서의 문서 이미지로부터 적어도 하나의 글자 영역들을 검출한다.In an embodiment, a device detects at least one character regions from a document image of a document.

장치는 문서가 스캔된 문서 이미지를 입력 받을 수 있다. 실시예에 따른 문서는 성적증명서를 포함할 수 있다. 문서 이미지의 확장자는 jpg, png, bmp, tiff 등을 포함할 수 있다. 장치는 문서 이미지로부터 정보를 추출하기 위해 문서 이미지를 미리 정해진 사이즈로 변경할 수 있다. 예를 들어, 세로로 긴 형식의 문서는 가로 800 pixel 이상 및 세로 1000 pixel 이상, 가로로 긴 문서는 가로 1000 pixel 이상 및 세로 800 pixel 이상을 가지도록 할 수 있다.The device can receive an image of a scanned document as input. Documents according to embodiments may include transcripts. The extension of the document image may include jpg, png, bmp, tiff, etc. The device can change the document image to a predetermined size in order to extract information from the document image. For example, a vertically long document can have more than 800 pixels in width and 1000 pixels in height, and a long document can have more than 1000 pixels in width and more than 800 pixels in height.

실시예에서, OCR은 글자와 배경의 명확한 경계가 있어야 인식 정확도가 높아지므로 글자 영역을 정확히 검출하기 위해서, 장치는 스캔된 문서 이미지에 대해 전처리를 수행할 수 있다. 전처리는 예를 들어, 노이즈, 워터마크, 배경 색상 및 프레임 등을 제거하는 과정과 문서의 글자 및 흰 배경만을 남기는 과정을 포함할 수 있다.In an embodiment, OCR requires a clear boundary between text and the background to increase recognition accuracy, so in order to accurately detect the text area, the device may perform preprocessing on the scanned document image. Preprocessing may include, for example, removing noise, watermarks, background colors, and frames, and leaving only the text and white background of the document.

글자 영역을 검출할 시, 문서 이미지를 구성하는 테이블의 일부가 OCR 모델에 입력되면서 글자 오인식에 대한 가능성이 있으므로 장치는 테이블을 삭제할 수 있다.When detecting a character area, a part of the table constituting the document image is input to the OCR model, so there is a possibility of character misrecognition, so the device can delete the table.

실시예에 따른 전처리 과정에 의하면, 문서 이미지에 대해 x축 및 y축에 대해 미분하여 밝기 변화가 나타나는 픽셀들을 검출하고, 밝기 변화가 나타나는 픽셀들을 이용하여 문서 이미지 내 테이블을 구성하는 적어도 하나의 직선을 검출하여 테이블을 구성하는 직선을 삭제할 수 있다. 픽셀의 미분 값을 이용하여 가로 세로가 긴 직선들이 검출되어 해당 직선들이 제거될 수 있다.According to the preprocessing process according to the embodiment, pixels showing brightness changes are detected by differentiating the document image with respect to the x-axis and y-axis, and pixels showing brightness changes are used to create at least one straight line that forms a table in the document image You can detect and delete the straight lines that make up the table. Long horizontal and vertical straight lines can be detected using the differential value of the pixel and the corresponding straight lines can be removed.

실시예에서, 이미지의 경계를 검출하기 위한 Canny Edge Detection 알고리즘을 활용할 수 있다. In an embodiment, the Canny Edge Detection algorithm may be utilized to detect the edge of the image.

이미지의 경계는 이미지의 픽셀 주변으로 밝기 값이 급격히 변하므로, 픽셀 값이 급격히 변하는 픽셀들을 검출하고 상기의 수학식 1을 이용하여 이미지의 X축과 Y축에 대해 미분하여 각각 검출하고 수학식 2를 이용하여 검출된 픽셀의 크기 및 각도를 계산하여 최종적으로 가로, 세로가 가장 긴 직선들만 검출할 수 있다.Since the brightness value of the boundary of the image changes rapidly around the pixels of the image, pixels whose pixel values change rapidly are detected and differentiated for the X-axis and Y-axis of the image using Equation 1 above, respectively, and Equation 2 By calculating the size and angle of the detected pixel, only the longest horizontal and vertical straight lines can be detected.

이렇게 검출된 테이블 영역에 해당하는 직선(검정색 픽셀 값: 0)은 해당하는 부분의 문서 이미지에서 삭제(픽셀 값을 255로 변경)될 수 있다.The straight line (black pixel value: 0) corresponding to the table area detected in this way can be deleted (pixel value changed to 255) from the document image of the corresponding part.

더불어, 전처리 과정은 문서 이미지를 이진화하는 과정을 포함할 수 있다. 예를 들어, 문서 이미지가 흰색 배경과 검정색 글씨로 나누어지도록 하는 작업을 수행할 수 있다.Additionally, the preprocessing process may include a process of binarizing the document image. For example, you can split a document image into a white background and black text.

실시예에서, 장치는 적응적 이진화(Adaptive threshold) 알고리즘을 이용하여 문서 이미지를 이진화할 수 있다. 배경과 글자를 분리하기 위해 이진화를 수행하는데, 배경과 문서 이미지의 밝기, 색상이 양식, 스캔/출력 상태에 따라 상이하므로 고정 임계 값을 사용하여 이미지 전체를 이진화 하지 않고 입력 이미지 상태에 따라 임계 값이 가변인 적응적 이진화 알고리즘을 적용할 수 있다.In embodiments, a device may binarize document images using an adaptive threshold algorithm. Binarization is performed to separate the background and text. Since the brightness and color of the background and document image are different depending on the form and scanning/output status, the entire image is not binarized using a fixed threshold, but the threshold is adjusted according to the input image status. This variable adaptive binarization algorithm can be applied.

이후, 이진화된 문서 이미지에 대해서 모폴로지 필터링을 적용한 후, 필터링된 결과와 문서 이미지의 픽셀 값의 공통 부분에 대해 문서 이미지의 픽셀 값을 할당함으로써 전처리 결과를 획득할 수 있다.Afterwards, after applying morphological filtering to the binarized document image, the preprocessing result can be obtained by assigning the pixel value of the document image to the common part of the filtered result and the pixel value of the document image.

해당 실시예에 대해서 도 2를 통해 자세히 설명하도록 한다.The corresponding embodiment will be described in detail with reference to FIG. 2.

도 2는 실시예에서, 문서 이미지를 전처리하는 방법을 설명하기 위한 도면이다.Figure 2 is a diagram for explaining a method of preprocessing a document image in an embodiment.

도 2(a)는 테이블이 제거된 문서 이미지의 일례이고, 도 2(a)의 문서 이미지에 적응적 이진화를 수행함으로써 도 2(b)와 같은 결과를 획득할 수 있다.Figure 2(a) is an example of a document image from which the table has been removed, and the result shown in Figure 2(b) can be obtained by performing adaptive binarization on the document image of Figure 2(a).

적응적 이진화 알고리즘은 이진화를 위한 임계 값을 정할 때 주변 픽셀의 값을 참조로 한다. 이 알고리즘은 픽셀마다 서로 다른 임계 값을 사용하는데 임계 값은 픽셀을 중심으로 블록 사이즈(N X N pixel)주변 영역의 밝기 평균에 일정한 상수를 빼서 결정한다.The adaptive binarization algorithm refers to the values of surrounding pixels when determining the threshold for binarization. This algorithm uses a different threshold for each pixel, and the threshold is determined by subtracting a certain constant from the average brightness of the area around the block size (N

수학식 3의 블록사이즈(blocksize)와 보정 상수 C 값에 따라 적응적 이진화의 임계 값이 달라진다. blocksize는 중심점이 존재할 수 있도록 홀수로 지정하며 실시예에서는 실험을 통해 이미지 너비에 90을 나눈 몫을 blocksize로 할당할 수 있다. 문서 이미지의 사이즈에 따라 blocksize가 짝수가 되는 경우 1을 더해 홀수가 되도록 blocksize를 지정할 수 있다. 보정 상수 값 C는 실시예에서 2로 고정할 수 있다.The threshold value of adaptive binarization varies depending on the block size and correction constant C value in Equation 3. blocksize is specified as an odd number so that a center point can exist, and in an embodiment, the quotient of 90 divided by the image width can be assigned as blocksize through experimentation. Depending on the size of the document image, if the blocksize is an even number, you can add 1 to specify the blocksize to be an odd number. The correction constant value C may be fixed to 2 in the embodiment.

장치는, 이미지의 각 블록마다 각기 다른 임계 값을 적용하여 블록 내부의 픽셀이 임계 값 보다 높은 경우는 배경으로 할당하기 위해 픽셀 값을 255로 할당하고 반대의 경우 글자 영역으로 할당하기 위해 픽셀 값을 0으로 할당한다.The device applies a different threshold to each block of the image, and if the pixels inside the block are higher than the threshold, it assigns a pixel value of 255 to assign it to the background. Conversely, it assigns a pixel value of 255 to assign it to the text area. Assign as 0.

도 2(c)는 모폴로지 필터링을 적용한 결과이고, 도 2(d)는 최종 전처리 결과이다.Figure 2(c) is the result of applying morphology filtering, and Figure 2(d) is the final preprocessing result.

원본 이미지를 이진화 하면 글자 외곽 부분에 계단현상(aliasing effect)이 발생할 수 있는데, 이로 인해 글자 고유의 형태가 손실될 수 있다. 따라서 문서 이미지의 원본의 글자 형태를 유지시키기 위해 이진화 결과에 도 2(c)와 같이 모폴로지 필터링(글자부분 팽창연산, dilation)을 적용시키고 문서 이미지 원본의 픽셀 값이 일정 기준(예컨대 100) 이하인 영역과 공통되는 부분을 찾고 공통되는 부분에 문서 이미지 원본의 픽셀 값을 할당하여 도 2(d)와 같은 최종 전처리 결과를 생성할 수 있다.When the original image is binarized, an aliasing effect may occur on the outer part of the letter, which may result in the loss of the letter's unique shape. Therefore, in order to maintain the original character shape of the document image, morphological filtering (dilation operation of the character part) is applied to the binarization result as shown in Figure 2(c), and the area where the pixel value of the original document image is below a certain standard (for example, 100) By finding common parts and assigning pixel values of the original document image to the common parts, the final preprocessing result as shown in Figure 2(d) can be generated.

다시 도 1로 돌아가, 장치는 전처리가 완료된 이미지에 대해서 글자 영역을 검출할 수 있다.Going back to Figure 1, the device can detect the letter area for the image for which preprocessing has been completed.

실시예에서, 글자 영역을 검출하기 위해서 연결 요소 라벨링 (Connected Component Labeling)을 수행할 수 있다. 라벨링은 글자인 영역전처리 완료된 문서 이미지에서 픽셀 값이 255가 아닌 픽셀)과 인접한 픽셀이면서 배경이 아닌 픽셀에 대해서 동일한 번호(Label)을 라벨링하고 인접하지 않은 글자 픽셀에는 다른 번호를 라벨링하는 것이다. In an embodiment, Connected Component Labeling may be performed to detect text areas. Labeling is to label pixels that are adjacent to (pixel values other than 255) and non-background pixels with the same number (Label) in the document image for which text area preprocessing has been completed, and to label non-adjacent text pixels with different numbers.

실시예에서는 8-Connected Component Labeling(연결 요소 라벨링)을 적용하였다. 이 방법은 한 픽셀 (x, y)를 기준으로 상, 하, 좌, 우, 대각선에 해당하는 픽셀 영역에 배경이 아닌 픽셀이 있다면 같은 라벨로 라벨링하는 것이다. 또한, 해당 방식으로 분류된 라벨을 거리 기준으로 인접한 라벨들을 분류하여 적어도 하나의 글자 영역을 검출할 수 있다.In the example, 8-Connected Component Labeling was applied. In this method, if there are pixels other than the background in the pixel areas corresponding to the top, bottom, left, right, and diagonal lines based on one pixel (x, y), they are labeled with the same label. In addition, at least one character area can be detected by classifying labels classified in this way into adjacent labels based on distance.

2차 라벨링 이후 라벨 간의 간격을 이용하여 인접한 라벨들을 병합함으로써 단어 영역을 검출할 수 있다. 이 때, 라벨과 라벨 사이의 너비가 임계치 이하이면 같은 라벨로 병합하고 아닌 경우 띄어쓰기로 인식한다. 예를 들어, 띄어쓰기로 인식하는 임계치는, 라벨의 사이즈가 문서 이미지마다 다르므로 두 라벨 사이의 간격이 두 라벨 너비의 평균을 반으로 나눈 픽셀보다 작으면 한 단어로 인식될 수 있다.After secondary labeling, word regions can be detected by merging adjacent labels using the gap between labels. At this time, if the width between labels is less than the threshold, they are merged into the same label. Otherwise, they are recognized as spaces. For example, the threshold for recognizing a space is that the size of the label is different for each document image, so if the gap between two labels is smaller than the pixel divided by half the average of the widths of the two labels, it can be recognized as one word.

글자 영역을 검출하는 방법은 도 3을 통해 자세히 설명할 수 있다.The method for detecting the character area can be explained in detail with reference to FIG. 3.

도 3은 실시예에서, 라벨링을 통해 글자 영역을 검출하는 방법을 설명하기 위한 도면이다.Figure 3 is a diagram for explaining a method of detecting a character area through labeling in an embodiment.

한글은 자음 및 모임의 조합으로 이루어져 있으므로, 한 음절 단위의 글자에도 음소 단위의 여러 개의 라벨이 생성될 수 있다. 라벨링 결과는 글자를 인식하는 OCR 모델의 입력으로 사용되므로, 한 음절 당 하나의 라벨이 생성되도록 음절 단위의 라벨링이 생성되도록 라벨링을 수행할 수 있다.Since Hangul is made up of a combination of consonants and vowels, multiple phoneme-level labels can be created even for one syllable letter. Since the labeling results are used as input to an OCR model that recognizes letters, labeling can be performed to generate syllable-level labeling so that one label is generated per syllable.

실시예에서, 1차로 특정 픽셀의 픽셀 값과 동일한 픽셀 값을 가지는 픽셀을 동일한 라벨로 분류하는 라벨링을 통해 음소 단위의 라벨링을 수행할 수 있다. 이어, 1차의 라벨링 결과를 이용하여 각 라벨 간의 간격을 계산하고, 라벨 간의 간격이 일정 거리 이내에 있거나 겹쳐진 라벨들을 하나로 병합하여 2차 라벨링을 수행할 수 있다.In an embodiment, phoneme-level labeling may be performed through labeling that first classifies pixels with the same pixel value as the pixel value of a specific pixel into the same label. Next, the spacing between each label can be calculated using the primary labeling result, and secondary labeling can be performed by merging labels where the spacing between labels is within a certain distance or overlapping.

도 3에 도시된 라벨링 결과는 1차 라벨링 및 2차 라벨링 결과를 나타내고 있다. 장치는, 1차 라벨링을 수행한 후, 인접한 라벨을 병합하여 한 음절 당 하나의 라벨이 생성되도록 2차 라벨링을 수행할 수 있다.The labeling results shown in Figure 3 show the first labeling and second labeling results. After performing primary labeling, the device may perform secondary labeling by merging adjacent labels to generate one label per syllable.

이어, 단계(120)에서 장치는, 학습된 OCR 모델에 기초하여 검출된 글자 영역들 각각에서 나타내는 글자들을 인식한다.Next, in step 120, the device recognizes letters represented in each of the detected letter areas based on the learned OCR model.

실시예에서, 학습된 OCR 모델은, 글자의 이미지를 정규화하여 글자에 대한 픽셀 값과 클래스 라벨을 할당한 학습 데이터를 입력으로, 글자를 출력으로 MLP 학습된 모델에 해당한다.In an embodiment, the learned OCR model corresponds to an MLP learned model with training data that normalizes images of letters and assigns pixel values and class labels to letters as input, and uses letters as output.

OCR 모델에 대해서 도 4를 통해 자세히 설명하도록 한다.The OCR model will be explained in detail in Figure 4.

도 4는 실시예에서, OCR 모델의 학습을 설명하기 위한 도면이다.Figure 4 is a diagram for explaining learning of an OCR model in an embodiment.

글자 인식을 위해 머신러닝 방법을 이용할 수 있다. 실시예를 위해 미리 학습된 모델에 해당하며, 학습 완료 후 생성된 OCR 모델을 통해 실시간으로 글자를 인식할 수 있다. Machine learning methods can be used to recognize letters. It corresponds to a model trained in advance for the embodiment, and letters can be recognized in real time through the OCR model generated after learning is completed.

실시예에서는 MLP(Multi-Layer Perceptron)을 사용하여 글자를 인식할 수 있다. MLP는 도 4와 같이 Input Layer와 두 개의 Hidden Layer 그리고 Output Layer로 구성된다. Input Layer는 학습할 글자 이미지 픽셀 값과 클래스 라벨을 벡터로 입력 받아 Hidden Layer에 전달하고 Hidden Layer의 출력 값이 Output Layer로 전달되는 구조이다.In an embodiment, letters can be recognized using MLP (Multi-Layer Perceptron). As shown in Figure 4, MLP consists of an input layer, two hidden layers, and an output layer. The Input Layer is a structure in which the pixel value and class label of the character image to be learned are input as vectors and are transmitted to the Hidden Layer, and the output value of the Hidden Layer is transmitted to the Output Layer.

Input Layer에 입력되는 하나의 학습 벡터는 글자 이미지의 픽셀 값과 목표 값으로 구성된다. 출력 층의 출력 값과 학습 벡터의 목표 값을 비교하여 출력 값과 목표 값의 차이가 허용 오차보다 크면 가중치를 학습 규칙에 따라 조정하며 학습이 진행된다.One learning vector input to the input layer consists of the pixel value of the letter image and the target value. The output value of the output layer is compared with the target value of the learning vector, and if the difference between the output value and the target value is greater than the tolerance, the weights are adjusted according to the learning rule and learning proceeds.

학습 데이터는 흰색 배경에 검정색 글자인 이미지를 사용할 수 있다. 실시예에서는 다양한 폰트 별 문서 데이터를 이미지로 변환한 것과 문서 이미지의 라벨링 결과를 활용하여 학습 데이터를 구성할 수 있다. 학습 데이터는 한 글자 당 한 개의 이미지로 변환하고 크기를 정규화(예컨대 가로 30 픽셀 세로 30 픽셀) 한 뒤 픽셀 값과 클래스 라벨을 할당하여 학습 데이터를 생성할 수 있다.Training data can be images of black letters on a white background. In an embodiment, learning data can be constructed by converting document data for various fonts into images and using the labeling results of the document images. The training data can be generated by converting it into one image per character, normalizing the size (for example, 30 pixels wide and 30 pixels tall), and then assigning pixel values and class labels.

실시예에서는 한글, 영어, 일부 특수문자를 인식할 수 있도록 11,257개의 클래스로 구성하였으며 학습 데이터의 개수는 각 클래스 별로 100개씩 구성하였다(한글 조합 11,172개 영어 대소문자 52개, 특수문자 23개).In the example, 11,257 classes were configured to recognize Korean, English, and some special characters, and the number of learning data was 100 for each class (11,172 Korean combinations, 52 English upper and lower case letters, and 23 special characters).

표 1은 MLP 학습 진행 과정을 설명하고 있다.Table 1 describes the MLP learning process.

각 뉴런의 값은 수학식 4를 통해 계산된다. 수학식 4에서 은 입력 벡터의 크기, 는 입력 벡터의 입력 값(OCR 학습 데이터 픽셀 값), 는 weight를 나타낸다.The value of each neuron is calculated through Equation 4. In Equation 4, represents the size of the input vector, represents the input value of the input vector (OCR learning data pixel value), and represents the weight.

수학식 5의 활성화 함수(Activation Function)는 값이 임계치보다 크면 뉴런의 출력 값을 활성화하고, 그렇지 않으면 뉴런의 출력 값을 비활성화 하는 함수이며, 실시예에서는 Sigmoid Function을 사용하였다.The activation function in Equation 5 is a function that activates the output value of the neuron if the value is greater than the threshold, and deactivates the output value of the neuron otherwise. In the example, the Sigmoid Function was used.

수학식 6은 번째 학습 벡터에 대한 번째 출력 층 뉴런의 j번째 가중치에 대한 학습 규칙을 표현한 것이다.Equation 6 expresses the learning rule for the jth weight of the th output layer neuron for the th learning vector.

수학식 6에서 v_kj는번째 Output Layer의 j번째 가중치를 나타내며, η는 학습율(Learning rate)를 나타낸다. z_nk는 번째 학습 벡터에 대한 j번째 Hidden 뉴런의 출력 값이며, t_nk는 n번째 학습 벡터와 k번째 목표 값을 나타낸다. net_nk는 n번째 학습 벡터에 대한 k번째 출력 층 뉴런의 값을 나타낸다.In Equation 6, v_kj represents the jth weight of the th Output Layer, and η represents the learning rate. z_nk is the output value of the jth hidden neuron for the learning vector, and t_nk represents the nth learning vector and the kth target value. net_nk represents the value of the kth output layer neuron for the nth learning vector.

식 (7)은 번째 학습 벡터에 대한 j번째 Hidden Layer뉴런의 i번째 weight 학습 규칙이며 는 출력 층의 크기를 나타낸다.Equation (7) is the learning rule for the ith weight of the jth Hidden Layer neuron for the learning vector, and represents the size of the output layer.

여기서 w_ji는 j번째 Hidden layer뉴런의 i번째 가중치를 나타내며 η는 학습율(learning rate), x_ni는 n번째 학습 벡터의 i번째 입력 값, t_nk는 번째 학습 벡터의 번째 목표 값, O_nk는 n번째 학습 벡터에 대한 k번째 Output Layer 뉴런의 출력 값을 나타낸다. Here, w_ji represents the ith weight of the jth hidden layer neuron, η is the learning rate, x_ni is the ith input value of the nth learning vector, t_nk is the target value of the nth learning vector, and O_nk is the nth learning vector. Indicates the output value of the kth Output Layer neuron.

학습이 완료되면 OCR 모델이 생성되고 문서 이미지에서 글자 영역이 라벨링 된 영역의 문서 이미지를 해당 모델에 입력하면 실시간으로 글자가 인식될 수 있다.Once learning is complete, an OCR model is created, and letters can be recognized in real time by inputting the document image of the area where the letter area is labeled in the document image into the model.

도 5는 실시예에서, 문서 이미지의 글자가 인식된 일례이다.Figure 5 is an example of characters in a document image being recognized in an embodiment.

실시예에서, 학습된 OCR 모델을 문서 이미지에 적용하여 글자를 인식한 결과를 나타낸다. 2차 라벨링 결과가 입력으로 사용되었으며 각 라벨에 대한 글자가 결과로 출력될 수 있다. 단어 영역에 대한 검출 결과를 기초하여 단어 라벨이 몇 개의 서브 라벨을 포함하고 있는지 체크 후 인식 결과에 띄어쓰기를 적용할 수 있다.In an embodiment, the result of character recognition by applying the learned OCR model to a document image is shown. The secondary labeling results were used as input, and letters for each label can be output as results. Based on the detection results for the word area, you can check how many sub-labels the word label contains and then apply spacing to the recognition results.

다만, 양식이 다양한 문서의 유형, 예컨대 성적증명서는 학교마다 그 양식이 상이하다. 예를 들어, 성적증명서의 글자를 OCR로 모두 인식하여 디지털 문서로 생성하더라도 필요한 정보(이름, 학교, 생년월일, 전공, 과목명, 성적, 학점 등)를 문서화하기 위해서는 OCR 결과를 사람이 일일이 편집해야 하는 번거로움이 존재한다.However, types of documents with various formats, such as transcripts, have different formats for each school. For example, even if all letters in a transcript are recognized by OCR and created as a digital document, the OCR results must be manually edited by a person in order to document the necessary information (name, school, date of birth, major, subject name, grades, credits, etc.) There is a hassle to do this.

실시예에서는 하나의 유형에 대해 각기 다른 양식의 문서를 분석하고, 해당 유형의 문서에서 필요한 공통 정보를 정형화된 양식으로 정의하여 활용할 수 있다. 이에, 실시예는 OCR 결과를 재편집하지 않고 자동으로 정의된 양식에 입력할 수 있는 방법을 제안할 수 있다.In an embodiment, documents in different formats for one type can be analyzed, and common information required from documents of that type can be defined and utilized in a standardized format. Accordingly, the embodiment may propose a method of automatically entering the OCR results into a defined form without re-editing them.

도 1로 돌아가, 단계(130)에서 장치는, 인식된 글자들로부터 문서의 유형에 대응하여 미리 정의된 정의 단어를 검색한다.Returning to Figure 1, in step 130 the device searches for a predefined definition word corresponding to the type of document from the recognized characters.

단계(140)에서 장치는, 문서 내 정의 단어의 좌표; 및 문서 내 정의 단어를 제외한 나머지 단어의 좌표를 인식하여 두 좌표를 비교할 수 있다.In step 140 the device determines: the coordinates of the definition word in the document; And the two coordinates can be compared by recognizing the coordinates of words other than the defined words in the document.

실시예에서, 문서의 유형에 대해 공통으로 입력되는 정보를 분석하여 정형화된 디지털 문서를 자동으로 생성하기 위해 성적 증명서에 대해서 분석된 결과를 이용하여 아래의 표 2와 같은 정의 단어를 정의할 수 있다.In an embodiment, definition words as shown in Table 2 below can be defined using the analysis results for the transcript to automatically generate a standardized digital document by analyzing information commonly input for the type of document. .

정의 단어는 문서 이미지 내에서 단어의 음절 사이에 공백이 있어서 단어 영역으로 검출되지 않는 경우가 있다. 이에 실시예에서는, 문서 이미지 내에 하나의 음절로 단어 영역의 라벨링이 구성된 라벨 박스에 대해서 전후에 있는 라벨 박스와의 병합 여부를 정의 단어를 참조하여 확인할 수 있다.Definition Words may not be detected as word areas because there are spaces between syllables of the word in the document image. Accordingly, in the embodiment, it is possible to check whether a label box in which a word area is labeled with a single syllable in a document image is merged with label boxes before and after it by referring to the definition word.

해당 사항에 대해 도 6을 참조할 수 있다.Please refer to Figure 6 for this matter.

도 6은 실시예에서, 공백이 존재하는 정의 단어를 인식하고 정의 단어에 대응하는 글자를 인식하는 방법을 설명하기 위한 도면이다.Figure 6 is a diagram for explaining a method of recognizing a definition word with a space and recognizing letters corresponding to the definition word, in an embodiment.

앞서 설명한 바와 같이, 문서 이미지 내에 하나의 음절로 단어 영역의 라벨링이 구성된 라벨 박스에 대해서 전후에 있는 라벨 박스와의 병합 여부를 확인하여 각 라벨 박스를 하나의 라벨 박스로 병합할 수 있다.As described above, each label box can be merged into one label box by checking whether label boxes that label the word area with one syllable in the document image are merged with the label boxes before and after.

실시예에서, 정의 단어 라벨 박스의 중심점 y좌표와 주변 라벨 박스 중심점 y좌표의 차이가

_보다 작고 정의 단어의 라벨 박스 끝점 x좌표보다 우측에 존재하며, 정의 단어에 대한 라벨 박스의 끝점 x좌표와 주변 라벨박스 시작점 x좌표의 차이가

_인 라벨 박스를 찾아 해당 라벨 박스를 해당 정의 단어에 대한 정보로 인식할 수 있다.In an embodiment, the difference between the y-coordinate of the center point of the definition word label box and the y-coordinate of the center point of the surrounding label box is

It is smaller than _ and exists to the right of the x-coordinate of the end point of the label box of the definition word, and the difference between the x-coordinate of the end point of the label box for the definition word and the x-coordinate of the start point of the surrounding label box is

You can find the label box with _ and recognize it as information about the definition word.

수학식 8에서

_y는 y좌표의 임계치,

_x는 x좌표의 임계치를 나타낸다. P_h는 정의 단어 라벨 박스의 높이를 나타내며, P_w는 정의 단어 라벨 박스의 너비를 나타낸다. C는 상수 값을 나타내며 실시예에서는 실험을 통해 2로 결정될 수 있다.In equation 8:

_ y is the threshold value of the y coordinate,

_x represents the threshold value of the x-coordinate. P_h represents the height of the definition word label box, and P_w represents the width of the definition word label box. C represents a constant value and may be determined to be 2 through experiment in the embodiment.

도 7은 실시예에서, 정의 단어 및 정의 단어의 정보에 대해 설명하기 위한 도면이다.7 is a diagram for explaining definition words and information on definition words in an embodiment.

실시예에서, 장치는 성적 증명서에 대한 표 2의 기본 정보를 인식하기 위해 정의 단어(710)의 좌표를 인식할 수 있다. 정의 단어(710)에 해당하는 정보(720)는 도시된 바와 같이 정의 단어(710)의 우측에 배치된다. 이에, 정의 단어 라벨 박스의 좌표를 기준으로 우측으로 임계치 이내에 존재하는 라벨 박스를 정의 단어(710)에 해당하는 정보(720)로 인식할 수 있다.In embodiments, a device may recognize the coordinates of definition words 710 to recognize the basic information in Table 2 for a transcript. Information 720 corresponding to the definition word 710 is disposed on the right side of the definition word 710, as shown. Accordingly, the label box that exists within a threshold to the right of the coordinates of the definition word label box can be recognized as information 720 corresponding to the definition word 710.

도 8은 실시예에서, 정의 단어 및 정의 단어의 정보에 대해 설명하기 위한 도면이다.8 is a diagram for explaining definition words and information on definition words in an embodiment.

실시예에서, 기본 정보에 해당하는 정의 단어를 인식한 후, 내용을 인식할 수 있다. 실시예에서, 기본 정보 중 가장 아래의 라벨 박스(901)의 y좌표를 인식하고, 해당 좌표를 기준으로 아래로 문서 이미지의 왼쪽에서 첫 번째로 등장하는 내용에 정의된 정의 단어에 해당하는 라벨 박스(902)의 좌표를 인식할 수 있다.In an embodiment, the content may be recognized after recognizing the definition word corresponding to the basic information. In an embodiment, the y-coordinate of the bottom label box 901 among the basic information is recognized, and the label box corresponding to the definition word defined in the content that first appears on the left of the document image below based on the coordinates The coordinates of (902) can be recognized.

예를 들어, "학점", "성적"은 성적 증명서 열 개수에 따라 2개 이상 존재하므로 처음 등장한 라벨 박스의 중심점을 기준으로 임계치

_y 이내인 라벨 박스 중 동일한 OCR 결과를 가진 라벨 박스가 있는지 찾고, 해당 좌표를 인식할 수 있다.For example, there are two or more “credits” and “grades” depending on the number of transcript columns, so the threshold value is based on the center point of the label box that first appears.

_ You can find whether there is a label box with the same OCR result among the label boxes within y and recognize the corresponding coordinates.

일 실시예에서, 정의 단어의 좌표에 기초하여 기본 정보에 해당하는 영역(910) 및 내용에 해당하는 영역(920) 각각을 관심 영역으로 지정하여 관심 영역 단위로 문서의 내용을 인식할 수 있다. 실시예에서, "학점", "성적"에 해당하는 라벨 박스가 복수 개인 경우, 각각에 대해서 관심 영역을 지정할 수 있다. 예를 들어, 두 단어 중 우측에 있는 단어 좌표를 ROI의 끝점으로 할당할 수 있고, 가장 첫 ROI의 시작점은 문서 이미지의 시작 픽셀인 가장 왼쪽 픽셀을 기준으로 할 수 있다.In one embodiment, the area 910 corresponding to the basic information and the area 920 corresponding to the content are designated as areas of interest based on the coordinates of the definition word, respectively, so that the contents of the document can be recognized in units of areas of interest. In an embodiment, when there are multiple label boxes corresponding to “credit” and “grade,” an area of interest can be designated for each. For example, the coordinates of the word on the right of the two words can be assigned as the end point of the ROI, and the starting point of the first ROI can be based on the leftmost pixel, which is the starting pixel of the document image.

도 9는 실시예에서, 좌표를 기준으로 정의 단어 및 정의 단어의 정보를 인식하는 방법을 설명하기 위한 도면이다.Figure 9 is a diagram for explaining a method of recognizing a definition word and information about the definition based on coordinates in an embodiment.

실시예에서, 관심 영역을 지정한 후, 과목명, 학점, 성적을 인식할 수 있다. 이를 위해 관심 영역 내 정의 단어에 해당하는 라벨 박스의 중심점 x좌표와 수학식 8의 임계치

_x 이내인 라벨 박스들을 찾아 학점, 성적 내용을 먼저 인식할 수 있다.In an embodiment, after specifying an area of interest, subject names, credits, and grades may be recognized. For this purpose, the x-coordinate of the center point of the label box corresponding to the definition word in the region of interest and the threshold of Equation 8

_ You can first recognize the credits and grades by finding the label boxes within x .

실시예에서, 학점, 성적에 해당하는 라벨박스를 찾은 뒤 동일한 ROI에 속해 있으며, 각 라벨 박스의 중심점 y좌표와 임계치

_y 이내인 라벨 박스를 찾아 과목명으로 인식할 수 있다.In an embodiment, after finding the label boxes corresponding to credits and grades and belonging to the same ROI, the y-coordinate of the center point of each label box and the threshold

_ You can find the label box within y and recognize it as the subject name.

실시예에서, 라벨 박스 중 학점, 성적, 과목명에 해당하지 않으며 "학기"라는 글자를 포함하고 있는 라벨 박스가 있는 경우, 학기 별 관심 영역을 생성하고 관심 영역 내에 있는 학점, 성적, 과목명을 매칭할 수 있다.In an embodiment, if there is a label box that does not correspond to credits, grades, or subject names and contains the word "semester" among the label boxes, an area of interest is created for each semester and the credits, grades, and subject names within the area of interest are created. You can match.

단계(150)에서 장치는, 비교 결과를 기준으로 인식된 글자들을 미리 정의된 형식의 디지털 문서로 입력한다.In step 150, the device inputs the characters recognized based on the comparison result into a digital document in a predefined format.

실시예에서, 문서의 내용 중에 OCR에 포함된 오류를 보정하거나, 인식된 내용 중 정규화가 필요한 부분(예컨대, 성적 증명서에서 학교에 따라 다른 이수 구분을 정규화)에 대해서 자동으로 보정하기 위한 딕셔너리를 사전에 정의할 수 있다.In an embodiment, a dictionary is used to correct errors included in OCR in the contents of a document or to automatically correct parts that require normalization among recognized contents (e.g., normalizing completion classifications that differ depending on the school in the transcript). It can be defined in .

오류의 종류는 대치(Substitution: 다른 문자로 인식), 실종(Missing: 문자를 인식하지 못함), 추가(Insertion: 없는 문자가 새로 추가됨), 조합(Combination: 두 문자가 하나로 결합됨), 분해(Decomposition: 하나의 문자가 두 개로 분해됨) 등의 종류로 발생하며 실시예에서 학습된 OCR 모델을 통해 인식된 글자들 중 자주 발생하는 오류의 종류를 분석하고 이를 보정하기 위한 딕셔너리를 생성할 수 있다.The types of errors are Substitution (recognized as a different character), Missing (character not recognized), Addition (insertion: a new missing character is added), Combination (two characters are combined into one), and decomposition ( Decomposition: One character is broken into two), etc., and it is possible to analyze the types of errors that frequently occur among characters recognized through the OCR model learned in the embodiment and create a dictionary to correct them.

실시예에서, 딕셔너리는 .csv형태로 생성될 수 있고 실시예에 따른 보정 단계에서 장치는 딕셔너리를 읽어 들여 규칙에 맞게 글자를 보정하거나 정규화 할 수 있다.In an embodiment, the dictionary may be created in .csv format, and in the correction step according to the embodiment, the device may read the dictionary and correct or normalize the characters according to the rules.

상기의 표 3은 성적 증명서의 예시를 위해 생성된 딕셔너리의 종류를 나타낸다.Table 3 above shows the types of dictionaries created for examples of transcripts.

실시예에서, 딕셔너리의 글자 보정 리스트와 OCR 결과를 비교하고 글자 보정 리스트의 왼쪽에 해당하는 단어를 발견하면 오른쪽 단어로 변경할 수 있다.In an embodiment, the OCR result is compared with the character correction list of the dictionary, and if a word corresponding to the left side of the character correction list is found, it can be changed to the word on the right.

실시예에서, 학교에 따라 발급되는 성적 증명서마다 각기 다른 이수구분 표시를 인식하여 "교양", "전공" 두 가지로 정규화 하기 위해 이수구분 인식 리스트를 참조할 수 있다. 예컨대, 성적 증명서의 학교명과 이수구분 인식 리스트의 학교명이 일치하며 내용 인식 결과 중 과목명에 해당하는 글자와 두 번째 등장하는 글자가 일치하면 사전에 정의한 디지털 문서 양식에 맞춰 정규화된 이수구분으로 보정될 수 있다.In an embodiment, the completion classification recognition list may be referred to in order to recognize different completion classification marks for each transcript issued depending on the school and normalize them into two types: “liberal arts” and “major.” For example, if the school name on the transcript matches the school name in the completion classification recognition list, and the letter corresponding to the subject name in the content recognition result matches the second letter, it will be corrected to a normalized completion classification according to the predefined digital document format. You can.

실시예에서는 텍스트 비교를 위해 최장 공통부분 수열과 편집 거리 알고리즘을 혼합하여 사용할 수 있다. 성적 증명서의 학교명과 전공이 자동 이수구분 인식 리스트와 동일하며 과목명과 자동 이수구분 리스트의 과목명의 텍스트 비교하여 최장 부분 공통 수열(LCS: Longest Common Search)의 유사도가 80% 이상이고 편집 거리가 문자열/3 보다 작은 경우, 리스트에 적힌 이수구분을 할당할 수 있다.In an embodiment, a combination of the longest common sequence and the edit distance algorithm can be used for text comparison. The school name and major on the transcript are the same as the automatic completion classification recognition list, the similarity of the longest common search (LCS) is more than 80% by comparing the subject name and the text of the subject name in the automatic completion classification list, and the editing distance is string/ If it is less than 3, the course division indicated in the list can be assigned.

실시예에서, 장치는 인식된 글자를 미리 정의된 형식의 디지털 양식으로 입력하여 디지털 문서를 생성할 수 있다.In embodiments, a device may generate a digital document by inputting recognized characters into digital form in a predefined format.

도 10은 실시예에서, 문서로부터 정보를 추출하기 위한 장치의 구성을 설명하기 위한 블록도이다.Figure 10 is a block diagram for explaining the configuration of a device for extracting information from a document in an embodiment.

실시예에 따른 장치(1000)는, 메모리(1010) 및 프로세서(1020)를 포함하여 구성될 수 있고, 프로세서(1020)에 의해 실행되는 프로그램을 포함할 수 있다. 실시예에서, 프로그램은 도 1 내지 도 9를 통해 설명된 장치의 동작 방법을 포함할 수 있다.The device 1000 according to the embodiment may be configured to include a memory 1010 and a processor 1020, and may include a program executed by the processor 1020. In an embodiment, the program may include a method of operating the device described in FIGS. 1 to 9.

실시예에서, 장치(1000)는 문서의 문서 이미지로부터 적어도 하나의 글자 영역들을 검출한다.In an embodiment, device 1000 detects at least one character regions from a document image of a document.

장치(1000)는 문서가 스캔된 문서 이미지를 입력 받을 수 있다. 실시예에 따른 문서는 성적증명서를 포함할 수 있다. 문서 이미지의 확장자는 jpg, png, bmp, tiff 등을 포함할 수 있다. 장치(1000)는 문서 이미지로부터 정보를 추출하기 위해 문서 이미지를 미리 정해진 사이즈로 변경할 수 있다. 예를 들어, 세로로 긴 형식의 문서는 가로 800 pixel 이상 및 세로 1000 pixel 이상, 가로로 긴 문서는 가로 1000 pixel 이상 및 세로 800 pixel 이상을 가지도록 할 수 있다.The device 1000 can receive an image of a scanned document. Documents according to embodiments may include transcripts. The extension of the document image may include jpg, png, bmp, tiff, etc. The device 1000 may change the document image to a predetermined size in order to extract information from the document image. For example, a vertically long document can have more than 800 pixels in width and 1000 pixels in height, and a long document can have more than 1000 pixels in width and more than 800 pixels in height.

실시예에서, OCR은 글자와 배경의 명확한 경계가 있어야 인식 정확도가 높아지므로 글자 영역을 정확히 검출하기 위해서, 장치(1000)는 스캔된 문서 이미지에 대해 전처리를 수행할 수 있다. 전처리는 예를 들어, 노이즈, 워터마크, 배경 색상 및 프레임 등을 제거하는 과정과 문서의 글자 및 흰 배경만을 남기는 과정을 포함할 수 있다.In an embodiment, OCR requires a clear boundary between text and the background to increase recognition accuracy, so in order to accurately detect the text area, the device 1000 may perform preprocessing on the scanned document image. Preprocessing may include, for example, removing noise, watermarks, background colors, and frames, and leaving only the text and white background of the document.

실시예에서, 장치(1000)는 적응적 이진화(Adaptive threshold) 알고리즘을 이용하여 문서 이미지를 이진화할 수 있다. 배경과 글자를 분리하기 위해 이진화를 수행하는데, 배경과 문서 이미지의 밝기, 색상이 양식, 스캔/출력 상태에 따라 상이하므로 고정 임계 값을 사용하여 이미지 전체를 이진화 하지 않고 입력 이미지 상태에 따라 임계 값이 가변인 적응적 이진화 알고리즘을 적용할 수 있다.In an embodiment, device 1000 may binarize a document image using an adaptive threshold algorithm. Binarization is performed to separate the background and text. Since the brightness and color of the background and document image are different depending on the form and scanning/output status, the entire image is not binarized using a fixed threshold, but the threshold is adjusted according to the input image status. This variable adaptive binarization algorithm can be applied.

장치(1000)는 전처리가 완료된 이미지에 대해서 글자 영역을 검출할 수 있다.The device 1000 can detect a character area in an image for which preprocessing has been completed.

실시예에서, 장치(1000)는 2차 라벨링 이후 라벨 간의 간격을 이용하여 인접한 라벨들을 병합함으로써 단어 영역을 검출할 수 있다. 이 때, 라벨과 라벨 사이의 너비가 임계치 이하이면 같은 라벨로 병합하고 아닌 경우 띄어쓰기로 인식한다. 예를 들어, 띄어쓰기로 인식하는 임계치는, 라벨의 사이즈가 문서 이미지마다 다르므로 두 라벨 사이의 간격이 두 라벨 너비의 평균을 반으로 나눈 픽셀보다 작으면 한 단어로 인식될 수 있다.In an embodiment, the device 1000 may detect a word region by merging adjacent labels using the gap between labels after secondary labeling. At this time, if the width between labels is less than the threshold, they are merged into the same label. Otherwise, they are recognized as spaces. For example, the threshold for recognizing a space is that the size of the label is different for each document image, so if the gap between two labels is smaller than the pixel divided by half the average of the widths of the two labels, it can be recognized as one word.

장치(1000)는, 학습된 OCR 모델에 기초하여 검출된 글자 영역들 각각에서 나타내는 글자들을 인식한다.The device 1000 recognizes letters represented in each of the detected letter areas based on the learned OCR model.

장치(1000)는, 인식된 글자들로부터 문서의 유형에 대응하여 미리 정의된 정의 단어를 검색한다.The device 1000 searches for a predefined definition word corresponding to the type of document from the recognized characters.

장치(1000)는, 문서 내 정의 단어의 좌표; 및 문서 내 정의 단어를 제외한 나머지 단어의 좌표를 인식하여 두 좌표를 비교할 수 있다.The device 1000 includes coordinates of definition words in a document; And the two coordinates can be compared by recognizing the coordinates of words other than the defined words in the document.

실시예에서, 문서의 유형에 대해 공통으로 입력되는 정보를 분석하여 정형화된 디지털 문서를 자동으로 생성하기 위해 성적 증명서에 대해서 분석된 결과를 이용하여 정의 단어를 정의할 수 있다.In an embodiment, definition words may be defined using the results of the analysis of the transcript to automatically generate a standardized digital document by analyzing commonly input information about the type of document.

정의 단어는 문서 이미지 내에서 단어의 음절 사이에 공백이 있어서 단어 영역으로 검출되지 않는 경우가 있다. 이에 실시예에서는, 문서 이미지 내에 하나의 음절로 단어 영역의 라벨링이 구성된 라벨 박스에 대해서 전후에 있는 라벨 박스와의 병합 여부는 정의 단어를 참조하여 확인될 수 있다.Definition Words may not be detected as word areas because there are spaces between syllables of the word in the document image. Accordingly, in the embodiment, for a label box in which a word area is labeled with a single syllable in a document image, whether or not it is merged with label boxes before and after it can be confirmed by referring to the definition word.

일 실시예에서 장치(1000)는, 정의 단어의 좌표에 기초하여 내용에 따라 각각 해당하는 영역을 관심 영역으로 지정하여 관심 영역 단위로 문서의 내용을 인식할 수 있다.In one embodiment, the device 1000 may recognize the contents of the document in units of interest areas by designating each corresponding area as an area of interest according to the content based on the coordinates of the definition word.

장치(1000)는, 비교 결과를 기준으로 인식된 글자들을 미리 정의된 형식의 디지털 문서로 입력한다.The device 1000 inputs the characters recognized based on the comparison result into a digital document in a predefined format.

오류의 종류는 대치(Substitution: 다른 문자로 인식), 실종(Missing: 문자를 인식하지 못함), 추가(Insertion: 없는 문자가 새로 추가됨), 조합(Combination: 두 문자가 하나로 결합됨), 분해(Decomposition: 하나의 문자가 두 개로 분해됨) 등의 종류로 발생하며 실시예에서 학습된 OCR 모델을 통해 인식된 글자들 중 자주 발생하는 오류의 종류를 분석하고 이를 보정하기 위한 딕셔너리가 생성될 수 있다.The types of errors are Substitution (recognized as a different character), Missing (character not recognized), Addition (insertion: a new missing character is added), Combination (two characters are combined into one), and decomposition ( Decomposition: One character is broken into two), etc., and a dictionary can be created to analyze and correct the types of errors that frequently occur among characters recognized through the OCR model learned in the embodiment.

실시예는, 성정증명서 등의 문서를 정형화된 디지털 문서로 변환하는 기술에 관한 것으로써, 보다 상세하게는 이미지 파일을 컴퓨터 비전 기술을 적용하여 노이즈 및 워터마크를 제거하고 글자 영역을 검출한 뒤 머신러닝 기반의 OCR(Optical Character Recognition)을 적용하여 글자를 인식하고, 인식된 특정 단어를 기준으로 사전에 정의한 정형화된 형식의 디지털 문서로 변환하는 RPA 기술을 제안할 수 있다.The embodiment relates to a technology for converting documents such as birth certificates into standardized digital documents. In more detail, computer vision technology is applied to image files to remove noise and watermarks, detect text areas, and then machine We can propose RPA technology that recognizes letters by applying learning-based OCR (Optical Character Recognition) and converts them into digital documents in a standardized format defined in advance based on specific recognized words.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may contain program instructions, data files, data structures, etc., singly or in combination. Program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. -Includes optical media (magneto-optical media) and hardware devices specifically configured to store and execute program instructions, such as ROM, RAM, flash memory, etc. Examples of program instructions include machine language code, such as that produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of these, which may configure a processing unit to operate as desired, or may be processed independently or collectively. You can command the device. Software and/or data may be used on any type of machine, component, physical device, virtual equipment, computer storage medium or device to be interpreted by or to provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored on one or more computer-readable recording media.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.Although the embodiments have been described with limited drawings as described above, those skilled in the art can apply various technical modifications and variations based on the above. For example, the described techniques are performed in a different order than the described method, and/or components of the described system, structure, device, circuit, etc. are combined or combined in a different form than the described method, or other components are used. Alternatively, appropriate results may be achieved even if substituted or substituted by an equivalent.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims also fall within the scope of the following claims.

Claims

In a method of operating a device for extracting information from a document,
detecting at least one character area from a document image of the document;
Recognizing letters represented in each of the detected letter areas based on the learned OCR model;
searching for a predefined definition word corresponding to the type of the document from the recognized characters;
coordinates of the defined word in the document; and comparing coordinates of words other than the definition word in the document; and
Inputting the recognized characters into a digital document in a predefined format based on the comparison result
Including,
The step of searching for a predefined definition word corresponding to the type of the document from the recognized letters,
Detecting the definition word containing a space by referring to the predefined definition word and determining whether to merge the label box labeled with one syllable with the syllable corresponding to the label box existing before and after it; and
Based on the coordinates of the label box based on the definition word, designate the region of interest corresponding to the basic information and the region of interest corresponding to the content, based on the x-coordinate of the center point of the label box corresponding to the definition word within each region of interest. Steps to recognize multiple label boxes
Including,
How the device works.

According to paragraph 1,
Detecting at least one character area from the document image includes:
classifying a first pixel having a pixel value in the document image and at least one pixel having the same pixel value as the pixel value of the first pixel among adjacent pixels of the first pixel to the same label; and
Detecting at least one character area by classifying adjacent labels based on distance from the classified label
Including,
How the device works.

According to paragraph 2,
Detecting a word area based on the gap between the at least one letter area
Containing more,
How the device works.

According to paragraph 1,
The learned OCR model is,
An MLP-trained model that uses training data that normalizes images of letters and assigns pixel values and class labels to the letters as input, and uses the letters as output.
How the device works.

According to paragraph 1,
The above documents include transcripts,
How the device works.

According to paragraph 1,
The step of searching for a predefined definition word corresponding to the type of the document is,
Checking whether a first letter labeled with one syllable among the recognized letters is labeled with a letter located before or after the first letter;
merging the first letter and letters located before and after the first letter into one word based on whether or not the label is labeled; and
Recognizing the definition word through the merged word
Including,
How the device works.

According to paragraph 1,
coordinates of the defined word in the document; And the step of comparing the coordinates of words other than the definition word in the document,
Recognizing the coordinates of each of the recognized letters based on their positions within the document image
Including,
How the device works.

According to paragraph 1,
coordinates of the defined word in the document; And the step of comparing the coordinates of words other than the definition word in the document,
generating a region of interest based on the coordinates of the defined word; and
Recognizing at least one word within a threshold among the remaining words based on the x coordinate corresponding to the center point of the region of interest.
Including,
How the device works.

According to clause 8,
The step of inputting the recognized characters into a digital document in a predefined format is,
Matching the definition word and at least one word within the threshold
Containing more,
How the device works.

According to paragraph 1,
The step of inputting the recognized characters into a digital document in a predefined format is,
Inputting the definition word and the remaining words into a predetermined position in the digital document
Including,
How the device works.

According to paragraph 1,
The step of inputting the recognized characters into a digital document in a predefined format is,
Detecting whether a recognition error occurs among the recognized characters by referring to a dictionary in which recognition errors of the recognized characters and corrections for the recognition errors are defined; and
Correcting the detected recognition error by referring to the dictionary
Including,
How the device works.

According to paragraph 1,
further comprising removing noise from the document image to detect at least one character region from the document image of the document,
How the device works.

According to clause 12,
The step of removing noise from the document image is,
Differentiating the document image with respect to the x-axis and y-axis to detect pixels in which brightness changes appear;
detecting at least one straight line constituting a table in the document image using the detected pixels; and
Deleting the at least one straight line
Including,
How the device works.

According to clause 12,
The step of removing noise from the document image is,
Binarizing the document image;
extracting a common portion of the binarized document image and pixel values of the document image; and
Assigning a pixel value of the document image to the common portion
Including,
How the device works.

A computer program stored in a computer-readable medium in combination with hardware to execute the method of any one of claims 1 to 14.

In a device for extracting information from a document,
One or more processors;
Memory; and
Comprising one or more programs stored in the memory and configured to be executed by the one or more processors,
The above program is,
detecting at least one character area from a document image of the document;
Recognizing letters represented in each of the detected letter areas based on the learned OCR model;
searching for a predefined definition word corresponding to the type of the document from the recognized characters;
coordinates of the defined word in the document; and comparing coordinates of words other than the definition word in the document; and
Inputting the recognized characters into a digital document in a predefined format based on the comparison result
Including,
The step of searching for a predefined definition word corresponding to the type of the document from the recognized letters,
Detecting the definition word containing a space by referring to the predefined definition word and determining whether to merge the label box labeled with one syllable with the syllable corresponding to the label box existing before and after it; and
Based on the coordinates of the label box based on the definition word, designate the region of interest corresponding to the basic information and the region of interest corresponding to the content, based on the x-coordinate of the center point of the label box corresponding to the definition word within each region of interest. Steps to recognize multiple label boxes
Including,
Device.

According to clause 16,
Detecting at least one character area from the document image includes:
classifying a first pixel having a pixel value in the document image and at least one pixel having the same pixel value as the pixel value of the first pixel among adjacent pixels of the first pixel to the same label; and
Detecting at least one character area by classifying adjacent labels based on distance from the classified label
Including,
Device.

According to clause 17,
Detecting a word area based on the gap between the at least one letter area
Containing more,
Device.

According to clause 16,
The learned OCR model is,
An MLP-trained model that uses training data that normalizes images of letters and assigns pixel values and class labels to the letters as input, and uses the letters as output.
Device.

According to clause 16,
The above documents include transcripts,
Device.

According to clause 16,
The step of searching for a predefined definition word corresponding to the type of the document is,
Checking whether a first letter labeled with one syllable among the recognized letters is labeled with a letter located before or after the first letter;
merging the first letter and letters located before and after the first letter into one word based on whether or not the label is labeled; and
Recognizing the definition word through the merged word
Including,
Device.

According to clause 16,
coordinates of the defined word in the document; And the step of comparing the coordinates of words other than the definition word in the document,
Recognizing the coordinates of each of the recognized letters based on their positions within the document image
Including,
Device.

According to clause 16,
coordinates of the defined word in the document; And the step of comparing the coordinates of words other than the definition word in the document,
generating a region of interest based on the coordinates of the defined word; and
Recognizing at least one word within a threshold among the remaining words based on the x coordinate corresponding to the center point of the region of interest.
Including,
Device.

According to clause 23,
The step of inputting the recognized characters into a digital document in a predefined format is,
Matching the definition word and at least one word within the threshold
Containing more,
Device.

According to clause 16,
The step of inputting the recognized characters into a digital document in a predefined format is,
Inputting the definition word and the remaining words into a predetermined position in the digital document
Including,
Device.

According to clause 16,
The step of inputting the recognized characters into a digital document in a predefined format is,
Detecting whether a recognition error occurs among the recognized characters by referring to a dictionary in which recognition errors of the recognized characters and corrections for the recognition errors are defined; and
Correcting the detected recognition error by referring to the dictionary
Including,
Device.

According to clause 16,
further comprising removing noise from the document image to detect at least one character region from the document image of the document,
Device.

According to clause 27,
The step of removing noise from the document image is,
Differentiating the document image with respect to the x-axis and y-axis to detect pixels in which brightness changes appear;
detecting at least one straight line constituting a table in the document image using the detected pixels; and
Deleting the at least one straight line
Including,
Device.

According to clause 27,
The step of removing noise from the document image is,
Binarizing the document image;
extracting a common portion of the binarized document image and pixel values of the document image; and
Assigning a pixel value of the document image to the common portion
Including,
Device.