KR102629133B1

KR102629133B1 - Document recognition device using optical character recognition and document structuring tags for building ai learning dataset

Info

Publication number: KR102629133B1
Application number: KR1020230107353A
Authority: KR
Inventors: 황선희; 조창희; 고형석; 이홍재
Original assignee: (주)유알피
Priority date: 2023-08-17
Filing date: 2023-08-17
Publication date: 2024-01-25

Abstract

본 발명은 OCR을 통해 텍스트를 추출하고, 문서의 서식 및 구조를 파악하여 문서 구조화 태그를 부착함으로서 문서의 서식에 따라 문서의 내용 및 표의 내용을 원본 문서대로 파악할 수 있도록 하는 인공지능 학습 데이터셋 구축을 위한 광학 문자 인식 및 문서 구조화 태그를 활용한 문서 인식 장치에 관한 것으로, 입력된 대상 문서의 타입에 따라 서식 항목을 식별하고, 상기 서식 항목에 대응하는 텍스트를 추출하는 OCR분석부; 상기 대상 문서에서 식별된 적어도 하나 이상의 서식 항목 및 상기 서식 항목의 텍스트에 대한 관계 정보를 포함하여 문서 구조화 데이터를 생성하는 문서 구조화부; 상기 대상 문서에 포함된 표의 구조를 파악하고, 상기 표의 셀 영역에 포함된 텍스트를 추출하여 표 서식 데이터를 생성하는 표인식부; 및 상기 문서 구조화 데이터에 문서 구조를 식별하는 문서 구조화 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환하는 문서 태깅부;를 포함한다.The present invention extracts text through OCR, identifies the format and structure of the document, attaches a document structuring tag, and builds an artificial intelligence learning dataset that allows the content of the document and the table to be identified as the original document according to the format of the document. It relates to a document recognition device using optical character recognition and document structuring tags, comprising: an OCR analysis unit that identifies format items according to the type of an input target document and extracts text corresponding to the format items; a document structuring unit that generates document structured data including at least one format item identified in the target document and relationship information about text of the format item; a table recognition unit that determines the structure of a table included in the target document, extracts text included in a cell area of the table, and generates table format data; and a document tagging unit that attaches a document structuring tag identifying the document structure to the document structured data and converts it into text data in a predetermined format.

Description

Document recognition device using optical character recognition and document structuring tags for building artificial intelligence learning dataset {DOCUMENT RECOGNITION DEVICE USING OPTICAL CHARACTER RECOGNITION AND DOCUMENT STRUCTURING TAGS FOR BUILDING AI LEARNING DATASET}

본 발명은 문서를 스캔한 이미지 파일을 수신하여 광학 문자 인식(OCR)을 통해 텍스트를 추출하고, 문서의 서식 및 구조를 파악하여 문서 구조화 태그를 부착함으로서 문서의 서식에 따라 문서의 내용 및 표의 내용을 원본 문서대로 파악할 수 있도록 하고, 문서의 서식 및 표를 재현하여 편집을 가능하게 하는 인공지능 학습 데이터셋 구축을 위한 광학 문자 인식 및 문서 구조화 태그를 활용한 문서 인식 장치에 관한 것이다.The present invention receives an image file of a scanned document, extracts text through optical character recognition (OCR), determines the format and structure of the document, and attaches a document structuring tag to the content of the document and the contents of the table according to the format of the document. This relates to a document recognition device that utilizes optical character recognition and document structuring tags to build an artificial intelligence learning dataset that allows the original document to be recognized and edited by reproducing the format and tables of the document.

광학 문자 인식(OCR) 기술이 발전하면서, 기업의 이미지 및 문서 처리 업무를 자동화하여 프로세스 효율을 극대화하고 있다.As optical character recognition (OCR) technology develops, companies are automating image and document processing tasks to maximize process efficiency.

수집된 이미지의 내부 자료화를 위해 문자 내용을 사람이 일일이 시스템에 입력하는 대신 OCR기술을 적용하여 자동으로 텍스트를 추출할 수 있게 되었고, 최근에는 인공지능 기반의 광학 문자 인식(OCR) 기술이 발전하면서 인식된 글자, 단어에서 문맥 인식을 통해 더욱 정교한 텍스트 추출이 가능하게 되었다.In order to internalize the collected images, it has become possible to automatically extract text by applying OCR technology instead of having people manually input the text contents into the system. Recently, artificial intelligence-based optical character recognition (OCR) technology has been developed. In this way, more sophisticated text extraction became possible through context recognition from recognized letters and words.

OCR 기술은 스캔한 이미지로부터 글자를 식별하고, 식별된 글자를 텍스트로 인식하여 추출하여 이미지 좌표에 따라 텍스트를 배치하는 형태로 동작하여 이미지로부터 텍스트를 추출하게 된다.OCR technology extracts text from the image by identifying letters from a scanned image, recognizing and extracting the identified letters as text, and arranging the text according to the image coordinates.

일반적으로 OCR은 문서 전체를 읽어서 문자를 인식하는 방식으로 기술이 구현되는데, 문서에서 머리말, 꼬리말, 페이지 번호 등 문서 내에서 반복되는 불필요한 부분까지 인식되는 경우가 있으므로 수작업으로 제거하거나 분류해야 하는 불편함이 여전히 존재한다.In general, OCR technology is implemented by reading the entire document and recognizing characters. However, in some cases, unnecessary parts that are repeated within the document, such as headers, footers, and page numbers, are recognized, which causes the inconvenience of having to manually remove or classify them. This still exists.

또한, 텍스트 추출 시 문서 서식이 제외되므로 서식이나 표가 포함된 문서는 원래 문서의 의도대로 파악되지 못하는 경우가 많다.Additionally, since document formatting is excluded when extracting text, documents containing formatting or tables are often not understood as intended in the original document.

따라서, OCR 수행 시 텍스트 추출과 더불어 문서의 서식이 반영되어 문서의 구조를 파악할 수 있도록 문서를 인식하는 기술이 요구되고 있다.Therefore, when performing OCR, in addition to text extraction, there is a need for technology to recognize documents so that the format of the document can be reflected and the structure of the document can be identified.

본 발명은 상기 문제점을 해결하기 위해 OCR 수행 시 문서의 종류에 따라 문서의 구조를 파악하여 문서의 서식에 따라 텍스트를 추출하고, 문서의 구조 태그를 부착하여 원래 문서의 의도대로 문서를 인식하여 데이터화하는 장치를 제공하는데 그 목적이 있다.In order to solve the above problem, the present invention identifies the structure of the document according to the type of document when performing OCR, extracts text according to the format of the document, and attaches a structure tag to the document to recognize the document as intended and turn it into data. The purpose is to provide a device that does this.

본 발명의 일 실시예에 따른 인공지능 학습 데이터셋 구축을 위한 광학 문자 인식 및 문서 구조화 태그를 활용한 문서 인식 장치는, 입력된 대상 문서의 타입에 따라 서식 항목을 식별하고, 상기 서식 항목에 대응하는 텍스트를 추출하는 OCR분석부; 상기 대상 문서에서 식별된 적어도 하나 이상의 서식 항목 및 상기 서식 항목의 텍스트에 대한 관계 정보를 포함하여 문서 구조화 데이터를 생성하는 문서 구조화부; 상기 대상 문서에 포함된 표의 구조를 파악하고, 상기 표의 셀 영역에 포함된 텍스트를 추출하여 표 서식 데이터를 생성하는 표인식부; 및 상기 문서 구조화 데이터에 문서 구조를 식별하는 문서 구조화 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환하는 문서 태깅부;를 포함할 수 있다.A document recognition device using optical character recognition and document structuring tags for building an artificial intelligence learning dataset according to an embodiment of the present invention identifies format items according to the type of the input target document and corresponds to the format items. OCR analysis unit that extracts text; a document structuring unit that generates document structured data including at least one format item identified in the target document and relationship information about text of the format item; a table recognition unit that determines the structure of a table included in the target document, extracts text included in a cell area of the table, and generates table format data; and a document tagging unit that attaches a document structuring tag identifying the document structure to the document structured data and converts it into text data in a predetermined format.

또한, 상기 대상 문서는 텍스트가 포함된 이미지 및 pdf파일 중 적어도 하나 이상을 포함하고, 상기 OCR분석부는 광학 문자 인식(OCR, Optical Character Recognition) 기술을 적용하여 텍스트 이미지 형태의 문서에서 텍스트를 인식하여 추출하는 것을 특징으로 한다.In addition, the target document includes at least one of an image and a PDF file containing text, and the OCR analysis unit applies optical character recognition (OCR) technology to recognize text in a document in the form of a text image. It is characterized by extraction.

또한, 상기 OCR분석부는 상기 대상 문서의 메타 정보를 통해 문서 타입을 파악하고, 상기 문서 타입의 템플릿 데이터를 확인하여 상기 템플릿 데이터에 정의된 서식 항목의 위치 영역 및 패턴 규칙에 따라 상기 대상 문서 내의 서식 항목의 위치 영역을 파악하는 것을 특징으로 한다.In addition, the OCR analysis unit determines the document type through the meta information of the target document, checks the template data of the document type, and determines the format in the target document according to the location area and pattern rule of the format item defined in the template data. It is characterized by identifying the location area of the item.

또한, 상기 OCR분석부는 상기 대상 문서에서 식별된 각 서식 항목에 대해, 상기 서식 항목의 위치 영역에 포함된 텍스트를 추출하여 서식 항목-텍스트 내용의 페어(pair) 데이터를 생성하는 것을 특징으로 한다.In addition, the OCR analysis unit is characterized in that for each format item identified in the target document, the text included in the position area of the format item is extracted to generate format item-text content pair data.

또한, 상기 표인식부는 상기 대상 문서 내에 표가 위치하는 영역의 이미지 데이터에서 표의 각 셀 영역을 식별하여 표 전체의 행렬 구조를 파악하고, 식별된 셀 영역의 텍스트를 추출하고, 상기 셀을 식별할 수 있는 표 태그를 부착하여, 문서 구조화 데이터에 연결하는 것을 특징으로 한다.In addition, the table recognition unit identifies each cell area of the table from the image data of the area where the table is located in the target document, determines the matrix structure of the entire table, extracts text in the identified cell area, and identifies the cell. It is characterized by attaching a table tag that can be used to connect to document structured data.

또한, 상기 표 태그는 표 내에서 셀의 상대적 위치 정보를 행렬 구조로 수식화하여 표시하는 것을 특징으로 한다.In addition, the table tag is characterized in that the relative position information of cells within the table is formatted and displayed in a matrix structure.

또한, 상기 표 태그는 셀 태그를 포함하고, 상기 셀 태그는 특정 셀이 추가되거나 복수개의 셀이 병합된 정보를 포함하여 표시하는 것을 특징으로 한다.In addition, the table tag includes a cell tag, and the cell tag is characterized in that it includes and displays information that a specific cell has been added or a plurality of cells have been merged.

또한, 상기 문서 태깅부는 사전에 정의된 문서 구조화 태그를 사용하여 상기 문서 구조화 데이터를 마크업 언어(Markup Language)로 작성하여 저장하는 것을 특징으로 한다.In addition, the document tagging unit is characterized in that it writes and stores the document structured data in markup language using a predefined document structuring tag.

PDF문서 및 문서를 스캔한 이미지로부터 텍스트 추출 시 문서의 서식을 파악하고 서식에 따라 태그를 부착함으로서 필요한 서식의 텍스트만 선택하여 편집 및 가공이 가능하다.When extracting text from PDF documents or scanned images of documents, the format of the document is identified and tags are attached according to the format, allowing editing and processing by selecting only the text in the required format.

또한, PDF문서 및 문서를 스캔한 이미지로부터 추출한 텍스트를 학습 데이터로 활용할 때, 반복적으로 표시되는 머리말, 꼬리말, 페이지 번호 등 불필요한 텍스트를 손쉽게 제외할 수 있다.Additionally, when using text extracted from PDF documents and scanned images of documents as learning data, unnecessary text such as headers, footers, and page numbers that appear repeatedly can be easily excluded.

또한, 문서에서 표에 대한 구조 및 각 셀의 내용을 연결하여 파악할 수 있고, 태깅된 문서 데이터를 통해 다시 표를 재현할 수 있다.In addition, the structure of the table and the contents of each cell can be connected and understood in the document, and the table can be reproduced through tagged document data.

도 1은 본 발명의 일 실시예에 따른 인공지능 학습 데이터셋 구축을 위한 광학 문자 인식 및 문서 구조화 태그를 활용한 문서 인식 장치의 전체 관계도이다.
도 2는 본 발명의 일 실시예에 따른 인공지능 학습 데이터셋 구축을 위한 광학 문자 인식 및 문서 구조화 태그를 활용한 문서 인식 장치의 기능에 대한 블록도이다.
도 3은 본 발명의 일 실시예에 따른 인공지능 학습 데이터셋 구축을 위한 광학 문자 인식 및 문서 구조화 태그를 활용한 문서 인식 장치에서, 문서 내에 포함된 표를 행렬 구조로 표시하고, 표 태그를 부착한 표 서식 데이터에 대한 예시 도면이다.
도 4는 본 발명의 일 실시예에 따른 인공지능 학습 데이터셋 구축을 위한 광학 문자 인식 및 문서 구조화 태그를 활용한 문서 인식 장치에서, 텍스트가 포함된 이미지 파일에서 문서 구조 및 텍스트를 인식하고, 문서 구조 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환한 데이터를 나타낸 도면이다.
도 5는 본 발명의 일 실시예에 따른 인공지능 학습 데이터셋 구축을 위한 광학 문자 인식 및 문서 구조화 태그를 활용한 문서 인식 장치의 하드웨어 구조를 나타낸 도면이다.Figure 1 is an overall relationship diagram of a document recognition device using optical character recognition and document structuring tags for building an artificial intelligence learning dataset according to an embodiment of the present invention.
Figure 2 is a block diagram of the function of a document recognition device using optical character recognition and document structuring tags for building an artificial intelligence learning dataset according to an embodiment of the present invention.
Figure 3 shows a document recognition device using optical character recognition and document structuring tags for building an artificial intelligence learning dataset according to an embodiment of the present invention, where tables included in the document are displayed in a matrix structure and table tags are attached. This is an example drawing of one table format data.
Figure 4 shows a document recognition device using optical character recognition and document structuring tags for building an artificial intelligence learning dataset according to an embodiment of the present invention, recognizes document structure and text in an image file containing text, and documents This is a diagram showing data converted into text data in a defined format by attaching a structure tag.
Figure 5 is a diagram showing the hardware structure of a document recognition device using optical character recognition and document structuring tags for building an artificial intelligence learning dataset according to an embodiment of the present invention.

이하에서는 도면을 참조하여 본 발명의 구체적인 실시예를 상세하게 설명한다. 다만, 본 발명의 사상은 제시되는 실시예에 제한되지 아니하고, 본 발명의 사상을 이해하는 당업자는 동일한 사상의 범위 내에서 다른 구성요소를 추가, 변경, 삭제 등을 통하여, 퇴보적인 다른 발명이나 본 발명 사상의 범위 내에 포함되는 다른 실시예를 용이하게 제안할 수 있을 것이나, 이 또한 본원 발명 사상 범위 내에 포함된다고 할 것이다.Hereinafter, specific embodiments of the present invention will be described in detail with reference to the drawings. However, the spirit of the present invention is not limited to the presented embodiments, and those skilled in the art who understand the spirit of the present invention may add, change, or delete other components within the scope of the same spirit, or create other degenerative inventions or this invention. Other embodiments that are included within the scope of the invention can be easily proposed, but this will also be said to be included within the scope of the invention of the present application.

그리고, 후술되는 용어들은 본 발명에서의 기능을 고려하여 설정된 용어들로써 이는 발명자의 의도 또는 관례에 따라 달라질 수 있으므로 그 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 할 것이고, 본 명세서에서 본 발명에 관련된 공지의 구성 또는 기능에 대한 구체적인 설명이 본 발명의 요지를 흐릴 수 있다고 판단되는 경우에 이에 관한 자세한 설명은 생략하기로 한다.In addition, the terms described below are terms set in consideration of the function in the present invention, and may vary depending on the inventor's intention or custom, so the definition should be made based on the content throughout the specification, and in this specification, the terms related to the present invention In cases where it is determined that detailed descriptions of well-known configurations or functions may obscure the gist of the present invention, detailed descriptions thereof will be omitted.

이하, 도면을 참조로 하여 본 발명에 따른 인공지능 학습 데이터셋 구축을 위한 광학 문자 인식 및 문서 구조화 태그를 활용한 문서 인식 장치(100)를 설명한다.Hereinafter, a document recognition device 100 using optical character recognition and document structuring tags for constructing an artificial intelligence learning dataset according to the present invention will be described with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 인공지능 학습 데이터셋 구축을 위한 광학 문자 인식 및 문서 구조화 태그를 활용한 문서 인식 장치(이하, 문서 인식 장치라 함.)의 전체 관계도이다.Figure 1 is an overall relationship diagram of a document recognition device (hereinafter referred to as a document recognition device) using optical character recognition and document structuring tags for building an artificial intelligence learning dataset according to an embodiment of the present invention.

도 1을 참조하면, 문서 인식 장치(100)는 적어도 하나 이상의 사용자 단말기(200) 및 적어도 하나 이상의 어플리케이션 서버(300)와 네트워크로 연결되어 서로 통신할 수 있다.Referring to FIG. 1, the document recognition device 100 is connected to at least one user terminal 200 and at least one application server 300 over a network and can communicate with each other.

본 발명에서 언급하는 네트워크라 함은 유선 공중망, 무선 이동 통신망, 또는 휴대 인터넷 등과 통합된 코어 망일 수도 있고, TCP/IP 프로토콜 및 그 상위 계층에 존재하는 여러 서비스, 즉 HTTP(Hyper Text Transfer Protocol), HTTPS(Hyper Text Transfer Protocol Secure), Telnet, FTP(File Transfer Protocol) 등을 제공하는 전 세계적인 개방형 컴퓨터 네트워크 구조를 의미할 수 있으며, 이러한 예에 한정하지 않고 다양한 형태로 데이터를 송수신할 수 있는 데이터 통신망을 포괄적으로 의미하는 것이다.The network referred to in the present invention may be a core network integrated with a wired public network, wireless mobile communication network, or mobile Internet, etc., and may include the TCP/IP protocol and various services existing in its upper layer, such as HTTP (Hyper Text Transfer Protocol), It can refer to a global open computer network structure that provides HTTPS (Hyper Text Transfer Protocol Secure), Telnet, and FTP (File Transfer Protocol), etc., and is not limited to these examples, but is a data communication network that can transmit and receive data in various forms. It means comprehensively.

본 발명의 문서 인식 장치(100)는 PDF 문서 또는 이미지 문서에서 텍스트를 추출하면 서식이 제외되어 원래 문서의 의도대로 파악이 불가능하므로, 입력된 문서에 대해 텍스트를 추출하고, 입력된 문서에 대해 머리말, 꼬리말, 페이지 번호, 본문과 같은 문서의 구조를 판단하여 추출한 텍스트에 문서 구조 태그를 부착하고, 본문에 표가 포함된 경우 표의 구조 및 각 셀의 내용 간의 연계 정보를 행렬 구조로 수식화하여 표 태그를 부착하여 표의 내용을 원본 문서대로 파악할 수 있도록 데이터화한다.The document recognition device 100 of the present invention extracts text from the input document and extracts the text from the input document because the format is excluded when text is extracted from a PDF document or an image document, making it impossible to understand the original document as intended. , determine the structure of the document such as footer, page number, and body and attach a document structure tag to the extracted text. If the body includes a table, link information between the structure of the table and the contents of each cell is formatted into a matrix structure to tag the table. is attached to convert the contents of the table into data so that it can be understood as in the original document.

이를 위해, 사용자 단말기(200) 외부 서버(300) 및 중 적어도 하나 이상으로부터 PDF 문서 또는 스캔한 문서(이미지)를 수신하고, 수신한 대상 문서의 타입에 따라 서식 항목을 식별하고, 상기 서식 항목에 대응하는 텍스트를 추출하고, 상기 대상 문서에서 식별된 적어도 하나 이상의 서식 항목 및 상기 서식 항목의 텍스트에 대한 관계 정보를 포함하여 문서 구조화 데이터를 생성한다.For this purpose, the user terminal 200 receives a PDF document or a scanned document (image) from at least one of the external server 300, identifies form items according to the type of the received target document, and enters the form items into the form items. The corresponding text is extracted, and document structured data is generated including at least one format item identified in the target document and relationship information about the text of the format item.

또한, 상기 대상 문서에 포함된 표의 구조를 파악하고, 상기 표의 셀 영역에 포함된 텍스트를 추출하여 표 서식 데이터를 생성하여, 문서 구조화 데이터에 연결한다.In addition, the structure of the table included in the target document is identified, the text included in the cell area of the table is extracted, table format data is generated, and linked to document structured data.

문서 구조화가 완료되면 문서 구조화 데이터에 문서 구조를 식별하는 문서 구조화 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환한다.When document structuring is completed, a document structuring tag that identifies the document structure is attached to the document structured data and converted into text data in a specified format.

본 발명에서 사용자 단말기(200) 또는 외부 서버(300)는 문서 인식 장치(100)에서 제공하는 사용자 인터페이스 또는 연동 인터페이스를 통해 대상 문서를 등록할 수 있다.In the present invention, the user terminal 200 or the external server 300 can register the target document through the user interface or linkage interface provided by the document recognition device 100.

여기서, 대상 문서는 텍스트가 포함된 이미지 및 pdf파일 중 적어도 하나 이상의 문서로 종이 문서를 스캔한 이미지 파일 또는 PDF로 저장한 파일과 워드, PPT, 한글 문서 등의 전자 문서를 PDF로 변환한 파일 등을 포함할 수 있다.Here, the target document is at least one of an image containing text and a PDF file, such as an image file scanned from a paper document, a file saved as a PDF, and a file converted to PDF from an electronic document such as Word, PPT, or Korean document. may include.

외부 서버(300)는 문서를 생산하거나, 수집된 문서를 가공하는 서버일 수 있고, 문서를 보관, 저장하는 시스템일 수 있다.The external server 300 may be a server that produces documents or processes collected documents, or may be a system that stores and stores documents.

도 2는 본 발명의 일 실시예에 따른 인공지능 학습 데이터셋 구축을 위한 광학 문자 인식 및 문서 구조화 태그를 활용한 문서 인식 장치(100)의 기능에 대한 블록도이다.Figure 2 is a block diagram of the function of the document recognition device 100 using optical character recognition and document structuring tags for building an artificial intelligence learning dataset according to an embodiment of the present invention.

도 2를 참조하면, 문서 인식 장치(100)는 OCR분석부(110), 문서 구조화부(120), 표인식부(130) 및 문서 태깅부(140)를 구비할 수 있다.Referring to FIG. 2, the document recognition device 100 may include an OCR analysis unit 110, a document structuring unit 120, a mark recognition unit 130, and a document tagging unit 140.

OCR분석부(110)는 입력된 대상 문서의 타입에 따라 서식 항목을 식별하고, 상기 서식 항목에 대응하는 텍스트를 추출한다.The OCR analysis unit 110 identifies format items according to the type of the input target document and extracts text corresponding to the format items.

OCR분석부(110)로 입력되는 대상 문서는 텍스트를 포함하는 이미지, PDF 문서로, 서식 및 표를 포함하는 논문, 보고서, 보도 자료, 행정 문서 등 일 수 있다.Target documents input to the OCR analysis unit 110 may be images containing text, PDF documents, papers, reports, press releases, administrative documents, etc. containing formats and tables.

OCR분석부(110)는 광학 문자 인식(OCR, Optical Character Recognition) 기술을 적용하여 입력된 대상 문서에서 텍스트 이미지 형태의 문서에서 텍스트를 인식하여 추출한다.The OCR analysis unit 110 applies optical character recognition (OCR) technology to recognize and extract text from the input target document in the form of a text image.

일례로, 광학 문자 인식은 아래와 같은 단계로 진행할 수 있다.For example, optical character recognition can proceed in the following steps.

먼저 대상 문서에 대해 노이즈로 손상되거나 이미지가 기울어지거나 회전되어 있는 경우 이미지를 분석에 적합한 형태로 복구하는 전처리 작업을 수행할 수 있다.First, if the target document is damaged by noise or the image is tilted or rotated, preprocessing can be performed to restore the image to a form suitable for analysis.

이후, 전처리 된 이미지에서 텍스트를 검출하는 작업을 진행한다. 문서에는 텍스트 뿐만 아니라 그림, 그래프, 선 등의 다양한 오브젝트가 존재하므로, 텍스트를 인식하고, 검출된 영역의 문자가 무엇인지를 인식하는 텍스트 인식 작업을 수행한다. Afterwards, the task of detecting text from the preprocessed image is carried out. Since documents contain not only text but also various objects such as pictures, graphs, and lines, text recognition is performed to recognize the text and identify the characters in the detected area.

텍스트를 인식하는 과정에는 CNN(Convolutional neural network), RNN(Recurrent neural network), CNN과 RNN을 결합한 CRNN 방식 등의 딥러닝 기반 OCR모델을 적용하여 할 수 있다.The process of recognizing text can be done by applying deep learning-based OCR models such as CNN (Convolutional neural network), RNN (Recurrent neural network), and CRNN method that combines CNN and RNN.

OCR분석부(110)는 서식 식별부(111) 및 텍스트 추출부(112)를 포함한다.The OCR analysis unit 110 includes a format identification unit 111 and a text extraction unit 112.

서식 식별부(111)는 대상 문서의 메타 정보를 통해 문서 타입을 파악하고, 상기 문서 타입의 템플릿 데이터를 확인하여 템플릿 데이터에 정의된 서식 항목의 위치 영역 및 패턴 규칙에 따라 텍스트를 추출한다.The format identification unit 111 identifies the document type through meta information of the target document, checks template data of the document type, and extracts text according to the location area and pattern rules of the format item defined in the template data.

일례로, 일반적인 문서의 경우 제목, 본문, 머리말, 꼬리말, 페이지 등의 서식으로 구성될 수 있을 것이다.For example, a general document may consist of a title, body, header, footer, page, etc.

또한, 일례로, 입력된 대상 문서가 논문인 경우 제목, 초록, 서론, 연구방법, 결과, 고찰(Discussion), 사사(Acknowledgement), 참고문헌 등으로 구성될 수 있을 것이다.Additionally, as an example, if the input target document is a paper, it may consist of title, abstract, introduction, research method, results, discussion, acknowledgment, and references.

서식 식별부(111)는 각 도메인에서 사용되는 문서 서식을 정의한 템플릿을 등록하여 관리하고, OCR 분석 대상 문서를 수신하면, 함께 수신된 해당 문서의 메타 정보를 확인하여 문서의 타입을 파악한다.The format identification unit 111 registers and manages a template defining the document format used in each domain, and upon receiving a document subject to OCR analysis, checks the meta information of the document received together to determine the type of the document.

또한, 문서의 타입에 따라 템플릿 데이터에 정의된 서식 항목을 확인하여 서식 식별을 위한 추출 규칙을 확인한다.In addition, the format items defined in the template data are checked according to the type of document, and the extraction rules for format identification are checked.

템플릿 데이터에 포함되는 서식 식별 규칙은 문서 이미지 내의 위치 영역(좌표), 글자 스타일(크기, 폰트 종류), 특수 문자 또는 기호 포함 등이 포함될 수 있다.Format identification rules included in template data may include location area (coordinates) within the document image, character style (size, font type), inclusion of special characters or symbols, etc.

서식 식별부(111)에서 해당 서식의 서식 영역이 확인되면, 텍스트 추출부(112)를 통해 해당 영역의 텍스트를 인식하고, 추출한다.When the format area of the corresponding form is confirmed in the format identification unit 111, the text in the corresponding area is recognized and extracted through the text extraction unit 112.

텍스트 추출부(112)는 이미지에서 특정 영역에 대한 텍스트를 추출할 수 있다.The text extraction unit 112 may extract text for a specific area from the image.

추출된 텍스트는 서식 항목-텍스트 내용의 페어(pair) 데이터로 매핑 되어 저장되고, 문서 구조화부(120)를 통해 문서 서식 간의 관계성에 따라 구조화되어 저장된다.The extracted text is mapped and stored as format item-text content pair data, and is structured and stored according to the relationship between document formats through the document structuring unit 120.

문서 구조화부(120)는 OCR분석부(110)를 통해 식별된 적어도 하나 이상의 서식 항목 및 추출된 텍스트에 대한 관계 정보를 포함하여 문서 구조화 데이터를 생성한다.The document structuring unit 120 generates document structured data including relationship information about at least one form item identified through the OCR analysis unit 110 and the extracted text.

일례로, 문서 구조화 데이터는 적어도 하나 이상의 속성 이름-속성값으로 구성된 페어(pair) 데이터를 포함하는 형태일 수 있다.For example, document structured data may be in the form of pair data consisting of at least one attribute name and attribute value.

또한, 문서 구조화 데이터는 복수개의 페어 데이터 간의 관계성을 나타내는 정보를 포함할 수 있다.Additionally, document structured data may include information indicating relationships between a plurality of pair data.

한편, 서식 식별부(111)는 대상 문서 내에 포함된 표를 인식하고, 표의 위치 영역을 추출할 수 있다. Meanwhile, the format identification unit 111 can recognize the table included in the target document and extract the location area of the table.

추출된 표의 위치 영역은 표인식부(130)에 의해 분석되어 표의 구조화 및 셀 내의 텍스트 추출이 이루어 진다.The location area of the extracted table is analyzed by the table recognition unit 130 to structure the table and extract text within the cell.

표인식부(130)는 대상 문서에 포함된 표의 구조를 파악하고, 상기 표의 셀 영역에 포함된 텍스트를 추출하여 표 서식 데이터를 생성한다.The table recognition unit 130 determines the structure of the table included in the target document, extracts text included in the cell area of the table, and generates table format data.

표인식부(130)는 표구조 식별부(131) 및 셀텍스트 추출부(132)를 포함한다.The table recognition unit 130 includes a table structure identification unit 131 and a cell text extraction unit 132.

표구조 식별부(131)는 서식 식별부(111)에 의해 추출된 표의 위치 영역을 분석하여 전체 표의 영역을 식별하고, 표가 위치하는 영역의 이미지 데이터에서 표의 각 셀 영역을 식별한다.The table structure identification unit 131 analyzes the location area of the table extracted by the format identification unit 111 to identify the entire table area, and identifies each cell area of the table from the image data of the area where the table is located.

이때, 표 내에서 셀의 상대적 위치 정보를 행렬 구조로 수식화하여 표시한다.At this time, the relative position information of cells within the table is formatted and displayed in a matrix structure.

도 3은 본 발명의 일 실시예에 따른 인공지능 학습 데이터셋 구축을 위한 광학 문자 인식 및 문서 구조화 태그를 활용한 문서 인식 장치(100)에서, 문서 내에 포함된 표를 행렬 구조로 표시하고, 표 태그를 부착한 표 서식 데이터에 대한 예시 도면이다.Figure 3 shows a document recognition device 100 using optical character recognition and document structuring tags for constructing an artificial intelligence learning dataset according to an embodiment of the present invention, where a table included in the document is displayed in a matrix structure, and the table This is an example drawing of table format data with tags attached.

도 3을 참조하여, 문서 내 포함된 표를 인식하고, 행렬 구조로 수식화하는 과정을 설명한다.Referring to FIG. 3, the process of recognizing tables included in a document and formulating them into a matrix structure will be described.

도 3의 (a)는 문서 내에 포함된 표의 예시이다. Figure 3(a) is an example of a table included in a document.

표구조 식별부(131)는 표의 외곽선을 식별하여 문서 내에 포함된 표의 전체 영역을 식별하고, 표 영역 내에서 셀 라인을 인식하여 셀의 영역을 판단한다.The table structure identification unit 131 identifies the entire area of the table included in the document by identifying the outline of the table, and determines the area of the cell by recognizing cell lines within the table area.

이때, 표의 전체 구조는 (b)와 같은 행렬 구조로 수식화될 수 있다.At this time, the entire structure of the table can be formalized into a matrix structure like (b).

또한, 표구조 식별부(131)는 식별된 각 셀에 대해 셀의 위치, 병합 여부를 확인할 수 있는 표 태그를 정의한다.Additionally, the table structure identification unit 131 defines a table tag for each identified cell that can check the location of the cell and whether it is merged.

특히, 적어도 하나 이상의 셀이 병합된 부분은 병합된 셀의 범위를 포함하여 표 태그를 정의한다.In particular, the part where at least one cell is merged defines the table tag including the range of the merged cells.

일례로, 도 3의 표(a)의 구조는 2개의 행과 4개의 열로 판단되고, '<table2*4>'라는 표 태그로 정의할 수 있다.For example, the structure of table (a) in Figure 3 is determined to have 2 rows and 4 columns, and can be defined with the table tag '<table2*4>'.

도 3의 표(a)에서 첫번째 셀, 즉, '구분' 이라는 텍스트를 포함한 셀의 경우 행렬 구조로 정의할 때 (1,1) 셀과 (1,2)의 셀이 병합된 것으로 인식하고, 해당 표 태그를 '<data:(1,1)(1,2)>'로 정의할 수 있다.In the first cell in table (a) of Figure 3, that is, the cell containing the text 'Separation', when defining the matrix structure, the (1,1) cell and the (1,2) cell are recognized as merged, The corresponding table tag can be defined as '<data:(1,1)(1,2)>'.

도 3의 (c)는 표(a)를 인식하여 표 태그를 부착한 예시이다.Figure 3 (c) is an example of recognizing table (a) and attaching a table tag.

일례로, 표 태그는 도 3의 (c)와 같이 시작 태그, 끝 태그, 엘리먼트(element) 및 속성(attribute)를 포함하는 html, xml과 유사한 마크업 (Markup) 언어의 구조일 수 있다.For example, a table tag may be a markup language structure similar to HTML and XML that includes a start tag, an end tag, an element, and an attribute, as shown in (c) of FIG. 3.

다만, 이에 한정하지 않고 다양한 형태의 태그를 포함하는 텍스트 형식의 문서 형식을 나타내는 태그일 수 있다.However, the tag is not limited to this and may be a tag indicating a text format document format including various types of tags.

이와 같이 표의 구조를 행렬 구조로 수식화하여 정의함으로서, 태깅된 표 서식 데이터를 다시 표로 정확하게 재현할 수 있으며, 재현된 표에서 셀 병합, 셀 추가와 같은 표 편집이 가능해 진다.By formulating and defining the table structure as a matrix structure in this way, tagged table format data can be accurately reproduced as a table, and table editing such as merging cells and adding cells in the reproduced table becomes possible.

또한, 편집된 표에 대해 셀 추가 및 셀 병합이 이루어 진 경우, 셀 태그는 특정 셀이 추가되거나 복수개의 셀이 병합된 정보를 포함하여 표시할 수 있다.Additionally, when cell addition and cell merging are performed on the edited table, the cell tag can be displayed including information about a specific cell being added or a plurality of cells being merged.

셀텍스트 추출부(132)는 인식되어 표 태그로 정의된 셀 영역 내에서 텍스트를 추출한다.The cell text extractor 132 extracts text within a cell area recognized and defined by a table tag.

이때, 셀 영역 내의 텍스트가 복수개의 줄로 이루어진 경우 셀의 영역 내에서 셀의 외각선을 인식하고 줄바꿈하여 텍스트를 읽도록 하여 원문과 동일하게 읽혀지도록 하는 것이 바람직할 것이다.At this time, if the text in the cell area consists of multiple lines, it would be desirable to recognize the cell's outline within the cell area and change the lines to read the text so that it can be read the same as the original text.

한편, 표인식부(130)는 표 태그가 부착된 표 서식 데이터를 문서 구조화 데이터에 연결한다.Meanwhile, the table recognition unit 130 connects table format data with a table tag attached to document structured data.

일례로, 제목, 머리말, 꼬리말, 본문의 구조로 인식되어 문서 구조화 데이터에 저장되고, 본문 내에 표가 위치하는 경우 본문의 하위 구조로 표 태그가 부착된 텍스트가 문서 구조화 데이터에 포함될 수 있다. For example, if the structure of the title, header, footer, and body is recognized and stored in the document structured data, and a table is located within the body, text with a table tag attached as a substructure of the body may be included in the document structured data.

문서 태깅부(140)는 문서 구조화 데이터에 문서 구조를 식별하는 문서 구조화 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환한다.The document tagging unit 140 attaches a document structuring tag that identifies the document structure to the document structured data and converts it into text data in a predetermined format.

즉, 사전에 정의된 문서 구조화 태그를 사용하여 문서 구조화 데이터를 마크업 언어(Markup Language)로 작성하여 저장할 수 있다.In other words, document structured data can be written and stored in markup language using predefined document structuring tags.

여기서, 마크업 언어(Markup Language)는 html, xml 과 같이 시작 태그, 끝 태그, 엘리먼트(element) 및 속성(attribute)를 포함하고, 문서의 구조를 포함할 수 있다.Here, the markup language includes start tags, end tags, elements, and attributes, such as html and xml, and may include the structure of the document.

도 4는 본 발명의 일 실시예에 따른 인공지능 학습 데이터셋 구축을 위한 광학 문자 인식 및 문서 구조화 태그를 활용한 문서 인식 장치(100)에서, 텍스트가 포함된 이미지 파일에서 문서 구조 및 텍스트를 인식하고, 문서 구조 태그를 부착하여 정해진 형식의 텍스트 데이터로 변환한 데이터를 나타낸 도면으로, 생성된 문서 구조화 데이터에 대해 사전에 정의된 각 서식에 대한 문서 구조 태그를 부착하여 마크업 언어로 표시한 것이다.Figure 4 shows a document recognition device 100 using optical character recognition and document structuring tags for building an artificial intelligence learning dataset according to an embodiment of the present invention, recognizing document structure and text in an image file containing text. This is a drawing showing data that has been converted into text data in a given format by attaching a document structure tag. The document structure tag for each format defined in advance is attached to the generated document structured data and displayed in a markup language. .

도 4와 같이 태그가 부착되어 마크업 언어 형태로 변환된 문서는 문서의 구조에 영향을 받지 않고 유연하게 텍스트 형태로 저장될 수 있고, 또한 구조화된 데이터로 변환하여 사용할 수 있다.As shown in Figure 4, a document with a tag attached and converted into a markup language form can be flexibly stored in text form without being affected by the structure of the document, and can also be converted to structured data and used.

또한, 문서 타입 별로 문서의 구조를 사전에 규칙(스키마, schema)로 정의하고, 태그가 부착된 문서 텍스트에 대해 상기 규칙(스키마)에 따라 파싱(Parsing)하여 문서 구조화 데이터로 변환할 수 있고, 필요한 서식 텍스트 만을 선별적으로 추출하여 편집/가공할 수 있다.In addition, the structure of the document can be defined in advance as a rule (schema, schema) for each document type, and the tagged document text can be parsed according to the rule (schema) and converted into document structured data. You can selectively extract and edit/process only the required format text.

일례로, 문서에 반복적으로 포함된 머리말, 꼬리말, 페이지를 식별하여 문서 가공 시 제외시킬 수 있다.For example, headers, footers, and pages repeatedly included in a document can be identified and excluded when processing the document.

또한 일례로, 문서에 대한 키워드 추출 시 문서 내용 전체를 분석하지 않고, 제목으로 인식된 텍스트를 사용할 수 있다.Also, as an example, when extracting keywords for a document, text recognized as the title can be used without analyzing the entire document content.

도 5는 본 발명의 일 실시예에 따른 인공지능 학습 데이터셋 구축을 위한 광학 문자 인식 및 문서 구조화 태그를 활용한 문서 인식 장치(100)의 하드웨어 구조를 나타낸 도면이다.Figure 5 is a diagram showing the hardware structure of a document recognition device 100 using optical character recognition and document structuring tags for building an artificial intelligence learning dataset according to an embodiment of the present invention.

도 5를 참조하면, 문서 인식 장치(100)의 하드웨어 구조는, 중앙처리장치(1000), 메모리(2000), 사용자 인터페이스(3000), 데이터베이스 인터페이스(4000), 네트워크 인터페이스(5000), 웹서버(6000) 등을 포함하여 구성된다.Referring to FIG. 5, the hardware structure of the document recognition device 100 includes a central processing unit 1000, a memory 2000, a user interface 3000, a database interface 4000, a network interface 5000, and a web server ( 6000), etc.

사용자 인터페이스(3000)는 그래픽 사용자 인터페이스(GUI, graphical user interface)를 사용함으로써, 사용자에게 입력과 출력 인터페이스를 제공한다.The user interface 3000 provides an input and output interface to the user by using a graphical user interface (GUI).

데이터베이스 인터페이스(4000)는 데이터베이스와 하드웨어 구조 사이의 인터페이스를 제공한다.The database interface 4000 provides an interface between the database and the hardware structure.

네트워크 인터페이스(5000)는 사용자가 보유한 장치 간의 네트워크 연결을 제공한다.The network interface 5000 provides network connections between devices owned by users.

웹 서버(6000)는 사용자가 네트워크를 통해 하드웨어 구조로 액세스하기 위한 수단을 제공한다. 대부분의 사용자들은 원격에서 웹 서버로 접속하여 문서 인식 장치(100)을 사용할 수 있다.The web server 6000 provides a means for users to access the hardware structure through a network. Most users can use the document recognition device 100 by remotely accessing a web server.

상술한 구성 또는 방법의 각 단계는, 컴퓨터 판독 가능한 기록 매체 상의 컴퓨터 판독 가능 코드로 구현되거나 전송 매체를 통해 전송될 수 있다. 컴퓨터 판독 가능한 기록 매체는, 컴퓨터 시스템에 의해 읽혀질 수 있는 데이터를 저장할 수 있는 데이터 저장 디바이스이다.Each step of the above-described configuration or method may be implemented as computer-readable code on a computer-readable recording medium or transmitted through a transmission medium. A computer-readable recording medium is a data storage device capable of storing data that can be read by a computer system.

컴퓨터 판독 가능한 기록 매체의 예로는 데이터베이스, ROM, RAM, CD-ROM, DVD, 자기 테이프, 플로피 디스크 및 광학 데이터 저장 디바이스가 있으나 이에 한정되는 것은 아니다. 전송 매체는 인터넷 또는 다양한 유형의 통신 채널을 통해 전송되는 반송파를 포함할 수 있다. 또한 컴퓨터 판독 가능한 기록 매체는, 컴퓨터 판독 가능 코드가 분산 방식으로 저장되고, 실행되도록 네트워크 결합 컴퓨터 시스템을 통해 분배될 수 있다.Examples of computer-readable recording media include, but are not limited to, databases, ROM, RAM, CD-ROM, DVD, magnetic tape, floppy disk, and optical data storage devices. Transmission media may include carrier waves transmitted over the Internet or various types of communication channels. The computer-readable recording medium may also be distributed through a network-coupled computer system such that the computer-readable code is stored and executed in a distributed manner.

또한 본 발명에 적용된 적어도 하나 이상의 구성요소는, 각각의 기능을 수행하는 중앙처리장치(CPU), 마이크로프로세서 등과 같은 프로세서를 포함하거나 이에 의해 구현될 수 있으며, 상기 구성요소 중 둘 이상은 하나의 단일 구성요소로 결합되어 결합된 둘 이상의 구성요소에 대한 모든 동작 또는 기능을 수행할 수 있다. 또한 본 발명에 적용된 적어도 하나 이상의 구성요소의 일부는, 이들 구성요소 중 다른 구성요소에 의해 수행될 수 있다. 또한 상기 구성요소들 간의 통신은 버스(미도시)를 통해 수행될 수 있다.In addition, at least one or more components applied to the present invention may include or be implemented by a processor such as a central processing unit (CPU) or microprocessor that performs each function, and two or more of the components may be implemented as a single It can be combined into components and perform all operations or functions of two or more components combined. Additionally, part of at least one or more components applied to the present invention may be performed by other components among these components. Additionally, communication between the components may be performed through a bus (not shown).

100: 문서 인식 장치
110: OCR분석부
111: 서식 식별부 112: 텍스트 추출부
120: 문서 구조화부
130: 표인식부
131: 표구조 식별부 132: 셀텍스트 추출부
140: 문서 태깅부
200: 사용자 단말기
300: 외부 서버100: Document recognition device
110: OCR analysis department
111: Format identification unit 112: Text extraction unit
120: Document structuring unit
130: Mark recognition unit
131: Table structure identification unit 132: Cell text extraction unit
140: Document tagging unit
200: user terminal
300: external server

Claims

an OCR analysis unit that identifies format items according to the type of the input target document and extracts text corresponding to the format items;
a document structuring unit that generates document structured data including at least one format item identified in the target document and relationship information about text of the format item;
a table recognition unit that determines the structure of a table included in the target document, extracts text included in a cell area of the table, generates table format data, and connects the generated table format data to document structured data; and
It includes a document tagging unit that attaches a document structuring tag identifying the document structure to the document structured data and converts it into text data in a predetermined format,
The target document includes at least one of an image and a PDF file containing text,
The OCR analysis unit,
The document type is identified through the meta information of the target document, and optical character recognition (OCR) technology is applied to identify the format item located in the format item area according to the format item identification rules defined in the template data of the target document type. Characterized by recognizing and extracting text,
The identification rules of format items defined in the template data include location coordinates and character styles on the document image,
The document tagging unit,
Based on a schema that defines the structure of each document for each target document type, an XML-based document text is created by attaching a document structuring tag to the document structured data,
The XML-based document text can be parsed according to the schema and converted into document structured data, and text of specific format items can be extracted.
A document recognition device utilizing optical character recognition and document structuring tags.

delete

According to paragraph 1,
The OCR analysis unit,
Characterized in that, for each format item identified in the target document, text included in the location area of the format item is extracted to generate format item-text content pair data,
A document recognition device utilizing optical character recognition and document structuring tags.

According to paragraph 1,
The mark recognition unit,
Identify each cell area of the table from the image data of the area where the table is located in the target document, identify the matrix structure of the entire table, extract the text of the identified cell area, and attach a table tag to identify the cell. Characterized by linking to document structured data,
A document recognition device utilizing optical character recognition and document structuring tags.

According to clause 5,
The table tag above is,
Characterized in that the relative position information of cells within the table is formatted and displayed in a matrix structure.
A document recognition device utilizing optical character recognition and document structuring tags.

According to clause 6,
The table tag above is,
Includes a cell tag, wherein the cell tag displays information including information on a specific cell being added or a plurality of cells being merged.
A document recognition device utilizing optical character recognition and document structuring tags.

delete