KR102309562B1

KR102309562B1 - Apparatus for pdf table reconstruction and method thereof

Info

Publication number: KR102309562B1
Application number: KR1020200188392A
Authority: KR
Inventors: 김태윤; 이안준; 어진솔; 신은성
Original assignee: 주식회사 애자일소다
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2021-10-06

Abstract

The present invention relates to a device for reconstructing a table included in a PDF document and an operation method thereof. In accordance with one aspect of the present invention, provided is a method for operating a PDF table reconstruction device. The method comprises the steps of: receiving a PDF document including a table to be reconstructed; extracting PDF data including coordinate information of characters and lines included in the PDF document; retrieving a table area to be reconstructed based on the PDF data; generating a primary data structure of a reconstruction table composed of a plurality of cells based on coordinate information of characters and lines corresponding to the table area to be reconstructed in the PDF data; retrieving a merged cell among a plurality of cells constituting the primary data structure; and splitting the merged cell to create a final data structure of the reconstruction table. In accordance with the present invention, by means of the PDF table construction device, a table included in a PDF document can be reconstructed.

Description

PDF table reorganization device and its operation method

본 발명은 PDF 문서에 포함된 테이블을 재구성하는 장치 및 그 동작 방법에 관한 것이다. The present invention relates to an apparatus for reconstructing a table included in a PDF document and a method for operating the same.

증권사 등에서는 기업의 경영실태나 재무현황 등을 정리한 보고서를 정기적으로 발행하며, 여기에는 기업 별로 손익계산, 재무상태, 현금흐름, 주요지표 등의 항목에 대한 데이터가 일목요연하게 알아볼 수 있도록 테이블 형태로 PDF(Portable Document Format) 문서에 포함되는 경우가 많다. 하지만, PDF 문서에 포함된 테이블을 구성하는 다수의 셀의 윤곽선이 명확히 표현되지 않거나 다수의 셀이 병합된 형태의 셀이 존재할 경우, PDF 문서에 포함된 테이블의 다른 유형의 문서 등에 활용에 한계를 갖게 된다.Securities companies, etc. regularly publish reports summarizing the management and financial status of companies, in which data on items such as profit and loss calculation, financial status, cash flow, and key indicators for each company can be clearly identified in a table format. It is often included in Portable Document Format (PDF) documents. However, if the outline of multiple cells constituting the table included in the PDF document is not clearly expressed or if there are cells in the form of multiple cells being merged, there is a limit to the use of other types of documents in the table included in the PDF document. will have

따라서, PDF 문서에 포함된 테이블에서 셀의 윤곽을 명확히 하고 병합된 셀을 분리할 필요가 있다. Therefore, it is necessary to clarify the outline of the cells and separate the merged cells in the table included in the PDF document.

본 발명의 목적은 PDF 문서에 포함된 테이블을 재구성하는 기술을 제공하는 것이다. It is an object of the present invention to provide a technique for reconstructing a table included in a PDF document.

또한, 본 발명의 목적은 PDF 문서의 이미지 파일에 포함된 테이블을 재구성하는 기술을 제공하는 것이다. Another object of the present invention is to provide a technique for reconstructing a table included in an image file of a PDF document.

본 발명의 일 측면에 따르면, 재구성될 테이블을 포함하는 PDF 문서를 수신하는 단계; PDF 문서에 포함된 문자 및 선의 좌표 정보를 포함하는 PDF 데이터를 추출하는 단계; PDF 데이터를 기반으로 재구성될 테이블 영역을 검색하는 단계; PDF 데이터 중 재구성될 테이블 영역에 대응하는 문자 및 선의 좌표 정보를 기반으로, 복수의 셀로 구성되는 재구성 테이블의 1차 데이터 구조를 생성하는 단계; 1차 데이터 구조를 구성하는 복수의 셀 중 병합 셀을 검색하는 단계; 및 병합 셀을 분할하여 재구성 테이블의 최종 데이터 구조를 생성하는 단계를 포함하는 PDF 테이블 재구성 장치의 동작 방법이 제공된다. According to an aspect of the present invention, there is provided a method comprising: receiving a PDF document including a table to be reconstructed; extracting PDF data including coordinate information of characters and lines included in the PDF document; retrieving a table area to be reorganized based on the PDF data; generating a primary data structure of a reorganization table composed of a plurality of cells based on coordinate information of characters and lines corresponding to a table area to be reconstructed among PDF data; searching for a merged cell among a plurality of cells constituting the primary data structure; and dividing the merge cell to generate a final data structure of the reorganization table.

일 실시예에서, PDF 데이터를 추출하는 단계는, PDF 문서를 파싱(parsing)함으로써 PDF 데이터를 추출할 수 있다. In an embodiment, extracting the PDF data may include extracting the PDF data by parsing the PDF document.

일 실시예에서, PDF 데이터를 추출하는 단계는, PDF 마이너(miner)를 이용하여 PDF 문서를 파싱(parsing)함으로써 PDF 데이터를 추출할 수 있다.In an embodiment, extracting the PDF data may include extracting the PDF data by parsing the PDF document using a PDF miner.

일 실시예에서, 최종 데이터 구조는 JSON(JavaScript Object Notation) 형식일 수 있다. In one embodiment, the final data structure may be in JavaScript Object Notation (JSON) format.

일 실시예에서, 병합 셀은 L 개의 가로 셀이 병합된 형태이고, 최종 데이터 구조를 생성하는 단계는 병합 셀을 L 개의 가로 셀로 분할함으로써 최종 데이터 구조를 생성할 수 있다. In an embodiment, the merged cell is a form in which L horizontal cells are merged, and the generating of the final data structure may generate the final data structure by dividing the merged cell into L horizontal cells.

일 실시예에서, 병합 셀은 M 개의 세로 셀이 병합된 형태이고, 최종 데이터 구조를 생성하는 단계는 병합 셀을 M 개의 세로 셀로 분할함으로써 최종 데이터 구조를 생성할 수 있다. In an embodiment, the merged cell is a form in which M vertical cells are merged, and the generating of the final data structure may generate the final data structure by dividing the merged cell into M vertical cells.

일 실시예에서, 병합 셀은 N 개의 하위 셀에 대응하는 상위 셀이고, 최종 데이터 구조를 생성하는 단계는 병합 셀을 N 개의 세로 셀로 분할함으로써 최종 데이터 구조를 생성할 수 있다. In an embodiment, the merged cell is a parent cell corresponding to the N number of lower cells, and generating the final data structure may generate the final data structure by dividing the merged cell into N vertical cells.

일 실시예에서, 재구성될 테이블 영역을 검색하는 단계는 재구성될 테이블 영역이 검색되지 않으면, PDF 파일을 이미지 파일로 변환하고, 이미지 파일에 대한 딥러닝 기반의 이미지 분석 모델을 기반으로 이미지 파일에 포함된 문자 및 선의 좌표 정보를 추출하고, 이미지 파일에서 추출된 문자 및 선의 좌표 정보를 기반으로 재구성될 테이블 영역을 검색하며, 1차 데이터 구조를 생성하는 단계는 이미지 파일에서 추출된 문자 및 선의 좌표 정보 중 재구성될 테이블 영역에 대응하는 문자 및 선의 좌표 정보를 기반으로, 복수의 셀로 구성되는 재구성 테이블의 1차 데이터 구조를 생성할 수 있다. In one embodiment, the step of searching for the table area to be reconstructed is to convert the PDF file to an image file if the table area to be reconstructed is not searched, and to include it in the image file based on a deep learning-based image analysis model for the image file The steps of extracting the coordinate information of the characters and lines extracted from the image file, searching the table area to be reconstructed based on the coordinate information of the characters and lines extracted from the image file, and generating the primary data structure include the coordinate information of the characters and lines extracted from the image file. A primary data structure of a reorganization table composed of a plurality of cells may be generated based on coordinate information of characters and lines corresponding to the table area to be reconstructed.

일 실시예에서, 딥러닝 기반의 이미지 분석 모델은 MM detection일 수 있다. In an embodiment, the deep learning-based image analysis model may be MM detection.

또한, 본 발명의 다른 측면에 따르면, PDF 문서의 이미지 파일을 수신하는 단계; 딥러닝 기반의 이미지 분석 모델을 기반으로 이미지 파일에 포함된 문자 및 선의 좌표 정보를 추출하는 단계; 이미지 파일에 포함된 문자 및 선의 좌표 정보를 기반으로 재구성될 테이블 영역을 검색하는 단계; 이미지 파일에 포함된 문자 및 선의 좌표 정보 중 재구성될 테이블 영역에 대응하는 문자 및 선의 좌표 정보를 기반으로, 복수의 셀로 구성되는 재구성 테이블의 1차 데이터 구조를 생성하는 단계; 1차 데이터 구조를 구성하는 복수의 셀 중 병합 셀을 검색하는 단계; 및 병합 셀을 분할하여 재구성 테이블의 최종 데이터 구조를 생성하는 단계;를 포함하는 PDF 테이블 재구성 장치의 동작 방법이 제공된다. In addition, according to another aspect of the present invention, the method comprising: receiving an image file of a PDF document; extracting coordinate information of characters and lines included in an image file based on a deep learning-based image analysis model; searching for a table area to be reconstructed based on coordinate information of characters and lines included in the image file; generating a primary data structure of a reconfiguration table composed of a plurality of cells based on coordinate information of characters and lines corresponding to a table region to be reconstructed among the coordinate information of characters and lines included in the image file; searching for a merged cell among a plurality of cells constituting the primary data structure; and generating a final data structure of the reorganization table by dividing the merge cell.

또한, 본 발명의 다른 측면에 따르면, PDF 문서에 포함된 테이블을 재구성하기 위한 명령어를 저장하는 메모리; 및 메모리에 저장된 명령어에 따라 PDF 문서의 테이블을 재구성하는 프로세서를 포함하되, 명령어는 PDF 문서에 포함된 문자 및 선의 좌표 정보를 포함하는 PDF 데이터를 추출하고, PDF 데이터를 기반으로 재구성될 테이블 영역을 검색하고, PDF 데이터 중 재구성될 테이블 영역에 대응하는 문자 및 선의 좌표 정보를 기반으로 복수의 셀로 구성되는 재구성 테이블의 1차 데이터 구조를 생성하고, 1차 데이터 구조를 구성하는 복수의 셀 중 병합 셀을 검색하여, 병합 셀을 분할하여 재구성 테이블의 최종 데이터 구조를 생성하는 명령어인 PDF 테이블 재구성 장치가 제공된다. In addition, according to another aspect of the present invention, a memory for storing instructions for reconstructing a table included in a PDF document; and a processor for reconstructing a table of the PDF document according to instructions stored in the memory, wherein the instructions extract PDF data including coordinate information of characters and lines included in the PDF document, and select a table area to be reconstructed based on the PDF data. Search, create a primary data structure of a reorganization table consisting of a plurality of cells based on the coordinate information of characters and lines corresponding to the table area to be reorganized among PDF data, and merge cells among a plurality of cells constituting the primary data structure A PDF table reorganization apparatus is provided, which is a command for generating a final data structure of a reorganization table by searching for , splitting a merge cell.

또한, 본 발명의 다른 측면에 따르면, PDF 문서의 이미지 파일에 포함된 테이블을 재구성하기 위한 명령어를 저장하는 메모리; 및 메모리에 저장된 명령어에 따라 PDF 문서의 이미지 파일에 포함된 테이블을 재구성하는 프로세서를 포함하되, 명령어는 딥러닝 기반의 이미지 분석 모델을 기반으로 이미지 파일에 포함된 문자 및 선의 좌표 정보를 추출하고, 이미지 파일에 포함된 문자 및 선의 좌표 정보를 기반으로 재구성될 테이블 영역을 검색하고, 이미지 파일에 포함된 문자 및 선의 좌표 정보 중 재구성될 테이블 영역에 대응하는 문자 및 선의 좌표 정보를 기반으로 복수의 셀로 구성되는 재구성 테이블의 1차 데이터 구조를 생성하고, 1차 데이터 구조를 구성하는 복수의 셀 중 병합 셀을 검색하며, 병합 셀을 분할하여 재구성 테이블의 최종 데이터 구조를 생성하는 명령어인, PDF 테이블 재구성 장치가 제공된다. In addition, according to another aspect of the present invention, a memory for storing instructions for reconstructing a table included in the image file of the PDF document; and a processor for reconstructing a table included in an image file of a PDF document according to a command stored in the memory, wherein the command extracts coordinate information of characters and lines included in the image file based on a deep learning-based image analysis model, Searches the table area to be reconstructed based on the coordinate information of the text and line included in the image file, and selects a plurality of cells based on the coordinate information of the text and line corresponding to the table area to be reconstructed among the coordinate information of the text and line included in the image file. PDF table reorganization, which is a command that creates the primary data structure of the configured reorganization table, searches for merged cells among a plurality of cells constituting the primary data structure, and splits the merged cells to generate the final data structure of the reorganized table A device is provided.

본 발명의 실시예에 따르면, PDF 문서에 포함된 테이블을 재구성하는 것이 가능하게 된다.According to an embodiment of the present invention, it becomes possible to reconstruct a table included in a PDF document.

또한, 본 발명의 다른 실시예에 따르면, PDF 문서의 이미지 파일에 포함된 테이블을 재구성하는 것이 가능하게 된다.Further, according to another embodiment of the present invention, it becomes possible to reconstruct a table included in an image file of a PDF document.

도 1은 본 발명의 일 실시예에 따른 PDF 테이블 재구성 장치의 동작 방법의 흐름도.
도 2는 본 발명이 다른 실시예에 따른 PDF 테이블 재구성 장치의 동작 방법의 흐름도.
도 3 내지 도 7은 본 발명의 일 실시예에 따른 병합 셀의 예시를 나타낸 도면.
도 8은 본 발명의 일 실시예에 따른 PDF 테이블 재구성 장치의 블록도.
도 9는 본 발명의 일 실시예에 따라 PDF 데이터를 파싱한 결과를 도시한 도면.1 is a flowchart of an operating method of a PDF table reorganization apparatus according to an embodiment of the present invention;
Figure 2 is a flowchart of an operating method of a PDF table reorganization apparatus according to another embodiment of the present invention.
3 to 7 are diagrams illustrating examples of a merge cell according to an embodiment of the present invention.
8 is a block diagram of a PDF table reorganization apparatus according to an embodiment of the present invention.
9 is a diagram illustrating a result of parsing PDF data according to an embodiment of the present invention;

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 이를 상세한 설명을 통해 상세히 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 본 발명을 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 본 명세서 및 청구항에서 사용되는 단수 표현은, 달리 언급하지 않는 한 일반적으로 "하나 이상"을 의미하는 것으로 해석되어야 한다.Since the present invention can have various changes and can have various embodiments, specific embodiments are illustrated in the drawings and will be described in detail through the detailed description. However, this is not intended to limit the present invention to specific embodiments, and it should be understood to include all modifications, equivalents and substitutes included in the spirit and scope of the present invention. In describing the present invention, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. Also, as used herein and in the claims, the terms "a" and "a" and "a" are to be construed to mean "one or more" in general, unless stated otherwise.

이하, 본 발명의 바람직한 실시예를 첨부도면을 참조하여 상세히 설명하기로 하며, 첨부 도면을 참조하여 설명함에 있어, 동일하거나 대응하는 구성 요소는 동일한 도면번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. do it with

도 1은 본 발명의 일 실시예에 따른 PDF 문서의 테이블 재구성하는 PDF 테이블 재구성 장치의 동작 방법의 흐름도이다. 1 is a flowchart of an operating method of a PDF table reconfiguration apparatus for reconstructing a table of a PDF document according to an embodiment of the present invention.

단계 S1010에서, PDF 문서가 수신된다. 구체적으로, PDF 테이블 재구성 장치는 재구성될 테이블 포함하는 PDF 문서를 유무선 통신 방식을 통해 외부 기기로부터 수신할 수 있다. In step S1010, a PDF document is received. Specifically, the PDF table reconstructing apparatus may receive a PDF document including a table to be reconstructed from an external device through a wired/wireless communication method.

단계 S1020에서, PDF 데이터가 추출된다. 구체적으로, PDF 테이블 재구성 장치는 수신된 PDF 문서에서, PDF 문서에 포함된 문자, 선 등의 좌표 정보를 포함하는 PDF 데이터를 추출할 수 있다. In step S1020, PDF data is extracted. Specifically, the PDF table reconstructing apparatus may extract PDF data including coordinate information such as characters and lines included in the PDF document from the received PDF document.

일 실시예에서, PDF 테이블 재구성 장치는 수신된 PDF 문서를 파싱(parsing)함으로 써, 수신된 PDF 문서로부터 PDF 데이터를 추출할 수 있다. 예를 들어, PDF 테이블 재구성 장치는 PDF 마이너(miner) 등을 이용하여 PDF 문서를 파싱함으로써, PDF 데이터를 추출할 수 있다. PDF 마이너를 이용한 파싱으로 추출된 데이터는 도 9와 같이 PDF 문서 내의 문자, 선 등의 배치 정보(LTRect는 PDF 문서 내의 선 정보를 의미, LTTextLineHorizontal는 PDF 문서 내의 텍스트 좌표와 그 내용을 나타냄)를 포함할 수 있다. In an embodiment, the PDF table reconstructing apparatus may extract PDF data from the received PDF document by parsing the received PDF document. For example, the PDF table reconstructing apparatus may extract PDF data by parsing the PDF document using a PDF miner or the like. Data extracted by parsing using PDF miner includes arrangement information of characters and lines in the PDF document (LTRect means line information in the PDF document, LTTextLineHorizontal indicates the text coordinates and contents in the PDF document) as shown in FIG. 9 can do.

단계 S1030에서, 재구성될 테이블 영역이 검색된다. 구체적으로, PDF 테이블 재구성 장치는 수신된 PDF 문서로부터 추출된 PDF 데이터를 기반으로 재구성될 테이블 영역을 검색할 수 있다. In step S1030, a table area to be reorganized is searched for. Specifically, the PDF table reorganization apparatus may search for a table area to be reconstructed based on PDF data extracted from the received PDF document.

일 실시예에서, PDF 테이블 재구성 장치는 PDF 데이터에 테이블 영역을 확인할 수 있는 정보가 포함된 경우, 이를 이용하여 재구성될 테이블 영역을 확인할 수 있다. In an embodiment, when the PDF data includes information for identifying the table area, the PDF table reorganization apparatus may check the table area to be reconstructed by using the information.

일 실시예에서, PDF 테이블 재구성 장치는 PDF 데이터에 포함된 문자 및 선의 배치 간격, 위치를 기반으로 재구성될 테이블 영역을 확인할 수 있다. 예를 들어, PDF 테이블 재구성 장치는 문자들의 세로 및 가로 배열이 테이블을 구성하는 세로 셀 및 가로 셀에 대응하는 경우, 재구성될 테이블 영역으로 확인할 수 있다. 즉, 도 3의 (A)가 PDF 문서에 포함된 재구성될 테이블 영역인 경우, PDF 테이블 재구성 장치는 가로 항목들(가로 항목 1~3), 세로 항목들(세로 항목 1~3) 및 이들에 대응하는 값들(A~G)이 배열되는 가로 및 세로 간격이 도 3의 (B)와 같은 셀의 형태인 것으로 인식함으로써, DPF 문서에서 도 3의 (A)에 해당하는 영역 재구성될 테이블 영역임을 확인할 수 있다.In an embodiment, the apparatus for reconstructing a PDF table may identify a table area to be reconstructed based on an arrangement interval and a position of characters and lines included in the PDF data. For example, when the vertical and horizontal arrangement of characters correspond to vertical and horizontal cells constituting the table, the PDF table reorganization apparatus may identify the table area to be reorganized. That is, when (A) of FIG. 3 is a table area to be reconstructed included in a PDF document, the PDF table reconfiguration apparatus includes horizontal items (horizontal items 1 to 3), vertical items (vertical items 1 to 3) and these By recognizing that the horizontal and vertical intervals at which the corresponding values A to G are arranged are in the form of a cell as in FIG. can be checked

일 실시예에서, PDF 테이블 재구성 장치는 PDF 데이터에 테이블의 외곽 선의 좌표 정보가 포함된 경우, 이를 재구성될 테이블 영역으로 확인할 수 있다.In an embodiment, when the PDF data includes coordinate information of an outline of a table, the PDF table reconstructing apparatus may identify it as a table area to be reconstructed.

단계 S1040에서, PDF 문서가 이미지 파일로 변환된다. 구체적으로, PDF 테이블 재구성 장치는 PDF 데이터로부터 재구성될 테이블 영역이 검색되지 않으면, PDF 문서를 이미지 파일(예, jpg)로 변환할 수 있다. In step S1040, the PDF document is converted into an image file. Specifically, the PDF table reconstructing apparatus may convert the PDF document into an image file (eg, jpg) if the table area to be reconstructed is not retrieved from the PDF data.

일 실시예에서, PDF 테이블 재구성 장치는 PDF 파일에 포함된 테이블의 외곽 선이 존재하지 않는 경우, 테이블 영역을 인식할 수 없을 수 있다. In an embodiment, the apparatus for reconstructing a PDF table may not recognize a table area when an outline of a table included in a PDF file does not exist.

단계 S1050에서, 이미지 파일에 대한 이미지 분석이 수행된다. 구체적으로, PDF 테이블 재구성 장치는 PDF 파일이 변환된 이미지 파일을 분석하여, 이미지 파일에 포함된 문자, 선 및 이들의 좌표 정보를 추출할 수 있다. In step S1050, image analysis is performed on the image file. Specifically, the PDF table reconstructing apparatus may analyze the image file converted from the PDF file, and extract characters, lines, and coordinate information thereof included in the image file.

일 실시예에서, PDF 테이블 재구성 장치는 딥러닝 기반의 이미지 분석 모델을 이용하여 문자, 선 및 이들의 좌표 정보를 추출할 수 있다. 예를 들어, PDF 테이블 재구성 장치는 MM detection(딥러닝 프레임워크인 Pytorch의 객체탐지 모델 라이브러리) 등의 딥러닝 기반의 이미지 분석 모델을 이용하여, 이미지 파일에 포함된 문자, 선 및 이들의 좌표 정보를 추출할 수 있다. In an embodiment, the PDF table reconstruction apparatus may extract characters, lines, and coordinate information thereof using a deep learning-based image analysis model. For example, the PDF table reconstruction device uses a deep learning-based image analysis model such as MM detection (object detection model library of Pytorch, a deep learning framework), and uses text, lines and their coordinate information included in an image file. can be extracted.

단계 S1060에서, 이미지 분석 결과를 기반으로 재구성될 테이블 영역이 검색된다. 구체적으로, PDF 테이블 재구성 장치는 이미지 파일로부터 추출된 문자, 선 및 이들의 좌표 정보를 기반으로 재구성될 테이블 영역을 확인할 수 있다.In step S1060, a table area to be reconstructed is searched for based on the image analysis result. Specifically, the PDF table reconstructing apparatus may identify a table area to be reconstructed based on characters, lines, and coordinate information thereof extracted from an image file.

일 실시예에서, PDF 테이블 재구성 장치는 이미지 파일로부터 추출된 문자, 선 및 이들의 좌표 정보에 포함된 문자 및 선의 배치 간격, 위치를 기반으로 재구성될 테이블 영역을 확인할 수 있다. 예를 들어, PDF 테이블 재구성 장치는 문자들의 세로 및 가로 배열이 테이블 구성하는 세로 셀 및 가로 셀에 대응하는 경우, 재구성될 테이블 영역으로 확인할 수 있다. 즉, 도 3의 (A)가 PDF 문서에 포함된 재구성될 테이블 영역인 경우, PDF 테이블 재구성 장치는 가로 항목들(가로 항목 1~3), 세로 항목들(세로 항목 1~3) 및 이들에 대응하는 값들(A~G)이 배열되는 가로 및 세로 간격이 도 3의 (B)와 같은 셀의 형태인 것으로 인식함으로써, DPF 문서에서 도 3의 (A)에 해당하는 영역 재구성될 테이블 영역임을 확인할 수 있다.In an embodiment, the apparatus for reconstructing a PDF table may identify a table area to be reconstructed based on an arrangement interval and location of characters and lines included in characters and lines extracted from an image file and their coordinate information. For example, when the vertical and horizontal arrangement of characters correspond to vertical and horizontal cells constituting the table, the PDF table reorganization apparatus may identify the table area to be reorganized. That is, when (A) of FIG. 3 is a table area to be reconstructed included in a PDF document, the PDF table reconfiguration apparatus includes horizontal items (horizontal items 1 to 3), vertical items (vertical items 1 to 3) and these By recognizing that the horizontal and vertical intervals at which the corresponding values A to G are arranged are in the form of a cell as in FIG. can be checked

단계 S1070에서, 재구성 테이블의 1차 데이터 구조가 생성된다. 구체적으로, PDF 테이블 재구성 장치는 PDF 데이터 중 검색된 테이블 영역에 해당하는 문자 및 선의 좌표 정보를 기반으로 복수의 셀로 구성되는 1차 데이터 구조를 생성할 수 있다.In step S1070, a primary data structure of the reorganization table is created. Specifically, the apparatus for reconstructing a PDF table may generate a primary data structure composed of a plurality of cells based on coordinate information of characters and lines corresponding to the searched table area among PDF data.

일 실시예에서, PDF 테이블 재구성 장치는 검색된 테이블 영역에 해당하는 문자 및 선의 배열 간격에 해당하는 크기 및 개수의 셀을 할당함으로써, 복수의 셀로 구성되는 1차 데이터 구조를 생성할 수 있다. 예를 들어, 도 3의 (A)의 문자가 가로 및 세로 방향으로 배열되는 간격은 도 3의 (B)의 셀의 크기 및 개수와 같은 복수의 셀에 대응한다. 따라서, PDF 테이블 재구성 장치는 도 3의 (A)와 같이 배열되는 문자들의 좌표 정보로부터 도 3(B)와 같은 형태의 1차 데이터 구조를 생성한다. In an embodiment, the apparatus for reconstructing a PDF table may generate a primary data structure composed of a plurality of cells by allocating a size and number of cells corresponding to an arrangement interval of characters and lines corresponding to the searched table area. For example, an interval at which the letters of FIG. 3A are arranged in the horizontal and vertical directions correspond to a plurality of cells having the same size and number of cells of FIG. 3B . Accordingly, the PDF table reconstructing apparatus generates a primary data structure of the form shown in FIG. 3(B) from the coordinate information of the characters arranged as shown in FIG. 3(A).

또한, PDF 테이블 재구성 장치는 이미지 파일로부터 추출된 문자, 선 및 이들의 좌표 정보 중 검색된 재구성 테이블에 해당하는 문자 및 선의 좌표 정보를 기반으로 복수의 셀로 구성되는 1차 데이터 구조를 생성할 수 있다. In addition, the PDF table reconstructing apparatus may generate a primary data structure composed of a plurality of cells based on the coordinate information of the characters and lines corresponding to the searched reconstruction table among the characters, lines, and their coordinate information extracted from the image file.

일 실시예에서, 1차 데이터 구조는 JSON(JavaScript Object Notation),텍스트(txt), Pandas (Python Data Analysis Library), NumPy (NPY) 형 데이터 등 다양한 형태일 수 있다. 이때, 1차 데이터 구조는 셀의 위치 기준, 세로 셀 이름 기준, 가로 셀 이름 기준 등으로 저장될 수 있다. In an embodiment, the primary data structure may be in various forms, such as JavaScript Object Notation (JSON), text (txt), Pandas (Python Data Analysis Library), NumPy (NPY) type data, and the like. In this case, the primary data structure may be stored according to a cell position basis, a vertical cell name basis, a horizontal cell name basis, and the like.

단계 S1080에서, 병합 셀이 검색된다. 구체적으로, PDF 테이블 재구성 장치는 재구성 테이블의 1차 데이터 구조를 구성하는 복수의 셀 중 2 개 이상의 셀이 병합된 셀을 검색할 수 있다. In step S1080, a merged cell is searched for. Specifically, the PDF table reorganization apparatus may search for a cell in which two or more cells are merged among a plurality of cells constituting the primary data structure of the reorganization table.

일 실시예에서, 병합 셀은 L(2 이상의 자연수)개의 가로 셀이 병합된 셀일 수 있다. 예를 들어, 도 4의 (A)와 같이 가로 항목 2에 해당하는 3개의 가로 셀이 병합된 셀일 수 있다. In an embodiment, the merged cell may be a cell in which L (a natural number greater than or equal to 2) horizontal cells are merged. For example, as shown in (A) of FIG. 4 , three horizontal cells corresponding to horizontal item 2 may be merged cells.

일 실시예에서, 병합 셀은 M(2 이상의 자연수) 개의 세로 셀이 병합된 셀일 수 있다. 예를 들어, 도 5의 (A)와 같이 세로 항목 3에 해당하는 2 개의 세로 셀이 병합된 셀일 수 있다.In an embodiment, the merged cell may be a cell in which M (a natural number greater than or equal to 2) vertical cells are merged. For example, as shown in FIG. 5A , two vertical cells corresponding to vertical item 3 may be merged cells.

일 실시예에서, 병합 셀은 L 개의 가로 셀 및 M 개의 세로 셀이 병합된 셀일 수 있다. 예를 들어, 도 6의 (A)와 같이, 세로 항목 1 및2와, 가로 항목 1 및 2에 대응하는 4개의 셀이 병합된 셀일 수 있다. In an embodiment, the merged cell may be a cell in which L horizontal cells and M vertical cells are merged. For example, as shown in FIG. 6A , four cells corresponding to vertical items 1 and 2 and horizontal items 1 and 2 may be merged cells.

일 실시예에서, 병합 셀은 N(2 이상의 자연수) 개의 하위 셀에 대응하는 셀일 수 있다. 예를 들어, 도 7의 (A)와 같이 2 개의 하위 셀(하위 셀 1 및 하위 셀 2)에 대응하는 상위 셀일 수 있다. In an embodiment, the merged cell may be a cell corresponding to N (a natural number greater than or equal to 2) sub-cells. For example, it may be an upper cell corresponding to two lower cells (lower cell 1 and lower cell 2) as shown in (A) of FIG. 7 .

단계 S1090에서, 병합 셀이 분할되어 최종 데이터 구조가 생성된다. 구체적으로, PDF 테이블 재구성 장치는 1차 데이터 구조를 구성하는 복수의 셀 중 병합 셀을 분할함으로써, 재구성 테이블의 최종 데이터 구조를 생성할 수 있다. In step S1090, the merge cell is split to create a final data structure. Specifically, the PDF table reorganization apparatus may generate the final data structure of the reorganization table by dividing the merged cell among a plurality of cells constituting the primary data structure.

일 실시예에서, PDF 테이블 재구성 장치는 병합 셀이 L 개의 가로 셀이 병합된 형태인 경우, 병합 셀을 L 개의 가로 셀로 분할함으로써, 최종 데이터 구조를 생성할 수 있다. 예를 들어, PDF 테이블 재구성 장치는 도 4의 (A)와 같이 3개의 가로 셀이 병합된 형태인 경우, 도 4의 (B)와 같이 병합 셀을 3개의 가로 셀로 분할할 수 있다. 또한, 가로 항목 2에 대응하는 세로 항목 1 내지 3의 값은 모두 A로 동일하므로, PDF 테이블 재구성 장치는 분할된 영역에 모두 동일한 값 A를 할당할 수 있다. In an embodiment, when the merged cell is a form in which L horizontal cells are merged, the PDF table reconstructing apparatus may generate a final data structure by dividing the merged cell into L horizontal cells. For example, when three horizontal cells are merged as shown in FIG. 4(A), the PDF table reconstructing apparatus may divide the merged cell into three horizontal cells as shown in FIG. 4(B). Also, since the values of vertical items 1 to 3 corresponding to horizontal item 2 are all the same as A, the PDF table reconstructing apparatus may assign the same value A to all divided regions.

일 실시예에서, PDF 테이블 재구성 장치는 병합 셀이 M 개의 세로 셀로 병합된 형태인 경우, M 개의 세로 셀로 분할함으로써, 최종 데이터 구조를 생성할 수 있다. 예를 들어, PDF 테이블 재구성 장치는 도 5의 (A)와 같이 2 개의 세로 셀이 병합된 형태인 경우, 도 5의 (B)와 같이 2 개의 세로 셀로 분할할 수 있다. 또한, 세로 항목 3에 대응하는 가로 항목 1 내지 2의 값은 모두 A로 동일하므로, PDF 테이블 재구성 장치는 분할된 영역에 모두 동일한 값 A를 할당할 수 있다. In an embodiment, when the merged cells are merged into M vertical cells, the PDF table reconstructing apparatus may generate a final data structure by dividing the merged cells into M vertical cells. For example, when two vertical cells are merged as shown in FIG. 5A , the PDF table reconstructing apparatus may divide the PDF table into two vertical cells as shown in FIG. 5B . Also, since values of horizontal items 1 to 2 corresponding to vertical item 3 are all the same as A, the PDF table reconstructing apparatus may assign the same value A to all divided regions.

일 실시예에서, PDF 테이블 재구성 장치는 병합 셀이 K(2 이상의 자연수) 개의 가로 셀 및 세로 셀이 병합된 형태인 경우, K 개의 가로 셀 및 세로 셀로 분할함으로써, 최종 데이터 구조를 생성할 수 있다. 예를 들어, PDF 테이블 재구성 장치는 도 6의 (A)와 같이 4개의 가로 셀 및 세로 셀이 병합된 형태인 경우, 도 6의 (B)와 같이 4개의 가로 셀 및 세로 셀로 분할할 수 있다. 또한, 가로 항목 1 내지 2에 대응하는 세로 항목 1 내지 2의 값은 모두 A로 동일하므로, PDF 테이블 재구성 장치는 분할된 영역에 모두 동일한 값 A를 할당할 수 있다. In an embodiment, when the merged cell is a merged form of K (a natural number greater than or equal to 2) horizontal cells and vertical cells, the PDF table reconstructing apparatus may generate a final data structure by dividing the merged cells into K horizontal cells and vertical cells. . For example, when the PDF table reconstructing apparatus has a form in which four horizontal cells and vertical cells are merged as shown in FIG. 6(A), it may be divided into four horizontal cells and vertical cells as shown in FIG. 6(B). . In addition, since the values of the vertical items 1 and 2 corresponding to the horizontal items 1 and 2 are all the same as A, the PDF table reconstructing apparatus may allocate the same value A to the divided regions.

일 실시예에서, PDF 테이블 재구성 장치는 병합 셀이 N 개의 하위 셀에 대응하는 상위 셀인 경우, N 개의 셀로 분할함으로써, 최종 데이터 구조를 생성할 수 있다. 예를 들어, PDF 테이블 재구성 장치는 도 7의 (A)와 같이 2 개의 하위 셀에 대응하는 병합 셀이 경우, 2개의 셀로 분할할 수 있다. 이때, 상위 셀에 대응하는 문자(A)와 하위 셀에 대응 문자(B,C)는 결합(AB, AC 또는 BA, CA)될 수 있다. 즉, 하위 셀은 상위 셀의 개념을 더욱 한정하는 것이므로, A가 남자, B가 10대, C가 20대인 경우를 예를 들면, PDF 테이블 재구성 장치는 분할된 셀 각각에 "남자 10대(또는 10대 남자)", "남자 20대(또는 20대 남자)"를 할당할 수 있다. In an embodiment, the PDF table reconstructing apparatus may generate a final data structure by dividing the merged cell into N cells when the merged cell is an upper cell corresponding to N lower cells. For example, the PDF table reconstructing apparatus may divide a merge cell corresponding to two sub-cells into two cells as shown in FIG. 7A . In this case, the letter (A) corresponding to the upper cell and the letter (B, C) corresponding to the lower cell may be combined (AB, AC, or BA, CA). That is, the lower cell further limits the concept of the upper cell, so for example, when A is a male, B is a teenager, and C is 20 Teenager male)", "Men in their 20s (or 20s male)" can be assigned.

도 2은 본 발명의 다른 실시예에 따른 PDF 문서의 테이블 재구성하는 PDF 테이블 재구성 장치의 동작 방법의 흐름도이다. 2 is a flowchart of a method of operating a PDF table reconstructing apparatus for reconstructing a table of a PDF document according to another embodiment of the present invention.

단계 S2010에서, PDF 문서의 이미지 파일이 수신된다. 구체적으로, PDF 테이블 재구성 장치는 재구성될 테이블 포함하는 PDF 문서의 이미지 파일을 유무선 통신 방식을 통해 외부 기기로부터 수신할 수 있다. In step S2010, an image file of the PDF document is received. Specifically, the PDF table reconstructing apparatus may receive an image file of a PDF document including a table to be reconstructed from an external device through a wired/wireless communication method.

단계 S2020에서, 이미지 파일에 대한 이미지 분석이 수행된다. 구체적으로, PDF 테이블 재구성 장치는 PDF 파일이 변환된 이미지 파일을 분석하여, 이미지 파일에 포함된 문자, 선 및 이들의 좌표 정보를 추출할 수 있다. In step S2020, image analysis is performed on the image file. Specifically, the PDF table reconstructing apparatus may analyze the image file converted from the PDF file, and extract characters, lines, and coordinate information thereof included in the image file.

일 실시예에서, PDF 테이블 재구성 장치는 딥러닝 기반의 이미지 분석 모델을 이용하여 문자, 선 및 이들의 좌표 정보를 추출할 수 있다. 예를 들어, PDF 테이블 재구성 장치는 MM detection 등의 딥러닝 기반의 이미지 분석 모델을 이용하여, 이미지 파일에 포함된 문자, 선 및 이들의 좌표 정보를 추출할 수 있다. In an embodiment, the PDF table reconstruction apparatus may extract characters, lines, and coordinate information thereof using a deep learning-based image analysis model. For example, the PDF table reconstruction apparatus may extract characters, lines, and coordinate information thereof included in an image file by using a deep learning-based image analysis model such as MM detection.

단계 S2030에서, 이미지 분석 결과를 기반으로 재구성될 테이블 영역이 검색된다. 구체적으로, PDF 테이블 재구성 장치는 이미지 파일로부터 추출된 문자, 선 및 이들의 좌표 정보를 기반으로 재구성될 테이블 영역을 확인할 수 있다.In step S2030, a table area to be reconstructed is searched for based on the image analysis result. Specifically, the PDF table reconstructing apparatus may identify a table area to be reconstructed based on characters, lines, and coordinate information thereof extracted from an image file.

일 실시예에서, PDF 테이블 재구성 장치는 이미지 파일로부터 추출된 문자, 선 및 이들의 좌표 정보에 포함된 문자 및 선의 배치 간격, 위치를 기반으로 재구성될 테이블 영역을 확인할 수 있다. 예를 들어, PDF 테이블 재구성 장치는 문자들의 세로 및 가로 배열이 테이블을 구성하는 세로 셀 및 가로 셀에 대응하는 경우, 재구성될 테이블 영역으로 확인할 수 있다. 즉, 도 3의 (A)가 PDF 문서에 포함된 재구성될 테이블 영역인 경우, PDF 테이블 재구성 장치는 가로 항목들(가로 항목 1~3), 세로 항목들(세로 항목 1~3) 및 이들에 대응하는 값들(A~G)이 배열되는 가로 및 세로 간격이 도 3의 (B)와 같은 셀의 형태인 것으로 인식함으로써, DPF 문서에서 도 3의 (A)에 해당하는 영역 재구성될 테이블 영역임을 확인할 수 있다.In an embodiment, the apparatus for reconstructing a PDF table may identify a table area to be reconstructed based on an arrangement interval and location of characters and lines included in characters and lines extracted from an image file and their coordinate information. For example, when the vertical and horizontal arrangement of characters correspond to vertical and horizontal cells constituting the table, the PDF table reorganization apparatus may identify the table area to be reorganized. That is, when (A) of FIG. 3 is a table area to be reconstructed included in a PDF document, the PDF table reconfiguration apparatus includes horizontal items (horizontal items 1 to 3), vertical items (vertical items 1 to 3) and these By recognizing that the horizontal and vertical intervals at which the corresponding values A to G are arranged are in the form of a cell as in FIG. can be checked

단계 S2040에서, 재구성 테이블의 1차 데이터 구조가 생성된다. 구체적으로, PDF 테이블 재구성 장치는 PDF 데이터 중 검색된 테이블 영역에 해당하는 문자 및 선의 좌표 정보를 기반으로 복수의 셀로 구성되는 1차 데이터 구조를 생성할 수 있다.In step S2040, a primary data structure of the reorganization table is created. Specifically, the apparatus for reconstructing a PDF table may generate a primary data structure composed of a plurality of cells based on coordinate information of characters and lines corresponding to the searched table area among PDF data.

일 실시예에서, 1차 데이터 구조는 JSON(JavaScript Object Notation)형 데이터일 수 있다. In one embodiment, the primary data structure may be JSON (JavaScript Object Notation) type data.

단계 S2050에서, 병합 셀이 검색된다. 구체적으로, PDF 테이블 재구성 장치는 재구성 테이블의 1차 데이터 구조를 구성하는 복수의 셀 중 2 개 이상의 셀이 병합된 셀을 검색할 수 있다. In step S2050, a merged cell is searched for. Specifically, the PDF table reorganization apparatus may search for a cell in which two or more cells are merged among a plurality of cells constituting the primary data structure of the reorganization table.

일 실시예에서, 병합 셀은 L개의 가로 셀이 병합된 셀일 수 있다. 예를 들어, 도 4의 (A)와 같이 가로 항목 2에 해당하는 3개의 가로 셀이 병합된 셀일 수 있다. In an embodiment, the merged cell may be a cell in which L horizontal cells are merged. For example, as shown in (A) of FIG. 4 , three horizontal cells corresponding to horizontal item 2 may be merged cells.

일 실시예에서, 병합 셀은 M 개의 세로 셀이 병합된 셀일 수 있다. 예를 들어, 도 5의 (A)와 같이 세로 항목 3에 해당하는 2 개의 세로 셀이 병합된 셀일 수 있다.In an embodiment, the merged cell may be a cell in which M vertical cells are merged. For example, as shown in FIG. 5A , two vertical cells corresponding to vertical item 3 may be merged cells.

일 실시예에서, 병합 셀은 N 개의 하위 셀에 대응하는 셀일 수 있다. 예를 들어, 도 7의 (A)와 같이 2 개의 하위 셀(하위 셀 1 및 하위 셀 2)에 대응하는 상위 셀일 수 있다. In one embodiment, the merge cell may be a cell corresponding to N sub-cells. For example, it may be an upper cell corresponding to two lower cells (lower cell 1 and lower cell 2) as shown in (A) of FIG. 7 .

단계 S2060에서, 병합 셀이 분할하여 최종 데이터 구조가 생성된다. 구체적으로, PDF 테이블 재구성 장치는 1차 데이터 구조를 구성하는 복수의 셀 중 병합 셀을 분할함으로써, 재구성 테이블의 최종 데이터 구조를 생성할 수 있다. In step S2060, the merged cell is divided to create a final data structure. Specifically, the PDF table reorganization apparatus may generate the final data structure of the reorganization table by dividing the merged cell among a plurality of cells constituting the primary data structure.

일 실시예에서, PDF 테이블 재구성 장치는 병합 셀이 L 개의 가로 셀이 병합된 형태인 경우, 병합 셀을 L 개의 가로 셀로 분할함으로써, 최종 데이터 구조를 생성할 수 있다. 예를 들어, PDF 테이블 재구성 장치는 도 4의 (A)와 같이 3개의 가로 셀이 병합된 형태인 경우, 도 4의 (B)와 같이 병합 셀을 3개의 가로 셀로 분할할 수 있다. 또한, 가로 항목 2에 대응하는 세로 항목 1 내지 3의 값은 모두 A로 동일하므로, PDF 테이블 재구성 장치는 분할된 영역에 모두 동일한 값 A를 할당할 수 있다. In an embodiment, when the merged cell is a merged form of L horizontal cells, the PDF table reconstructing apparatus may generate a final data structure by dividing the merged cell into L horizontal cells. For example, when three horizontal cells are merged as shown in FIG. 4(A), the PDF table reconstructing apparatus may divide the merged cell into three horizontal cells as shown in FIG. 4(B). Also, since the values of vertical items 1 to 3 corresponding to horizontal item 2 are all the same as A, the PDF table reconstructing apparatus may assign the same value A to all divided regions.

일 실시예에서, PDF 테이블 재구성 장치는 병합 셀이 K 개의 가로 셀 및 세로 셀이 병합된 형태인 경우, K 개의 가로 셀 및 세로 셀로 분할함으로써, 최종 데이터 구조를 생성할 수 있다. 예를 들어, PDF 테이블 재구성 장치는 도 6의 (A)와 같이 4개의 가로 셀 및 세로 셀이 병합된 형태인 경우, 도 6의 (B)와 같이 4개의 가로 셀 및 세로 셀로 분할할 수 있다. 또한, 가로 항목 1 내지 2에 대응하는 세로 항목 1 내지 2의 값은 모두 A로 동일하므로, PDF 테이블 재구성 장치는 분할된 영역에 모두 동일한 값 A를 할당할 수 있다. In an embodiment, when the merged cell is a merged form of K horizontal cells and vertical cells, the PDF table reconstruction apparatus may generate a final data structure by dividing the merged cells into K horizontal cells and vertical cells. For example, when the PDF table reconstructing apparatus has a form in which four horizontal cells and vertical cells are merged as shown in FIG. 6(A), it may be divided into four horizontal cells and vertical cells as shown in FIG. 6(B). . In addition, since the values of the vertical items 1 and 2 corresponding to the horizontal items 1 and 2 are all the same as A, the PDF table reconstructing apparatus may allocate the same value A to the divided regions.

도 8은 본 발명의 일 실시예에 따른 PDF 테이블 재구성 장치(800)의 블록도이다.8 is a block diagram of a PDF table reconfiguration apparatus 800 according to an embodiment of the present invention.

도 8에 도시된 바와 같이, PDF 테이블 재구성 장치(800)는 프로세서(810), 메모리(820), 저장부(830), 사용자 인터페이스 입력부(840) 및 사용자 인터페이스 출력부(850) 중 적어도 하나 이상의 요소를 포함할 수 있으며, 이들은 버스(860)를 통해 서로 통신할 수 있다. 또한, PDF 테이블 재구성 장치(800)는 네트워크에 접속하기 위한 네트워크 인터페이스(870)를 또한 포함할 수 있다. 프로세서(810)는 메모리(820) 및/또는 저장부(830)에 저장된 처리 명령어를 실행시키는 CPU 또는 반도체 소자일 수 있다. 메모리(820) 및 저장부(830)는 다양한 유형의 휘발성/비휘발성 기억 매체를 포함할 수 있다. 예를 들어, 메모리는 ROM(824) 및 RAM(825)을 포함할 수 있다. As shown in FIG. 8 , the PDF table reconstruction apparatus 800 includes at least one or more of a processor 810 , a memory 820 , a storage 830 , a user interface input unit 840 , and a user interface output unit 850 . It may include elements, which may communicate with each other via bus 860 . In addition, the PDF table reconstruction apparatus 800 may also include a network interface 870 for connecting to a network. The processor 810 may be a CPU or a semiconductor device that executes processing instructions stored in the memory 820 and/or the storage 830 . The memory 820 and the storage 830 may include various types of volatile/nonvolatile storage media. For example, the memory may include ROM 824 and RAM 825 .

일 실시예에서, 메모리는 도 1 및 도 2에 도시된 방법을 수행하기 위한 명령어를 포함할 수 있으며, 프로세서(810)는 메모리에 저장된 명령어에 따라 PDF 문서에 포함된 테이블을 재구성할 수 있다. In an embodiment, the memory may include instructions for performing the method shown in FIGS. 1 and 2 , and the processor 810 may reconstruct a table included in the PDF document according to the instructions stored in the memory.

또한, PDF 테이블 재구성 장치(800)은 다수의 PDF 문서로부터 특정 문자에 대응하는 값을 추출해야 하는 경우, PDF 테이블 재구성 장치(800)는 제1 PDF 문서를 파싱하여 특정 문자를 검색하고, 검색된 특정 문자에 대응하는 값의 위치를 특정 문자를 기준으로 설정할 수 있다. 예를 들어, PDF 테이블 재구성 장치(800)는 PDF 문서에서 날짜(특정 항목)에 대응 하는 2019.07.01(값)이, 날짜를 기준으로 좌측 하단에 위치한다는 것으로 날짜에 대응하는 값의 위치를 설정할 수 있다. 이후, PDF 테이블 재구성 장치(800)는 특정 항목에 대응하는 값을 추출할 PDF 문서가 수신되면, PDF 파일을 파싱하므로써 특정 항목의 위치를 검색하고, 검색된 위치로부터 설정된 위치에 존재하는 데이터를 추출할 수 있다. 예를 들어, PDF 테이블 재구성 장치(800)는 수신된 PDF 문서를 파싱하여 날짜를 검색하고, 검색된 날짜를 기준으로 좌츠 하단에 위치하는 데이터를 추출할 수 있다. 이러한 방식으로, PDF 테이블 재구성 장치(800)는 다수의 PDF 문서로부터 특정 항목에 대응하는 값을 빠르게 추출하는 것을 가능하게 한다. In addition, when the PDF table reconstructing device 800 needs to extract a value corresponding to a specific character from a plurality of PDF documents, the PDF table reconstructing device 800 parses the first PDF document to search for a specific character, and The position of a value corresponding to a character can be set based on a specific character. For example, the PDF table reorganization device 800 sets the position of the value corresponding to the date by saying that 2019.07.01 (value) corresponding to the date (specific item) in the PDF document is located at the lower left corner based on the date. can Thereafter, when a PDF document from which a value corresponding to a specific item is to be extracted is received, the PDF table reconstructing device 800 searches the location of the specific item by parsing the PDF file, and extracts the data present at the set location from the searched location. can For example, the PDF table reconstructing apparatus 800 may parse the received PDF document to search for a date, and extract data located at the lower left corner based on the searched date. In this way, the PDF table reconstruction apparatus 800 makes it possible to quickly extract a value corresponding to a specific item from a plurality of PDF documents.

이제까지 본 발명에 대하여 그 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been looked at focusing on the embodiments thereof. Those of ordinary skill in the art to which the present invention pertains will understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments are to be considered in an illustrative rather than a restrictive sense. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

Claims

A method of operating an apparatus for reconstructing a table of a PDF document, the method comprising:
receiving a PDF document comprising a table to be reconstructed;
extracting PDF data including coordinate information of characters and lines included in the PDF document;
searching the table area to be reorganized based on the PDF data;
generating a primary data structure of a reorganization table including a plurality of cells based on coordinate information of characters and lines corresponding to the table area to be reconstructed among the PDF data;
searching for a merged cell among the plurality of cells constituting the primary data structure; and
dividing the merged cell to generate a final data structure of the reorganization table;
including,
The merged cell has one value,
Creating the final data structure comprises:
Creating the final data structure by assigning the one value to each cell into which the merged cell is divided,
If the table area to be reorganized is not found in the step of searching for the table area to be reorganized,
Converts the PDF document to an image file, extracts coordinate information of characters and lines included in the image file based on a deep learning-based image analysis model for the converted image file, and coordinates information of the extracted characters and lines Perform the steps of creating the primary data structure by checking the table area to be reorganized based on the spacing and position of the characters and lines included in the
Creating the primary data structure comprises:
Among the coordinate information of characters and lines extracted from the image file, the arrangement interval and position of the line corresponding to the table area to be reconstructed, and the horizontal and vertical intervals at which the vertical and horizontal arrangement of characters and values based thereon are arranged are divided into vertical cells and horizontal cells. A method of recognizing and creating a primary data structure of a reorganized table composed of multiple cells.

According to claim 1,
The step of extracting the PDF data comprises extracting the PDF data by parsing the PDF document.

According to claim 1,
The extracting of the PDF data comprises extracting the PDF data by parsing the PDF document using a PDF miner.

According to claim 1,
The final data structure is in the format of at least one of JSON (JavaScript Object Notation), text (txt), Pandas (Python Data Analysis Library), and NumPy (NPY).

According to claim 1,
The merged cell is a form in which L horizontal cells are merged,
wherein generating the final data structure creates the final data structure by dividing the merged cell into L horizontal cells.

According to claim 1,
The merged cell is a merged form of M vertical cells,
wherein generating the final data structure creates the final data structure by dividing the merged cell into M vertical cells.

According to claim 1,
The merged cell is an upper cell corresponding to N number of lower cells,
wherein generating the final data structure creates the final data structure by dividing the merged cell into N vertical cells.

delete

According to claim 1,
The deep learning-based image analysis model is MM detection, the method.

A method of operating an apparatus for reconstructing a table of a PDF document, the method comprising:
receiving an image file of the PDF document;
extracting coordinate information of characters and lines included in the image file based on a deep learning-based image analysis model;
retrieving a table area to be reconstructed based on an arrangement interval and position of characters and lines included in the coordinate information of characters and lines extracted from the image file;
A reorganization table consisting of a plurality of cells by recognizing the arrangement interval and position of the line corresponding to the searched table area to be reconstructed, the vertical and horizontal arrangement of characters and the horizontal interval and the vertical interval at which the values are arranged as vertical cells and horizontal cells creating a primary data structure;
searching for a merged cell among the plurality of cells constituting the primary data structure; and
dividing the merged cell to generate a final data structure of the reorganization table;
including,
The merged cell has one value,
Creating the final data structure comprises:
and generating the final data structure by respectively assigning the one value to each cell into which the merged cell is divided.

11. The method of claim 10,
The deep learning-based image analysis model is MM detection, the method.

a memory for storing instructions for reconstructing tables included in the PDF document; and
A processor for reconstructing the table of the PDF document according to the instructions stored in the memory,
The command is
Extracting PDF data including coordinate information of characters and lines included in the PDF document, searching for the table area to be reconstructed based on the PDF data, and selecting text and lines corresponding to the table area to be reconstructed among the PDF data Create a primary data structure of a reorganization table consisting of a plurality of cells based on the coordinate information,
If the table area to be reorganized is not found,
Converts the PDF document to an image file, extracts coordinate information of characters and lines included in the image file based on a deep learning-based image analysis model for the converted image file, and coordinates information of the extracted characters and lines Create the primary data structure by checking the table area to be reorganized based on the spacing and position of the text and lines included in the
The primary data structure is
Among the coordinate information of characters and lines extracted from the image file, the arrangement interval and position of the line corresponding to the table area to be reconstructed, and the horizontal and vertical intervals at which the vertical and horizontal arrangement of characters and values based thereon are arranged are divided into vertical cells and horizontal cells. Recognized and composed of a plurality of cells,
A command for generating a final data structure of the reorganization table by searching for a merged cell among the plurality of cells constituting the primary data structure, dividing the merged cell,
PDF table reorganization device.

a memory for storing instructions for reconstructing a table included in an image file of a PDF document; and
A processor for reconstructing a table included in the image file of the PDF document according to the instructions stored in the memory,
The command is
Coordinate information of characters and lines included in the image file is extracted based on a deep learning-based image analysis model, and reconfigured based on the spacing and location of characters and lines included in the coordinate information of characters and lines extracted from the image file Searches the table area to be reconstructed, and recognizes the arrangement spacing and position of lines corresponding to the searched table area to be reconstructed, the vertical and horizontal arrangement of characters and the horizontal spacing and the vertical spacing at which values are arranged as vertical cells and horizontal cells, and plural generating a primary data structure of a reorganization table composed of cells of command,
The merged cell has one value,
and generating the final data structure by assigning the one value to each cell into which the merged cell is divided.