CN117454851B

CN117454851B - A method and device for extracting table data from PDF documents

Info

Publication number: CN117454851B
Application number: CN202311786233.4A
Authority: CN
Inventors: 朱海洋; 陈为; 储诚灿; 胡健; 谈旭炜; 应石磊; 苏轶; 王牡丹; 潘奇豪; 朱凌军; 沈萍平
Original assignee: Products Zhongda Digital Technology Co ltd; Zhejiang University ZJU
Current assignee: Products Zhongda Digital Technology Co ltd; Zhejiang University ZJU
Priority date: 2023-12-25
Filing date: 2023-12-25
Publication date: 2024-03-12
Anticipated expiration: 2043-12-25
Also published as: CN117454851A

Abstract

In the extraction method, after an initial form is obtained by analyzing a PDF document, a text list corresponding to a page where the initial form is located is firstly segmented to obtain a text two-dimensional list. Then, a table category of the initial table is determined based on the number of rows and columns of the initial table and the number of columns of the text two-dimensional list. Finally, the initial form is reconstructed based on the determined form category and the text list, and a reconstructed form is obtained as form data extracted from the PDF document. Therefore, the extraction efficiency and accuracy of the table data can be greatly improved.

Description

A method and device for extracting table data from PDF documents

技术领域Technical field

本说明书一个或多个实施例涉及计算机技术领域，尤其涉及一种面向PDF文档的表格数据抽取方法及装置。One or more embodiments of this specification relate to the field of computer technology, and in particular, to a method and device for extracting tabular data from PDF documents.

背景技术Background technique

多数情况下，多源异构多维度供应链数据包含了丰富的有价值信息，对于指导企业的经营管理、决策支持以及商业模式创新等方面具有重要意义。其中，可移植文档格式(portable document format, PDF)作为一种广泛应用的非结构化数据形式，在跨平台性、高保真度和安全性等方面具有显著优势，因此在各类文档的制作和传播中被广泛采用。特别是在企业应用领域，PDF文档是企业内部沟通以及外部交流的重要载体，例如招股说明书、上市公司定期报告（包括年报、半年报和季度报等）、合同协议、产品说明书等。这些PDF文档中蕴含了大量的企业信息，如经营状况、财务指标、市场竞争力、产品特性等，对于企业自身及其利益相关者都具有重要价值。然而，由于PDF文档通常是不可编辑的，且包含表格、图片及文本等多种非结构化数据，因此从中有效地抽取数据变得繁杂且耗时。目前，从PDF文档中抽取数据的方法主要包括人工摘取录入、PDF转换器、开源工具及智能算法等，然而这些方法都存在一定的局限和不足。具体如下：In most cases, multi-source, heterogeneous and multi-dimensional supply chain data contain a wealth of valuable information, which is of great significance for guiding enterprise management, decision support and business model innovation. Among them, Portable Document Format (PDF), as a widely used unstructured data form, has significant advantages in cross-platform, high fidelity and security. Therefore, it is widely used in the production and production of various documents. It is widely used in communication. Especially in the field of enterprise applications, PDF documents are an important carrier for internal and external communication, such as prospectuses, regular reports of listed companies (including annual reports, semi-annual reports and quarterly reports, etc.), contract agreements, product instructions, etc. These PDF documents contain a large amount of corporate information, such as operating conditions, financial indicators, market competitiveness, product characteristics, etc., which are of great value to the company itself and its stakeholders. However, since PDF documents are usually not editable and contain a variety of unstructured data such as tables, images, and text, effectively extracting data from them becomes complex and time-consuming. Currently, methods for extracting data from PDF documents mainly include manual extraction and entry, PDF converters, open source tools, and intelligent algorithms. However, these methods have certain limitations and shortcomings. details as follows:

（1）数据复杂问题。PDF文档通常由表格、图片及文本等具有复杂性及多样性特点的非结构化数据组成，常见的数据转换方法/工具效率低、成本高且未能提供可视分析功能，操作不便捷、可用性受限。(1) Data complexity problem. PDF documents are usually composed of unstructured data with complexity and diversity such as tables, pictures, and texts. Common data conversion methods/tools are inefficient, costly, and fail to provide visual analysis functions, making operation inconvenient and usable. Restricted.

（2）数据质量问题。由于人为主观判断、疏忽或疲劳等多种因素，采用手工方式抽取PDF文档非结构化数据容易发生遗漏、错误，甚至忽略某些重要的数据信息，可能会对后续分析应用产生负面影响。(2) Data quality issues. Due to various factors such as human subjective judgment, negligence or fatigue, manual extraction of unstructured data from PDF documents is prone to omissions, errors, and even the neglect of some important data information, which may have a negative impact on subsequent analysis applications.

（3）数据完整问题。采用自动化工具从PDF文档中抽取数据时，往往只能抽取一些常规的财务指标数据，而忽略了那些对数据分析具有极高价值的财务附注、图片及文本等信息，影响数据完整性及分析精准性。(3) Data integrity issue. When using automated tools to extract data from PDF documents, often only some conventional financial indicator data can be extracted, while information such as financial notes, pictures, and texts that are extremely valuable for data analysis are ignored, affecting data integrity and analysis accuracy. sex.

（4）数据对比问题。采用手工方式从PDF文档抽取的结构化数据通常存储在Excel或Word文档表格中，未来需要进行同比、环比以及本年累计等指标的统计分析时，无法实现历史数据的快速检索及调用。(4) Data comparison problem. The structured data extracted from PDF documents manually is usually stored in Excel or Word document tables. When statistical analysis of year-on-year, month-on-month, and current year cumulative indicators is required in the future, historical data cannot be quickly retrieved and recalled.

（5）数据融合问题。采用传统的数据抽取方法/工具从PDF文档中提取的结构化数据，通常难以按业务主题进行合理归纳、分类存储，数据可用性不强，由此带来了数据融合方面的挑战。(5) Data fusion issues. The structured data extracted from PDF documents using traditional data extraction methods/tools is usually difficult to reasonably summarize and store according to business themes, and the data usability is not strong, which brings challenges in data fusion.

为了有效解决上述问题，需要提供一种更有效的面向PDF文档的数据抽取方法。In order to effectively solve the above problems, a more effective data extraction method for PDF documents needs to be provided.

发明内容Contents of the invention

本说明书一个或多个实施例描述了一种面向PDF文档的表格数据抽取方法及装置，可以大大提升表格数据的抽取效率和准确性。One or more embodiments of this specification describe a method and device for extracting tabular data from PDF documents, which can greatly improve the efficiency and accuracy of extracting tabular data.

第一方面，提供了一种面向PDF文档的表格数据抽取方法，包括：The first aspect provides a tabular data extraction method for PDF documents, including:

对PDF文档进行解析，得到其中包含的初始表格和多页文本内容；Parse the PDF document to obtain the initial table and multi-page text content contained in it;

将所述多页文本内容转换为对应的各个文本列表，单个文本列表包括多行文本；Convert the multi-page text content into corresponding text lists, where a single text list includes multiple lines of text;

从所述各个文本列表中，选取所述初始表格所在页对应的目标文本列表；From each of the text lists, select the target text list corresponding to the page where the initial form is located;

按照预设符号，对所述目标文本列表进行切分，得到文本二维列表；Segment the target text list according to preset symbols to obtain a two-dimensional text list;

根据所述初始表格的第一行数和第一列数，以及所述文本二维列表的第二列数，确定所述初始表格的表格类别；Determine the table category of the initial table based on the first row number and first column number of the initial table, and the second column number of the text two-dimensional list;

所述确定所述初始表格的表格类别包括，若所述第一行数小于预设行数，且所述第一列数和所述第二列数相等，则确定所述表格类别为三线表；若所述第二列数与所述第一列数的差值等于预设列数，则确定所述表格类别为边框缺失表；若所述第二列数与所述第一列数的差值大于预设列数，则确定所述表格类别为颜色阶梯表；Determining the table type of the initial table includes determining that the table type is a three-line table if the first number of rows is less than a preset number of rows and the first column number and the second column number are equal. ; If the difference between the second column number and the first column number is equal to the preset column number, then the table category is determined to be a table with missing borders; if the difference between the second column number and the first column number is If the difference is greater than the preset number of columns, the table category is determined to be a color ladder table;

根据确定的表格类别，对所述初始表格进行重构，得到重构表格；Reconstruct the initial table according to the determined table category to obtain a reconstructed table;

将所述重构表格确定为从所述PDF文档中抽取的表格数据。The reconstructed table is determined as table data extracted from the PDF document.

第二方面，提供了一种面向PDF文档的表格数据抽取装置，包括：In the second aspect, a form data extraction device for PDF documents is provided, including:

解析单元，用于对PDF文档进行解析，得到其中包含的初始表格和多页文本内容；The parsing unit is used to parse the PDF document and obtain the initial table and multi-page text content contained in it;

转换单元，用于将所述多页文本内容转换为对应的各个文本列表，单个文本列表包括多行文本；A conversion unit used to convert the multi-page text content into corresponding text lists, where a single text list includes multiple lines of text;

选取单元，用于从所述各个文本列表中，选取所述初始表格所在页对应的目标文本列表；A selection unit configured to select the target text list corresponding to the page where the initial form is located from each of the text lists;

切分单元，用于按照预设符号，对所述目标文本列表进行切分，得到文本二维列表；A segmentation unit, used to segment the target text list according to preset symbols to obtain a two-dimensional text list;

确定单元，用于根据所述初始表格的第一行数和第一列数，以及所述文本二维列表的第二列数，确定所述初始表格的表格类别；A determination unit configured to determine the table category of the initial table based on the first row number and first column number of the initial table and the second column number of the text two-dimensional list;

所述确定单元具体用于：若所述第一行数小于预设行数，且所述第一列数和所述第二列数相等，则确定所述表格类别为三线表；若所述第二列数与所述第一列数的差值等于预设列数，则确定所述表格类别为边框缺失表；若所述第二列数与所述第一列数的差值大于预设列数，则确定所述表格类别为颜色阶梯表；The determining unit is specifically configured to: if the number of first rows is less than the preset number of rows, and the number of first columns and the number of second columns are equal, determine that the table category is a three-line table; if the If the difference between the second column number and the first column number is equal to the preset column number, the table category is determined to be a table with missing borders; if the difference between the second column number and the first column number is greater than the preset Assuming the number of columns, the table category is determined to be a color ladder table;

重构单元，用于根据确定的表格类别，对所述初始表格进行重构，得到重构表格；A reconstruction unit, configured to reconstruct the initial table according to the determined table category to obtain a reconstructed table;

所述确定单元，还用于将所述重构表格确定为从所述PDF文档中抽取的表格数据。The determining unit is also used to determine the reconstructed table as table data extracted from the PDF document.

本说明书一个或多个实施例提供的一种面向PDF文档的表格数据抽取方法及装置，在从PDF文档中解析得到初始表格之后，先针对该初始表格所在页对应的文本列表进行切分，得到文本二维列表。之后，基于该初始表格的行列数和文本二维列表的列数，确定该初始表格的表格类别。最后，基于确定的表格类别和上述的文本列表，对该初始表格进行重构，得到重构表格作为从PDF文档中抽取的表格数据。由此可以大大提升表格数据的抽取效率和准确性。One or more embodiments of this specification provide a method and device for extracting table data from PDF documents. After parsing the initial table from the PDF document, the text list corresponding to the page where the initial table is located is first segmented to obtain A two-dimensional list of text. Then, based on the number of rows and columns of the initial table and the number of columns of the two-dimensional text list, the table category of the initial table is determined. Finally, based on the determined table category and the above text list, the initial table is reconstructed to obtain the reconstructed table as table data extracted from the PDF document. This can greatly improve the efficiency and accuracy of table data extraction.

附图说明Description of the drawings

为了更清楚地说明本说明书实施例的技术方案，下面将对实施例描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本说明书的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其它的附图。In order to more clearly illustrate the technical solutions of the embodiments of this specification, the drawings needed to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of this specification. Those of ordinary skill in the art can also obtain other drawings based on these drawings without exerting creative efforts.

图1为本说明书一个实施例披露的实施场景示意图；Figure 1 is a schematic diagram of an implementation scenario disclosed in an embodiment of this specification;

图2示出根据一个实施例的一种面向PDF文档的表格数据抽取方法流程图；Figure 2 shows a flow chart of a method for extracting table data from PDF documents according to one embodiment;

图3示出在一个示例中的PDF文档解析过程示意图；Figure 3 shows a schematic diagram of the PDF document parsing process in an example;

图4示出在一个示例中的文本列表示意图；Figure 4 shows a text column representation diagram in one example;

图5a示出在一个示例中的目标文本列表示意图；Figure 5a shows a target text column representation diagram in an example;

图5b示出在一个示例中的文本二维列表示意图；Figure 5b shows a two-dimensional column representation of text in one example;

图6示出在一个示例中的面向PDF文档的表格数据抽取方法示意图；Figure 6 shows a schematic diagram of a table data extraction method for PDF documents in an example;

图7a示出可视化分析系统中的文档概览视图示意图；Figure 7a shows a schematic diagram of the document overview view in the visual analysis system;

图7b示出可视化分析系统中的数据抽取视图示意图；Figure 7b shows a schematic diagram of the data extraction view in the visual analysis system;

图7c示出可视化分析系统中的数据转换审核视图示意图；Figure 7c shows a schematic diagram of the data conversion audit view in the visual analysis system;

图8示出根据一个实施例的一种面向PDF文档的表格数据抽取装置示意图。Figure 8 shows a schematic diagram of a table data extraction device for PDF documents according to one embodiment.

具体实施方式Detailed ways

下面结合附图，对本说明书提供的方案进行描述。The solutions provided in this specification will be described below in conjunction with the accompanying drawings.

通常而言，上市公司定期报告PDF文档包含了丰富的数据信息，这些信息通常以表格的形式进行呈现，例如资产负债表、利润表、现金流量表以及财务报告附注等。将这些表格数据从PDF文档中抽取出来，可为企业决策提供更可靠的数据依据，能更加便捷地对不同时间点或不同企业的数据进行比较，更好地了解对标客体企业的财务变化情况，从而更有针对性地制定相应的计划和决策。Generally speaking, the PDF documents of listed companies' periodic reports contain a wealth of data information, which is usually presented in tabular form, such as balance sheets, income statements, cash flow statements, and notes to financial reports. Extracting these tabular data from PDF documents can provide a more reliable data basis for corporate decision-making, make it easier to compare data at different time points or different companies, and better understand the financial changes of the benchmarking company. , thereby making corresponding plans and decisions more targeted.

为实现对PDF文档中表格数据的自动化抽取，现有方案提出了许多文档结构化的数据转换技术。In order to realize the automatic extraction of table data in PDF documents, existing solutions have proposed many document structured data conversion technologies.

PDF文档通常以图片或二进制编码等形式进行存储，采用文档解析(documentparsing)方法可解码文档结构、解析数据类型。Strouthopoulos等提出了一种基于PDF文档结构和关键词的文档解析方法，可自动识别、抽取其中的文本信息，并精准地确定段落边界和句子完整性。Zhang等研究了一种基于规则的文档解析方法，将PDF文档转换为XML格式并从中抽取元数据。Nguyen等则引入了一种将PDF文档转换为图像格式的方法，采用计算机视觉(computer vision, CV)及图像处理(image processing, IP)技术来识别表格、图片及文本。Grijalva等开发了一个数据转换平台，首先从扫描的PDF文档中抽取文本单元格、位图图像和线条，然后采用机器学习(machine learning)分类方法解析文档内容。Rizvi等提出了一种采用基于掩码及区域的卷积神经网络(mask region-based convolutionalneural network, Mask R-CNN)BRExSys系统框架，对PDF文档进行页面布局解析。此外，Ahmed等还提出了一种基于文本块、排版和几何信息等多维特征的文档解析方法。然而，该方法的准确率较低，且在处理大规模PDF文档时需要占用较多的存储空间和计算资源。PDF documents are usually stored in the form of images or binary encodings, and the document parsing method can be used to decode the document structure and parse the data type. Strouthopoulos et al. proposed a document parsing method based on PDF document structure and keywords, which can automatically identify and extract text information, and accurately determine paragraph boundaries and sentence integrity. Zhang et al. studied a rule-based document parsing method to convert PDF documents into XML format and extract metadata from them. Nguyen et al. introduced a method to convert PDF documents into image formats, using computer vision (CV) and image processing (IP) technologies to identify tables, pictures and text. Grijalva et al. developed a data conversion platform that first extracts text cells, bitmap images, and lines from scanned PDF documents, and then uses machine learning (machine learning) classification methods to parse the document content. Rizvi et al. proposed a BRExSys system framework using mask region-based convolutional neural network (Mask R-CNN) to analyze the page layout of PDF documents. In addition, Ahmed et al. also proposed a document parsing method based on multi-dimensional features such as text blocks, typography and geometric information. However, this method has low accuracy and requires more storage space and computing resources when processing large-scale PDF documents.

数据抽取(data extraction)是指从PDF文档的表格、图片或文本中识别并提取特定类型的信息。对于表格数据抽取，首先需要检测和理解表格结构，然后再提取其中的数据。传统方法主要依赖于预定义模板及规则匹配来提取特定的字段内容，但受到模板创建的局限，且难以适应不同表格结构。机器学习方法采用YOLO、UNet等图像分割与识别算法检测表格结构，然后运用光学字符识别(optical character recognition，OCR)技术抽取表格数据。Hashmi等提出了一种基于导向锚点的方法，用于精确定位表格图像中的行和列，具有强泛化能力。Jiang等提出了一种基于表格单元结构的深度学习模型，通过学习表不同类型和内容单元格的特征，提高了处理异构表格数据的准确性。Data extraction refers to identifying and extracting specific types of information from tables, pictures, or text in PDF documents. For table data extraction, you first need to detect and understand the table structure, and then extract the data in it. Traditional methods mainly rely on predefined templates and rule matching to extract specific field content, but are limited by template creation and difficult to adapt to different table structures. The machine learning method uses image segmentation and recognition algorithms such as YOLO and UNet to detect the table structure, and then uses optical character recognition (optical character recognition, OCR) technology to extract the table data. Hashmi et al. proposed a method based on guided anchor points to accurately locate rows and columns in table images, with strong generalization capabilities. Jiang et al. proposed a deep learning model based on the table cell structure, which improves the accuracy of processing heterogeneous table data by learning the characteristics of different types and content cells of the table.

与上述方法不同，本方案提出了一种流程化的表格数据抽取方法，不仅可以抽取PDF文档中的表格主题信息，还能实现复杂表格的结构解析和数据提取。Different from the above methods, this solution proposes a process-based table data extraction method, which can not only extract table subject information in PDF documents, but also achieve structural analysis and data extraction of complex tables.

图1为本说明书一个实施例披露的实施场景示意图。图1中，先对PDF文档进行解析，得到初始表格和文本内容。之后，可以基于文本内容，对初始表格进行重构，进而得到抽取的表格数据。最后，可以对抽取的表格数据进行可视化展示，以供数据分析人员查看并审核，以及将审核通过的表格数据保存到数据库中。Figure 1 is a schematic diagram of an implementation scenario disclosed in an embodiment of this specification. In Figure 1, the PDF document is first parsed to obtain the initial table and text content. Afterwards, the initial table can be reconstructed based on the text content to obtain the extracted table data. Finally, the extracted tabular data can be visually displayed for data analysts to view and review, and the tabular data that has passed the review can be saved to the database.

图2示出根据一个实施例的一种面向PDF文档的表格数据抽取方法流程图，该方法可以通过任何具有计算、处理能力的装置、设备、平台、设备集群来执行。如图2所示，该方法可以包括如下步骤。Figure 2 shows a flow chart of a method for extracting table data from PDF documents according to one embodiment. The method can be executed by any device, device, platform, or device cluster with computing and processing capabilities. As shown in Figure 2, the method may include the following steps.

步骤S202，对PDF文档进行解析，得到其中包含的初始表格和多个页面的文本内容。Step S202: The PDF document is parsed to obtain the initial table and text content of multiple pages contained therein.

在一个实施例中，可以利用基于Python的开源工具（pdfplumber）对PDF文档进行解析，从而得到其中的初始表格和多个页面的文本内容（简称多页文本内容）。应理解，该多个页面的文本内容包括初始表格所在页面的文本内容。In one embodiment, a Python-based open source tool (pdfplumber) can be used to parse the PDF document to obtain the initial table and the text content of multiple pages (referred to as multi-page text content). It should be understood that the text content of the multiple pages includes the text content of the page where the initial form is located.

图3示出在一个示例中的PDF文档解析过程示意图，图3中，对于给定的PDF文档，首先将其以二进制内容流的形式读取并转换为Python对象，然后逐页遍历PDF文档，解析页面中的线条、矩形、点、图像以及字符等各种对象。对于表格，则参考Nurminen算法的思路，首先基于一维线条、二维矩形和连接点等信息获取实际存在的表格线条；然后通过分析文字对齐的位置信息，推测可能存在的虚拟线条，进而合并这些线条以构建表格单元格；然后再进一步提取单元格中的文本字符，并将表格数据保存为文本二维列表。Figure 3 shows a schematic diagram of the PDF document parsing process in an example. In Figure 3, for a given PDF document, it is first read in the form of a binary content stream and converted into a Python object, and then the PDF document is traversed page by page. Parse various objects such as lines, rectangles, points, images, and characters on the page. For tables, refer to the idea of the Nurminen algorithm. First, obtain the actual table lines based on information such as one-dimensional lines, two-dimensional rectangles, and connection points. Then, by analyzing the position information of text alignment, we can speculate on possible virtual lines, and then merge these. lines to build table cells; then further extract the text characters in the cells and save the table data as a two-dimensional list of text.

而对于文本内容，则首先基于PDFMiner等方法提取文本，解码文本内容流以提取字符，然后计算字符之间的水平和垂直距离，在字符之间插入空格和换行符，以重新构建文本内容结构，并将文本保存为逐行的字符串。For text content, the text is first extracted based on methods such as PDFMiner, the text content stream is decoded to extract characters, and then the horizontal and vertical distances between characters are calculated, and spaces and line breaks are inserted between characters to reconstruct the text content structure. And save the text as a line-by-line string.

回到图2，图2还可以包括如下步骤：Returning to Figure 2, Figure 2 can also include the following steps:

步骤S204，将多页文本内容转换为对应的各个文本列表，单个文本列表包括多行文本。Step S204: Convert multiple pages of text content into corresponding text lists. A single text list includes multiple lines of text.

在一个实施例中，针对任意的某页文本内容，可以按照换行符，将该页文本内容切割为多行文本，之后将该多行文本整理为列表的形式，就可以得到对应的文本列表。在一个更具体的实施例中，该文本列表还可以指示每行文本的索引（Index）、数据类型（Type）、大小（Size）等。In one embodiment, for any page of text content, the text content of the page can be cut into multiple lines of text according to line breaks, and then the multiple lines of text can be organized into a list to obtain a corresponding text list. In a more specific embodiment, the text list may also indicate the index (Index), data type (Type), size (Size), etc. of each line of text.

图4示出在一个示例中的文本列表示意图。图4中，该文本列表包括索引（Index）列、类型（Type）列、大小（Size）列和数值（Value）列。其中，索引列的内容为文本标识，其可以从0开始编号。类型列的内容为文本的数据类型，如可以为字符串（Str）等。大小列的内容为文本所含字符串的个数。数值列的内容为文本（也称字符串）。Figure 4 shows a text column representation diagram in one example. In Figure 4, the text list includes an Index column, a Type column, a Size column and a Value column. Among them, the content of the index column is a text identifier, which can be numbered starting from 0. The content of the type column is a text data type, such as a string (Str), etc. The content of the size column is the number of strings contained in the text. The contents of numeric columns are text (also called strings).

步骤S206，从各个文本列表中，选取初始表格所在页对应的目标文本列表。Step S206: Select the target text list corresponding to the page where the initial form is located from each text list.

如前所述，针对每个页面的文本内容，均转换为了对应的文本列表。这里是提取针对初始表格所在页面的文本内容所转换得到的目标文本列表。As mentioned before, the text content of each page is converted into a corresponding text list. Here is the target text list converted by extracting the text content of the page where the initial table is located.

步骤S208，按照预设符号，对目标文本列表进行切分，得到文本二维列表。Step S208: Segment the target text list according to preset symbols to obtain a two-dimensional text list.

在一个实施例中，这里的预设符号例如可以为空格（blank）。In one embodiment, the preset symbol here may be, for example, a blank.

如前所述，目标文本列表包括多行文本，其中每行文本记录为字符串的形式，而上述对目标文本列表进行切分，可以理解为是将每一行的字符串切分为多个子串，从而形成子列表。As mentioned before, the target text list includes multiple lines of text, where each line of text is recorded in the form of a string. The above-mentioned segmentation of the target text list can be understood as dividing the string of each line into multiple substrings. , thus forming a sublist.

图5a示出在一个示例中的目标文本列表示意图，对于图5a中的目标文本列表，在针对其切分后，所得到的文本二维列表可以如图5b所示。图5b中，每一行对应的子列表包括四个子串，各个子串之间通过逗号分隔，从而也可以理解为该文本二维列表包括4列。Figure 5a shows a representation of a target text column in an example. For the target text list in Figure 5a, after segmenting it, the resulting text two-dimensional list can be as shown in Figure 5b. In Figure 5b, the sublist corresponding to each row includes four substrings, and each substring is separated by a comma, so it can also be understood that the two-dimensional text list includes four columns.

需要说明，由于不同的PDF文档格式各异，往往存在三线表、边框缺失表、颜色阶梯表、跨页表、连续表、嵌套表和多头表等多种表格类别，而针对不同的表格类别，其抽取方式往往不同，因此以下先对表格类别进行判断。It should be noted that due to the different formats of different PDF documents, there are often multiple table categories such as three-line tables, missing border tables, color ladder tables, cross-page tables, continuous tables, nested tables, and multi-head tables. For different table categories , the extraction methods are often different, so the table category is judged first below.

步骤S210，根据初始表格的行数和列数，以及文本二维列表的列数，确定初始表格的表格类别。Step S210: Determine the table type of the initial table based on the number of rows and columns of the initial table and the number of columns of the two-dimensional text list.

具体地，若初始表格D_t的行数小于预设行数（比如，2），且初始表格D_t的列数和文本二维列表D_l,t的列数相等，则确定初始表格D_t的表格类别为三线表；若文本二维列表D_l,t的列数与初始表格D_t的列数的差值n等于预设列数（比如，2），则确定初始表格D_t的表格类别为边框缺失表；若文本二维列表D_l,t的列数与初始表格D_t的列数的差值n大于预设列数，则确定初始表格D_t的表格类别为颜色阶梯表。Specifically, if the number of rows of the initial table D _t is less than the preset number of rows (for example, 2), and the number of columns of the initial table D _t is equal to the number of columns of the text two-dimensional list D _l,t , then the initial table D _t is determined The table type of is a three-line table; if the difference n between the number of columns of the text two-dimensional list D _l,t and the number of columns of the initial table D _t is equal to the preset number of columns (for example, 2), then determine the table of the initial table D _t The category is a table with missing borders; if the difference n between the number of columns of the text two-dimensional list D _l,t and the number of columns of the initial table D _t is greater than the preset number of columns, then the table category of the initial table D _t is determined to be a color ladder table.

需要说明，由于上述的初始表格利用开源工具解析得到，而利用该开源工具所解析得到的表格可能存在如下问题：三线表通常采用三条横线区分表头和表身，但该方法可能会将表身部分识别成一行；边框缺失表（也称两端缺失表）通常缺少表格左右两侧线条，但该方法仅能识别到表格的中间部分；颜色阶梯表通常采用不同深浅的颜色区分相邻行，但该方法对表格颜色并不敏感，容易将相邻的两行数据识别成同一单元格。It should be noted that since the above initial table is parsed using an open source tool, the table parsed using this open source tool may have the following problems: Three-line tables usually use three horizontal lines to distinguish the table header and table body, but this method may change the table. The body part is recognized as one row; the border missing table (also called the missing table at both ends) usually lacks the lines on the left and right sides of the table, but this method can only identify the middle part of the table; the color ladder table usually uses different shades of colors to distinguish adjacent rows , but this method is not sensitive to table color, and it is easy to identify two adjacent rows of data as the same cell.

鉴于上述通过开源工具所解析得到的初始表格均存在相应的问题。为此，本方案将针对初始表格进行重构。In view of the above initial tables parsed through open source tools, there are corresponding problems. To this end, this plan will reconstruct the initial form.

步骤S212, 根据确定的表格类别，对初始表格进行重构，得到重构表格。Step S212: Reconstruct the initial table according to the determined table category to obtain the reconstructed table.

具体地，对于三线表，针对初始表格在目标文本列表中的对应区域的每一行，按照空格对其进行切分，并通过对切分得到的若干一维列表进行聚类，确定出目标列数，以及将初始表格中的内容对应填入具有该目标列数和对应区域所含行数的表格中，得到重构表格。Specifically, for the three-line table, each row of the corresponding area of the initial table in the target text list is segmented according to spaces, and the target number of columns is determined by clustering several one-dimensional lists obtained by segmentation. , and correspondingly fill in the contents of the initial table into a table with the target number of columns and the number of rows contained in the corresponding area to obtain the reconstructed table.

其中，关于初始表格在目标文本列表中的对应区域的确定方法可以包括，将初始表格的前i行（比如，前2行）与目标文本列表中的各行进行匹配（比如，计算相似度），以确定初始表格在目标文本列表中的起始行。之后，自该起始行向下逐行判断目标文本列表中是否含有空格的行，如果某行无空格，则将该行作为初始表格在目标文本列表的终止行。最后，基于确定的起始行和终止行，就可以确定出初始表格在目标文本列表中的对应区域。The method for determining the corresponding area of the initial table in the target text list may include matching the first i rows (for example, the first 2 rows) of the initial table with each row in the target text list (for example, calculating similarity), To determine the starting row of the initial table in the target text list. After that, it is judged line by line from the starting line downwards whether there are lines with spaces in the target text list. If there are no lines in a line, the line is used as the termination line of the initial table in the target text list. Finally, based on the determined starting line and ending line, the corresponding area of the initial table in the target text list can be determined.

此外，上述切分得到的若干一维列表也可以看作是若干单元格，通过利用Kmeans等基于地理位置的聚类算法，对针对每一行所切分的各个单元格进行聚类，就可以得到上述目标列数。应理解，基于该目标列数和上述对应区域所含行数可以得到新建表格。In addition, several one-dimensional lists obtained by the above segmentation can also be regarded as several cells. By using clustering algorithms based on geographical location such as Kmeans to cluster the cells segmented for each row, we can obtain The number of target columns above. It should be understood that a new table can be obtained based on the target number of columns and the number of rows contained in the above corresponding area.

最后，上述将初始表格中的内容对应填入具有该目标列数和对应区域所含行数的表格中具体包括，对于初始表格中的每个单元格的内容，将其对应填入到新建表格中的对应位置。比如，将初始表格中第i行第j列的内容，对应填入新建表格的第i行第j列。应理解，在将初始表格中的每个单元格的内容对应填入新建表格后，就可以得到初始表格对应的重构表格。Finally, the above-mentioned corresponding filling of the contents in the initial table into a table with the target number of columns and the number of rows contained in the corresponding area specifically includes filling in the contents of each cell in the initial table into the new table. corresponding position in . For example, fill the i-th row and j-th column of the new table with the contents of the i-th row and j-th column of the initial table. It should be understood that after filling in the contents of each cell in the initial table into the new table, the reconstructed table corresponding to the initial table can be obtained.

当然，在实际应用中，在将初始表格中的每个单元格的内容对应填入新建表格后，还可以判断新建表格的行间距是否有差异，针对行间距差异（位置差）和首个单元格的位置是否对齐等进行判断是否存在同行跨行的情况，并针对这类情况进行行合并等。最后，将经过行合并处理后的新建表格确定为重构表格。Of course, in practical applications, after filling in the contents of each cell in the initial table into the new table, you can also determine whether there is a difference in the row spacing of the new table. Focus on the row spacing difference (position difference) and the first cell Check whether the positions of the cells are aligned, etc. to determine whether there are rows across the same row, and perform row merging for such situations. Finally, the newly created table after row merging is determined as the reconstructed table.

对于颜色阶梯表，其重构方法与三线表相类似，所不同的是，在对初始表格在目标文本列表的对应区域切分前，可以对初始表格进行预处理，比如，去除初始表格中的None列，这里的None列是指对应列只包括None（空值），或者同时包括None和空。For the color ladder table, the reconstruction method is similar to the three-line table. The difference is that before dividing the initial table in the corresponding area of the target text list, the initial table can be preprocessed, for example, removing the None column, the None column here means that the corresponding column only includes None (null value), or includes both None and null.

对于边框缺失表，可以补齐初始表格的左右两列，并用None填充补齐列后的初始表格中的缺失内容，得到对应的重构表格。For tables with missing borders, you can fill in the left and right columns of the initial table, and use None to fill in the missing content in the initial table after filling the columns to obtain the corresponding reconstructed table.

步骤S214，将重构表格确定为从PDF文档中抽取的表格数据。Step S214, determine the reconstructed table as table data extracted from the PDF document.

需要说明，本方案通过对从PDF文档中抽取的初始表格进行重构，可以得到准确的表格数据。It should be noted that this solution can obtain accurate table data by reconstructing the initial table extracted from the PDF document.

当然，在实际应用中，除了需要获取表格数据本身外，还需要获取表名、计量单位及货币单位等与表格关联的主题信息，以下对该主题信息的获取方法进行说明。Of course, in practical applications, in addition to obtaining the table data itself, it is also necessary to obtain the table name, unit of measurement, currency unit and other topic information associated with the table. The method for obtaining this topic information is explained below.

将初始表格D_t的前i行与目标文本列表L_p,i进行匹配，以确定初始表格D_t在目标文本列表L_p,i中的起始行P_s。判断在目标文本列表L_p,i中，从起始行P_s开始向前的全部行数m是否不小于预设数目ρ，在不小于预设数目ρ的情况下，根据起始行P_s和预设数目ρ，从目标文本列表L_p,i中提取对应区域作为表格主题信息所在区域。具体地，上述对应区域是指在目标文本列表L_p,i中，从起始行P_s开始向前的预设数目ρ个行。在小于预设数目ρ的情况下，计算预设数目ρ与m的差值ρ-m，并根据该差值ρ-m、目标文本列表L_p,i和其它文本列表L_p,i-1，确定表格主题信息所在区域。其中，其它文本列表L_p,i-1是初始表格所在页的上一页文本内容对应的文本列表。通过从表格主题信息所在区域中提取关键词确定初始表格D_t的表格主题信息。Match the first i rows of the initial table D _t with the target text list L _p,i to determine the starting row P _s of the initial table D _t in the target text list L _p,i . Determine whether the number m of all lines starting from the starting line P _s in the target text list L _p,i is not less than the preset number ρ. If it is not less than the preset number ρ, based on the starting line P _s and the preset number ρ, and extract the corresponding area from the target text list L _p,i as the area where the table topic information is located. Specifically, the above-mentioned corresponding area refers to the preset number ρ lines starting from the starting line P _s in the target text list L _p,i . If it is less than the preset number ρ, calculate the difference ρ-m between the preset number ρ and m, and use the difference ρ-m, the target text list L _p,i and other text lists L _p,i-1 , determine the area where the table subject information is located. Among them, other text lists L _p,i-1 are text lists corresponding to the text content of the previous page of the page where the initial table is located. The table topic information of the initial table D _t is determined by extracting keywords from the area where the table topic information is located.

其中，上述根据差值ρ-m、目标文本列表L_p,i和其它文本列表L_p,i-1，确定表格主题信息所在区域具体包括，将从其它文本列表L_p,i-1的最后一行开始向前的ρ-m个行，作为目标文本列表L_p,i的在前的补充内容。将增加补充内容后的目标文本列表L_p,i确定为表格主题信息所在区域。Among them, based on the difference value ρ-m, the target text list L _p,i and other text lists L _p,i-1 , the specific area where the table subject information is located is determined to be from the end of the other text list L _p,i-1. A line starts ρ-m lines forward, as the preceding supplementary content of the target text list L _p,i . The target text list L _p,i after adding supplementary content is determined as the area where the table topic information is located.

至此，针对PDF文档，抽取到了其中每一页面的表格数据和表格主题信息。At this point, for the PDF document, the table data and table subject information of each page have been extracted.

由于PDF文档中的表格可能存在跨页显示的情况，因此针对抽取到的相邻两页及以上的表格数据，还需要判断其是跨页表格（简称跨页表）还是连续表格（简称连续表），并采用对应的方法对其进行还原合并。其中，对连续表格进行合并，是因为连续表中除了第一页表格外的其它各页表格均无主题信息，因此需要合并，确保表格主题信息的完整性及准确性，以便更好地进行数据融合及对比分析。Since tables in PDF documents may be displayed across pages, it is necessary to determine whether the extracted table data of two adjacent pages or more is a cross-page table (referred to as a cross-page table) or a continuous table (referred to as a continuous table). ), and use corresponding methods to restore and merge them. Among them, the continuous tables are merged because the tables on the other pages of the continuous table except the first page table have no subject information, so they need to be merged to ensure the completeness and accuracy of the subject information of the tables, so as to better process the data. Fusion and comparative analysis.

以下对上述的跨页和连续表格的判断和合并过程进行说明。The following describes the judgment and merging process of the above-mentioned cross-page and continuous tables.

假设通过图2示出的方法，所抽取的表格数据包括第一重构表格D_t,-1和第二重构表格D_t,1，且第一重构表格D_t,-1位于上一页面，第二重构表格D_t,1位于下一页面，那么首先可以判断第一条件是否满足。这里的第一条件可以包括，第一重构表格D_t,-1的最后一行与对应的第一文本列表L_t,i-1的最后一行相匹配，第二重构表格D_t,1的第一行与对应的第二文本列表L_t,i的第一行相匹配，以及第一重构表格D_t,-1的列数与第二重构表格D_t,1的列数相等（或第一重构表格D_t,-1与第二重构表格D_t,1的表头数据相一致）。也即第一条件包括三项约束内容。Assume that through the method shown in Figure 2, the extracted table data includes a first reconstructed table D _t,-1 and a second reconstructed table D _t,1 , and the first reconstructed table D _t,-1 is located in the previous page, the second reconstructed table D _t,1 is located on the next page, then it can first be judged whether the first condition is met. The first condition here may include that the last row of the first reconstructed table D _t,-1 matches the last row of the corresponding first text list L _t,i-1, and that the last row of the second reconstructed table D _t,1 The first row matches the corresponding first row of the second text list L _t,i , and the number of columns of the first reconstructed table D _t,-1 is equal to the number of columns of the second reconstructed table D _t,1 ( Or the header data of the first reconstructed table D _t,-1 is consistent with that of the second reconstructed table D _t,1 ). That is to say, the first condition includes three constraints.

如果第一条件满足，则判定第一重构表格和第二重构表格为跨页表格，而如果第一条件不满足，则判断第一重构表格和第二重构表格为独立的两张表格。If the first condition is met, it is determined that the first reconstructed table and the second reconstructed table are cross-page tables. If the first condition is not met, it is determined that the first reconstructed table and the second reconstructed table are two independent tables. sheet.

在第一重构表格和第二重构表格为跨页表格的情况下，可以判断第一重构表格的最后一行与第二重构表格的第一行之间的相似度是否大于预设阈值σ，若是，则确定第一重构表格和第二重构表格为异行跨页表，从而可以在在去除重复表头数据（即去除第二重构表格的表头数据）后，合并第一重构表格和所述第二重构表格，得到合并表格。而如果上述相似度不大于预设阈值σ，则说明第一重构表格和第二重构表格为同行跨页表，从而可以先从第一和第二重构表格中分别截取最后一行和第一行进行合并，然后再合并第一和第二重构表格的剩余部分，得到合并表格。When the first reconstructed table and the second reconstructed table are cross-page tables, it can be determined whether the similarity between the last row of the first reconstructed table and the first row of the second reconstructed table is greater than a preset threshold. σ, if so, it is determined that the first reconstructed table and the second reconstructed table are cross-row cross-page tables, so that after removing the duplicate header data (that is, removing the header data of the second reconstructed table), the first reconstructed table can be merged. A reconstructed table and the second reconstructed table are combined to obtain a merged table. And if the above similarity is not greater than the preset threshold σ, it means that the first reconstructed table and the second reconstructed table are peer cross-page tables, so the last row and the second row can be intercepted from the first and second reconstructed tables respectively. One row is merged, and then the remaining parts of the first and second reconstructed tables are merged to obtain the merged table.

此外，连续表格也是一种特殊的跨页表，其中子表格占满一页，即初始表格与文本列表的内容一致，其处理方法与跨页表格类似，在此不复赘述。In addition, the continuous table is also a special cross-page table, in which the sub-table occupies a full page, that is, the contents of the initial table and the text list are consistent. The processing method is similar to that of the cross-page table, and will not be described again here.

还需要说明，对于通过上述方法得到的合并表格，其有可能是复杂结构表，比如，嵌套表或者多头表等等。对于复杂结构表，本方案还可以对其进行拆分处理。It should also be noted that the merged table obtained through the above method may be a complex structure table, such as a nested table or a multi-head table, etc. For complex structure tables, this solution can also split them.

具体地，对于上述的合并表格，可以判断该合并表格中是否存在只包含一个非None的中间行。若是，则确定该合并表格为嵌套表，从而可以该中间行为界，将合并表格拆分为上下两个部分；若否，则不做拆分处理。Specifically, for the above merged table, it can be determined whether there is an intermediate row that contains only one non-None row in the merged table. If yes, it is determined that the merged table is a nested table, so that the merged table can be split into upper and lower parts based on the intermediate behavior; if not, no splitting process is performed.

在本方案中，在对合并表格进行拆分后，针对拆分得到的每个表格，还可以进一步判断其是否是多头表，以下对其进行说明。In this solution, after the merged table is split, for each split table, it can be further determined whether it is a multi-head table, which is explained below.

假设针对合并表格进行拆分后得到的上下两个部分包括：第一拆分表格和第二拆分表格，则可以针对第一拆分表格（或第二拆分表格）获取表头数据，并判断该表头数据的行数是否大于1行，如果大于，且其中一行包含None，而另一行不包含None，则确定该第一拆分表格（或第二拆分表格）为多头表，从而合并该两行，得到目标表格。Assuming that the upper and lower parts obtained after splitting the merged table include: the first split table and the second split table, you can obtain the header data for the first split table (or the second split table), and Determine whether the number of rows of the header data is greater than 1 row. If it is greater, and one of the rows contains None and the other row does not contain None, then it is determined that the first split table (or the second split table) is a multi-head table, so Merge the two rows to get the target table.

需要说明，上述之所以在判断第一/第二拆分表格为多头表的情况下，执行合并行的操作，是因为在利用开源工具解析文档的过程中采取的策略是以最细颗粒度的线条来预测单元格，导致多头表中的合并单元格被拆分并以None进行填充。It should be noted that the reason why the above operation of merging rows is performed when the first/second split table is judged to be a multi-head table is because the strategy adopted in the process of parsing the document using open source tools is with the finest granularity. lines to predict cells, causing merged cells in the long table to be split and filled with None.

还需要说明，通过本说明书实施例得到的重构表格、合并表格、拆分表格或者目标表格，可以CSV文件的格式进行保存，也可进一步转换为JSON格式存入数据库中。后续可根据实际需求，使用数据分析工具进行数据清洗、统计分析以及可视化展示等操作，从而更加深入地了解对标客体企业的财务状况和业务情况。It should also be noted that the reconstructed table, merged table, split table or target table obtained through the embodiments of this specification can be saved in the format of a CSV file, or can be further converted into a JSON format and stored in a database. Subsequently, according to actual needs, data analysis tools can be used to perform operations such as data cleaning, statistical analysis, and visual display, so as to gain a deeper understanding of the financial status and business conditions of the benchmarking enterprise.

图6示出在一个示例中的面向PDF文档的表格数据抽取方法示意图。图6中，在获取到初始表格和及其对应的文本列表后，可以从初始表格中抽取表格主题信息，其中包括表名、计量单元以及货币单位等。此外，针对初始表格，可以依次进行不规则表格（如三线表、边框缺失表、颜色阶梯表等）数据抽取、跨页表格及连续表格数据抽取以及复杂结构表格（如嵌套表和多头表等）数据抽取等。最后抽取到的表格数据包括表格主题信息和表格数据本身，其中，表格数据本身可以以JSON格式等形式存储。Figure 6 shows a schematic diagram of a table data extraction method for PDF documents in an example. In Figure 6, after obtaining the initial table and its corresponding text list, the table subject information can be extracted from the initial table, including the table name, measurement unit, currency unit, etc. In addition, for the initial table, data extraction of irregular tables (such as three-line tables, tables with missing borders, color ladder tables, etc.), cross-page tables and continuous tables, as well as complex structure tables (such as nested tables and multi-head tables, etc.) can be carried out in sequence. ) data extraction, etc. The finally extracted table data includes table subject information and the table data itself, where the table data itself can be stored in JSON format or other forms.

本方案中，还可以向用户展示抽取的表格数据，并支持用户对抽取的表格数据进行审核与分析。In this solution, the extracted tabular data can also be displayed to users, and users can review and analyze the extracted tabular data.

在一个实施例中，可以通过可视化分析系统展示抽取的表格数据。该可视化分析系统可以包括三个视图：文档概览视图、数据抽取视图和数据转换审核视图。其中，文档概览视图，用于展示PDF文档。数据抽取视图，用于展示从PDF文档中抽取的不同表格类别的表格的分布情况。数据转换审核视图，用于审核从PDF文档中所抽取的表格数据。In one embodiment, the extracted tabular data can be displayed through a visual analysis system. The visual analysis system can include three views: document overview view, data extraction view and data transformation review view. Among them, the document overview view is used to display PDF documents. The data extraction view is used to display the distribution of tables of different table categories extracted from PDF documents. The data conversion review view is used to review tabular data extracted from PDF documents.

以下对上述三个视图进行详细说明。The above three views are described in detail below.

图7a示出可视化分析系统中的文档概览视图示意图，该文档概览视图包括a1区域和a2区域，其中，a1区域用于展示PDF文档，a2区域采用两层树状结构展示PDF文档中每个章节及小节的表格、图片及文本等文档元素构成概况。根节点表示文档名称，叶子节点表示文档的各个章节，章节名显示在根节点与叶子节点的连接线上。叶子节点的大小表示对应章节文档元素的数量多少。叶子节点采用环形树图(circular treemap)形式展示章节包含的小节，每个饼状图表示一个小节，饼状图的大小表示对应小节文档元素的数量多少，饼状图编码对应小节的表格、图片及文本的数量比例。当鼠标悬浮在某个饼状图上时，将会显示对应小节的名称。点击章节名或某个小饼状图，可跳转至PDF文档对应位置。Figure 7a shows a schematic diagram of the document overview view in the visual analysis system. The document overview view includes an a1 area and a2 area. The a1 area is used to display the PDF document, and the a2 area uses a two-layer tree structure to display each chapter in the PDF document. Document elements such as tables, pictures, and text in sections and sections form an overview. The root node represents the document name, the leaf nodes represent each chapter of the document, and the chapter name is displayed on the connection line between the root node and the leaf node. The size of the leaf node indicates the number of document elements in the corresponding chapter. The leaf nodes use a circular treemap to display the sections contained in the chapter. Each pie chart represents a section. The size of the pie chart represents the number of document elements corresponding to the section. The pie chart encodes the tables and pictures of the corresponding section. and the amount of text. When the mouse is hovered over a pie chart, the name of the corresponding section will be displayed. Click on a chapter name or a pie chart to jump to the corresponding location in the PDF document.

图7b示出可视化分析系统中的数据抽取视图示意图，该视图的左侧展示了标准表、三线表、边框缺失表、颜色阶梯表、跨页表、连续表、嵌套表和多头表等不同类型表格的图示，右侧以柱状图的形式展示了对应类型表格的总数量及审核状态。用户可以在右侧选择要查看的表格类型，通过点击感兴趣的直方图对应条来进一步查看审核情况。Figure 7b shows a schematic diagram of the data extraction view in the visual analysis system. The left side of the view shows different tables such as standard tables, three-line tables, missing border tables, color ladder tables, cross-page tables, continuous tables, nested tables, and multi-head tables. Illustration of the type of form. The right side shows the total number and review status of the corresponding type of form in the form of a bar chart. Users can select the table type they want to view on the right and click on the corresponding bar of the histogram of interest to further view the review status.

图7c示出可视化分析系统中的数据转换审核视图示意图，该视图支持用户对抽取的表格数据进行审核与分析。用户可在数据抽取视图中，通过交互方式选择过滤，对抽取的表格数据进行查看、溯源、分析和校正。对于表格数据的审核，用户可以通过点击每一列的列头进行排序，拖动列头左右移动以改变列的顺序，以便根据个人分析习惯组织表格内容。数据表右上角的“放大镜”表示转换溯源，点击“放大镜”图标，该数据表对应的原始PDF文档会在文档概览视图中高亮显示，方便用户对转换前、后的数据进行分析，并审核确认数据的准确性。Figure 7c shows a schematic diagram of the data conversion review view in the visual analysis system. This view supports users to review and analyze the extracted table data. Users can interactively select filters in the data extraction view to view, trace, analyze and correct the extracted table data. For the review of tabular data, users can sort by clicking on the column header of each column, and drag the column header to move left and right to change the order of the columns, so as to organize the table content according to personal analysis habits. The "magnifying glass" in the upper right corner of the data table indicates conversion traceability. Click the "magnifying glass" icon, and the original PDF document corresponding to the data table will be highlighted in the document overview view, making it easier for users to analyze the data before and after conversion, and review and confirm it. Accuracy of data.

具体地，上述的数据转换审核视图可以包括c1-c4四个区域，以下对该四个区域进行说明。Specifically, the above-mentioned data conversion audit view may include four areas c1-c4, and these four areas will be described below.

如c1区域所示，对于抽取的表格数据，当用户将鼠标悬停在某一数据行时，右侧会显示“编辑”及“备注”图标，用户可以根据需要点击“编辑”图标进行修改和记录，或者点击“备注”图标直接记录该内容为准确。如c2区域所示，对于审核准确的数据行将被标记为浅灰色背景。经过审核的数据行右侧将显示“备注”图标，随时可点击查看审核日志。此外，如c3区域所示，若用户发现数据存在错误，点击数据行右侧的“编辑”图标，则该数据行将被标记为深灰色背景，同时在其下端插入一行浅灰色背景的修改行，并将错误行数据原样复制过来，且每个数据单元格都可以编辑，用户可以直接进行修改，修改后的数据会以加粗方式显示。最后，如c4区域所示，点击数据行右侧的“备注”图标，用户可以记录修改日志，包括数据是否正确以及审核备注说明等信息。As shown in area c1, for the extracted table data, when the user hovers the mouse over a certain data row, the "Edit" and "Remarks" icons will be displayed on the right side. The user can click the "Edit" icon to modify and Record, or click the "Remarks" icon to record the content directly for accuracy. As shown in area c2, data rows that are accurate for review will be marked with a light gray background. The "Remarks" icon will be displayed on the right side of the audited data row, and you can click to view the audit log at any time. In addition, as shown in area c3, if the user finds an error in the data and clicks the "Edit" icon on the right side of the data row, the data row will be marked with a dark gray background, and a modified row with a light gray background will be inserted at the bottom. And copy the error row data as it is, and each data cell can be edited. The user can modify it directly, and the modified data will be displayed in bold. Finally, as shown in area c4, by clicking the "Remarks" icon on the right side of the data row, the user can record the modification log, including information such as whether the data is correct and audit notes.

综合以上，本方案首先对获取的PDF文档进行解析，提取出其中的表格，然后对提取的表格进行重构等处理以实现数据转换。具体地，对于表格数据，本方案采用数据抽取方法获取表格的主题信息及表格数据本身。为进一步提升数据转换的质量，针对数据转换过程中可能存在的数据准确性以及效率性问题，本方案还提供了一种可视化分析系统，实现了数据的可比对、可追溯及可分析。最终，将转换后的结构化数据融合入数据库，便于未来的检索和调用。Based on the above, this solution first parses the obtained PDF document, extracts the tables, and then reconstructs the extracted tables to achieve data conversion. Specifically, for tabular data, this solution uses data extraction methods to obtain the subject information of the form and the tabular data itself. In order to further improve the quality of data conversion and address possible data accuracy and efficiency issues during the data conversion process, this solution also provides a visual analysis system that enables data comparison, traceability, and analysis. Finally, the converted structured data is integrated into the database to facilitate future retrieval and retrieval.

总而言之，本方案计了一套针对上市公司定期报告这一具有特殊内容结构及样式特征的PDF文档智能处理策略，提升了PDF文档结构化转换处理的质量及效率。构建了一个新型的可视化分析系统，用于展示抽取的表格数据。此外，该可视化分析系统还支持用户对抽取的表格数据进行审核与分析。All in all, this plan designed a set of intelligent processing strategies for PDF documents with special content structure and style characteristics for the periodic reports of listed companies, which improved the quality and efficiency of structured conversion processing of PDF documents. A new visual analysis system was built to display the extracted tabular data. In addition, the visual analysis system also supports users to review and analyze the extracted tabular data.

与上述一种面向PDF文档的表格数据抽取方法对应地，本说明书一个实施例还提供的一种面向PDF文档的表格数据抽取装置，如图8所示，该装置可以包括：Corresponding to the above-mentioned table data extraction method for PDF documents, one embodiment of this specification also provides a table data extraction device for PDF documents. As shown in Figure 8, the device may include:

解析单元802，用于对PDF文档进行解析，得到其中包含的初始表格和多页文本内容。The parsing unit 802 is used to parse the PDF document to obtain the initial table and multi-page text content contained therein.

转换单元804，用于将多页文本内容转换为对应的各个文本列表，单个文本列表包括多行文本。The conversion unit 804 is used to convert multiple pages of text content into corresponding text lists, where a single text list includes multiple lines of text.

选取单元806，用于从各个文本列表中，选取初始表格所在页对应的目标文本列表。The selection unit 806 is used to select the target text list corresponding to the page where the initial table is located from each text list.

切分单元808，用于按照预设符号，对目标文本列表进行切分，得到文本二维列表。The segmentation unit 808 is used to segment the target text list according to preset symbols to obtain a two-dimensional text list.

确定单元810，用于根据初始表格的第一行数和第一列数，以及文本二维列表的第二列数，确定初始表格的表格类别。The determining unit 810 is configured to determine the table type of the initial table based on the first row number and first column number of the initial table, and the second column number of the text two-dimensional list.

确定单元810具体用于：若上述第一行数小于预设行数，且上述第一列数和上述第二列数相等，则确定表格类别为三线表；若上述第二列数与第一列数的差值等于预设列数，则确定表格类别为边框缺失表；若第二列数与第一列数的差值大于预设列数，则确定表格类别为颜色阶梯表。The determination unit 810 is specifically configured to: if the number of the first rows is less than the preset number of rows, and the number of the first columns is equal to the number of the second columns, determine the table type to be a three-line table; if the number of the second columns is equal to the number of the first columns, If the difference in the number of columns is equal to the preset number of columns, the table category is determined to be a table with missing borders; if the difference between the number of the second column and the number of the first column is greater than the preset number of columns, the table category is determined to be a color ladder table.

重构单元812，用于根据确定的表格类别，对初始表格进行重构，得到重构表格。The reconstruction unit 812 is used to reconstruct the initial table according to the determined table category to obtain a reconstructed table.

确定单元810，还用于将重构表格确定为从PDF文档中抽取的表格数据。The determining unit 810 is also used to determine the reconstructed table as table data extracted from the PDF document.

在一个实施例中，重构表格的数目为两个，且该两个重构表格包括，位于上一页面的第一重构表格和位于下一页面的第二重构表格；该装置还包括：In one embodiment, the number of reconstructed tables is two, and the two reconstructed tables include a first reconstructed table located on the previous page and a second reconstructed table located on the next page; the device further includes :

判断单元814，用于判断第一条件是否满足，该第一条件包括，第一重构表格的最后一行与对应的第一文本列表的最后一行相匹配，第二重构表格的第一行与对应的第二文本列表的第一行相匹配，第一重构表格与第二重构表格的列数相等，或第一重构表格与第二重构表格的表头数据相一致；The judgment unit 814 is used to judge whether the first condition is met. The first condition includes that the last row of the first reconstructed table matches the last row of the corresponding first text list, and the first row of the second reconstructed table matches The first row of the corresponding second text list matches, the number of columns of the first reconstructed table and the second reconstructed table are equal, or the header data of the first reconstructed table and the second reconstructed table are consistent;

判断单元814，还用于在上述第一条件满足的情况下，判断第一重构表格的最后一行与第二重构表格的第一行之间的相似度是否大于预设阈值，若是，则在去除重复表头数据后，合并第一重构表格和第二重构表格，得到合并表格；若否，则通过合并第一重构表格的最后一行和第二重构表格的第一行，得到合并表格。The judgment unit 814 is also used to judge whether the similarity between the last row of the first reconstructed table and the first row of the second reconstructed table is greater than the preset threshold when the above-mentioned first condition is met. If so, then After removing duplicate header data, merge the first reconstructed table and the second reconstructed table to obtain a merged table; if not, merge the last row of the first reconstructed table and the first row of the second reconstructed table, Get the merged form.

在一个实施例中，该装置还包括：拆分单元816；In one embodiment, the device further includes: a splitting unit 816;

判断单元814，还用于判断合并表格中是否存在只包含一个非None的中间行；The judgment unit 814 is also used to judge whether there is an intermediate row in the merged table that contains only one non-None row;

拆分单元816，用于若判断合并表格中存在只包含一个非None的中间行，则以该中间行为界，将合并表格拆分为上下两个部分。The splitting unit 816 is used to split the merged table into upper and lower parts based on the middle row that contains only one non-None row in the merged table.

在一个实施例中，上述两个部分包括第一拆分表格和第二拆分表格；该装置还包括：合并单元818；In one embodiment, the above two parts include a first split table and a second split table; the device further includes: a merging unit 818;

合并单元818，用于对于第一/第二拆分表格，获取其中的表头数据，如果该表头数据的行数大于1行，且其中一行包含None，而另一行不包含None，则合并该两行，得到目标表格。The merging unit 818 is used to obtain the header data of the first/second split table. If the number of rows of the header data is greater than 1 row, and one row contains None and the other row does not contain None, then merge With these two lines, the target table is obtained.

在一个实施例中，重构单元812具体用于：In one embodiment, the reconstruction unit 812 is specifically used to:

在表格类别为三线表或颜色阶梯表的情况下，针对初始表格在目标文本列表中的对应区域的每一行，按照空格对其进行切分，并通过对切分得到的若干一维列表进行聚类，确定出目标列数，以及将初始表格中的内容对应填入具有目标列数和上述对应区域所含行数的表格中，得到重构表格；When the table category is a three-line table or a color ladder table, segment each row of the corresponding area of the initial table in the target text list according to spaces, and aggregate several one-dimensional lists obtained by segmentation. class, determine the target number of columns, and fill in the contents of the initial table into a table with the target number of columns and the number of rows contained in the corresponding area above to obtain the reconstructed table;

在表格类别为边框缺失表的情况下，补齐初始表格的左右两列，并用None填充补齐列后的初始表格中的缺失内容，得到对应的重构表格。When the table category is a table with missing borders, fill in the left and right columns of the initial table, and fill the missing content in the initial table after filling the columns with None to obtain the corresponding reconstructed table.

在一个实施例中，该装置还包括：In one embodiment, the device further includes:

匹配单元820，用于将初始表格的前i行与目标文本列表进行匹配，以确定初始表格在目标文本列表中的起始行；Matching unit 820, used to match the first i rows of the initial table with the target text list to determine the starting row of the initial table in the target text list;

提取单元822，用于在目标文本列表中从所述起始行开始向前的全部行数不小于预设数目的情况下，根据起始行和预设数目，从目标文本列表中提取对应区域作为表格主题信息所在区域；Extraction unit 822, configured to extract the corresponding region from the target text list according to the starting line and the preset number when the number of all lines starting from the starting line in the target text list is not less than a preset number. As the area where the table subject information is located;

确定单元810，还用于在上述全部行数小于预设数目的情况下，计算预设数目与全部行数的差值，并根据差值、目标文本列表和其它文本列表，确定表格主题信息所在区域；其中，该其它文本列表是初始表格所在页的上一页文本内容对应的文本列表；The determination unit 810 is also used to calculate the difference between the preset number and the total number of rows when the total number of rows is less than the preset number, and determine the location of the table subject information based on the difference, the target text list and other text lists. area; wherein, the other text list is a text list corresponding to the text content of the previous page of the page where the initial table is located;

确定单元810，还用于通过从表格主题信息所在区域中提取关键词确定表格主题信息。The determining unit 810 is also used to determine the table theme information by extracting keywords from the area where the table theme information is located.

在一个实施例中，确定单元810具体用于：In one embodiment, the determining unit 810 is specifically used to:

将从其它文本列表的最后一行开始向前的差值个行，作为目标文本列表的在前的补充内容；The difference lines starting from the last line of other text lists will be used as the previous supplementary content of the target text list;

将增加补充内容后的目标文本列表确定为表格主题信息所在区域。The target text list after adding supplementary content is determined as the area where the table topic information is located.

在一个实施例中，转换单元804具体用于：In one embodiment, the conversion unit 804 is specifically used to:

对于某页文本内容，按照换行符，将该页文本内容切割为多行文本，该多行文本形成对应的文本列表。For a certain page of text content, the text content of the page is cut into multiple lines of text according to line breaks, and the multiple lines of text form a corresponding text list.

本说明书上述实施例装置的各功能单元的功能，可以通过上述方法实施例的各步骤来实现，因此，本说明书一个实施例提供的装置的具体工作过程，在此不复赘述。The functions of each functional unit of the device in the above embodiments of this specification can be realized through each step of the above method embodiment. Therefore, the specific working process of the device provided in one embodiment of this specification will not be described again here.

本说明书一个实施例提供的一种面向PDF文档的表格数据抽取装置，可以大大提升表格数据的抽取效率和准确性。An embodiment of this specification provides a table data extraction device for PDF documents, which can greatly improve the efficiency and accuracy of table data extraction.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分互相参见即可，每个实施例重点说明的都是与其他实施例的不同之处。尤其，对于服务器实施例而言，由于其基本相似于方法实施例，所以描述的比较简单，相关之处参见方法实施例的部分说明即可。Each embodiment in this specification is described in a progressive manner. The same and similar parts between the various embodiments can be referred to each other. Each embodiment focuses on its differences from other embodiments. In particular, for the server embodiment, since it is basically similar to the method embodiment, the description is relatively simple. For relevant details, please refer to the partial description of the method embodiment.

结合本说明书公开内容所描述的方法或者算法的步骤可以硬件的方式来实现，也可以是由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成，软件模块可以被存放于RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器，从而使处理器能够从该存储介质读取信息，且可向该存储介质写入信息。当然，存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。另外，该ASIC可以位于服务器中。当然，处理器和存储介质也可以作为分立组件存在于服务器中。The steps of the method or algorithm described in conjunction with the disclosure of this specification may be implemented in hardware, or may be implemented in a processor executing software instructions. Software instructions can be composed of corresponding software modules, and the software modules can be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disks, mobile hard disks, CD-ROM or any other form of storage well known in the art. in the medium. An exemplary storage medium is coupled to the processor such that the processor can read information from the storage medium and write information to the storage medium. Of course, the storage medium can also be an integral part of the processor. The processor and storage media may be located in an ASIC. Alternatively, the ASIC can be located in the server. Of course, the processor and storage media can also exist as discrete components in the server.

本领域技术人员应该可以意识到，在上述一个或多个示例中，本发明所描述的功能可以用硬件、软件、固件或它们的任意组合来实现。当使用软件实现时，可以将这些功能存储在计算机可读介质中或者作为计算机可读介质上的一个或多个指令或代码进行传输。计算机可读介质包括计算机存储介质和通信介质，其中通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。存储介质可以是通用或专用计算机能够存取的任何可用介质。Those skilled in the art should realize that in one or more of the above examples, the functions described in the present invention can be implemented by hardware, software, firmware, or any combination thereof. When implemented using software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. Storage media can be any available media that can be accessed by a general purpose or special purpose computer.

上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下，在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外，在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中，多任务处理和并行处理也是可以的或者可能是有利的。The foregoing describes specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desired results. Additionally, the processes depicted in the figures do not necessarily require the specific order shown, or sequential order, to achieve desirable results. Multitasking and parallel processing are also possible or may be advantageous in certain implementations.

以上所述的具体实施方式，对本说明书的目的、技术方案和有益效果进行了进一步详细说明，所应理解的是，以上所述仅为本说明书的具体实施方式而已，并不用于限定本说明书的保护范围，凡在本说明书的技术方案的基础之上，所做的任何修改、等同替换、改进等，均应包括在本说明书的保护范围之内。The above-mentioned specific implementations further describe the purpose, technical solutions and beneficial effects of this specification. It should be understood that the above are only specific implementations of this specification and are not intended to limit the scope of this specification. The scope of protection: any modifications, equivalent replacements, improvements, etc. made on the basis of the technical solutions in this manual shall be included in the scope of protection of this manual.

Claims

1. A table data extraction method for PDF documents, including:

Parse the PDF document to obtain the initial table and multi-page text content contained in it;

Convert the multi-page text content into corresponding text lists, where a single text list includes multiple lines of text;

From each of the text lists, select the target text list corresponding to the page where the initial form is located;

Segment the target text list according to preset symbols to obtain a two-dimensional text list;

Determine the table category of the initial table based on the first row number and first column number of the initial table, and the second column number of the text two-dimensional list;

Determining the table type of the initial table includes determining that the table type is a three-line table if the first number of rows is less than a preset number of rows and the first column number and the second column number are equal. ; If the difference between the second column number and the first column number is equal to the preset column number, then the table category is determined to be a table with missing borders; if the difference between the second column number and the first column number is If the difference is greater than the preset number of columns, the table category is determined to be a color ladder table;

Reconstruct the initial table according to the determined table category to obtain a reconstructed table;

Determine the reconstructed table as table data extracted from the PDF document;

The reconstruction of the initial form includes:

When the table type is a three-line table or a color ladder table, for each row of the corresponding area of the initial table in the target text list, segment it according to spaces, and obtain the Cluster several one-dimensional lists to determine the target number of columns, and correspondingly fill in the content in the initial table into a table with the target column number and the number of rows in the corresponding area to obtain the reconstruction sheet;

When the table type is a table with missing borders, fill in the left and right columns of the initial table, and fill the missing content in the initial table after filling the columns with None to obtain the corresponding reconstructed table.

2. The method according to claim 1, wherein the number of the reconstructed tables is two, and the two reconstructed tables include a first reconstructed table located on the previous page and a third reconstructed table located on the next page. 2. Reconstruct the table; the method also includes:

Determine whether the first condition is met. The first condition includes: the last row of the first reconstructed table matches the last row of the corresponding first text list; the first row of the second reconstructed table matches the corresponding The first row of the second text list matches; the number of columns of the first reconstructed table and the second reconstructed table are equal, or the number of columns of the first reconstructed table and the second reconstructed table The header data is consistent;

When the first condition is met, determine whether the similarity between the last row of the first reconstructed table and the first row of the second reconstructed table is greater than a preset threshold, and if so, remove the After repeating the header data, merge the first reconstructed table and the second reconstructed table to obtain a merged table; if not, obtain a merged table by merging the last row and the first row.

3. The method of claim 2, further comprising:

Determine whether there is an intermediate row in the merged table that contains only one non-None row;

If so, the merged table is divided into upper and lower parts based on the middle line.

4. The method of claim 3, wherein the two parts comprise a first split table and a second split table;

For the first/second split table, get the header data. If the number of rows of the header data is greater than 1 row, and one row contains None and the other row does not contain None, merge the two rows to get the target sheet.

5. The method of claim 1, further comprising:

Match the first i rows of the initial table with the target text list to determine the starting row of the initial table in the target text list;

When the number of all lines in the target text list starting from the starting line is not less than a preset number, extract the corresponding area from the target text list based on the starting line and the preset number. As the area where the table subject information is located;

When the total number of lines is less than the preset number, the difference between the preset number and the total line number is calculated, and based on the difference, the target text list and other text lists, determine the The area where the table subject information is located; wherein the other text lists are text lists corresponding to the text content on the previous page of the page where the initial form is located;

The table theme information is determined by extracting keywords from the area where the table theme information is located.

6. The method according to claim 5, wherein determining the area where the table subject information is located includes:

The difference lines starting from the last line of the other text list and going forward are used as the previous supplementary content of the target text list;

The target text list after adding supplementary content is determined as the area where the subject information of the table is located.

7. The method of claim 1, wherein converting the multiple pages of text content into multiple text lists includes:

For a certain page of text content, the text content of the page is cut into multiple lines of text according to line breaks, and the multiple lines of text form a corresponding text list.

8. A visual analysis system, including:

Document overview view, used to display the target PDF document;

A data extraction view, used to display the distribution of different table categories extracted from the target PDF document;

A data conversion review view is used to display tabular data extracted from the target PDF document according to the method of claim 1.

9. A form data extraction device for PDF documents, including:

The parsing unit is used to parse the PDF document and obtain the initial table and multi-page text content contained in it;

A conversion unit used to convert the multi-page text content into corresponding text lists, where a single text list includes multiple lines of text;

A selection unit configured to select the target text list corresponding to the page where the initial form is located from each of the text lists;

A segmentation unit, used to segment the target text list according to preset symbols to obtain a two-dimensional text list;

A determination unit configured to determine the table category of the initial table based on the first row number and first column number of the initial table and the second column number of the text two-dimensional list;

The determining unit is specifically configured to: if the number of first rows is less than the preset number of rows, and the number of first columns and the number of second columns are equal, determine that the table category is a three-line table; if the If the difference between the second column number and the first column number is equal to the preset column number, the table category is determined to be a table with missing borders; if the difference between the second column number and the first column number is greater than the preset Assuming the number of columns, the table category is determined to be a color ladder table;

A reconstruction unit, configured to reconstruct the initial table according to the determined table category to obtain a reconstructed table;

The determining unit is also used to determine the reconstructed table as table data extracted from the PDF document;

The reconstruction unit is specifically used for: