WO2021042507A1 - Method and device for extracting table data from pdf file, and storage medium - Google Patents

Method and device for extracting table data from pdf file, and storage medium Download PDF

Info

Publication number
WO2021042507A1
WO2021042507A1 PCT/CN2019/116528 CN2019116528W WO2021042507A1 WO 2021042507 A1 WO2021042507 A1 WO 2021042507A1 CN 2019116528 W CN2019116528 W CN 2019116528W WO 2021042507 A1 WO2021042507 A1 WO 2021042507A1
Authority
WO
WIPO (PCT)
Prior art keywords
header
row
keywords
data
extracting
Prior art date
Application number
PCT/CN2019/116528
Other languages
French (fr)
Chinese (zh)
Inventor
王凯
邓会林
顾杨
Original Assignee
苏州朗动网络科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州朗动网络科技有限公司 filed Critical 苏州朗动网络科技有限公司
Publication of WO2021042507A1 publication Critical patent/WO2021042507A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries

Definitions

  • the invention relates to the field of computers, and in particular to a method, equipment and storage medium for extracting table data in a pdf document.
  • the purpose of the present invention is to provide a method, equipment and storage medium for extracting table data in a pdf document.
  • an embodiment of the present invention provides a method for extracting table data in a pdf document, the method including:
  • table information look up the table header keywords, and locate the row of the table header of a table according to the weight or combination of the table header keywords;
  • the method further includes:
  • the "look up the header keywords in the table information, and locate the row where the header of a table is located according to the weight of the header keywords" specifically includes:
  • the "obtaining the weight of each header keyword" specifically includes:
  • the "look up the table header keywords in the table information, and locate the row where the table header of a table is located according to the combination of the table header keywords" specifically includes:
  • the "determining whether the multiple header keywords have combined keywords" specifically includes:
  • the method further includes:
  • an embodiment of the present invention provides an electronic device including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor executes the program The steps in the method for extracting table data in the pdf document described in any one of the above are realized.
  • an embodiment of the present invention provides a computer-readable storage medium on which a computer program is stored, wherein the computer program is characterized in that, when the computer program is executed by a processor, the pdf of any one of the above is realized Steps in the method of extracting tabular data in the document.
  • the method for extracting table data in pdf documents of the present invention can automatically extract data in pdf tables in batches, which solves the time-consuming and manpower-consuming problems, the error of the extraction results is small, and the accuracy of the extracted data high.
  • Fig. 1 is a schematic flow chart of a method for extracting table data in a pdf document of the present invention.
  • FIG. 2 is a schematic flowchart of an embodiment of step 2 in FIG. 1.
  • the method for extracting table data in a pdf document of the present invention includes:
  • Step S1 Extract table information from the pdf document.
  • PDF was born from the Camelot project. The purpose is to create a common document exchange format to support multiple machine platforms, operating systems and communication networks. The goal is to make documents visible on any monitor and printable on any modern printer.
  • PDF is based on PostScript (a page description language). This language solves the problem of displaying and printing anywhere.
  • PostScript a page description language
  • This language solves the problem of displaying and printing anywhere.
  • the PDF contains the components needed for the document to be "visible and printed anywhere". For example, characters, fonts, graphics, pictures, etc.
  • a PDF document contains many instructions for placing text (or other components). These instructions use the x and y coordinates with the lower left corner of the page as the origin to place page elements.
  • a word is simulated by placing several characters compactly together. Similarly, white space is simulated by making the character spacing larger. How to simulate a table? Simulate by placing the characters like a spreadsheet.
  • the third-party open source tool tabula extracts all table information in the pdf file and aggregates them together. Therefore, all form information includes one or more forms. Table 1 below is an example of a table:
  • Step S2 In the table information, look up the table header keywords, and locate the row of the table header of a table according to the weight or combination of the table header keywords.
  • the weight of the header keywords is calculated by the word frequency, and the weight of the header keywords is summarized List.
  • the table header keyword weight list can be: [ ⁇ "Customer", 25% ⁇ , ⁇ "Sales Amount", 18% ⁇ , ⁇ "Proportion", 11% ⁇ ...]. Then look up the header keywords in the extracted table information, and locate the row where the header of a table is located according to the weight of the header keywords. Because some tables in the pdf are special, using the keyword weight of the header to locate the header can improve the accuracy of positioning.
  • Step S21 Find one or more header keywords in a certain row of the table information
  • Step S22 Obtain the weight of the one or more header keywords, and calculate the overall weight of the one or more header keywords;
  • the overall weight is the sum of the weights of the one or more header keywords.
  • Step S23 If the overall weight exceeds the weight threshold, locate the row where the header keyword is located in the header of the table.
  • the setting process of the weight threshold may be: an initial weight threshold is given through historical data, and then the initial weight threshold is corrected by the accuracy of the extracted header.
  • header combination keywords appear in combination. Therefore, you can obtain a list of header combination keywords by analyzing in advance what combination keywords are in the header of the table in the historical pdf document.
  • the table header combination keyword list can be: [ ⁇ supplier name, purchase amount, proportion, relationship ⁇ , ⁇ customer, amount, proportion ⁇ , ⁇ unit name, operating income, amount incurred in the current period, and the company relationship ⁇ ...]. Then search for multiple header keywords in the extracted table information, determine whether there are header combination keywords for multiple header keywords, and if so, locate the row where the header of a table is located. The specific steps include the following:
  • Step S24 Find multiple table header keywords in a certain row of the table information
  • Step S25 Determine whether the multiple header keywords have combined keywords, and if so, locate the row where the multiple header keywords are located in the header of the table;
  • Step S3 traverse the data format of the cells in each row starting from the next row of the row where the header is located, and locate the row where the header of the table is located according to the change of the data format.
  • the data format of a certain row is different from that of the previous row, locate the row at the end of the table in the previous row.
  • Table 1 the data format of rows 2 to 6 is the same, and the data format of row 7 is different from that of row 6, and row 6 is defined as the end of the table. What needs to be explained here is that the total content in line 7 is not what we need, so it will be discarded.
  • the data format of a row is different from that of the previous row, judge whether the data in this row contains keywords at the end of the table (the keywords at the end of the table can be "total”, “total” or “total” ”Etc.), if it is, then the previous row is where the end of the table is located; if not, then judge whether the data format of the next row of this row is the same as the data format of the previous row of this row (this is mainly for The merged cell that appears in the middle of the table). If it is different, the previous row is the row at the end of the table. If it is the same, it means a merged cell appears in the middle.
  • Step S4 Obtain data information of the table according to the header and footer of the table.
  • the table is traversed from the next row of the table header to the row of the table footer, and the data of each row and each column of the table is extracted.
  • the above steps are the process of acquiring the data information of one table. If there are multiple tables, the above steps are repeated until the data information of all the tables is extracted.
  • the method for extracting table data in a pdf document of the present invention can automatically extract data in a pdf table in batches, which solves the time-consuming and manpower-consuming problems, the error of the extraction result is small, and the accuracy of the extracted data is high.
  • the method further includes:
  • each header keyword represents that the data in the column where the keyword is located is what we need. Therefore, the data in the column where the header keyword does not exist can be discarded.
  • the method further includes:
  • each header keyword corresponds to a corresponding data format, for example, "customer” corresponds to the name of a company or person, "sales amount” corresponds to a number, and "proportion” should contain "%" (if not %, the format is a number).
  • the present invention also provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the extraction of the table data in the pdf document when the processor executes the program Steps in the method.
  • the present invention also provides a computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps in the method for extracting table data in the pdf document are realized.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Document Processing Apparatus (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and a device for extracting table data from a PDF file, and a storage medium. The method comprises: extracting table information from a PDF file (S1); in the table information, searching for keywords of a table header, and positioning, according to weights or a combination of the keywords of a table header, a row where the table header of a table is located (S2); traversing, from the next row of the row where the table header is located, downwards the data format of cells in each row, and positioning, according to a change of the data format, a row where a table footer of the table is located (S3); and acquiring data information of the table according to the table header and the table footer of the table (S4). The method for extracting table data can automatically extract data from a PDF table in batches, solving the problems of time consumption and manpower consumption, and providing an extraction result having a small error and extracted data having a high accuracy.

Description

pdf文档中表格数据的提取方法、设备和存储介质Method, equipment and storage medium for extracting table data in pdf document 技术领域Technical field
本发明涉及计算机领域,具体而言,涉及一种pdf文档中表格数据的提取方法、设备和存储介质。The invention relates to the field of computers, and in particular to a method, equipment and storage medium for extracting table data in a pdf document.
背景技术Background technique
随着数字化,信息化的高速发展,从各类非结构化文档中提取数据已经成了令很多人头疼的事情。With the rapid development of digitization and informatization, extracting data from various unstructured documents has become a headache for many people.
如果你尝试从大量的pdf文件中寻找某种格式的报表,你会发现这是一种非常耗时,而且费眼的事情。如果再想把大量的pdf文件中表格里面的内容存入数据库,那更是一个浩大的工程,并且很容易出错。If you try to find a report in a certain format from a large number of pdf files, you will find that this is a very time-consuming and eye-consuming thing. If you want to store the contents of the tables in a large number of pdf files into the database, it is a huge project and it is easy to make mistakes.
发明内容Summary of the invention
本发明的目的在于提供一种pdf文档中表格数据的提取方法、设备和存储介质。The purpose of the present invention is to provide a method, equipment and storage medium for extracting table data in a pdf document.
为实现上述发明目的之一,本发明一实施方式提供一种pdf文档中表格数据的提取方法,所述方法包括:In order to achieve one of the above-mentioned objects of the invention, an embodiment of the present invention provides a method for extracting table data in a pdf document, the method including:
从pdf文档中提取表格信息;Extract table information from pdf documents;
在所述表格信息中,查找表头关键词,根据表头关键词的权重或者组合,定位一个表格的表头所在行;In the table information, look up the table header keywords, and locate the row of the table header of a table according to the weight or combination of the table header keywords;
从所述表头所在行的下一行开始往下遍历各行中单元格的数据格式,根据数据格式的改变,定位所述表格的表尾所在行;Traverse the data format of the cells in each row starting from the next row of the row where the header is located, and locate the row where the header of the table is located according to the change of the data format;
根据所述表格的表头和表尾,获取所述表格的数据信息。Obtain the data information of the table according to the header and footer of the table.
作为本发明一实施方式的进一步改进,所述方法还包括:As a further improvement of an embodiment of the present invention, the method further includes:
丢弃所述表格中不存在表头关键词的列。Discard the column in the table where the header keyword does not exist.
作为本发明一实施方式的进一步改进,所述“在所述表格信息中,查找表头关键词,根据表头关键词的权重,定位一个表格的表头所在行”具体包括:As a further improvement of an embodiment of the present invention, the "look up the header keywords in the table information, and locate the row where the header of a table is located according to the weight of the header keywords" specifically includes:
在所述表格信息的某一行中,查找到一个或者多个表头关键词;Find one or more header keywords in a certain row of the table information;
获取所述一个或者多个表头关键词的权重,计算所述一个或者多个表头关键词的总体权重;Acquiring the weight of the one or more header keywords, and calculating the overall weight of the one or more header keywords;
若所述总体权重超过权重阈值,定位所述表头关键词所在的行为所述表格的表头所在行。If the overall weight exceeds the weight threshold, locate the row where the header keyword is located in the header row of the table.
作为本发明一实施方式的进一步改进,所述“获取每个表头关键词的权重”具体包括:As a further improvement of an embodiment of the present invention, the "obtaining the weight of each header keyword" specifically includes:
获取历史pdf文档中表格的表头关键词及其词频;Get the header keywords and word frequencies of the table in the historical pdf document;
通过所述词频计算所述表头关键词的权重,得到表头关键词权重列表;Calculate the weights of the header keywords by the word frequency to obtain a list of header keyword weights;
查找所述表头关键词权重列表,获取每个表头关键词的权重。Look up the header keyword weight list, and obtain the weight of each header keyword.
作为本发明一实施方式的进一步改进,所述“在所述表格信息中,查找表头关键词,根据表头关键词的组合,定位一个表格的表头所在行”具体包括:As a further improvement of an embodiment of the present invention, the "look up the table header keywords in the table information, and locate the row where the table header of a table is located according to the combination of the table header keywords" specifically includes:
在所述表格信息的某一行中,查找到多个表头关键词;In a certain row of the table information, multiple table header keywords are found;
判断所述多个表头关键词是否有组合关键词,若是,定位所述多个表头关键词所在的行为所述表格的表头所在行。Determine whether the multiple header keywords have combined keywords, and if so, locate the row where the multiple header keywords are located in the header row of the table.
作为本发明一实施方式的进一步改进,所述“判断所述多个表头关键词是否有组合关键词”具体包括:As a further improvement of an embodiment of the present invention, the "determining whether the multiple header keywords have combined keywords" specifically includes:
获取历史pdf文档中表格的表头关键词的组合,得到表头组合关键词列表;Get the combination of header keywords of the table in the historical pdf document, and get the list of header combination keywords;
判断所述多个表头关键词是否有所述表头组合关键词列表中的组合关键词。It is determined whether the plurality of header keywords has a combination keyword in the header combination keyword list.
作为本发明一实施方式的进一步改进,所述“从所述表头所在行的下一行开始往下遍历各行中单元格的数据格式,根据数据格式的改变,定位所述表格的表尾所在行”具体包括:As a further improvement of an embodiment of the present invention, the "starting from the next row of the header where the header is located and traversing down the data format of the cells in each row, according to the data format change, locate the row where the footer of the table is located "Specifically include:
如果出现某行的数据格式不同于上一行的,定位所述上一行为表格的表尾所在行。If the data format of a certain row is different from that of the previous row, locate the row at the end of the table in the previous row.
作为本发明一实施方式的进一步改进,所述方法还包括:As a further improvement of an embodiment of the present invention, the method further includes:
检查所述表格的数据信息是否符合规范,若是,将所述数据信息存入数据库。Check whether the data information of the table meets the specifications, and if so, store the data information in the database.
为实现上述发明目的之一,本发明一实施方式提供一种电子设备,包括存储器和处理器,所述存储器存储有可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现上述任意一项所述pdf文档中表格数据的提取方法中的步骤。In order to achieve one of the above-mentioned objects of the invention, an embodiment of the present invention provides an electronic device including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor executes the program The steps in the method for extracting table data in the pdf document described in any one of the above are realized.
为实现上述发明目的之一,本发明一实施方式提供一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现上述任意一项所述pdf文档中表格数据的提取方法中的步骤。In order to achieve one of the objectives of the above-mentioned invention, an embodiment of the present invention provides a computer-readable storage medium on which a computer program is stored, wherein the computer program is characterized in that, when the computer program is executed by a processor, the pdf of any one of the above is realized Steps in the method of extracting tabular data in the document.
与现有技术相比,本发明的pdf文档中表格数据的提取方法,可以自动批量的提取pdf表格中的数据,解决了耗时,耗人力的问题,提取结果误差小,提取的数据准确性高。Compared with the prior art, the method for extracting table data in pdf documents of the present invention can automatically extract data in pdf tables in batches, which solves the time-consuming and manpower-consuming problems, the error of the extraction results is small, and the accuracy of the extracted data high.
附图说明Description of the drawings
图1是本发明pdf文档中表格数据的提取方法的流程示意图。Fig. 1 is a schematic flow chart of a method for extracting table data in a pdf document of the present invention.
图2是图1的步骤2的一个实施方式的流程示意图。FIG. 2 is a schematic flowchart of an embodiment of step 2 in FIG. 1.
具体实施方式detailed description
以下将结合附图所示的具体实施方式对本发明进行详细描述。但这些实施方式并不限制本发明,本领域的普通技术人员根据这些实施方式所做出的结构、方法、或功能上的变换均包含在本发明的保护范围内。Hereinafter, the present invention will be described in detail with reference to the specific embodiments shown in the drawings. However, these embodiments do not limit the present invention, and the structural, method, or functional changes made by those skilled in the art according to these embodiments are all included in the protection scope of the present invention.
如图1所示,本发明的pdf文档中表格数据的提取方法包括:As shown in Figure 1, the method for extracting table data in a pdf document of the present invention includes:
步骤S1:从pdf文档中提取表格信息。Step S1: Extract table information from the pdf document.
PDF诞生自Camelot项目。目的是创建一个通用的文档交流格式,以支持多种机器平台,操作系统和通信网络。其目标是使文档能够在任何显示器上可视,在任何现代打印机上可打印。PDF基于PostScript(一种页面描述语言)。该语言解决了在任意地方显示和打印的问题。PDF包含了文档“在任意地方可视和打印”所需的组件。比如,字符、字体、图表、图片等。PDF was born from the Camelot project. The purpose is to create a common document exchange format to support multiple machine platforms, operating systems and communication networks. The goal is to make documents visible on any monitor and printable on any modern printer. PDF is based on PostScript (a page description language). This language solves the problem of displaying and printing anywhere. The PDF contains the components needed for the document to be "visible and printed anywhere". For example, characters, fonts, graphics, pictures, etc.
一个PDF文档包含许多放置文字(或其他组件)的指令。这些指令使用以页面左下角为原点的x、y坐标放置页面元素。一个单词通过将几个字符紧凑的放置在一起来模拟。同样的,空白通过使字符间隔更大来模拟。那怎样模拟一个表格呢?通过把字符摆放得跟一个电子表格一样来模拟。A PDF document contains many instructions for placing text (or other components). These instructions use the x and y coordinates with the lower left corner of the page as the origin to place page elements. A word is simulated by placing several characters compactly together. Similarly, white space is simulated by making the character spacing larger. How to simulate a table? Simulate by placing the characters like a spreadsheet.
PDF中没有一个内部的表示方式来表示一个表格。这使得表格数据很难被抽取出来做分析。不幸的是很多开放的数据是存储在pdf格式的文件中的。但是PDF格式在设计上并没有很好的支持表格数据。但是第三方开源工具tabula或者Camelot等能够从pdf文件中提取表格数据。There is no internal representation in PDF to represent a table. This makes it difficult to extract tabular data for analysis. Unfortunately, a lot of open data is stored in pdf format files. However, the design of the PDF format does not support tabular data well. However, third-party open source tools such as tabula or Camelot can extract tabular data from pdf files.
本发明优选第三方开源工具tabula将pdf文件中所有的表格信息都提取出来,汇总在一起。因此所有的表格信息中,包括一张或者多张表格。下表1为一张表格的一个例子:In the present invention, it is preferred that the third-party open source tool tabula extracts all table information in the pdf file and aggregates them together. Therefore, all form information includes one or more forms. Table 1 below is an example of a table:
Figure PCTCN2019116528-appb-000001
Figure PCTCN2019116528-appb-000001
表1Table 1
步骤S2:在所述表格信息中,查找表头关键词,根据表头关键词的权重或者组合,定位一个表格的表头所在行。Step S2: In the table information, look up the table header keywords, and locate the row of the table header of a table according to the weight or combination of the table header keywords.
在此步骤中,通过事先分析历史pdf文档中表格的表头都有哪些关键词,以及这些关键词出现的频率,即词频,通过词频计算表头关键词的权重,汇总成表头关键词权重列表。表头关键词权重列表可以是:[{“客户”,25%},{“销售金额”,18%},{“占比”,11%}…]。然后在提取的表格信息中查找表头关键词,根据表头关键词的权重,定位一个表格的表头所在行。由于pdf中有些表格比较特殊,采用表头关键词权重定位表头,可以提高定位的准确性。In this step, by analyzing in advance what keywords are in the header of the table in the historical pdf document, and the frequency of these keywords, that is, the word frequency, the weight of the header keywords is calculated by the word frequency, and the weight of the header keywords is summarized List. The table header keyword weight list can be: [{"Customer", 25%}, {"Sales Amount", 18%}, {"Proportion", 11%}...]. Then look up the header keywords in the extracted table information, and locate the row where the header of a table is located according to the weight of the header keywords. Because some tables in the pdf are special, using the keyword weight of the header to locate the header can improve the accuracy of positioning.
如图2所示,具体步骤包括如下:As shown in Figure 2, the specific steps include the following:
步骤S21:在所述表格信息的某一行中,查找到一个或者多个表头关键词;Step S21: Find one or more header keywords in a certain row of the table information;
参见表1所示,搜索表格信息,定位到具有“序号、客户、销售金额,年度销售占比,是否存在关联关系”的这一行,找到表头关键词“客户”、“销售金额”和“占比”。See Table 1, search for table information, locate the row with "Serial Number, Customer, Sales Amount, Annual Sales Percentage, Whether There Is an Association Relationship", and find the header keywords "Customer", "Sales Amount" and " Percentage".
步骤S22:获取所述一个或者多个表头关键词的权重,计算所述一个或者多个表头关键词的总体权重;Step S22: Obtain the weight of the one or more header keywords, and calculate the overall weight of the one or more header keywords;
总体权重即所述一个或者多个表头关键词的权重之和。通过查找表头关键词权重列表,可以获取每个表头关键词的权重,将所有的这一个或者多个表头关键词的权重相加,即得到总体权重。The overall weight is the sum of the weights of the one or more header keywords. By looking up the table header keyword weight list, the weight of each header keyword can be obtained, and the weights of all the one or more header keywords are added together to obtain the overall weight.
步骤S23:若所述总体权重超过权重阈值,定位所述表头关键词所在的行为所述表格的表头所在行。Step S23: If the overall weight exceeds the weight threshold, locate the row where the header keyword is located in the header of the table.
由于表头关键词也可能出现表头以外的地方,因此需要设定一个权重阈值来界定表头。权重阈值的设定过程可以是:通过历史数据给出一个初始权重阈值,然后通过提取的表头的准确率,再对此初始权重阈值进行修正。Since the header keywords may also appear outside the header, it is necessary to set a weight threshold to define the header. The setting process of the weight threshold may be: an initial weight threshold is given through historical data, and then the initial weight threshold is corrected by the accuracy of the extracted header.
通过以上步骤,就定位到一个表格的表头所在行。Through the above steps, you can locate the row where the header of a table is located.
另外,有些表头关键词是组合出现的,因此,可以通过事先分析历史pdf文档中表格的表头都有哪些组合关键词,得到表头组合关键词列表。例如表头组合关键词列表可以是:[{供应商名称,采购金额,占比,关联关系},{客户,金额,占比},{单位名称,营业收入,本期发生额,与本公司关系}…]。然后在提取的表格信息中查找多个表头关键词,判断多个表头关键词是否有表头组合关键词,若有,定位一个表格的表头所在行。具体步骤包括如下:In addition, some header keywords appear in combination. Therefore, you can obtain a list of header combination keywords by analyzing in advance what combination keywords are in the header of the table in the historical pdf document. For example, the table header combination keyword list can be: [{supplier name, purchase amount, proportion, relationship}, {customer, amount, proportion}, {unit name, operating income, amount incurred in the current period, and the company relationship}…]. Then search for multiple header keywords in the extracted table information, determine whether there are header combination keywords for multiple header keywords, and if so, locate the row where the header of a table is located. The specific steps include the following:
步骤S24:在所述表格信息的某一行中,查找到多个表头关键词;Step S24: Find multiple table header keywords in a certain row of the table information;
步骤S25:判断所述多个表头关键词是否有组合关键词,若是,定位所述多个表头关键词所在的行为所述表格的表头所在行;Step S25: Determine whether the multiple header keywords have combined keywords, and if so, locate the row where the multiple header keywords are located in the header of the table;
通过以上步骤,定位到一个表格的表头所在行。Through the above steps, locate the row where the header of a table is located.
步骤S3:从所述表头所在行的下一行开始往下遍历各行中单元格的数据格式,根据数据格式的改变,定位所述表格的表尾所在行。Step S3: traverse the data format of the cells in each row starting from the next row of the row where the header is located, and locate the row where the header of the table is located according to the change of the data format.
如果出现某行的数据格式不同于上一行的,定位所述上一行为表格的表尾所在行。参见表1所示,第2到6行的数据格式相同,第7行数据格式与第6行不同,定义第6行为表尾所在行。这里需要说明的是,第7行的总计的内容不是我们所需要的,因此会被丢弃。If the data format of a certain row is different from that of the previous row, locate the row at the end of the table in the previous row. As shown in Table 1, the data format of rows 2 to 6 is the same, and the data format of row 7 is different from that of row 6, and row 6 is defined as the end of the table. What needs to be explained here is that the total content in line 7 is not what we need, so it will be discarded.
进一步的,为了增加准确性,如果出现某行的数据格式不同于上一行的,判断这行的数据中是否包含表尾关键词(表尾关键词可以是“总计”、“合计”或者“共计”等),若是,则所述上一行为所述表格的表尾所在行;若否,再判断这行的下一行的数据格式与这行的上一行数据格式 是否相同(此处主要是针对出现在表中间的合并单元格),若不同,所述上一行为所述表格的表尾所在行,若相同,代表中间出现了合并单元格,按照上述方法,继续定位表尾所在行。另外有些表格中或出现“-”,当发现数据格式的改变是由于“-”的出现导致的,不用管这一行,继续往下判断,定位表尾所在行。Further, in order to increase accuracy, if the data format of a row is different from that of the previous row, judge whether the data in this row contains keywords at the end of the table (the keywords at the end of the table can be "total", "total" or "total" ”Etc.), if it is, then the previous row is where the end of the table is located; if not, then judge whether the data format of the next row of this row is the same as the data format of the previous row of this row (this is mainly for The merged cell that appears in the middle of the table). If it is different, the previous row is the row at the end of the table. If it is the same, it means a merged cell appears in the middle. Continue to locate the row at the end of the table according to the above method. In addition, "-" may appear in some tables. When it is found that the data format change is caused by the appearance of "-", don't care about this line, continue to judge and locate the line at the end of the table.
步骤S4:根据所述表格的表头和表尾,获取所述表格的数据信息。Step S4: Obtain data information of the table according to the header and footer of the table.
从所述表头的下一行开始遍历所述表格直至表尾所在行,提取所述表格的每行和每列的数据。The table is traversed from the next row of the table header to the row of the table footer, and the data of each row and each column of the table is extracted.
以上步骤为一个表格的数据信息的获取过程,若存在多个表格,循环上述步骤,直到将所有表格的数据信息提取完成。本发明的pdf文档中表格数据的提取方法,可以自动批量的提取pdf表格中的数据,解决了耗时,耗人力的问题,提取结果误差小,提取的数据准确性高。The above steps are the process of acquiring the data information of one table. If there are multiple tables, the above steps are repeated until the data information of all the tables is extracted. The method for extracting table data in a pdf document of the present invention can automatically extract data in a pdf table in batches, which solves the time-consuming and manpower-consuming problems, the error of the extraction result is small, and the accuracy of the extracted data is high.
在一个优选的实施方式中,所述方法还包括:In a preferred embodiment, the method further includes:
丢弃所述表格中不存在表头关键词的列。Discard the column in the table where the header keyword does not exist.
需要说明的是,每个表头关键词都代表这个关键词所在列的数据是我们所需要的,因此,对于不存在表头关键词的列中的数据,是可以丢弃掉的。It should be noted that each header keyword represents that the data in the column where the keyword is located is what we need. Therefore, the data in the column where the header keyword does not exist can be discarded.
在一个优选的实施方式中,所述方法还包括:In a preferred embodiment, the method further includes:
检查所述表格的数据信息是否符合规范,若是,将所述数据信息存入数据库。Check whether the data information of the table meets the specifications, and if so, store the data information in the database.
由于每个表头关键词都对应有相应的数据格式,比如“客户”对应的是公司或人的名称,“销售金额”对应的是数字,“占比”应该包含“%”(如果不包含%,格式为数字)。检查表格的数据信息是否符合以上的规范,如果符合,将所述数据信息存入数据库。Since each header keyword corresponds to a corresponding data format, for example, "customer" corresponds to the name of a company or person, "sales amount" corresponds to a number, and "proportion" should contain "%" (if not %, the format is a number). Check whether the data information of the table conforms to the above specifications, and if it conforms, store the data information in the database.
本发明还提供一种电子设备,包括存储器和处理器,所述存储器存储有可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现上述pdf文档中表格数据的提取方法中的步骤。The present invention also provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the extraction of the table data in the pdf document when the processor executes the program Steps in the method.
本发明还提供一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现上述pdf文档中表格数据的提取方法中的步骤。The present invention also provides a computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps in the method for extracting table data in the pdf document are realized.
应当理解,虽然本说明书按照实施方式加以描述,但并非每个实施方式仅包含一个独立的技术方案,说明书的这种叙述方式仅仅是为清楚起见,本领域技术人员应当将说明书作为一个整体,各实施方式中的技术方案也可以经适当组合,形成本领域技术人员可以理解的其他实施方式。It should be understood that although this specification is described in accordance with the implementation manners, not each implementation manner only includes an independent technical solution. This narration in the specification is only for the sake of clarity, and those skilled in the art should regard the specification as a whole. The technical solutions in the embodiments can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.
上文所列出的一系列的详细说明仅仅是针对本发明的可行性实施方式的具体说明,它们并非用以限制本发明的保护范围,凡未脱离本发明技艺精神所作的等效实施方式或变更均应包含在本发明的保护范围之内。The series of detailed descriptions listed above are only specific descriptions of feasible implementations of the present invention. They are not intended to limit the scope of protection of the present invention. Any equivalent implementations or implementations made without departing from the technical spirit of the present invention All changes shall be included in the protection scope of the present invention.

Claims (10)

  1. 一种pdf文档中表格数据的提取方法,其特征在于,所述方法包括:A method for extracting table data in a pdf document, characterized in that the method includes:
    从pdf文档中提取表格信息;Extract table information from pdf documents;
    在所述表格信息中,查找表头关键词,根据表头关键词的权重或者组合,定位一个表格的表头所在行;In the table information, look up the table header keywords, and locate the row of the table header of a table according to the weight or combination of the table header keywords;
    从所述表头所在行的下一行开始往下遍历各行中单元格的数据格式,根据数据格式的改变,定位所述表格的表尾所在行;Traverse the data format of the cells in each row starting from the next row of the row where the header is located, and locate the row where the header of the table is located according to the change of the data format;
    根据所述表格的表头和表尾,获取所述表格的数据信息。Obtain the data information of the table according to the header and footer of the table.
  2. 根据权利要求1所述的pdf文档中表格数据的提取方法,其特征在于,所述方法还包括:The method for extracting table data in a pdf document according to claim 1, wherein the method further comprises:
    丢弃所述表格中不存在表头关键词的列。Discard the column in the table where the header keyword does not exist.
  3. 根据权利要求1所述的pdf文档中表格数据的提取方法,其特征在于,所述“在所述表格信息中,查找表头关键词,根据表头关键词的权重,定位一个表格的表头所在行”具体包括:The method for extracting table data in a pdf document according to claim 1, wherein the "in the table information, look up the table header keywords, and locate the table header of a table according to the weight of the table header keywords. The line" specifically includes:
    在所述表格信息的某一行中,查找到一个或者多个表头关键词;Find one or more header keywords in a certain row of the table information;
    获取所述一个或者多个表头关键词的权重,计算所述一个或者多个表头关键词的总体权重;Acquiring the weight of the one or more header keywords, and calculating the overall weight of the one or more header keywords;
    若所述总体权重超过权重阈值,定位所述表头关键词所在的行为所述表格的表头所在行。If the overall weight exceeds the weight threshold, locate the row where the header keyword is located in the header row of the table.
  4. 根据权利要求3所述的pdf文档中表格数据的提取方法,其特征在于,所述“获取每个表头关键词的权重”具体包括:The method for extracting table data in a pdf document according to claim 3, wherein the "obtaining the weight of each table header keyword" specifically includes:
    获取历史pdf文档中表格的表头关键词及其词频;Get the header keywords and word frequencies of the table in the historical pdf document;
    通过所述词频计算所述表头关键词的权重,得到表头关键词权重列表;Calculate the weights of the header keywords by the word frequency to obtain a list of header keyword weights;
    查找所述表头关键词权重列表,获取每个表头关键词的权重。Look up the header keyword weight list, and obtain the weight of each header keyword.
  5. 根据权利要求1所述的pdf文档中表格数据的提取方法,其特征在于,所述“在所述表格信息中,查找表头关键词,根据表头关键词的组合,定位一个表格的表头所在行”具体包括:The method for extracting table data in a pdf document according to claim 1, wherein the "in the table information, look up the table header keywords, and locate the table header of a table according to the combination of the table header keywords. The line" specifically includes:
    在所述表格信息的某一行中,查找到多个表头关键词;In a certain row of the table information, multiple table header keywords are found;
    判断所述多个表头关键词是否有组合关键词,若是,定位所述多个表头关键词所在的行为所述表格的表头所在行。Determine whether the multiple header keywords have combined keywords, and if so, locate the row where the multiple header keywords are located in the header row of the table.
  6. 根据权利要求5所述的pdf文档中表格数据的提取方法,其特征在于,所述“判断所述多个表头关键词是否有组合关键词”具体包括:The method for extracting table data in a pdf document according to claim 5, wherein the "determining whether the multiple header keywords have combined keywords" specifically includes:
    获取历史pdf文档中表格的表头关键词的组合,得到表头组合关键词列表;Get the combination of header keywords of the table in the historical pdf document, and get the list of header combination keywords;
    判断所述多个表头关键词是否有所述表头组合关键词列表中的组合关键词。It is determined whether the plurality of header keywords has a combination keyword in the header combination keyword list.
  7. 根据权利要求1所述的pdf文档中表格数据的提取方法,其特征在于,所述“从所述表 头所在行的下一行开始往下遍历各行中单元格的数据格式,根据数据格式的改变,定位所述表格的表尾所在行”具体包括:The method for extracting table data in a pdf document according to claim 1, characterized in that said "starting from the next row of the row where the header is located and traversing down the data format of the cells in each row, according to the change of the data format "Locating the line at the end of the table" specifically includes:
    如果出现某行的数据格式不同于上一行的,定位所述上一行为表格的表尾所在行。If the data format of a certain row is different from that of the previous row, locate the row at the end of the table in the previous row.
  8. 根据权利要求1所述pdf文档中表格数据的提取方法,其特征在于,所述方法还包括:The method for extracting table data in a pdf document according to claim 1, wherein the method further comprises:
    检查所述表格的数据信息是否符合规范,若是,将所述数据信息存入数据库。Check whether the data information of the table meets the specifications, and if so, store the data information in the database.
  9. 一种电子设备,包括存储器和处理器,所述存储器存储有可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述程序时实现权利要求1-8任意一项所述pdf文档中表格数据的提取方法中的步骤。An electronic device comprising a memory and a processor, the memory storing a computer program that can run on the processor, wherein the processor implements any one of claims 1-8 when the program is executed The steps in the method for extracting table data in the pdf document.
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-8任意一项所述pdf文档中表格数据的提取方法中的步骤。A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps in the method for extracting table data in a pdf document according to any one of claims 1-8 .
PCT/CN2019/116528 2019-09-02 2019-11-08 Method and device for extracting table data from pdf file, and storage medium WO2021042507A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910821962.6 2019-09-02
CN201910821962.6A CN110516048A (en) 2019-09-02 2019-09-02 The extracting method, equipment and storage medium of list data in pdf document

Publications (1)

Publication Number Publication Date
WO2021042507A1 true WO2021042507A1 (en) 2021-03-11

Family

ID=68629147

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116528 WO2021042507A1 (en) 2019-09-02 2019-11-08 Method and device for extracting table data from pdf file, and storage medium

Country Status (2)

Country Link
CN (1) CN110516048A (en)
WO (1) WO2021042507A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111027285B (en) * 2019-12-17 2023-06-16 南京上游软件有限公司 Method and system for automatically extracting order information from pdf format order
CN111104783B (en) * 2019-12-17 2021-07-23 珠海格力电器股份有限公司 Data verification method and device, electronic equipment and storage medium
CN112434496B (en) * 2020-12-11 2021-06-22 深圳司南数据服务有限公司 Method and terminal for identifying form data of bulletin document
CN112579727B (en) * 2020-12-16 2022-03-22 北京百度网讯科技有限公司 Document content extraction method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105589841A (en) * 2016-01-15 2016-05-18 同方知网(北京)技术有限公司 Portable document format (PDF) document form identification method
CN108197216A (en) * 2017-12-28 2018-06-22 深圳市巨鼎医疗设备有限公司 A kind of method of information processing
CN108470021A (en) * 2018-03-26 2018-08-31 阿博茨德(北京)科技有限公司 The localization method and device of table in PDF document
CN108595402A (en) * 2018-04-28 2018-09-28 西安极数宝数据服务有限公司 A kind of system of extraction PDF form datas
CN108734089A (en) * 2018-04-02 2018-11-02 腾讯科技(深圳)有限公司 Identify method, apparatus, equipment and the storage medium of table content in picture file
US10303938B2 (en) * 2016-12-29 2019-05-28 Factset Research Systems Inc Identifying a structure presented in portable document format (PDF)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103034633B (en) * 2011-09-30 2016-08-03 国际商业机器公司 Generate the method and device of the result of page searching summary of extension
KR101541306B1 (en) * 2013-11-11 2015-08-04 주식회사 엘지씨엔에스 Computer enabled method of important keyword extraction, server performing the same and storage media storing the same
CN105518667B (en) * 2014-06-30 2019-06-18 微软技术许可有限责任公司 Understand method, system and the computer storage medium of the table for search
US10078629B2 (en) * 2015-10-22 2018-09-18 International Business Machines Corporation Tabular data compilation
CN106709032B (en) * 2016-12-29 2019-12-20 深圳市华傲数据技术有限公司 Method and device for extracting structured information in electronic form document
CN107748803B (en) * 2017-11-20 2021-02-09 中国运载火箭技术研究院 Method for designing spatial situation characteristic event database

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105589841A (en) * 2016-01-15 2016-05-18 同方知网(北京)技术有限公司 Portable document format (PDF) document form identification method
US10303938B2 (en) * 2016-12-29 2019-05-28 Factset Research Systems Inc Identifying a structure presented in portable document format (PDF)
CN108197216A (en) * 2017-12-28 2018-06-22 深圳市巨鼎医疗设备有限公司 A kind of method of information processing
CN108470021A (en) * 2018-03-26 2018-08-31 阿博茨德(北京)科技有限公司 The localization method and device of table in PDF document
CN108734089A (en) * 2018-04-02 2018-11-02 腾讯科技(深圳)有限公司 Identify method, apparatus, equipment and the storage medium of table content in picture file
CN108595402A (en) * 2018-04-28 2018-09-28 西安极数宝数据服务有限公司 A kind of system of extraction PDF form datas

Also Published As

Publication number Publication date
CN110516048A (en) 2019-11-29

Similar Documents

Publication Publication Date Title
WO2021042507A1 (en) Method and device for extracting table data from pdf file, and storage medium
CN108874928B (en) Resume data information analysis processing method, device, equipment and storage medium
US9304993B2 (en) Methods and data structures for multiple combined improved searchable formatted documents including citation and corpus generation
US9881037B2 (en) Method for systematic mass normalization of titles
WO2020077824A1 (en) Method, apparatus, and device for locating abnormality, and storage medium
CN109074383B (en) Document search with visualization within the context of a document
CN103955538B (en) HBase data persistence and query methods and HBase system
JP2013531289A (en) Use of model information group in search
CN110795524B (en) Main data mapping processing method and device, computer equipment and storage medium
US20130204835A1 (en) Method of extracting named entity
WO2019085463A1 (en) Department demand recommendation method, application server, and computer-readable storage medium
CN110162773A (en) Title estimator
US20170091082A1 (en) Test db data generation apparatus
CN111553151A (en) Question recommendation method and device based on field similarity calculation and server
US9881073B2 (en) Method for reconfiguration of database, recording medium, and reconfiguration device
CN111078564B (en) UI test case management method, device, computer equipment and computer readable storage medium
CN112612810A (en) Slow SQL statement identification method and system
CN109670183B (en) Text importance calculation method, device, equipment and storage medium
WO2019085118A1 (en) Topic model-based associated word analysis method, and electronic apparatus and storage medium
CN107203525B (en) Database processing method and device
CN115329083A (en) Document classification method and device, computer equipment and storage medium
JP2015191277A (en) Data identification method, data identification program, and data identification device
US20130238607A1 (en) Seed set expansion
CN113468339A (en) Label extraction method, system, electronic device and medium based on knowledge graph
US20090138461A1 (en) Method for discovering design documents

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19944421

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19944421

Country of ref document: EP

Kind code of ref document: A1