WO2021042507A1

WO2021042507A1 - Method and device for extracting table data from pdf file, and storage medium

Info

Publication number: WO2021042507A1
Application number: PCT/CN2019/116528
Authority: WO
Inventors: 王凯; 邓会林; 顾杨
Original assignee: 苏州朗动网络科技有限公司
Priority date: 2019-09-02
Filing date: 2019-11-08
Publication date: 2021-03-11
Also published as: CN110516048A

Abstract

A method and a device for extracting table data from a PDF file, and a storage medium. The method comprises: extracting table information from a PDF file (S1); in the table information, searching for keywords of a table header, and positioning, according to weights or a combination of the keywords of a table header, a row where the table header of a table is located (S2); traversing, from the next row of the row where the table header is located, downwards the data format of cells in each row, and positioning, according to a change of the data format, a row where a table footer of the table is located (S3); and acquiring data information of the table according to the table header and the table footer of the table (S4). The method for extracting table data can automatically extract data from a PDF table in batches, solving the problems of time consumption and manpower consumption, and providing an extraction result having a small error and extracted data having a high accuracy.

Description

Method, equipment and storage medium for extracting table data in pdf document

Technical field

The invention relates to the field of computers, and in particular to a method, equipment and storage medium for extracting table data in a pdf document.

Background technique

With the rapid development of digitization and informatization, extracting data from various unstructured documents has become a headache for many people.

If you try to find a report in a certain format from a large number of pdf files, you will find that this is a very time-consuming and eye-consuming thing. If you want to store the contents of the tables in a large number of pdf files into the database, it is a huge project and it is easy to make mistakes.

Summary of the invention

The purpose of the present invention is to provide a method, equipment and storage medium for extracting table data in a pdf document.

In order to achieve one of the above-mentioned objects of the invention, an embodiment of the present invention provides a method for extracting table data in a pdf document, the method including:

Extract table information from pdf documents;

In the table information, look up the table header keywords, and locate the row of the table header of a table according to the weight or combination of the table header keywords;

Traverse the data format of the cells in each row starting from the next row of the row where the header is located, and locate the row where the header of the table is located according to the change of the data format;

Obtain the data information of the table according to the header and footer of the table.

As a further improvement of an embodiment of the present invention, the method further includes:

Discard the column in the table where the header keyword does not exist.

As a further improvement of an embodiment of the present invention, the "look up the header keywords in the table information, and locate the row where the header of a table is located according to the weight of the header keywords" specifically includes:

Find one or more header keywords in a certain row of the table information;

Acquiring the weight of the one or more header keywords, and calculating the overall weight of the one or more header keywords;

If the overall weight exceeds the weight threshold, locate the row where the header keyword is located in the header row of the table.

As a further improvement of an embodiment of the present invention, the "obtaining the weight of each header keyword" specifically includes:

Get the header keywords and word frequencies of the table in the historical pdf document;

Calculate the weights of the header keywords by the word frequency to obtain a list of header keyword weights;

Look up the header keyword weight list, and obtain the weight of each header keyword.

As a further improvement of an embodiment of the present invention, the "look up the table header keywords in the table information, and locate the row where the table header of a table is located according to the combination of the table header keywords" specifically includes:

In a certain row of the table information, multiple table header keywords are found;

Determine whether the multiple header keywords have combined keywords, and if so, locate the row where the multiple header keywords are located in the header row of the table.

As a further improvement of an embodiment of the present invention, the "determining whether the multiple header keywords have combined keywords" specifically includes:

Get the combination of header keywords of the table in the historical pdf document, and get the list of header combination keywords;

It is determined whether the plurality of header keywords has a combination keyword in the header combination keyword list.

As a further improvement of an embodiment of the present invention, the "starting from the next row of the header where the header is located and traversing down the data format of the cells in each row, according to the data format change, locate the row where the footer of the table is located "Specifically include:

If the data format of a certain row is different from that of the previous row, locate the row at the end of the table in the previous row.

Check whether the data information of the table meets the specifications, and if so, store the data information in the database.

In order to achieve one of the above-mentioned objects of the invention, an embodiment of the present invention provides an electronic device including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor executes the program The steps in the method for extracting table data in the pdf document described in any one of the above are realized.

In order to achieve one of the objectives of the above-mentioned invention, an embodiment of the present invention provides a computer-readable storage medium on which a computer program is stored, wherein the computer program is characterized in that, when the computer program is executed by a processor, the pdf of any one of the above is realized Steps in the method of extracting tabular data in the document.

Compared with the prior art, the method for extracting table data in pdf documents of the present invention can automatically extract data in pdf tables in batches, which solves the time-consuming and manpower-consuming problems, the error of the extraction results is small, and the accuracy of the extracted data high.

Description of the drawings

Fig. 1 is a schematic flow chart of a method for extracting table data in a pdf document of the present invention.

FIG. 2 is a schematic flowchart of an embodiment of step 2 in FIG. 1.

detailed description

Hereinafter, the present invention will be described in detail with reference to the specific embodiments shown in the drawings. However, these embodiments do not limit the present invention, and the structural, method, or functional changes made by those skilled in the art according to these embodiments are all included in the protection scope of the present invention.

As shown in Figure 1, the method for extracting table data in a pdf document of the present invention includes:

Step S1: Extract table information from the pdf document.

PDF was born from the Camelot project. The purpose is to create a common document exchange format to support multiple machine platforms, operating systems and communication networks. The goal is to make documents visible on any monitor and printable on any modern printer. PDF is based on PostScript (a page description language). This language solves the problem of displaying and printing anywhere. The PDF contains the components needed for the document to be "visible and printed anywhere". For example, characters, fonts, graphics, pictures, etc.

A PDF document contains many instructions for placing text (or other components). These instructions use the x and y coordinates with the lower left corner of the page as the origin to place page elements. A word is simulated by placing several characters compactly together. Similarly, white space is simulated by making the character spacing larger. How to simulate a table? Simulate by placing the characters like a spreadsheet.

There is no internal representation in PDF to represent a table. This makes it difficult to extract tabular data for analysis. Unfortunately, a lot of open data is stored in pdf format files. However, the design of the PDF format does not support tabular data well. However, third-party open source tools such as tabula or Camelot can extract tabular data from pdf files.

In the present invention, it is preferred that the third-party open source tool tabula extracts all table information in the pdf file and aggregates them together. Therefore, all form information includes one or more forms. Table 1 below is an example of a table:

Table 1

Step S2: In the table information, look up the table header keywords, and locate the row of the table header of a table according to the weight or combination of the table header keywords.

In this step, by analyzing in advance what keywords are in the header of the table in the historical pdf document, and the frequency of these keywords, that is, the word frequency, the weight of the header keywords is calculated by the word frequency, and the weight of the header keywords is summarized List. The table header keyword weight list can be: [{"Customer", 25%}, {"Sales Amount", 18%}, {"Proportion", 11%}...]. Then look up the header keywords in the extracted table information, and locate the row where the header of a table is located according to the weight of the header keywords. Because some tables in the pdf are special, using the keyword weight of the header to locate the header can improve the accuracy of positioning.

As shown in Figure 2, the specific steps include the following:

Step S21: Find one or more header keywords in a certain row of the table information;

See Table 1, search for table information, locate the row with "Serial Number, Customer, Sales Amount, Annual Sales Percentage, Whether There Is an Association Relationship", and find the header keywords "Customer", "Sales Amount" and " Percentage".

Step S22: Obtain the weight of the one or more header keywords, and calculate the overall weight of the one or more header keywords;

The overall weight is the sum of the weights of the one or more header keywords. By looking up the table header keyword weight list, the weight of each header keyword can be obtained, and the weights of all the one or more header keywords are added together to obtain the overall weight.

Step S23: If the overall weight exceeds the weight threshold, locate the row where the header keyword is located in the header of the table.

Since the header keywords may also appear outside the header, it is necessary to set a weight threshold to define the header. The setting process of the weight threshold may be: an initial weight threshold is given through historical data, and then the initial weight threshold is corrected by the accuracy of the extracted header.

Through the above steps, you can locate the row where the header of a table is located.

In addition, some header keywords appear in combination. Therefore, you can obtain a list of header combination keywords by analyzing in advance what combination keywords are in the header of the table in the historical pdf document. For example, the table header combination keyword list can be: [{supplier name, purchase amount, proportion, relationship}, {customer, amount, proportion}, {unit name, operating income, amount incurred in the current period, and the company relationship}…]. Then search for multiple header keywords in the extracted table information, determine whether there are header combination keywords for multiple header keywords, and if so, locate the row where the header of a table is located. The specific steps include the following:

Step S24: Find multiple table header keywords in a certain row of the table information;

Step S25: Determine whether the multiple header keywords have combined keywords, and if so, locate the row where the multiple header keywords are located in the header of the table;

Through the above steps, locate the row where the header of a table is located.

Step S3: traverse the data format of the cells in each row starting from the next row of the row where the header is located, and locate the row where the header of the table is located according to the change of the data format.

If the data format of a certain row is different from that of the previous row, locate the row at the end of the table in the previous row. As shown in Table 1, the data format of rows 2 to 6 is the same, and the data format of row 7 is different from that of row 6, and row 6 is defined as the end of the table. What needs to be explained here is that the total content in line 7 is not what we need, so it will be discarded.

Further, in order to increase accuracy, if the data format of a row is different from that of the previous row, judge whether the data in this row contains keywords at the end of the table (the keywords at the end of the table can be "total", "total" or "total" ”Etc.), if it is, then the previous row is where the end of the table is located; if not, then judge whether the data format of the next row of this row is the same as the data format of the previous row of this row (this is mainly for The merged cell that appears in the middle of the table). If it is different, the previous row is the row at the end of the table. If it is the same, it means a merged cell appears in the middle. Continue to locate the row at the end of the table according to the above method. In addition, "-" may appear in some tables. When it is found that the data format change is caused by the appearance of "-", don't care about this line, continue to judge and locate the line at the end of the table.

Step S4: Obtain data information of the table according to the header and footer of the table.

The table is traversed from the next row of the table header to the row of the table footer, and the data of each row and each column of the table is extracted.

The above steps are the process of acquiring the data information of one table. If there are multiple tables, the above steps are repeated until the data information of all the tables is extracted. The method for extracting table data in a pdf document of the present invention can automatically extract data in a pdf table in batches, which solves the time-consuming and manpower-consuming problems, the error of the extraction result is small, and the accuracy of the extracted data is high.

In a preferred embodiment, the method further includes:

Discard the column in the table where the header keyword does not exist.

It should be noted that each header keyword represents that the data in the column where the keyword is located is what we need. Therefore, the data in the column where the header keyword does not exist can be discarded.

In a preferred embodiment, the method further includes:

Since each header keyword corresponds to a corresponding data format, for example, "customer" corresponds to the name of a company or person, "sales amount" corresponds to a number, and "proportion" should contain "%" (if not %, the format is a number). Check whether the data information of the table conforms to the above specifications, and if it conforms, store the data information in the database.

The present invention also provides an electronic device, including a memory and a processor, the memory stores a computer program that can run on the processor, and the processor implements the extraction of the table data in the pdf document when the processor executes the program Steps in the method.

The present invention also provides a computer-readable storage medium on which a computer program is stored, characterized in that, when the computer program is executed by a processor, the steps in the method for extracting table data in the pdf document are realized.

It should be understood that although this specification is described in accordance with the implementation manners, not each implementation manner only includes an independent technical solution. This narration in the specification is only for the sake of clarity, and those skilled in the art should regard the specification as a whole. The technical solutions in the embodiments can also be appropriately combined to form other embodiments that can be understood by those skilled in the art.

The series of detailed descriptions listed above are only specific descriptions of feasible implementations of the present invention. They are not intended to limit the scope of protection of the present invention. Any equivalent implementations or implementations made without departing from the technical spirit of the present invention All changes shall be included in the protection scope of the present invention.

Claims

A method for extracting table data in a pdf document, characterized in that the method includes:

Extract table information from pdf documents;

In the table information, look up the table header keywords, and locate the row of the table header of a table according to the weight or combination of the table header keywords;

Traverse the data format of the cells in each row starting from the next row of the row where the header is located, and locate the row where the header of the table is located according to the change of the data format;

Obtain the data information of the table according to the header and footer of the table.
The method for extracting table data in a pdf document according to claim 1, wherein the method further comprises:

Discard the column in the table where the header keyword does not exist.
The method for extracting table data in a pdf document according to claim 1, wherein the "in the table information, look up the table header keywords, and locate the table header of a table according to the weight of the table header keywords. The line" specifically includes:

Find one or more header keywords in a certain row of the table information;

Acquiring the weight of the one or more header keywords, and calculating the overall weight of the one or more header keywords;

If the overall weight exceeds the weight threshold, locate the row where the header keyword is located in the header row of the table.
The method for extracting table data in a pdf document according to claim 3, wherein the "obtaining the weight of each table header keyword" specifically includes:

Get the header keywords and word frequencies of the table in the historical pdf document;

Calculate the weights of the header keywords by the word frequency to obtain a list of header keyword weights;

Look up the header keyword weight list, and obtain the weight of each header keyword.
The method for extracting table data in a pdf document according to claim 1, wherein the "in the table information, look up the table header keywords, and locate the table header of a table according to the combination of the table header keywords. The line" specifically includes:

In a certain row of the table information, multiple table header keywords are found;

Determine whether the multiple header keywords have combined keywords, and if so, locate the row where the multiple header keywords are located in the header row of the table.
The method for extracting table data in a pdf document according to claim 5, wherein the "determining whether the multiple header keywords have combined keywords" specifically includes:

Get the combination of header keywords of the table in the historical pdf document, and get the list of header combination keywords;

It is determined whether the plurality of header keywords has a combination keyword in the header combination keyword list.
The method for extracting table data in a pdf document according to claim 1, characterized in that said "starting from the next row of the row where the header is located and traversing down the data format of the cells in each row, according to the change of the data format "Locating the line at the end of the table" specifically includes:

If the data format of a certain row is different from that of the previous row, locate the row at the end of the table in the previous row.
The method for extracting table data in a pdf document according to claim 1, wherein the method further comprises:

Check whether the data information of the table meets the specifications, and if so, store the data information in the database.
An electronic device comprising a memory and a processor, the memory storing a computer program that can run on the processor, wherein the processor implements any one of claims 1-8 when the program is executed The steps in the method for extracting table data in the pdf document.
A computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps in the method for extracting table data in a pdf document according to any one of claims 1-8 .