CN108664458B

CN108664458B - PDF file table analysis method and system

Info

Publication number: CN108664458B
Application number: CN201710193060.3A
Authority: CN
Inventors: 裴泽光; 武海峰
Original assignee: Zhongke Yuntou Technology Co ltd
Current assignee: Zhongke Yuntou Technology Co ltd
Priority date: 2017-03-28
Filing date: 2017-03-28
Publication date: 2022-06-14
Anticipated expiration: 2037-03-28
Also published as: CN108664458A

Abstract

The invention discloses a PDF file table analysis method and a PDF file table analysis system, and relates to the field of data processing. The method comprises the following steps: acquiring a target PDF file, and converting the target PDF file into a word document; converting the word document into an html document; identifying form information in the html document, reading and outputting the form information; in the process of identifying the table information in the html document, the identified table information also needs to be converted into structured information. The system comprises: the device comprises a first conversion unit, a second conversion unit and a manufacturing unit. The method not only can accurately identify and read the character information in the PDF file, but also can finish reading the table information in the PDF file, the accuracy rate is at least 90%, and the method can also convert the read table information into structured language data.

Description

PDF file table analysis method and system

Technical Field

The invention relates to the field of data processing, in particular to a PDF file table analysis method and a PDF file table analysis system.

Background

PDF is a Portable Document Format, which is an electronic file Format. Because of the versatility of PDF in various mainstream operating systems, PDF is a mainstream form of file information delivery.

The PDF file contains a large amount of data information, such as text information, table information, and picture information. However, due to the sealing property of the PDF file, although the prior art can identify the character information in the PDF file, the identification and reading effects on the form information are poor, and the accuracy is low.

The accuracy of PDF form identification can be improved to more than 90% by products developed by the company.

Disclosure of Invention

The invention aims to provide a PDF file table analysis method and a PDF file table analysis system, so that the problems in the prior art are solved.

In order to achieve the above object, the PDF file table parsing method according to the present invention includes:

s1, acquiring a target PDF file and converting the target PDF file into a word document;

s2, converting the word document into an html document;

s3, identifying the table information in the html document, reading and outputting the table information;

in the process of identifying the table information in the html document, the identified table information also needs to be converted into structured information.

Preferably, an underlying component of the adobe acrobat DC product is called to convert the target PDF file into a word document.

Preferably, the underlying component of the microsoft office product is called to convert the word document into the html document.

Preferably, in the process of identifying table information in an html document, the identified table information needs to be converted into structured information, and any table information needs to be converted into structured information, and the method is specifically implemented according to the following steps:

assuming that in the html file, a table label represents a table, a tr label represents a row, and td represents a cell in the row; colspan label represents column merging of cells, and rowspan label represents merging of rows; the sequences of the table tag, the tr tag and the td tag are increased progressively from 1, and the increment is 1; the value ranges of the colspan label and the rowspan label are both more than or equal to 2;

reading each unit information in each line from a first tr label, and judging whether a colspan label or a rowspan label exists in any cell A when the information of the cell A is read;

if the colspan tag and the rowspan tag do not exist, obtaining the numerical value of the cell A, and recording the data storage form of the cell A as [ element 1, element 2 and element 3], wherein the element 1 represents the numerical value of the cell A, the element 2 represents the sequence number of the tr tag in which the cell A is located, and the element 3 represents the sequence number of the td tag of the cell A;

if the cell A has a colspan label, acquiring a value m of the colspan label and a numerical value of the cell A, wherein m is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2, element 3 ═ m ] and [ element 1, element 2, element 3 ═ m +1 ];

if a rowspan tag exists in the cell A, acquiring a value n of the rowspan tag and a numerical value of the cell A, wherein n is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2 ═ n, element 3] and [ element 1, element 2 ═ n +1, element 3 ];

after all cells under the table label are completely read, on the basis of the stored data storage form, taking the element 2 as a line number, taking the element 3 as a column number, and correspondingly supplementing the element 1 to a corresponding row and column to complete the drawing of the two-dimensional table.

More preferably, when reading the line marked by each tr label, if the data storage form corresponding to the cell corresponding to a certain td label read in sequence is marked as [ a, b, c ], judging whether a cell with an element 2 ═ b and an element 3 ═ c exists in the data storage forms corresponding to all the cells obtained by the previous reading, and if so, modifying [ a, b, c ] into [ a, b, c +1] and storing; if not, directly storing [ a, b, c ].

The invention discloses a system for realizing a PDF file table analysis method, which comprises the following steps:

a first conversion unit: converting the target PDF file into a word document;

a second conversion unit: converting the word document into an html document;

a manufacturing unit: identifying form information in the html document, reading and outputting the form information; in the process of identifying the table information in the html document, the identified table information also needs to be converted into structured information.

Preferably, the manufacturing unit includes:

a collecting unit: acquiring the number and the existing numerical value of each cell;

a judging unit: judging whether each cell has a colspan label or a rowspan label; if so, if the colspan tag and the rowspan tag do not exist, obtaining the numerical value of the cell A, and recording the data storage form of the cell A as [ element 1, element 2 and element 3], wherein the element 1 represents the numerical value of the cell A, the element 2 represents the sequence number of the tr tag in which the cell A is positioned, and the element 3 represents the sequence number of the td tag of the cell A;

and a drawing unit which finishes drawing the two-dimensional table according to the data storage form obtained from the judging unit.

The invention has the beneficial effects that:

the method not only can accurately identify and read the character information in the PDF file, but also can finish reading the table information in the PDF file, the accuracy rate is at least 90%, and the method can also convert the read table information into structured language data.

Drawings

FIG. 1 is a flow chart of a PDF document table parsing method;

FIG. 2 is a table diagram of example 1;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Examples

Referring to fig. 1, the method for parsing a PDF file table according to this embodiment includes:

s2, converting the word document into an html document;

More detailed explanation:

calling an underlying component of an adobe acrobat DC product to convert the target PDF file into a word document. And calling a bottom layer component of the microsoft office product to convert the word document into the html document.

(II) in the process of identifying the table information in the html document, the identified table information is also required to be converted into structured information, and any table information is converted into structured information, which is specifically realized according to the following steps:

assuming that in the html file, a table label represents a table, a tr label represents a row, and td represents a cell in the row; colspan label represents column merging of cells, rowspan label represents merging of rows; the sequences of the table tag, the tr tag and the td tag are all increased from 1, and the increment is 1; the value ranges of the colspan label and the rowspan label are both more than or equal to 2;

if no colspan tag and no rowspan tag exist, obtaining the numerical value of the cell A, and recording the data storage form of the cell A as [ element 1, element 2 and element 3], wherein the element 1 represents the numerical value of the cell A, the element 2 represents the sequence number of a tr tag in which the cell A is located, and the element 3 represents the sequence number of a td tag of the cell A; for example, the data from the first td element in the first tr is β, and the format for storing this data is [ β,1,1]

referring to fig. 2, an example is: if the cell A has the colspan label, the cell A is divided into the cells with the number of the colspan label, the content of the cell is consistent with that of the original cell, the line number is unchanged, and the column number is sequentially added with 1. For example: when data in the 4 th td element in the 2 nd tr is ζ and the colspan tag value in the td element is 2, the storage format is ζ,2,4 and ζ,2, 5.

Thirdly, if a rowpan label exists in the cell A, obtaining a value n of the rowpan label and a numerical value of the cell A, wherein n is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2 ═ n, element 3] and [ element 1, element 2 ═ n +1, element 3 ];

referring to fig. 2, an example is: if the cell A has a rowpan label, the cell A is divided into cells with the numerical number of the rowpan label, the cells are added to the following lines, the column number is unchanged, and the line number is sequentially added with 1. For example: if the data in the 6 th td element in the 3 rd tr is θ and the rowspan tag value in the td element is 2, the data is stored in the format of [ θ,3,6] and [ θ,4,6 ].

When reading the line marked by each tr label, if the data storage form corresponding to the cell corresponding to a certain td label read in sequence is marked as [ a, b, c ], judging whether a cell with an element 2 being b and an element 3 being c exists in the data storage forms corresponding to all the cells obtained by previous reading, if so, modifying [ a, b, c ] into [ a, b, c +1] and storing; if not, directly saving [ a, b, c ].

Referring to fig. 2, an example is: in reading td cells, if the column data already exists in order, then the column attribute of the cell is marked with 1 in the already existing column attribute. For example: when the td element in the 4 th tr is read, and when the 6 th td element is read, assuming that the content of the cell is λ, the data should be stored as [ λ,4,6] in sequence, but since the data at the position of "4, 6" already exists, the column number needs to be added by 1 and then stored as [ λ,4,7 ].

A system for implementing the PDF file table parsing method in embodiment 1 comprises:

a first conversion unit: converting the target PDF file into a word document;

a second conversion unit: converting the word document into an html document;

Wherein the manufacturing unit includes:

a judging unit: judging whether each cell has a colspan label or a rowspan label; if yes, if no colspan tag and no rowpan tag exist, obtaining the numerical value of the cell A, and recording the data storage form of the cell A as [ element 1, element 2 and element 3], wherein the element 1 represents the numerical value of the cell A, the element 2 represents the sequence number of a tr tag in which the cell A is located, and the element 3 represents the sequence number of a td tag of the cell A;

By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained: the method of the invention not only can accurately identify and read the character information in the PDF file, but also can finish reading the form information in the PDF file, and the accuracy rate is at least 90%.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims

1. A PDF file table analysis method is characterized by comprising the following steps:

s1, acquiring a target PDF file, calling a bottom layer component of an adobe acrobat DC product, and converting the target PDF file into a word document;

s2, calling a bottom layer component of the microsoft office product, and converting the word document into an html document;

in the process of identifying table information in an html document, the identified table information is required to be converted into structured information, and any table information is converted into structured information, and the method is specifically realized according to the following steps:

assuming that in the html file, a table label represents a table, a tr label represents a line, and a td label represents a cell in the line; colspan label represents column merging of cells, and rowspan label represents merging of rows; the sequences of the table tag, the tr tag and the td tag are all increased from 1, and the increment is 1; the value ranges of the colspan label and the rowspan label are both more than or equal to 2;

after all cells under the table label are completely read, on the basis of the stored data storage form, taking the element 2 as a line number, taking the element 3 as a column number, and correspondingly supplementing the element 1 into a corresponding row and column to complete the drawing of the two-dimensional table.

2. The PDF file table parsing method according to claim 1, wherein when reading the line marked by each tr tag, if the data storage format corresponding to the cell corresponding to a certain td tag read in sequence is marked as [ a, b, c ], determining whether there is a cell with element 2 ═ b and element 3 ═ c in the data storage formats corresponding to all the cells obtained by previous reading, if yes, modifying [ a, b, c ] to [ a, b, c +1] and saving; if not, directly saving [ a, b, c ].

3. A system for implementing the PDF file table parsing method of claim 2, wherein the system comprises:

a first conversion unit: converting the target PDF file into a word document;

a second conversion unit: converting the word document into an html document;

4. The system of claim 3, wherein the production unit comprises: