CN108664458B - PDF file table analysis method and system - Google Patents
PDF file table analysis method and system Download PDFInfo
- Publication number
- CN108664458B CN108664458B CN201710193060.3A CN201710193060A CN108664458B CN 108664458 B CN108664458 B CN 108664458B CN 201710193060 A CN201710193060 A CN 201710193060A CN 108664458 B CN108664458 B CN 108664458B
- Authority
- CN
- China
- Prior art keywords
- cell
- label
- tag
- information
- colspan
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a PDF file table analysis method and a PDF file table analysis system, and relates to the field of data processing. The method comprises the following steps: acquiring a target PDF file, and converting the target PDF file into a word document; converting the word document into an html document; identifying form information in the html document, reading and outputting the form information; in the process of identifying the table information in the html document, the identified table information also needs to be converted into structured information. The system comprises: the device comprises a first conversion unit, a second conversion unit and a manufacturing unit. The method not only can accurately identify and read the character information in the PDF file, but also can finish reading the table information in the PDF file, the accuracy rate is at least 90%, and the method can also convert the read table information into structured language data.
Description
Technical Field
The invention relates to the field of data processing, in particular to a PDF file table analysis method and a PDF file table analysis system.
Background
PDF is a Portable Document Format, which is an electronic file Format. Because of the versatility of PDF in various mainstream operating systems, PDF is a mainstream form of file information delivery.
The PDF file contains a large amount of data information, such as text information, table information, and picture information. However, due to the sealing property of the PDF file, although the prior art can identify the character information in the PDF file, the identification and reading effects on the form information are poor, and the accuracy is low.
The accuracy of PDF form identification can be improved to more than 90% by products developed by the company.
Disclosure of Invention
The invention aims to provide a PDF file table analysis method and a PDF file table analysis system, so that the problems in the prior art are solved.
In order to achieve the above object, the PDF file table parsing method according to the present invention includes:
s1, acquiring a target PDF file and converting the target PDF file into a word document;
s2, converting the word document into an html document;
s3, identifying the table information in the html document, reading and outputting the table information;
in the process of identifying the table information in the html document, the identified table information also needs to be converted into structured information.
Preferably, an underlying component of the adobe acrobat DC product is called to convert the target PDF file into a word document.
Preferably, the underlying component of the microsoft office product is called to convert the word document into the html document.
Preferably, in the process of identifying table information in an html document, the identified table information needs to be converted into structured information, and any table information needs to be converted into structured information, and the method is specifically implemented according to the following steps:
assuming that in the html file, a table label represents a table, a tr label represents a row, and td represents a cell in the row; colspan label represents column merging of cells, and rowspan label represents merging of rows; the sequences of the table tag, the tr tag and the td tag are increased progressively from 1, and the increment is 1; the value ranges of the colspan label and the rowspan label are both more than or equal to 2;
reading each unit information in each line from a first tr label, and judging whether a colspan label or a rowspan label exists in any cell A when the information of the cell A is read;
if the colspan tag and the rowspan tag do not exist, obtaining the numerical value of the cell A, and recording the data storage form of the cell A as [ element 1, element 2 and element 3], wherein the element 1 represents the numerical value of the cell A, the element 2 represents the sequence number of the tr tag in which the cell A is located, and the element 3 represents the sequence number of the td tag of the cell A;
if the cell A has a colspan label, acquiring a value m of the colspan label and a numerical value of the cell A, wherein m is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2, element 3 ═ m ] and [ element 1, element 2, element 3 ═ m +1 ];
if a rowspan tag exists in the cell A, acquiring a value n of the rowspan tag and a numerical value of the cell A, wherein n is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2 ═ n, element 3] and [ element 1, element 2 ═ n +1, element 3 ];
after all cells under the table label are completely read, on the basis of the stored data storage form, taking the element 2 as a line number, taking the element 3 as a column number, and correspondingly supplementing the element 1 to a corresponding row and column to complete the drawing of the two-dimensional table.
More preferably, when reading the line marked by each tr label, if the data storage form corresponding to the cell corresponding to a certain td label read in sequence is marked as [ a, b, c ], judging whether a cell with an element 2 ═ b and an element 3 ═ c exists in the data storage forms corresponding to all the cells obtained by the previous reading, and if so, modifying [ a, b, c ] into [ a, b, c +1] and storing; if not, directly storing [ a, b, c ].
The invention discloses a system for realizing a PDF file table analysis method, which comprises the following steps:
a first conversion unit: converting the target PDF file into a word document;
a second conversion unit: converting the word document into an html document;
a manufacturing unit: identifying form information in the html document, reading and outputting the form information; in the process of identifying the table information in the html document, the identified table information also needs to be converted into structured information.
Preferably, the manufacturing unit includes:
a collecting unit: acquiring the number and the existing numerical value of each cell;
a judging unit: judging whether each cell has a colspan label or a rowspan label; if so, if the colspan tag and the rowspan tag do not exist, obtaining the numerical value of the cell A, and recording the data storage form of the cell A as [ element 1, element 2 and element 3], wherein the element 1 represents the numerical value of the cell A, the element 2 represents the sequence number of the tr tag in which the cell A is positioned, and the element 3 represents the sequence number of the td tag of the cell A;
if the cell A has a colspan label, acquiring a value m of the colspan label and a numerical value of the cell A, wherein m is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2, element 3 ═ m ] and [ element 1, element 2, element 3 ═ m +1 ];
if a rowspan tag exists in the cell A, acquiring a value n of the rowspan tag and a numerical value of the cell A, wherein n is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2 ═ n, element 3] and [ element 1, element 2 ═ n +1, element 3 ];
and a drawing unit which finishes drawing the two-dimensional table according to the data storage form obtained from the judging unit.
The invention has the beneficial effects that:
the method not only can accurately identify and read the character information in the PDF file, but also can finish reading the table information in the PDF file, the accuracy rate is at least 90%, and the method can also convert the read table information into structured language data.
Drawings
FIG. 1 is a flow chart of a PDF document table parsing method;
FIG. 2 is a table diagram of example 1;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
Examples
Referring to fig. 1, the method for parsing a PDF file table according to this embodiment includes:
s1, acquiring a target PDF file and converting the target PDF file into a word document;
s2, converting the word document into an html document;
s3, identifying the table information in the html document, reading and outputting the table information;
in the process of identifying the table information in the html document, the identified table information also needs to be converted into structured information.
More detailed explanation:
calling an underlying component of an adobe acrobat DC product to convert the target PDF file into a word document. And calling a bottom layer component of the microsoft office product to convert the word document into the html document.
(II) in the process of identifying the table information in the html document, the identified table information is also required to be converted into structured information, and any table information is converted into structured information, which is specifically realized according to the following steps:
assuming that in the html file, a table label represents a table, a tr label represents a row, and td represents a cell in the row; colspan label represents column merging of cells, rowspan label represents merging of rows; the sequences of the table tag, the tr tag and the td tag are all increased from 1, and the increment is 1; the value ranges of the colspan label and the rowspan label are both more than or equal to 2;
reading each unit information in each line from a first tr label, and judging whether a colspan label or a rowspan label exists in any cell A when the information of the cell A is read;
if no colspan tag and no rowspan tag exist, obtaining the numerical value of the cell A, and recording the data storage form of the cell A as [ element 1, element 2 and element 3], wherein the element 1 represents the numerical value of the cell A, the element 2 represents the sequence number of a tr tag in which the cell A is located, and the element 3 represents the sequence number of a td tag of the cell A; for example, the data from the first td element in the first tr is β, and the format for storing this data is [ β,1,1]
If the cell A has a colspan label, acquiring a value m of the colspan label and a numerical value of the cell A, wherein m is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2, element 3 ═ m ] and [ element 1, element 2, element 3 ═ m +1 ];
referring to fig. 2, an example is: if the cell A has the colspan label, the cell A is divided into the cells with the number of the colspan label, the content of the cell is consistent with that of the original cell, the line number is unchanged, and the column number is sequentially added with 1. For example: when data in the 4 th td element in the 2 nd tr is ζ and the colspan tag value in the td element is 2, the storage format is ζ,2,4 and ζ,2, 5.
Thirdly, if a rowpan label exists in the cell A, obtaining a value n of the rowpan label and a numerical value of the cell A, wherein n is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2 ═ n, element 3] and [ element 1, element 2 ═ n +1, element 3 ];
referring to fig. 2, an example is: if the cell A has a rowpan label, the cell A is divided into cells with the numerical number of the rowpan label, the cells are added to the following lines, the column number is unchanged, and the line number is sequentially added with 1. For example: if the data in the 6 th td element in the 3 rd tr is θ and the rowspan tag value in the td element is 2, the data is stored in the format of [ θ,3,6] and [ θ,4,6 ].
After all cells under the table label are completely read, on the basis of the stored data storage form, taking the element 2 as a line number, taking the element 3 as a column number, and correspondingly supplementing the element 1 to a corresponding row and column to complete the drawing of the two-dimensional table.
When reading the line marked by each tr label, if the data storage form corresponding to the cell corresponding to a certain td label read in sequence is marked as [ a, b, c ], judging whether a cell with an element 2 being b and an element 3 being c exists in the data storage forms corresponding to all the cells obtained by previous reading, if so, modifying [ a, b, c ] into [ a, b, c +1] and storing; if not, directly saving [ a, b, c ].
Referring to fig. 2, an example is: in reading td cells, if the column data already exists in order, then the column attribute of the cell is marked with 1 in the already existing column attribute. For example: when the td element in the 4 th tr is read, and when the 6 th td element is read, assuming that the content of the cell is λ, the data should be stored as [ λ,4,6] in sequence, but since the data at the position of "4, 6" already exists, the column number needs to be added by 1 and then stored as [ λ,4,7 ].
A system for implementing the PDF file table parsing method in embodiment 1 comprises:
a first conversion unit: converting the target PDF file into a word document;
a second conversion unit: converting the word document into an html document;
a manufacturing unit: identifying form information in the html document, reading and outputting the form information; in the process of identifying the table information in the html document, the identified table information also needs to be converted into structured information.
Wherein the manufacturing unit includes:
a collecting unit: acquiring the number and the existing numerical value of each cell;
a judging unit: judging whether each cell has a colspan label or a rowspan label; if yes, if no colspan tag and no rowpan tag exist, obtaining the numerical value of the cell A, and recording the data storage form of the cell A as [ element 1, element 2 and element 3], wherein the element 1 represents the numerical value of the cell A, the element 2 represents the sequence number of a tr tag in which the cell A is located, and the element 3 represents the sequence number of a td tag of the cell A;
if the cell A has a colspan label, acquiring a value m of the colspan label and a numerical value of the cell A, wherein m is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2, element 3 ═ m ] and [ element 1, element 2, element 3 ═ m +1 ];
if a rowspan tag exists in the cell A, acquiring a value n of the rowspan tag and a numerical value of the cell A, wherein n is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2 ═ n, element 3] and [ element 1, element 2 ═ n +1, element 3 ];
and a drawing unit which finishes drawing the two-dimensional table according to the data storage form obtained from the judging unit.
By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained: the method of the invention not only can accurately identify and read the character information in the PDF file, but also can finish reading the form information in the PDF file, and the accuracy rate is at least 90%.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.
Claims (4)
1. A PDF file table analysis method is characterized by comprising the following steps:
s1, acquiring a target PDF file, calling a bottom layer component of an adobe acrobat DC product, and converting the target PDF file into a word document;
s2, calling a bottom layer component of the microsoft office product, and converting the word document into an html document;
s3, identifying the table information in the html document, reading and outputting the table information;
in the process of identifying table information in an html document, the identified table information is required to be converted into structured information, and any table information is converted into structured information, and the method is specifically realized according to the following steps:
assuming that in the html file, a table label represents a table, a tr label represents a line, and a td label represents a cell in the line; colspan label represents column merging of cells, and rowspan label represents merging of rows; the sequences of the table tag, the tr tag and the td tag are all increased from 1, and the increment is 1; the value ranges of the colspan label and the rowspan label are both more than or equal to 2;
reading each unit information in each line from a first tr label, and judging whether a colspan label or a rowspan label exists in any cell A when the information of the cell A is read;
if the colspan tag and the rowspan tag do not exist, obtaining the numerical value of the cell A, and recording the data storage form of the cell A as [ element 1, element 2 and element 3], wherein the element 1 represents the numerical value of the cell A, the element 2 represents the sequence number of the tr tag in which the cell A is located, and the element 3 represents the sequence number of the td tag of the cell A;
if the cell A has a colspan label, acquiring a value m of the colspan label and a numerical value of the cell A, wherein m is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2, element 3 ═ m ] and [ element 1, element 2, element 3 ═ m +1 ];
if a rowspan tag exists in the cell A, acquiring a value n of the rowspan tag and a numerical value of the cell A, wherein n is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2 ═ n, element 3] and [ element 1, element 2 ═ n +1, element 3 ];
after all cells under the table label are completely read, on the basis of the stored data storage form, taking the element 2 as a line number, taking the element 3 as a column number, and correspondingly supplementing the element 1 into a corresponding row and column to complete the drawing of the two-dimensional table.
2. The PDF file table parsing method according to claim 1, wherein when reading the line marked by each tr tag, if the data storage format corresponding to the cell corresponding to a certain td tag read in sequence is marked as [ a, b, c ], determining whether there is a cell with element 2 ═ b and element 3 ═ c in the data storage formats corresponding to all the cells obtained by previous reading, if yes, modifying [ a, b, c ] to [ a, b, c +1] and saving; if not, directly saving [ a, b, c ].
3. A system for implementing the PDF file table parsing method of claim 2, wherein the system comprises:
a first conversion unit: converting the target PDF file into a word document;
a second conversion unit: converting the word document into an html document;
a manufacturing unit: identifying form information in the html document, reading and outputting the form information; in the process of identifying the table information in the html document, the identified table information also needs to be converted into structured information.
4. The system of claim 3, wherein the production unit comprises:
a collecting unit: acquiring the number and the existing numerical value of each cell;
a judging unit: judging whether each cell has a colspan label or a rowspan label; if so, if the colspan tag and the rowspan tag do not exist, obtaining the numerical value of the cell A, and recording the data storage form of the cell A as [ element 1, element 2 and element 3], wherein the element 1 represents the numerical value of the cell A, the element 2 represents the sequence number of the tr tag in which the cell A is positioned, and the element 3 represents the sequence number of the td tag of the cell A;
if the cell A has a colspan label, acquiring a value m of the colspan label and a numerical value of the cell A, wherein m is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2, element 3 ═ m ] and [ element 1, element 2, element 3 ═ m +1 ];
if a rowspan tag exists in the cell A, acquiring a value n of the rowspan tag and a numerical value of the cell A, wherein n is more than or equal to 2, and recording the data storage form of the cell A as [ element 1, element 2 ═ n, element 3] and [ element 1, element 2 ═ n +1, element 3 ];
and a drawing unit which finishes drawing the two-dimensional table according to the data storage form obtained from the judging unit.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710193060.3A CN108664458B (en) | 2017-03-28 | 2017-03-28 | PDF file table analysis method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710193060.3A CN108664458B (en) | 2017-03-28 | 2017-03-28 | PDF file table analysis method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108664458A CN108664458A (en) | 2018-10-16 |
CN108664458B true CN108664458B (en) | 2022-06-14 |
Family
ID=63785875
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710193060.3A Active CN108664458B (en) | 2017-03-28 | 2017-03-28 | PDF file table analysis method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108664458B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109829139B (en) * | 2019-01-30 | 2023-04-18 | 中国软件与技术服务股份有限公司 | Method and device for converting DOC/DOCX format streaming file into OFD format file |
CN112632940A (en) * | 2021-01-02 | 2021-04-09 | 浙江建达科技股份有限公司 | Method for automatically converting word format application form into online filling webpage |
CN113869014A (en) * | 2021-08-25 | 2021-12-31 | 盐城金堤科技有限公司 | Extraction method and device of table data, storage medium and electronic equipment |
CN114462393A (en) * | 2022-04-12 | 2022-05-10 | 安徽数智建造研究院有限公司 | Webpage text information extraction method and device, terminal equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1687926A (en) * | 2005-04-18 | 2005-10-26 | 福州大学 | Method of PDF file information extraction system based on XML |
CN101976232A (en) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | Method for identifying data form in document and device thereof |
CN102467378A (en) * | 2010-11-11 | 2012-05-23 | 深圳市金蝶友商电子商务服务有限公司 | HTML (Hypertext Markup Language) form processing method based on two-dimensional matrix and computer |
CN103198069A (en) * | 2012-01-06 | 2013-07-10 | 株式会社理光 | Method and device for extracting relational table |
CN104063364A (en) * | 2013-03-19 | 2014-09-24 | 福建福昕软件开发股份有限公司北京分公司 | PDF document recognition method |
CN105630916A (en) * | 2015-12-21 | 2016-06-01 | 浙江工业大学 | Method for extracting and organizing unstructured sheet document data under big data environment |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7289981B2 (en) * | 2002-12-10 | 2007-10-30 | International Business Machines Corporation | Using text search engine for parametric search |
US8972437B2 (en) * | 2009-12-23 | 2015-03-03 | Apple Inc. | Auto-population of a table |
US9047533B2 (en) * | 2012-02-17 | 2015-06-02 | Palo Alto Research Center Incorporated | Parsing tables by probabilistic modeling of perceptual cues |
-
2017
- 2017-03-28 CN CN201710193060.3A patent/CN108664458B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1687926A (en) * | 2005-04-18 | 2005-10-26 | 福州大学 | Method of PDF file information extraction system based on XML |
CN101976232A (en) * | 2010-09-19 | 2011-02-16 | 深圳市万兴软件有限公司 | Method for identifying data form in document and device thereof |
CN102467378A (en) * | 2010-11-11 | 2012-05-23 | 深圳市金蝶友商电子商务服务有限公司 | HTML (Hypertext Markup Language) form processing method based on two-dimensional matrix and computer |
CN103198069A (en) * | 2012-01-06 | 2013-07-10 | 株式会社理光 | Method and device for extracting relational table |
CN104063364A (en) * | 2013-03-19 | 2014-09-24 | 福建福昕软件开发股份有限公司北京分公司 | PDF document recognition method |
CN105630916A (en) * | 2015-12-21 | 2016-06-01 | 浙江工业大学 | Method for extracting and organizing unstructured sheet document data under big data environment |
Non-Patent Citations (1)
Title |
---|
"Web页中表格结构识别的研究与实现";林科锵;《中国优秀博硕士学位论文全文数据库 (硕士) 信息科技辑》;中国学术期刊(光盘版)电子杂志社;20061215;论文第17-19,54-59页,图5-4、5-5、5-6 * |
Also Published As
Publication number | Publication date |
---|---|
CN108664458A (en) | 2018-10-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108664458B (en) | PDF file table analysis method and system | |
Pletschacher et al. | The page (page analysis and ground-truth elements) format framework | |
CN101308488B (en) | Document stream type information processing method based on format document and device therefor | |
US20130174024A1 (en) | Method and device for converting document format | |
CN106709032A (en) | Method and device for extracting structured information from spreadsheet document | |
CN107168695B (en) | Excel data analysis method and system | |
US20140074878A1 (en) | Spreadsheet schema extraction | |
CN109002425B (en) | Method for acquiring upstream and downstream relations of enterprise, terminal device and medium | |
CN115391322A (en) | Data checking method, device, equipment, storage medium and program product | |
CN107066431A (en) | The storage method and storage processing equipment of a kind of model data | |
CN107526795B (en) | Knowledge base construction method and device, storage medium and computing equipment | |
US20070282804A1 (en) | Apparatus and method for extracting database information from a report | |
CN116415562B (en) | Method, apparatus and medium for parsing financial data | |
CN105574164A (en) | Excel document data analysis method and device | |
CN112463931A (en) | Intelligent analysis method for insurance product clauses and related equipment | |
US11836445B2 (en) | Spreadsheet table transformation | |
CN110941610B (en) | Excel data file processing method and device | |
CN111401007A (en) | Method for converting unstructured data into structured data | |
CN113807416B (en) | Model training method and device, electronic equipment and storage medium | |
CN113642291B (en) | Method, system, storage medium and terminal for constructing logical structure tree reported by listed companies | |
CN115935928A (en) | Method and device for extracting document information | |
CN113821555A (en) | Unstructured data collection processing method of intelligent supervision black box | |
JP2015191277A (en) | Data identification method, data identification program, and data identification device | |
CN114004209A (en) | PDF format data export method and device, electronic equipment and readable storage medium | |
WO2021036181A1 (en) | Data extraction method and device, storage medium and equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100089 Room 101, 1st floor, building 4, yard 6, Wanliu Middle Road, Haidian District, Beijing Applicant after: Zhongke Yuntou Technology Co.,Ltd. Address before: 100089 Room 101, 1st floor, building 4, yard 6, Wanliu Middle Road, Haidian District, Beijing Applicant before: HUADUO JIUZHOU TECHNOLOGY CO.,LTD. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |