CN112069991A

CN112069991A - PDF table information extraction method and related device

Info

Publication number: CN112069991A
Application number: CN202010922836.2A
Authority: CN
Inventors: 余昊旻; 张青龙; 陈强; 丁明; 蒋坡良
Original assignee: Servyou Software Group Co ltd
Current assignee: Servyou Software Group Co ltd
Priority date: 2020-09-04
Filing date: 2020-09-04
Publication date: 2020-12-11

Abstract

The application discloses a PDF table information extraction method, which comprises the following steps: performing character analysis on the PDF file to obtain characters and character position information; carrying out closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array; and carrying out structuralization processing on the characters according to the rectangular outline array and the character position information to obtain table information. The corresponding table contour is identified for the picture corresponding to the PDF file in an image identification mode, and then the table information is spliced according to the table contour, so that the extraction efficiency of the table information is improved, and the extraction effect is ensured. The application also discloses a PDF table information extraction device, a calculation device and a computer readable storage medium, which have the beneficial effects.

Description

PDF table information extraction method and related device

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method, an apparatus, a computing apparatus, and a computer-readable storage medium for extracting table information of a PDF.

Background

With the continuous development of information technology, various document file formats are emerging. Among them, a Portable Document Format (PDF) is a file Format that presents a Document in a manner independent of an application program, hardware, and an operating system.

As can be appreciated from the PDF specification, the image presented by the PDF is composed of vector graphics, bitmaps, text, and interactable elements. The table is also composed of vector graphics, bitmaps and text. When reading PDF, it is not as intuitive to read several rows and columns as when reading Excel.

In the prior art, technical solutions such as PDFBox, Tabula, Itext, etc. are all text data reading based on PDF specification analysis. However, since the PDF specification does not define a table, the rendered table content cannot be directly extracted. When the contents of the table are relatively complicated and difficult to recognize, the recognition rate of the table contents is lowered, and the contents of the table cannot be extracted.

Therefore, how to improve the efficiency of extracting PDF form information is a key issue to be focused on by those skilled in the art.

Disclosure of Invention

The application aims to provide a PDF table information extraction method, a PDF table information extraction device, a calculation device and a computer readable storage medium.

In order to solve the above technical problem, the present application provides a method for extracting table information of PDF, including:

performing character analysis on the PDF file to obtain characters and character position information;

carrying out closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array;

and carrying out structuralization processing on the characters according to the rectangular outline array and the character position information to obtain table information.

Optionally, performing text parsing on the PDF file to obtain text and text position information, including:

and performing character analysis on the PDF file according to a PDF analysis library to obtain the characters and the character position information.

Optionally, the image corresponding to the PDF file is subjected to closed contour recognition processing by using an image recognition algorithm to obtain a rectangular contour array, including:

converting the PDF file into a picture;

carrying out binarization processing on the picture to obtain a black and white picture;

identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;

and converting the rectangular area into the rectangular outline array by an array.

Optionally, performing structured processing on the text according to the rectangular outline array and the text position information to obtain table information, including:

inquiring in the rectangular outline array according to the character position information, and determining the position of each character in the rectangular array;

and splicing all the characters according to the positions to obtain the table information.

The present application further provides a device for extracting table information of PDF, including:

the character analysis module is used for carrying out character analysis on the PDF file to obtain characters and character position information;

the table contour identification module is used for carrying out closed contour identification processing on the picture corresponding to the PDF file through an image identification algorithm to obtain a rectangular contour array;

and the table structured processing module is used for carrying out structured processing on the characters according to the rectangular outline array and the character position information to obtain table information.

Optionally, the text parsing module is specifically configured to perform text parsing on the PDF file according to a PDF parsing library to obtain the text and the text position information.

Optionally, the form contour identification module includes:

a PDF conversion unit for converting the PDF file into a picture;

a binarization processing unit, configured to perform binarization processing on the picture to obtain a black-and-white picture;

the closed contour identification unit is used for identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;

and the array conversion unit is used for converting the rectangular area into the rectangular outline array by an array.

Optionally, the table structuring processing module includes:

the position query unit is used for querying in the rectangular outline array according to the character position information and determining the position of each character in the rectangular array;

and the character splicing unit is used for splicing all characters according to the positions to obtain the table information.

The present application further provides a computing device comprising:

a memory for storing a computer program;

a processor for implementing the steps of the table information extraction method as described above when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the table information extraction method as described above.

The application provides a PDF table information extraction method, which comprises the following steps: performing character analysis on the PDF file to obtain characters and character position information; carrying out closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array; and carrying out structuralization processing on the characters according to the rectangular outline array and the character position information to obtain table information.

The method comprises the steps of firstly analyzing characters and character position information from a PDF file, then carrying out outline recognition on a picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular outline array, namely an outline array of a table, and finally carrying out structural processing on the characters according to the rectangular outline array and the character position information to obtain table information.

The present application further provides a device for extracting table information of PDF, a computing device, and a computer-readable storage medium, which have the above beneficial effects and are not described herein again.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a method for extracting table information of a PDF according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of a PDF form information extraction device according to an embodiment of the present application.

Detailed Description

The core of the application is to provide a PDF table information extraction method, a PDF table information extraction device, a calculation device and a computer readable storage medium, wherein corresponding table outlines are identified for pictures corresponding to PDF files in an image identification mode, and then the table information is spliced according to the table outlines, so that the extraction efficiency of the table information is improved, and the extraction effect is ensured.

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Therefore, the method for extracting the table information of the PDF includes the steps of firstly analyzing characters and character position information from a PDF file, then carrying out outline recognition on a picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular outline array, namely obtaining the outline array of the table, and finally carrying out structural processing on the characters according to the rectangular outline array and the character position information to obtain the table information.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for extracting table information of a PDF according to an embodiment of the present disclosure.

In this embodiment, the method may include:

s101, performing character analysis on a PDF file to obtain characters and character position information;

the step aims to perform character analysis on the PDF file to obtain characters and character position information. That is, the words in the PDF file are parsed to determine all the words in the PDF file. It should be understood that the text includes all the text in the PDF file, and also includes all the text in the table. And meanwhile, acquiring character position information corresponding to the character.

The text position information generally refers to position information indicating the position of each text with respect to the PDF file. The coordinates of the rectangular coordinate system of the PDF file can be generally used for representation.

Optionally, this step may include:

It can be seen that the present alternative mainly explains how to perform text parsing. In the alternative scheme, a PDF analysis library is mainly adopted to analyze characters in a PDF file to obtain corresponding characters and character position information. The PDF analysis library may be any one of PDF analysis libraries provided in the prior art, which is not described herein again.

S102, carrying out closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array;

on the basis of S101, the step aims to perform closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm to obtain a rectangular contour array. That is, the matrix information corresponding to the table is extracted by means of image recognition, rather than analyzing the table that cannot be normally recognized by means of an analysis library.

In this step, any image recognition algorithm provided in the prior art may be used as long as the table in the picture is recognized.

Furthermore, the outline of the picture corresponding to the PDF file is identified through the image identification algorithm. Since the form shapes have uniform characteristics, for example, they are all closed rectangular shapes. Therefore, the closed rectangles of each table can be identified from the corresponding picture of the PDF file through the characteristics, and the identified closed rectangles are sorted into the rectangular outline array. The rectangular outline array is an array representing the length, the size and the number of tables. According to the number of the information types, the information type can be a binary array.

Optionally, this step may include:

step 1, converting the PDF file into a picture;

step 2, carrying out binarization processing on the picture to obtain a black and white picture;

step 3, identifying the black and white picture through a closed contour algorithm to obtain the rectangular area;

and 4, converting the rectangular area into the rectangular outline array by using an array.

It can be seen that the present alternative scheme mainly explains how to obtain the matrix contour array. In this alternative, specifically, the PDF file is converted into a picture; then, carrying out binarization processing on the picture to obtain a black and white picture; here, the binarization processing of the picture is mainly performed to make more prominent the image features related to the table information in the picture. Then, identifying the black and white picture through a closed contour algorithm to obtain the rectangular area; and finally, converting the rectangular area into the rectangular outline array by using an array.

It should be noted that, in this alternative, the step of identifying the black and white picture by using a closed contour algorithm to obtain the rectangular area may include: and identifying the black and white picture by a closed contour algorithm to obtain all closed contours in the picture. And then eliminating the closed region irrelevant to the maximum closed region through the correlation among all the closed contours, and finally obtaining the rectangular region.

S103, carrying out structuralization processing on the characters according to the rectangular outline array and the character position information to obtain table information.

On the basis of S102, this step aims to perform structuring processing on the text according to the rectangular outline array and the text position information to obtain table information. That is, on the basis of acquiring the rectangular outline array, that is, on the basis of determining all rectangular tables in the picture, the corresponding characters are pieced close to the corresponding tables through the character position information.

Optionally, this step may include:

step 1, inquiring in the rectangular outline array according to the character position information, and determining the position of each character in the rectangular array;

and 2, splicing all the characters according to the positions to obtain the table information.

It can be seen that the alternative scheme mainly explains how to splice the table information. In this alternative, the position of each character in the rectangular array is determined by query, and then all characters are spliced into the corresponding table through the position to obtain the table information.

In summary, in the embodiment, the characters and the character position information are firstly analyzed from the PDF file, then the image corresponding to the PDF file is subjected to contour recognition through an image recognition algorithm to obtain a rectangular contour array, that is, a contour array of a table is obtained, and finally the characters are subjected to structural processing according to the rectangular contour array and the character position information to obtain the table information, so that the table information is extracted from the PDF file in an image recognition manner, the problem of information extraction errors caused by the fact that the table cannot be analyzed is solved, the extraction efficiency of the table information is improved, and the extraction effect is ensured.

The following further describes a method for extracting table information of a PDF provided by the present application by a specific embodiment.

In this embodiment, the method may include:

step 1, generating a picture corresponding to each page of a PDF according to a given PDF file;

step 2, analyzing characters and position information of the characters in the PDF through the existing PDF analysis library, and using the characters and position information as character arrays for subsequent processing; among them, the PDF parsing library may include itext.

And 3, carrying out closed contour recognition processing on the picture file generated in the step 1 through opencv (cross-platform computer vision and machine learning software library) to obtain a rectangular array.

Preferably, the step may include: firstly, extracting three layers of channels of RGB (RGB color mode) of the picture; then, only a red channel is used for carrying out binarization to convert the picture into black and white; then, searching all closed contours through a FindContours function, abandoning the contours with small areas, and only reserving the contours at the bottommost layer to obtain a rectangular area of the contours to obtain a rectangular array; filling the rectangles in the blank canvas, finding the maximum black area through a FindContours function again, and removing the rectangles which are not in the black area from the found rectangular array to obtain a new rectangular array; and finally, aggregating and de-duplicating x and y of each rectangle in the rectangle array to obtain a row array and a column array.

And 4, data structuring: traversing the acquired rectangle arrays, comparing each rectangle with the row-column array to obtain the rows and columns occupied by the rectangle, formatting the result (A1: B3) for example, searching the obtained characters, searching the texts in the rectangles and splicing. Thus, the position information and the content of the table of each rectangle are obtained.

It can be seen that, in the embodiment, the characters and the character position information are firstly analyzed from the PDF file, then the image corresponding to the PDF file is subjected to contour recognition through an image recognition algorithm to obtain a rectangular contour array, that is, a contour array of a table is obtained, and finally, the characters are subjected to structural processing according to the rectangular contour array and the character position information to obtain table information.

The following describes a table information extraction device of a PDF according to an embodiment of the present application, and the table information extraction device of a PDF described below and the table information extraction method of a PDF described above may be referred to in correspondence with each other.

Referring to fig. 2, fig. 2 is a schematic structural diagram of a PDF form information extraction device according to an embodiment of the present application.

In this embodiment, the apparatus may include:

the character analysis module 100 is used for performing character analysis on the PDF file to obtain characters and character position information;

the table contour identification module 200 is configured to perform closed contour identification processing on the picture corresponding to the PDF file through an image identification algorithm to obtain a rectangular contour array;

and the table structuring processing module 300 is configured to perform structuring processing on the characters according to the rectangular outline array and the character position information to obtain table information.

Optionally, the text parsing module 100 is specifically configured to perform text parsing on the PDF file according to a PDF parsing library to obtain the text and the text position information.

Optionally, the table contour identifying module 200 may include:

a PDF conversion unit for converting the PDF file into a picture;

Optionally, the table structuring processing module 300 may include:

An embodiment of the present application further provides a computing apparatus, including:

a memory for storing a computer program;

a processor for implementing the steps of the table information extraction method as described in the above embodiments when the computer program is executed.

The present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the table information extraction method according to the above embodiments are implemented.

The embodiments are described in a progressive manner in the specification, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above detailed description is provided for a method, device, and computer readable storage medium for extracting form information of a PDF. The principles and embodiments of the present application are explained herein using specific examples, which are provided only to help understand the method and the core idea of the present application. It should be noted that, for those skilled in the art, it is possible to make several improvements and modifications to the present application without departing from the principle of the present application, and such improvements and modifications also fall within the scope of the claims of the present application.

Claims

1. A method for extracting form information of PDF is characterized by comprising the following steps:

2. The form information extraction method of claim 1, wherein performing text parsing on a PDF file to obtain text and text position information comprises:

3. The form information extraction method of claim 1, wherein the obtaining of the rectangular contour array by performing closed contour recognition processing on the picture corresponding to the PDF file through an image recognition algorithm comprises:

converting the PDF file into a picture;

4. The form information extraction method of claim 1, wherein the structuring of the text according to the rectangular outline array and the text position information to obtain form information comprises:

5. A form information extraction apparatus of a PDF, comprising:

6. The form information extraction information of claim 5, wherein the text parsing module is specifically configured to perform text parsing on the PDF file according to a PDF parsing library to obtain the text and the text position information.

7. The form information extraction information of claim 5, wherein the form outline recognition module comprises:

a PDF conversion unit for converting the PDF file into a picture;

8. The table information extraction information according to claim 5, wherein the table structuring processing module includes:

9. A computing device, comprising:

a memory for storing a computer program;

a processor for implementing the steps of the table information extraction method according to any one of claims 1 to 4 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, implements the steps of the table information extraction method according to any one of claims 1 to 4.