WO2020252931A1

WO2020252931A1 - Pdf file data extraction method and apparatus, device, and storage medium

Info

Publication number: WO2020252931A1
Application number: PCT/CN2019/103580
Authority: WO
Inventors: 杨志鸿; 常河; 徐亮; 阮晓雯
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-06-17
Filing date: 2019-08-30
Publication date: 2020-12-24
Also published as: CN110377559A; CN110377559B

Abstract

A PDF file data extraction method and apparatus, a device, and a storage medium. The method comprises: parsing a PDF file, and generating LT sub-objects; acquiring the ordinate and the abscissa of each LT sub-object, correspondingly storing LT sub-objects of each page in a first list, extracting LT sub-objects in an ascending order of the ordinates, and longitudinally arranging, in the first list, the LT sub-objects in an ascending order of the ordinates, wherein the abscissa comprises a left boundary coordinate x0 and a right boundary coordinate x1; during reading performed in a row by row manner, determining whether the LT sub-objects are in the same row by means of a longitudinal distance, and sorting the LT sub-objects into respective rows; and sorting LT sub-objects of each row in an ascending order of x0, and if x1 of an LT sub-object on a left side is equal to x0 of an LT sub-object on a right side, combining the two LT sub-objects to form a combined character string. The method reduces difficulty in extracting information from a monthly statistical bulletin.

Description

A PDF file data extraction method, device, equipment and storage medium

This application claims the priority rights of Chinese Patent Application No. 201910521031.4 filed on June 17, 2019. The entire contents of the above cases are incorporated herein by reference.

Technical field

This application relates to the field of artificial intelligence, and specifically to a method and device, equipment and storage medium for extracting data from a PDF file.

Background technique

The monthly statistical reports of the existing Bureau of Statistics are all stored in PDF format. It is very inconvenient to extract data from PDF. It is often time-consuming and labor-consuming to manually view and manually extract the required data. At present, you can also convert PDF files into word format, and then extract data from word files. However, the inventor found that when the existing PDF-to-word technology converts PDF, problems such as garbled characters and misplaced Chinese order often occur. As for the monthly statistical report, it also contains a large number of tables. In the process of PDF file conversion, the position of the tables will fluctuate, and the contents of the tables will be missing. The inventor realized that extracting data from PDF files requires a better solution.

Summary of the invention

In order to solve the above technical problems, this application provides a PDF file data extraction method, which is applied to electronic equipment, including:

S10. Use the pdfminer tool to parse the PDF file, and generate a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains an LT sub-object;

S20. Obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in a corresponding first list, where the abscissa includes the left boundary coordinate x0 of the LT sub-object and the LT sub-object The LT sub-objects of the pdfminer.layout object in each page are extracted in the order of the ordinate from small to large, and the LT sub-objects of the pdfminer.layout object in each page are sequentially arranged in the first list corresponding to each page according to the order of the ordinate from small to large. in;

S30. Perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, judge whether each LT sub-object belongs to each LT sub-object according to the vertical distance between each LT sub-object To divide the LT sub-object into each row;

S40, for each type of LT sub-object, in each row, sort the LT sub-objects from small to large in the order of the left boundary coordinate x0, and determine whether the right boundary coordinate x1 of the left LT sub-object is equal to The left boundary coordinates x0 of the adjacent LT sub-objects on the right are combined to form a combined character string.

This application also provides a PDF file data extraction device, including:

The PDF file parsing module uses the pdfminer tool to parse the PDF file, and generates a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains the LT sub-object;

The LT sub-object storage module is used to obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in the corresponding first list, wherein the abscissa includes the left boundary of the LT sub-object Coordinates x0 and the right boundary coordinate x1 of the LT sub-object, extract the LT sub-objects of the pdfminer.layout object in each page in the order of ordinates from small to large, and arrange them vertically in the order of ordinates from small to large In the first list corresponding to the page;

The branch reading module is used to perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, each LT sub-object is judged by the vertical distance between each LT sub-object. A row to which the LT sub-object belongs, thereby dividing the LT sub-object into each row;

The LT sub-object sorting module is used to sort each LT sub-object, in each row, sort the LT sub-objects from small to large in the order of the left boundary coordinate x0, and by judging the right of the LT sub-object on the left Whether the side boundary coordinate x1 is equal to the left boundary coordinate x0 of the adjacent right LT sub-object, the multiple LT sub-objects are combined to form a combined character string.

The present application also provides an electronic device including a memory and a processor, the memory stores a PDF file data extraction program, and the PDF file data extraction program is executed by the processor to implement the following steps:

The present application also provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, Realize the PDF file data extraction method as described above.

This application converts the data of the PDF file into an excel format, which will greatly reduce the difficulty of extracting information from the monthly statistical report using data analysis software such as spider and pycharm.

Description of the drawings

By describing its embodiments in conjunction with the following drawings, the above-mentioned features and technical advantages of the present application will become clearer and easier to understand.

FIG. 1 is a flowchart showing a method for extracting data from a PDF file according to an embodiment of the present application;

FIG. 2 is a schematic diagram showing the framework of the pdfminer.layout object in an embodiment of the present application;

3 is a schematic diagram showing LTChar in the PDF file of the first embodiment of the present application;

4 is a schematic diagram showing a data extraction result obtained by branch reading in the first embodiment of the present application;

5 is a schematic diagram showing the data extraction result after sorting LTChar in the first embodiment of the present application;

6 is a schematic diagram showing LTChar in the PDF file of the second embodiment of the present application;

FIG. 7 is a schematic diagram showing the data extraction result after comparing and combining string coordinates in the second embodiment of the present application;

8 is a schematic diagram showing the data extraction result after adding LTline in the third embodiment of the present application;

9 is a schematic diagram showing the data extraction result after adjusting LTline in the fourth embodiment of the present application;

10 is a schematic diagram showing the LTLtine in the PDF file of the fifth embodiment of the present application;

FIG. 11 is a schematic diagram showing the hardware architecture of an electronic device according to an embodiment of the present application;

Fig. 12 is a schematic diagram showing program modules of a PDF file data extraction program according to an embodiment of the present application.

Detailed ways

Hereinafter, embodiments of the PDF file data extraction method and device, equipment, and storage medium described in this application will be described with reference to the accompanying drawings. A person of ordinary skill in the art may realize that the described embodiments can be modified in various different ways or combinations thereof without departing from the spirit and scope of the present application. Therefore, the drawings and descriptions are illustrative in nature and are not intended to limit the scope of protection of the claims. In addition, in this specification, the drawings are not drawn to scale, and the same reference numerals denote the same parts.

The PDF file data extraction method of this embodiment is applied to the extraction of text and tables in the PDF file. The text can be formed in a table, or it can be divided into paragraphs without a table. Take the monthly statistical report in PDF format as an example.

First embodiment

Fig. 1 shows a flowchart of a method for extracting data from a PDF file in this embodiment. The method includes the following steps:

Step S10, use the pdfminer tool (a tool for extracting information from a PDF document) to parse the PDF file, and generate a pdfminer.layout object for each page of the PDF, as shown in Figure 2, where the pdfminer.layout object is It may contain multiple LT sub-objects, and the LT sub-object may be at least one of LTTextBoxHorizontal (horizontal text box) sub-object and LTChar (character) sub-object. LTChar is a character with boundaries. The following is mainly based on LTChar as an example, so the characters mentioned are LT sub-objects.

Of course, it can further include LTFigure (area box) sub-object and LTLine (separation line) sub-object. Among them, LTFigure represents an area occupied by an area frame, which is used to introduce a picture or another PDF document.

Step S20: Obtain the ordinate and abscissa of each LT sub-object. For example, if the LT sub-object is LTChar, then the ordinate and abscissa of each character are obtained. If the LT sub-object is LTLine, get the abscissa and ordinate of each dividing line. If the LT sub-object is a horizontal text box, obtain the ordinate and abscissa of each horizontal text box. And store the LT sub-object of each page into the corresponding first list, for example, the first page corresponds to a first list, and the second page corresponds to a first list. To save the LT sub-object of each page of the pdfminer.layout object into a first list, the corresponding command is {'pageN': [LTobjs of layout]}, where N represents the Nth layout, and [LTobjs of layout] is Array.

Wherein, preferably, the ordinate includes the ordinate y0 of the lower left corner of the LT sub-object and the ordinate y1 of the upper right corner of the LT sub-object, and the abscissas include the left boundary coordinate x0 of the LT sub-object and the right coordinate of the LT sub-object. At the side boundary coordinate x1, extract the LT sub-objects of the pdfminer.layout object in the order of the vertical coordinate from small to large, and arrange them vertically in the first list in order according to the vertical coordinate of the lower left corner from small to large.

In step S30, the content in the first list is read by the [list of line(list)] command line by line, and in the process of line-by-line reading, the vertical distance is also used to determine whether the LT sub-objects are on the same line, and the LT The sub-objects are divided into rows. Among them, the formula for judging whether LT sub-objects are in the same row by vertical distance is as follows:

|LTtext[i] _y0 -LTtext[i+1] _y0 |<|LTtext[i] _y1 -LTtext[i] _y0 | (1)

Among them, |LTtext[i] _y1 -LTtext[i] _y0 | is the height of the LT sub-object. For example, the LT sub-object is a character, and the corresponding height of the LT sub-object is the height of the character;

|LTtext[i] _y0 -LTtext[i+1] _y0 | is the difference between _y0 of the i-th LT sub-object and the i+1-th LT sub-object;

i represents the i-th LT sub-object.

According to formula 1, if the difference between y0 of the i-th LT sub-object and the i+1-th LT sub-object is less than the height of one LT sub-object, it means that the i-th LT sub-object and the i+1-th LT sub-object The distance between the objects must be less than the height required by an LT sub-object, then the i-th LT sub-object and the i+1-th LT sub-object should be in the same row. And if the distance between the i LT sub-objects and the i+1th LT sub-object is greater than the height of an LT sub-object, then the i-th LT sub-object and the i+1-th LT sub-object should be distributed in different Line.

Step S40, the LT sub-objects of the same row are sorted from left to right in the order of the left boundary coordinate x0 from small to large. For each row, it is determined whether the right boundary coordinate x1 of the left LT sub-object is equal to the adjacent The left boundary coordinate x0 of the LT sub-object on the right, combines multiple LT sub-objects to form a combined character string. Through step S40, the LT sub-objects in the same row can be combined according to the order in the original PDF file to restore the text order in the PDF file.

The above is to extract the content of LTChar, the main content of the monthly report usually also includes a table, the following also needs to extract LTline. According to the coordinates of LTline, the boundary line of the table is divided.

Further, the method further includes step S50, using the left boundary coordinates of the leftmost character string of the combined character string as the left boundary coordinates of the combined character string, and the right boundary coordinates of the rightmost character string of the combined character string As the right boundary coordinates of the combined string;

Compare the left boundary coordinates of each combined character string, and then sort the combined character strings from left to right according to the left boundary coordinates of the combined character string from small to large.

Further, it also includes step S60 of sorting the abscissa positions of all the vertical lines of the LTline from left to right in the order from small to large, and sorting the ordinate positions of all the horizontal lines of the LTline from top to bottom in the order of small to large Sort to form a table.

The following is a specific example to illustrate the data extraction process. Extract the list and text in the PDF file shown in Figure 3. As shown in Figure 3, the text includes the following three lines:

["其", "他"...... "-", "5", "8", ".", "7"]

["Total", "Count"......"Electricity", "Sub", "100 million", "Yuan"......"1","0",".","9"]

["Equipment", "Preparation", "System", "Making", "Industry", "General"]

During the reading process, the absolute value of the difference between the y0 values of "its" and "he" is less than the height of the LT sub-object. Therefore, "qi" and "he" should be in the same line. Similarly, all LT sub-objects that should be in the same row are allocated to the same row. When reading "7", the y0 value of "7"-the absolute value of the y0 value of "total" is greater than the height of the LT sub-object, so "7" and "total" will not be in the same row, "total" A new line will start.

And because the words "equipment manufacturing" are in the middle of the two lines, the absolute value of the y0 value of "9"-the y0 value of "set" is less than the height of the LT sub-object. Therefore, when extracting text, " In the total: Computer, communication and other electronic "and" equipment manufacturing" are stored in the same row of the first list, and "equipment manufacturing" is added after "10.9" (because it is read line by line from top to bottom , It must read "10.9" first, and then read "Set"). The resulting file is shown in Figure 4.

Next, the characters on the same line form a combined string. For "Tong" and "Xin" in the second line, since x1 of "Tong" is equal to the value of x0 of "Xin", "Tong" and "Xin" are combined. As for the "zi" and "billion" in the second row, since the x1 of "zi" is not equal to the x0 value of "billion", the "zi" and "billion" are not combined together, but as "sub" The interval between x1 and the x0 value of "100 million" is maintained. By comparing the values of x1 and x0, the LT sub-objects of each row can be formed into a combined string. For example, the second line can form the combined character strings of "in total: computers, communications and other electronics", "100 million yuan", "490.31", "3202.49", "10.9", and "equipment manufacturing". Among them, the left boundary coordinate of the leftmost character string of the combined character string is used as the left boundary coordinate of the combined character string, and the right boundary coordinate of the rightmost character string is used as the right boundary coordinate of the combined character string . The ordinate of the lower left corner of the leftmost character string of the combined character string is used as the ordinate of the lower left corner of the combined character string, and the ordinate of the upper right corner of the rightmost character string is used as the ordinate of the upper right corner of the combined character string.

Further compare the left and right boundary coordinates of the combined character string. If the x0 value of the right combined string-the x1 value of the left combined string <the preset splicing threshold, such as 0.01 (to avoid being unable to connect due to small errors), then the two combined strings can be spliced together. The second line of this embodiment does not have a combined character string that meets this condition.

However, these combined strings are currently arranged in the order of "Total: Computers, Communications and Other Electronics", "100 million", "490.31", "3202.49", "10.9", and "Equipment Manufacturing", which is different from the original PDF The files are not exactly the same. Therefore, continue to compare the left boundary coordinates of each combined character string, and then sort the combined character strings from left to right in the order of the left boundary coordinates from small to large. For example, the left boundary coordinates of the combined character string "Total: Computer, Communication and Other Electronics" are smaller than the left boundary coordinates of the combined character string "Equipment Manufacturing". Then the combined character string "in total: computer, communications and other electronics" should be on the left side of the combined character string "equipment manufacturing industry", and the left boundary coordinate of the combined character string "equipment manufacturing industry" is less than "100 million yuan", " With the coordinates of 490.31", "3202.49", and "10.9", the combined character string "equipment manufacturing" will be transferred between "in total: computers, communications and other electronics" and "100 million yuan".

So far, several combined character strings in the second line can be arranged in accordance with "Total: Computer, Communication and Other Electronics", "Equipment Manufacturing", "100 Million Yuan", "490.31", "3202.49", and "10.9". Its form is shown in Figure 5.

Sort the abscissa positions of all the vertical lines of LTline from left to right in the order from small to large, and sort the ordinate positions of all the horizontal lines of LTline from top to bottom in the order of small to large, thus forming a table. As shown in Figure 8.

The above description is based on the second line example, and the other lines have the same method as the second line and will not be repeated.

In an optional embodiment, in step S20, for the LTFigure, the LT sub-objects therein are iteratively extracted to form a second list containing all the LT sub-objects in the LTFigure, and stored in the first list.

Second embodiment

The second embodiment is basically the same as the first embodiment, and part of the content that is the same as the first embodiment is omitted here, and only features different from the first embodiment are described.

You can also compare the left boundary coordinates of the combined character string in the same line. If the left boundary coordinates are the same, as shown in Figure 6, the x0 value of "Total: Computer, Communication and Other Electronics" and "Equipment Manufacturing" are equal, Then the y0 values of the two combined character strings are further compared, and the character string with the higher y0 value is arranged in front of the combined character string with the lower y0 value. For example, in Figure 6, the y0 of "Total: Computers, Communications and Other Electronics" is greater than y0 of "Equipment Manufacturing", so the "Equipment Manufacturing" is spliced after "Totals: Computers, Communications and Other Electronics". As shown in Figure 7.

The third embodiment

The third embodiment is basically the same as the first embodiment, and part of the same content as the first embodiment is omitted here, and only the features that are different from the first embodiment are described.

If the position overlap occurs after sorting in the order of x0 from small to large, it means that the interval between the left and right boundary coordinates of the combined character string falls on the left and right boundary coordinates of another combined character string Within the range of coordinates, it is possible that the combined character string is a position change caused by the line break of the other combined character string, as shown in FIG. 3. For example, the interval between the left boundary coordinates and the right boundary coordinates of "equipment manufacturing" falls within the interval of the left boundary coordinates and the right boundary coordinates of "Total: Computers, Communications, and Other Electronics", then further comparison The y0 value of the two combined strings, and the higher y0 value is arranged in front of the combined string with the lower y0 value. This results in "in total: computer, communications and other electronic equipment manufacturing industries", as shown in Figure 7.

Fourth embodiment

The fourth embodiment is basically the same as the first embodiment, and part of the content that is the same as the first embodiment is omitted here, and only features different from the first embodiment are described.

After adjusting the position of the combined character string according to the overlap, the value of the LTline corresponding to the combined character string is also compared. If the value of the vertical line of the LTline corresponding to the combined character string is exactly the same, it means that the combined character string is in the original PDF file Are in the same cell. For example, the value of LTline corresponding to "Total: Computer, Communication and Other Electronics" is exactly the same as the value of LTline corresponding to "Equipment Manufacturing". But now their corresponding two vertical lines separate the two combined strings. Therefore, according to the number of shifts of LT sub-objects satisfying |LTtext[i] _y0 -LTtext[i+1] _y0 |<|LTtext[i] _y1 -LTtext[i] _y0 | The right vertical line of the corresponding LTline moves to the right by a distance corresponding to the number of shifts. For example, "equipment manufacturing" is 5 LTChars, where the vertical distance between each character and other characters in the line satisfies |LTtext[i] _y0 -LTtext[i+1] _y0 |<|LTtext[i] _y1 -LTtext[i] _y0 |, then move the vertical line on the right to the right by a distance corresponding to 5 LTChar, so that all text in "Total: Computer, Communication and Other Electronics"], ["Equipment Manufacturing"] All are framed in it, thus forming the "Total: Computer, Communication and Other Electronic Equipment Manufacturing", as shown in Figure 9. And, correspondingly, the multiple vertical lines and LTChar on the right side of the vertical line are also moved to the right by a distance of 5 LTChar. That is to say, in order to splice the "equipment manufacturing industry" behind the "total: computer, communications and other electronics", all objects behind it are moved to the right, and the position of the vertical line is adjusted to obtain a reasonable analysis result .

Fifth embodiment

The fifth embodiment is basically the same as the first embodiment, and part of the same content as the first embodiment is omitted here, and only the features that are different from the first embodiment are described.

For each line, determine whether the abscissa of the leftmost vertical line is greater than the abscissa of the leftmost string. If it is greater, it means that the leftmost vertical line is on the right side of the leftmost string, that is, the string is not All the boxes are in the cell, so add a vertical line at the left boundary coordinate position of the leftmost string, and the leftmost string can also be framed in the cell. Similarly, determine whether the abscissa of the rightmost vertical line is smaller than the abscissa of the rightmost character string, and if it is smaller, add a vertical line at the right boundary coordinate position of the rightmost character string. As shown in Figure 10, there is no vertical line on the left side of the leftmost character string "total" in the second row, so a vertical line is added to the left side of the string "total".

The above is the explanation of LTline and LTChar, LTTextBoxHorizontal and LTFigure are also in the same way. An LTTextBoxHorizontal as an LT sub-object is equivalent to a character. For the parsed multiple LTTextBoxHorizontal, the ordinate and abscissa can be arranged according to the above-mentioned ordinate and abscissa method.

In an optional embodiment, the similarity is calculated for the corresponding lines of the extracted document and the original PDF document. If the similarity is lower than the similarity threshold, the line is cut according to the position of the vertical line of the LTline. Divide into text blocks, and calculate the similarity between several text blocks and the corresponding parts of the PDF document. If the similarity is lower than the similarity threshold, it is considered that some of the characters are garbled caused by the recognition process. If there are garbled characters in the text block, the text block is divided again according to the character width, and the similarity between each character and the corresponding part in the original text is calculated. If the similarity is lower than the similarity threshold, it is considered as garbled.

Usually, for the garbled characters in the recognition process, it is because the embedded fonts in the PDF document use custom encoding, but they lack the mapping relationship with the standard encoding or have the wrong mapping relationship. There is only the standard encoding in the word document. When it is recognized in the word document, the Unicode code of the recognized character cannot be found in the word document, and the garbled code will be displayed.

With font as a unit, the mapping relationship between the current Unicode encoding of the garbled characters in the embedded font and the standard Unicode encoding can be established to remove garbled characters.

And, further, it can be recognized every time the PDF document is extracted to form a line of text, so that the mapping relationship between the current encoding of the PDF embedded font and the standard encoding of the word document can be established as soon as possible, and then the extraction will be performed later. In the process, garbled characters can be reduced.

Refer to FIG. 11, which is a schematic diagram of the hardware architecture of an embodiment of the electronic device of the present application. In this embodiment, the electronic device 2 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. For example, it can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a cabinet server (including an independent server or a server cluster composed of multiple servers). As shown in FIG. 11, the electronic device 2 at least includes, but is not limited to, a memory 21 and a processor 22 that can be communicatively connected to each other through a system bus. Wherein: the memory 21 includes at least one type of computer-readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM) ), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 21 may be an internal storage unit of the electronic device 2, for example, a hard disk or a memory of the electronic device 2. In other embodiments, the memory 21 may also be an external storage device of the electronic device 2, for example, a plug-in hard disk equipped on the electronic device 2, a smart media card (SMC), a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Of course, the memory 21 may also include both the internal storage unit of the electronic device 2 and its external storage device. In this embodiment, the memory 21 is generally used to store an operating system and various application software installed in the electronic device 2, such as the PDF file data extraction program code. In addition, the memory 21 can also be used to temporarily store various types of data that have been output or will be output.

The processor 22 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 22 is generally used to control the overall operation of the electronic device 2, for example, perform data interaction or communication-related control and processing with the electronic device 2. In this embodiment, the processor 22 is configured to run the program code or process data stored in the memory 21, for example, run the PDF file data extraction program.

Optionally, the electronic device 2 may also include a display, and the display may also be called a display screen or a display unit. In some embodiments, it may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) display, etc. The display is used to display the information processed in the electronic device 2 and to display a visualized user interface.

It should be pointed out that FIG. 11 only shows the electronic device 2 with components 21-22, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.

The memory 21 containing a readable storage medium may include an operating system, a PDF file data extraction program 50, and the like. When the processor 22 executes the PDF file data extraction program 50 in the memory 21, the steps described in the above PDF file data extraction method are implemented. In this embodiment, the PDF file data extraction program stored in the memory 21 may be divided into one or more program modules, the one or more program modules are stored in the memory 21, and may be one or more It is executed by two processors (in this embodiment, the processor 22) to complete the application. For example, FIG. 12 shows a schematic diagram of the program modules of the PDF file data extraction program. In this embodiment, the PDF file data extraction program 50 can be divided into a PDF file analysis module 501, an LT sub-object storage module 502, and branch Reading module 503, LT sub-object sorting module 504. Among them, the program module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is more suitable than a program to describe the execution process of the PDF file data extraction program in the electronic device 2. The following description will specifically introduce the specific functions of the program modules.

Among them, the PDF file parsing module 501 is used to use the pdfminer tool (a tool that can extract information from PDF documents) to parse the PDF file, and generate a pdfminer.layout object corresponding to each page of the PDF, among which, the pdfminer.layout object A plurality of LT sub-objects are contained therein, and the LT sub-object includes at least one of LTTextBoxHorizontal (horizontal text box) sub-object and LTChar (character) sub-object.

Further, LT sub-objects may also include LTFigure (area box) sub-objects and LTLine (separation line) sub-objects, where LTFigure represents an area occupied by the area box, and the area box is used to introduce, for example, a picture or another PDF document.

Wherein, the LT sub-object storage module 502 is used to obtain the ordinate and abscissa of each LT sub-object. For example, if the LT sub-object is LTChar, it is to obtain the ordinate and abscissa of each character. If the LT sub-object is LTLine, get the abscissa and ordinate of each dividing line. If the LT sub-object is a horizontal text box, obtain the ordinate and abscissa of each horizontal text box. And the LT sub-object of each page is correspondingly stored in a first list, for example, the first page corresponds to a first list, and the second page corresponds to a first list. To save the LT sub-object of each page of the pdfminer.layout object into a first list, the corresponding command is {'pageN': [LTobjs of layout]}, where N represents the Nth layout, and [LTobjs of layout] is Array.

Wherein, the ordinate includes the ordinate y0 of the lower left corner of the LT sub-object and the ordinate y1 of the upper right corner of the LT sub-object, and the abscissas include the left boundary coordinate x0 of the LT sub-object and the right boundary coordinate of the LT sub-object. x1, extract the LT sub-objects of the pdfminer.layout object in the order of the ordinate from small to large, and arrange them in the first list in order according to the ordinate of the lower left corner from small to large.

The branch reading module 503 is used to read the content in the first list branch by line through the [list of line(list)] command, and in the process of branch reading, it also judges the vertical distance between each LT sub-object Whether the LT sub-objects are in the same row, so that the LT sub-objects are divided into rows. Among them, the formula for judging whether LT sub-objects are in the same row by vertical distance is as follows:

|LTtext[i] _y0 -LTtext[i+1] _y0 |<|LTtext[i] _y1 -LTtext[i] _y0 | (1)

i represents the i-th LT sub-object.

With this formula, if the difference between y0 of the i-th LT sub-object and the i+1-th LT sub-object is less than the height of one LT sub-object, it means that the i-th LT sub-object and the i+1-th LT sub-object The distance between them must be less than the height required by an LT sub-object, so the i-th LT sub-object and the i+1-th LT sub-object should be in the same row. And if the distance between the i LT sub-objects and the i+1th LT sub-object is greater than the height of an LT sub-object, then the i-th LT sub-object and the i+1-th LT sub-object should be distributed in different Line.

Among them, the LT sub-object sorting module 504 is used to sort the LT sub-objects of the same row in the order of x0 from small to large. For each row, if x1 of the LT sub-object on the left is equal to x0 of the adjacent LT sub-object on the right , The two LT sub-objects are combined together to form a combined string. Through step 4, you can combine the LT sub-objects in the same line according to the order in the original PDF file to restore the text order in the PDF file.

The following is a specific example to illustrate the data extraction process. Extract the text in the list in the PDF file shown in Figure 3. As shown in Figure 3, the text includes the following three lines:

["其", "他"...... "-", "5", "8", ".", "7"......]

["Equipment", "Preparation", "System", "Making", "Industry", "General"...]

During the reading process, the absolute value of the difference between the y0 values of "its" and "he" is less than the height of the LT sub-object. Therefore, "qi" and "he" should be in the same line. Similarly, all LT sub-objects that should be in the same row are allocated to the same row. When reading "7", the absolute value of y0 value of ("7" y0 value-"total") is greater than the height of the LT sub-object, so "7" and "total" will not be on the same line, " "Total" will start a new line.

And because the words "equipment manufacturing" are in the middle of the two lines, the absolute value of "9" y0 value-"set" y0 value) is less than the height of the LT sub-object. Therefore, when extracting text, it will "In total: computer, communication and other electronic "and" equipment manufacturing" is stored in the same row of the first list, and "equipment manufacturing" is added after "10.9" (because it is read line by line from top to bottom Take it, it must read "10.9" first, and then read "Set"). The resulting file is shown in Figure 4.

It also includes step S50, and then the characters in the same row form a combined character string. For "Tong" and "Xin" in the second line, since x1 of "Tong" is equal to the value of x0 of "Xin", "Tong" and "Xin" are combined. As for the "zi" and "billion" in the second row, since the x1 of "zi" is not equal to the x0 value of "billion", the "zi" and "billion" are not combined together, but as "sub" The interval between x1 and the x0 value of "100 million" is maintained. By comparing the values of x1 and x0, the LT sub-objects of each row can be formed into a combined string. For example, the second line can form the combined character strings of "in total: computers, communications and other electronics", "100 million yuan", "490.31", "3202.49", "10.9", and "equipment manufacturing". Among them, the left boundary coordinates of the leftmost character string are used as the left boundary coordinates of the combined character string, and the right boundary coordinates of the rightmost character string are used as the right boundary coordinates of the combined character string. Use the ordinate of the lower left corner of the leftmost character string as the ordinate of the lower left corner of the combined character string, and use the ordinate of the upper right corner of the rightmost character string as the ordinate of the upper right corner of the combined character string.

In an optional embodiment, a first position correction module 505 is further included. The position correction module 505 can compare the left boundary coordinates of the combined character string in the same line. If the left boundary coordinates are the same, as shown in Figure 6, "Total Medium: if the x0 values of "Computer, Communication and Other Electronics" and "Equipment Manufacturing" are equal, the y0 values of the two combined character strings are further compared, and the one with the higher y0 value is arranged before the one with the lower y0 value. For example, in Figure 5, the y0 of "Total: Computer, Communication and Other Electronics" is greater than y0 of "Equipment Manufacturing", so the "Equipment Manufacturing" is spliced after "Total: Computer, Communication and Other Electronics". As shown in Figure 7.

In an optional embodiment, the second position correction module 506 is further included. If the position overlap occurs after sorting in the order of x0 from small to large, it indicates the difference between the left and right boundary coordinates of the combined character string If the interval falls within the interval between the left boundary coordinates and the right boundary coordinates of another combined character string, it is possible that the combined character string is a position change caused by a line break of the other combined character string. For example, as shown in Figure 3, the interval between the left boundary coordinates and the right boundary coordinates of the "equipment manufacturing industry" falls between the left boundary coordinates and the right boundary coordinates of the "Total: Computers, Communications and Other Electronics". In the interval, the second position correction module 506 further compares the y0 values of the two combined character strings, and arranges the combined character string with the higher y0 value before the combined character string with the lower y0 value. This results in "in total: computer, communications and other electronic equipment manufacturing industries", as shown in Figure 7.

Further, it also includes a table forming module 507. The table forming module 507 sorts the abscissa positions of all the vertical lines of the LTline from left to right in the order from small to large, and sorts the ordinate positions of all the horizontal lines of the LTline from small to large. The order is sorted from top to bottom to form a table, as shown in Figure 8.

Further, it also includes a table adjustment module 508. The table adjustment module 508 compares the value of the LTline corresponding to the combined character string. If the value of the vertical line of the LTline of the LT sub-object is exactly the same, it means that the LT sub-object is in the original PDF file. Are in the same cell. For example, the value of LTline corresponding to "Total: Computer, Communication and Other Electronics" is the same as the value of LTline corresponding to "Equipment Manufacturing". Therefore, according to the number of LTChars satisfying |LTtext[i] _y0 -LTtext[i+1] _y0 |<|LTtext[i] _y1 -LTtext[i] _y0 |, the LTline corresponding to the LTChar satisfying the condition Move the vertical line of the corresponding distance to the right. For example, "Equipment Manufacturing" is a 5 LTChar, wherein each character from other characters with the vertical line satisfies _{| LTtext [i] y0 -LTtext [} i + 1] y0 | <| LTtext [i] y1 - LTtext[i] _y0 |, then move the vertical line to the right to correspond to 5 LTChars, so as to frame all the words "Total: Computer, Communication and Other Electronics" and "Equipment Manufacturing" to form a "Total Middle: Computer, Communication and Other Electronic Equipment Manufacturing", as shown in Figure 9. And, correspondingly, the multiple vertical lines on the right side of the vertical line are also moved to the right by a distance of 5 LTChar.

In an optional embodiment, the LT sub-object storage module 502 is further used to extract the LT sub-objects of the LTFigure iteratively in step S20 to form a second list containing all LT sub-objects in the LTFigure, and coexist Into the first list.

In an optional embodiment, the table adjustment module 508 is also used to determine whether the abscissa of the leftmost vertical line is greater than the abscissa of the leftmost character string for each row. If it is greater, it means the leftmost vertical Located on the right side of the leftmost string, that is, not all the strings are framed in the cell, so add a vertical line to the left boundary coordinate position of the leftmost string, and the leftmost string can also be framed into the cell Grid. Similarly, it is judged whether the abscissa of the rightmost vertical line is smaller than the abscissa of the most lateral character string, and if it is smaller, a vertical line is added to the right boundary coordinate position of the rightmost character string. As shown in Figure 10, the leftmost character string "total" in the second row does not have a vertical direction on the left side, so a vertical line is added to the left side of the character string "total" at the left boundary coordinate position.

This application also provides a PDF file data extraction device, including:

The PDF file parsing module 501 uses the pdfminer tool to parse the PDF file, and generates a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains the LT sub-object;

The LT sub-object storage module 502 is used to obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in the corresponding first list, wherein the abscissa includes the left side of the LT sub-object The boundary coordinate x0 and the right boundary coordinate x1 of the LT sub-object are sequentially extracted from the LT sub-objects of the pdfminer.layout object in each page in the order of the ordinates from small to large, and arranged vertically according to the order of the ordinates from small to large In the first list corresponding to each page;

The branch reading module 503 is configured to perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, judge by the vertical distance between each LT sub-object The row to which each LT sub-object belongs, so that the LT sub-object is divided into rows;

The LT sub-object sorting module 504 is used to sort the LT sub-objects in each row from small to large in the order of the left boundary coordinate x0, and by judging the LT sub-objects on the left Whether the right boundary coordinate x1 is equal to the left boundary coordinate x0 of the adjacent right LT sub-object, the multiple LT sub-objects are combined to form a combined character string.

In addition, the embodiments of the present application also propose a computer-readable storage medium, which may be a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read-only memory (ROM), an erasable programmable Any one or any combination of read-only memory (EPROM), portable compact disk read-only memory (CD-ROM), USB memory, etc. The computer-readable storage medium includes a PDF file data extraction program, etc., when the PDF file data extraction program 50 is executed by the processor 22, the following operations are implemented:

Step S10, use the pdfminer tool (a tool for extracting information from a PDF document) to parse the PDF file, and generate a pdfminer.layout object corresponding to each page of the PDF, as shown in Figure 2, where, in the pdfminer.layout object Contains multiple LT sub-objects, and the LT sub-objects include LTTextBoxHorizontal (horizontal text box) sub-objects and LTChar (character) sub-objects. LTChar is a character with boundaries. The following is mainly based on LTChar as an example, so the characters mentioned are LT sub-objects.

of course. Furthermore, it can also include LTFigure (area box) sub-object and LTLine (separation line) sub-object. Among them, LTFigure represents an area occupied by an area frame, which is used to introduce a picture or another PDF document.

Step S20: Obtain the ordinate and abscissa of each LT sub-object. For example, if the LT sub-object is LTChar, then the ordinate and abscissa of each character are obtained. If the LT sub-object is LTLine, get the abscissa and ordinate of each dividing line. If the LT sub-object is a horizontal text box, obtain the ordinate and abscissa of each horizontal text box. And the LT sub-object of each page is correspondingly stored in a first list, for example, the first page corresponds to a first list, and the second page corresponds to a first list. To save the LT sub-object of each page of the pdfminer.layout object into a first list, the corresponding command is {'pageN': [LTobjs of layout]}, where N represents the Nth layout, and [LTobjs of layout] is Array.

Step S30, read the content of the PDF file branch by line through the [list of line(list)] command, and in the process of branch reading, also judge whether the LT sub-object is in the line by the vertical distance between each LT sub-object In the same row, the LT sub-objects are divided into rows. Among them, the formula for judging whether LT sub-objects are in the same row by vertical distance is as follows:

|LTtext[i] _y0 -LTtext[i+1] _y0 |<|LTtext[i] _y1 -LTtext[i] _y0 | (1)

i represents the i-th LT sub-object.

Step S40: Sort the LT sub-objects in the same row from small to large in the order of x0. For each row, determine whether x1 of the LT sub-object on the left is equal to x0 of the adjacent LT sub-object on the right. LT sub-objects are combined to form a combined string. Through step S40, the LT sub-objects in the same row can be combined according to the order in the original PDF file to restore the text order in the PDF file.

The specific implementation of the computer-readable storage medium of the present application is substantially the same as the specific implementation of the PDF file data extraction method and the electronic device 2 described above, and will not be repeated here.

The foregoing descriptions are only preferred embodiments of the application, and are not intended to limit the application. For those skilled in the art, the application may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the protection scope of this application.

Claims

A method for extracting data from a PDF file, applied to an electronic device, is characterized in that it comprises the following steps:

S10. Use the pdfminer tool to parse the PDF file, and generate a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains an LT sub-object;

S20. Obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in a corresponding first list, where the abscissa includes the left boundary coordinate x0 of the LT sub-object and the LT sub-object The LT sub-objects of the pdfminer.layout object in each page are extracted in the order of the ordinate from small to large, and the LT sub-objects of the pdfminer.layout object in each page are sequentially arranged in the first list corresponding to each page according to the order of the ordinate from small to large. in;

S30. Perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, judge whether each LT sub-object belongs to each LT sub-object according to the vertical distance between each LT sub-object To divide the LT sub-object into each row;

S40, for each type of LT sub-object, in each row, sort the LT sub-objects from small to large in the order of the left boundary coordinate x0, and determine whether the right boundary coordinate x1 of the left LT sub-object is equal to The left boundary coordinates x0 of the adjacent LT sub-objects on the right are combined to form a combined character string.
The PDF file data extraction method of claim 1, wherein the method further comprises:

S50, taking the left boundary coordinate of the leftmost character string of the combined character string as the left boundary coordinate of the combined character string;

In each row, compare the left boundary coordinates of each combined character string, and then sort the combined character strings from left to right in the order of the left boundary coordinates of the combined character string from small to large.
The method for extracting data from a PDF file according to claim 2, wherein the formula for judging whether the LT sub-objects are in the same row by vertical distance is as follows:

|LTtext[i] y0 -LTtext[i+1] y0 |<|LTtext[i] y1 -LTtext[i] y0 |

Among them, |LTtext[i] y1 -LTtext[i] y0 | is the height of the LT sub-object;

|LTtext[i] y0 -LTtext[i+1] y0 | is the difference between y0 of the i-th LT sub-object and the i+1-th LT sub-object;

y0 is the ordinate of the lower left corner of the LT sub-object; y1 is the ordinate of the upper right corner of the LT sub-object;

i represents the i-th LT sub-object.
The PDF file data extraction method according to claim 1, wherein the step of storing the LT sub-objects of each page in the corresponding first list in step S20 comprises:

Use the command {'pageN': [LTobjs of layout]} to store the LT sub-objects of the pdfminer.layout object of each page into a first list, where N represents the Nth layout, and [LTobjs of layout] is an array.
The PDF file data extraction method according to claim 3, wherein the LT sub-object includes at least one of the LTTextBoxHorizontal sub-object, the LTChar sub-object, and the LTFigure sub-object, wherein the LTFigure sub-object represents occupied by an area frame An area of the area, the area box is used to import another PDF document.
The PDF file data extraction method according to claim 5, characterized in that,

In step S20, for the LTTextBoxHorizontal sub-objects, directly store it in the first list. For the LTFigure sub-objects, iteratively extract the LT sub-objects in it to form a second list containing all the LT sub-objects inside LTFigure and store it in the first list. List.
The method for extracting data from a PDF file according to claim 3, wherein:

In step S50, the left boundary coordinates of each combined character string are compared, and the combined character strings are sorted from left to right according to the left boundary coordinates of the combined character string from small to large,

If the left boundary coordinates are the same, the y0 values of the two combined character strings are further compared, and the character string with the higher y0 value is arranged in front of the combined character string with the lower y0 value.
The PDF file data extraction method according to claim 5, characterized in that,

In step S50, the left boundary coordinates of each combined character string are compared, and the combined character strings are sorted from left to right according to the left boundary coordinates of the combined character string from small to large,

If the positions overlap, the y0 values of the two combined character strings are further compared, and the character string with the higher y0 value is arranged in front of the combined character string with the lower y0 value.
The method for extracting data from a PDF file according to claim 8, wherein the LT sub-object further comprises an LTLine sub-object, and the method further comprises:

Step S60, according to the coordinates of the LTline sub-object, sort the abscissa positions of all the vertical lines of the LTline from left to right in the order from small to large, and sort the ordinate positions of all the horizontal lines of the LTline from small to large from the top Sort down to form a table.
The PDF file data extraction method according to claim 9, characterized in that,

Sort the combined character string from left to right in the order of the left boundary coordinates of the combined character string from left to right, and the position overlap occurs, and the higher y0 value is arranged in front of the lower y0 value.

Also compare the values of LTline corresponding to the combined character string in the same row. If the values of the vertical lines of the LTline corresponding to the combined character string are the same, it means that the combined character string is in the same cell in the PDF file. The number of shifts of LT sub-objects that satisfy |LTtext[i] y0 -LTtext[i+1] y0 |<|LTtext[i] y1 -LTtext[i] y0 |, combining characters with the same value of the vertical line The right vertical line of the LTline corresponding to the string is moved to the right by the distance corresponding to the number of shifts, and all LT sub-objects on the right side of the combined string on the right in the combined string with the same value of the vertical line are moved to the right The right side moves the distance corresponding to the number of shifts.
The PDF file data extraction method according to claim 9, characterized in that,

After forming the table, for each row, determine whether the abscissa of the leftmost vertical line is greater than the abscissa of the leftmost character string, and if it is greater, add a vertical line at the left boundary coordinate position of the leftmost character string;

Determine whether the abscissa of the rightmost vertical line is less than the abscissa of the rightmost character string. If it is smaller, add a vertical line at the right boundary coordinate position of the rightmost character string.
The method for extracting data from a PDF file according to claim 3, wherein:

Compare the left and right boundary coordinates of the combined character string in the same line. If the x0 value of the right combined character string-the x1 value of the left combined character string <the preset splicing threshold, the two combined character strings are spliced Together.
The PDF file data extraction method according to claim 5, characterized in that,

Each extracted PDF document forms a row of LTChar sub-objects, and calculates the similarity between the row of LTChar sub-objects and the corresponding rows of the PDF document. If it is lower than the similarity threshold, then divide the row according to the position of the vertical line of the LTline of the row It is a text block, and the similarity between the text block and the corresponding part of the PDF document is calculated. If the similarity is lower than the similarity threshold, it is determined that the characters in the text block are garbled caused by the recognition process, and the text block is again Divide according to the width of a single character, and calculate the similarity between each character and the corresponding part in the original text. If the similarity is lower than the similarity threshold, it is considered as garbled.
The method for extracting data from a PDF file according to claim 13, wherein:

For the identified garbled codes, a mapping relationship between the current Unicode encoding of the garbled codes and the standard Unicode code is established to remove the garbled codes.
A PDF file data extraction device, characterized in that it comprises:

The PDF file parsing module uses the pdfminer tool to parse the PDF file, and generates a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains the LT sub-object;

The LT sub-object storage module is used to obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in the corresponding first list, wherein the abscissa includes the left boundary of the LT sub-object Coordinates x0 and the right boundary coordinate x1 of the LT sub-object, extract the LT sub-objects of the pdfminer.layout object in each page in the order of ordinates from small to large, and arrange them vertically in the order of ordinates from small to large In the first list corresponding to the page;

The branch reading module is used to perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, each LT sub-object is judged by the vertical distance between each LT sub-object. A row to which the LT sub-object belongs, thereby dividing the LT sub-object into each row;

The LT sub-object sorting module is used to sort each LT sub-object, in each row, sort the LT sub-objects from small to large in the order of the left boundary coordinate x0, and by judging the right of the LT sub-object on the left Whether the side boundary coordinate x1 is equal to the left boundary coordinate x0 of the adjacent right LT sub-object, the multiple LT sub-objects are combined to form a combined character string.
The PDF file data extraction device according to claim 15, further comprising a first position correction module for comparing the left boundary coordinates of the combined character string in the same line, and if the left boundary coordinates are the same, then further comparing the two Combine the y0 value of the string, and arrange the string with the higher y0 value in front of the string with the lower y0 value.
The PDF file data extraction device according to claim 15, characterized in that it further comprises a second position correction module, which is used to further compare two combined characters if the position overlap occurs after sorting from small to large in the order of x0 The y0 value of the string, and the higher y0 value is arranged in front of the combined string with the lower y0 value.
An electronic device, characterized in that it includes a memory and a processor, the memory stores a PDF file data extraction program, and the PDF file data extraction program is executed by the processor to implement the following steps:

S10. Use the pdfminer tool to parse the PDF file, and generate a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains an LT sub-object;

S20. Obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in a corresponding first list, where the abscissa includes the left boundary coordinate x0 of the LT sub-object and the LT sub-object The LT sub-objects of the pdfminer.layout object in each page are extracted in the order of the ordinate from small to large, and the LT sub-objects of the pdfminer.layout object in each page are sequentially arranged in the first list corresponding to each page according to the order of the ordinate from small to large. in;

S30. Perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, judge whether each LT sub-object belongs to each LT sub-object according to the vertical distance between each LT sub-object To divide the LT sub-object into each row;

S40, for each type of LT sub-object, in each row, sort the LT sub-objects from small to large in the order of the left boundary coordinate x0, and determine whether the right boundary coordinate x1 of the left LT sub-object is equal to The left boundary coordinates x0 of the adjacent LT sub-objects on the right are combined to form a combined character string.
The electronic device according to claim 18, wherein said PDF file data extraction program further comprises step S50 when being executed by said processor:

Take the left boundary coordinate of the leftmost character string of the combined character string as the left boundary coordinate of the combined character string;

Compare the left boundary coordinates of each combined character string, and then sort the combined character strings from left to right according to the left boundary coordinates of the combined character string from small to large.
A computer nonvolatile readable storage medium, wherein the computer nonvolatile readable storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, The PDF file data extraction method according to any one of claims 1-14 is realized.