WO2020252931A1 - Pdf file data extraction method and apparatus, device, and storage medium - Google Patents

Pdf file data extraction method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2020252931A1
WO2020252931A1 PCT/CN2019/103580 CN2019103580W WO2020252931A1 WO 2020252931 A1 WO2020252931 A1 WO 2020252931A1 CN 2019103580 W CN2019103580 W CN 2019103580W WO 2020252931 A1 WO2020252931 A1 WO 2020252931A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
character string
objects
pdf file
combined
Prior art date
Application number
PCT/CN2019/103580
Other languages
French (fr)
Chinese (zh)
Inventor
杨志鸿
常河
徐亮
阮晓雯
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2020252931A1 publication Critical patent/WO2020252931A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/116Details of conversion of file system types or formats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines

Definitions

  • This application relates to the field of artificial intelligence, and specifically to a method and device, equipment and storage medium for extracting data from a PDF file.
  • this application provides a PDF file data extraction method, which is applied to electronic equipment, including:
  • This application also provides a PDF file data extraction device, including:
  • the PDF file parsing module uses the pdfminer tool to parse the PDF file, and generates a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains the LT sub-object;
  • the LT sub-object storage module is used to obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in the corresponding first list, wherein the abscissa includes the left boundary of the LT sub-object Coordinates x0 and the right boundary coordinate x1 of the LT sub-object, extract the LT sub-objects of the pdfminer.layout object in each page in the order of ordinates from small to large, and arrange them vertically in the order of ordinates from small to large In the first list corresponding to the page;
  • the branch reading module is used to perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, each LT sub-object is judged by the vertical distance between each LT sub-object. A row to which the LT sub-object belongs, thereby dividing the LT sub-object into each row;
  • the LT sub-object sorting module is used to sort each LT sub-object, in each row, sort the LT sub-objects from small to large in the order of the left boundary coordinate x0, and by judging the right of the LT sub-object on the left Whether the side boundary coordinate x1 is equal to the left boundary coordinate x0 of the adjacent right LT sub-object, the multiple LT sub-objects are combined to form a combined character string.
  • the present application also provides an electronic device including a memory and a processor, the memory stores a PDF file data extraction program, and the PDF file data extraction program is executed by the processor to implement the following steps:
  • the present application also provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, Realize the PDF file data extraction method as described above.
  • This application converts the data of the PDF file into an excel format, which will greatly reduce the difficulty of extracting information from the monthly statistical report using data analysis software such as spider and pycharm.
  • FIG. 1 is a flowchart showing a method for extracting data from a PDF file according to an embodiment of the present application
  • FIG. 2 is a schematic diagram showing the framework of the pdfminer.layout object in an embodiment of the present application
  • FIG. 3 is a schematic diagram showing LTChar in the PDF file of the first embodiment of the present application.
  • FIG. 4 is a schematic diagram showing a data extraction result obtained by branch reading in the first embodiment of the present application.
  • FIG. 5 is a schematic diagram showing the data extraction result after sorting LTChar in the first embodiment of the present application.
  • FIG. 6 is a schematic diagram showing LTChar in the PDF file of the second embodiment of the present application.
  • FIG. 7 is a schematic diagram showing the data extraction result after comparing and combining string coordinates in the second embodiment of the present application.
  • FIG. 8 is a schematic diagram showing the data extraction result after adding LTline in the third embodiment of the present application.
  • FIG. 9 is a schematic diagram showing the data extraction result after adjusting LTline in the fourth embodiment of the present application.
  • FIG. 10 is a schematic diagram showing the LTLtine in the PDF file of the fifth embodiment of the present application.
  • FIG. 11 is a schematic diagram showing the hardware architecture of an electronic device according to an embodiment of the present application.
  • Fig. 12 is a schematic diagram showing program modules of a PDF file data extraction program according to an embodiment of the present application.
  • the PDF file data extraction method of this embodiment is applied to the extraction of text and tables in the PDF file.
  • the text can be formed in a table, or it can be divided into paragraphs without a table. Take the monthly statistical report in PDF format as an example.
  • Fig. 1 shows a flowchart of a method for extracting data from a PDF file in this embodiment. The method includes the following steps:
  • Step S10 use the pdfminer tool (a tool for extracting information from a PDF document) to parse the PDF file, and generate a pdfminer.layout object for each page of the PDF, as shown in Figure 2, where the pdfminer.layout object is It may contain multiple LT sub-objects, and the LT sub-object may be at least one of LTTextBoxHorizontal (horizontal text box) sub-object and LTChar (character) sub-object.
  • LTChar is a character with boundaries. The following is mainly based on LTChar as an example, so the characters mentioned are LT sub-objects.
  • LT Figure represents an area occupied by an area frame, which is used to introduce a picture or another PDF document.
  • Step S20 Obtain the ordinate and abscissa of each LT sub-object. For example, if the LT sub-object is LTChar, then the ordinate and abscissa of each character are obtained. If the LT sub-object is LTLine, get the abscissa and ordinate of each dividing line. If the LT sub-object is a horizontal text box, obtain the ordinate and abscissa of each horizontal text box. And store the LT sub-object of each page into the corresponding first list, for example, the first page corresponds to a first list, and the second page corresponds to a first list.
  • the corresponding command is ⁇ 'pageN': [LTobjs of layout] ⁇ , where N represents the Nth layout, and [LTobjs of layout] is Array.
  • the ordinate includes the ordinate y0 of the lower left corner of the LT sub-object and the ordinate y1 of the upper right corner of the LT sub-object
  • the abscissas include the left boundary coordinate x0 of the LT sub-object and the right coordinate of the LT sub-object.
  • the side boundary coordinate x1 extract the LT sub-objects of the pdfminer.layout object in the order of the vertical coordinate from small to large, and arrange them vertically in the first list in order according to the vertical coordinate of the lower left corner from small to large.
  • step S30 the content in the first list is read by the [list of line(list)] command line by line, and in the process of line-by-line reading, the vertical distance is also used to determine whether the LT sub-objects are on the same line, and the LT The sub-objects are divided into rows.
  • the formula for judging whether LT sub-objects are in the same row by vertical distance is as follows:
  • the LT sub-object is a character, and the corresponding height of the LT sub-object is the height of the character;
  • i represents the i-th LT sub-object.
  • Step S40 the LT sub-objects of the same row are sorted from left to right in the order of the left boundary coordinate x0 from small to large. For each row, it is determined whether the right boundary coordinate x1 of the left LT sub-object is equal to the adjacent The left boundary coordinate x0 of the LT sub-object on the right, combines multiple LT sub-objects to form a combined character string.
  • the LT sub-objects in the same row can be combined according to the order in the original PDF file to restore the text order in the PDF file.
  • the main content of the monthly report usually also includes a table, the following also needs to extract LTline. According to the coordinates of LTline, the boundary line of the table is divided.
  • the method further includes step S50, using the left boundary coordinates of the leftmost character string of the combined character string as the left boundary coordinates of the combined character string, and the right boundary coordinates of the rightmost character string of the combined character string As the right boundary coordinates of the combined string;
  • step S60 of sorting the abscissa positions of all the vertical lines of the LTline from left to right in the order from small to large, and sorting the ordinate positions of all the horizontal lines of the LTline from top to bottom in the order of small to large Sort to form a table.
  • the second line can form the combined character strings of "in total: computers, communications and other electronics", "100 million yuan", “490.31", “3202.49”, “10.9", and "equipment manufacturing”.
  • the left boundary coordinate of the leftmost character string of the combined character string is used as the left boundary coordinate of the combined character string
  • the right boundary coordinate of the rightmost character string is used as the right boundary coordinate of the combined character string
  • the ordinate of the lower left corner of the leftmost character string of the combined character string is used as the ordinate of the lower left corner of the combined character string
  • the ordinate of the upper right corner of the rightmost character string is used as the ordinate of the upper right corner of the combined character string.
  • the two combined strings can be spliced together.
  • the second line of this embodiment does not have a combined character string that meets this condition.
  • the combined character string "in total: computer, communications and other electronics” should be on the left side of the combined character string “equipment manufacturing industry", and the left boundary coordinate of the combined character string “equipment manufacturing industry” is less than "100 million yuan", " With the coordinates of 490.31”, “3202.49”, and "10.9", the combined character string "equipment manufacturing” will be transferred between "in total: computers, communications and other electronics" and "100 million yuan".
  • step S20 for the LT Figure, the LT sub-objects therein are iteratively extracted to form a second list containing all the LT sub-objects in the LT Figure, and stored in the first list.
  • the second embodiment is basically the same as the first embodiment, and part of the content that is the same as the first embodiment is omitted here, and only features different from the first embodiment are described.
  • the third embodiment is basically the same as the first embodiment, and part of the same content as the first embodiment is omitted here, and only the features that are different from the first embodiment are described.
  • the position overlap occurs after sorting in the order of x0 from small to large, it means that the interval between the left and right boundary coordinates of the combined character string falls on the left and right boundary coordinates of another combined character string Within the range of coordinates, it is possible that the combined character string is a position change caused by the line break of the other combined character string, as shown in FIG. 3.
  • the interval between the left boundary coordinates and the right boundary coordinates of "equipment manufacturing” falls within the interval of the left boundary coordinates and the right boundary coordinates of "Total: Computers, Communications, and Other Electronics”
  • the y0 value of the two combined strings, and the higher y0 value is arranged in front of the combined string with the lower y0 value. This results in “in total: computer, communications and other electronic equipment manufacturing industries", as shown in Figure 7.
  • the fourth embodiment is basically the same as the first embodiment, and part of the content that is the same as the first embodiment is omitted here, and only features different from the first embodiment are described.
  • the value of the LTline corresponding to the combined character string is also compared. If the value of the vertical line of the LTline corresponding to the combined character string is exactly the same, it means that the combined character string is in the original PDF file Are in the same cell. For example, the value of LTline corresponding to "Total: Computer, Communication and Other Electronics" is exactly the same as the value of LTline corresponding to "Equipment Manufacturing". But now their corresponding two vertical lines separate the two combined strings.
  • equipment manufacturing is 5 LTChars, where the vertical distance between each character and other characters in the line satisfies
  • the multiple vertical lines and LTChar on the right side of the vertical line are also moved to the right by a distance of 5 LTChar. That is to say, in order to splice the "equipment manufacturing industry" behind the “total: computer, communications and other electronics", all objects behind it are moved to the right, and the position of the vertical line is adjusted to obtain a reasonable analysis result .
  • the fifth embodiment is basically the same as the first embodiment, and part of the same content as the first embodiment is omitted here, and only the features that are different from the first embodiment are described.
  • the abscissa of the leftmost vertical line is greater than the abscissa of the leftmost string. If it is greater, it means that the leftmost vertical line is on the right side of the leftmost string, that is, the string is not All the boxes are in the cell, so add a vertical line at the left boundary coordinate position of the leftmost string, and the leftmost string can also be framed in the cell.
  • determine whether the abscissa of the rightmost vertical line is smaller than the abscissa of the rightmost character string, and if it is smaller, add a vertical line at the right boundary coordinate position of the rightmost character string. As shown in Figure 10, there is no vertical line on the left side of the leftmost character string "total" in the second row, so a vertical line is added to the left side of the string "total".
  • LTline and LTChar, LTTextBoxHorizontal and LT Figure are also in the same way.
  • An LTTextBoxHorizontal as an LT sub-object is equivalent to a character.
  • the ordinate and abscissa can be arranged according to the above-mentioned ordinate and abscissa method.
  • the similarity is calculated for the corresponding lines of the extracted document and the original PDF document. If the similarity is lower than the similarity threshold, the line is cut according to the position of the vertical line of the LTline. Divide into text blocks, and calculate the similarity between several text blocks and the corresponding parts of the PDF document. If the similarity is lower than the similarity threshold, it is considered that some of the characters are garbled caused by the recognition process. If there are garbled characters in the text block, the text block is divided again according to the character width, and the similarity between each character and the corresponding part in the original text is calculated. If the similarity is lower than the similarity threshold, it is considered as garbled.
  • the garbled characters in the recognition process it is because the embedded fonts in the PDF document use custom encoding, but they lack the mapping relationship with the standard encoding or have the wrong mapping relationship. There is only the standard encoding in the word document. When it is recognized in the word document, the Unicode code of the recognized character cannot be found in the word document, and the garbled code will be displayed.
  • the mapping relationship between the current Unicode encoding of the garbled characters in the embedded font and the standard Unicode encoding can be established to remove garbled characters.
  • the electronic device 2 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions.
  • it can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a cabinet server (including an independent server or a server cluster composed of multiple servers).
  • the electronic device 2 at least includes, but is not limited to, a memory 21 and a processor 22 that can be communicatively connected to each other through a system bus.
  • the memory 21 includes at least one type of computer-readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM) ), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 21 may be an internal storage unit of the electronic device 2, for example, a hard disk or a memory of the electronic device 2.
  • the memory 21 may also be an external storage device of the electronic device 2, for example, a plug-in hard disk equipped on the electronic device 2, a smart media card (SMC), a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc.
  • the memory 21 may also include both the internal storage unit of the electronic device 2 and its external storage device.
  • the memory 21 is generally used to store an operating system and various application software installed in the electronic device 2, such as the PDF file data extraction program code.
  • the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 22 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments.
  • the processor 22 is generally used to control the overall operation of the electronic device 2, for example, perform data interaction or communication-related control and processing with the electronic device 2.
  • the processor 22 is configured to run the program code or process data stored in the memory 21, for example, run the PDF file data extraction program.
  • the electronic device 2 may also include a display, and the display may also be called a display screen or a display unit.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) display, etc.
  • the display is used to display the information processed in the electronic device 2 and to display a visualized user interface.
  • FIG. 11 only shows the electronic device 2 with components 21-22, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
  • the memory 21 containing a readable storage medium may include an operating system, a PDF file data extraction program 50, and the like.
  • the processor 22 executes the PDF file data extraction program 50 in the memory 21, the steps described in the above PDF file data extraction method are implemented.
  • the PDF file data extraction program stored in the memory 21 may be divided into one or more program modules, the one or more program modules are stored in the memory 21, and may be one or more It is executed by two processors (in this embodiment, the processor 22) to complete the application.
  • FIG. 12 shows a schematic diagram of the program modules of the PDF file data extraction program.
  • the PDF file data extraction program 50 can be divided into a PDF file analysis module 501, an LT sub-object storage module 502, and branch Reading module 503, LT sub-object sorting module 504.
  • the program module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is more suitable than a program to describe the execution process of the PDF file data extraction program in the electronic device 2. The following description will specifically introduce the specific functions of the program modules.
  • the PDF file parsing module 501 is used to use the pdfminer tool (a tool that can extract information from PDF documents) to parse the PDF file, and generate a pdfminer.layout object corresponding to each page of the PDF, among which, the pdfminer.layout object A plurality of LT sub-objects are contained therein, and the LT sub-object includes at least one of LTTextBoxHorizontal (horizontal text box) sub-object and LTChar (character) sub-object.
  • LT sub-objects may also include LT Figure (area box) sub-objects and LTLine (separation line) sub-objects, where LT Figure represents an area occupied by the area box, and the area box is used to introduce, for example, a picture or another PDF document.
  • LT Figure area box
  • LTLine separation line
  • the LT sub-object storage module 502 is used to obtain the ordinate and abscissa of each LT sub-object. For example, if the LT sub-object is LTChar, it is to obtain the ordinate and abscissa of each character. If the LT sub-object is LTLine, get the abscissa and ordinate of each dividing line. If the LT sub-object is a horizontal text box, obtain the ordinate and abscissa of each horizontal text box. And the LT sub-object of each page is correspondingly stored in a first list, for example, the first page corresponds to a first list, and the second page corresponds to a first list.
  • the corresponding command is ⁇ 'pageN': [LTobjs of layout] ⁇ , where N represents the Nth layout, and [LTobjs of layout] is Array.
  • the ordinate includes the ordinate y0 of the lower left corner of the LT sub-object and the ordinate y1 of the upper right corner of the LT sub-object
  • the abscissas include the left boundary coordinate x0 of the LT sub-object and the right boundary coordinate of the LT sub-object.
  • x1 extract the LT sub-objects of the pdfminer.layout object in the order of the ordinate from small to large, and arrange them in the first list in order according to the ordinate of the lower left corner from small to large.
  • the branch reading module 503 is used to read the content in the first list branch by line through the [list of line(list)] command, and in the process of branch reading, it also judges the vertical distance between each LT sub-object Whether the LT sub-objects are in the same row, so that the LT sub-objects are divided into rows.
  • the formula for judging whether LT sub-objects are in the same row by vertical distance is as follows:
  • the LT sub-object is a character, and the corresponding height of the LT sub-object is the height of the character;
  • i represents the i-th LT sub-object.
  • the difference between y0 of the i-th LT sub-object and the i+1-th LT sub-object is less than the height of one LT sub-object, it means that the i-th LT sub-object and the i+1-th LT sub-object The distance between them must be less than the height required by an LT sub-object, so the i-th LT sub-object and the i+1-th LT sub-object should be in the same row. And if the distance between the i LT sub-objects and the i+1th LT sub-object is greater than the height of an LT sub-object, then the i-th LT sub-object and the i+1-th LT sub-object should be distributed in different Line.
  • the LT sub-object sorting module 504 is used to sort the LT sub-objects of the same row in the order of x0 from small to large. For each row, if x1 of the LT sub-object on the left is equal to x0 of the adjacent LT sub-object on the right , The two LT sub-objects are combined together to form a combined string. Through step 4, you can combine the LT sub-objects in the same line according to the order in the original PDF file to restore the text order in the PDF file.
  • step S50 the characters in the same row form a combined character string.
  • "Tong” and "Xin” in the second line since x1 of "Tong” is equal to the value of x0 of "Xin”, “Tong” and “Xin” are combined.
  • the "zi” and "billion” in the second row since the x1 of "zi” is not equal to the x0 value of "billion", the “zi” and “billion” are not combined together, but as “sub” The interval between x1 and the x0 value of "100 million” is maintained. By comparing the values of x1 and x0, the LT sub-objects of each row can be formed into a combined string.
  • the second line can form the combined character strings of "in total: computers, communications and other electronics", "100 million yuan", “490.31", “3202.49”, “10.9", and "equipment manufacturing”.
  • the left boundary coordinates of the leftmost character string are used as the left boundary coordinates of the combined character string
  • the right boundary coordinates of the rightmost character string are used as the right boundary coordinates of the combined character string.
  • the combined character string "in total: computer, communications and other electronics” should be on the left side of the combined character string “equipment manufacturing industry", and the left boundary coordinate of the combined character string “equipment manufacturing industry” is less than "100 million yuan", " With the coordinates of 490.31”, “3202.49”, and "10.9", the combined character string "equipment manufacturing” will be transferred between "in total: computers, communications and other electronics" and "100 million yuan".
  • a first position correction module 505 is further included.
  • the position correction module 505 can compare the left boundary coordinates of the combined character string in the same line. If the left boundary coordinates are the same, as shown in Figure 6, "Total Medium: if the x0 values of "Computer, Communication and Other Electronics” and “Equipment Manufacturing” are equal, the y0 values of the two combined character strings are further compared, and the one with the higher y0 value is arranged before the one with the lower y0 value. For example, in Figure 5, the y0 of “Total: Computer, Communication and Other Electronics” is greater than y0 of "Equipment Manufacturing", so the "Equipment Manufacturing” is spliced after "Total: Computer, Communication and Other Electronics”. As shown in Figure 7.
  • the second position correction module 506 is further included. If the position overlap occurs after sorting in the order of x0 from small to large, it indicates the difference between the left and right boundary coordinates of the combined character string If the interval falls within the interval between the left boundary coordinates and the right boundary coordinates of another combined character string, it is possible that the combined character string is a position change caused by a line break of the other combined character string. For example, as shown in Figure 3, the interval between the left boundary coordinates and the right boundary coordinates of the "equipment manufacturing industry" falls between the left boundary coordinates and the right boundary coordinates of the "Total: Computers, Communications and Other Electronics".
  • the second position correction module 506 further compares the y0 values of the two combined character strings, and arranges the combined character string with the higher y0 value before the combined character string with the lower y0 value. This results in “in total: computer, communications and other electronic equipment manufacturing industries", as shown in Figure 7.
  • the main content of the monthly report usually also includes a table, the following also needs to extract LTline. According to the coordinates of LTline, the boundary line of the table is divided.
  • the table forming module 507 sorts the abscissa positions of all the vertical lines of the LTline from left to right in the order from small to large, and sorts the ordinate positions of all the horizontal lines of the LTline from small to large. The order is sorted from top to bottom to form a table, as shown in Figure 8.
  • the table adjustment module 508 compares the value of the LTline corresponding to the combined character string. If the value of the vertical line of the LTline of the LT sub-object is exactly the same, it means that the LT sub-object is in the original PDF file. Are in the same cell. For example, the value of LTline corresponding to "Total: Computer, Communication and Other Electronics" is the same as the value of LTline corresponding to "Equipment Manufacturing".
  • the LTline corresponding to the LTChar satisfying the condition Move the vertical line of the corresponding distance to the right.
  • Equipment Manufacturing is a 5 LTChar, wherein each character from other characters with the vertical line satisfies
  • the LT sub-object storage module 502 is further used to extract the LT sub-objects of the LT Figure iteratively in step S20 to form a second list containing all LT sub-objects in the LT Figure, and coexist Into the first list.
  • the table adjustment module 508 is also used to determine whether the abscissa of the leftmost vertical line is greater than the abscissa of the leftmost character string for each row. If it is greater, it means the leftmost vertical Located on the right side of the leftmost string, that is, not all the strings are framed in the cell, so add a vertical line to the left boundary coordinate position of the leftmost string, and the leftmost string can also be framed into the cell Grid. Similarly, it is judged whether the abscissa of the rightmost vertical line is smaller than the abscissa of the most lateral character string, and if it is smaller, a vertical line is added to the right boundary coordinate position of the rightmost character string. As shown in Figure 10, the leftmost character string "total" in the second row does not have a vertical direction on the left side, so a vertical line is added to the left side of the character string "total" at the left boundary coordinate position.
  • This application also provides a PDF file data extraction device, including:
  • the PDF file parsing module 501 uses the pdfminer tool to parse the PDF file, and generates a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains the LT sub-object;
  • the LT sub-object storage module 502 is used to obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in the corresponding first list, wherein the abscissa includes the left side of the LT sub-object
  • the boundary coordinate x0 and the right boundary coordinate x1 of the LT sub-object are sequentially extracted from the LT sub-objects of the pdfminer.layout object in each page in the order of the ordinates from small to large, and arranged vertically according to the order of the ordinates from small to large In the first list corresponding to each page;
  • the branch reading module 503 is configured to perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, judge by the vertical distance between each LT sub-object The row to which each LT sub-object belongs, so that the LT sub-object is divided into rows;
  • the LT sub-object sorting module 504 is used to sort the LT sub-objects in each row from small to large in the order of the left boundary coordinate x0, and by judging the LT sub-objects on the left Whether the right boundary coordinate x1 is equal to the left boundary coordinate x0 of the adjacent right LT sub-object, the multiple LT sub-objects are combined to form a combined character string.
  • the embodiments of the present application also propose a computer-readable storage medium, which may be a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read-only memory (ROM), an erasable programmable Any one or any combination of read-only memory (EPROM), portable compact disk read-only memory (CD-ROM), USB memory, etc.
  • the computer-readable storage medium includes a PDF file data extraction program, etc., when the PDF file data extraction program 50 is executed by the processor 22, the following operations are implemented:
  • Step S10 use the pdfminer tool (a tool for extracting information from a PDF document) to parse the PDF file, and generate a pdfminer.layout object corresponding to each page of the PDF, as shown in Figure 2, where, in the pdfminer.layout object Contains multiple LT sub-objects, and the LT sub-objects include LTTextBoxHorizontal (horizontal text box) sub-objects and LTChar (character) sub-objects.
  • LTChar is a character with boundaries. The following is mainly based on LTChar as an example, so the characters mentioned are LT sub-objects.
  • LT Figure represents an area occupied by an area frame, which is used to introduce a picture or another PDF document.
  • Step S20 Obtain the ordinate and abscissa of each LT sub-object. For example, if the LT sub-object is LTChar, then the ordinate and abscissa of each character are obtained. If the LT sub-object is LTLine, get the abscissa and ordinate of each dividing line. If the LT sub-object is a horizontal text box, obtain the ordinate and abscissa of each horizontal text box. And the LT sub-object of each page is correspondingly stored in a first list, for example, the first page corresponds to a first list, and the second page corresponds to a first list.
  • the corresponding command is ⁇ 'pageN': [LTobjs of layout] ⁇ , where N represents the Nth layout, and [LTobjs of layout] is Array.
  • the ordinate includes the ordinate y0 of the lower left corner of the LT sub-object and the ordinate y1 of the upper right corner of the LT sub-object
  • the abscissas include the left boundary coordinate x0 of the LT sub-object and the right boundary coordinate of the LT sub-object.
  • x1 extract the LT sub-objects of the pdfminer.layout object in the order of the ordinate from small to large, and arrange them in the first list in order according to the ordinate of the lower left corner from small to large.
  • Step S30 read the content of the PDF file branch by line through the [list of line(list)] command, and in the process of branch reading, also judge whether the LT sub-object is in the line by the vertical distance between each LT sub-object In the same row, the LT sub-objects are divided into rows. Among them, the formula for judging whether LT sub-objects are in the same row by vertical distance is as follows:
  • the LT sub-object is a character, and the corresponding height of the LT sub-object is the height of the character;
  • i represents the i-th LT sub-object.
  • Step S40 Sort the LT sub-objects in the same row from small to large in the order of x0. For each row, determine whether x1 of the LT sub-object on the left is equal to x0 of the adjacent LT sub-object on the right. LT sub-objects are combined to form a combined string. Through step S40, the LT sub-objects in the same row can be combined according to the order in the original PDF file to restore the text order in the PDF file.

Abstract

A PDF file data extraction method and apparatus, a device, and a storage medium. The method comprises: parsing a PDF file, and generating LT sub-objects; acquiring the ordinate and the abscissa of each LT sub-object, correspondingly storing LT sub-objects of each page in a first list, extracting LT sub-objects in an ascending order of the ordinates, and longitudinally arranging, in the first list, the LT sub-objects in an ascending order of the ordinates, wherein the abscissa comprises a left boundary coordinate x0 and a right boundary coordinate x1; during reading performed in a row by row manner, determining whether the LT sub-objects are in the same row by means of a longitudinal distance, and sorting the LT sub-objects into respective rows; and sorting LT sub-objects of each row in an ascending order of x0, and if x1 of an LT sub-object on a left side is equal to x0 of an LT sub-object on a right side, combining the two LT sub-objects to form a combined character string. The method reduces difficulty in extracting information from a monthly statistical bulletin.

Description

一种PDF文件数据提取方法和装置、设备及存储介质A PDF file data extraction method, device, equipment and storage medium
本申请要求于2019年6月17日提交的中国专利申请号201910521031.4的优先权益,上述案件全部内容以引用的方式并入本文中。This application claims the priority rights of Chinese Patent Application No. 201910521031.4 filed on June 17, 2019. The entire contents of the above cases are incorporated herein by reference.
技术领域Technical field
本申请涉及人工智能领域,具体说,涉及一种PDF文件数据提取方法和装置、设备及存储介质。This application relates to the field of artificial intelligence, and specifically to a method and device, equipment and storage medium for extracting data from a PDF file.
背景技术Background technique
现有统计局的统计月报都是用PDF格式对数据进行存储,从PDF中提取数据十分不便,往往需要人工查看手动来提取所需的数据,十分耗时耗力。目前还可以将PDF文件转换成word格式,再从word文件中提取数据。但是,发明人发现现存的PDF转word技术对PDF进行转换时,常常出现乱码,中文顺序错位等问题。而对于统计月报来说,其中还包含有大量的表格,在PDF文件转换的过程中更加会出现表格位置窜动,表格中内容缺失等情况。发明人意识到,从PDF文件中提取数据需要更好的解决方案。The monthly statistical reports of the existing Bureau of Statistics are all stored in PDF format. It is very inconvenient to extract data from PDF. It is often time-consuming and labor-consuming to manually view and manually extract the required data. At present, you can also convert PDF files into word format, and then extract data from word files. However, the inventor found that when the existing PDF-to-word technology converts PDF, problems such as garbled characters and misplaced Chinese order often occur. As for the monthly statistical report, it also contains a large number of tables. In the process of PDF file conversion, the position of the tables will fluctuate, and the contents of the tables will be missing. The inventor realized that extracting data from PDF files requires a better solution.
发明内容Summary of the invention
为解决以上技术问题,本申请提供一种PDF文件数据提取方法,应用于电子设备,包括:In order to solve the above technical problems, this application provides a PDF file data extraction method, which is applied to electronic equipment, including:
S10,使用pdfminer工具对PDF文件进行解析,对PDF的每个页面都生成一个pdfminer.layout对象,其中,所述pdfminer.layout对象中包含LT子对象;S10. Use the pdfminer tool to parse the PDF file, and generate a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains an LT sub-object;
S20,获取各LT子对象的纵坐标和横坐标,并将每页的LT子对象存入对应的第一列表,其中,所述横坐标包括LT子对象的左侧边界坐标x0和LT子对象的右侧边界坐标x1,按照纵坐标从小到大的顺序依次抽取每个页面中pdfminer.layout对象的LT子对象,并根据纵坐标由小至大的顺序纵向排列在各页面对应的第一列表中;S20. Obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in a corresponding first list, where the abscissa includes the left boundary coordinate x0 of the LT sub-object and the LT sub-object The LT sub-objects of the pdfminer.layout object in each page are extracted in the order of the ordinate from small to large, and the LT sub-objects of the pdfminer.layout object in each page are sequentially arranged in the first list corresponding to each page according to the order of the ordinate from small to large. in;
S30,对所述第一列表进行分行读取操作,并且,对于每一种LT子对象,在分行读取的过程中,通过各LT子对象之间的竖向距离判断每个LT子对象所属的行,从而将LT子对象划分到各行中;S30. Perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, judge whether each LT sub-object belongs to each LT sub-object according to the vertical distance between each LT sub-object To divide the LT sub-object into each row;
S40,对于每一种LT子对象,在每一行中,对LT子对象按左侧边界坐标x0的顺序从小到大排序,并且,通过判断左侧的LT子对象的右侧边界坐标x1是否等于相邻的右侧的LT子对象的左侧边界坐标x0,将多个LT子对象组合形成组合字符串。S40, for each type of LT sub-object, in each row, sort the LT sub-objects from small to large in the order of the left boundary coordinate x0, and determine whether the right boundary coordinate x1 of the left LT sub-object is equal to The left boundary coordinates x0 of the adjacent LT sub-objects on the right are combined to form a combined character string.
本申请还提供一种PDF文件数据提取装置,包括:This application also provides a PDF file data extraction device, including:
PDF文件解析模块,使用pdfminer工具对PDF文件进行解析,对PDF的每个页面都生成一个pdfminer.layout对象,其中,所述pdfminer.layout对象中包含LT子对象;The PDF file parsing module uses the pdfminer tool to parse the PDF file, and generates a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains the LT sub-object;
LT子对象存储模块,用于获取各LT子对象的纵坐标和横坐标,并将每页的LT子对象存入对应的第一列表,其中,所述横坐标包括LT子对象的左侧边界坐标x0和LT子对象的右侧边界坐标x1,按照纵坐标从小到大的顺序依次抽取每个页面中pdfminer.layout对象的LT子对象,并根据纵坐标由小至大的顺序纵向排列在各页面对应的第一列表中;The LT sub-object storage module is used to obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in the corresponding first list, wherein the abscissa includes the left boundary of the LT sub-object Coordinates x0 and the right boundary coordinate x1 of the LT sub-object, extract the LT sub-objects of the pdfminer.layout object in each page in the order of ordinates from small to large, and arrange them vertically in the order of ordinates from small to large In the first list corresponding to the page;
分行读取模块,用于对所述第一列表进行分行读取操作,并且,对于每一种LT子对象,在分行读取的过程中,通过各LT子对象之间的竖向距离判断每个LT子对象所属的行,从而将LT子对象划分到各行中;The branch reading module is used to perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, each LT sub-object is judged by the vertical distance between each LT sub-object. A row to which the LT sub-object belongs, thereby dividing the LT sub-object into each row;
LT子对象排序模块,用于对每一种LT子对象,在每一行中,对LT子对象按左侧边 界坐标x0的顺序从小到大排序,并且,通过判断左侧的LT子对象的右侧边界坐标x1是否等于相邻的右侧的LT子对象的左侧边界坐标x0,将多个LT子对象组合形成组合字符串。The LT sub-object sorting module is used to sort each LT sub-object, in each row, sort the LT sub-objects from small to large in the order of the left boundary coordinate x0, and by judging the right of the LT sub-object on the left Whether the side boundary coordinate x1 is equal to the left boundary coordinate x0 of the adjacent right LT sub-object, the multiple LT sub-objects are combined to form a combined character string.
本申请还提供一种电子设备,该电子设备包括:存储器和处理器,所述存储器中存储PDF文件数据提取程序,所述PDF文件数据提取程序被所述处理器执行时实现如下步骤:The present application also provides an electronic device including a memory and a processor, the memory stores a PDF file data extraction program, and the PDF file data extraction program is executed by the processor to implement the following steps:
S10,使用pdfminer工具对PDF文件进行解析,对PDF的每个页面都生成一个pdfminer.layout对象,其中,所述pdfminer.layout对象中包含LT子对象;S10. Use the pdfminer tool to parse the PDF file, and generate a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains an LT sub-object;
S20,获取各LT子对象的纵坐标和横坐标,并将每页的LT子对象存入对应的第一列表,其中,所述横坐标包括LT子对象的左侧边界坐标x0和LT子对象的右侧边界坐标x1,按照纵坐标从小到大的顺序依次抽取每个页面中pdfminer.layout对象的LT子对象,并根据纵坐标由小至大的顺序纵向排列在各页面对应的第一列表中;S20. Obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in a corresponding first list, where the abscissa includes the left boundary coordinate x0 of the LT sub-object and the LT sub-object The LT sub-objects of the pdfminer.layout object in each page are extracted in the order of the ordinate from small to large, and the LT sub-objects of the pdfminer.layout object in each page are sequentially arranged in the first list corresponding to each page according to the order of the ordinate from small to large. in;
S30,对所述第一列表进行分行读取操作,并且,对于每一种LT子对象,在分行读取的过程中,通过各LT子对象之间的竖向距离判断每个LT子对象所属的行,从而将LT子对象划分到各行中;S30. Perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, judge whether each LT sub-object belongs to each LT sub-object according to the vertical distance between each LT sub-object To divide the LT sub-object into each row;
S40,对于每一种LT子对象,在每一行中,对LT子对象按左侧边界坐标x0的顺序从小到大排序,并且,通过判断左侧的LT子对象的右侧边界坐标x1是否等于相邻的右侧的LT子对象的左侧边界坐标x0,将多个LT子对象组合形成组合字符串。S40, for each type of LT sub-object, in each row, sort the LT sub-objects from small to large in the order of the left boundary coordinate x0, and determine whether the right boundary coordinate x1 of the left LT sub-object is equal to The left boundary coordinates x0 of the adjacent LT sub-objects on the right are combined to form a combined character string.
本申请还提供一种计算机非易失性可读存储介质,所述计算机非易失性可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令被处理器执行时,实现如上所述的PDF文件数据提取方法。The present application also provides a computer non-volatile readable storage medium, the computer non-volatile readable storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, Realize the PDF file data extraction method as described above.
本申请将PDF文件的数据转换成excel格式,将大大降低运用例如spider和pycharm数据分析软件从统计月报提取信息的难度。This application converts the data of the PDF file into an excel format, which will greatly reduce the difficulty of extracting information from the monthly statistical report using data analysis software such as spider and pycharm.
附图说明Description of the drawings
通过结合下面附图对其实施例进行描述,本申请的上述特征和技术优点将会变得更加清楚和容易理解。By describing its embodiments in conjunction with the following drawings, the above-mentioned features and technical advantages of the present application will become clearer and easier to understand.
图1是表示本申请实施例的PDF文件数据提取方法的流程图;FIG. 1 is a flowchart showing a method for extracting data from a PDF file according to an embodiment of the present application;
图2是表示本申请实施例的pdfminer.layout对象的框架示意图;FIG. 2 is a schematic diagram showing the framework of the pdfminer.layout object in an embodiment of the present application;
图3是表示本申请第一实施例的PDF文件中的LTChar的示意图;3 is a schematic diagram showing LTChar in the PDF file of the first embodiment of the present application;
图4是表示本申请第一实施例的分行读取获得的数据提取结果示意图;4 is a schematic diagram showing a data extraction result obtained by branch reading in the first embodiment of the present application;
图5是表示本申请第一实施例的对LTChar进行排序后的数据提取结果示意图;5 is a schematic diagram showing the data extraction result after sorting LTChar in the first embodiment of the present application;
图6是表示本申请第二实施例的PDF文件中的LTChar的示意图;6 is a schematic diagram showing LTChar in the PDF file of the second embodiment of the present application;
图7是表示本申请第二实施例的比对组合字符串坐标后的数据提取结果示意图;FIG. 7 is a schematic diagram showing the data extraction result after comparing and combining string coordinates in the second embodiment of the present application;
图8是表示本申请第三实施例的添加LTline后的数据提取结果示意图;8 is a schematic diagram showing the data extraction result after adding LTline in the third embodiment of the present application;
图9是表示本申请第四实施例的调整LTline后的数据提取结果示意图;9 is a schematic diagram showing the data extraction result after adjusting LTline in the fourth embodiment of the present application;
图10是表示本申请第五实施例的PDF文件中的LTLtine的示意图;10 is a schematic diagram showing the LTLtine in the PDF file of the fifth embodiment of the present application;
图11是表示本申请实施例的电子设备的硬件架构示意图;FIG. 11 is a schematic diagram showing the hardware architecture of an electronic device according to an embodiment of the present application;
图12是表示本申请实施例的PDF文件数据提取程序的程序模块示意图。Fig. 12 is a schematic diagram showing program modules of a PDF file data extraction program according to an embodiment of the present application.
具体实施方式Detailed ways
下面将参考附图来描述本申请所述的PDF文件数据提取方法和装置、设备及存储介质的实施例。本领域的普通技术人员可以认识到,在不偏离本申请的精神和范围的情况下,可以用各种不同的方式或其组合对所描述的实施例进行修正。因此,附图和描述在本质上 是说明性的,而不是用于限制权利要求的保护范围。此外,在本说明书中,附图未按比例画出,并且相同的附图标记表示相同的部分。Hereinafter, embodiments of the PDF file data extraction method and device, equipment, and storage medium described in this application will be described with reference to the accompanying drawings. A person of ordinary skill in the art may realize that the described embodiments can be modified in various different ways or combinations thereof without departing from the spirit and scope of the present application. Therefore, the drawings and descriptions are illustrative in nature and are not intended to limit the scope of protection of the claims. In addition, in this specification, the drawings are not drawn to scale, and the same reference numerals denote the same parts.
本实施例的PDF文件数据提取方法是应用于对PDF文件中文字及其表格的提取。文字可以是形成在表格内的,也可以是没有表格的分段落的文字。下面以PDF格式的统计月报为例来说明。The PDF file data extraction method of this embodiment is applied to the extraction of text and tables in the PDF file. The text can be formed in a table, or it can be divided into paragraphs without a table. Take the monthly statistical report in PDF format as an example.
第一实施例First embodiment
图1示出本实施例中PDF文件数据提取方法的流程图,该方法包括以下步骤:Fig. 1 shows a flowchart of a method for extracting data from a PDF file in this embodiment. The method includes the following steps:
步骤S10,使用pdfminer工具(从PDF文档中提取信息的工具),对PDF文件进行解析,对PDF的每个页面都生成一个pdfminer.layout对象,如图2所示,其中,pdfminer.layout对象中可以包含多个LT子对象,所述LT子对象可以是LTTextBoxHorizontal(水平文本框)子对象、LTChar(字符)子对象中的至少一个。LTChar是具有边界的字符。下文中主要是以LTChar为例来说明,所以提到的字符为LT子对象。Step S10, use the pdfminer tool (a tool for extracting information from a PDF document) to parse the PDF file, and generate a pdfminer.layout object for each page of the PDF, as shown in Figure 2, where the pdfminer.layout object is It may contain multiple LT sub-objects, and the LT sub-object may be at least one of LTTextBoxHorizontal (horizontal text box) sub-object and LTChar (character) sub-object. LTChar is a character with boundaries. The following is mainly based on LTChar as an example, so the characters mentioned are LT sub-objects.
当然,更进一步,还可以包括LTFigure(区域框)子对象、LTLine(分隔线)子对象。其中,LTFigure代表由区域框占用的一块区域,区域框用于引入图片或另一个PDF文档。Of course, it can further include LTFigure (area box) sub-object and LTLine (separation line) sub-object. Among them, LTFigure represents an area occupied by an area frame, which is used to introduce a picture or another PDF document.
步骤S20,获取各LT子对象的纵坐标和横坐标,例如,如果LT子对象是LTChar,则是获取每个字符的纵坐标和横坐标。如果LT子对象是LTLine,则获取每个分隔线的横坐标和纵坐标。如果LT子对象是水平文本框,则获取每个水平文本框的纵坐标和横坐标。并将每页的LT子对象存入对应的第一列表,比如,第一页对应一个第一列表,第二页对应一个第一列表。将每页的pdfminer.layout对象的LT子对象对应存入一个第一列表对应的命令为{'pageN':[LTobjs of layout]},其中,N表示第N个layout,[LTobjs of layout]为数组。Step S20: Obtain the ordinate and abscissa of each LT sub-object. For example, if the LT sub-object is LTChar, then the ordinate and abscissa of each character are obtained. If the LT sub-object is LTLine, get the abscissa and ordinate of each dividing line. If the LT sub-object is a horizontal text box, obtain the ordinate and abscissa of each horizontal text box. And store the LT sub-object of each page into the corresponding first list, for example, the first page corresponds to a first list, and the second page corresponds to a first list. To save the LT sub-object of each page of the pdfminer.layout object into a first list, the corresponding command is {'pageN': [LTobjs of layout]}, where N represents the Nth layout, and [LTobjs of layout] is Array.
其中,优选地,所述纵坐标包括LT子对象的左下角纵坐标y0和LT子对象的右上角纵坐标y1,所述横坐标包括LT子对象的左侧边界坐标x0和LT子对象的右侧边界坐标x1,按照纵坐标从小到大的顺序依次抽取pdfminer.layout对象的LT子对象,并根据左下角的纵坐标由小至大的顺序按序纵向排列在第一列表中。Wherein, preferably, the ordinate includes the ordinate y0 of the lower left corner of the LT sub-object and the ordinate y1 of the upper right corner of the LT sub-object, and the abscissas include the left boundary coordinate x0 of the LT sub-object and the right coordinate of the LT sub-object. At the side boundary coordinate x1, extract the LT sub-objects of the pdfminer.layout object in the order of the vertical coordinate from small to large, and arrange them vertically in the first list in order according to the vertical coordinate of the lower left corner from small to large.
步骤S30,通过[list of line(list)]命令分行读取第一列表中的内容,并且,在分行读取的过程中,还通过竖向距离判断LT子对象是否在同一行,从而将LT子对象划分到各行中。其中,通过竖向距离判断LT子对象是否在同一行的公式如下:In step S30, the content in the first list is read by the [list of line(list)] command line by line, and in the process of line-by-line reading, the vertical distance is also used to determine whether the LT sub-objects are on the same line, and the LT The sub-objects are divided into rows. Among them, the formula for judging whether LT sub-objects are in the same row by vertical distance is as follows:
|LTtext[i] y0-LTtext[i+1] y0|<|LTtext[i] y1-LTtext[i] y0|      (1) |LTtext[i] y0 -LTtext[i+1] y0 |<|LTtext[i] y1 -LTtext[i] y0 | (1)
其中,|LTtext[i] y1-LTtext[i] y0|为LT子对象的高度,例如LT子对象为一个字符,则对应的该LT子对象的高度为字符的高度; Among them, |LTtext[i] y1 -LTtext[i] y0 | is the height of the LT sub-object. For example, the LT sub-object is a character, and the corresponding height of the LT sub-object is the height of the character;
|LTtext[i] y0-LTtext[i+1] y0|为第i个LT子对象与第i+1个LT子对象的y0的差值; |LTtext[i] y0 -LTtext[i+1] y0 | is the difference between y0 of the i-th LT sub-object and the i+1-th LT sub-object;
i表示第i个LT子对象。i represents the i-th LT sub-object.
通过该公式1,第i个LT子对象与第i+1个LT子对象的y0的差值如果小于一个LT子对象的高度,则说明第i个LT子对象与第i+1个LT子对象之间的距离肯定小于一个LT子对象所需占用的高度,那么,第i个LT子对象与第i+1个LT子对象就应该在同一行。而如果i个LT子对象与第i+1个LT子对象之间的距离大于一个LT子对象的高度,则第i个LT子对象与第i+1个LT子对象就应该是分布在不同的行。According to formula 1, if the difference between y0 of the i-th LT sub-object and the i+1-th LT sub-object is less than the height of one LT sub-object, it means that the i-th LT sub-object and the i+1-th LT sub-object The distance between the objects must be less than the height required by an LT sub-object, then the i-th LT sub-object and the i+1-th LT sub-object should be in the same row. And if the distance between the i LT sub-objects and the i+1th LT sub-object is greater than the height of an LT sub-object, then the i-th LT sub-object and the i+1-th LT sub-object should be distributed in different Line.
步骤S40,对同一行的LT子对象按左侧边界坐标x0从小到大的顺序由左至右排序,对于每一行,通过判断左侧的LT子对象的右侧边界坐标x1是否等于相邻的右侧的LT子对象的左侧边界坐标x0,将多个LT子对象组合形成组合字符串。通过步骤S40,可以将同一行的LT子对象按照原PDF文件中的顺序组合在一起,还原PDF文件中的文字顺序。Step S40, the LT sub-objects of the same row are sorted from left to right in the order of the left boundary coordinate x0 from small to large. For each row, it is determined whether the right boundary coordinate x1 of the left LT sub-object is equal to the adjacent The left boundary coordinate x0 of the LT sub-object on the right, combines multiple LT sub-objects to form a combined character string. Through step S40, the LT sub-objects in the same row can be combined according to the order in the original PDF file to restore the text order in the PDF file.
以上是提取了LTChar的内容,月报的主要内容通常还包括表格,下面还需要提取LTline。根据LTline的坐标,划分出表格的边界线。The above is to extract the content of LTChar, the main content of the monthly report usually also includes a table, the following also needs to extract LTline. According to the coordinates of LTline, the boundary line of the table is divided.
进一步地,还包括步骤S50,以组合字符串的最左侧的字符串的左侧边界坐标作为组合字符串的左侧边界坐标,以组合字符串的最右侧的字符串的右侧边界坐标作为组合字符串的右侧边界坐标;Further, the method further includes step S50, using the left boundary coordinates of the leftmost character string of the combined character string as the left boundary coordinates of the combined character string, and the right boundary coordinates of the rightmost character string of the combined character string As the right boundary coordinates of the combined string;
比较各组合字符串的左侧边界坐标,将组合字符串再按照组合字符串的左侧边界坐标从小到大的顺序从左至右排序。Compare the left boundary coordinates of each combined character string, and then sort the combined character strings from left to right according to the left boundary coordinates of the combined character string from small to large.
进一步地,还包括步骤S60,将LTline的所有竖线的横坐标位置按照从小到大的顺序从左往右排序,将LTline的所有横线的纵坐标位置按照从小到大的顺序从上往下排序,从而形成表格。Further, it also includes step S60 of sorting the abscissa positions of all the vertical lines of the LTline from left to right in the order from small to large, and sorting the ordinate positions of all the horizontal lines of the LTline from top to bottom in the order of small to large Sort to form a table.
下面以一个具体实例来说明数据提取过程。对图3所示的PDF文件中的列表及其中的文字进行提取。如图3所示,其中的文字包括如下三行:The following is a specific example to illustrate the data extraction process. Extract the list and text in the PDF file shown in Figure 3. As shown in Figure 3, the text includes the following three lines:
[“其”,“他”……“-”,“5”,“8”,“.”,“7”]["其", "他"...... "-", "5", "8", ".", "7"]
[“总”,“计”……“电”,“子”,“亿”,“元”……”1”,”0”,“.”,“9”]["Total", "Count"......"Electricity", "Sub", "100 million", "Yuan"......"1","0",".","9"]
[“设”,“备”,“制“,”造“,”业”,“总”]["Equipment", "Preparation", "System", "Making", "Industry", "General"]
在读取过程中,“其”与“他”的y0值的差值的绝对值小于LT子对象的高度。所以,“其”与“他”应该在同一行内。同样地,把应该在同一行的LT子对象都分配到同一行内。而读取到“7”的时候,“7”的y0值–“总”的y0值的绝对值大于LT子对象的高度,因此“7”与“总”不会在同一行,“总”会另起一行。During the reading process, the absolute value of the difference between the y0 values of "its" and "he" is less than the height of the LT sub-object. Therefore, "qi" and "he" should be in the same line. Similarly, all LT sub-objects that should be in the same row are allocated to the same row. When reading "7", the y0 value of "7"-the absolute value of the y0 value of "total" is greater than the height of the LT sub-object, so "7" and "total" will not be in the same row, "total" A new line will start.
而因为“设备制造业”这几个字在两行中间,所以“9”的y0值-“设”的y0值的绝对值小于LT子对象的高度,因此,在提取文字时,会将“总计中:计算机、通信和其他电子“和”设备制造业”存入到第一列表的同一行,并将“设备制造业”添加到“10.9”后面(因为是逐行由上至下读取,肯定是先读取到“10.9”,然后再读取到“设”)。由此得到的文件如图4所示。And because the words "equipment manufacturing" are in the middle of the two lines, the absolute value of the y0 value of "9"-the y0 value of "set" is less than the height of the LT sub-object. Therefore, when extracting text, " In the total: Computer, communication and other electronic "and" equipment manufacturing" are stored in the same row of the first list, and "equipment manufacturing" is added after "10.9" (because it is read line by line from top to bottom , It must read "10.9" first, and then read "Set"). The resulting file is shown in Figure 4.
接下来将处在同一行的字符形成组合字符串。对于第二行的“通”和“信”,由于“通”的x1等于“信”的x0值,所以,“通”与“信”组合在一起。而对于第二行的“子”和“亿”,由于“子”的x1不等于“亿”的x0值,所以“子”和“亿”并不是组合在一起,而是以“子”的x1与“亿”的x0值之间的间隔保持。通过比较x1和x0值,可以将每行的LT子对象形成组合字符串。例如第二行可以形成“总计中:计算机、通信和其他电子”、“亿元”、“490.31”、“3202.49”、“10.9”、“设备制造业”这些组合字符串。其中,以组合字符串的最左侧的字符串的左侧边界坐标作为该组合字符串的左侧边界坐标,以最右侧的字符串的右侧边界坐标作为组合字符串的右侧边界坐标。以组合字符串的最左侧字符串的左下角纵坐标作为组合字符串的左下角纵坐标,以最右侧的字符串的右上角纵坐标作为组合字符串的右上角纵坐标。Next, the characters on the same line form a combined string. For "Tong" and "Xin" in the second line, since x1 of "Tong" is equal to the value of x0 of "Xin", "Tong" and "Xin" are combined. As for the "zi" and "billion" in the second row, since the x1 of "zi" is not equal to the x0 value of "billion", the "zi" and "billion" are not combined together, but as "sub" The interval between x1 and the x0 value of "100 million" is maintained. By comparing the values of x1 and x0, the LT sub-objects of each row can be formed into a combined string. For example, the second line can form the combined character strings of "in total: computers, communications and other electronics", "100 million yuan", "490.31", "3202.49", "10.9", and "equipment manufacturing". Among them, the left boundary coordinate of the leftmost character string of the combined character string is used as the left boundary coordinate of the combined character string, and the right boundary coordinate of the rightmost character string is used as the right boundary coordinate of the combined character string . The ordinate of the lower left corner of the leftmost character string of the combined character string is used as the ordinate of the lower left corner of the combined character string, and the ordinate of the upper right corner of the rightmost character string is used as the ordinate of the upper right corner of the combined character string.
进一步比较组合字符串的左侧边界坐标和右侧边界坐标。如果右侧组合字符串的x0值–左侧组合字符串的x1值<预设的拼接阈值,例如0.01(避免由于小误差而不能连接),则这两个组合字符串可以拼接在一起。本实施例的第二行并没有满足此条件的组合字符串。Further compare the left and right boundary coordinates of the combined character string. If the x0 value of the right combined string-the x1 value of the left combined string <the preset splicing threshold, such as 0.01 (to avoid being unable to connect due to small errors), then the two combined strings can be spliced together. The second line of this embodiment does not have a combined character string that meets this condition.
但是目前这些组合字符串是以“总计中:计算机、通信和其他电子”、“亿元”、“490.31”、“3202.49”、“10.9”、“设备制造业”的顺序排列,这与原PDF文件并不完全一致。因此,继续比较各组合字符串的左侧边界坐标,将组合字符串再按照左侧边界坐标从小到大的顺序从左至右排序。例如,组合字符串“总计中:计算机、通信和其他电子”的左侧边界坐标小于组合字符串“设备制造业”的左侧边界坐标。则组合字符串“总计中:计算机、通信和其他电子”应该在组合字符串“设备制造业”的左侧,而组合字符串“设备制造业”的左侧边界坐标小于“亿元”、“490.31”、“3202.49”、“10.9”的坐标,则组合字符串“设备制造业”就转移到“总计中:计算机、通信和其他电子”与“亿元”之间。However, these combined strings are currently arranged in the order of "Total: Computers, Communications and Other Electronics", "100 million", "490.31", "3202.49", "10.9", and "Equipment Manufacturing", which is different from the original PDF The files are not exactly the same. Therefore, continue to compare the left boundary coordinates of each combined character string, and then sort the combined character strings from left to right in the order of the left boundary coordinates from small to large. For example, the left boundary coordinates of the combined character string "Total: Computer, Communication and Other Electronics" are smaller than the left boundary coordinates of the combined character string "Equipment Manufacturing". Then the combined character string "in total: computer, communications and other electronics" should be on the left side of the combined character string "equipment manufacturing industry", and the left boundary coordinate of the combined character string "equipment manufacturing industry" is less than "100 million yuan", " With the coordinates of 490.31", "3202.49", and "10.9", the combined character string "equipment manufacturing" will be transferred between "in total: computers, communications and other electronics" and "100 million yuan".
至此,第二行的几个组合字符串可以按照“总计中:计算机、通信和其他电子”、“设备制造业”、“亿元”、“490.31”、“3202.49”、“10.9”排列。其形式如图5所示。So far, several combined character strings in the second line can be arranged in accordance with "Total: Computer, Communication and Other Electronics", "Equipment Manufacturing", "100 Million Yuan", "490.31", "3202.49", and "10.9". Its form is shown in Figure 5.
将LTline的所有竖线的横坐标位置按照从小到大的顺序从左往右排序,将LTline的所有横线的纵坐标位置按照从小到大的顺序从上往下排序,从而形成表格,其形式如图8所示。Sort the abscissa positions of all the vertical lines of LTline from left to right in the order from small to large, and sort the ordinate positions of all the horizontal lines of LTline from top to bottom in the order of small to large, thus forming a table. As shown in Figure 8.
以上是以第二行为例进行说明,其他行与第二行的方法相同,不再赘述。The above description is based on the second line example, and the other lines have the same method as the second line and will not be repeated.
在一个可选实施例中,在步骤S20中,对于LTFigure,则迭代抽取其中的LT子对象,形成一个包含LTFigure内部所有LT子对象的第二列表,并存入第一列表中。In an optional embodiment, in step S20, for the LTFigure, the LT sub-objects therein are iteratively extracted to form a second list containing all the LT sub-objects in the LTFigure, and stored in the first list.
第二实施例Second embodiment
第二实施例与第一实施例基本相同,在此省略与第一实施例相同的部分内容,仅描述与第一实施例不同的特征。The second embodiment is basically the same as the first embodiment, and part of the content that is the same as the first embodiment is omitted here, and only features different from the first embodiment are described.
还可以比较同一行的组合字符串的左侧边界坐标,如果左侧边界坐标相同,如图6所示,“总计中:计算机、通信和其他电子”与“设备制造业”的x0值相等,则进一步比较两个组合字符串的y0数值,并将y0数值高的排列在y0数值低的组合字符串前面。例如图6中,“总计中:计算机、通信和其他电子”的y0大于“设备制造业”的y0,所以将“设备制造业”拼接在“总计中:计算机、通信和其他电子”的后面,如图7所示。You can also compare the left boundary coordinates of the combined character string in the same line. If the left boundary coordinates are the same, as shown in Figure 6, the x0 value of "Total: Computer, Communication and Other Electronics" and "Equipment Manufacturing" are equal, Then the y0 values of the two combined character strings are further compared, and the character string with the higher y0 value is arranged in front of the combined character string with the lower y0 value. For example, in Figure 6, the y0 of "Total: Computers, Communications and Other Electronics" is greater than y0 of "Equipment Manufacturing", so the "Equipment Manufacturing" is spliced after "Totals: Computers, Communications and Other Electronics". As shown in Figure 7.
第三实施例The third embodiment
第三实施例与第一实施例基本相同,在此省略与第一实施例相同的部分内容,仅描述与第一实施例不同的特征。The third embodiment is basically the same as the first embodiment, and part of the same content as the first embodiment is omitted here, and only the features that are different from the first embodiment are described.
如果按x0的顺序从小到大排序后,产生位置重叠的情况,则说明组合字符串的左侧边界坐标和右侧边界坐标的区间落在另一组合字符串的左侧边界坐标和右侧边界坐标的区间之内,则可能该组合字符串是所述另一组合字符串换行导致的位置变化,如图3所示。例如,“设备制造业”的左侧边界坐标和右侧边界坐标的区间就落在“总计中:计算机、通信和其他电子”的左侧边界坐标和右侧边界坐标的区间内,则进一步比较两个组合字符串的y0数值,并将y0数值高的排列在y0数值低的组合字符串前面。从而得到“总计中:计算机、通信和其他电子设备制造业”,如图7所示。If the position overlap occurs after sorting in the order of x0 from small to large, it means that the interval between the left and right boundary coordinates of the combined character string falls on the left and right boundary coordinates of another combined character string Within the range of coordinates, it is possible that the combined character string is a position change caused by the line break of the other combined character string, as shown in FIG. 3. For example, the interval between the left boundary coordinates and the right boundary coordinates of "equipment manufacturing" falls within the interval of the left boundary coordinates and the right boundary coordinates of "Total: Computers, Communications, and Other Electronics", then further comparison The y0 value of the two combined strings, and the higher y0 value is arranged in front of the combined string with the lower y0 value. This results in "in total: computer, communications and other electronic equipment manufacturing industries", as shown in Figure 7.
第四实施例Fourth embodiment
第四实施例与第一实施例基本相同,在此省略与第一实施例相同的部分内容,仅描述与第一实施例不同的特征。The fourth embodiment is basically the same as the first embodiment, and part of the content that is the same as the first embodiment is omitted here, and only features different from the first embodiment are described.
在根据重叠情况调整组合字符串的位置后,还比对组合字符串对应的LTline的数值,如果组合字符串对应的LTline的竖线的数值是完全相同的,则说明组合字符串在原PDF文件中是处于同一单元格内。比如,“总计中:计算机、通信和其他电子”对应的LTline的数值与“设备制造业”对应的LTline的数值是完全相同的。但现在他们对应的两条竖线却把这两个组合字符串分隔开了。因此,根据满足|LTtext[i] y0-LTtext[i+1] y0|<|LTtext[i] y1-LTtext[i] y0|条件的LT子对象的移位个数,将满足该条件的LTChar所对应的LTline的右侧竖线向右侧移动与所述移位个数对应的距离。例如,“设备制造业”是5个LTChar,其中每个字符与该行的其他字符的竖向距离都满足|LTtext[i] y0-LTtext[i+1] y0|<|LTtext[i] y1-LTtext[i] y0|,则将右侧的竖线向右侧移动对应5个LTChar的距离,以便将“总计中:计算机、通信和其他电子”],[“设备制造业”]所有文字都框在其中,从而形成“总计中:计算机、通信和其他电子设备制造业”,如图9所示。并且,对应的,将该竖线的右侧的多个竖线和LTChar也向右侧移动5个LTChar的距离。也就是说,为了将“设备制造业”拼接在“总计中:计算机、通信和其他电子”的后面,将其后面所有的对象都右移,并调整了竖线的位置,获得合理的分析结果。 After adjusting the position of the combined character string according to the overlap, the value of the LTline corresponding to the combined character string is also compared. If the value of the vertical line of the LTline corresponding to the combined character string is exactly the same, it means that the combined character string is in the original PDF file Are in the same cell. For example, the value of LTline corresponding to "Total: Computer, Communication and Other Electronics" is exactly the same as the value of LTline corresponding to "Equipment Manufacturing". But now their corresponding two vertical lines separate the two combined strings. Therefore, according to the number of shifts of LT sub-objects satisfying |LTtext[i] y0 -LTtext[i+1] y0 |<|LTtext[i] y1 -LTtext[i] y0 | The right vertical line of the corresponding LTline moves to the right by a distance corresponding to the number of shifts. For example, "equipment manufacturing" is 5 LTChars, where the vertical distance between each character and other characters in the line satisfies |LTtext[i] y0 -LTtext[i+1] y0 |<|LTtext[i] y1 -LTtext[i] y0 |, then move the vertical line on the right to the right by a distance corresponding to 5 LTChar, so that all text in "Total: Computer, Communication and Other Electronics"], ["Equipment Manufacturing"] All are framed in it, thus forming the "Total: Computer, Communication and Other Electronic Equipment Manufacturing", as shown in Figure 9. And, correspondingly, the multiple vertical lines and LTChar on the right side of the vertical line are also moved to the right by a distance of 5 LTChar. That is to say, in order to splice the "equipment manufacturing industry" behind the "total: computer, communications and other electronics", all objects behind it are moved to the right, and the position of the vertical line is adjusted to obtain a reasonable analysis result .
第五实施例Fifth embodiment
第五实施例与第一实施例基本相同,在此省略与第一实施例相同的部分内容,仅描述与第一实施例不同的特征。The fifth embodiment is basically the same as the first embodiment, and part of the same content as the first embodiment is omitted here, and only the features that are different from the first embodiment are described.
对于每行,判断最左侧竖线的横坐标是否大于最左侧字符串的横坐标,如果大于,则意味着最左侧竖线位于最左侧字符串的右侧,即没有把字符串全部框在单元格内,因此在该最左侧字符串的左侧边界坐标位置添加竖线,可以最左侧字符串也框入单元格内。同样的,判断最右侧竖线的横坐标是否小于最右侧字符串的横坐标,如果小于,则在该最右侧字符串的右侧边界坐标位置添加竖线。如图10所示,第二行的最左侧字符串“总”左侧没有竖线,所以在其左侧以“总”的左侧边界坐标位置添加竖线。For each line, determine whether the abscissa of the leftmost vertical line is greater than the abscissa of the leftmost string. If it is greater, it means that the leftmost vertical line is on the right side of the leftmost string, that is, the string is not All the boxes are in the cell, so add a vertical line at the left boundary coordinate position of the leftmost string, and the leftmost string can also be framed in the cell. Similarly, determine whether the abscissa of the rightmost vertical line is smaller than the abscissa of the rightmost character string, and if it is smaller, add a vertical line at the right boundary coordinate position of the rightmost character string. As shown in Figure 10, there is no vertical line on the left side of the leftmost character string "total" in the second row, so a vertical line is added to the left side of the string "total".
以上是说明了LTline和LTChar,LTTextBoxHorizontal和LTFigure也是同样的方式,一个LTTextBoxHorizontal作为LT子对象即相当于一个字符,对于解析出来的多个LTTextBoxHorizontal按照上述的纵坐标、横坐标排列方法即可。The above is the explanation of LTline and LTChar, LTTextBoxHorizontal and LTFigure are also in the same way. An LTTextBoxHorizontal as an LT sub-object is equivalent to a character. For the parsed multiple LTTextBoxHorizontal, the ordinate and abscissa can be arranged according to the above-mentioned ordinate and abscissa method.
在一个可选实施例中,将提取的文档与原PDF文档的对应行求取相似度,对于相似度低于相似度阈值的,则对该行再按照LTline的竖线的位置将该行切分为文本块,并对几块文本块分别与PDF文档的对应部位求取相似度,对于相似度低于相似度阈值的,则认为其中有字符是识别过程中造成的乱码。如果文本块中有乱码,则再次对文本块按照字符宽度来切分,并对每个字符与原文中的对应部位求取相似度,对于相似度低于相似度阈值的,则认为是乱码。In an optional embodiment, the similarity is calculated for the corresponding lines of the extracted document and the original PDF document. If the similarity is lower than the similarity threshold, the line is cut according to the position of the vertical line of the LTline. Divide into text blocks, and calculate the similarity between several text blocks and the corresponding parts of the PDF document. If the similarity is lower than the similarity threshold, it is considered that some of the characters are garbled caused by the recognition process. If there are garbled characters in the text block, the text block is divided again according to the character width, and the similarity between each character and the corresponding part in the original text is calculated. If the similarity is lower than the similarity threshold, it is considered as garbled.
通常对于识别过程中的乱码,是由于PDF文档中的内嵌字体使用了自定义编码,但是又缺少与标准编码之间的映射关系或具有错误的映射关系,word文档中只有标准编码,所以在识别过来到word文档中时,在word文档中无法找到识别到的字符的Unicode编码,才会显示乱码。Usually, for the garbled characters in the recognition process, it is because the embedded fonts in the PDF document use custom encoding, but they lack the mapping relationship with the standard encoding or have the wrong mapping relationship. There is only the standard encoding in the word document. When it is recognized in the word document, the Unicode code of the recognized character cannot be found in the word document, and the garbled code will be displayed.
以字体为单位,建立内嵌字体下乱码字符的当前Unicode编码与标准Unicode编码之间的映射关系,即可去除乱码。With font as a unit, the mapping relationship between the current Unicode encoding of the garbled characters in the embedded font and the standard Unicode encoding can be established to remove garbled characters.
并且,进一步地,可以在每提取PDF文档形成一行文本的时候就进行识别,以使得及时将PDF内嵌字体的当前编码与word文档的标准编码之间的尽早建立映射关系,则在后期的提取过程中,可以减少乱码的情况。And, further, it can be recognized every time the PDF document is extracted to form a line of text, so that the mapping relationship between the current encoding of the PDF embedded font and the standard encoding of the word document can be established as soon as possible, and then the extraction will be performed later. In the process, garbled characters can be reduced.
参阅图11所示,是本申请电子设备的实施例的硬件架构示意图。本实施例中,所述电子设备2是一种能够按照事先设定或者存储的指令,自动进行数值计算和/或信息处理的设备。例如,可以是智能手机、平板电脑、笔记本电脑、台式计算机、机架式服务器、刀片式服务器、塔式服务器或机柜式服务器(包括独立的服务器,或者多个服务器所组成的服务器集群)等。如图11所示,所述电子设备2至少包括,但不限于,可通过系统总线相互通信连接的存储器21、处理器22。其中:所述存储器21至少包括一种类型的计算机可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器21可以是所述电子设备2的内部存储单元,例如该电子设备2的硬盘或内存。在另一些实施例中,所述存储器21也可以是所述电子设备2的外部存储设备,例如该电子设备2上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器21还可以既包括所述电子设备2的内部存储单元也包括其外部存储设备。本实施例中,所述存储器21通常用于存储安装于所述电子设备2的操作系统和各类应用软件,例如所述PDF文件数据提取程序代码等。此外,所述存储器21还可以用于暂时地存储已经输出或者将要输出的各类数据。Refer to FIG. 11, which is a schematic diagram of the hardware architecture of an embodiment of the electronic device of the present application. In this embodiment, the electronic device 2 is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions. For example, it can be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack server, a blade server, a tower server, or a cabinet server (including an independent server or a server cluster composed of multiple servers). As shown in FIG. 11, the electronic device 2 at least includes, but is not limited to, a memory 21 and a processor 22 that can be communicatively connected to each other through a system bus. Wherein: the memory 21 includes at least one type of computer-readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM) ), static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc. In some embodiments, the memory 21 may be an internal storage unit of the electronic device 2, for example, a hard disk or a memory of the electronic device 2. In other embodiments, the memory 21 may also be an external storage device of the electronic device 2, for example, a plug-in hard disk equipped on the electronic device 2, a smart media card (SMC), a secure digital (Secure Digital, SD) card, flash card (Flash Card), etc. Of course, the memory 21 may also include both the internal storage unit of the electronic device 2 and its external storage device. In this embodiment, the memory 21 is generally used to store an operating system and various application software installed in the electronic device 2, such as the PDF file data extraction program code. In addition, the memory 21 can also be used to temporarily store various types of data that have been output or will be output.
所述处理器22在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控 制器、微控制器、微处理器、或其他数据处理芯片。该处理器22通常用于控制所述电子设备2的总体操作,例如执行与所述电子设备2进行数据交互或者通信相关的控制和处理等。本实施例中,所述处理器22用于运行所述存储器21中存储的程序代码或者处理数据,例如运行所述的PDF文件数据提取程序等。The processor 22 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 22 is generally used to control the overall operation of the electronic device 2, for example, perform data interaction or communication-related control and processing with the electronic device 2. In this embodiment, the processor 22 is configured to run the program code or process data stored in the memory 21, for example, run the PDF file data extraction program.
可选地,该电子设备2还可以包括显示器,显示器也可以称为显示屏或显示单元。在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及有机发光二极管(Organic Light-Emitting Diode,OLED)显示器等。显示器用于显示在电子设备2中处理的信息以及用于显示可视化的用户界面。Optionally, the electronic device 2 may also include a display, and the display may also be called a display screen or a display unit. In some embodiments, it may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (OLED) display, etc. The display is used to display the information processed in the electronic device 2 and to display a visualized user interface.
需要指出的是,图11仅示出了具有组件21-22的电子设备2,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。It should be pointed out that FIG. 11 only shows the electronic device 2 with components 21-22, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.
包含可读存储介质的存储器21中可以包括操作系统、PDF文件数据提取程序50等。处理器22执行存储器21中PDF文件数据提取程序50时实现以上PDF文件数据提取方法所述的步骤。在本实施例中,存储于存储器21中的所述PDF文件数据提取程序可以被分割为一个或者多个程序模块,所述一个或者多个程序模块被存储于存储器21中,并可由一个或多个处理器(本实施例为处理器22)所执行,以完成本申请。例如,图12示出了所述PDF文件数据提取程序的程序模块示意图,该实施例中,所述PDF文件数据提取程序50可以被分割为PDF文件解析模块501、LT子对象存储模块502、分行读取模块503、LT子对象排序模块504。其中,本申请所称的程序模块是指能够完成特定功能的一系列计算机程序指令段,比程序更适合于描述所述PDF文件数据提取程序在所述电子设备2中的执行过程。以下描述将具体介绍所述程序模块的具体功能。The memory 21 containing a readable storage medium may include an operating system, a PDF file data extraction program 50, and the like. When the processor 22 executes the PDF file data extraction program 50 in the memory 21, the steps described in the above PDF file data extraction method are implemented. In this embodiment, the PDF file data extraction program stored in the memory 21 may be divided into one or more program modules, the one or more program modules are stored in the memory 21, and may be one or more It is executed by two processors (in this embodiment, the processor 22) to complete the application. For example, FIG. 12 shows a schematic diagram of the program modules of the PDF file data extraction program. In this embodiment, the PDF file data extraction program 50 can be divided into a PDF file analysis module 501, an LT sub-object storage module 502, and branch Reading module 503, LT sub-object sorting module 504. Among them, the program module referred to in this application refers to a series of computer program instruction segments that can complete specific functions, and is more suitable than a program to describe the execution process of the PDF file data extraction program in the electronic device 2. The following description will specifically introduce the specific functions of the program modules.
其中,PDF文件解析模块501用于使用pdfminer工具(可以从PDF文档中提取信息的工具),对PDF文件进行解析,对应PDF的每个页面都生成一个pdfminer.layout对象,其中,pdfminer.layout对象中包含多个LT子对象,所述LT子对象包括LTTextBoxHorizontal(水平文本框)子对象、LTChar(字符)子对象中的至少一个。Among them, the PDF file parsing module 501 is used to use the pdfminer tool (a tool that can extract information from PDF documents) to parse the PDF file, and generate a pdfminer.layout object corresponding to each page of the PDF, among which, the pdfminer.layout object A plurality of LT sub-objects are contained therein, and the LT sub-object includes at least one of LTTextBoxHorizontal (horizontal text box) sub-object and LTChar (character) sub-object.
进一步地,LT子对象还可以包括LTFigure(区域框)子对象、LTLine(分隔线)子对象,其中,LTFigure代表由区域框占用的一块区域,区域框用于引入例如图片或另一个PDF文档。Further, LT sub-objects may also include LTFigure (area box) sub-objects and LTLine (separation line) sub-objects, where LTFigure represents an area occupied by the area box, and the area box is used to introduce, for example, a picture or another PDF document.
其中,LT子对象存储模块502用于获取各LT子对象的纵坐标和横坐标,例如,如果LT子对象是LTChar,则是获取每个字符的纵坐标和横坐标。如果LT子对象是LTLine,则获取每个分隔线的横坐标和纵坐标。如果LT子对象是水平文本框,则获取每个水平文本框的纵坐标和横坐标。并将每页的LT子对象对应存入一个第一列表,比如,第一页对应一个第一列表,第二页对应一个第一列表。将每页的pdfminer.layout对象的LT子对象对应存入一个第一列表对应的命令为{'pageN':[LTobjs of layout]},其中,N表示第N个layout,[LTobjs of layout]为数组。Wherein, the LT sub-object storage module 502 is used to obtain the ordinate and abscissa of each LT sub-object. For example, if the LT sub-object is LTChar, it is to obtain the ordinate and abscissa of each character. If the LT sub-object is LTLine, get the abscissa and ordinate of each dividing line. If the LT sub-object is a horizontal text box, obtain the ordinate and abscissa of each horizontal text box. And the LT sub-object of each page is correspondingly stored in a first list, for example, the first page corresponds to a first list, and the second page corresponds to a first list. To save the LT sub-object of each page of the pdfminer.layout object into a first list, the corresponding command is {'pageN': [LTobjs of layout]}, where N represents the Nth layout, and [LTobjs of layout] is Array.
其中,所述纵坐标包括LT子对象的左下角纵坐标y0和LT子对象的右上角纵坐标y1,所述横坐标包括LT子对象的左侧边界坐标x0和LT子对象的右侧边界坐标x1,按照纵坐标从小到大的顺序依次抽取pdfminer.layout对象的LT子对象,并根据左下角的纵坐标由小至大的顺序按序排列在第一列表中。Wherein, the ordinate includes the ordinate y0 of the lower left corner of the LT sub-object and the ordinate y1 of the upper right corner of the LT sub-object, and the abscissas include the left boundary coordinate x0 of the LT sub-object and the right boundary coordinate of the LT sub-object. x1, extract the LT sub-objects of the pdfminer.layout object in the order of the ordinate from small to large, and arrange them in the first list in order according to the ordinate of the lower left corner from small to large.
分行读取模块503用于通过[list of line(list)]命令分行读取第一列表中的内容,并且,在分行读取的过程中,还通过各LT子对象之间的竖向距离判断LT子对象是否在同一行,从而将LT子对象划分到各行中。其中,通过竖向距离判断LT子对象是否在同一行的公式如下:The branch reading module 503 is used to read the content in the first list branch by line through the [list of line(list)] command, and in the process of branch reading, it also judges the vertical distance between each LT sub-object Whether the LT sub-objects are in the same row, so that the LT sub-objects are divided into rows. Among them, the formula for judging whether LT sub-objects are in the same row by vertical distance is as follows:
|LTtext[i] y0-LTtext[i+1] y0|<|LTtext[i] y1-LTtext[i] y0|    (1) |LTtext[i] y0 -LTtext[i+1] y0 |<|LTtext[i] y1 -LTtext[i] y0 | (1)
其中,|LTtext[i] y1-LTtext[i] y0|为LT子对象的高度,例如LT子对象为一个字符,则 对应的该LT子对象的高度为字符的高度; Among them, |LTtext[i] y1 -LTtext[i] y0 | is the height of the LT sub-object. For example, the LT sub-object is a character, and the corresponding height of the LT sub-object is the height of the character;
|LTtext[i] y0-LTtext[i+1] y0|为第i个LT子对象与第i+1个LT子对象的y0的差值; |LTtext[i] y0 -LTtext[i+1] y0 | is the difference between y0 of the i-th LT sub-object and the i+1-th LT sub-object;
i表示第i个LT子对象。i represents the i-th LT sub-object.
通过该公式,第i个LT子对象与第i+1个LT子对象的y0的差值如果小于一个LT子对象的高度,则说明第i个LT子对象与第i+1个LT子对象之间的距离肯定小于一个LT子对象所需占用的高度,那么,第i个LT子对象与第i+1个LT子对象就应该在同一行。而如果i个LT子对象与第i+1个LT子对象之间的距离大于一个LT子对象的高度,则第i个LT子对象与第i+1个LT子对象就应该是分布在不同的行。With this formula, if the difference between y0 of the i-th LT sub-object and the i+1-th LT sub-object is less than the height of one LT sub-object, it means that the i-th LT sub-object and the i+1-th LT sub-object The distance between them must be less than the height required by an LT sub-object, so the i-th LT sub-object and the i+1-th LT sub-object should be in the same row. And if the distance between the i LT sub-objects and the i+1th LT sub-object is greater than the height of an LT sub-object, then the i-th LT sub-object and the i+1-th LT sub-object should be distributed in different Line.
其中,LT子对象排序模块504用于对同一行的LT子对象按x0的顺序从小到大排序,对于每一行,若左侧的LT子对象的x1等于其右侧临近的LT子对象的x0,则将这两个LT子对象组合在一起,形成组合字符串。通过步骤4,可以将同一行的LT子对象按照原PDF文件中的顺序组合在一起,还原PDF文件中的文字顺序。Among them, the LT sub-object sorting module 504 is used to sort the LT sub-objects of the same row in the order of x0 from small to large. For each row, if x1 of the LT sub-object on the left is equal to x0 of the adjacent LT sub-object on the right , The two LT sub-objects are combined together to form a combined string. Through step 4, you can combine the LT sub-objects in the same line according to the order in the original PDF file to restore the text order in the PDF file.
下面以一个具体实例来说明数据提取过程。对图3所示的PDF文件中的列表即其中的文字进行提取。如图3所示,其中的文字包括如下三行:The following is a specific example to illustrate the data extraction process. Extract the text in the list in the PDF file shown in Figure 3. As shown in Figure 3, the text includes the following three lines:
[“其”,“他”……“-”,“5”,“8”,“.”,“7”……]["其", "他"...... "-", "5", "8", ".", "7"......]
[“总”,“计”……“电”,“子”,“亿”,“元”……”1”,”0”,“.”,“9”]["Total", "Count"......"Electricity", "Sub", "100 million", "Yuan"......"1","0",".","9"]
[“设”,“备”,“制“,”造“,”业”,“总”……]["Equipment", "Preparation", "System", "Making", "Industry", "General"...]
在读取过程中,“其”与“他”的y0值的差值的绝对值小于LT子对象的高度。所以,“其”与“他”应该在同一行内。同样地,把应该在同一行的LT子对象都分配到同一行内。而读取到“7”的时候,(“7”的y0值–“总”)的y0值的绝对值大于LT子对象的高度,因此“7”与“总”不会在同一行,“总”会另起一行。During the reading process, the absolute value of the difference between the y0 values of "its" and "he" is less than the height of the LT sub-object. Therefore, "qi" and "he" should be in the same line. Similarly, all LT sub-objects that should be in the same row are allocated to the same row. When reading "7", the absolute value of y0 value of ("7" y0 value-"total") is greater than the height of the LT sub-object, so "7" and "total" will not be on the same line, " "Total" will start a new line.
而因为“设备制造业”这几个字在两行中间,所以“9”的y0值-“设”的y0值)的绝对值小于LT子对象的高度,因此,在提取文字时,会将“总计中:计算机、通信和其他电子“和”设备制造业”存入到第一列表的同一行,并将“设备制造业”添加到“10.9”后面(因为是逐行由上至下读取,肯定是先读取到“10.9”,然后再读取到“设”)。由此得到的文件如图4所示。And because the words "equipment manufacturing" are in the middle of the two lines, the absolute value of "9" y0 value-"set" y0 value) is less than the height of the LT sub-object. Therefore, when extracting text, it will "In total: computer, communication and other electronic "and" equipment manufacturing" is stored in the same row of the first list, and "equipment manufacturing" is added after "10.9" (because it is read line by line from top to bottom Take it, it must read "10.9" first, and then read "Set"). The resulting file is shown in Figure 4.
还包括步骤S50,接下来将处在同一行的字符形成组合字符串。对于第二行的“通”和“信”,由于“通”的x1等于“信”的x0值,所以,“通”与“信”组合在一起。而对于第二行的“子”和“亿”,由于“子”的x1不等于“亿”的x0值,所以“子”和“亿”并不是组合在一起,而是以“子”的x1与“亿”的x0值之间的间隔保持。通过比较x1和x0值,可以将每行的LT子对象形成组合字符串。例如第二行可以形成“总计中:计算机、通信和其他电子”、“亿元”、“490.31”、“3202.49”、“10.9”、“设备制造业”这些组合字符串。其中,以最左侧的字符串的左侧边界坐标作为组合字符串的左侧边界坐标,以最右侧的字符串的右侧边界坐标作为组合字符串的右侧边界坐标。以最左侧字符串的左下角纵坐标作为组合字符串的左下角纵坐标,以最右侧的字符串的右上角纵坐标作为组合字符串的右上角纵坐标。It also includes step S50, and then the characters in the same row form a combined character string. For "Tong" and "Xin" in the second line, since x1 of "Tong" is equal to the value of x0 of "Xin", "Tong" and "Xin" are combined. As for the "zi" and "billion" in the second row, since the x1 of "zi" is not equal to the x0 value of "billion", the "zi" and "billion" are not combined together, but as "sub" The interval between x1 and the x0 value of "100 million" is maintained. By comparing the values of x1 and x0, the LT sub-objects of each row can be formed into a combined string. For example, the second line can form the combined character strings of "in total: computers, communications and other electronics", "100 million yuan", "490.31", "3202.49", "10.9", and "equipment manufacturing". Among them, the left boundary coordinates of the leftmost character string are used as the left boundary coordinates of the combined character string, and the right boundary coordinates of the rightmost character string are used as the right boundary coordinates of the combined character string. Use the ordinate of the lower left corner of the leftmost character string as the ordinate of the lower left corner of the combined character string, and use the ordinate of the upper right corner of the rightmost character string as the ordinate of the upper right corner of the combined character string.
但是目前这些组合字符串是以“总计中:计算机、通信和其他电子”、“亿元”、“490.31”、“3202.49”、“10.9”、“设备制造业”的顺序排列,这与原PDF文件并不完全一致。因此,继续比较各组合字符串的左侧边界坐标,将组合字符串再按照左侧边界坐标从小到大的顺序从左至右排序。例如,组合字符串“总计中:计算机、通信和其他电子”的左侧边界坐标小于组合字符串“设备制造业”的左侧边界坐标。则组合字符串“总计中:计算机、通信和其他电子”应该在组合字符串“设备制造业”的左侧,而组合字符串“设备制造业”的左侧边界坐标小于“亿元”、“490.31”、“3202.49”、“10.9”的坐标,则组合字符串“设备制造业”就转移到“总计中:计算机、通信和其他电子”与“亿元”之间。However, these combined strings are currently arranged in the order of "Total: Computers, Communications and Other Electronics", "100 million", "490.31", "3202.49", "10.9", and "Equipment Manufacturing", which is different from the original PDF The files are not exactly the same. Therefore, continue to compare the left boundary coordinates of each combined character string, and then sort the combined character strings from left to right in the order of the left boundary coordinates from small to large. For example, the left boundary coordinates of the combined character string "Total: Computer, Communication and Other Electronics" are smaller than the left boundary coordinates of the combined character string "Equipment Manufacturing". Then the combined character string "in total: computer, communications and other electronics" should be on the left side of the combined character string "equipment manufacturing industry", and the left boundary coordinate of the combined character string "equipment manufacturing industry" is less than "100 million yuan", " With the coordinates of 490.31", "3202.49", and "10.9", the combined character string "equipment manufacturing" will be transferred between "in total: computers, communications and other electronics" and "100 million yuan".
至此,第二行的几个组合字符串可以按照“总计中:计算机、通信和其他电子”、“设备制造业”、“亿元”、“490.31”、“3202.49”、“10.9”排列。其形式如图5所示。So far, several combined character strings in the second line can be arranged in accordance with "Total: Computer, Communication and Other Electronics", "Equipment Manufacturing", "100 Million Yuan", "490.31", "3202.49", and "10.9". Its form is shown in Figure 5.
以上是以第二行为例进行说明,其他行与第二行的方法相同,不再赘述。The above description is based on the second line example, and the other lines have the same method as the second line and will not be repeated.
在一个可选实施例中,还包括第一位置纠正模块505,位置纠正模块505可以比较同一行的组合字符串的左侧边界坐标,如果左侧边界坐标相同,如图6所示,“总计中:计算机、通信和其他电子”与“设备制造业”的x0值相等,则进一步比较两个组合字符串的y0数值,并将y0数值高的排列在y0数值低的组合字符串前面。例如图5中,“总计中:计算机、通信和其他电子”的y0大于“设备制造业”的y0,所以将“设备制造业”拼接在“总计中:计算机、通信和其他电子”的后面,如图7所示。In an optional embodiment, a first position correction module 505 is further included. The position correction module 505 can compare the left boundary coordinates of the combined character string in the same line. If the left boundary coordinates are the same, as shown in Figure 6, "Total Medium: if the x0 values of "Computer, Communication and Other Electronics" and "Equipment Manufacturing" are equal, the y0 values of the two combined character strings are further compared, and the one with the higher y0 value is arranged before the one with the lower y0 value. For example, in Figure 5, the y0 of "Total: Computer, Communication and Other Electronics" is greater than y0 of "Equipment Manufacturing", so the "Equipment Manufacturing" is spliced after "Total: Computer, Communication and Other Electronics". As shown in Figure 7.
在一个可选实施例中,还包括第二位置纠正模块506,如果按x0的顺序从小到大排序后,产生位置重叠的情况,则说明组合字符串的左侧边界坐标和右侧边界坐标的区间落在另一组合字符串的左侧边界坐标和右侧边界坐标的区间之内,则可能该组合字符串是所述另一组合字符串换行导致的位置变化。例如,如图3所示,“设备制造业”的左侧边界坐标和右侧边界坐标的区间就落在“总计中:计算机、通信和其他电子”的左侧边界坐标和右侧边界坐标的区间内,第二位置纠正模块506进一步比较两个组合字符串的y0数值,并将y0数值高的排列在y0数值低的组合字符串前面。从而得到“总计中:计算机、通信和其他电子设备制造业”,如图7所示。In an optional embodiment, the second position correction module 506 is further included. If the position overlap occurs after sorting in the order of x0 from small to large, it indicates the difference between the left and right boundary coordinates of the combined character string If the interval falls within the interval between the left boundary coordinates and the right boundary coordinates of another combined character string, it is possible that the combined character string is a position change caused by a line break of the other combined character string. For example, as shown in Figure 3, the interval between the left boundary coordinates and the right boundary coordinates of the "equipment manufacturing industry" falls between the left boundary coordinates and the right boundary coordinates of the "Total: Computers, Communications and Other Electronics". In the interval, the second position correction module 506 further compares the y0 values of the two combined character strings, and arranges the combined character string with the higher y0 value before the combined character string with the lower y0 value. This results in "in total: computer, communications and other electronic equipment manufacturing industries", as shown in Figure 7.
以上是提取了LTChar的内容,月报的主要内容通常还包括表格,下面还需要提取LTline。根据LTline的坐标,划分出表格的边界线。The above is to extract the content of LTChar, the main content of the monthly report usually also includes a table, the following also needs to extract LTline. According to the coordinates of LTline, the boundary line of the table is divided.
进一步地,还包括表格形成模块507,表格形成模块507将LTline的所有竖线的横坐标位置按照从小到大的顺序从左往右排序,将LTline的所有横线的纵坐标位置按照从小到大的顺序从上往下排序,从而形成表格,如图8所示。Further, it also includes a table forming module 507. The table forming module 507 sorts the abscissa positions of all the vertical lines of the LTline from left to right in the order from small to large, and sorts the ordinate positions of all the horizontal lines of the LTline from small to large. The order is sorted from top to bottom to form a table, as shown in Figure 8.
进一步地,还包括表格调整模块508,表格调整模块508比对组合字符串对应的LTline的数值,如果LT子对象的LTline的竖线的数值是完全相同的,则说明LT子对象在原PDF文件中是处于同一单元格内。比如,“总计中:计算机、通信和其他电子”对应的LTline的数值与“设备制造业”对应的LTline的数值是相同的。因此,根据满足|LTtext[i] y0-LTtext[i+1] y0|<|LTtext[i] y1-LTtext[i] y0|条件的LTChar的个数,将满足该条件的LTChar所对应的LTline的竖线向右侧移动对应的距离。例如“设备制造业”是5个LTChar,其中每个字符与该行的其他字符的竖向距离都满足|LTtext[i] y0-LTtext[i+1] y0|<|LTtext[i] y1-LTtext[i] y0|,则将竖线向右侧移动对应5个LTChar,以便将“总计中:计算机、通信和其他电子”,“设备制造业”所有文字都框在其中,从而形成“总计中:计算机、通信和其他电子设备制造业”,如图9所示。并且,对应的,将该竖线的右侧的多个竖线也向右侧移动5个LTChar的距离。 Further, it also includes a table adjustment module 508. The table adjustment module 508 compares the value of the LTline corresponding to the combined character string. If the value of the vertical line of the LTline of the LT sub-object is exactly the same, it means that the LT sub-object is in the original PDF file. Are in the same cell. For example, the value of LTline corresponding to "Total: Computer, Communication and Other Electronics" is the same as the value of LTline corresponding to "Equipment Manufacturing". Therefore, according to the number of LTChars satisfying |LTtext[i] y0 -LTtext[i+1] y0 |<|LTtext[i] y1 -LTtext[i] y0 |, the LTline corresponding to the LTChar satisfying the condition Move the vertical line of the corresponding distance to the right. For example, "Equipment Manufacturing" is a 5 LTChar, wherein each character from other characters with the vertical line satisfies | LTtext [i] y0 -LTtext [ i + 1] y0 | <| LTtext [i] y1 - LTtext[i] y0 |, then move the vertical line to the right to correspond to 5 LTChars, so as to frame all the words "Total: Computer, Communication and Other Electronics" and "Equipment Manufacturing" to form a "Total Middle: Computer, Communication and Other Electronic Equipment Manufacturing", as shown in Figure 9. And, correspondingly, the multiple vertical lines on the right side of the vertical line are also moved to the right by a distance of 5 LTChar.
在一个可选实施例中,LT子对象存储模块502还用于在步骤S20中,对于LTFigure,则迭代抽取其中的LT子对象,形成一个包含LTFigure内部所有LT子对象的第二列表,并存入第一列表中。In an optional embodiment, the LT sub-object storage module 502 is further used to extract the LT sub-objects of the LTFigure iteratively in step S20 to form a second list containing all LT sub-objects in the LTFigure, and coexist Into the first list.
在一个可选实施例中,表格调整模块508还用于对于每行,判断最左侧竖线的横坐标是否大于最左侧字符串的横坐标,如果大于,则意味着最左侧竖向位于最左侧字符串的右侧,即没有把字符串全部框在单元格内,因此在该最左侧字符串的左侧边界坐标位置添加竖线,可以最左侧字符串也框入单元格内。同样的,判断最右侧竖线的横坐标是否小于最又侧字符串的横坐标,如果小于,则在该最右侧字符串的右侧边界坐标位置添加竖线。如图10所示,第二行的最左侧字符串“总”左侧没有竖向,所以在其左侧以“总”的左侧边界坐标位置添加竖线。In an optional embodiment, the table adjustment module 508 is also used to determine whether the abscissa of the leftmost vertical line is greater than the abscissa of the leftmost character string for each row. If it is greater, it means the leftmost vertical Located on the right side of the leftmost string, that is, not all the strings are framed in the cell, so add a vertical line to the left boundary coordinate position of the leftmost string, and the leftmost string can also be framed into the cell Grid. Similarly, it is judged whether the abscissa of the rightmost vertical line is smaller than the abscissa of the most lateral character string, and if it is smaller, a vertical line is added to the right boundary coordinate position of the rightmost character string. As shown in Figure 10, the leftmost character string "total" in the second row does not have a vertical direction on the left side, so a vertical line is added to the left side of the character string "total" at the left boundary coordinate position.
本申请还提供一种PDF文件数据提取装置,包括:This application also provides a PDF file data extraction device, including:
PDF文件解析模块501,使用pdfminer工具对PDF文件进行解析,对PDF的每个页面都生成一个pdfminer.layout对象,其中,所述pdfminer.layout对象中包含LT子对象;The PDF file parsing module 501 uses the pdfminer tool to parse the PDF file, and generates a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains the LT sub-object;
LT子对象存储模块502,用于获取各LT子对象的纵坐标和横坐标,并将每页的LT子对象存入对应的第一列表,其中,所述横坐标包括LT子对象的左侧边界坐标x0和LT子对象的右侧边界坐标x1,按照纵坐标从小到大的顺序依次抽取每个页面中pdfminer.layout对象的LT子对象,并根据纵坐标由小至大的顺序纵向排列在各页面对应的第一列表中;The LT sub-object storage module 502 is used to obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in the corresponding first list, wherein the abscissa includes the left side of the LT sub-object The boundary coordinate x0 and the right boundary coordinate x1 of the LT sub-object are sequentially extracted from the LT sub-objects of the pdfminer.layout object in each page in the order of the ordinates from small to large, and arranged vertically according to the order of the ordinates from small to large In the first list corresponding to each page;
分行读取模块503,用于对所述第一列表进行分行读取操作,并且,对于每一种LT子对象,在分行读取的过程中,通过各LT子对象之间的竖向距离判断每个LT子对象所属的行,从而将LT子对象划分到各行中;The branch reading module 503 is configured to perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, judge by the vertical distance between each LT sub-object The row to which each LT sub-object belongs, so that the LT sub-object is divided into rows;
LT子对象排序模块504,用于对每一种LT子对象,在每一行中,对LT子对象按左侧边界坐标x0的顺序从小到大排序,并且,通过判断左侧的LT子对象的右侧边界坐标x1是否等于相邻的右侧的LT子对象的左侧边界坐标x0,将多个LT子对象组合形成组合字符串。The LT sub-object sorting module 504 is used to sort the LT sub-objects in each row from small to large in the order of the left boundary coordinate x0, and by judging the LT sub-objects on the left Whether the right boundary coordinate x1 is equal to the left boundary coordinate x0 of the adjacent right LT sub-object, the multiple LT sub-objects are combined to form a combined character string.
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质可以是硬盘、多媒体卡、SD卡、闪存卡、SMC、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器等等中的任意一种或者几种的任意组合。所述计算机可读存储介质中包括PDF文件数据提取程序等,所述PDF文件数据提取程序50被处理器22执行时实现如下操作:In addition, the embodiments of the present application also propose a computer-readable storage medium, which may be a hard disk, a multimedia card, an SD card, a flash memory card, an SMC, a read-only memory (ROM), an erasable programmable Any one or any combination of read-only memory (EPROM), portable compact disk read-only memory (CD-ROM), USB memory, etc. The computer-readable storage medium includes a PDF file data extraction program, etc., when the PDF file data extraction program 50 is executed by the processor 22, the following operations are implemented:
步骤S10,使用pdfminer工具(从PDF文档中提取信息的工具),对PDF文件进行解析,对应PDF的每个页面都生成一个pdfminer.layout对象,如图2所示,其中,pdfminer.layout对象中包含多个LT子对象,所述LT子对象包括LTTextBoxHorizontal(水平文本框)子对象、LTChar(字符)子对象。LTChar是具有边界的字符。下文中主要是以LTChar为例来说明,所以提到的字符为LT子对象。Step S10, use the pdfminer tool (a tool for extracting information from a PDF document) to parse the PDF file, and generate a pdfminer.layout object corresponding to each page of the PDF, as shown in Figure 2, where, in the pdfminer.layout object Contains multiple LT sub-objects, and the LT sub-objects include LTTextBoxHorizontal (horizontal text box) sub-objects and LTChar (character) sub-objects. LTChar is a character with boundaries. The following is mainly based on LTChar as an example, so the characters mentioned are LT sub-objects.
当然。更进一步,还可以包括LTFigure(区域框)子对象、LTLine(分隔线)子对象。其中,LTFigure代表由区域框占用的一块区域,区域框用于引入图片或另一个PDF文档。of course. Furthermore, it can also include LTFigure (area box) sub-object and LTLine (separation line) sub-object. Among them, LTFigure represents an area occupied by an area frame, which is used to introduce a picture or another PDF document.
步骤S20,获取各LT子对象的纵坐标和横坐标,例如,如果LT子对象是LTChar,则是获取每个字符的纵坐标和横坐标。如果LT子对象是LTLine,则获取每个分隔线的横坐标和纵坐标。如果LT子对象是水平文本框,则获取每个水平文本框的纵坐标和横坐标。并将每页的LT子对象对应存入一个第一列表,比如,第一页对应一个第一列表,第二页对应一个第一列表。将每页的pdfminer.layout对象的LT子对象对应存入一个第一列表对应的命令为{'pageN':[LTobjs of layout]},其中,N表示第N个layout,[LTobjs of layout]为数组。Step S20: Obtain the ordinate and abscissa of each LT sub-object. For example, if the LT sub-object is LTChar, then the ordinate and abscissa of each character are obtained. If the LT sub-object is LTLine, get the abscissa and ordinate of each dividing line. If the LT sub-object is a horizontal text box, obtain the ordinate and abscissa of each horizontal text box. And the LT sub-object of each page is correspondingly stored in a first list, for example, the first page corresponds to a first list, and the second page corresponds to a first list. To save the LT sub-object of each page of the pdfminer.layout object into a first list, the corresponding command is {'pageN': [LTobjs of layout]}, where N represents the Nth layout, and [LTobjs of layout] is Array.
其中,所述纵坐标包括LT子对象的左下角纵坐标y0和LT子对象的右上角纵坐标y1,所述横坐标包括LT子对象的左侧边界坐标x0和LT子对象的右侧边界坐标x1,按照纵坐标从小到大的顺序依次抽取pdfminer.layout对象的LT子对象,并根据左下角的纵坐标由小至大的顺序按序排列在第一列表中。Wherein, the ordinate includes the ordinate y0 of the lower left corner of the LT sub-object and the ordinate y1 of the upper right corner of the LT sub-object, and the abscissas include the left boundary coordinate x0 of the LT sub-object and the right boundary coordinate of the LT sub-object. x1, extract the LT sub-objects of the pdfminer.layout object in the order of the ordinate from small to large, and arrange them in the first list in order according to the ordinate of the lower left corner from small to large.
步骤S30,通过[list of line(list)]命令分行读取PDF文件中的内容,并且,在分行读取的过程中,还通过各LT子对象之间的竖向距离判断LT子对象是否在同一行,从而将LT子对象划分到各行中。其中,通过竖向距离判断LT子对象是否在同一行的公式如下:Step S30, read the content of the PDF file branch by line through the [list of line(list)] command, and in the process of branch reading, also judge whether the LT sub-object is in the line by the vertical distance between each LT sub-object In the same row, the LT sub-objects are divided into rows. Among them, the formula for judging whether LT sub-objects are in the same row by vertical distance is as follows:
|LTtext[i] y0-LTtext[i+1] y0|<|LTtext[i] y1-LTtext[i] y0|       (1) |LTtext[i] y0 -LTtext[i+1] y0 |<|LTtext[i] y1 -LTtext[i] y0 | (1)
其中,|LTtext[i] y1-LTtext[i] y0|为LT子对象的高度,例如LT子对象为一个字符,则 对应的该LT子对象的高度为字符的高度; Among them, |LTtext[i] y1 -LTtext[i] y0 | is the height of the LT sub-object. For example, the LT sub-object is a character, and the corresponding height of the LT sub-object is the height of the character;
|LTtext[i] y0-LTtext[i+1] y0|为第i个LT子对象与第i+1个LT子对象的y0的差值; |LTtext[i] y0 -LTtext[i+1] y0 | is the difference between y0 of the i-th LT sub-object and the i+1-th LT sub-object;
i表示第i个LT子对象。i represents the i-th LT sub-object.
通过该公式1,第i个LT子对象与第i+1个LT子对象的y0的差值如果小于一个LT子对象的高度,则说明第i个LT子对象与第i+1个LT子对象之间的距离肯定小于一个LT子对象所需占用的高度,那么,第i个LT子对象与第i+1个LT子对象就应该在同一行。而如果i个LT子对象与第i+1个LT子对象之间的距离大于一个LT子对象的高度,则第i个LT子对象与第i+1个LT子对象就应该是分布在不同的行。According to formula 1, if the difference between y0 of the i-th LT sub-object and the i+1-th LT sub-object is less than the height of one LT sub-object, it means that the i-th LT sub-object and the i+1-th LT sub-object The distance between the objects must be less than the height required by an LT sub-object, then the i-th LT sub-object and the i+1-th LT sub-object should be in the same row. And if the distance between the i LT sub-objects and the i+1th LT sub-object is greater than the height of an LT sub-object, then the i-th LT sub-object and the i+1-th LT sub-object should be distributed in different Line.
步骤S40,对同一行的LT子对象按x0的顺序从小到大排序,对于每一行,通过判断左侧的LT子对象的x1是否等于相邻的右侧的LT子对象的x0,将多个LT子对象组合形成组合字符串。通过步骤S40,可以将同一行的LT子对象按照原PDF文件中的顺序组合在一起,还原PDF文件中的文字顺序。Step S40: Sort the LT sub-objects in the same row from small to large in the order of x0. For each row, determine whether x1 of the LT sub-object on the left is equal to x0 of the adjacent LT sub-object on the right. LT sub-objects are combined to form a combined string. Through step S40, the LT sub-objects in the same row can be combined according to the order in the original PDF file to restore the text order in the PDF file.
本申请之计算机可读存储介质的具体实施方式与上述PDF文件数据提取方法以及电子设备2的具体实施方式大致相同,在此不再赘述。The specific implementation of the computer-readable storage medium of the present application is substantially the same as the specific implementation of the PDF file data extraction method and the electronic device 2 described above, and will not be repeated here.
以上所述仅为本申请的优选实施例,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The foregoing descriptions are only preferred embodiments of the application, and are not intended to limit the application. For those skilled in the art, the application may have various modifications and changes. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included in the protection scope of this application.

Claims (20)

  1. 一种PDF文件数据提取方法,应用于电子设备,其特征在于,包括以下步骤:A method for extracting data from a PDF file, applied to an electronic device, is characterized in that it comprises the following steps:
    S10,使用pdfminer工具对PDF文件进行解析,对PDF的每个页面都生成一个pdfminer.layout对象,其中,所述pdfminer.layout对象中包含LT子对象;S10. Use the pdfminer tool to parse the PDF file, and generate a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains an LT sub-object;
    S20,获取各LT子对象的纵坐标和横坐标,并将每页的LT子对象存入对应的第一列表,其中,所述横坐标包括LT子对象的左侧边界坐标x0和LT子对象的右侧边界坐标x1,按照纵坐标从小到大的顺序依次抽取每个页面中pdfminer.layout对象的LT子对象,并根据纵坐标由小至大的顺序纵向排列在各页面对应的第一列表中;S20. Obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in a corresponding first list, where the abscissa includes the left boundary coordinate x0 of the LT sub-object and the LT sub-object The LT sub-objects of the pdfminer.layout object in each page are extracted in the order of the ordinate from small to large, and the LT sub-objects of the pdfminer.layout object in each page are sequentially arranged in the first list corresponding to each page according to the order of the ordinate from small to large. in;
    S30,对所述第一列表进行分行读取操作,并且,对于每一种LT子对象,在分行读取的过程中,通过各LT子对象之间的竖向距离判断每个LT子对象所属的行,从而将LT子对象划分到各行中;S30. Perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, judge whether each LT sub-object belongs to each LT sub-object according to the vertical distance between each LT sub-object To divide the LT sub-object into each row;
    S40,对于每一种LT子对象,在每一行中,对LT子对象按左侧边界坐标x0的顺序从小到大排序,并且,通过判断左侧的LT子对象的右侧边界坐标x1是否等于相邻的右侧的LT子对象的左侧边界坐标x0,将多个LT子对象组合形成组合字符串。S40, for each type of LT sub-object, in each row, sort the LT sub-objects from small to large in the order of the left boundary coordinate x0, and determine whether the right boundary coordinate x1 of the left LT sub-object is equal to The left boundary coordinates x0 of the adjacent LT sub-objects on the right are combined to form a combined character string.
  2. 根据权利要求1所述的PDF文件数据提取方法,其特征在于,所述方法还包括:The PDF file data extraction method of claim 1, wherein the method further comprises:
    S50,以组合字符串的最左侧的字符串的左侧边界坐标作为组合字符串的左侧边界坐标;S50, taking the left boundary coordinate of the leftmost character string of the combined character string as the left boundary coordinate of the combined character string;
    在每一行中,比较各组合字符串的左侧边界坐标,将组合字符串再按照组合字符串的左侧边界坐标从小到大的顺序从左至右排序。In each row, compare the left boundary coordinates of each combined character string, and then sort the combined character strings from left to right in the order of the left boundary coordinates of the combined character string from small to large.
  3. 根据权利要求2所述的PDF文件数据提取方法,其特征在于,通过竖向距离判断LT子对象是否在同一行的公式如下:The method for extracting data from a PDF file according to claim 2, wherein the formula for judging whether the LT sub-objects are in the same row by vertical distance is as follows:
    |LTtext[i] y0-LTtext[i+1] y0|<|LTtext[i] y1-LTtext[i] y0| |LTtext[i] y0 -LTtext[i+1] y0 |<|LTtext[i] y1 -LTtext[i] y0 |
    其中,|LTtext[i] y1-LTtext[i] y0|为LT子对象的高度; Among them, |LTtext[i] y1 -LTtext[i] y0 | is the height of the LT sub-object;
    |LTtext[i] y0-LTtext[i+1] y0|为第i个LT子对象与第i+1个LT子对象的y0的差值; |LTtext[i] y0 -LTtext[i+1] y0 | is the difference between y0 of the i-th LT sub-object and the i+1-th LT sub-object;
    y0为LT子对象的左下角纵坐标;y1为LT子对象的右上角纵坐标;y0 is the ordinate of the lower left corner of the LT sub-object; y1 is the ordinate of the upper right corner of the LT sub-object;
    i表示第i个LT子对象。i represents the i-th LT sub-object.
  4. 根据权利要求1所述的PDF文件数据提取方法,其特征在于,步骤S20中将每页的LT子对象存入对应的第一列表的步骤包括:The PDF file data extraction method according to claim 1, wherein the step of storing the LT sub-objects of each page in the corresponding first list in step S20 comprises:
    通过命令{'pageN':[LTobjs of layout]}将每页的pdfminer.layout对象的LT子对象对应存入一个第一列表,其中,N表示第N个layout,[LTobjs of layout]为数组。Use the command {'pageN': [LTobjs of layout]} to store the LT sub-objects of the pdfminer.layout object of each page into a first list, where N represents the Nth layout, and [LTobjs of layout] is an array.
  5. 根据权利要求3所述的PDF文件数据提取方法,其特征在于,所述LT子对象包括LTTextBoxHorizontal子对象、LTChar子对象、LTFigure子对象中的至少一项,其中,LTFigure子对象代表由区域框占用的一块区域,区域框用于引入另一个PDF文档。The PDF file data extraction method according to claim 3, wherein the LT sub-object includes at least one of the LTTextBoxHorizontal sub-object, the LTChar sub-object, and the LTFigure sub-object, wherein the LTFigure sub-object represents occupied by an area frame An area of the area, the area box is used to import another PDF document.
  6. 根据权利要求5所述的PDF文件数据提取方法,其特征在于,The PDF file data extraction method according to claim 5, characterized in that,
    在步骤S20中,对于LTTextBoxHorizontal子对象,直接存入第一列表,对于LTFigure子对象,则迭代抽取其中的LT子对象,形成一个包含LTFigure内部所有LT子对象的第二列表,并存入第一列表中。In step S20, for the LTTextBoxHorizontal sub-objects, directly store it in the first list. For the LTFigure sub-objects, iteratively extract the LT sub-objects in it to form a second list containing all the LT sub-objects inside LTFigure and store it in the first list. List.
  7. 根据权利要求3所述的PDF文件数据提取方法,其特征在于,The method for extracting data from a PDF file according to claim 3, wherein:
    步骤S50中,比较各组合字符串的左侧边界坐标,将组合字符串再按照组合字符串的左侧边界坐标从小到大的顺序从左至右排序后,In step S50, the left boundary coordinates of each combined character string are compared, and the combined character strings are sorted from left to right according to the left boundary coordinates of the combined character string from small to large,
    如果左侧边界坐标相同,则进一步比较两个组合字符串的y0数值,并将y0数值高的排列在y0数值低的组合字符串前面。If the left boundary coordinates are the same, the y0 values of the two combined character strings are further compared, and the character string with the higher y0 value is arranged in front of the combined character string with the lower y0 value.
  8. 根据权利要求5所述的PDF文件数据提取方法,其特征在于,The PDF file data extraction method according to claim 5, characterized in that,
    步骤S50中,比较各组合字符串的左侧边界坐标,将组合字符串再按照组合字符串的左侧边界坐标从小到大的顺序从左至右排序后,In step S50, the left boundary coordinates of each combined character string are compared, and the combined character strings are sorted from left to right according to the left boundary coordinates of the combined character string from small to large,
    如果产生位置重叠的情况,则进一步比较两个组合字符串的y0数值,并将y0数值高的排列在y0数值低的组合字符串前面。If the positions overlap, the y0 values of the two combined character strings are further compared, and the character string with the higher y0 value is arranged in front of the combined character string with the lower y0 value.
  9. 根据权利要求8所述的PDF文件数据提取方法,其特征在于,所述LT子对象还包括LTLine子对象,所述方法还包括:The method for extracting data from a PDF file according to claim 8, wherein the LT sub-object further comprises an LTLine sub-object, and the method further comprises:
    步骤S60,根据LTline子对象的坐标,将LTline的所有竖线的横坐标位置按照从小到大的顺序从左往右排序,将LTline的所有横线的纵坐标位置按照从小到大的顺序从上往下排序,从而形成表格。Step S60, according to the coordinates of the LTline sub-object, sort the abscissa positions of all the vertical lines of the LTline from left to right in the order from small to large, and sort the ordinate positions of all the horizontal lines of the LTline from small to large from the top Sort down to form a table.
  10. 根据权利要求9所述的PDF文件数据提取方法,其特征在于,The PDF file data extraction method according to claim 9, characterized in that,
    将组合字符串按照组合字符串的左侧边界坐标从小到大的顺序从左至右排序后产生位置重叠的情况,并将y0数值高的排列在y0数值低的组合字符串前面后,Sort the combined character string from left to right in the order of the left boundary coordinates of the combined character string from left to right, and the position overlap occurs, and the higher y0 value is arranged in front of the lower y0 value.
    还比对同一行的组合字符串对应的LTline的数值,如果组合字符串对应的LTline的竖线的数值是相同的,则说明组合字符串在所述PDF文件中是处于同一单元格内,根据满足|LTtext[i] y0-LTtext[i+1] y0|<|LTtext[i] y1-LTtext[i] y0|条件的LT子对象的移位个数,将竖线的数值相同的组合字符串所对应的LTline的右侧竖线向右侧移动与移位个数对应的距 离,并且将竖线的数值相同的组合字符串中右侧的组合字符串的右侧的所有LT子对象向右侧移动与移位个数对应的距离。 Also compare the values of LTline corresponding to the combined character string in the same row. If the values of the vertical lines of the LTline corresponding to the combined character string are the same, it means that the combined character string is in the same cell in the PDF file. The number of shifts of LT sub-objects that satisfy |LTtext[i] y0 -LTtext[i+1] y0 |<|LTtext[i] y1 -LTtext[i] y0 |, combining characters with the same value of the vertical line The right vertical line of the LTline corresponding to the string is moved to the right by the distance corresponding to the number of shifts, and all LT sub-objects on the right side of the combined string on the right in the combined string with the same value of the vertical line are moved to the right The right side moves the distance corresponding to the number of shifts.
  11. 根据权利要求9所述的PDF文件数据提取方法,其特征在于,The PDF file data extraction method according to claim 9, characterized in that,
    形成表格后,对于每行,判断最左侧竖线的横坐标是否大于最左侧字符串的横坐标,如果大于,则在该最左侧字符串的左侧边界坐标位置添加竖线;After forming the table, for each row, determine whether the abscissa of the leftmost vertical line is greater than the abscissa of the leftmost character string, and if it is greater, add a vertical line at the left boundary coordinate position of the leftmost character string;
    判断最右侧竖线的横坐标是否小于最右侧字符串的横坐标,如果小于,则在该最右侧字符串的右侧边界坐标位置添加竖线。Determine whether the abscissa of the rightmost vertical line is less than the abscissa of the rightmost character string. If it is smaller, add a vertical line at the right boundary coordinate position of the rightmost character string.
  12. 根据权利要求3所述的PDF文件数据提取方法,其特征在于,The method for extracting data from a PDF file according to claim 3, wherein:
    比较同一行组合字符串的左侧边界坐标和右侧边界坐标,如果右侧组合字符串的x0值–左侧组合字符串的x1值<预设的拼接阈值,则这两个组合字符串拼接在一起。Compare the left and right boundary coordinates of the combined character string in the same line. If the x0 value of the right combined character string-the x1 value of the left combined character string <the preset splicing threshold, the two combined character strings are spliced Together.
  13. 根据权利要求5所述的PDF文件数据提取方法,其特征在于,The PDF file data extraction method according to claim 5, characterized in that,
    每提取PDF文档形成一行LTChar子对象,对该行LTChar子对象与PDF文档的对应行进行相似度计算,若低于相似度阈值,则再按照该行LTline的竖线的位置将该行切分为文本块,并对文本块分别与PDF文档的对应部位求取相似度,对于相似度低于相似度阈值的,则认定文本块中有字符是识别过程中造成的乱码,则再次对文本块按照单个字符宽度来切分,并对每个字符与原文中的对应部位求取相似度,对于相似度低于相似度阈值的,则认为是乱码。Each extracted PDF document forms a row of LTChar sub-objects, and calculates the similarity between the row of LTChar sub-objects and the corresponding rows of the PDF document. If it is lower than the similarity threshold, then divide the row according to the position of the vertical line of the LTline of the row It is a text block, and the similarity between the text block and the corresponding part of the PDF document is calculated. If the similarity is lower than the similarity threshold, it is determined that the characters in the text block are garbled caused by the recognition process, and the text block is again Divide according to the width of a single character, and calculate the similarity between each character and the corresponding part in the original text. If the similarity is lower than the similarity threshold, it is considered as garbled.
  14. 根据权利要求13所述的PDF文件数据提取方法,其特征在于,The method for extracting data from a PDF file according to claim 13, wherein:
    对于识别出的乱码,建立乱码的当前Unicode编码与标准Unicode编码之间的映射关系,从而去除乱码。For the identified garbled codes, a mapping relationship between the current Unicode encoding of the garbled codes and the standard Unicode code is established to remove the garbled codes.
  15. 一种PDF文件数据提取装置,其特征在于,包括:A PDF file data extraction device, characterized in that it comprises:
    PDF文件解析模块,使用pdfminer工具对PDF文件进行解析,对PDF的每个页面都生成一个pdfminer.layout对象,其中,所述pdfminer.layout对象中包含LT子对象;The PDF file parsing module uses the pdfminer tool to parse the PDF file, and generates a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains the LT sub-object;
    LT子对象存储模块,用于获取各LT子对象的纵坐标和横坐标,并将每页的LT子对象存入对应的第一列表,其中,所述横坐标包括LT子对象的左侧边界坐标x0和LT子对象的右侧边界坐标x1,按照纵坐标从小到大的顺序依次抽取每个页面中pdfminer.layout对象的LT子对象,并根据纵坐标由小至大的顺序纵向排列在各页面对应的第一列表中;The LT sub-object storage module is used to obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in the corresponding first list, wherein the abscissa includes the left boundary of the LT sub-object Coordinates x0 and the right boundary coordinate x1 of the LT sub-object, extract the LT sub-objects of the pdfminer.layout object in each page in the order of ordinates from small to large, and arrange them vertically in the order of ordinates from small to large In the first list corresponding to the page;
    分行读取模块,用于对所述第一列表进行分行读取操作,并且,对于每一种LT子对象,在分行读取的过程中,通过各LT子对象之间的竖向距离判断每个LT子对象所属的行,从而将LT子对象划分到各行中;The branch reading module is used to perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, each LT sub-object is judged by the vertical distance between each LT sub-object. A row to which the LT sub-object belongs, thereby dividing the LT sub-object into each row;
    LT子对象排序模块,用于对每一种LT子对象,在每一行中,对LT子对象按左侧边 界坐标x0的顺序从小到大排序,并且,通过判断左侧的LT子对象的右侧边界坐标x1是否等于相邻的右侧的LT子对象的左侧边界坐标x0,将多个LT子对象组合形成组合字符串。The LT sub-object sorting module is used to sort each LT sub-object, in each row, sort the LT sub-objects from small to large in the order of the left boundary coordinate x0, and by judging the right of the LT sub-object on the left Whether the side boundary coordinate x1 is equal to the left boundary coordinate x0 of the adjacent right LT sub-object, the multiple LT sub-objects are combined to form a combined character string.
  16. 根据权利要求15的PDF文件数据提取装置,其特征在于,还包括第一位置纠正模块,用于比较同一行的组合字符串的左侧边界坐标,如果左侧边界坐标相同,则进一步比较两个组合字符串的y0数值,并将y0数值高的排列在y0数值低的组合字符串前面。The PDF file data extraction device according to claim 15, further comprising a first position correction module for comparing the left boundary coordinates of the combined character string in the same line, and if the left boundary coordinates are the same, then further comparing the two Combine the y0 value of the string, and arrange the string with the higher y0 value in front of the string with the lower y0 value.
  17. 根据权利要求15的PDF文件数据提取装置,其特征在于,还包括第二位置纠正模块,用于在如果按x0的顺序从小到大排序后,产生位置重叠的情况,则进一步比较两个组合字符串的y0数值,并将y0数值高的排列在y0数值低的组合字符串前面。The PDF file data extraction device according to claim 15, characterized in that it further comprises a second position correction module, which is used to further compare two combined characters if the position overlap occurs after sorting from small to large in the order of x0 The y0 value of the string, and the higher y0 value is arranged in front of the combined string with the lower y0 value.
  18. 一种电子设备,其特征在于,该电子设备包括:存储器和处理器,所述存储器中存储PDF文件数据提取程序,所述PDF文件数据提取程序被所述处理器执行时实现如下步骤:An electronic device, characterized in that it includes a memory and a processor, the memory stores a PDF file data extraction program, and the PDF file data extraction program is executed by the processor to implement the following steps:
    S10,使用pdfminer工具对PDF文件进行解析,对PDF的每个页面都生成一个pdfminer.layout对象,其中,所述pdfminer.layout对象中包含LT子对象;S10. Use the pdfminer tool to parse the PDF file, and generate a pdfminer.layout object for each page of the PDF, where the pdfminer.layout object contains an LT sub-object;
    S20,获取各LT子对象的纵坐标和横坐标,并将每页的LT子对象存入对应的第一列表,其中,所述横坐标包括LT子对象的左侧边界坐标x0和LT子对象的右侧边界坐标x1,按照纵坐标从小到大的顺序依次抽取每个页面中pdfminer.layout对象的LT子对象,并根据纵坐标由小至大的顺序纵向排列在各页面对应的第一列表中;S20. Obtain the ordinate and abscissa of each LT sub-object, and store the LT sub-object of each page in a corresponding first list, where the abscissa includes the left boundary coordinate x0 of the LT sub-object and the LT sub-object The LT sub-objects of the pdfminer.layout object in each page are extracted in the order of the ordinate from small to large, and the LT sub-objects of the pdfminer.layout object in each page are sequentially arranged in the first list corresponding to each page according to the order of the ordinate from small to large. in;
    S30,对所述第一列表进行分行读取操作,并且,对于每一种LT子对象,在分行读取的过程中,通过各LT子对象之间的竖向距离判断每个LT子对象所属的行,从而将LT子对象划分到各行中;S30. Perform a branch reading operation on the first list, and for each type of LT sub-object, in the process of branch reading, judge whether each LT sub-object belongs to each LT sub-object according to the vertical distance between each LT sub-object To divide the LT sub-object into each row;
    S40,对于每一种LT子对象,在每一行中,对LT子对象按左侧边界坐标x0的顺序从小到大排序,并且,通过判断左侧的LT子对象的右侧边界坐标x1是否等于相邻的右侧的LT子对象的左侧边界坐标x0,将多个LT子对象组合形成组合字符串。S40, for each type of LT sub-object, in each row, sort the LT sub-objects from small to large in the order of the left boundary coordinate x0, and determine whether the right boundary coordinate x1 of the left LT sub-object is equal to The left boundary coordinates x0 of the adjacent LT sub-objects on the right are combined to form a combined character string.
  19. 根据权利要求18所述的电子设备,其特征在于,所述PDF文件数据提取程序被所述处理器执行时还包括步骤S50:The electronic device according to claim 18, wherein said PDF file data extraction program further comprises step S50 when being executed by said processor:
    以组合字符串的最左侧的字符串的左侧边界坐标作为组合字符串的左侧边界坐标;Take the left boundary coordinate of the leftmost character string of the combined character string as the left boundary coordinate of the combined character string;
    比较各组合字符串的左侧边界坐标,将组合字符串再按照组合字符串的左侧边界坐标从小到大的顺序从左至右排序。Compare the left boundary coordinates of each combined character string, and then sort the combined character strings from left to right according to the left boundary coordinates of the combined character string from small to large.
  20. 一种计算机非易失性可读存储介质,其特征在于,所述计算机非易失性可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令被处理器执行时,实现如权利要求1-14任一项所述的PDF文件数据提取方法。A computer nonvolatile readable storage medium, wherein the computer nonvolatile readable storage medium stores a computer program, the computer program includes program instructions, and when the program instructions are executed by a processor, The PDF file data extraction method according to any one of claims 1-14 is realized.
PCT/CN2019/103580 2019-06-17 2019-08-30 Pdf file data extraction method and apparatus, device, and storage medium WO2020252931A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910521031.4 2019-06-17
CN201910521031.4A CN110377559B (en) 2019-06-17 2019-06-17 PDF file data extraction method, device and storage medium

Publications (1)

Publication Number Publication Date
WO2020252931A1 true WO2020252931A1 (en) 2020-12-24

Family

ID=68248967

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/103580 WO2020252931A1 (en) 2019-06-17 2019-08-30 Pdf file data extraction method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN110377559B (en)
WO (1) WO2020252931A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113361257B (en) * 2021-06-29 2022-10-11 深圳壹账通智能科技有限公司 PDF document analysis method, system, electronic device and storage medium
CN115618847B (en) * 2022-12-20 2023-03-14 浙江保融科技股份有限公司 Method and device for analyzing PDF document and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866335A (en) * 2010-06-14 2010-10-20 深圳市万兴软件有限公司 Form processing method and device in document conversion
CN108038426A (en) * 2017-11-29 2018-05-15 阿博茨德(北京)科技有限公司 The method and device of chart-information in a kind of extraction document
CN108415887A (en) * 2018-02-09 2018-08-17 武汉大学 A kind of method that pdf document is converted to OFD files
US20190179885A1 (en) * 2017-12-13 2019-06-13 Think Research Corporation Automated Generation of Web Forms Using Fillable Electronic Documents

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8861856B2 (en) * 2007-09-28 2014-10-14 Abbyy Development Llc Model-based methods of document logical structure recognition in OCR systems
CN102722475A (en) * 2012-05-09 2012-10-10 深圳市万兴软件有限公司 Method for converting form in portable document format (PDF) document into Excel form
JP6719862B2 (en) * 2015-03-20 2020-07-08 株式会社島津製作所 PDF data retrieval system and program for PDF data retrieval system
CN109446487A (en) * 2018-11-01 2019-03-08 北京神州泰岳软件股份有限公司 A kind of method and device parsing portable document format document table

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101866335A (en) * 2010-06-14 2010-10-20 深圳市万兴软件有限公司 Form processing method and device in document conversion
CN108038426A (en) * 2017-11-29 2018-05-15 阿博茨德(北京)科技有限公司 The method and device of chart-information in a kind of extraction document
US20190179885A1 (en) * 2017-12-13 2019-06-13 Think Research Corporation Automated Generation of Web Forms Using Fillable Electronic Documents
CN108415887A (en) * 2018-02-09 2018-08-17 武汉大学 A kind of method that pdf document is converted to OFD files

Also Published As

Publication number Publication date
CN110377559A (en) 2019-10-25
CN110377559B (en) 2022-09-16

Similar Documents

Publication Publication Date Title
US10592184B2 (en) Method and device for parsing tables in PDF document
WO2021189803A1 (en) Text error correction method and apparatus, electronic device, and storage medium
WO2021147252A1 (en) Ocr-based table format recovery method and apparatus, electronic device, and storage medium
US7853869B2 (en) Creation of semantic objects for providing logical structure to markup language representations of documents
US11829401B2 (en) Method for table extraction from journal literature based on text state characteristics
CN110659527B (en) Form detection in electronic forms
CN102375807B (en) Method and device for proofing characters
WO2021208703A1 (en) Method and apparatus for question parsing, electronic device, and storage medium
CN108536745B (en) Shell-based data table extraction method, terminal, equipment and storage medium
WO2020252931A1 (en) Pdf file data extraction method and apparatus, device, and storage medium
CN111310426A (en) Form format recovery method and device based on OCR and storage medium
JP5380040B2 (en) Document processing device
CN116644729A (en) Table file processing method, apparatus, computer device and storage medium
CN112417899A (en) Character translation method, device, computer equipment and storage medium
WO2022178994A1 (en) Table structure recognition method and apparatus, electronic device, and storage medium
CN114201620A (en) Method, apparatus and medium for mining PDF tables in PDF file
US10643022B2 (en) PDF extraction with text-based key
US10970478B2 (en) Tabular data analysis method, recording medium storing tabular data analysis program, and information processing apparatus
CN114385679A (en) Meter structure inspection method, meter structure inspection device and electronic equipment
CN103176956A (en) Method and device for extracting file structure
CN104536947A (en) Layout document processing method and device
CN116860747A (en) Training sample generation method and device, electronic equipment and storage medium
CN112528599A (en) Multi-page document processing method, apparatus, computer device and medium based on XML
US20190005038A1 (en) Method and apparatus for grouping documents based on high-level features clustering
CN117151106A (en) Method and device for generating document outline, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19933599

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19933599

Country of ref document: EP

Kind code of ref document: A1