CN117994804A - Method for extracting form information from PDF file and electronic equipment - Google Patents

Method for extracting form information from PDF file and electronic equipment Download PDF

Info

Publication number
CN117994804A
CN117994804A CN202410070624.4A CN202410070624A CN117994804A CN 117994804 A CN117994804 A CN 117994804A CN 202410070624 A CN202410070624 A CN 202410070624A CN 117994804 A CN117994804 A CN 117994804A
Authority
CN
China
Prior art keywords
bounding box
bounding
row
bounding boxes
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410070624.4A
Other languages
Chinese (zh)
Inventor
邓高伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202410070624.4A priority Critical patent/CN117994804A/en
Publication of CN117994804A publication Critical patent/CN117994804A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Input (AREA)

Abstract

The application discloses a method for extracting form information from a PDF file, which comprises the following steps: analyzing the PDF file to be processed to obtain bounding boxes of the characters connected in the PDF file to be processed; expanding the bounding boxes to enable the boundaries of adjacent bounding boxes to coincide, so as to obtain an expanded bounding box; extracting candidate cells from the extension bounding boxes according to the row alignment relationship and the column connection relationship between the extension bounding boxes; according to a candidate table formed by candidate cells, respectively cutting the candidate table by taking column boundary lines at two sides of a row with the minimum width in the candidate table as cutting lines, and taking the candidate cell positioned between two cutting lines as a target cell; and generating target table information extracted from the PDF file to be processed according to the target cell. According to the technical scheme, the table information can be accurately extracted under the conditions of hidden table grid frames, misalignment of cells in the table and the like, so that the accuracy of table extraction is improved.

Description

Method for extracting form information from PDF file and electronic equipment
Technical Field
The application belongs to the technical field of data processing, and particularly relates to a method for extracting form information from a PDF file and electronic equipment.
Background
PDF (Portable Document Format ) is a file format with which files are commonly referred to as PDF files. The PDF file is a common file format in daily work and life of people at present because the display effect can be kept consistent when the software is crossed or the platform is crossed. In some cases, it is necessary to extract character content from the PDF file, including extraction of table content. When the table information is extracted from the PDF file, the line frames of the table are intersected to obtain the cells surrounded by the line frames, and then the structural relation of the whole table is restored according to the position information of the cells. However, in the case of incomplete form wire frames, the method has low accuracy of extracting the form, and even cannot extract the form.
Disclosure of Invention
The application aims to provide a method for extracting form information from a PDF file and electronic equipment so as to improve the accuracy of extracting the form from the PDF file.
Other features and advantages of the application will be apparent from the following detailed description, or may be learned by the practice of the application.
According to an aspect of an embodiment of the present application, there is provided a method for extracting form information from a PDF file, including:
analyzing a PDF file to be processed to obtain bounding boxes of characters connected in the PDF file to be processed;
Expanding the bounding boxes to enable the boundaries of adjacent bounding boxes to coincide, so as to obtain an expanded bounding box;
Extracting candidate cells from the extended bounding boxes according to the row alignment relationship and the column connection relationship between the extended bounding boxes, wherein the candidate cells comprise the extended bounding boxes with aligned boundaries in the row direction and connected in the column direction;
According to a candidate table formed by the candidate cells, respectively cutting the candidate table by taking column boundary lines at two sides of a row with the minimum width in the candidate table as cutting lines, and taking the candidate cells positioned between the two cutting lines as target cells;
and generating target table information extracted from the PDF file to be processed according to the target cell.
In one embodiment of the application, after the PDF file to be processed is analyzed, a parting line contained in the PDF file to be processed is also obtained; expanding the bounding box to enable boundaries of adjacent bounding boxes to coincide, and obtaining an expanded bounding box, wherein the expanding comprises the following steps:
detecting whether 4 boundaries of the bounding box are intersected with boundaries of other bounding boxes or dividing lines in the PDF file to be processed;
if any boundary of the bounding box is intersected with the boundary of other bounding boxes or the dividing line in the PDF file to be processed, stopping expanding the boundary of the bounding box;
If any boundary of the bounding box is not intersected with the boundary of other bounding boxes or the dividing line in the PDF file to be processed, expanding the boundary of the bounding box in a direction away from the center of the bounding box according to a set step length, and returning to the step of detecting whether 4 boundaries of the bounding box are intersected with the boundary of other bounding boxes or the dividing line in the PDF file to be processed;
And after all 4 boundaries of the bounding box stop expanding, generating an expanded bounding box according to the expanded boundaries.
In one embodiment of the present application, extracting candidate cells from the extended bounding box according to a row alignment relationship and a column connection relationship between the extended bounding boxes includes:
Detecting whether an upper boundary of the first extended bounding box and an upper boundary of the second extended bounding box are aligned or detecting whether a lower boundary of the first extended bounding box and a lower boundary of the second extended bounding box are aligned;
If the upper boundary of the first extended bounding box and the upper boundary of the second extended bounding box are aligned, or the lower boundary of the first extended bounding box and the lower boundary of the second extended bounding box are aligned, the first extended bounding box and the second extended bounding box are used as valid cells in the same row;
detecting whether the effective cell has a connection boundary with effective cells of other rows in the column direction;
And if the effective cell has a connection boundary with the effective cells of other rows in the column direction, taking the effective cell as a candidate cell.
In one embodiment of the present application, before detecting whether the valid cell has a connection boundary with valid cells of other rows in the column direction, the method further includes:
ordering each row formed by the effective cells from small to large according to row boundary coordinates of each row;
detecting whether the number of the effective cells contained in the first row is larger than a preset threshold value;
If the number of the effective cells contained in the first row reaches a preset threshold, executing a step of detecting whether a connecting boundary exists between the effective cells and the effective cells of other rows in the column direction;
If the number of the effective cells contained in the first row does not reach the preset threshold, eliminating the effective cells contained in the first row, taking the next row as the first row, and returning to the step of detecting whether the number of the effective cells contained in the first row is larger than the preset threshold.
In one embodiment of the present application, according to a candidate table formed by the candidate cells, the candidate table is cut by using column boundary lines on two sides of a row with a minimum width in the candidate table as cutting lines, and a candidate cell located between two cutting lines is used as a target cell, including:
expanding the boundary, which is not connected with other candidate cells, in the candidate cells until the boundary of the candidate cell is connected with the boundary of a bounding box near the candidate cell, so as to obtain an expanded candidate cell;
And according to an expansion candidate table formed by the expansion candidate cells, respectively cutting the expansion candidate table by taking column boundary lines at two sides of a row with the minimum width in the expansion candidate table as cutting lines, and taking the expansion candidate cells which are not cut by the cutting lines as target cells.
In one embodiment of the present application, according to an extended candidate table formed by the extended candidate cells, cutting the extended candidate table with column boundary lines on both sides of a row with a minimum width in the extended candidate table as cutting lines, respectively, includes:
According to an expansion candidate table formed by the expansion candidate cells, taking a left column boundary line of a row with the minimum width in the expansion candidate table as a first cutting line and taking a right column boundary line of the row with the minimum width in the expansion candidate table as a second cutting line;
And if the left column boundary of the bounding box corresponding to the expansion candidate cell in the expansion candidate table is positioned on the right side of the first cutting line and the right column boundary of the bounding box corresponding to the expansion candidate cell is positioned on the left side of the second cutting line, determining that the expansion candidate cell is not cut.
In one embodiment of the present application, generating the target table information extracted from the PDF file to be processed according to the target cell includes:
Expanding the boundaries of the target cells which are not aligned with other target cells so as to align the boundaries of the target cells with the boundaries of other target cells;
and generating target table information extracted from the PDF file to be processed according to the expanded target cell.
In one embodiment of the present application, after acquiring the bounding boxes of the connected characters in the PDF file to be processed, the method further includes:
Clustering the bounding boxes to obtain a plurality of class clusters, wherein one class cluster corresponds to one table;
And executing the steps of expanding the bounding boxes aiming at the bounding boxes contained in each class cluster so as to enable the boundaries of adjacent bounding boxes to coincide, and obtaining the expanded bounding boxes.
In one embodiment of the present application, after expanding the bounding boxes so that the boundaries of adjacent bounding boxes overlap to obtain an expanded bounding box, the method further includes:
Extracting a bounding box to be detected from a table formed by the extended bounding boxes, and acquiring the extended bounding box connected to the right side of the bounding box to be detected; wherein the bounding box to be detected starts from a first extended bounding box in the table;
If the extension bounding box connected to the right side of the bounding box to be detected is one, taking the next unprocessed extension bounding box as the bounding box to be detected, and returning to the step of recursively acquiring the extension bounding box connected to the right side of the bounding box to be detected;
If the number of the extension bounding boxes connected to the right side of the bounding box to be detected is multiple, combining the multiple extension bounding boxes, wherein the combining operation is used for combining the multiple extension bounding boxes into one extension bounding box;
If the multiple extension bounding boxes are successfully combined, reserving combined extension bounding boxes corresponding to the multiple extension bounding boxes, taking the next unprocessed extension bounding box as the bounding box to be detected, and returning to obtain the extension bounding box connected to the right side of the bounding box to be detected;
if the merging of the plurality of extension bounding boxes fails, keeping the plurality of extension bounding boxes unchanged, taking the next unprocessed extension bounding box as the bounding box to be detected, and returning to the step of acquiring the extension bounding box connected to the right side of the bounding box to be detected.
In one embodiment of the present application, after expanding the bounding boxes so that the boundaries of adjacent bounding boxes overlap to obtain an expanded bounding box, the method further includes:
Extracting a bounding box to be adjusted from a table formed by the extended bounding boxes, and recursively acquiring a plurality of extended bounding boxes connected to the right side of the bounding box to be detected; wherein the bounding box to be adjusted starts from a first extended bounding box in the table;
detecting whether a plurality of extension bounding boxes with aligned row boundaries exist in the bounding boxes to be detected and a plurality of extension bounding boxes connected to the right side of the bounding boxes to be detected;
If a plurality of extension bounding boxes with aligned row boundaries exist, adjusting the row boundaries of other extension bounding boxes with misaligned row boundaries by taking the aligned row boundaries of the extension bounding boxes with aligned row boundaries as references, so that the row boundaries of the extension bounding boxes to be detected and the extension bounding boxes connected to the right side of the extension bounding boxes to be detected are aligned;
if a plurality of extension bounding boxes with aligned row boundaries do not exist, adjusting the row boundaries of the extension bounding boxes connected to the right side of the bounding box to be detected by taking the row boundaries of the bounding box to be detected as references, so that the row boundaries of the bounding box to be detected and the extension bounding boxes connected to the right side of the bounding box to be detected are aligned;
And returning the next unprocessed extension bounding box to the step of recursively acquiring a plurality of extension bounding boxes connected to the right side of the bounding box to be detected as the bounding box to be adjusted.
According to an aspect of an embodiment of the present application, there is provided an apparatus for extracting form information from a PDF file, including:
The file analysis module is used for analyzing the PDF file to be processed to obtain bounding boxes of the characters connected in the PDF file to be processed;
The bounding box expansion module is used for expanding the bounding boxes so as to enable the boundaries of adjacent bounding boxes to coincide and obtain an expanded bounding box;
A candidate cell extraction module, configured to extract candidate cells from the extended bounding boxes according to a row alignment relationship and a column connection relationship between the extended bounding boxes, where the candidate cells include extended bounding boxes that are aligned in a boundary in a row direction and connected in a column direction;
The table cutting module is used for cutting the candidate table by taking column boundary lines at two sides of the row with the minimum width in the candidate table as cutting lines respectively according to the candidate table formed by the candidate cells, and taking the candidate cell positioned between the two cutting lines as a target cell;
and the table information generating module is used for generating target table information extracted from the PDF file to be processed according to the target cell.
According to an aspect of the embodiments of the present application, there is provided a computer-readable medium having stored thereon a computer program which, when executed by a processor, implements a method of extracting form information from a PDF file as in the above technical solution.
According to an aspect of an embodiment of the present application, there is provided an electronic apparatus including: a processor; and a memory for storing executable instructions of the processor; wherein execution of the executable instructions by the processor causes the electronic device to perform the method of extracting form information from a PDF file as in the above technical solution.
According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method of extracting form information from a PDF file as in the above technical solution.
In the technical scheme provided by the embodiment of the application, a PDF file to be processed is firstly analyzed to obtain bounding boxes of characters connected in the PDF file to be processed; then expanding the bounding boxes to enable the boundaries of adjacent bounding boxes to coincide, so as to obtain an expanded bounding box; extracting candidate cells from the extension bounding boxes according to the row alignment relationship and the column connection relationship between the extension bounding boxes, wherein the candidate cells comprise extension bounding boxes with aligned boundaries in the row direction and connected in the column direction; then according to a candidate table formed by the candidate cells, respectively cutting the candidate table by taking column boundary lines at two sides of the row with the minimum width in the candidate table as cutting lines, and taking the candidate cell positioned between the two cutting lines as a target cell; finally, generating target table information extracted from the PDF file to be processed according to the target cells, and extracting the table information accurately under the conditions of hidden table grid frames, misalignment of the cells in the table and the like by finding out the target cells constituting the table through processing such as bounding box expansion, line alignment, column connection, table cutting and the like from the bounding box of the character without depending on the table wire frames.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application. It is evident that the drawings in the following description are only some embodiments of the present application and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.
Fig. 2 schematically shows a flowchart of a method for extracting form information from a PDF file according to an embodiment of the present application.
Fig. 3 schematically shows a schematic diagram of a PDF file to be processed according to an embodiment of the present application.
Fig. 4 schematically illustrates a schematic diagram of an extended bounding box provided by an embodiment of the present application.
Fig. 5A-5J schematically illustrate diagrams of table data in a table extraction process according to an embodiment of the present application.
FIG. 6 schematically illustrates a flow diagram of a clustering process provided by one embodiment of the application.
FIG. 7A schematically illustrates pre-cluster table data provided by one embodiment of the application.
FIG. 7B schematically illustrates clustered table data provided by one embodiment of the application.
Fig. 8A schematically illustrates table data before the merging process according to an embodiment of the present application.
Fig. 8B schematically illustrates a table data after the merging process according to an embodiment of the present application.
Fig. 9 schematically illustrates a table diagram of an alignment adjustment process provided by an embodiment of the present application.
Fig. 10A schematically illustrates table data before alignment adjustment according to an embodiment of the present application.
Fig. 10B schematically illustrates alignment-adjusted tabular data provided by an embodiment of the present application.
Fig. 11 schematically shows a block diagram of an apparatus for extracting form information from a PDF file according to an embodiment of the present application.
Fig. 12 schematically shows a block diagram of a computer system suitable for use in implementing embodiments of the application.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the application may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the application.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
In the present embodiment, the term "module" or "unit" refers to a computer program or a part of a computer program having a predetermined function and working together with other relevant parts to achieve a predetermined object, and may be implemented in whole or in part by using software, hardware (such as a processing circuit or a memory), or a combination thereof. Also, a processor (or multiple processors or memories) may be used to implement one or more modules or units. Furthermore, each module or unit may be part of an overall module or unit that incorporates the functionality of the module or unit.
It will be appreciated that in particular embodiments of the present application, where data relating to customer information (e.g., transaction information, reconciliation data) and the like is involved, when the above embodiments of the present application are applied to particular products or technologies, customer approval or consent is required and the collection, use and processing of the relevant data is required to comply with relevant laws and regulations and standards of the relevant country and region.
Fig. 1 schematically shows a block diagram of an exemplary system architecture to which the technical solution of the present application is applied.
As shown in fig. 1, system architecture 100 may include a terminal device 110, a network 120, and a server 130. Terminal device 110 may include a smart phone, tablet, notebook, smart voice interaction device, smart home appliance, vehicle terminal, and the like. The server 130 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. Network 120 may be a communication medium of various connection types capable of providing a communication link between terminal device 110 and server 130, and may be, for example, a wired communication link or a wireless communication link.
The system architecture in embodiments of the present application may have any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 130 may be a server group composed of a plurality of server devices. In addition, the technical solution provided in the embodiment of the present application may be applied to the terminal device 110, or may be applied to the server 130, or may be implemented by the terminal device 110 and the server 130 together, which is not limited in particular.
The method for extracting form information from a PDF file provided by the present application will be described in detail with reference to the following embodiments.
Fig. 2 schematically shows a flowchart of a method for extracting form information from a PDF file, which may be implemented by a terminal device or a server, according to an embodiment of the present application. As shown in fig. 2, the method for extracting form information from a PDF file provided in this embodiment includes steps 210 to 250, which are specifically as follows:
And 210, analyzing the PDF file to be processed to obtain bounding boxes of the connected characters in the PDF file to be processed.
Specifically, the PDF file to be processed is a PDF file for which form information needs to be extracted. The PDF file to be processed can be analyzed by a specific PDF analysis tool, and at present, many open source tools provide PDF file analysis functions. The purpose of the parsing is mainly to obtain character information stored in the PDF file to be processed, wherein the character information comprises information such as text characters and dividing lines, for example, the coding of the text characters, the pixel positions of the text characters, the size information of the text character fonts, the length of the dividing lines, the pixel position information of the dividing lines and the like. The bounding box of the connected characters refers to a rectangular frame that encloses the connected characters, and the connected characters refer to the close arrangement of the characters without gaps.
The PDF file to be processed may include visible characters and invisible characters, and in this embodiment, the invisible characters are not processed, and after the PDF file to be processed is parsed, a corresponding bounding box is obtained for character information of the visible characters. As for the dividing lines, since the dividing lines constituting the table generally include horizontal lines and vertical lines, after the PDF file to be processed is parsed, the horizontal dividing lines and the vertical dividing lines are reserved. If the character information is a text character, the code of the text character, the pixel position of the text character and the width and height of the text character are recorded, wherein the pixel position of the text character is usually the left-lower corner coordinate information of a connected character, and the coordinate information comprises a horizontal axis coordinate and a vertical axis coordinate. Then the pixel position of the text character and the width and height of the text character actually form a bounding box of the connected characters, and assuming that the pixel position of the text character is marked as (x, y), the height of the text character is h, and the width is w, the corresponding bounding box is: the lower left corner vertex is (x, y), the upper left corner vertex is (x, y+h), the lower right corner vertex is (x+w, y), and the upper right corner vertex is (x+w, y+h). Exemplary, the PDF file to be processed shown in fig. 3, the parsed information is represented as follows:
{ "id": "4", "text": "ranking", "rect": { "x":186 "," y ": 726", "w":30 "," h ":10},
{ "Id": "5", "text": "name", "rect": { "x":250 "," y ": 726", "w":30 "," h ":10},
{ "Id": "6", "text": "language", "rect": { "x":304 "," y ": 726", "w":30 "," h ":10},
Wherein id represents the label of the bounding box (the label can be represented by the code of a text character or can be alternatively numbered), text represents the character content contained in the bounding box, x represents the abscissa of the left lower corner of the bounding box, y represents the ordinate of the left lower corner of the bounding box, w represents the width in the x direction, and h represents the height in the y direction.
The analysis information of the parting line can be expressed as follows:
{"from":{"x":160,"y":629},"to":{"x":448,"y":629}},
{"from":{"x":160,"y":651},"to":{"x":448,"y":651}},
{"from":{"x":160,"y":696},"to":{"x":448,"y":696}},
{"from":{"x":160,"y":719},"to":{"x":448,"y":719}},
Where from denotes the start point of the dividing line, and to denotes the end point of the dividing line.
And 220, expanding the bounding boxes so as to enable the boundaries of adjacent bounding boxes to coincide, and obtaining an expanded bounding box.
Specifically, the minimum bounding box of the text character is obtained in step 210, the bounding boxes are spaced and are not connected or intersected, and adjacent cells are connected in the form, so that the bounding boxes are expanded in the step, the boundaries of the adjacent bounding boxes are overlapped, and the expanded bounding boxes have the connected property, so that the real cells forming the form can be conveniently identified later.
In one embodiment of the present application, the bounding box is expanded, that is, 4 boundaries of the bounding box are respectively expanded to the outside of the bounding box, so that the bounding box gradually becomes larger until intersecting with other bounding boxes, and expansion is stopped. Specifically, the expansion process includes: detecting whether 4 boundaries of the bounding box are intersected with boundaries of other bounding boxes or dividing lines in the PDF file to be processed; if any boundary of the bounding box is intersected with the boundary of other bounding boxes or the dividing line in the PDF file to be processed, stopping expanding the boundary of the bounding box; if any boundary of the bounding box is not intersected with the boundary of other bounding boxes or the dividing line in the PDF file to be processed, expanding the boundary of the bounding box in a direction away from the center of the bounding box according to a set step length, and returning to the step of detecting whether 4 boundaries of the bounding box are intersected with the boundary of other bounding boxes or the dividing line in the PDF file to be processed; and after all 4 boundaries of the bounding box stop expanding, generating an expanded bounding box according to the expanded boundaries.
The dividing lines analyzed from the PDF file to be processed may be the frames of some cells in the table, so that when the boundary of the bounding box intersects with the dividing lines in the PDF file to be processed in the process of expanding the bounding box, the bounding box is expanded to the frame positions of the cells, and then the expansion is stopped. Of course, if the bounding box is expanded and its boundary coincides with the boundary of other bounding boxes, it is also indicated that the bounding box has been expanded to the cell border position, and the expansion is stopped. The bounding box is expanded according to a set step, i.e. for each boundary of the bounding box, the distance of each expansion is the set step, and the expansion direction is the direction outside the bounding box, i.e. in principle the center of the bounding box. When the extension of all 4 boundaries of the bounding box is stopped, the bounding box formed by the 4 extended boundaries is denoted as an extended bounding box.
By way of example, a specific procedure for bounding box expansion will be described below using two bounding boxes as shown in fig. 4. The bounding box expansion process comprises two aspects of judging whether the bounding box intersects with the dividing line or not and judging whether the bounding box intersects with other bounding boxes or not. In the example shown in fig. 4, since the y coordinates of the horizontal parting lines are generally equal, the coordinates of the horizontal parting line M are represented by two unused x coordinates and one y coordinate, denoted as M (x 1, x2, y); correspondingly, the vertical split line is denoted as N (x, y1, y 2) in terms of two unused y coordinates and one x coordinate.
In judging whether or not the bounding box intersects the dividing line, taking the bounding box a as an example, judging whether or not the bounding box a intersects the dividing line includes the following 4 aspects of operations:
1) The upper direction of the bounding box A is connected with the horizontal parting line M for judgment: (A y+h==My)&&(Ax+w>Mx1)&&(Ax<Mx2) and judging by intersecting the vertical dividing line N: (A y+h==Ny1)&&(Ax<Nx)&&(Ax+w>Nx). Wherein a y+h represents the coordinate value of y+h in the coordinates of the bounding box a, M y represents the y coordinate value of the dividing line M, and the meaning of other similar description modes is similar, and will not be repeated. "A y+h==My" means that A y+h is equal to M y, "A x+w>Mx1" means that A x+w is greater than M x1,"Ax<Mx2 "means that A x is less than M x2," & @ "means" AND "or" AND ". "the upper direction of the bounding box A is connected with the horizontal parting line M for judgment: (a y+h==My)&&(Ax+w>Mx1)&&(Ax<Mx2) "means: when the three conditions "a y+h==My"、"Ax+w>Mx1" and "a x<Mx2" are satisfied at the same time, it is explained that the upper direction of the bounding box a, which means the upper boundary of the bounding box a in the row direction, is connected to the horizontal dividing line M.
2) The lower direction of the bounding box A is connected with the horizontal parting line M for judgment: (A y==My)&&(Ax+w>Mx1)&&(Ax<Mx2) and judging by intersecting the vertical dividing line N: (A y==Ny2)&&(Ax<Nx)&&(Ax+w>Nx). The lower direction of the bounding box a refers to the lower boundary of the bounding box a in the row direction.
3) And judging that the left direction of the bounding box A is connected with the vertical dividing line N: (a x==Nx)&&(Ay+h>Ny1)&&(Ay<Ny2) and the horizontal parting line M: (A x==Mx2)&&(Ay<My)&&(Ay+h>My). The left direction of the bounding box a refers to the left boundary of the bounding box a in the column direction.
4) Right direction of bounding box A is connected with vertical dividing line N for judgment: a x+w==Nx&&Ay+h>Ny1&&Ay<Ny2, judging by intersecting with the horizontal parting line M: (A x+w==Mx1)&&(Ay<My)&&(Ay+h>My). The right direction of the bounding box a refers to the right boundary of the bounding box a in the column direction.
Taking bounding box a as an example, when determining whether bounding box a intersects with any other bounding box, it is determined whether bounding box a has a connection relationship with any other bounding box B in the up-down-left-right direction. The method specifically comprises the following 4 aspects of operations:
1) And the upper direction of the bounding box A is connected with the bounding box B for judgment: (A y+h==By)&&(Ax+w>Bx)&&(Ax<Bx+w).
2) And the lower direction of the bounding box A is connected with the bounding box B for judgment: (A y==By+h)&&(Ax+w>Bx)&&(Ax<Bx+w).
3) And the left direction of the bounding box A is connected with the bounding box B for judgment: (A x==Bx+w)&&(Ay+h>By)&&(Ay<By+h).
4) Right direction of bounding box A is connected with bounding box B for judgment: (A x+w==Bx)&&(Ay+h>By)&&(Ay<By+h).
If the vertical and horizontal directions of the bounding box a are not connected or intersected with the parting line or other bounding boxes, the bounding box a is expanded towards the corresponding direction, such as upward expansion: (x, y, w, h+1), expanding downward: (x, y-1, w, h+1), spread to the left: (x-1, y, w+1, h), spread rightward: (x, y, w+1, h). If the corresponding direction coordinates are connected or intersected with the parting line or other bounding boxes, the corresponding direction coordinates are kept unchanged, and the expansion is stopped. The judging process of the two aspects is repeated until all bounding boxes are connected with the parting line or other bounding boxes or the intersection can not be expanded.
The information of the extended bounding box may be expressed as follows:
{ "id": "4", "text": "rank ″,″rect″:{″x″:186,″yy″:726,″w″:30,″h″:10},″rect_explore″:{″x″:168,″y″:719,″w″:65,″h″:30}},{″id″:5″,″text″:″ sexual name ″,″rect″:{″x″:250,″y″:726,″w″:30,″h″:10},″rect_explore″:{″x″:233,″y″:719,″w″:65,″h″:30}},{″id″:″6″,″text″:″ language", "rect": { "x": 304, "y": 726, "w": 30, "h": 10}, "rect_ explore": { "x": 292, "y": 719, "w": 65, "h": 30}},
Wherein, rect represents the coordinate information of the bounding box before extension, and rect_ explore represents the coordinate information of the bounding box after extension.
Exemplary, the bounding box shown in fig. 3 is expanded to obtain an expanded bounding box shown in fig. 5A. In the expansion process, for the boundary of the bounding box, which is not connected with other dividing lines or bounding boxes, for example, the upper boundary, the left boundary and the right boundary of the bounding box in the first row in fig. 3, the expansion is stopped when the dividing line or other bounding box which is intersected with the bounding box is not found after the bounding box is expanded by a set distance. Since the specific content of the text characters is not considered in the table extraction process, after the bounding box is expanded, the text characters can be removed, and only the expanded bounding box is left for subsequent operation, such as the table data shown in fig. 5A, and after the text characters are removed, the table data shown in fig. 5B and only composed of the expanded bounding box is obtained.
And 230, extracting candidate cells from the extended bounding boxes according to the row alignment relation and the column connection relation among the extended bounding boxes, wherein the candidate cells comprise the extended bounding boxes with aligned boundaries in the row direction and connected in the column direction.
Specifically, the candidate cell is a cell that is preliminarily determined to be likely to be a constituent table. The row alignment relationship refers to whether or not alignment in the row direction, more specifically, whether or not the upper or lower boundary is aligned. The column connection relationship refers to whether there is a connection relationship in the column direction, more specifically, whether the extended bounding box has a connection relationship with the extended bounding box of other rows in the column direction. According to the characteristics of the cells constituting the table, the table is generally constituted of a plurality of cells aligned in the row direction and a plurality of cells connected in the column direction, and therefore, when extracting a candidate cell from the extended bounding box, that is, an extended bounding box satisfying such characteristics is extracted, then the extracted candidate cell is actually an extended bounding box with boundaries aligned in the row direction and connected in the column direction.
In one embodiment of the present application, the process of extracting candidate cells specifically includes: detecting whether an upper boundary of the first extended bounding box and an upper boundary of the second extended bounding box are aligned or detecting whether a lower boundary of the first extended bounding box and a lower boundary of the second extended bounding box are aligned; if the upper boundary of the first extension bounding box is aligned with the upper boundary of the second extension bounding box, or the lower boundary of the first extension bounding box is aligned with the lower boundary of the second extension bounding box, the first extension bounding box and the second extension bounding box are used as effective cells in the same row; detecting whether the effective cell has a connection boundary with effective cells of other rows in the column direction; and if the effective cell has a connection boundary with the effective cells of other rows in the column direction, taking the effective cell as a candidate cell.
Specifically, first, whether the extension bounding boxes are aligned in the row direction is detected, and the alignment in the row direction can be determined from one boundary of the extension bounding boxes in the row direction, for example, whether the extension bounding boxes are aligned in the row direction is determined by an upper boundary of the extension bounding boxes in the row direction, or whether the extension bounding boxes are aligned in the row direction is determined by a lower boundary of the extension bounding boxes in the row direction. The alignment standards of the extension bounding boxes in the same row should be the same, and may be the upper boundary alignment standard or the lower boundary alignment standard. The pairs Ji Biao of extended bounding boxes of different rows may be different or the same. When the extension bounding boxes are aligned in the row direction, the extension bounding boxes are marked as effective cells, and the column direction connection relation of the effective cells is judged next. Comparing the effective cells of each row with the effective cells of other rows, and if one effective cell is not connected with the effective cells of other rows in the column direction, the effective cell can be regarded as an isolated cell, and the isolated cell does not belong to cells in the table, so that the effective cell cannot be used as a candidate cell; conversely, if one active cell has a contiguous boundary with active cells of other rows in the column direction, it may be a candidate cell.
The following describes a judgment process of the line alignment relationship in one example. Each extended bounding box obtained in step 220 is scanned, and it is determined whether there is a line alignment relationship between the extended bounding box a and other extended bounding boxes B, that is, the extended bounding box is left-right connected and the upper boundary or the lower boundary is aligned, that is, it is determined :(Ax+w==Bx)&&(Ay+h>By)&&(Ay<By+h)&&((Ay==By)||(Ay+h==By+h)),, where "|" means "or", "(a y==By)||(Ay+h==By+h)" means "a y==By" and "a y+h==By+h" are satisfied, and "a y==By" means the lower boundary is aligned and "a y+h==By+h" means the upper boundary is aligned. If the extended bounding box A and the extended bounding box B have a row alignment relationship, the extended bounding box A and the extended bounding box B form a row, and the extended bounding box A and the extended bounding box B are valid cells. If the extended bounding box C does not form a row with any one of the extended bounding boxes, the extended bounding box C is independent of the rows, and then it can be determined whether the extended bounding box C has an up-down connection relationship with the valid cells Q of the other rows, that is, :((Qy+h==Cy)||(Qy==Cy+h))&&(Qx+w>Cx)&&(Qx<Cx+w). indicates that the extended bounding box CC is a valid cell if the number of valid cells having an up-down connection relationship with the extended bounding box C exceeds 2, otherwise, the extended bounding box C is considered to be an invalid cell. Repeating the operation until all the extension bounding boxes are judged to be effective cells or ineffective cells, finally eliminating the ineffective cells and reserving the effective cells.
Illustratively, as shown in fig. 5B, the extension bounding boxes 1-39 are traversed, and the extension bounding boxes with left and right connected extension bounding boxes and aligned upper and lower boundaries are acquired to form a row, and the extension bounding boxes 1 are independent to form a row. The extended bounding box 2 is aligned with the extended bounding boxes 4, 5, 6, 7, 8,3 lines forming a line. The extension bounding boxes 9 and 11 are independent in line, though connected left and right, but not aligned. The extended bounding boxes 1 and 11 are independently arranged in a row, but have an up-down connection relationship with more than two effective cells 2,4, 5 and the like, and therefore are considered to be effective cells, while the extended bounding box 9 has an up-down connection relationship with only one effective cell of the extended bounding box 2, and therefore is considered to be ineffective cells. By analogy, it can be determined that the extended bounding boxes 9, 10, 12, 13, 19, 20, 26, 27, 29, 30 are all invalid cells. After the row alignment relationship screening, the effective unit cell shown in fig. 5C is obtained.
In one embodiment of the present application, after determining that the valid cells are screened out by the row alignment relationship, the first row of the table may also be detected according to the number of valid cells. For a table with a certain number of rows and columns, so that the first row of the table is typically provided with a plurality of cells, the process of determining the first row specifically includes: ordering each row formed by the effective cells from small to large according to the row boundary coordinates of each row; detecting whether the number of the effective cells contained in the first row is larger than a preset threshold value; if the number of the effective cells contained in the first row reaches a preset threshold, executing a step of detecting whether a connecting boundary exists between the effective cells and the effective cells of other rows in the column direction; if the number of the effective cells contained in the first row does not reach the preset threshold, eliminating the effective cells contained in the first row, taking the next row as the first row, and returning to the step of detecting whether the number of the effective cells contained in the first row is larger than the preset threshold. Specifically, when the line boundary coordinates of each line are sorted from small to large, the line boundary coordinates may be uniformly sorted according to the upper boundary coordinates or may be uniformly sorted according to the lower boundary coordinates. If the number of the effective cells in the first row is smaller than the preset threshold, the first row is considered to be not a real first row, the first row is removed, and the number of the effective cells in the next row is continuously judged until the first row with the number of the effective cells reaching the preset threshold is found. For example, as shown in the table data of fig. 5C, assuming that the preset threshold is 3, the first row of rows of the valid cells 1 is not satisfied, but the condition of at least three cells is not satisfied, so that the valid cells 1 are not considered as the table start row until the second row of the valid cells 2, 4, 5, 6, 7, 8, 3 satisfies the condition that the number of valid cells reaches 3, then the valid cells 1 are removed, and the table data shown in fig. 5D is obtained.
In one embodiment of the application, column connection screening continues after passing row alignment or active cell count screening. The following describes a column connection relationship determination process in one example. And scanning each effective cell A in each row according to the effective cells screened in the process to see whether the effective cells A have an up-down connection relationship with any effective cell B in other rows, namely :((Ay+h==By)||(Ay==Cy+h))&&(Ax+w>Bx)&&(Ax<Bx+w), when the condition is met, namely that the effective cells A and the effective cells B in other rows have an up-down connection relationship in the column direction, wherein the effective cells A can be used as candidate cells, otherwise, the effective cells A cannot be used as the candidate cells and should be removed. Repeating the above until all candidate cells are found. For example, as shown in the table data of fig. 5D, the effective cells 2,3 are formed in one row with the effective cells 4,5,6, 7, 8, but there is no upper-lower connection relationship with the effective cells of the other rows in the column direction, so that the effective cells 2,3 can be considered as not candidate cells, and removed to obtain the table data shown in fig. 5E.
And 240, cutting the candidate table by taking column boundary lines at two sides of the row with the minimum width in the candidate table as cutting lines according to the candidate table formed by the candidate cells, and taking the candidate cell positioned between the two cutting lines as a target cell.
Specifically, the candidate cells obtained through the screening in the foregoing step of actually processing the regularity of the table in the row direction and then processing the regularity of the table in the column direction by cutting may constitute a candidate table. Considering that the width of each row in the table should be the same, the width of the candidate table should be based on the width of the row with the smallest width in the candidate table, and the column boundary lines on both sides of the row with the smallest width in the candidate table are used as cutting lines to cut the candidate table, so that the distance between two cutting lines is the width of the table, the candidate cells outside the cutting lines are not target cells, and the candidate cells between the two cutting lines are target cells.
In one embodiment of the present application, when dicing is performed, there may be a case where a dicing line passes through a candidate cell, and at this time, whether the candidate cell is outside or inside the dicing line is determined by the bounding box corresponding to the candidate cell and the position of the dicing line. When the bounding box corresponding to the candidate cell is outside the cutting line or is penetrated by the cutting line, the candidate cell is considered to be outside the cutting line, so the candidate cell is not the target cell; when the bounding box corresponding to the candidate cell is inside the cutting line and is not penetrated by the cutting line, the candidate cell is considered to be the target cell.
In one embodiment of the present application, when the bounding box corresponding to the candidate cell is inside the scribe line and is not penetrated by the scribe line, but the candidate cell is penetrated by the scribe line, the portion of the candidate cell outside the scribe line may be removed at this time, and the scribe line is used as a new boundary of the candidate cell to generate the target cell.
In one embodiment of the present application, after the candidate cells are obtained, considering that there may be an irregular arrangement between the candidate cells, the heights of the candidate cells are different from those of the candidate cells in the same row, and thus gaps exist between the candidate cells with lower heights and the candidate cells in other rows, which may make the subsequently extracted form not attractive enough and may affect the accuracy of the form extraction, so that the candidate cells may be further expanded, specifically, the boundaries of the candidate cells, which are not connected with other candidate cells, may be expanded until the boundaries of the candidate cells are connected with the boundaries of bounding boxes near the candidate cells, so as to obtain expanded candidate cells; and then cutting the expansion candidate table by taking column boundary lines at two sides of the minimum width row in the expansion candidate table as cutting lines respectively according to the expansion candidate table formed by the expansion candidate cells, and taking the expansion candidate cells which are not cut by the cutting lines as target cells. When expanding the candidate cells, it is not necessary to expand all the boundaries of the candidate cells, but the boundaries of the candidate cells that are not connected to other candidate cells are expanded, and the expansion termination condition is that the boundary of the bounding box near the candidate cell is connected, and the specific expansion mode is the same as the expansion mode in the step 220, that is, the expansion is performed with a set step length until the termination condition is reached. By expanding, the area of the candidate cell is enlarged, so that the effective table area can be reserved maximally, and the attractiveness and accuracy of the extracted table of the candidate are improved. For example, as shown in fig. 5E, the cells included in the table data are candidate cells, the bounding boxes near the candidate cells are marked, and as shown in fig. 5F, the candidate cells are expanded with the boundary of the bounding box near the candidate cells as a termination condition, so as to obtain the table data shown in fig. 5G.
In one embodiment of the present application, the expanded candidate cells are denoted as expanded candidate cells, and the expanded candidate cells are cut in the same manner as the candidate cells, that is: the row of the minimum width in the extended candidate table is obtained (it will be understood that the width of a row is the sum of the widths of all the cells in the row), the left boundary X coordinate of the row of the minimum width (i.e., the minimum X value of the cell coordinates in the row) is extracted as the first cut line X left, and the right boundary X coordinate (i.e., the maximum X value of the cell coordinates in the row) is extracted as the second cut line X right. Then, any expansion candidate cell a in each row is scanned, and a bounding box a (X, y, w, h) of the expansion candidate cell a is obtained, and if the bounding box a of the expansion candidate cell a is outside the dicing line, that is, the left side of the first dicing line X left, or the right side of the second dicing line X right, that is, (a x>Xright)||(ax+w<Xleft), or the bounding box a is diced (that is, (a x+w>Xright)&&(ax<Xleft)), it can be determined that the cell a is not the target cell. In other words, if the left column boundary of the bounding box corresponding to the extended candidate cell in the extended candidate table is located on the right side of the first cut line and the right column boundary of the bounding box corresponding to the extended candidate cell is located on the left side of the second cut line, it is determined that the extended candidate cell is not cut and is the target cell. Repeating the above operation until all the cells are judged, eliminating the non-target cells, and leaving all the target cells. Exemplary, the table data shown in fig. 5G is cut to obtain the table data shown in fig. 5H.
And 250, generating target table information extracted from the PDF file to be processed according to the target cell.
Specifically, the table formed by the target cells is a target table, and the target table is combined with character information contained in the target table, namely the target table information.
In one embodiment of the present application, considering that the shape of the table may not be regular after the candidate table is cut, the target cell may be further expanded, that is: expanding the boundaries of the target cells that are not aligned with other target cells to align the boundaries of the target cells with the boundaries of other target cells; and then generating target table information extracted from the PDF file to be processed according to the expanded target cells. The expansion process is similar to the expansion process involved in the previous step, and will not be described in detail here. Illustratively, as shown in the table data of fig. 5H, the target cells 32 and 36 do not coincide with the table boundaries, resulting in irregular tables, and the target table shown in fig. 5I is obtained by expansion. For the target table shown in fig. 5I, the character information contained therein is combined to obtain the target table information shown in fig. 5J. It can be seen that the expansion step in the embodiment of the application can be performed at any time when needed, and can be performed for multiple times, so that the extraction of the table is more accurate.
In the technical scheme provided by the embodiment of the application, a PDF file to be processed is firstly analyzed to obtain bounding boxes of characters connected in the PDF file to be processed; then expanding the bounding boxes to enable the boundaries of adjacent bounding boxes to coincide, so as to obtain an expanded bounding box; extracting candidate cells from the extension bounding boxes according to the row alignment relationship and the column connection relationship between the extension bounding boxes, wherein the candidate cells comprise extension bounding boxes with aligned boundaries in the row direction and connected in the column direction; then according to a candidate table formed by the candidate cells, respectively cutting the candidate table by taking column boundary lines at two sides of the row with the minimum width in the candidate table as cutting lines, and taking the candidate cell positioned between the two cutting lines as a target cell; finally, generating target table information extracted from the PDF file to be processed according to the target cells, and extracting the table information accurately under the conditions of hidden table grid frames, misalignment of the cells in the table and the like by finding out the target cells constituting the table through processing such as bounding box expansion, line alignment, column connection, table cutting and the like from the bounding box of the character without depending on the table wire frames.
In one embodiment of the present application, considering that one PDF file may contain a plurality of tables, after the bounding box is expanded, clustering may be performed on the expanded bounding box to obtain a plurality of class clusters, where one class cluster corresponds to one table; then, for the extended bounding box included in each class cluster, the operations of step 230 and thereafter of the present application are performed to extract information of each target table.
In one embodiment of the present application, FIG. 6 schematically illustrates a flow chart of a clustering process provided by one embodiment of the present application. As shown in fig. 6, the clustering process includes:
s601, taking the center point of each extension bounding box in extraction as a sample point.
S602, scanning any unlabeled sample point P.
S603, judging whether the number of sample points in the radius R range of the P point is larger than N. If so, S605 is entered; if not, S604 is entered.
S604, marking the P point as a noise point, and returning to S603. By noise point is meant that the sample point is not the sample point that makes up the table.
And S605, marking the P points as core points, establishing a new cluster C, and adding the point Q in the range of R into the new cluster. I.e. classifying the sample points with the distance of radius R from the P point into the class cluster C corresponding to the P point.
S606, scanning unlabeled points in the radius R range of the point Q.
S607, judging whether the number of sample points in the radius R range of the Q point is larger than N. If so, S609 is entered; if not, the process proceeds to S608.
S608, marking the Q point as a boundary point, and returning to S607.
S609, marking the Q point as a core point, and adding the core point into the cluster C.
S610, judging whether the current sample point pairs are marked. If yes, indicating that all sample points are clustered, and ending the clustering flow; if not, returning to S602, and continuing to cluster the unmarked points.
Exemplary, as shown in fig. 7A, the table data is formed by an extended bounding box, and after clustering, two table data shown in fig. 7B are obtained. The operations of step 230 and thereafter are then performed for each form data.
In one embodiment of the present application, in some tables, text characters in some cells may have a line-wrapping condition, then multiple bounding boxes are obtained after parsing, multiple extended bounding boxes are obtained after expanding, and in fact, these extended bounding boxes correspond to one cell in the table. The process specifically comprises the following steps: extracting a bounding box to be detected from a table formed by the extended bounding boxes, and acquiring the extended bounding box connected to the right side of the bounding box to be detected; wherein the bounding box to be detected starts from the first extended bounding box in the table; if the extension bounding box connected to the right side of the bounding box to be detected is one, taking the next unprocessed extension bounding box as the bounding box to be detected, and returning to the step of recursively acquiring the extension bounding box connected to the right side of the bounding box to be detected; if the number of the extension bounding boxes connected to the right side of the bounding box to be detected is multiple, combining the multiple extension bounding boxes, wherein the combining operation is used for combining the multiple extension bounding boxes into one extension bounding box; if the multiple extension bounding boxes are successfully combined, reserving combined extension bounding boxes corresponding to the multiple extension bounding boxes, taking the next unprocessed extension bounding box as a bounding box to be detected, and returning to obtain the extension bounding box connected to the right side of the bounding box to be detected; if the merging of the plurality of extension bounding boxes fails, keeping the plurality of extension bounding boxes unchanged, taking the next unprocessed extension bounding box as a bounding box to be detected, and returning to the step of acquiring the extension bounding box connected to the right side of the bounding box to be detected.
Specifically, detection is performed for each extended bounding box. And acquiring the number of the extension bounding boxes connected to the right side of the bounding box to be detected, and when the right boundary of the extension bounding box A is coincident with the left boundary of the extension bounding box B, then the extension bounding box B is called as the extension bounding box connected to the right side of the extension bounding box A. When the number of extension bounding boxes connected to the right side of the bounding box to be detected is multiple, in this case, at least one extension bounding box is smaller than the height of the bounding box to be detected, and it is considered that the multiple extension bounding boxes connected to the right side may be caused by line feed in the same cell, the multiple extension bounding boxes need to be combined, the multiple extension bounding boxes are combined into one extension bounding box, and the subsequent processing is performed based on the combined extension bounding boxes, without processing the multiple extension bounding boxes before combination. If the merging is successful, continuing to detect the next unprocessed extension bounding box; if the merging fails, the method indicates that the plurality of extension bounding boxes may not be caused by line feed, and at this time, the merging operation on the plurality of extension bounding boxes is not needed to be continued, the plurality of extension bounding boxes are kept unchanged, and the detection on the next unprocessed extension bounding box is continued.
In one embodiment of the present application, merging of multiple extended bounding boxes refers to taking the large boundaries of the multiple extended bounding boxes as the boundaries of the merged extended bounding box, which is equivalent to finding the smallest bounding box that can enclose the multiple extended bounding boxes as the merged extended bounding box.
In one embodiment of the present application, whether the multiple extended bounding boxes are combined successfully is determined, and may be determined by a relationship between a combined result of the multiple bounding boxes corresponding to the multiple extended bounding boxes and the bounding box to be detected. I.e. merging a plurality of bounding boxes corresponding to the plurality of extended bounding boxes, in the column direction, when the y coordinate of the merged bounding box is within the y coordinate range of the bounding box to be detected, namely the upper boundary of the merged bounding box is lower than the upper boundary of the bounding box to be detected, and the lower boundary of the merged bounding box is higher than the lower boundary of the bounding box to be detected, the merging is considered to be successful, otherwise, the merging is considered to be failed.
The following describes a specific procedure of merging with a specific example: with the extended bounding box a as the bounding box to be detected, the extended bounding box B, C connected to the right of the extended bounding box a and the cell D connected to the right of the extended bounding box B, C are recursively acquired (meaning that the extended bounding box connected to the right is acquired until there is no extended bounding box connected to the right). Right side connection judgment: (A x+w==Bx)&&(Ay+h>By)&&(Ay<By+h). Then scanning an extended bounding box ABCD, and if the extended bounding box A is connected with the extended bounding box B at the right side, taking the extended bounding box B as the next bounding box to be detected, and continuing to judge the extended bounding box B connected with the extended bounding box at the right side; if the extended bounding box connected to the right of extended bounding box a is B, C, which is a plurality of cells and the width of extended bounding box B, C is equal (i.e., the w values are equal), extended bounding box B, C is attempted to be merged into one merged extended bounding box E. After merging :Ex=min(Bx,Cx),Ey=min(By,Cy),Ew=(max(Bx+w,Cx+w)-Ex),Eh=(max(By+h,Cy+h)-Ey)., acquiring a bounding box E of the merged extended bounding box E (obtained by merging the bounding boxes of the extended bounding box B, C), and judging whether the bounding box E is within the Y-axis range of the unit cell of the extended bounding box A, namely: (e y>Ay)&&(ey+h<Ay+h). If the bounding box is within the range, the merging is successfully indicated, and the operation on the next bounding box to be detected (the extended bounding box D) is continued; otherwise, the merging fails, the merging extension bounding box E is invalid, and the original extension bounding box B, C is reserved. Further, it is also necessary to determine whether the merge expansion bounding box E intersects the horizontal dividing line M (the dividing line obtained by analysis), that is, :(Mx2>=Ex)&&(Mx1<=Ex+w)&&(My>=Ey)&&(My<=Ey+h). if there is an intersection relationship, it also indicates that the merge fails. If no intersection exists, merge extended bounding box E is deemed successful and replaces extended bounding box B, C with merge extended bounding box E. The foregoing operations are repeated until no extended bounding boxes are found that satisfy the merge rule. For example, in the table data shown in fig. 8A, a row is formed in the table by the rule of the number column due to the length, two independent extension bounding boxes are formed after the bounding box extension processing, and the two extension bounding boxes can be recombined into a new extension bounding box by using the bounding box merging rule. When the extended bounding box "Zhang san" is scanned, 2 extended bounding boxes "20230001" and "001024" are connected to the right side of the extended bounding box "Zhang san", the extended bounding boxes "20230001" and "001024" are combined into one cell, and whether the height of the smallest bounding box of the characters "20230001" and "001024" is within the height range of the cell "Zhang san" is detected. If so, "20230001" and "001024" are combined into one unit cell. After the merging process, table data as shown in fig. 8B is obtained.
In one embodiment of the present application, considering that different alignment manners may exist for characters included in each cell in the table, where the different alignment manners may cause an irregular arrangement between bounding boxes of different consecutive characters, and thus the extended bounding boxes may not be aligned regularly, after the extended bounding boxes are obtained, an alignment adjustment process may be performed on the extended bounding boxes, so that the extended bounding boxes may be aligned regularly, and it should be noted that, in this embodiment, the alignment refers to alignment in a row direction. The process specifically comprises the following steps: extracting a bounding box to be adjusted from a table formed by the extended bounding boxes, and recursively acquiring a plurality of extended bounding boxes connected to the right side of the bounding box to be detected; wherein the bounding box to be adjusted starts from the first extended bounding box in the table; detecting whether a plurality of extension bounding boxes with aligned row boundaries exist in the bounding boxes to be detected and a plurality of extension bounding boxes connected to the right side of the bounding boxes to be detected; if a plurality of extension bounding boxes with aligned row boundaries exist, adjusting the row boundaries of other extension bounding boxes with misaligned row boundaries by taking the aligned row boundaries of the extension bounding boxes with aligned row boundaries as references, so that the row boundaries of the bounding boxes to be detected and the extension bounding boxes connected to the right side of the bounding boxes to be detected are aligned; if a plurality of extension bounding boxes with aligned row boundaries do not exist, adjusting the row boundaries of the extension bounding boxes connected to the right side of the bounding box to be detected by taking the row boundaries of the bounding box to be detected as references, so that the row boundaries of the bounding box to be detected and the extension bounding boxes connected to the right side of the bounding box to be detected are aligned; and returning the next unprocessed extension bounding box serving as a bounding box to be adjusted to recursively acquire a plurality of extension bounding boxes connected to the right side of the bounding box to be detected.
Firstly extracting a bounding box to be adjusted from a table formed by the bounding boxes to be adjusted, wherein when the bounding box to be adjusted is the first bounding box to be adjusted in the table when the bounding box to be adjusted is processed for the first time, and then recursively acquiring a plurality of bounding boxes connected to the right side of the bounding box to be adjusted, wherein recursively acquiring refers to starting from the bounding box to be adjusted, and acquiring the bounding boxes connected to the right side until the bounding boxes connected to the right side are not found. For example, assuming that the extended bounding boxes A, B, C, D are sequentially right-connected, and the bounding box to be adjusted is an extended bounding box a, when the extended bounding box connected to the right side of the extended bounding box a is recursively acquired, firstly, an extended bounding box B is acquired, then, the extended bounding box C is detected to be further right of the extended bounding box B, then, the extended bounding box C is continuously acquired, then, the extended bounding box D is continuously acquired, no other extended bounding box is further right of the extended bounding box D, and then, the extended bounding box connected to the right side of the extended bounding box a includes an extended bounding box B, C, D. The judging mode of the right side connection is the same as that of the right side connection in the cell merging embodiment, and will not be described herein. By recursion acquisition, all the extension bounding boxes of a row beginning with the bounding box to be adjusted are acquired, and then alignment adjustment is carried out on all the extension bounding boxes of the row, so that the extension bounding boxes of each row can be aligned regularly, and subsequent processing is facilitated.
Next, detecting whether a plurality of extension bounding boxes with aligned row boundaries exist in the bounding box to be adjusted and a plurality of extension bounding boxes corresponding to the bounding box to be adjusted, if so, the aligned plurality of extension bounding boxes can serve as alignment references, and other unaligned extension bounding boxes are adjusted, so that all extension bounding boxes of the row can be aligned regularly by taking the aligned row boundaries of the aligned extension bounding boxes as references. If there is no aligned extended bounding box, the first extended bounding box of the row is referenced, i.e., the row boundary of the bounding box to be adjusted, such that the row boundaries of the other extended bounding boxes are aligned with the row boundary of the bounding box to be adjusted.
After one row of extension bounding boxes is processed, the next unprocessed extension bounding box is taken as a bounding box to be adjusted, and the operation is repeated to align and adjust the extension bounding box of the next row.
The following describes a specific operation procedure of the alignment adjustment with a specific example. As shown in fig. 9, the first extended bounding box a on the left side of each row is acquired as a bounding box to be adjusted, and the extended bounding box B, C connected to the right side of the extended bounding box a and the cell D connected to the right side of the extended bounding box B, C are recursively acquired. The extended bounding box A, B, C, D is scanned, and it is detected whether there is an aligned extended bounding box in the extended bounding box A, B, C, D. If the extended bounding box B is not aligned, i.e., (a y≠By)||(Ay+h≠By+h), and the extended bounding box A, D already has an alignment relationship, i.e., (a y==Dy)||(Ay+h==Dy+h), then with the extended bounding box A, D as an alignment reference, an attempt is made to adjust the extended bounding box B so that B y=Ay,By+h=Ay+h. If the expanded bounding box B is not aligned, i.e., (a y≠By)||(Ay+h≠By+h), nor is there an alignment relationship for the expanded bounding box A, D, the expanded bounding box B, C, D cells are adjusted with reference to the first expanded bounding box a.
And then judging whether the adjustment of the extended bounding box is successful or not, if the adjusted extended bounding box meets the condition that no intersection relation exists between the extended bounding box and other extended bounding boxes or division lines, the adjustment is successful, otherwise, the adjustment is failed. Taking the adjusted extension bounding box B as an example, if the adjusted extension bounding box B has no intersection relationship with the minimum bounding box C and the parting line MN of the other extension bounding boxes C, it indicates that the extension bounding box B is successfully adjusted. Determination :(cx+w>Bx)&&(cy+h>By)&&(cx<Bx+w)&&(cy<By+h). of intersection of extended bounding box B and bounding box c extended bounding box B and horizontal split line M intersection determination :(Mx2>=Bx)&&(Mx1<=Bx+w)&&(My>=By)&&(My<=By+h). extended bounding box B and vertical split line N intersection determination :(Ny2>=By)&&(Ny1<=By+h)&&(Nx>=Bx)&&(Nx<=Bx+w). when the above formula is established, it means that the two intersect.
Repeating the steps until all the extension bounding boxes cannot be adjusted. When a certain extended bounding box is adjusted, the boundaries of other extended bounding boxes connected with the extended bounding box are determined according to the adjusted extended bounding box boundaries. For example, as in the case of the extended bounding box B described above, the lower boundary of the extended bounding box B is the upper boundary of the extended bounding box C, and after the extended bounding box B is adjusted, the upper boundary of the extended bounding box C connected thereto is changed to the lower boundary of the adjusted extended bounding box B.
Illustratively, the table data shown in fig. 10A is subjected to the alignment adjustment processing, and the table data shown in fig. 10B is obtained.
In one embodiment of the present application, after the target table information is determined, the target table information may be output in a readable format. The output modes are two, one is a csv table file; the other is formatted data output, such as two-dimensional array format, JSON format, etc., and the upper layer application can extract the required information through the formatted data.
By way of example, the target table information may be output as a two-dimensional array structure as shown below:
[ "ranking", "name", "Chinese", "math", "English" ]
[ "Mid-period score" ]
[ "1", "Zhangsan", "100", "99", "98" ]
[ "2", "Lifour", "89", "90", "91" ]
[ "End of period score ]
[ "1", "Wangwu", "88", "85", "76" ]
It should be noted that although the steps of the methods of the present application are depicted in the accompanying drawings in a particular order, this does not require or imply that the steps must be performed in that particular order, or that all illustrated steps be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform, etc.
The following describes an embodiment of the apparatus of the present application, which may be used to perform the method for extracting form information from PDF files in the above-described embodiment of the present application. Fig. 11 schematically shows a block diagram of an apparatus for extracting form information from a PDF file according to an embodiment of the present application. As shown in fig. 11, an apparatus for extracting form information from a PDF file according to an embodiment of the present application includes:
the file analysis module 1110 is configured to analyze a PDF file to be processed to obtain bounding boxes of characters connected to the PDF file to be processed;
the bounding box expansion module 1120 is configured to expand the bounding boxes so that boundaries of adjacent bounding boxes overlap to obtain an expanded bounding box;
a candidate cell extraction module 1130, configured to extract candidate cells from the extended bounding boxes according to a row alignment relationship and a column connection relationship between the extended bounding boxes, where the candidate cells include extended bounding boxes aligned at a boundary in a row direction and connected in a column direction;
A table cutting module 1140, configured to cut the candidate table with column boundary lines on two sides of a row with a minimum width in the candidate table as cutting lines according to the candidate table formed by the candidate cells, and use the candidate cell located between two cutting lines as a target cell;
The table information generating module 1150 is configured to generate target table information extracted from the PDF file to be processed according to the target cell.
In one embodiment of the application, after the PDF file to be processed is analyzed, a parting line contained in the PDF file to be processed is also obtained; the bounding box expansion module 1120 is specifically configured to:
detecting whether 4 boundaries of the bounding box are intersected with boundaries of other bounding boxes or dividing lines in the PDF file to be processed;
if any boundary of the bounding box is intersected with the boundary of other bounding boxes or the dividing line in the PDF file to be processed, stopping expanding the boundary of the bounding box;
If any boundary of the bounding box is not intersected with the boundary of other bounding boxes or the dividing line in the PDF file to be processed, expanding the boundary of the bounding box in a direction away from the center of the bounding box according to a set step length, and returning to the step of detecting whether 4 boundaries of the bounding box are intersected with the boundary of other bounding boxes or the dividing line in the PDF file to be processed;
And after all 4 boundaries of the bounding box stop expanding, generating an expanded bounding box according to the expanded boundaries.
In one embodiment of the application, the candidate cell extraction module 1130 is specifically configured to:
Detecting whether an upper boundary of the first extended bounding box and an upper boundary of the second extended bounding box are aligned or detecting whether a lower boundary of the first extended bounding box and a lower boundary of the second extended bounding box are aligned;
If the upper boundary of the first extended bounding box and the upper boundary of the second extended bounding box are aligned, or the lower boundary of the first extended bounding box and the lower boundary of the second extended bounding box are aligned, the first extended bounding box and the second extended bounding box are used as valid cells in the same row;
detecting whether the effective cell has a connection boundary with effective cells of other rows in the column direction;
And if the effective cell has a connection boundary with the effective cells of other rows in the column direction, taking the effective cell as a candidate cell.
In one embodiment of the application, the apparatus further comprises:
The cell number detection module is used for sequencing each row formed by the effective cells from small to large according to row boundary coordinates of each row; detecting whether the number of the effective cells contained in the first row is larger than a preset threshold value; if the number of the effective cells contained in the first row reaches a preset threshold, executing a step of detecting whether a connecting boundary exists between the effective cells and the effective cells of other rows in the column direction; if the number of the effective cells contained in the first row does not reach the preset threshold, eliminating the effective cells contained in the first row, taking the next row as the first row, and returning to the step of detecting whether the number of the effective cells contained in the first row is larger than the preset threshold.
In one embodiment of the present application, the form cutting module 1140 is specifically configured to:
expanding the boundary, which is not connected with other candidate cells, in the candidate cells until the boundary of the candidate cell is connected with the boundary of a bounding box near the candidate cell, so as to obtain an expanded candidate cell;
And according to an expansion candidate table formed by the expansion candidate cells, respectively cutting the expansion candidate table by taking column boundary lines at two sides of a row with the minimum width in the expansion candidate table as cutting lines, and taking the expansion candidate cells which are not cut by the cutting lines as target cells.
In one embodiment of the present application, according to an extended candidate table formed by the extended candidate cells, cutting the extended candidate table with column boundary lines on both sides of a row with a minimum width in the extended candidate table as cutting lines, respectively, includes:
According to an expansion candidate table formed by the expansion candidate cells, taking a left column boundary line of a row with the minimum width in the expansion candidate table as a first cutting line and taking a right column boundary line of the row with the minimum width in the expansion candidate table as a second cutting line;
And if the left column boundary of the bounding box corresponding to the expansion candidate cell in the expansion candidate table is positioned on the right side of the first cutting line and the right column boundary of the bounding box corresponding to the expansion candidate cell is positioned on the left side of the second cutting line, determining that the expansion candidate cell is not cut.
In one embodiment of the present application, the table information generating module 1150 is specifically configured to:
Expanding the boundaries of the target cells which are not aligned with other target cells so as to align the boundaries of the target cells with the boundaries of other target cells;
and generating target table information extracted from the PDF file to be processed according to the expanded target cell.
In one embodiment of the application, the apparatus further comprises:
The clustering module is used for carrying out clustering processing on the bounding boxes to obtain a plurality of class clusters, wherein one class cluster corresponds to one table; and executing the steps of expanding the bounding boxes aiming at the bounding boxes contained in each class cluster so as to enable the boundaries of adjacent bounding boxes to coincide, and obtaining the expanded bounding boxes.
In one embodiment of the application, the apparatus further comprises:
the cell merging module is used for extracting a bounding box to be detected from a table formed by the extended bounding boxes and acquiring the extended bounding boxes connected to the right side of the bounding box to be detected; wherein the bounding box to be detected starts from a first extended bounding box in the table; if the extension bounding box connected to the right side of the bounding box to be detected is one, taking the next unprocessed extension bounding box as the bounding box to be detected, and returning to the step of recursively acquiring the extension bounding box connected to the right side of the bounding box to be detected; if the number of the extension bounding boxes connected to the right side of the bounding box to be detected is multiple, combining the multiple extension bounding boxes, wherein the combining operation is used for combining the multiple extension bounding boxes into one extension bounding box; if the multiple extension bounding boxes are successfully combined, reserving combined extension bounding boxes corresponding to the multiple extension bounding boxes, taking the next unprocessed extension bounding box as the bounding box to be detected, and returning to obtain the extension bounding box connected to the right side of the bounding box to be detected; if the merging of the plurality of extension bounding boxes fails, keeping the plurality of extension bounding boxes unchanged, taking the next unprocessed extension bounding box as the bounding box to be detected, and returning to the step of acquiring the extension bounding box connected to the right side of the bounding box to be detected.
In one embodiment of the application, the apparatus further comprises:
The cell alignment module is used for extracting a bounding box to be adjusted from a table formed by the extended bounding boxes and recursively acquiring a plurality of extended bounding boxes connected to the right side of the bounding box to be detected; wherein the bounding box to be adjusted starts from a first extended bounding box in the table; detecting whether a plurality of extension bounding boxes with aligned row boundaries exist in the bounding boxes to be detected and a plurality of extension bounding boxes connected to the right side of the bounding boxes to be detected; if a plurality of extension bounding boxes with aligned row boundaries exist, adjusting the row boundaries of other extension bounding boxes with misaligned row boundaries by taking the aligned row boundaries of the extension bounding boxes with aligned row boundaries as references, so that the row boundaries of the extension bounding boxes to be detected and the extension bounding boxes connected to the right side of the extension bounding boxes to be detected are aligned; if a plurality of extension bounding boxes with aligned row boundaries do not exist, adjusting the row boundaries of the extension bounding boxes connected to the right side of the bounding box to be detected by taking the row boundaries of the bounding box to be detected as references, so that the row boundaries of the bounding box to be detected and the extension bounding boxes connected to the right side of the bounding box to be detected are aligned; and returning the next unprocessed extension bounding box to the step of recursively acquiring a plurality of extension bounding boxes connected to the right side of the bounding box to be detected as the bounding box to be adjusted.
Specific details of the device for extracting form information from PDF files provided in the embodiments of the present application are described in the corresponding method embodiments, and are not described herein.
Fig. 12 schematically shows a block diagram of a computer system of an electronic device for implementing an embodiment of the application.
It should be noted that, the computer system 1200 of the electronic device shown in fig. 12 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
As shown in fig. 12, the computer system 1200 includes a central processing unit 1201 (Central Processing Unit, CPU) that can perform various appropriate actions and processes according to a program stored in a Read-Only Memory 1202 (ROM) or a program loaded from a storage section 1208 into a random access Memory 1203 (Random Access Memory, RAM). In the random access memory 1203, various programs and data necessary for the system operation are also stored. The cpu 1201 and the ram 1202 are connected to each other via a bus 1204. An Input/Output interface 1205 (i.e., an I/O interface) is also connected to the bus 1204.
The following components are connected to the input/output interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output portion 1207 including a Cathode Ray Tube (CRT), a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), and a speaker, etc.; a storage section 1208 including a hard disk or the like; and a communication section 1209 including a network interface card such as a lan card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. The driver 1210 is also connected to the input/output interface 1205 as needed. A removable medium 1211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1210 so that a computer program read out therefrom is installed into the storage section 1208 as needed.
In particular, the processes described in the various method flowcharts may be implemented as computer software programs according to embodiments of the application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1209, and/or installed from the removable media 1211. The computer programs, when executed by the central processor 1201, perform the various functions defined in the system of the present application.
It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), a flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present application.
Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains.
It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. A method of extracting form information from a PDF file, comprising:
analyzing a PDF file to be processed to obtain bounding boxes of characters connected in the PDF file to be processed;
Expanding the bounding boxes to enable the boundaries of adjacent bounding boxes to coincide, so as to obtain an expanded bounding box;
Extracting candidate cells from the extended bounding boxes according to the row alignment relationship and the column connection relationship between the extended bounding boxes, wherein the candidate cells comprise the extended bounding boxes with aligned boundaries in the row direction and connected in the column direction;
According to a candidate table formed by the candidate cells, respectively cutting the candidate table by taking column boundary lines at two sides of a row with the minimum width in the candidate table as cutting lines, and taking the candidate cells positioned between the two cutting lines as target cells;
and generating target table information extracted from the PDF file to be processed according to the target cell.
2. The method for extracting form information from a PDF file according to claim 1, wherein after the PDF file to be processed is parsed, a dividing line included in the PDF file to be processed is also obtained; expanding the bounding box to enable boundaries of adjacent bounding boxes to coincide, and obtaining an expanded bounding box, wherein the expanding comprises the following steps:
detecting whether 4 boundaries of the bounding box are intersected with boundaries of other bounding boxes or dividing lines in the PDF file to be processed;
if any boundary of the bounding box is intersected with the boundary of other bounding boxes or the dividing line in the PDF file to be processed, stopping expanding the boundary of the bounding box;
If any boundary of the bounding box is not intersected with the boundary of other bounding boxes or the dividing line in the PDF file to be processed, expanding the boundary of the bounding box in a direction away from the center of the bounding box according to a set step length, and returning to the step of detecting whether 4 boundaries of the bounding box are intersected with the boundary of other bounding boxes or the dividing line in the PDF file to be processed;
And after all 4 boundaries of the bounding box stop expanding, generating an expanded bounding box according to the expanded boundaries.
3. The method of extracting form information from a PDF file according to claim 1, wherein extracting candidate cells from the extended bounding box according to a row alignment relationship and a column connection relationship between the extended bounding boxes comprises:
Detecting whether an upper boundary of the first extended bounding box and an upper boundary of the second extended bounding box are aligned or detecting whether a lower boundary of the first extended bounding box and a lower boundary of the second extended bounding box are aligned;
If the upper boundary of the first extended bounding box and the upper boundary of the second extended bounding box are aligned, or the lower boundary of the first extended bounding box and the lower boundary of the second extended bounding box are aligned, the first extended bounding box and the second extended bounding box are used as valid cells in the same row;
detecting whether the effective cell has a connection boundary with effective cells of other rows in the column direction;
And if the effective cell has a connection boundary with the effective cells of other rows in the column direction, taking the effective cell as a candidate cell.
4. A method of extracting form information from a PDF file according to claim 3, wherein before detecting whether the valid cell has a connected boundary in a column direction with valid cells of other rows, the method further comprises:
ordering each row formed by the effective cells from small to large according to row boundary coordinates of each row;
detecting whether the number of the effective cells contained in the first row is larger than a preset threshold value;
If the number of the effective cells contained in the first row reaches a preset threshold, executing a step of detecting whether a connecting boundary exists between the effective cells and the effective cells of other rows in the column direction;
If the number of the effective cells contained in the first row does not reach the preset threshold, eliminating the effective cells contained in the first row, taking the next row as the first row, and returning to the step of detecting whether the number of the effective cells contained in the first row is larger than the preset threshold.
5. The method according to claim 1, wherein according to a candidate form constituted by the candidate cells, the candidate form is cut with column boundary lines on both sides of a line of minimum width in the candidate form as cut lines, respectively, and a candidate cell located between two of the cut lines is set as a target cell, comprising:
expanding the boundary, which is not connected with other candidate cells, in the candidate cells until the boundary of the candidate cell is connected with the boundary of a bounding box near the candidate cell, so as to obtain an expanded candidate cell;
And according to an expansion candidate table formed by the expansion candidate cells, respectively cutting the expansion candidate table by taking column boundary lines at two sides of a row with the minimum width in the expansion candidate table as cutting lines, and taking the expansion candidate cells which are not cut by the cutting lines as target cells.
6. The method according to claim 5, wherein cutting the extended candidate form with column boundary lines on both sides of a minimum width row in the extended candidate form as cut lines, respectively, according to the extended candidate form constituted by the extended candidate cells, comprises:
According to an expansion candidate table formed by the expansion candidate cells, taking a left column boundary line of a row with the minimum width in the expansion candidate table as a first cutting line and taking a right column boundary line of the row with the minimum width in the expansion candidate table as a second cutting line;
And if the left column boundary of the bounding box corresponding to the expansion candidate cell in the expansion candidate table is positioned on the right side of the first cutting line and the right column boundary of the bounding box corresponding to the expansion candidate cell is positioned on the left side of the second cutting line, determining that the expansion candidate cell is not cut.
7. The method for extracting form information from a PDF file of claim 1, wherein generating target form information extracted from the PDF file to be processed from the target cells includes:
Expanding the boundaries of the target cells which are not aligned with other target cells so as to align the boundaries of the target cells with the boundaries of other target cells;
and generating target table information extracted from the PDF file to be processed according to the expanded target cell.
8. The method for extracting form information from a PDF file according to any one of claims 1 to 7, wherein after acquiring bounding boxes of connected characters in the PDF file to be processed, the method further comprises:
Clustering the bounding boxes to obtain a plurality of class clusters, wherein one class cluster corresponds to one table;
And executing the steps of expanding the bounding boxes aiming at the bounding boxes contained in each class cluster so as to enable the boundaries of adjacent bounding boxes to coincide, and obtaining the expanded bounding boxes.
9. The method for extracting form information from a PDF file according to any one of claims 1 to 7, wherein after expanding the bounding boxes so that boundaries of adjacent bounding boxes overlap to obtain an expanded bounding box, the method further comprises:
Extracting a bounding box to be detected from a table formed by the extended bounding boxes, and acquiring the extended bounding box connected to the right side of the bounding box to be detected; wherein the bounding box to be detected starts from a first extended bounding box in the table;
If the extension bounding box connected to the right side of the bounding box to be detected is one, taking the next unprocessed extension bounding box as the bounding box to be detected, and returning to the step of recursively acquiring the extension bounding box connected to the right side of the bounding box to be detected;
If the number of the extension bounding boxes connected to the right side of the bounding box to be detected is multiple, combining the multiple extension bounding boxes, wherein the combining operation is used for combining the multiple extension bounding boxes into one extension bounding box;
If the multiple extension bounding boxes are successfully combined, reserving combined extension bounding boxes corresponding to the multiple extension bounding boxes, taking the next unprocessed extension bounding box as the bounding box to be detected, and returning to obtain the extension bounding box connected to the right side of the bounding box to be detected;
if the merging of the plurality of extension bounding boxes fails, keeping the plurality of extension bounding boxes unchanged, taking the next unprocessed extension bounding box as the bounding box to be detected, and returning to the step of acquiring the extension bounding box connected to the right side of the bounding box to be detected.
10. The method for extracting form information from a PDF file according to any one of claims 1 to 7, wherein after expanding the bounding boxes so that boundaries of adjacent bounding boxes overlap to obtain an expanded bounding box, the method further comprises:
Extracting a bounding box to be adjusted from a table formed by the extended bounding boxes, and recursively acquiring a plurality of extended bounding boxes connected to the right side of the bounding box to be detected; wherein the bounding box to be adjusted starts from a first extended bounding box in the table;
detecting whether a plurality of extension bounding boxes with aligned row boundaries exist in the bounding boxes to be detected and a plurality of extension bounding boxes connected to the right side of the bounding boxes to be detected;
If a plurality of extension bounding boxes with aligned row boundaries exist, adjusting the row boundaries of other extension bounding boxes with misaligned row boundaries by taking the aligned row boundaries of the extension bounding boxes with aligned row boundaries as references, so that the row boundaries of the extension bounding boxes to be detected and the extension bounding boxes connected to the right side of the extension bounding boxes to be detected are aligned;
if a plurality of extension bounding boxes with aligned row boundaries do not exist, adjusting the row boundaries of the extension bounding boxes connected to the right side of the bounding box to be detected by taking the row boundaries of the bounding box to be detected as references, so that the row boundaries of the bounding box to be detected and the extension bounding boxes connected to the right side of the bounding box to be detected are aligned;
And returning the next unprocessed extension bounding box to the step of recursively acquiring a plurality of extension bounding boxes connected to the right side of the bounding box to be detected as the bounding box to be adjusted.
CN202410070624.4A 2024-01-17 2024-01-17 Method for extracting form information from PDF file and electronic equipment Pending CN117994804A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410070624.4A CN117994804A (en) 2024-01-17 2024-01-17 Method for extracting form information from PDF file and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410070624.4A CN117994804A (en) 2024-01-17 2024-01-17 Method for extracting form information from PDF file and electronic equipment

Publications (1)

Publication Number Publication Date
CN117994804A true CN117994804A (en) 2024-05-07

Family

ID=90893556

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410070624.4A Pending CN117994804A (en) 2024-01-17 2024-01-17 Method for extracting form information from PDF file and electronic equipment

Country Status (1)

Country Link
CN (1) CN117994804A (en)

Similar Documents

Publication Publication Date Title
CN110275834B (en) User interface automatic test system and method
EP3117369B1 (en) Detecting and extracting image document components to create flow document
JP3359095B2 (en) Image processing method and apparatus
US8041113B2 (en) Image processing device, image processing method, and computer program product
WO2021237909A1 (en) Table restoration method and apparatus, device, and storage medium
WO2021212873A1 (en) Defect detection method and apparatus for four corners of certificate, and device and storage medium
CN109284756A (en) A kind of terminal censorship method based on OCR technique
CN111368744B (en) Method and device for identifying unstructured table in picture
CN111368511A (en) PDF document analysis method and device
CN112906695B (en) Form recognition method adapting to multi-class OCR recognition interface and related equipment
CN106599001A (en) Webpage content acquisition method and system
WO2019041442A1 (en) Method and system for structural extraction of figure data, electronic device, and computer readable storage medium
CN111914805A (en) Table structuring method and device, electronic equipment and storage medium
CN107330430A (en) Tibetan character recognition apparatus and method
CN204537126U (en) A kind of image text identification translation glasses
CN102968638A (en) Image sharpness judgment method based on keyword optical character recognition
US20090316219A1 (en) Image processing apparatus, image processing method and computer-readable storage medium
US11902522B2 (en) Character restoration method and apparatus, storage medium, and electronic device
CN110119459B (en) Image data search method and image data search device
CN117994804A (en) Method for extracting form information from PDF file and electronic equipment
CN111008987B (en) Method and device for extracting edge image based on gray background and readable storage medium
CN111985506A (en) Chart information extraction method and device and storage medium
JP2013149210A (en) Image processing program, image processing method and image processor
JP2012003358A (en) Background determination device, method, and program
CN110020983A (en) Image processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication