CN114004204A - Table structure reconstruction and character extraction method and system based on computer vision - Google Patents

Table structure reconstruction and character extraction method and system based on computer vision Download PDF

Info

Publication number
CN114004204A
CN114004204A CN202111263283.5A CN202111263283A CN114004204A CN 114004204 A CN114004204 A CN 114004204A CN 202111263283 A CN202111263283 A CN 202111263283A CN 114004204 A CN114004204 A CN 114004204A
Authority
CN
China
Prior art keywords
line
lines
inner frame
module
character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111263283.5A
Other languages
Chinese (zh)
Inventor
沈逸飞
李明泽
李琦
王海文
傅洛伊
王新兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN202111263283.5A priority Critical patent/CN114004204A/en
Publication of CN114004204A publication Critical patent/CN114004204A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Character Input (AREA)

Abstract

The invention provides a table structure reconstruction and character extraction method and system based on computer vision, which comprises the following steps: step 1: identifying and positioning the table in the PDF document through a neural network to obtain an outer frame area where the table is located; step 2: analyzing a character layer in the PDF document to obtain a text space in the PDF document; and step 3: reconstructing an inner frame line structure of the table in the table area through computer vision according to the framed table area and the text space; and 4, step 4: extracting text information from the same position in the PDF document according to the inner frame structure of the table; and 5: and generating an editable table file according to the inner frame line structure of the table and the corresponding text information. The invention identifies the form outline through the neural network, and can automatically extract all forms in the PDF document without manually giving the form outline area, and can extract a large number of forms in the PDF data in batch under the unsupervised condition.

Description

Table structure reconstruction and character extraction method and system based on computer vision
Technical Field
The invention relates to the technical field of document reconstruction, in particular to a table structure reconstruction and character extraction method and system based on computer vision.
Background
Pdf (portable Document format) is a format for saving, displaying, and printing documents, developed by Adobe, and widely used in various fields such as economy, finance, education, scientific research, and academia. However, since the PDF format is designed only for better presentation and more accurate printing, the relationship between individual texts is not preserved for structured data such as tables and the like. With the continuous development of deep learning, more raw data is needed to be supported, and document reconstruction itself is an important task for the publishing industry. The table data is used as highly structured data and has great information value. How to extract various forms existing in various PDFs quickly and accurately is an important basic work and premise for performing higher-level tasks. The existing table extraction technology has the problems of low extraction accuracy, low universality, low performance and the like.
Patent document CN106897690A (application number: 201710095978.4) discloses a PDF form extraction method, which comprises the following steps: a, analyzing a PDF document to obtain image data, first line data and character data; b, processing the image data acquired in the step A by adopting an image recognition algorithm, and acquiring second line data corresponding to the table data from the image data with the table data; step C, respectively processing the first line data obtained in the step A and the second line data obtained in the step B by adopting a graphic algorithm to obtain form frame data with form row data and form column data; step D, clustering the character data obtained in the step A by adopting a clustering algorithm to obtain text data with a character string set; and E, obtaining corresponding table cells through table row data and column data in the table frame data obtained in the step C, and matching the table cells with the character string set in the text data obtained in the step D to obtain table data in the PDF document. The method has poor identification performance on the general table, and cannot identify the general table.
Patent document CN110516208A (application number: 201910738531.3) discloses a system for extracting tables of PDF documents, which includes a table feature extraction module, a table positioning module, and a table internal structure analysis module; a method of form extraction for a PDF document, comprising: s1, extracting table features; s2, positioning a table; s3, analyzing the internal structure of the table; the final table is divided into a two-dimensional grid structure, and the position and size of each cell can be known for the obtained two-dimensional grid table, and the table is output by using an HTML format. The invention considers special table forms such as default lines, bottom color distinguishing cells and the like, and can extract PDF table data with high accuracy. The invention does not consider the case of a picture table in PDF, and has general universality.
Patent document CN105988979A (application No. 201510083646.5) provides a method and device for extracting a table based on a PDF file, in which after character information of each character and line information of each line in the PDF file are obtained by parsing, horizontal lines extracted from the same page of the PDF file are sorted according to line position information, and it is determined whether two adjacent horizontal lines are in the same table of the page, each horizontal line in the same table of the page is table-drawn according to line information, and each vertical line extracted from the page is filled in the drawn table according to line information, and finally, in the drawn table, character information in the character information is filled in a position corresponding to character position information in a cell constituted by the horizontal line and the vertical line according to the character information of each character. The accuracy of extracting the table from the PDF file is improved due to the fact that the information of the transverse lines and the longitudinal lines of the table is considered. The invention also does not consider the case of the picture table in the PDF, and has general universality.
Patent document CN109635268A (application number: CN201811630768.1) discloses a method for extracting table information in a PDF file, comprising: reading a PDF file; analyzing the attribute of the PDF file; finding and sorting a set of all horizontal lines and vertical lines in the page; judging whether the horizontal and vertical line sets of the current page can form a complete table frame, if so, processing according to a framed table, otherwise, processing according to a frameless table; obtaining rows and columns of the table and meta-information of the cells; judging whether the table is a page-crossing table or not, and combining the page-crossing table if the table is the page-crossing table; if the table is not a page-crossing table, directly storing the table; and storing the row and column information of the table, the page where the table is located, the position in the page and the like. The invention does not consider the case of a picture table in PDF, and has general universality.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a table structure reconstruction and character extraction method and system based on computer vision.
The table structure reconstruction and character extraction method based on computer vision provided by the invention comprises the following steps:
step 1: identifying and positioning the table in the PDF document through a neural network to obtain an outer frame area where the table is located;
step 2: analyzing a character layer in the PDF document to obtain a text space in the PDF document;
and step 3: reconstructing an inner frame line structure of the table in the table area through computer vision according to the framed table area and the text space;
and 4, step 4: extracting text information from the same position in the PDF document according to the inner frame structure of the table;
and 5: and generating an editable table file according to the inner frame line structure of the table and the corresponding text information.
Preferably, the step 1 comprises:
step 1.1: training and configuring a table detection neural network;
step 1.2: converting each page of the PDF document containing the form target into a picture, inputting each picture into a form detection neural network, and returning the number of the PDF document pages where the form target is located and the relative position of the form outer frame located on the page where the PDF document is located if the form target exists.
Preferably, the step 2 comprises:
step 2.1: judging whether the PDF page where the table is located contains a character layer or not;
step 2.2: if no character layer exists, embedding the page into the character layer by an optical character recognition technology, wherein the position of the embedded character is at the corresponding position of the character in the picture;
step 2.3: and counting the sizes of all characters in the PDF document, and taking the average width of the characters as an estimated value of the text interval.
Preferably, the step 3 comprises:
step 3.1: intercepting the form in a picture mode according to the form outer frame area and the PDF page where the form is located;
step 3.2: preprocessing the cut picture, wherein the preprocessing comprises threshold processing and morphological processing, and removing noise points except characters and frame lines in the table;
step 3.3: detecting a vertical line of the table, if the pixels of the vertical line exceed a preset value, indicating that the table contains a frame line and executing the step 3.4, otherwise executing the step 3.5;
step 3.4: reconstructing a table structure of a table with frame lines, extracting all vertical lines and horizontal lines of the table, acquiring an intersection point set of the vertical lines and the horizontal lines, and forming an inner frame intersection point set of the table after removing redundant points; judging whether a table inner frame line is formed between adjacent points according to the obtained intersection point set, and if so, connecting the two points to form an edge; forming a table structure with a frame line table according to the points and the edges;
step 3.5: preprocessing the picture, removing horizontal lines and vertical lines with the length exceeding a preset threshold value in the picture, and performing threshold processing on the picture to enable the pixel value of a blank position to be 0 and the pixel value containing characters to be 255; scanning lines of the picture, and if the sum of pixel values of a certain line is scanned to be 0, then the line is a transverse table inner frame line, the position of the table inner frame line is the middle position of all continuous pixel values and the line of 0, and one line in the table is arranged inside two adjacent transverse inner frame lines; longitudinally scanning between two adjacent horizontal inward frame lines, summing pixel values of each column, if the sum of longitudinal pixels continuously exceeding the text interval number is 0, marking the area scanned by the longitudinal lines as a blank area, otherwise, marking the area scanned by the longitudinal lines as a character area, thereby obtaining a coordinate set with or without the character area in each row; combining blank areas which are communicated with each other and can be completely penetrated by a longitudinal line from top to bottom from left to right to form blank blocks, recording the height of each blank block, and removing all blank blocks with the height of 1 row; traversing all vertical lines penetrating through the blank blocks, and recording the total heights of all the blank blocks penetrated by the vertical lines; finally, a vertical line which passes through the blank block and has the maximum sum of the heights is used as a table longitudinal inner frame line, and the blank block which the line passes through is set to pass through; selecting a vertical line which can pass through other blank blocks and has the maximum sum of the heights of the other blank blocks as a vertical frame line of another table, recording the passed blank blocks as passed blank blocks, and finally continuously obtaining the vertical line until all the blank blocks are passed; and establishing a minimum cell of the table according to the obtained transverse inner frame line and the longitudinal inner frame line, scanning whether the longitudinal line of each cell passes through a region with characters, deleting the small section of the longitudinal line if the longitudinal line passes through the region with the characters, combining left and right cells of the table, and finally forming the inner frame line structure of the table after the cells are combined.
Preferably, the step 4 comprises: according to the reconstructed form inner frame line structure, obtaining the rectangular frame coordinates of each cell in the PDF document, extracting character information in a region with the same position from the PDF document containing a character layer, and obtaining the content of the form cell through space removal adjustment;
the step 5 comprises the following steps: and establishing an Excel table according to the frame lines in all the tables and the contents of the corresponding table cells, and storing the information of the merged cells.
The invention provides a table structure reconstruction and character extraction system based on computer vision, which comprises:
module M1: identifying and positioning the table in the PDF document through a neural network to obtain an outer frame area where the table is located;
module M2: analyzing a character layer in the PDF document to obtain a text space in the PDF document;
module M3: reconstructing an inner frame line structure of the table in the table area through computer vision according to the framed table area and the text space;
module M4: extracting text information from the same position in the PDF document according to the inner frame structure of the table;
module M5: and generating an editable table file according to the inner frame line structure of the table and the corresponding text information.
Preferably, the module M1 includes:
module M1.1: training and configuring a table detection neural network;
module M1.2: converting each page of the PDF document containing the form target into a picture, inputting each picture into a form detection neural network, and returning the number of the PDF document pages where the form target is located and the relative position of the form outer frame located on the page where the PDF document is located if the form target exists.
Preferably, the module M2 includes:
module M2.1: judging whether the PDF page where the table is located contains a character layer or not;
module M2.2: if no character layer exists, embedding the page into the character layer by an optical character recognition technology, wherein the position of the embedded character is at the corresponding position of the character in the picture;
module M2.3: and counting the sizes of all characters in the PDF document, and taking the average width of the characters as an estimated value of the text interval.
Preferably, the module M3 includes:
module M3.1: intercepting the form in a picture mode according to the form outer frame area and the PDF page where the form is located;
module M3.2: preprocessing the cut picture, wherein the preprocessing comprises threshold processing and morphological processing, and removing noise points except characters and frame lines in the table;
module M3.3: performing vertical line detection on the table, if the vertical line pixel exceeds a preset value, indicating that the table contains a frame line and calling a module M3.4, otherwise calling a module M3.5;
module M3.4: reconstructing a table structure of a table with frame lines, extracting all vertical lines and horizontal lines of the table, acquiring an intersection point set of the vertical lines and the horizontal lines, and forming an inner frame intersection point set of the table after removing redundant points; judging whether a table inner frame line is formed between adjacent points according to the obtained intersection point set, and if so, connecting the two points to form an edge; forming a table structure with a frame line table according to the points and the edges;
module M3.5: preprocessing the picture, removing horizontal lines and vertical lines with the length exceeding a preset threshold value in the picture, and performing threshold processing on the picture to enable the pixel value of a blank position to be 0 and the pixel value containing characters to be 255; scanning lines of the picture, and if the sum of pixel values of a certain line is scanned to be 0, then the line is a transverse table inner frame line, the position of the table inner frame line is the middle position of all continuous pixel values and the line of 0, and one line in the table is arranged inside two adjacent transverse inner frame lines; longitudinally scanning between two adjacent horizontal inward frame lines, summing pixel values of each column, if the sum of longitudinal pixels continuously exceeding the text interval number is 0, marking the area scanned by the longitudinal lines as a blank area, otherwise, marking the area scanned by the longitudinal lines as a character area, thereby obtaining a coordinate set with or without the character area in each row; combining blank areas which are communicated with each other and can be completely penetrated by a longitudinal line from top to bottom from left to right to form blank blocks, recording the height of each blank block, and removing all blank blocks with the height of 1 row; traversing all vertical lines penetrating through the blank blocks, and recording the total heights of all the blank blocks penetrated by the vertical lines; finally, a vertical line which passes through the blank block and has the maximum sum of the heights is used as a table longitudinal inner frame line, and the blank block which the line passes through is set to pass through; selecting a vertical line which can pass through other blank blocks and has the maximum sum of the heights of the other blank blocks as a vertical frame line of another table, recording the passed blank blocks as passed blank blocks, and finally continuously obtaining the vertical line until all the blank blocks are passed; and establishing a minimum cell of the table according to the obtained transverse inner frame line and the longitudinal inner frame line, scanning whether the longitudinal line of each cell passes through a region with characters, deleting the small section of the longitudinal line if the longitudinal line passes through the region with the characters, combining left and right cells of the table, and finally forming the inner frame line structure of the table after the cells are combined.
Preferably, the module M4 includes: according to the reconstructed form inner frame line structure, obtaining the rectangular frame coordinates of each cell in the PDF document, extracting character information in a region with the same position from the PDF document containing a character layer, and obtaining the content of the form cell through space removal adjustment;
the module M5 includes: and establishing an Excel table according to the frame lines in all the tables and the contents of the corresponding table cells, and storing the information of the merged cells.
Compared with the prior art, the invention has the following beneficial effects:
(1) the invention identifies the form outer frame through the neural network, and can automatically extract all forms in PDF without manually setting the form outer frame area, and can extract a large amount of forms in PDF data in batch under the unsupervised condition;
(2) according to the invention, a character layer is added to the table without the character layer through an OCR technology, so that the table in PDF of the picture type can be compatibly identified, and the universality is stronger;
(3) the invention divides the table into the table with the frame line and the table without the frame line, and has different recognition algorithms for the frame line and the table without the frame line, so that the accuracy of the table structure reconstruction is higher;
(4) the invention obtains the text space by analyzing the PDF data and is used for assisting in reconstructing the inner frame line structure of the table, so that the table structure is more accurately reconstructed;
(5) the invention ensures that the recognition speed and the recognition precision are balanced by reasonably setting the size of the picture which is intercepted from the PDF for processing when the inner frame line is recognized.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a diagram of an example of a table outer frame of a table detection network detection PDF in a certain page;
FIG. 3 is an exemplary diagram of a table truncated from an original PDF;
FIG. 4 is a transverse boxed line diagram in a boxed line table;
FIG. 5 is a vertical boxed line plot in a boxed line table;
FIG. 6 is a schematic view of the intersection of horizontal axis lines;
FIG. 7 is a schematic diagram of inner outline structure lines generated from a boxed table;
FIG. 8 is a table diagram of a table judged to be without frame lines, with frame lines removed and subjected to reinforcement processing;
FIG. 9 is an exemplary diagram of a lateral box line and collection of regions that results after M3.5.3;
FIG. 10 is an exemplary view of all longitudinal interior trim lines obtained at M3.5.5;
FIG. 11 is an exemplary diagram of an inner frame line structure of a table obtained by deleting M3.5.6 the vertical lines in the merge cells;
FIG. 12 is a chart of the contents of a boxed table classifying all characters located within a cell region as cells of the table;
FIG. 13 is a diagram of the contents of a borderless table classifying all characters located within a cell area as cells of the table;
FIG. 14 is a diagram of an editable excel file resulting from the frameless line table;
FIG. 15 is a diagram of the resulting editable excel file with a boxed form.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
Example (b):
according to the method for reconstructing the PDF inner table structure and extracting the characters based on the computer vision, as shown in FIG. 1, the method comprises the following steps:
step 1: identifying and positioning the table in PDF through a neural network to obtain an outer frame area where the table is located; step 2: judging whether PDF contains a character layer, if not, embedding the PDF into a character layer through an OCR technology, and automatically acquiring text space in the PDF through analyzing the PDF; and step 3: drawing a table inner frame line in the table area according to the table area framed in the previous step and the text space based on a computer vision method, and reconstructing a table structure; and 4, step 4: extracting text information from the same position in PDF according to the inner frame line structure of the table reconstructed in the step M3 to obtain the content of the table; and 5: and automatically generating an editable table file by using the table inner frame line structure and the corresponding text information.
The step 1 comprises the following steps: training and configuring a form detection neural network, preprocessing PDF containing form targets, converting each page of the PDF into a picture, inputting each picture into the neural network for detecting whether a form exists, and if so, returning a PDF page where the form exists and relative coordinates of the position of the PDF page where the form exists. Specifically, the method comprises the following steps:
step 1.1: the method creates a table data set of the user, and trains a target detection neural network aiming at the table. Inputting picture data to a neural network, and returning an identification result;
step 1.2: and converting the PDF page containing the form target into a picture and inputting the picture into a neural network, and returning the position of the PDF page where the form outer frame is located by the neural network through detection. A page may contain multiple tables, as in FIG. 2.
The result returned is in the form of a list. The format is as follows:
[
[ page of table, [ abscissa of upper left corner, ordinate of upper left corner, abscissa of lower right corner, ordinate of lower right corner ]
[ page of table, [ abscissa of upper left corner, ordinate of upper left corner, abscissa of lower right corner, ordinate of lower right corner ]
......
]
The four coordinates describing the outline of the table are ratios relative to the length and width of the page. Is a floating point number between 0 and 1.
The step 2 comprises the following steps: and judging whether the PDF contains a character layer, and if not, embedding the PDF into a character layer by an OCR technology. And automatically acquires the text space in the PDF. Specifically, the method comprises the following steps:
step 2.1: firstly, a PDF2txt module in PDFMiner converts a PDF page corresponding to a target table into an xml format, and the file contains all text information in PDF. Extracting all text characters through a regular expression;
step 2.2: if the result returned in the previous step is null, the PDF is the PDF in the picture form. At this point, a text layer is added to the PDF. Then all operations on PDF are established on PDF with a text layer;
step 2.3: and extracting the position information of all the characters through the regular expression. And subtracting the difference values of the horizontal coordinates of all characters, averaging, and multiplying an empirical numerical value by the average value to obtain an estimated value of the character spacing.
Word Gap 1.3 Mean (horizontal coordinate of lower right vertex of character-horizontal coordinate of lower left vertex of character)
And step 3, redrawing the outline in the table area according to the table area framed in the previous step based on a computer vision method, and reconstructing a table structure. Specifically, the method comprises the following steps:
step 3.1: and intercepting the designated table area from the PDF according to the table outer frame area and the page of the PDF where the table is located. In order to balance the speed and accuracy of the subsequent identification of the outline in the table, the invention empirically sets the captured picture to satisfy that the smallest one of the length and width is 1700 pixels or more. Since most PDFs are recorded as vectors, the content in the PDF is captured in a scalable manner without loss of accuracy, as shown in fig. 3;
step 3.2: and preprocessing the intercepted picture. Since no color picture is needed for table reconstruction. The picture is converted into a single-channel gray mode from three RBG channels so as to accelerate the processing speed. The pixel value ranges from 0 to 255. Carrying out threshold and reverse color processing, setting the pixel value of the pixel point with the pixel value more than 200 as 0, otherwise, setting the pixel value as 255;
step 3.3: and detecting vertical lines of the image to judge whether the table is a frame line table. First, a rectangle with a height of 40 pixels and a width of 1 pixel is taken as a convolution kernel. And performing morphological processing of corroding before expanding on the image obtained in the last step according to the nucleus. All vertical lines having a length greater than 40 pixels are obtained. The lengths of the vertical lines are counted, and if the total length exceeds 1000 pixels set according to experience, the total length is counted. It is determined that the table is a frame line-containing table. Thus going to step 3.4.1; otherwise, the table is considered to have no frame line, and the step 3.5.1 is carried out.
Step 3.4.1: and reconstructing the inner frame line structure of the table containing the frame lines. By setting a convolution kernel with a height of 40 pixels and a width of 1 pixel and a height of 40, a picture with 255 vertical line frame line pixels and 0 other pixels is obtained by etching and then expanding the result obtained in step 3.2, as shown in fig. 5. Similarly, a convolution kernel with a width of 40 and a height of 1 pixel is set, and a picture with 255 pixels of all horizontal frame lines is obtained, as shown in fig. 4. A set of points is taken where the pixel value is 255 in both images. All points with a distance less than 5 (pixels) are grouped as one point until the distance between all coordinates in the set is greater than 5.
In the present invention, the distance between points is calculated by the following formula: d (X, Y) ═ sqrt ((X1-Y1) ^2+ (X2-Y2) ^2), wherein X1 and X2 are respectively horizontal and vertical coordinates of the point X, and Y1 and Y2 are respectively horizontal and vertical coordinates of the point Y.
All isolated points are removed. In the present invention, when there is no other point within the range of 2 pixels on the abscissa or the ordinate of the point, that is, the point does not share the abscissa with other points, the point is regarded as an isolated point, and is deleted. And setting the abscissa value of all the points sharing the abscissa axis as the average value of the abscissas of all the points, and similarly setting the ordinate value of all the points sharing the ordinate axis as the average value of the ordinates of all the points.
After removing these redundant coordinates, the remaining coordinates can be considered as the intersection of the horizontal and vertical box lines of the table, as shown in FIG. 6.
Step 3.4.2: and reconstructing the inner frame lines of the table through the intersection points, and sequencing the horizontal axis and the vertical axis of the intersection points obtained in the previous step to obtain all horizontal coordinate sets and vertical coordinate sets.
And traversing all the intersection points, and judging whether the intersection points are connected with adjacent intersection points to form a frame line or not by summing pixel values of straight lines corresponding to the adjacent intersection points in the graph, wherein if the average value of the pixel values is more than 200, the frame line is determined to exist. And all the frame lines are recorded.
Step 3.4.3: by recording the resulting box lines, all the table cells can be finally obtained. All the cells in the table are sorted and integrated to obtain the inner frame structure of the table, as shown in fig. 7.
Wherein, the data structure of the frame line in the table is defined as follows:
TableStructure (table inner box):
out line: [ float x1, float y1, float x2, float y2] represents position proportion information of the table outer frame in the whole PDF page, and each coordinate is a floating point number belonging to 0-1;
rows: list [ float ]: representing the position proportion information of all the horizontal line frame lines in the table area;
columns: list [ float ]: representing the position proportion information of all the longitudinal frame lines in the table area;
the Units: list [ Unit ] is a List of table cells;
wherein Unit is defined as follows:
unit (cell):
float X1: the horizontal coordinate of the upper left corner of the cell;
FloatY 1: the vertical coordinate of the upper left corner of the cell;
float X2: the horizontal coordinate of the lower right corner of the cell;
FloatY 2: the vertical coordinate of the lower right corner of the cell;
int RowId: subscript serial numbers of horizontal coordinates of upper left corners of the cell areas in Rows;
int ColId: the vertical coordinate of the upper left corner of the cell area is the subscript number in Columns;
int MergeRow: the merging height of the cells;
int MergeCol: the combined width of the cells;
boolean Type: whether the cell contains text.
Step 3.5.1: the picture obtained from step 3.2 is processed again. By setting a convolution kernel with a height of 40 pixels and a width of 1 pixel and a height of 40, the result obtained in step 3.2 is first eroded and then expanded to obtain a picture with all vertical line frame line pixels of 255 and other pixels of 0. And 3.2, subtracting the same-position pixel of the graph from the graph obtained in the step 3.2 to obtain a graph with vertical line pixels removed. And similarly, removing the horizontal line pixels. The resulting graph has the outline removed. Then the graph is processed by erosion and expansion with convolution kernel 3 x 3 to make the correlation between the pixels in the graph close to each other, and finally the median filtering is processed, as shown in fig. 8.
Step 3.5.2: and scanning the picture of the previous step from top to bottom in a transverse mode, and summing pixel values of each row. If the sum of the pixel values is 0, a horizontal table inner frame line is found, and the position of the horizontal inner frame line is set as the middle position of all the continuous rows with the sum of the pixel values being 0. All transverse inner frame lines are finally obtained. Inside two adjacent transverse inner frame lines is a row in the table.
Step 3.5.3: each row of the picture is scanned vertically and all pixel values from that row to the next row are summed. If pixels exceeding the text pitch number and the vertical line of 0 are continuous, marking the area as a blank area. Otherwise, the mark is a text area. Thereby obtaining a set of regions with or without text per row, as shown in fig. 9.
Step 3.5.4: combining the blank areas communicated with each other among the rows to form a blank block, counting the heights of the blank blocks, and removing all the blank blocks with the height of 1 row.
Step 3.5.5: and longitudinally scanning to find the vertical line which passes through the blank block and has the largest total height, wherein the vertical line is a longitudinal inner frame line. And enabling the subsequent searched vertical line not to pass through the scanned blank block area, and searching the vertical line again until no blank block can be scanned. All the resulting vertical lines are all the vertical inner lines of the table, as shown in fig. 10.
Step 3.5.6: and establishing the minimum cell of the table according to the obtained transverse inner frame line and the longitudinal inner frame line. And scanning whether the vertical line of each cell passes through the area with the characters or not, and if so, deleting the vertical line to represent that the table is subjected to merging operation. And deleting redundant table inner frame lines with the first row and the head being empty, and finally forming the table inner frame line structure after the cells are combined. The cell structure definition and the in-frame line structure definition of the table are the same as those in step 3.4.3, as shown in fig. 11.
And step 4, extracting the text information of each cell from the same position of PDF according to the inner frame structure of the table obtained by the previous reconstruction, and finally obtaining the content corresponding to each cell. Specifically, the method comprises the following steps:
and extracting the information of the PDF specified page through the PDFMiner, and outputting the information in an XML format, wherein the information comprises the position information of each character. The position information of each character is obtained through a regular expression, and the text and the position information of each character are obtained, wherein the format is the form of [ upper left-corner abscissa, upper left-corner ordinate, lower right-corner abscissa, lower right-corner ordinate, character ], for example [100.1,100.2,120.3,130.4, "table ]. The information of the characters is formed into a list, and the boxes described by the positions of the characters are reduced by 50% in place, so that the extraction omission operation caused by errors in the subsequent character extraction process is avoided.
And traversing each cell of the inner frame line structure to obtain the position information of each cell in the PDF. All characters located in the cell area are classified as the content of the cell. Thereby obtaining the corresponding text information of all the table cells, as shown in fig. 12 and fig. 13.
And step 5, establishing an Excel table according to the inner frame line structure of the table and the corresponding text content, and storing the information of the merging cells. This step utilizes xlwt module, establishes excel file by giving cell position, merging, text information, as shown in fig. 14 and fig. 15.
Firstly, a method for automatically detecting a PDF inner table, reconstructing a table structure and extracting table contents to generate an excel file is provided based on a computer vision-based PDF inner table structure reconstruction and text extraction method. The invention can extract a large number of forms in PDF data without supervision. Secondly, the invention distinguishes and distinguishes the frame line table and the frame line table, reconstructs the inner frame line structure by two different methods, and also fully uses the statistical data of the text distance, so that the table reconstruction is more accurate. Finally, the invention adopts the technology of extracting the text in the PDF, and enables the PDF of the non-text to support the mode through the ocr technology, so that the process of text extraction is unified and more efficient.
Those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims (10)

1. A table structure reconstruction and character extraction method based on computer vision is characterized by comprising the following steps:
step 1: identifying and positioning the table in the PDF document through a neural network to obtain an outer frame area where the table is located;
step 2: analyzing a character layer in the PDF document to obtain a text space in the PDF document;
and step 3: reconstructing an inner frame line structure of the table in the table area through computer vision according to the framed table area and the text space;
and 4, step 4: extracting text information from the same position in the PDF document according to the inner frame structure of the table;
and 5: and generating an editable table file according to the inner frame line structure of the table and the corresponding text information.
2. The method for reconstructing a table structure and extracting words based on computer vision according to claim 1, wherein the step 1 comprises:
step 1.1: training and configuring a table detection neural network;
step 1.2: converting each page of the PDF document containing the form target into a picture, inputting each picture into a form detection neural network, and returning the number of the PDF document pages where the form target is located and the relative position of the form outer frame located on the page where the PDF document is located if the form target exists.
3. The method for reconstructing a table structure and extracting words based on computer vision of claim 1, wherein the step 2 comprises:
step 2.1: judging whether the PDF page where the table is located contains a character layer or not;
step 2.2: if no character layer exists, embedding the page into the character layer by an optical character recognition technology, wherein the position of the embedded character is at the corresponding position of the character in the picture;
step 2.3: and counting the sizes of all characters in the PDF document, and taking the average width of the characters as an estimated value of the text interval.
4. The method for reconstructing a table structure and extracting words based on computer vision of claim 1, wherein the step 3 comprises:
step 3.1: intercepting the form in a picture mode according to the form outer frame area and the PDF page where the form is located;
step 3.2: preprocessing the cut picture, wherein the preprocessing comprises threshold processing and morphological processing, and removing noise points except characters and frame lines in the table;
step 3.3: detecting a vertical line of the table, if the pixels of the vertical line exceed a preset value, indicating that the table contains a frame line and executing the step 3.4, otherwise executing the step 3.5;
step 3.4: reconstructing a table structure of a table with frame lines, extracting all vertical lines and horizontal lines of the table, acquiring an intersection point set of the vertical lines and the horizontal lines, and forming an inner frame intersection point set of the table after removing redundant points; judging whether a table inner frame line is formed between adjacent points according to the obtained intersection point set, and if so, connecting the two points to form an edge; forming a table structure with a frame line table according to the points and the edges;
step 3.5: preprocessing the picture, removing horizontal lines and vertical lines with the length exceeding a preset threshold value in the picture, and performing threshold processing on the picture to enable the pixel value of a blank position to be 0 and the pixel value containing characters to be 255; scanning lines of the picture, and if the sum of pixel values of a certain line is scanned to be 0, then the line is a transverse table inner frame line, the position of the table inner frame line is the middle position of all continuous pixel values and the line of 0, and one line in the table is arranged inside two adjacent transverse inner frame lines; longitudinally scanning between two adjacent horizontal inward frame lines, summing pixel values of each column, if the sum of longitudinal pixels continuously exceeding the text interval number is 0, marking the area scanned by the longitudinal lines as a blank area, otherwise, marking the area scanned by the longitudinal lines as a character area, thereby obtaining a coordinate set with or without the character area in each row; combining blank areas which are communicated with each other and can be completely penetrated by a longitudinal line from top to bottom from left to right to form blank blocks, recording the height of each blank block, and removing all blank blocks with the height of 1 row; traversing all vertical lines penetrating through the blank blocks, and recording the total heights of all the blank blocks penetrated by the vertical lines; finally, a vertical line which passes through the blank block and has the maximum sum of the heights is used as a table longitudinal inner frame line, and the blank block which the line passes through is set to pass through; selecting a vertical line which can pass through other blank blocks and has the maximum sum of the heights of the other blank blocks as a vertical frame line of another table, recording the passed blank blocks as passed blank blocks, and finally continuously obtaining the vertical line until all the blank blocks are passed; and establishing a minimum cell of the table according to the obtained transverse inner frame line and the longitudinal inner frame line, scanning whether the longitudinal line of each cell passes through a region with characters, deleting the small section of the longitudinal line if the longitudinal line passes through the region with the characters, combining left and right cells of the table, and finally forming the inner frame line structure of the table after the cells are combined.
5. The method for reconstructing a table structure and extracting words based on computer vision of claim 1, wherein the step 4 comprises: according to the reconstructed form inner frame line structure, obtaining the rectangular frame coordinates of each cell in the PDF document, extracting character information in a region with the same position from the PDF document containing a character layer, and obtaining the content of the form cell through space removal adjustment;
the step 5 comprises the following steps: and establishing an Excel table according to the frame lines in all the tables and the contents of the corresponding table cells, and storing the information of the merged cells.
6. A table structure reconstruction and text extraction system based on computer vision, comprising:
module M1: identifying and positioning the table in the PDF document through a neural network to obtain an outer frame area where the table is located;
module M2: analyzing a character layer in the PDF document to obtain a text space in the PDF document;
module M3: reconstructing an inner frame line structure of the table in the table area through computer vision according to the framed table area and the text space;
module M4: extracting text information from the same position in the PDF document according to the inner frame structure of the table;
module M5: and generating an editable table file according to the inner frame line structure of the table and the corresponding text information.
7. The system according to claim 6, wherein the module M1 comprises:
module M1.1: training and configuring a table detection neural network;
module M1.2: converting each page of the PDF document containing the form target into a picture, inputting each picture into a form detection neural network, and returning the number of the PDF document pages where the form target is located and the relative position of the form outer frame located on the page where the PDF document is located if the form target exists.
8. The system according to claim 6, wherein the module M2 comprises:
module M2.1: judging whether the PDF page where the table is located contains a character layer or not;
module M2.2: if no character layer exists, embedding the page into the character layer by an optical character recognition technology, wherein the position of the embedded character is at the corresponding position of the character in the picture;
module M2.3: and counting the sizes of all characters in the PDF document, and taking the average width of the characters as an estimated value of the text interval.
9. The system according to claim 6, wherein the module M3 comprises:
module M3.1: intercepting the form in a picture mode according to the form outer frame area and the PDF page where the form is located;
module M3.2: preprocessing the cut picture, wherein the preprocessing comprises threshold processing and morphological processing, and removing noise points except characters and frame lines in the table;
module M3.3: performing vertical line detection on the table, if the vertical line pixel exceeds a preset value, indicating that the table contains a frame line and calling a module M3.4, otherwise calling a module M3.5;
module M3.4: reconstructing a table structure of a table with frame lines, extracting all vertical lines and horizontal lines of the table, acquiring an intersection point set of the vertical lines and the horizontal lines, and forming an inner frame intersection point set of the table after removing redundant points; judging whether a table inner frame line is formed between adjacent points according to the obtained intersection point set, and if so, connecting the two points to form an edge; forming a table structure with a frame line table according to the points and the edges;
module M3.5: preprocessing the picture, removing horizontal lines and vertical lines with the length exceeding a preset threshold value in the picture, and performing threshold processing on the picture to enable the pixel value of a blank position to be 0 and the pixel value containing characters to be 255; scanning lines of the picture, and if the sum of pixel values of a certain line is scanned to be 0, then the line is a transverse table inner frame line, the position of the table inner frame line is the middle position of all continuous pixel values and the line of 0, and one line in the table is arranged inside two adjacent transverse inner frame lines; longitudinally scanning between two adjacent horizontal inward frame lines, summing pixel values of each column, if the sum of longitudinal pixels continuously exceeding the text interval number is 0, marking the area scanned by the longitudinal lines as a blank area, otherwise, marking the area scanned by the longitudinal lines as a character area, thereby obtaining a coordinate set with or without the character area in each row; combining blank areas which are communicated with each other and can be completely penetrated by a longitudinal line from top to bottom from left to right to form blank blocks, recording the height of each blank block, and removing all blank blocks with the height of 1 row; traversing all vertical lines penetrating through the blank blocks, and recording the total heights of all the blank blocks penetrated by the vertical lines; finally, a vertical line which passes through the blank block and has the maximum sum of the heights is used as a table longitudinal inner frame line, and the blank block which the line passes through is set to pass through; selecting a vertical line which can pass through other blank blocks and has the maximum sum of the heights of the other blank blocks as a vertical frame line of another table, recording the passed blank blocks as passed blank blocks, and finally continuously obtaining the vertical line until all the blank blocks are passed; and establishing a minimum cell of the table according to the obtained transverse inner frame line and the longitudinal inner frame line, scanning whether the longitudinal line of each cell passes through a region with characters, deleting the small section of the longitudinal line if the longitudinal line passes through the region with the characters, combining left and right cells of the table, and finally forming the inner frame line structure of the table after the cells are combined.
10. The system according to claim 6, wherein the module M4 comprises: according to the reconstructed form inner frame line structure, obtaining the rectangular frame coordinates of each cell in the PDF document, extracting character information in a region with the same position from the PDF document containing a character layer, and obtaining the content of the form cell through space removal adjustment;
the module M5 includes: and establishing an Excel table according to the frame lines in all the tables and the contents of the corresponding table cells, and storing the information of the merged cells.
CN202111263283.5A 2021-10-28 2021-10-28 Table structure reconstruction and character extraction method and system based on computer vision Pending CN114004204A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111263283.5A CN114004204A (en) 2021-10-28 2021-10-28 Table structure reconstruction and character extraction method and system based on computer vision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111263283.5A CN114004204A (en) 2021-10-28 2021-10-28 Table structure reconstruction and character extraction method and system based on computer vision

Publications (1)

Publication Number Publication Date
CN114004204A true CN114004204A (en) 2022-02-01

Family

ID=79924592

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111263283.5A Pending CN114004204A (en) 2021-10-28 2021-10-28 Table structure reconstruction and character extraction method and system based on computer vision

Country Status (1)

Country Link
CN (1) CN114004204A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114677683A (en) * 2022-04-06 2022-06-28 电子科技大学 Background preprocessing method applied to microscopic character recognition of optical communication laser chip
CN115618836A (en) * 2022-12-15 2023-01-17 杭州恒生聚源信息技术有限公司 Wireless table structure restoration method and device, computer equipment and storage medium
CN115909369A (en) * 2023-02-15 2023-04-04 南京信息工程大学 Method and system for extracting binary slice image of Chinese character font
CN116311259A (en) * 2022-12-07 2023-06-23 中国矿业大学(北京) Information extraction method for PDF business document

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114677683A (en) * 2022-04-06 2022-06-28 电子科技大学 Background preprocessing method applied to microscopic character recognition of optical communication laser chip
CN114677683B (en) * 2022-04-06 2023-04-25 电子科技大学 Background preprocessing method applied to optical communication laser chip microscopic character recognition
CN116311259A (en) * 2022-12-07 2023-06-23 中国矿业大学(北京) Information extraction method for PDF business document
CN116311259B (en) * 2022-12-07 2024-03-12 中国矿业大学(北京) Information extraction method for PDF business document
CN115618836A (en) * 2022-12-15 2023-01-17 杭州恒生聚源信息技术有限公司 Wireless table structure restoration method and device, computer equipment and storage medium
CN115909369A (en) * 2023-02-15 2023-04-04 南京信息工程大学 Method and system for extracting binary slice image of Chinese character font

Similar Documents

Publication Publication Date Title
CN110516208B (en) System and method for extracting PDF document form
CN114004204A (en) Table structure reconstruction and character extraction method and system based on computer vision
CN113158808B (en) Method, medium and equipment for Chinese ancient book character recognition, paragraph grouping and layout reconstruction
AU2009281901B2 (en) Segmenting printed media pages into articles
EP2275973A2 (en) System and method for segmenting text lines in documents
CN101122952A (en) Picture words detecting method
US6532302B2 (en) Multiple size reductions for image segmentation
CN113239818B (en) Table cross-modal information extraction method based on segmentation and graph convolution neural network
CN115761773A (en) Deep learning-based in-image table identification method and system
CN111753706A (en) Complex table intersection point clustering extraction method based on image statistics
CN109213886B (en) Image retrieval method and system based on image segmentation and fuzzy pattern recognition
CN112241730A (en) Form extraction method and system based on machine learning
CN114463767A (en) Credit card identification method, device, computer equipment and storage medium
CN114170423B (en) Image document layout identification method, device and system
CN114581928A (en) Form identification method and system
JP2926066B2 (en) Table recognition device
CN116824608A (en) Answer sheet layout analysis method based on target detection technology
CN115019310B (en) Image-text identification method and equipment
Naz et al. Challenges in baseline detection of cursive script languages
CN115223172A (en) Text extraction method, device and equipment
Radzid et al. Framework of page segmentation for mushaf Al-Quran based on multiphase level segmentation
CN112633116A (en) Method for intelligently analyzing PDF (Portable document Format) image-text
Rao et al. Script identification of telugu, english and hindi document image
Nazemi et al. Converting Optically Scanned Regular or Irregular Tables to a Standardised Markup Format to Be Accessible to Vision-Impaired.
JP7370574B2 (en) Frame extraction method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination