CN112528724A - Table cell extraction method, device, equipment and computer readable storage medium - Google Patents

Table cell extraction method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN112528724A
CN112528724A CN202010981003.3A CN202010981003A CN112528724A CN 112528724 A CN112528724 A CN 112528724A CN 202010981003 A CN202010981003 A CN 202010981003A CN 112528724 A CN112528724 A CN 112528724A
Authority
CN
China
Prior art keywords
cell
picture
line segments
information
line segment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010981003.3A
Other languages
Chinese (zh)
Inventor
时慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Hailong Software Co ltd
Original Assignee
Shanghai Hailong Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Hailong Software Co ltd filed Critical Shanghai Hailong Software Co ltd
Priority to CN202010981003.3A priority Critical patent/CN112528724A/en
Publication of CN112528724A publication Critical patent/CN112528724A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • G06T5/80
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition

Abstract

The invention belongs to the technical field of optical character recognition in pattern recognition and devices, and particularly relates to a table cell extraction method, a table cell extraction device, table cell extraction equipment and a computer-readable storage medium. The invention discloses a table cell extraction method, which is characterized by comprising the following steps of: s1: detecting a transverse line segment of the picture, and correcting the angle of the picture by using the identified line segment; s2: identifying horizontal and vertical table line segments of the picture after the S1 angle correction, and completing the table by using the line segments; s3: performing flooding filling processing on the table pictures completed in the step S2, and sequentially identifying cell information; s4: using the cell information identified in S3 to exclude the erroneously identified cell; s5: and reconstructing the table according to the existing information for the cell information identified in the step S4.

Description

Table cell extraction method, device, equipment and computer readable storage medium
Technical Field
The invention belongs to the technical field of optical character recognition in pattern recognition and devices, and particularly relates to a table cell extraction method, a table cell extraction device, table cell extraction equipment and a computer-readable storage medium.
Background
The following methods are available on the market: 1. identifying the vertical and horizontal lines of the table line through a straight line detection algorithm, then calculating the intersection points of the table line and the horizontal line, and calculating the information of the table and the cells by using the intersection point information; firstly, the requirement on the line segment is high, if the line segment is no longer straight due to the bending of original paper or the deformation caused by a lens during shooting, the relevant straight line can not be recognized, and the recognition error is caused; in addition, for the Chinese, Japanese and Korean characters in east Asia, because the characters have a large number of horizontal and vertical line segments, the relatively close strokes in a plurality of continuous characters are easily recognized as table line segments by mistake, and the false recognition is caused; 2. table identification based on deep learning. The method obtains a model through training a large number of form images, and identifies the forms in the images by using the model; the method needs massive training data to train the model, has low reliability, and cannot improve the identification precision pertinently by simply adjusting codes after problems occur due to the inexplicability of a deep learning model; meanwhile, the problems of high requirement on hardware, long identification time consumption and the like exist. When the OCR operation is carried out on the image, if the table in the image cannot be accurately identified, the cells in the table are extracted and independently processed, and the integral character identification effect is seriously interfered; therefore, the invention aims to provide a brand-new scheme for identifying and extracting the content of a form based on the existing mature technology, and most forms can be effectively identified and processed.
Disclosure of Invention
In view of the problems raised by the above background art, the present invention is directed to: it is intended to provide a table cell extraction method, apparatus, device and computer-readable storage medium.
In order to achieve the technical purpose, the technical scheme adopted by the invention is as follows:
a table cell extraction method, comprising the steps of:
s1: detecting a transverse line segment of the picture, and correcting the angle of the picture by using the identified line segment;
s2: identifying horizontal and vertical table line segments of the picture after the S1 angle correction, and completing the table by using the line segments;
s3: performing flooding filling processing on the table pictures completed in the step S2, and sequentially identifying cell information;
s4: using the cell information identified in S3 to exclude the erroneously identified cell;
s5: and reconstructing the table according to the existing information for the cell information identified in the step S4.
As a preferred embodiment of the present invention, the step S1 of detecting a horizontal line segment of the picture and performing angle correction on the picture by using the identified line segment includes:
the method comprises the steps of carrying out binarization operation on a picture, using a Fast Line Detector algorithm to obtain a Line segment set in the picture, filtering vertical Line segments in the Line segment set, reserving transverse Line segments, calculating a slope of the Line segments, calculating a middle value of all the transverse Line segments, taking the center of the picture as a rotation point, and correcting the picture to be in a horizontal direction by reversely rotating the slope angle.
As a preferred embodiment of the present invention, in S2, the recognizing horizontal and vertical table line segments of the picture corrected by the angle of S1 and completing the table by using the line segments includes:
and obtaining a Line Segment set in the picture by using a Line Segment Detector algorithm, excluding Line segments with the length lower than a threshold value from the Line segments, excluding non-horizontal and vertical Line segments from the Line segments, and finally, properly lengthening two ends of the Line segments and redrawing the Line segments into the picture by using black.
As a preferable aspect of the present invention, in S3, the step of performing flood filling processing on the table picture completed in S2 to sequentially identify cell information includes:
and performing flood filling operation on the non-black area according to specific step length on the supplemented picture from left to right, from top to bottom and on the x axis and the y axis, and calculating the minimum external rectangle of the filled area, wherein if the length and the width of the external rectangle are consistent with the length and the width of the picture, the area is outside the table, otherwise, the area is the detected cell.
In a preferred embodiment of the present invention, the S4, which excludes the erroneously recognized cell using the cell information recognized in S3, includes:
calculating the height and width of the cell, and taking the cell smaller than a specified threshold value as a false recognition cell; and calculating gaps among the cells, and when the gaps around one cell are all larger than a set threshold value, determining as the error identification cell.
As a preferable aspect of the present invention, the reconstructing the table of the cell information identified in S4 based on the existing information in S5 includes:
when the specific area in the table is known to be in accordance with M rows and N columns, the complete cell information can be reversely deduced according to the current cell information so as to solve the situation of false recognition which may happen before.
A form cell extraction apparatus comprising:
the image enhancement module is configured to perform image graying, binaryzation and angle correction on an original image to obtain an enhanced image convenient for cell identification;
the cell identification module is configured to obtain information of each cell by utilizing a flooding filling algorithm;
and the cell reconstruction module is configured to identify and exclude cells except the table according to the information among the cells, and reconstruct the complete cell information of the region by using the cell information according to the known local information of the table, so that the problems of mistaken identification and missed identification of the cells are solved.
An electronic device, comprising: a processor and a memory for storing executable instructions of the processor; the processor is configured to perform the table cell extraction method of any of claims 1-6 by executing the table cell extraction method.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a table cell extraction method according to any one of claims 1 to 6.
The invention has the beneficial effects that:
in the existing paperless office, paper documents of the past years and even decades need to be imported into a computer for digital storage and processing, wherein the identification and information extraction of tables are always big problems; after the scheme is used, the most common table using line segments can be quickly and accurately detected, and the position information of each cell is extracted; by combining the OCR technology, the paper files containing tables and retained in history can be efficiently converted into digital documents, and retrieval and reference are convenient.
Drawings
The invention is further illustrated by the non-limiting examples given in the accompanying drawings;
FIG. 1 is a schematic flow chart diagram illustrating a method for extracting table cells according to an embodiment of the present invention;
FIG. 2 is a block diagram of an embodiment of a table cell extraction apparatus according to the present invention;
Detailed Description
In order that those skilled in the art can better understand the present invention, the following technical solutions are further described with reference to the accompanying drawings and examples.
A table cell extraction method, comprising the steps of:
s1, detecting the transverse line segment of the picture, and correcting the angle of the picture by using the identified line segment, wherein the method comprises the following steps:
the method comprises the steps of carrying out binarization operation on a picture, using a Fast Line Detector algorithm to obtain a Line segment set in the picture, filtering vertical Line segments in the Line segment set, reserving transverse Line segments, calculating a slope of the Line segments, calculating a middle value of all the transverse Line segments, taking the center of the picture as a rotation point, and correcting the picture to be in a horizontal direction by reversely rotating the slope angle.
S2: identifying horizontal and vertical table line segments of the picture after the S1 angle correction, and completing the table by using the line segments, wherein the method comprises the following steps:
and obtaining a Line Segment set in the picture by using a Line Segment Detector algorithm, excluding Line segments with the length lower than a threshold value from the Line segments, excluding non-horizontal and vertical Line segments from the Line segments, and finally, properly lengthening two ends of the Line segments and redrawing the Line segments into the picture by using black.
S3: and performing flooding filling processing on the table picture completed by the S2, and sequentially identifying cell information, wherein the flooding filling processing comprises the following steps:
and performing flood filling operation on the non-black area according to specific step length on the supplemented picture from left to right, from top to bottom and on the x axis and the y axis, and calculating the minimum external rectangle of the filled area, wherein if the length and the width of the external rectangle are consistent with the length and the width of the picture, the area is outside the table, otherwise, the area is the detected cell.
S4: using the cell information recognized at S3, excluding the erroneously recognized cell, including:
calculating the height and width of the cell, and taking the cell smaller than a specified threshold value as a false recognition cell; and calculating gaps among the cells, and when the gaps around one cell are all larger than a set threshold value, determining as the error identification cell.
S5: reconstructing the table for the cell information identified in S4 according to the existing information, including:
when the specific area in the table is known to be in accordance with M rows and N columns, the complete cell information can be reversely deduced according to the current cell information so as to solve the situation of false recognition which may happen before.
In this embodiment, the specific operation steps are as follows:
s1: carrying out graying and binarization operation on the identified picture to obtain a black and white picture;
s2: correcting the picture, if the picture has an obvious boundary, if the picture is placed on a material with a dark background color, obtaining four corners of the picture, and restoring the four corners of the picture into four rectangular corners for correction; if the picture has no obvious boundary, correcting the picture by identifying the horizontal line and the vertical line of the table in the picture and then restoring the horizontal line and the vertical line to the horizontal direction and the vertical direction;
s3: starting at the top left corner of the picture, pixels are taken for determination, and if not black, gray levels (e.g., 127) other than black and white (i.e., 0 and 255) are used for flood filling; if the pixel is black, moving the pixel point with the specified step length to the right, and repeatedly taking the pixel until encountering a non-black pixel; if the image is moved to an area outside the rightmost side of the image, returning to the leftmost position, moving down the pixel point with the specified step length, and repeating the operation;
s4: extracting the image filled before, and judging the length and the width of the outline of the circumscribed rectangle; if it happens to be equal to or very close (subject to a set threshold) to the picture size, this is an out-of-table content, and the area is blackened (i.e., changed to 0);
s5: if the length and width of the outline is smaller than a specific threshold value, the description area is too small, which is probably generated when certain characters (such as 'mouth, saying' and the like) are filled in a flooding way, are not treated as a unit cell and are directly painted black;
s6: if the scene does not belong to the scenes of S4 and S5, the scene is regarded as a cell, the coordinates of the upper left corner and the length and width of the cell are recorded, and then the area is blackened;
s7: repeating the operations of S3-S6, moving the designated step length to the right after each time, moving the designated step length to the left when the designated step length is moved to the rightmost side, and moving the designated step length downwards until the whole picture is scanned completely;
s8: and (4) all the detected cell information is normalized, because the adjacent cells in the upper, lower, left and right directions can have deviations of several pixels, and each side is operated by calculating the average value of each frame line of the adjacent cells.
The invention relates to a tool for automatically converting a paper form into an Excel form; the tool corrects the photo by taking a picture, identifies form information in the photo, then carries out OCR identification and records the position of each field; and then generating a corresponding Excel file according to the information of the characters and the table and providing downloading.
The foregoing embodiments are merely illustrative of the principles of the present invention and its efficacy, and are not to be construed as limiting the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims (9)

1. A method of extracting a table cell, comprising the steps of:
s1: detecting a transverse line segment of the picture, and correcting the angle of the picture by using the identified line segment;
s2: identifying horizontal and vertical table line segments of the picture after the S1 angle correction, and completing the table by using the line segments;
s3: performing flooding filling processing on the table pictures completed in the step S2, and sequentially identifying cell information;
s4: using the cell information identified in S3 to exclude the erroneously identified cell;
s5: and reconstructing the table according to the existing information for the cell information identified in the step S4.
2. The method for extracting a form unit cell according to claim 1, wherein the step S1 of detecting a horizontal line segment of the picture and correcting an angle of the picture using the identified line segment includes:
the method comprises the steps of carrying out binarization operation on a picture, using a Fast Line Detector algorithm to obtain a Line segment set in the picture, filtering vertical Line segments in the Line segment set, reserving transverse Line segments, calculating a slope of the Line segments, calculating a middle value of all the transverse Line segments, taking the center of the picture as a rotation point, and correcting the picture to be in a horizontal direction by reversely rotating the slope angle.
3. The method of claim 1, wherein the step of S2, wherein the step of identifying horizontal and vertical table line segments of the picture after the step of correcting the angle of S1 and completing the table with the line segments comprises:
and obtaining a Line Segment set in the picture by using a Line Segment Detector algorithm, excluding Line segments with the length lower than a threshold value from the Line segments, excluding non-horizontal and vertical Line segments from the Line segments, and finally, properly lengthening two ends of the Line segments and redrawing the Line segments into the picture by using black.
4. The method of claim 1, wherein the step of performing flood filling processing on the table picture completed in step S2 to sequentially identify cell information in step S3 includes:
and performing flood filling operation on the non-black area according to specific step length on the supplemented picture from left to right, from top to bottom and on the x axis and the y axis, and calculating the minimum external rectangle of the filled area, wherein if the length and the width of the external rectangle are consistent with the length and the width of the picture, the area is outside the table, otherwise, the area is the detected cell.
5. The method of extracting a table cell according to claim 1, wherein the step S4 of excluding the erroneously recognized cell using the cell information recognized in the step S3 includes:
calculating the height and width of the cell, and taking the cell smaller than a specified threshold value as a false recognition cell; and calculating gaps among the cells, and when the gaps around one cell are all larger than a set threshold value, determining as the error identification cell.
6. The method of claim 1, wherein the step of reconstructing the table from existing information for the cell information identified in the step S4 in the step S5 includes:
when the specific area in the table is known to be in accordance with M rows and N columns, the complete cell information can be reversely deduced according to the current cell information so as to solve the situation of false recognition which may happen before.
7. A form cell extraction apparatus, comprising:
the image enhancement module is configured to perform image graying, binaryzation and angle correction on an original image to obtain an enhanced image convenient for cell identification;
the cell identification module is configured to obtain information of each cell by utilizing a flooding filling algorithm;
and the cell reconstruction module is configured to identify and exclude cells except the table according to the information among the cells, and reconstruct the complete cell information of the region by using the cell information according to the known local information of the table, so that the problems of mistaken identification and missed identification of the cells are solved.
8. An electronic device, comprising: a processor and a memory for storing executable instructions of the processor; the processor is configured to perform the table cell extraction method of any of claims 1-6 by executing the table cell extraction method.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a table cell extraction method according to any one of claims 1 to 6.
CN202010981003.3A 2020-09-17 2020-09-17 Table cell extraction method, device, equipment and computer readable storage medium Pending CN112528724A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010981003.3A CN112528724A (en) 2020-09-17 2020-09-17 Table cell extraction method, device, equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010981003.3A CN112528724A (en) 2020-09-17 2020-09-17 Table cell extraction method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN112528724A true CN112528724A (en) 2021-03-19

Family

ID=74978847

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010981003.3A Pending CN112528724A (en) 2020-09-17 2020-09-17 Table cell extraction method, device, equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN112528724A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688688A (en) * 2021-07-28 2021-11-23 达观数据(苏州)有限公司 Completion method of table lines in picture and identification method of table in picture

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101127081A (en) * 2006-08-14 2008-02-20 富士通株式会社 Table data processing method and apparatus
CN106156761A (en) * 2016-08-10 2016-11-23 北京交通大学 The image form detection of facing moving terminal shooting and recognition methods
CN110136069A (en) * 2019-05-07 2019-08-16 语联网(武汉)信息技术有限公司 Text image antidote, device and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101127081A (en) * 2006-08-14 2008-02-20 富士通株式会社 Table data processing method and apparatus
CN106156761A (en) * 2016-08-10 2016-11-23 北京交通大学 The image form detection of facing moving terminal shooting and recognition methods
CN110136069A (en) * 2019-05-07 2019-08-16 语联网(武汉)信息技术有限公司 Text image antidote, device and electronic equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688688A (en) * 2021-07-28 2021-11-23 达观数据(苏州)有限公司 Completion method of table lines in picture and identification method of table in picture

Similar Documents

Publication Publication Date Title
CN111814722B (en) Method and device for identifying table in image, electronic equipment and storage medium
CN106778996B (en) It is embedded with the generation system and method for the two dimensional code of visual pattern and reads system
CN106599028B (en) Book content searching and matching method based on video image processing
CN107220640B (en) Character recognition method, character recognition device, computer equipment and computer-readable storage medium
CN110647882A (en) Image correction method, device, equipment and storage medium
CN111737478B (en) Text detection method, electronic device and computer readable medium
CN105469026A (en) Horizontal and vertical line detection and removal for document images
CN113392669B (en) Image information detection method, detection device and storage medium
CN110598566A (en) Image processing method, device, terminal and computer readable storage medium
CN112446262A (en) Text analysis method, text analysis device, text analysis terminal and computer-readable storage medium
CN113888756A (en) Method for determining effective area parameters, image acquisition method and test system
CN111461070B (en) Text recognition method, device, electronic equipment and storage medium
CN106682670B (en) Station caption identification method and system
CN113436222A (en) Image processing method, image processing apparatus, electronic device, and storage medium
CN111626145A (en) Simple and effective incomplete form identification and page-crossing splicing method
CN112528724A (en) Table cell extraction method, device, equipment and computer readable storage medium
CN111126266A (en) Text processing method, text processing system, device, and medium
US8891822B2 (en) System and method for script and orientation detection of images using artificial neural networks
CN112036294A (en) Method and device for automatically identifying paper table structure
CN109635798B (en) Information extraction method and device
JP5271956B2 (en) Document orientation detection method and apparatus
CN111046770A (en) Automatic annotation method for photo file figures
CN106648171B (en) A kind of interactive system and method based on lettering pen
CN113076952A (en) Method and device for automatically identifying and enhancing text
JP2871590B2 (en) Image extraction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination