WO2023045277A1 - 一种将图像中表格转换为电子表格的方法及装置 - Google Patents

一种将图像中表格转换为电子表格的方法及装置 Download PDF

Info

Publication number
WO2023045277A1
WO2023045277A1 PCT/CN2022/080926 CN2022080926W WO2023045277A1 WO 2023045277 A1 WO2023045277 A1 WO 2023045277A1 CN 2022080926 W CN2022080926 W CN 2022080926W WO 2023045277 A1 WO2023045277 A1 WO 2023045277A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
lines
line
cell
algorithm
Prior art date
Application number
PCT/CN2022/080926
Other languages
English (en)
French (fr)
Inventor
郭丰俊
龙伟
丁凯
龙腾
Original Assignee
上海合合信息科技股份有限公司
上海临冠数据科技有限公司
上海生腾数据科技有限公司
上海盈五蓄数据科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海合合信息科技股份有限公司, 上海临冠数据科技有限公司, 上海生腾数据科技有限公司, 上海盈五蓄数据科技有限公司 filed Critical 上海合合信息科技股份有限公司
Publication of WO2023045277A1 publication Critical patent/WO2023045277A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to an image recognition method, in particular to a method for recognizing a form in an image and converting it into an electronic form (such as an Excel file).
  • tables are widely used in office and daily life.
  • financial processing, data analysis, etc. there is a large demand for converting tables in images (pictures) into spreadsheets.
  • pictures images
  • the technical problem to be solved in this application is to provide a method for converting tables of different formats in images of different image quality into electronic tables with a good format restoration effect.
  • Step S1 Convert and rectify the image according to the character lines and lines in the image.
  • Step S2 Use an anchor-free object detection method in the image to determine the position of the table in the image, also called the table area of the image.
  • Step S3 Detect table lines in the table area of the image.
  • Step S4 Filter the table lines detected in step S3 according to the text line information obtained by performing optical character recognition on the table area of the image, remove false table lines, and obtain real table lines.
  • Step S5 According to the positional relationship between the table lines, classify all the table lines into groups of rows and columns.
  • Step S6 Construct cells according to the group to which the table lines belong, and save the OCR results within each cell as text information in the cell.
  • Step S7 According to whether the structure of the cells in the outermost circle of the table is complete and whether there is a gap between adjacent cells, determine whether there is a missing cell; if there is a missing cell, fill the cell at the corresponding position, In order to make the structure of the cells in the outermost circle of the table complete and there is no gap between adjacent cells, a complete structured spreadsheet is obtained.
  • the method described above provides multiple detection and correction methods for poor image quality defects, and the converted spreadsheet has the same layout as the table in the image.
  • step S1 the angle of the text line and the table line in the image is detected, and each line of text in the image is roughly arranged horizontally, the horizontal line in the table line is roughly horizontal, and the vertical line in the table line is roughly horizontal.
  • the entire image is rectified and rectified in such a way that the straight line is roughly vertical.
  • the anchor-free target detection method includes CornerNet algorithm, CenterNet algorithm, ExtremeNet algorithm, DenseBox algorithm, YOLO algorithm, FSAF algorithm, FCOS algorithm, FoveaBox algorithm, RepPoints algorithm, Sparse RCNN algorithm, CentripetalNet Any one or more of algorithms and SaccadeNet algorithms.
  • CornerNet algorithm CenterNet algorithm
  • ExtremeNet algorithm DenseBox algorithm
  • YOLO algorithm FSAF algorithm
  • FCOS algorithm FoveaBox algorithm
  • RepPoints algorithm Sparse RCNN algorithm
  • CentripetalNet Any one or more of algorithms and SaccadeNet algorithms.
  • step S3 specifically includes the following sub-steps.
  • Step S31 Predict and extract the table line area in the table area of the image using the algorithm based on the semantic segmentation network.
  • the table line area refers to the possible position of the table line, which is some isolated pixel points.
  • Step S32 Detect the table lines in the table line area of the image by curve fitting method, that is, use the curve fitting method to connect the isolated pixel points predicted in the previous step into line segments. This is a detailed description of a specific implementation manner of step S3.
  • the algorithm based on the semantic segmentation network is first trained through the marked table line data, and then uses the trained algorithm to predict and extract the table line area. This shows that the application has data-driven characteristics.
  • step S4 performing optical character recognition on the table area of the image to obtain text line information, performed in this step or any step before, also includes performing optical character recognition on the original image to obtain text line information and then A way to zoom out to the text row information within the table area of the image.
  • step S5 for the horizontal lines, they are sorted according to the starting endpoints and then looped. When encountering horizontal lines with close vertical distances and overlapping horizontal parts, they are merged and deduplicated. In this way, they will logically belong to the same horizontal line but The horizontal lines that are actually detected as multiple are assembled into one horizontal line; finally, the horizontal lines of each table row are grouped into one group, and the group contains one or more horizontal lines according to whether there is cell merging; the vertical line is processed using similar method. This is a detailed description of step S5.
  • the optical character recognition is performed on the cell to obtain the recognition result, which is carried out in this step or any step before it, including performing optical character recognition on the table area of the image to obtain the recognition result and then narrowed down to the unit
  • the method of recognizing the recognition result within the cell range also includes the method of performing optical character recognition on the original image to obtain the recognition result and then narrowing down to the recognition result within the cell range.
  • the optical character recognition is performed on the table area of the image to obtain text line information, and the optical character recognition is performed on the cells to obtain the recognition result, both of which are performed simultaneously.
  • This application also proposes a device for converting tables in images into electronic tables, including a rectification and correction processing unit, a table position detection unit, a table line detection unit, a table line filtering unit, a table line grouping unit, and a cell construction unit , Cell Completion Unit.
  • the conversion and correction processing unit is used to perform conversion and correction processing on the image according to the character lines and lines in the image.
  • the table position detection unit is used to determine the position of the table in the image by using an anchor-free target detection method in the image, also called the table area of the image.
  • the table line detection unit is used to detect table lines in a table area of an image.
  • the table line filtering unit is used to remove false table lines and obtain real table lines according to the text line information obtained by performing optical character recognition on the table area of the image.
  • the table line grouping unit is used to classify all table lines into groups of rows and columns according to the positional relationship between the table lines.
  • the cell construction unit is used to construct cells according to the group to which the table lines belong, and save the optical character recognition results within the range of each cell as text information in the cell.
  • the cell completion unit is used to judge whether there is a missing cell according to whether the structure of the cells in the outermost circle of the table is complete and whether there is a gap between adjacent cells; Cells are filled in position, so that the structure of the cells in the outermost circle of the table is complete and there is no gap between adjacent cells, and a complete structured spreadsheet is obtained.
  • the above-mentioned device provides multiple detection and correction methods for defects with poor image quality, and the converted electronic form has the same format as the form in the image.
  • the technical effect achieved by the application is that the image with poor or good image quality can be converted into an electronic form with high accuracy; and the layout is kept consistent, and the electronic form has good integrity.
  • FIG. 1 is a schematic flowchart of a method for converting a table in an image into an electronic table proposed in this application.
  • FIG. 2 is a schematic diagram of a sub-flow of step S3 in FIG. 1 .
  • FIG. 3 is a schematic structural diagram of a device for converting a form in an image into an electronic form proposed in this application.
  • 1 is the conversion and correction processing unit
  • 2 is the table position detection unit
  • 3 is the table line detection unit
  • 4 is the table line filtering unit
  • 5 is the table line grouping unit
  • 6 is the cell construction unit
  • 7 is the cell complement unit.
  • the method for converting a form in an image into an electronic form proposed in this application includes the following steps.
  • Step S1 Convert and correct the image according to the text line and table line information in the image.
  • the text in the image is usually arranged horizontally, and the table lines usually include horizontal lines and vertical lines. Due to the shooting angle and the bending of the paper, the text and table lines in the image may be tilted and distorted.
  • This step detects the angle of the text line and the table line, and arranges each line of text in the image approximately horizontally, so that the lines close to the horizontal are approximately horizontal, and the lines close to the vertical are approximately vertical. Perform rectification and correction.
  • the image processed in this way can improve the accuracy of the subsequent detection of the position of the form and the correct rate of the structured electronic form.
  • Step S2 An anchor free target detection method is used in the image to determine the position of the table in the image, which is also called the table area of the image.
  • the anchor-free target detection method includes, for example, CornerNet algorithm, CenterNet algorithm, ExtremeNet algorithm, DenseBox algorithm, YOLO algorithm, FSAF algorithm, FCOS algorithm, FoveaBox algorithm, RepPoints algorithm, Sparse RCNN algorithm, CentripetalNet algorithm, SaccadeNet algorithm, etc., these algorithms
  • CornerNet algorithm CenterNet algorithm
  • ExtremeNet algorithm ExtremeNet algorithm
  • DenseBox algorithm DenseBox algorithm
  • YOLO algorithm FSAF algorithm
  • FCOS algorithm FoveaBox algorithm
  • RepPoints algorithm Sparse RCNN algorithm
  • CentripetalNet algorithm SaccadeNet algorithm
  • Step S3 Detect table lines in the table area of the image.
  • the table lines include the outer border line used to separate the inside of the table from the outside of the table, and the inner divider line used to distinguish rows and columns inside the table.
  • Step S4 According to the text line information obtained by optical character recognition (OCR, Optical character recognition) on the table area of the image, filter the table lines detected in step S3, remove false table lines, and obtain clean real table lines.
  • the character line information includes the height of the character line, the width of a single character, the angle of the character line, and the like. Perform OCR on the table area of the image to obtain the text line information, which can be performed in this step or any step before it, and also includes performing OCR on the original image to obtain the text line information and then reducing it to the table area of the image Way of text line information (not preferred).
  • some character strokes are longer, or the strokes of adjacent characters are connected together, which may be detected as table lines in step S3, but belong to false table lines, which can be filtered out according to the height of the text line and the width of a single text.
  • the vertical table line is a false table line.
  • the vertical line is determined; if a certain form line detected in step S3 exceeds the allowable angle range of the horizontal line, and also exceeds the allowable angle range of the vertical line, Then it is determined that the table line is a false table line.
  • the allowable angle range of the horizontal line is, for example, plus or minus 15 degrees from the horizontal line.
  • the allowable angle range of the vertical line is, for example, plus or minus 15 degrees of the vertical line.
  • Step S5 According to the positional relationship between the table lines, classify all the table lines into groups of rows and columns. Due to factors such as poor image quality, it is inevitable that the same table line is detected as multiple table lines. At the same time, there are cases in the table where the table lines belonging to the same row and the same column are divided into multiple table lines for format requirements. This step is to accurately restore the row and column to which the cell belongs. According to the positional relationship between the horizontal lines in the table lines, the horizontal lines are classified into different row groups; according to the positional relationship between the vertical lines in the table lines, the vertical lines are divided into Lines are grouped into different columns.
  • the horizontal lines distinguish between horizontal and vertical lines by calculating the angle of the table lines. For the horizontal lines, sort them according to the start and end points and perform loop processing. When encountering horizontal lines with close vertical distance and overlapping horizontal parts, they will be merged and deduplicated. In this way, horizontal lines that logically belong to the same horizontal line but are actually detected as multiple horizontal lines can be detected. Assembled into a horizontal line, the processing can be accelerated using the Union-Find algorithm. Finally, the horizontal lines of each table row are grouped into one group, and the group contains one or more horizontal lines according to whether there is a cell merge. The vertical lines are handled in a similar way.
  • Step S6 Construct cells according to the group to which the table lines belong, and save the OCR results within each cell as text information in the cell. This makes the layout of the spreadsheet match the layout of the table in the original image. Perform OCR on the cells to obtain the recognition results, which can be performed in this step or any step before it, including the way to perform OCR on the table area of the image to obtain the recognition results and then narrow down to the recognition results within the cell range (preferred), also includes the method of performing optical character recognition on the original image to obtain the recognition result and then narrowing down to the recognition result within the cell range (not preferred).
  • OCR is performed on the table area of the image to obtain text line information (step S4 or any previous step), and the optical character recognition is performed on the cells to obtain the recognition result (step S6 or any previous step), which are performed simultaneously.
  • Step S7 According to whether the structure of the cells in the outermost circle of the table is complete and whether there is a gap between adjacent cells, it is judged whether there is a missing cell. If there are missing cells, fill in the cells at the corresponding positions, so that the structure of the cells in the outermost circle of the table is complete, and there is no gap between adjacent cells, and a complete structured spreadsheet is obtained. Table lines are missing due to lack of outer borders in the table layout, poor image quality, or incomplete table captures, which can cause some cells to fail to build. This step improves the integrity of the structured spreadsheet by padding the cells.
  • the table position, text row position, and table line position detected in the image it is judged whether it is necessary to supplement the outer border line of the table. If it needs to be supplemented, supplement the outer border line according to the existing line segment information in the orthogonal direction. For example, if the leftmost vertical outer border line of the table needs to be supplemented, it is obtained by fitting according to the left endpoint of the existing horizontal line segment, and the outer border lines at other positions are handled in a similar manner.
  • the number of the row and column to which the cell belongs can be known. Because the table has a rectangular structure and there can be no holes inside, it can be judged whether there are missing cells according to the existing row and column number information.
  • the row, column number and position information of the missing cell can be deduced and completed based on the existing adjacent cells. For another example, when the gap between adjacent cells exceeds the height of the text line, it is determined that there is a missing cell.
  • the step S3 specifically includes the following sub-steps.
  • Step S31 Using an algorithm based on Semantic Segmentation (Semantic Segmentation) network to predict and extract the table line area in the table area of the image, the table line area refers to the position where the table line may appear, which is some isolated pixel points.
  • the algorithm based on the semantic segmentation network for example, adopts a pixel classification method based on U-Net, which is a convolutional neural network algorithm for biomedical image segmentation (image segmentation).
  • the semantic segmentation network-based algorithm is firstly trained through labeled table line data, and then uses the trained algorithm to predict and extract table line areas.
  • the marked table line data refers to images that have been clearly marked as "yes” table lines and images that are clearly marked as "not” table lines.
  • Step S32 Detecting the table lines in the table line area of the image by means of a curve fitting method, that is, using a traditional curve fitting method to connect the isolated pixel points predicted in the previous step into line segments.
  • the method shown in Figure 2 combines data-driven (train the algorithm first and then use it for prediction and extraction) and classical image processing algorithms (curve fitting), which can not only effectively suppress noise, but also have good robustness to the detection of table lines in different formats Sex (robustness, also known as robustness).
  • the device for converting the form in the image into an electronic form proposed by this application includes a rectification and correction processing unit 1, a form position detection unit 2, a form line detection unit 3, a form line filtering unit 4, and a form line grouping unit 5.
  • the conversion and correction processing unit 1 is used to perform conversion and correction processing on the image according to the character lines and lines in the image.
  • the table position detection unit 2 is used to determine the position of the table in the image by using an anchor-free object detection method in the image, which is also called the table area of the image.
  • the table line detection unit 3 is used to detect table lines in the table area of the image.
  • the table line filtering unit 4 is used to remove false table lines and obtain real table lines according to the text line information obtained by performing optical character recognition on the table area of the image.
  • the table line grouping unit 5 is used to classify all table lines into groups of rows and columns according to the positional relationship between the table lines.
  • the cell construction unit 6 is used to construct cells according to the group to which the table lines belong, and save the OCR results within each cell as text information in the cell.
  • the cell completion unit 7 is used to determine whether there is a missing cell according to whether the structure of the cells in the outermost circle of the table is complete and whether there is a gap between adjacent cells. If there are missing cells, fill in the cells at the corresponding positions, so that the structure of the cells in the outermost circle of the table is complete, and there is no gap between adjacent cells, and a complete structured spreadsheet is obtained.
  • the method and device for converting tables in images into electronic tables proposed by this application have the following beneficial technical effects.
  • the image is first rectified and rectified, and then the form in the image is detected, which improves the accuracy of form detection and subsequent structured electronic forms.
  • an anchor-free object detection method is used to detect tables in images, which can accurately detect wired or wireless tables with different aspect ratios and different separation styles.
  • the wireless table means that the content of the document is separated according to the format of the table, but there is no table line.
  • the algorithm based on the semantic segmentation network is used to first pass the training of the marked data, and then use it to detect the table line area, so that it can be removed. interference. At the same time, the detection of table lines is realized by combining the curve fitting method.
  • the table lines are grouped based on distance, and the row and column positions of the cells are obtained according to the groups to construct the cells.
  • the missing cells are judged and filled, which improves the integrity of the spreadsheet.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Character Input (AREA)

Abstract

本申请公开了一种将图像中表格转换为电子表格的方法。步骤S1:根据图像中的文字行及线,对图像做转正及矫正处理。步骤S2:在图像中采用无锚的目标检测方法来确定图像中表格的位置,也称图像的表格区域。步骤S3:在图像的表格区域中检测表格线。步骤S4:根据对图像的表格区域进行光学字符识别获取的文字行信息,移除虚假表格线,得到真实表格线。步骤S5:根据表格线之间的位置关系,将所有表格线分别归入各个行、各个列的组别。步骤S6:根据表格线所属组别构建单元格,并将每一单元格范围内的光学字符识别结果作为该单元格中的文字信息保存。步骤S7:如有缺失的单元格,在相应位置补齐单元格,以得到完整的结构化的电子表格。

Description

一种将图像中表格转换为电子表格的方法及装置 技术领域
本申请涉及一种图像识别方法,特别是涉及一种将图像中的表格识别出来并转换为电子表格(例如Excel文件)的方法。
背景技术
表格作为常见的文档形式,在办公、日常生活中得到广泛使用。在财务处理、数据分析等工作中,存在大量将图像(图片)中的表格转化为电子表格的需求。由于打印质量、拍摄角度、拍摄光线、纸张弯折等方面的问题,现有的转换方法经常出现表格线误检、表格线漏检、单元格位置错误、单元格丢失等情况,从而使电子表格的版式还原出现错误。
发明内容
本申请所要解决的技术问题是对于不同图像质量的图像中的不同版式的表格,给出一种具有良好的版式还原效果的转换为电子表格的方法。
为解决上述技术问题,本申请提出的将图像中表格转换为电子表格的方法包括如下步骤。步骤S1:根据图像中的文字行及线,对图像做转正及矫正处理。步骤S2:在图像中采用无锚的目标检测方法来确定图像中表格的位置,也称图像的表格区域。步骤S3:在图像的表格区域中检测表格线。步骤S4:根据对图像的表格区域进行光学字符识别获取的文字行信息,对步骤S3检测得到的表格线进行过滤,移除虚假表格线,得到真实表格线。步骤S5:根据表格线之间的位置关系,将所有表格线分别归入各个行、各个列的组别。步骤S6:根据表格线所属组别构建单元格,并将每一单元格范围内的光学字符识别结果作为该单元格中的文字信息保存。步骤S7:根据表格最外侧一圈的单元格是否结构完整、以及相邻单元格之间是否有间隙,判断是否有缺失的单元格;如有缺失的单元格,在相应位置补齐单元格,以使表格最外侧一圈的单元格的结构完整、并且相邻单元格之间没有间隙,得到完整的结构化的电子表格。上述方法针对图像质量较差的缺陷给出了多种检测和矫正方式,并且转换后的电子表格与图像中的表格具有相同的版式。
进一步地,所述步骤S1中,检测图像中的文字行及表格线的角度,并使图像中的每一行文字大致为水平排列,使表格线中的水平线大致为水平,使表格线中的竖直线大致为竖直的方式对整幅图像进行转正及矫正处理。这是对步骤S1的详细说明。
进一步地,所述步骤S2中,所述无锚的目标检测方法包括CornerNet算法、CenterNet算法、ExtremeNet算法、DenseBox算法、YOLO算法、FSAF算法、FCOS算法、FoveaBox算法、RepPoints算法、Sparse RCNN算法、CentripetalNet算法、SaccadeNet算法的任意一 种或多种。这是步骤S2所用算法的一些优选示例。
进一步地,所述步骤S3具体包括如下子步骤。步骤S31:采用基于语义分割网络的算法在图像的表格区域中预测和提取表格线区域,表格线区域是指表格线可能出现的位置,就是一些孤立的像素点。步骤S32:在图像的表格线区域通过曲线拟合方法检测出表格线,也就是采用曲线拟合方法将前一步预测的孤立的像素点连接成线段。这是步骤S3的一种具体实现方式的详细说明。
进一步地,所述步骤S31中,所述基于语义分割网络的算法是先通过标注的表格线数据进行训练,然后再使用训练好的算法预测和提取表格线区域。这体现出本申请具有数据驱动的特点。
进一步地,所述步骤S4中,对图像的表格区域进行光学字符识别以获取文字行信息,在这一步或之前的任意步骤中进行,也包括对原始图像进行光学字符识别以获取文字行信息然后缩小到图像的表格区域内的文字行信息的方式。
进一步地,所述步骤S5中,对于水平线,按起始端点排序后进行循环处理,遇到垂直距离接近且水平部分有重叠的水平线就进行合并去重,如此将逻辑上属于上同一条水平线但实际被检测为多条的水平线组装为一条水平线;最终,每一表格行的水平线归为一组,组内根据是否有单元格合并情况包含一根或多根水平线;对竖直线的处理采用类似方法。这是对步骤S5的详细说明。
进一步地,所述步骤S6中,对单元格进行光学字符识别以获取识别结果,在这一步或之前的任意步骤中进行,包括对图像的表格区域进行光学字符识别以获取识别结果然后缩小到单元格范围内的识别结果的方式,也包括对原始图像进行光学字符识别以获取识别结果然后缩小到单元格范围内的识别结果的方式。
优选地,对图像的表格区域进行光学字符识别以获取文字行信息、对单元格进行光学字符识别以获取识别结果,两者同时进行。
本申请还提出了一种将图像中表格转换为电子表格的装置,包括转正及矫正处理单元、表格位置检测单元、表格线检测单元、表格线滤除单元、表格线分组单元、单元格构建单元、单元格补齐单元。所述转正及矫正处理单元用来根据图像中的文字行及线,对图像做转正及矫正处理。所述表格位置检测单元用来在图像中采用无锚的目标检测方法来确定图像中表格的位置,也称图像的表格区域。所述表格线检测单元用来在图像的表格区域中检测表格线。所述表格线滤除单元用来根据对图像的表格区域进行光学字符识别获取的文字行信息,移除虚假表格线,得到真实表格线。所述表格线分组单元用来根据表格线之间的位置关系,将所有表格线分别归入各个行、各个列的组别。所述单元格构建单元用来根据表格线所属组别构 建单元格,并将每一单元格范围内的光学字符识别结果作为该单元格中的文字信息保存。所述单元格补齐单元用来根据表格最外侧一圈的单元格是否结构完整、以及相邻单元格之间是否有间隙,判断是否有缺失的单元格;如有缺失的单元格,在相应位置补齐单元格,以使表格最外侧一圈的单元格的结构完整、并且相邻单元格之间没有间隙,得到完整的结构化的电子表格。上述装置针对图像质量较差的缺陷给出了多种检测和矫正方式,并且转换后的电子表格与图像中的表格具有相同的版式。
本申请取得的技术效果是能够将图像质量较差或较好的图像转换为电子表格,准确性高;并且保持版式一致,电子表格具有良好的完整性。
附图说明
图1是本申请提出的将图像中表格转换为电子表格的方法的流程示意图。
图2是图1中步骤S3的子流程示意图。
图3是本申请提出的将图像中表格转换为电子表格的装置的结构示意图。
图中附图标记说明:1为转正及矫正处理单元、2为表格位置检测单元、3为表格线检测单元、4为表格线滤除单元、5为表格线分组单元、6为单元格构建单元、7为单元格补齐单元。
具体实施方式
请参阅图1,本申请提出的将图像中表格转换为电子表格的方法包括如下步骤。
步骤S1:根据图像中的文字行及表格线信息,对图像做转正及矫正处理。例如,图像中的文字通常为水平排列,表格线通常包括水平线与竖直线,由于拍摄角度、纸张弯曲的问题而使得图像中的文字和表格线有可能出现倾斜、扭曲等情况。这一步通过检测文字行及表格线的角度,并使图像中的每一行文字大致为水平排列,使接近水平的线大致为水平,使接近竖直的线大致为竖直的方式对整幅图像进行转正及矫正处理。这样处理后的图像能提高后续检测表格位置的准确性以及结构化电子表格的正确率。
步骤S2:在图像中采用无锚(Anchor free)的目标检测方法来确定图像中表格的位置,也称图像的表格区域。所述无锚的目标检测方法例如包括CornerNet算法、CenterNet算法、ExtremeNet算法、DenseBox算法、YOLO算法、FSAF算法、FCOS算法、FoveaBox算法、RepPoints算法、Sparse RCNN算法、CentripetalNet算法、SaccadeNet算法等,这些算法通过训练能在图像中识别出不同版式的表格,从而检测出图像中表格的位置。后续的检测表格线、结构化电子表格的操作都仅在图像的表格区域中进行。
步骤S3:在图像的表格区域中检测表格线。表格线包括用于分隔表格内部与表格外部的外部边框线、以及在表格内部用于区分行、列的内部分隔线。
步骤S4:根据对图像的表格区域进行光学字符识别(OCR,Optical character recognition)获取的文字行信息,对步骤S3检测得到的表格线进行过滤,移除虚假表格线,得到干净的真实表格线。所述文字行信息包括文字行的高度、单个文字的宽度、文字行的角度等。对图像的表格区域进行光学字符识别以获取文字行信息,可以在这一步或之前的任意步骤中进行,也包括对原始图像进行光学字符识别以获取文字行信息然后缩小到图像的表格区域内的文字行信息的方式(非优选)。
例如,某些文字笔划较长、或者相邻文字的笔划连接在一起,就可能在步骤S3中被检测为表格线,但属于虚假表格线,根据文字行高度、单个文字的宽度可以过滤掉。又如,当步骤S3检测出的某一条竖直表格线的长度小于文字行高度,则判定该条垂直表格线为虚假表格线。再如,将文字行的角度认为呈水平,那么竖直线也就确定了;如果步骤S3检测出的某一条表格线超出了水平线的容许角度范围,也超出了竖直线的容许角度范围,则判定该条表格线为虚假表格线。水平线的容许角度范围例如为水平线的正负15度。竖直线的容许角度范围例如为竖直线的正负15度。
步骤S5:根据表格线之间的位置关系,将所有表格线分别归入各个行、各个列的组别。由于图像质量不佳等因素,不可避免地存在同一条表格线被检测成多条表格线的情况。同时表格中也存在为格式需要,属于同一行、同一列的表格线分成多条表格线的情况。这一步就是为了准确还原单元格所属行、列,根据表格线中水平线之间的位置关系,将水平线归入不同行的组别;根据表格线中竖直线之间的位置关系,将竖直线归入不同列的组别。
例如,通过计算表格线的角度区分水平线和竖直线。对于水平线,按起始端点排序后进行循环处理,遇到垂直距离接近且水平部分有重叠的水平线就进行合并去重,如此可将逻辑上属于上同一条水平线但实际被检测为多条的水平线组装为一条水平线,处理过程可使用并查集(Union-Find)算法进行加速。最终,每一表格行的水平线归为一组,组内根据是否有单元格合并情况包含一根或多根水平线。对竖直线的处理采用类似方法。
步骤S6:根据表格线所属组别构建单元格,并将每一单元格范围内的光学字符识别结果作为该单元格中的文字信息保存。这使得电子表格的版式与原始图像中的表格的版式保持一致。对单元格进行光学字符识别以获取识别结果,可以在这一步或之前的任意步骤中进行,包括对图像的表格区域进行光学字符识别以获取识别结果然后缩小到单元格范围内的识别结果的方式(优选),也包括对原始图像进行光学字符识别以获取识别结果然后缩小到单元格范围内的识别结果的方式(非优选)。
优选地,对图像的表格区域进行光学字符识别以获取文字行信息(步骤S4或之前任意步骤)、对单元格进行光学字符识别以获取识别结果(步骤S6或之前任意步骤),同时进行。
步骤S7:根据表格最外侧一圈的单元格是否结构完整、以及相邻单元格之间是否有间隙,判断是否有缺失的单元格。如有缺失的单元格,在相应位置补齐单元格,以使表格最外侧一圈的单元格的结构完整、并且相邻单元格之间没有间隙,得到完整的结构化的电子表格。由于表格的版式中缺少外部边框、或者图像质量不佳、或者拍摄的表格不完整等情况,都会引起表格线丢失,这会导致某些单元格无法构建。这一步通过补齐单元格提高了结构化电子表格的完整性。
例如,根据在图像中检测到的表格位置、文本行位置、表格线位置判断是否需要补充表格的外部边框线。如需补充,根据已有的正交方向的线段信息补充外部边框线。例如表格最左边的竖直外部边框线需补充,则根据已有的水平线段的左侧端点进行拟合得出,其它位置的外部边框线采用类似方式处理。又如,单元结构构建完成后,可知单元格所属行、列的编号。因表格为矩形结构且内部不能存在空洞,根据已有的行、列编号信息可判断是否有缺失单元格。如有缺失单元格可根据已存在的相邻单元格推导出缺失单元格的行、列编号及位置信息并补全。再如,当相邻单元格之间的间隙超出文字行高度时,判定存在缺失的单元格。
请参阅图2,所述步骤S3具体包括如下子步骤。
步骤S31:采用基于语义分割(Semantic Segmentation)网络的算法在图像的表格区域中预测和提取表格线区域,表格线区域是指表格线可能出现的位置,就是一些孤立的像素点。所述基于语义分割网络的算法例如采用基于U-Net的像素分类方法,U-Net是一种用于生物医学图像分割(image segmentation)的卷积神经网络(convolutional neural network)算法。
优选地,所述基于语义分割网络的算法是先通过标注的表格线数据进行训练,然后再使用训练好的算法预测和提取表格线区域。所述标注的表格线数据是指已经明确标注为“是”表格线的图像、以及明确标注为“不是”表格线的图像。
步骤S32:在图像的表格线区域通过曲线拟合(curve fitting)方法检测出表格线,也就是采用传统的曲线拟合方法将前一步预测的孤立的像素点连接成线段。
图2所示方法结合了数据驱动(先训练算法再用于预测和提取)和经典图像处理算法(曲线拟合),不仅能有效抑制噪声,而且对不同版式的表格线的检测具有良好的健壮性(robustness,也称鲁棒性)。
请参阅图3,本申请提出的将图像中表格转换为电子表格的装置包括转正及矫正处理单元1、表格位置检测单元2、表格线检测单元3、表格线滤除单元4、表格线分组单元5、单元格构建单元6、单元格补齐单元7。
所述转正及矫正处理单元1用来根据图像中的文字行及线,对图像做转正及矫正处理。
所述表格位置检测单元2用来在图像中采用无锚的目标检测方法来确定图像中表格的位置,也称图像的表格区域。
所述表格线检测单元3用来在图像的表格区域中检测表格线。
所述表格线滤除单元4用来根据对图像的表格区域进行光学字符识别获取的文字行信息,移除虚假表格线,得到真实表格线。
所述表格线分组单元5用来根据表格线之间的位置关系,将所有表格线分别归入各个行、各个列的组别。
所述单元格构建单元6用来根据表格线所属组别构建单元格,并将每一单元格范围内的光学字符识别结果作为该单元格中的文字信息保存。
所述单元格补齐单元7用来根据表格最外侧一圈的单元格是否结构完整、以及相邻单元格之间是否有间隙,判断是否有缺失的单元格。如有缺失的单元格,在相应位置补齐单元格,以使表格最外侧一圈的单元格的结构完整、并且相邻单元格之间没有间隙,得到完整的结构化的电子表格。
本申请提出的将图像中表格转换为电子表格的方法及装置具有如下有益的技术效果。
第一,先检测确定图像中表格的位置,然后对图像的表格区域(仅为原始图像的一部分)进行检测表格线、光学字符识别、构建单元格等操作。这与对整幅图像进行相应操作相比,简化了各项操作的工作量,减少了各项操作的处理时间。
第二,针对图像变形对表格检测的影响,先对图像做转正及矫正处理,再检测图像中的表格,提高了表格检测和后续结构化电子表格的准确性。
第三,根据表格版式多样的特点,采用无锚的目标检测方法检测图像中的表格,能准确检测出不同长宽比及不同分隔样式的有线或无线表格。无线表格就是指文档内容按照表格的格式被分隔开来,但是没有表格线。
第四,针对图像质量差及文字干扰造成的表格线容易误检或漏检的情况,采用基于语义分割网络的算法,先通过已标注数据的训练,再用于检测表格线区域,从而能够去除干扰。同时结合曲线拟合方法实现表格线的检测。
第五,对表格线基于距离进行分组,根据组别得到单元格行列位置从而构建单元格。同时为避免因为单元格边界线缺失而导致的单元格无法构建的问题,判断并补全缺失的单元格,提高了电子表格的完整性。
以上仅为本申请的优选实施例,并不用于限定本申请。对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种将图像中表格转换为电子表格的方法,其特征是,包括如下步骤;
    步骤S1:根据图像中的文字行及线,对图像做转正及矫正处理;
    步骤S2:在图像中采用无锚的目标检测方法来确定图像中表格的位置,也称图像的表格区域;
    步骤S3:在图像的表格区域中检测表格线;
    步骤S4:根据对图像的表格区域进行光学字符识别获取的文字行信息,对步骤S3检测得到的表格线进行过滤,移除虚假表格线,得到真实表格线;
    步骤S5:根据表格线之间的位置关系,将所有表格线分别归入各个行、各个列的组别;
    步骤S6:根据表格线所属组别构建单元格,并将每一单元格范围内的光学字符识别结果作为该单元格中的文字信息保存;
    步骤S7:根据表格最外侧一圈的单元格是否结构完整、以及相邻单元格之间是否有间隙,判断是否有缺失的单元格;如有缺失的单元格,在相应位置补齐单元格,以使表格最外侧一圈的单元格的结构完整、并且相邻单元格之间没有间隙,得到完整的结构化的电子表格。
  2. 根据权利要求1所述的将图像中表格转换为电子表格的方法,其特征是,所述步骤S1中,检测图像中的文字行及表格线的角度,并使图像中的每一行文字大致为水平排列,使表格线中的水平线大致为水平,使表格线中的竖直线大致为竖直的方式对整幅图像进行转正及矫正处理。
  3. 根据权利要求1所述的将图像中表格转换为电子表格的方法,其特征是,所述步骤S2中,所述无锚的目标检测方法包括CornerNet算法、CenterNet算法、ExtremeNet算法、DenseBox算法、YOLO算法、FSAF算法、FCOS算法、FoveaBox算法、RepPoints算法、Sparse RCNN算法、CentripetalNet算法、SaccadeNet算法的任意一种或多种。
  4. 根据权利要求1所述的将图像中表格转换为电子表格的方法,其特征是,所述步骤S3具体包括如下子步骤;
    步骤S31:采用基于语义分割网络的算法在图像的表格区域中预测和提取表格线区域,表格线区域是指表格线可能出现的位置,就是一些孤立的像素点;
    步骤S32:在图像的表格线区域通过曲线拟合方法检测出表格线,也就是采用曲线拟合方法将前一步预测的孤立的像素点连接成线段。
  5. 根据权利要求4所述的将图像中表格转换为电子表格的方法,其特征是,所述步骤S31中,所述基于语义分割网络的算法是先通过标注的表格线数据进行训练,然后再使用训练好的算法预测和提取表格线区域。
  6. 根据权利要求1所述的将图像中表格转换为电子表格的方法,其特征是,所述步骤 S4中,对图像的表格区域进行光学字符识别以获取文字行信息,在这一步或之前的任意步骤中进行,也包括对原始图像进行光学字符识别以获取文字行信息然后缩小到图像的表格区域内的文字行信息的方式。
  7. 根据权利要求1所述的将图像中表格转换为电子表格的方法,其特征是,所述步骤S5中,对于水平线,按起始端点排序后进行循环处理,遇到垂直距离接近且水平部分有重叠的水平线就进行合并去重,如此将逻辑上属于上同一条水平线但实际被检测为多条的水平线组装为一条水平线;最终,每一表格行的水平线归为一组,组内根据是否有单元格合并情况包含一根或多根水平线;对竖直线的处理采用类似方法。
  8. 根据权利要求1所述的将图像中表格转换为电子表格的方法,其特征是,所述步骤S6中,对单元格进行光学字符识别以获取识别结果,在这一步或之前的任意步骤中进行,包括对图像的表格区域进行光学字符识别以获取识别结果然后缩小到单元格范围内的识别结果的方式,也包括对原始图像进行光学字符识别以获取识别结果然后缩小到单元格范围内的识别结果的方式。
  9. 根据权利要求6或8所述的将图像中表格转换为电子表格的方法,其特征是,对图像的表格区域进行光学字符识别以获取文字行信息、对单元格进行光学字符识别以获取识别结果,两者同时进行。
  10. 一种将图像中表格转换为电子表格的装置,其特征是,包括转正及矫正处理单元、表格位置检测单元、表格线检测单元、表格线滤除单元、表格线分组单元、单元格构建单元、单元格补齐单元;
    所述转正及矫正处理单元用来根据图像中的文字行及线,对图像做转正及矫正处理;
    所述表格位置检测单元用来在图像中采用无锚的目标检测方法来确定图像中表格的位置,也称图像的表格区域;
    所述表格线检测单元用来在图像的表格区域中检测表格线;
    所述表格线滤除单元用来根据对图像的表格区域进行光学字符识别获取的文字行信息,移除虚假表格线,得到真实表格线;
    所述表格线分组单元用来根据表格线之间的位置关系,将所有表格线分别归入各个行、各个列的组别;
    所述单元格构建单元用来根据表格线所属组别构建单元格,并将每一单元格范围内的光学字符识别结果作为该单元格中的文字信息保存;
    所述单元格补齐单元用来根据表格最外侧一圈的单元格是否结构完整、以及相邻单元格之间是否有间隙,判断是否有缺失的单元格;如有缺失的单元格,在相应位置补齐单元格, 以使表格最外侧一圈的单元格的结构完整、并且相邻单元格之间没有间隙,得到完整的结构化的电子表格。
PCT/CN2022/080926 2021-09-27 2022-03-15 一种将图像中表格转换为电子表格的方法及装置 WO2023045277A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111134361.1A CN113688795A (zh) 2021-09-27 2021-09-27 一种将图像中表格转换为电子表格的方法及装置
CN202111134361.1 2021-09-27

Publications (1)

Publication Number Publication Date
WO2023045277A1 true WO2023045277A1 (zh) 2023-03-30

Family

ID=78587230

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/080926 WO2023045277A1 (zh) 2021-09-27 2022-03-15 一种将图像中表格转换为电子表格的方法及装置

Country Status (2)

Country Link
CN (1) CN113688795A (zh)
WO (1) WO2023045277A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116612487A (zh) * 2023-07-21 2023-08-18 亚信科技(南京)有限公司 表格识别方法、装置、电子设备及存储介质

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113688795A (zh) * 2021-09-27 2021-11-23 上海合合信息科技股份有限公司 一种将图像中表格转换为电子表格的方法及装置
CN116343247B (zh) * 2023-05-24 2023-10-20 荣耀终端有限公司 表格图像矫正方法、装置和设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6006240A (en) * 1997-03-31 1999-12-21 Xerox Corporation Cell identification in table analysis
CN109685052A (zh) * 2018-12-06 2019-04-26 泰康保险集团股份有限公司 文本图像处理方法、装置、电子设备及计算机可读介质
CN110135218A (zh) * 2018-02-02 2019-08-16 兴业数字金融服务(上海)股份有限公司 用于识别图像的方法、装置、设备和计算机存储介质
CN111814722A (zh) * 2020-07-20 2020-10-23 电子科技大学 一种图像中的表格识别方法、装置、电子设备及存储介质
CN113688795A (zh) * 2021-09-27 2021-11-23 上海合合信息科技股份有限公司 一种将图像中表格转换为电子表格的方法及装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101676930A (zh) * 2008-09-17 2010-03-24 北大方正集团有限公司 一种识别扫描图像中表格单元的方法及装置
CN107943956A (zh) * 2017-11-24 2018-04-20 北京金堤科技有限公司 页面转换方法、装置和页面转换设备
CN110569489B (zh) * 2018-06-05 2023-08-11 北京国双科技有限公司 基于pdf文件的表格数据解析方法及装置
CN110472623B (zh) * 2019-06-29 2022-08-09 华为技术有限公司 图像检测方法、设备以及系统
CN110796031A (zh) * 2019-10-11 2020-02-14 腾讯科技(深圳)有限公司 基于人工智能的表格识别方法、装置及电子设备
CN111368638A (zh) * 2020-02-10 2020-07-03 深圳追一科技有限公司 电子表格的创建方法、装置、计算机设备和存储介质
CN112036259A (zh) * 2020-08-10 2020-12-04 晶璞(上海)人工智能科技有限公司 一种基于图像处理与深度学习相结合的表格矫正与识别的方法
CN112818785B (zh) * 2021-01-22 2022-01-11 国家气象信息中心(中国气象局气象数据中心) 一种气象纸质表格文档的快速数字化方法及系统
CN113139457A (zh) * 2021-04-21 2021-07-20 浙江康旭科技有限公司 一种基于crnn的图片表格提取方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6006240A (en) * 1997-03-31 1999-12-21 Xerox Corporation Cell identification in table analysis
CN110135218A (zh) * 2018-02-02 2019-08-16 兴业数字金融服务(上海)股份有限公司 用于识别图像的方法、装置、设备和计算机存储介质
CN109685052A (zh) * 2018-12-06 2019-04-26 泰康保险集团股份有限公司 文本图像处理方法、装置、电子设备及计算机可读介质
CN111814722A (zh) * 2020-07-20 2020-10-23 电子科技大学 一种图像中的表格识别方法、装置、电子设备及存储介质
CN113688795A (zh) * 2021-09-27 2021-11-23 上海合合信息科技股份有限公司 一种将图像中表格转换为电子表格的方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116612487A (zh) * 2023-07-21 2023-08-18 亚信科技(南京)有限公司 表格识别方法、装置、电子设备及存储介质
CN116612487B (zh) * 2023-07-21 2023-10-13 亚信科技(南京)有限公司 表格识别方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN113688795A (zh) 2021-11-23

Similar Documents

Publication Publication Date Title
WO2023045277A1 (zh) 一种将图像中表格转换为电子表格的方法及装置
CN106156761B (zh) 面向移动终端拍摄的图像表格检测与识别方法
CN102750541B (zh) 一种文档图像分类识别方法及装置
US9396404B2 (en) Robust industrial optical character recognition
CN105654072A (zh) 一种低分辨率医疗票据图像的文字自动提取和识别系统与方法
US20110222776A1 (en) Form template definition method and form template definition apparatus
WO2023045298A1 (zh) 一种在图像中检测表格线的方法及装置
CN103577818A (zh) 一种图像文字识别的方法和装置
CN105260751A (zh) 一种文字识别方法及其系统
CN112016481B (zh) 基于ocr的财务报表信息检测和识别方法
CN111753706B (zh) 一种基于图像统计学的复杂表格交点聚类提取方法
Chamchong et al. Character segmentation from ancient palm leaf manuscripts in Thailand
CN112364834A (zh) 一种基于深度学习和图像处理的表格识别的还原方法
Deodhare et al. Preprocessing and Image Enhancement Algorithms for a Form-based Intelligent Character Recognition System.
CN111340032A (zh) 一种基于金融领域应用场景的字符识别方法
CN113139457A (zh) 一种基于crnn的图片表格提取方法
Mullick et al. An efficient line segmentation approach for handwritten Bangla document image
CN111310682A (zh) 一种文本文件表格的通用检测分析及识别方法
CN111832497B (zh) 一种基于几何特征的文本检测后处理方法
WO2023173949A1 (zh) 一种古籍识别方法、装置、存储介质及设备
Messaoud et al. A multilevel text-line segmentation framework for handwritten historical documents
CN111340000A (zh) 一种针对pdf文档表格提取优化方法及系统
Karthik et al. Segmentation and Recognition of Handwritten Kannada Text Using Relevance Feedback and Histogram of Oriented Gradients–A Novel Approach
CN109325487B (zh) 一种基于目标检测的全种类车牌识别方法
Bera et al. Distance transform based text-line extraction from unconstrained handwritten document images

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE