WO2023045298A1 - 一种在图像中检测表格线的方法及装置 - Google Patents

一种在图像中检测表格线的方法及装置 Download PDF

Info

Publication number
WO2023045298A1
WO2023045298A1 PCT/CN2022/085400 CN2022085400W WO2023045298A1 WO 2023045298 A1 WO2023045298 A1 WO 2023045298A1 CN 2022085400 W CN2022085400 W CN 2022085400W WO 2023045298 A1 WO2023045298 A1 WO 2023045298A1
Authority
WO
WIPO (PCT)
Prior art keywords
lines
line
image
semantic segmentation
unit
Prior art date
Application number
PCT/CN2022/085400
Other languages
English (en)
French (fr)
Inventor
龙伟
郭丰俊
丁凯
龙腾
Original Assignee
上海合合信息科技股份有限公司
上海临冠数据科技有限公司
上海生腾数据科技有限公司
上海盈五蓄数据科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海合合信息科技股份有限公司, 上海临冠数据科技有限公司, 上海生腾数据科技有限公司, 上海盈五蓄数据科技有限公司 filed Critical 上海合合信息科技股份有限公司
Publication of WO2023045298A1 publication Critical patent/WO2023045298A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to a method for detecting table lines in an image (picture).
  • Tables are widely used in daily life and office work, and there is a large demand for converting tables in pictures into spreadsheets, and such automatic conversion technologies usually rely heavily on the detection of table lines.
  • the table lines include the outer border line used to separate the inside of the table from the outside of the table, and the inner divider line used to distinguish rows and columns inside the table.
  • the technical problem to be solved in this application is to propose a method for detecting table lines in an image, which has the characteristics of high accuracy and can effectively assist table structure restoration.
  • the method for detecting table lines in an image proposed by this application includes the following steps.
  • Step S10 Input the image into the semantic segmentation network to obtain a set of pixels in the area adjacent to the potential form line; the set of pixels in the area adjacent to the potential form line refers to some isolated pixel points in the area where form lines may exist.
  • Step S20 Perform line segment fitting on the pixel set in the vicinity of the table line to obtain the table line.
  • Step S30 Filter the form lines obtained in step S20 according to the character line information obtained by performing optical character recognition on the image, remove false form lines, and obtain real form lines.
  • Step S40 According to the positional relationship between the table lines, classify all the table lines into groups of rows and columns.
  • Step S50 Construct cells according to the group to which the table lines belong, and save the OCR results within each cell as text information in the cell, and finally obtain a complete structured electronic form.
  • Step S60 If the electronic form structuring in step S50 fails and is caused by a form line detection error, then extract the typical features of the failure scene, and generate difficult samples, retrain the semantic segmentation network, and use the re- The trained semantic segmentation network repeats step S10 to step S50 until the electronic form is successfully structured in step S50.
  • the above method improves the accuracy of table line detection through repeated training of the semantic segmentation network, which helps to improve the success rate of spreadsheet structuring.
  • the semantic segmentation of the image is to classify each pixel in the image, determine the category of each point, and then perform region division;
  • the semantic segmentation network is based on a deep learning algorithm, including convolutional neural network Any one or more of network, deep convolutional neural network, and fully convolutional network. This is a detailed description of step S10.
  • the character line information includes any one or more of the height of the character line, the width of a single character, and the angle of the character line.
  • step S40 for the horizontal lines, they are sorted according to the starting endpoints and then looped. When encountering horizontal lines with close vertical distances and overlapping horizontal parts, they will be merged and deduplicated. In this way, they will logically belong to the same horizontal line but The horizontal lines that are actually detected as multiple are assembled into one horizontal line; finally, the horizontal lines of each table row are grouped into one group, and the group contains one or more horizontal lines according to whether there is cell merging; the vertical line is processed using similar method. This is a specific description of step S40.
  • the processing process is accelerated using a union search algorithm.
  • step S60 further includes the following sub-steps.
  • Step S61 preparing a general sample synthesis tool, the difficult sample synthesis tool has multiple adjustable parameters, by adjusting these parameters, samples and labels of various characteristics can be generated.
  • Step S62 Collect and analyze typical features in the scenario where the electronic form structure fails due to form line detection errors.
  • Step S63 According to the typical characteristics of the failure scene obtained in step S62, adjust the parameters in the general sample synthesis tool to generate difficult samples and labels with the same characteristics.
  • Step S64 using the generated difficult samples to retrain the semantic segmentation network used to obtain the pixel set of the adjacent area of the potential table line in the image. This is a specific description of step S60.
  • the difficult sample synthesis tool abstracts the sample generation process into five parts: basic background texture, table structure, text content and style, table line position and style, and stamp watermark synthesis; basic background texture
  • the parameters of the part include any one or more of background image, background color, texture pattern, and texture color
  • the parameters of the table structure part include any one or more of the number of tables, size, position, number of rows and columns, and merged cells ;
  • the parameters of the content and style of the body include any one or more of font size, font, color, position, and alignment;
  • the parameters of the position and style of the table line include any of the type, style, thickness, and pixel area of the table line or more;
  • the parameters of the stamp watermark synthesis part include any one or more of the number, position, angle, and color of the stamp watermark.
  • the typical features of the failure scene include word embossing caused by printing misalignment or handwriting, false lines caused by repeated longitudinal arrangement of long-stroke Chinese characters, missing lines caused by stamp occlusion, wrongly aligning the edge of the stamp Recognized as table lines, strong light shooting makes it difficult to distinguish table lines from the background, cells are separated by colored lines or color blocks in samples with complex textures, two parallel lines are used to separate adjacent cells, and very short tables are used in low dense cells Any one or more of the missing line identifications.
  • the general sample synthesis tool first generates the basic image according to the parameters of the basic background texture part, then generates the table structure according to the parameters of the table structure part, and then generates the text content according to the parameters of the text content and style part and style, and then generate the frame line and style according to the parameters of the table line position and style part, and then superimpose the stamp watermark according to the parameters of the stamp watermark synthesis part, and finally synthesize the image, table structure, text, table line and stamp watermark of the above parts is an image with annotations.
  • the present application also proposes a device for detecting table lines in an image, including a semantic segmentation unit, a line segment fitting unit, a table line filtering unit, a table line grouping unit, a spreadsheet structuring unit, and a retraining unit.
  • the semantic segmentation unit is used to obtain a set of pixels in the vicinity of potential table lines in the input image by using a semantic segmentation network.
  • the line segment fitting unit is used to perform line segment fitting on the pixel set in the vicinity of the table line to obtain the table line.
  • the table line filtering unit is used to filter the table lines according to the text line information obtained by optical character recognition on the image, remove false table lines, and obtain real table lines.
  • the table line grouping unit is used for grouping all table lines into groups of rows and columns according to the positional relationship between the table lines.
  • the electronic form structuring unit is used to construct cells according to the group to which the form lines belong, and save the optical character recognition results within each cell as the text information in the cell, and finally obtain a complete structured electronic form. sheet.
  • the retraining unit is used to extract the typical features of the failure scene when the electronic form structuring unit fails to perform electronic form structuring and is caused by a form line detection error, and thereby generate a difficult sample, and retrain The semantic segmentation network; the retrained semantic segmentation network is sent to the semantic segmentation unit, and is repeatedly executed by the semantic segmentation unit, the line segment fitting unit, the table line filtering unit, the table line grouping unit, and the spreadsheet structuring unit , until the electronic form structuring unit successfully executes electronic form structuring.
  • the above-mentioned device improves the accuracy rate of form line detection through repeated training of the semantic segmentation network, and helps to improve the success rate of electronic form structuring.
  • the form line is obtained by combining the semantic segmentation network and the line segment fitting, which effectively reduces the problem of false lines and missing lines in the form line detection;
  • Table line detection for difficult scenes such as occlusion, light line, colored line, color block, dotted line, double-line separation, ultra-short line, etc., by extracting typical features of failed scenes, generating difficult samples and repeatedly training the semantic segmentation network, thereby improving table line detection accuracy.
  • FIG. 1 is a schematic flowchart of a method for detecting table lines in an image proposed by the present application.
  • FIG. 2 is a schematic subflow diagram of step S60 in FIG. 1 .
  • FIG. 3 is a schematic structural diagram of a device for detecting table lines in an image proposed by the present application.
  • 10 is a semantic segmentation unit
  • 20 is a line segment fitting unit
  • 30 is a table line filtering unit
  • 40 is a table line grouping unit
  • 50 is a spreadsheet structuring unit
  • 60 is a retraining unit.
  • the method for detecting table lines in an image proposed by this application includes the following steps.
  • Step S10 Input the image into the Semantic Segmentation network to obtain a set of pixels in the vicinity of potential table lines, that is, some isolated pixel points in areas where table lines may exist.
  • the semantic segmentation of the image is to classify each pixel in the image, determine the category of each point, and then divide the region, which is an existing technology.
  • Common semantic segmentation networks are based on deep learning algorithms, including convolutional neural network (CNN), deep convolutional neural network, and fully convolutional network (FCN). This step can effectively remove non-table lines in the image, remove text or background stripe interference, and effectively reduce false lines and missing lines in table line detection.
  • CNN convolutional neural network
  • FCN fully convolutional network
  • Step S20 Perform line segment fitting on the pixel set in the vicinity of the table line to obtain the table line, that is, use the traditional line segment fitting method to connect the isolated pixel points predicted in the previous step into a line segment.
  • Step S30 According to the text line information obtained by optical character recognition (OCR, Optical character recognition) on the image, filter the table lines obtained in step S20, remove false table lines, and obtain clean real table lines.
  • the character line information includes the height of the character line, the width of a single character, the angle of the character line, and the like.
  • some character strokes are longer, or the strokes of adjacent characters are connected together, which may be detected as table lines in step S20, but belong to false table lines, which can be filtered out according to the height of the text line and the width of a single text.
  • the length of a certain vertical table line detected in step S20 is smaller than the height of the text line, it is determined that the vertical table line is a false table line.
  • the angle of the character line is determined; if a certain form line detected in step S20 exceeds the allowable angle range of the horizontal line, and also exceeds the allowable angle range of the vertical line, Then it is determined that the table line is a false table line.
  • the allowable angle range of the horizontal line is, for example, plus or minus 15 degrees from the horizontal line.
  • the allowable angle range of the vertical line is, for example, plus or minus 15 degrees of the vertical line.
  • Step S40 According to the positional relationship between the table lines, classify all the table lines into groups of rows and columns. Due to factors such as poor image quality, it is inevitable that the same table line is detected as multiple table lines. At the same time, there are cases in the table where the table lines belonging to the same row and the same column are divided into multiple table lines for format requirements. This step is to accurately restore the row and column to which the cell belongs. According to the positional relationship between the horizontal lines in the table lines, the horizontal lines are classified into different row groups; according to the positional relationship between the vertical lines in the table lines, the vertical lines are divided into Lines are grouped into different columns.
  • the horizontal lines distinguish between horizontal and vertical lines by calculating the angle of the table lines. For the horizontal lines, sort them according to the start and end points and perform loop processing. When encountering horizontal lines with close vertical distance and overlapping horizontal parts, they will be merged and deduplicated. In this way, horizontal lines that logically belong to the same horizontal line but are actually detected as multiple horizontal lines can be detected. Assembled into a horizontal line, the processing can be accelerated using the Union-Find algorithm. Finally, the horizontal lines of each table row are grouped into one group, and the group contains one or more horizontal lines according to whether there is a cell merge. The vertical lines are handled in a similar way.
  • Step S50 Construct cells according to the group to which the table lines belong, and save the OCR results within each cell as text information in the cell, and finally obtain a complete structured electronic form. This makes the layout of the spreadsheet match the layout of the table in the original image.
  • Step S60 If the electronic form structuring in step S50 fails and is caused by a form line detection error, then extract the typical features of the failure scene, and generate difficult samples, retrain the semantic segmentation network, and use the re- The trained semantic segmentation network repeats step S10 to step S50 until the electronic form is successfully structured in step S50.
  • the step S60 further includes the following sub-steps.
  • Step S61 Prepare a common sample synthesis tool, the difficult sample synthesis tool can control the existence, size, position, style, etc. of the graphic elements in the generated samples through parameters. In this way, when generating samples, only need to adjust the parameters according to the desired sample characteristics, and then the samples and labels of the corresponding characteristics can be generated, avoiding the process of data collection and data labeling with high cost.
  • the difficult sample synthesis tool abstracts the sample generation process into five parts: basic background texture, table structure, text content and style, table line position and style, and stamp watermark synthesis.
  • the parameters of the table structure part include the number of tables, size, position, number of rows and columns, merged cells, etc.
  • the parameters of the body content and style include font size, font, color, position, alignment, etc.
  • the parameters of the table line position and style include the type, style, thickness, and pixel area of the table line.
  • the parameters of the stamp watermark synthesis part include the number, position, angle, color, etc. of the stamp watermark.
  • Step S62 Collect and analyze typical features in the scenario where the electronic form structure fails due to form line detection errors.
  • Typical features of the failure scenarios include, for example, misalignment of printing or word pressure caused by handwriting, false lines caused by repeated vertical arrangement of long-stroke Chinese characters, missing lines caused by stamp occlusion, wrong recognition of stamp edges as form lines, strong light shooting Difficult to distinguish the table line from the background, cells separated by colored lines or color blocks in complex texture samples, two parallel lines used to separate adjacent cells, identification of very short table lines in low dense cells lost, etc.
  • Step S63 According to the typical characteristics of the failure scene obtained in step S62, adjust the parameters in the general sample synthesis tool to generate difficult samples and labels with the same characteristics.
  • the general sample synthesis tool also marks the generated difficult samples while generating the difficult samples.
  • Data labeling refers to the behavior of processing the learning data of artificial intelligence algorithms through the marking tools of data processors.
  • the general sample synthesis tool first generates the basic image according to the parameters of the basic background texture part, then generates the table structure according to the parameters of the table structure part, then generates the text content and style according to the parameters of the text content and style part, and then according to the form
  • the parameters of the line position and style part generate the frame line and style, and then superimpose the stamp watermark according to the parameters of the stamp watermark synthesis part, and finally synthesize the image, table structure, text, table line, stamp watermark, etc. of the above parts into a picture,
  • the picture has labels for table structure, table lines, etc.
  • Step S64 using the generated difficult samples to retrain the semantic segmentation network used to obtain the pixel set of the adjacent area of the potential table line in the image.
  • the retrained semantic segmentation network will be used to repeat step S10 to step S50, and can bring more accurate line segment fitting results, thereby improving the success rate of overall spreadsheet structuring.
  • the device that the present application proposes to detect form line in image comprises semantic segmentation unit 10, line segment fitting unit 20, form line filtering unit 30, form line grouping unit 40, spreadsheet structuring unit 50, retrain Unit 60.
  • the semantic segmentation unit 10 is configured to use a semantic segmentation network to obtain, in the input image, a set of pixels in areas adjacent to potential table lines, that is, some isolated pixel points in areas where table lines may exist.
  • the line segment fitting unit 20 is used to perform line segment fitting on the pixel set in the vicinity of the table line to obtain the table line, that is, the traditional line segment fitting method is used to connect the isolated pixel points predicted in the previous step into a line segment.
  • the table line filtering unit 30 is used to filter the table lines according to the text line information acquired by optical character recognition on the image, remove false table lines, and obtain real table lines.
  • the table line grouping unit 40 is used for grouping all table lines into groups of rows and columns according to the positional relationship between the table lines.
  • the electronic form structuring unit 50 is used to construct cells according to the group to which the form lines belong, and save the optical character recognition results within the range of each cell as text information in the cell, and finally obtain a complete structured Spreadsheets.
  • the retraining unit 60 is used to extract the typical features of the failure scene when the electronic form structuring unit 50 fails to perform the electronic form structuring and is caused by a form line detection error, and thereby generate a difficult sample, Retrain the semantic segmentation network.
  • the semantic segmentation network after retraining is sent to the semantic segmentation unit 10, and is repeatedly executed by the semantic segmentation unit 10, the line segment fitting unit 20, the table line filtering unit 30, the table line grouping unit 40, and the spreadsheet structuring unit 50 until the electronic form structuring unit 50 executes the electronic form structuring successfully.
  • the method and device for detecting table lines in images proposed by this application adopts the method of combining data-driven (that is, the semantic segmentation network is first trained and then used, and difficult samples are generated according to the failure scene for retraining and then used) and line segment fitting, which has a strong Robustness (robustness, also known as robustness).

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Character Input (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种在图像中检测表格线的方法。步骤S10:将图像输入语义分割网络,获得潜在表格线临近区域像素集合。步骤S20:对表格线临近区域像素集合进行线段拟合以得到表格线。步骤S30:移除虚假表格线,得到真实表格线。步骤S40:将所有表格线分别归入各个行、各个列的组别。步骤S50:得到完整的结构化的电子表格。步骤S60:如果步骤S50的电子表格结构化失败、并且是由于表格线检测错误导致的,则提取该失败场景的典型特征,并以此生成困难样本,重新训练所述语义分割网络。上述方法通过对语义分割网络的反复训练,提高了表格线检测的准确率,有助于提高电子表格结构化的成功率。

Description

一种在图像中检测表格线的方法及装置 技术领域
本申请涉及一种在图像(图片)中检测表格线的方法。
背景技术
表格在日常生活及办公中有广泛的应用,存在大量将图片中表格转化为电子表格的需求,而此类自动转化技术通常严重依赖于表格线的检测。表格线包括用于分隔表格内部与表格外部的外部边框线、以及在表格内部用于区分行、列的内部分隔线。
因为图像质量、拍摄角度、不均匀的光线、纸张弯曲褶皱、文字区域错位、图章水印干扰以及表格线本身的色彩、粗细、样式的多样性,会给检测表格线带来极大的挑战,进而影响表格的结构还原的准确性。
发明内容
本申请所要解决的技术问题是提出一种在图像中检测表格线的方法,具有准确率高、能够有效地辅助表格结构还原的特点。
为解决上述技术问题,本申请提出的在图像中检测表格线的方法包括如下步骤。步骤S10:将图像输入语义分割网络,获得潜在表格线临近区域像素集合;所述潜在表格线临近区域像素集合是指一些可能存在表格线的区域的孤立的像素点。步骤S20:对表格线临近区域像素集合进行线段拟合以得到表格线。步骤S30:根据对图像进行光学字符识别获取的文字行信息,对步骤S20得到的表格线进行过滤,移除虚假表格线,得到真实表格线。步骤S40:根据表格线之间的位置关系,将所有表格线分别归入各个行、各个列的组别。步骤S50:根据表格线所属组别构建单元格,并将每一单元格范围内的光学字符识别结果作为该单元格中的文字信息保存,最终得到完整的结构化的电子表格。步骤S60:如果步骤S50的电子表格结构化失败、并且是由于表格线检测错误导致的,则提取该失败场景的典型特征,并以此生成困难样本,重新训练所述语义分割网络,并利用重新训练后的所述语义分割网络重复步骤S10至步骤S50,直至步骤S50的电子表格结构化成功。上述方法 通过对语义分割网络的反复训练,提高了表格线检测的准确率,有助于提高电子表格结构化的成功率。
进一步地,所述步骤S10中,图像的语义分割是对图像中每一个像素点进行分类,确定每个点的类别,从而进行区域划分;所述语义分割网络基于深度学习算法,包括卷积神经网络、深度卷积神经网络、全卷积网络的任意一种或多种。这是对步骤S10的详细说明。
进一步地,所述步骤S30中,所述文字行信息包括文字行的高度、单个文字的宽度、文字行的角度的任一种或多种。
进一步地,所述步骤S40中,对于水平线,按起始端点排序后进行循环处理,遇到垂直距离接近且水平部分有重叠的水平线就进行合并去重,如此将逻辑上属于上同一条水平线但实际被检测为多条的水平线组装为一条水平线;最终,每一表格行的水平线归为一组,组内根据是否有单元格合并情况包含一根或多根水平线;对竖直线的处理采用类似方法。这是对步骤S40的具体说明。
可选地,所述步骤S40中,处理过程使用并查集算法进行加速。
进一步地,所述步骤S60进一步包括如下子步骤。步骤S61:准备通用样本合成工具,所述困难样本合成工具具有多个可调整的参数,通过调整这些参数可生成各种特征的样本及标注。步骤S62:收集并分析由于表格线检测错误造成的电子表格结构化失败的场景下的典型特征。步骤S63:根据步骤S62得到的失败场景的典型特征,调整通用样本合成工具中的参数以生成具有相同特征的困难样本及标注。步骤S64:利用所生成的困难样本重新训练用于在图像中获得潜在表格线临近区域像素集合的所述语义分割网络。这是对步骤S60的具体说明。
进一步地,所述步骤S61中,所述困难样本合成工具将样本生成过程抽象为基础背景纹理、表格结构、正文内容与样式、表格线位置与样式、图章水印合成这五个部分;基础背景纹理部分的参数包括背景图片、背景颜色、纹理图案、纹理颜色的任一种或多种;表格结构部分的参数包括表格数目、大小、位置、行列数、合并单元格情况的任一种或多种;正文内容与样式部分的参数包括字号、字体、颜色、位置、对齐方式的任一种或多种;表格线位置与样式部分的参数包括表格线的类型风格、粗细、像素区域的任一种或多种;图章水印合成部分的参数包含图章水印的数目、位置、角度、色彩的任一种或多种。
进一步地,所述步骤S62中,所述失败场景的典型特征包括印刷错位或手写造成的字压线、长笔划汉字纵向重复排列造成的假线、图章遮挡引起的漏线、错误地将图章边缘识别为表格线、强光线拍摄造成表线与背景难区分、复杂纹理样本中通过彩色线或颜色块分隔单元格、使用两根平行线分隔邻接单元格、低矮稠密单元格中很短的表格线识别丢失的任一种或多种。
进一步地,所述步骤S63中,所述通用样本合成工具先根据基础背景纹理部分的参数生成基础图像,再根据表格结构部分的参数生成表格结构,再根据正文内容与样式部分的参数生成文本内容及样式,再根据表格线位置与样式部分的参数生成框线及样式,再根据图章水印合成部分的参数叠加图章水印,最终将上述各部分的图像、表格结构、正文、表格线、图章水印合成为一张图片,该图片具有标注。
本申请还提出了一种在图像中检测表格线的装置,包括语义分割单元、线段拟合单元、表格线过滤单元、表格线分组单元、电子表格结构化单元、重新训练单元。所述语义分割单元用于采用语义分割网络在输入的图像中获得潜在表格线临近区域像素集合。所述线段拟合单元用于对表格线临近区域像素集合进行线段拟合以得到表格线。所述表格线过滤单元用于根据对图像进行光学字符识别获取的文字行信息对表格线进行过滤,移除虚假表格线,得到真实表格线。所述表格线分组单元用于根据表格线之间的位置关系,将所有表格线分别归入各个行、各个列的组别。所述电子表格结构化单元用于根据表格线所属组别构建单元格,并将每一单元格范围内的光学字符识别结果作为该单元格中的文字信息保存,最终得到完整的结构化的电子表格。所述重新训练单元用于当所述电子表格结构化单元执行电子表格结构化失败、并且是由于表格线检测错误导致的,则提取该失败场景的典型特征,并以此生成困难样本,重新训练所述语义分割网络;重新训练后的所述语义分割网络送入所述语义分割单元,由语义分割单元、线段拟合单元、表格线过滤单元、表格线分组单元、电子表格结构化单元重复执行,直至所述电子表格结构化单元执行电子表格结构化成功。上述装置通过对语义分割网络的反复训练,提高了表格线检测的准确率,有助于提高电子表格结构化的成功率。
本申请取得的技术效果是:采用语义分割网络和线段拟合相结合的方式得到表格线,有效减少了表格线检测中的假线、漏线问题;针对字压线、重复字假线、图章遮挡、淡线、彩色线、色块、虚线、双线分隔、超短线等困难场景的表格线检 测,通过提取失败场景的典型特征,生成困难样本对语义分割网络反复训练,从而提升表格线检测的准确性。
附图概述
本发明的特征、性能由以下的实施例及其附图进一步描述。
图1是本申请提出的在图像中检测表格线的方法的流程示意图。
图2是图1中步骤S60的子流程示意图。
图3是本申请提出的在图像中检测表格线的装置的结构示意图。
图中附图标记说明:10为语义分割单元、20为线段拟合单元、30为表格线过滤单元、40为表格线分组单元、50为电子表格结构化单元、60为重新训练单元。
本发明的较佳实施方式
请参阅图1,本申请提出的在图像中检测表格线的方法包括如下步骤。
步骤S10:将图像输入语义分割(Semantic Segmentation)网络,获得潜在表格线临近区域像素集合,就是一些可能存在表格线的区域的孤立的像素点。图像的语义分割是对图像中每一个像素点进行分类,确定每个点的类别,从而进行区域划分,这是一种现有技术。常见的语义分割网络基于深度学习算法,有卷积神经网络(CNN)、深度卷积神经网络、全卷积网络(FCN)等。这一步可有效去除图像中的非表格线,去除文字或背景条纹干扰,有效减少表格线检测中的假线、漏线问题。
步骤S20:对表格线临近区域像素集合进行线段拟合以得到表格线,也就是采用传统的线段拟合方法将前一步预测的孤立的像素点连接成线段。
步骤S30:根据对图像进行光学字符识别(OCR,Optical character recognition)获取的文字行信息,对步骤S20得到的表格线进行过滤,移除虚假表格线,得到干净的真实表格线。所述文字行信息包括文字行的高度、单个文字的宽度、文字行的角度等。
例如,某些文字笔划较长、或者相邻文字的笔划连接在一起,就可能在步骤S20中被检测为表格线,但属于虚假表格线,根据文字行高度、单个文字的宽度可以过滤掉。又如,当步骤S20检测出的某一条竖直表格线的长度小于文字行高度, 则判定该条垂直表格线为虚假表格线。再如,将文字行的角度认为呈水平,那么竖直线也就确定了;如果步骤S20检测出的某一条表格线超出了水平线的容许角度范围,也超出了竖直线的容许角度范围,则判定该条表格线为虚假表格线。水平线的容许角度范围例如为水平线的正负15度。竖直线的容许角度范围例如为竖直线的正负15度。
步骤S40:根据表格线之间的位置关系,将所有表格线分别归入各个行、各个列的组别。由于图像质量不佳等因素,不可避免地存在同一条表格线被检测成多条表格线的情况。同时表格中也存在为格式需要,属于同一行、同一列的表格线分成多条表格线的情况。这一步就是为了准确还原单元格所属行、列,根据表格线中水平线之间的位置关系,将水平线归入不同行的组别;根据表格线中竖直线之间的位置关系,将竖直线归入不同列的组别。
例如,通过计算表格线的角度区分水平线和竖直线。对于水平线,按起始端点排序后进行循环处理,遇到垂直距离接近且水平部分有重叠的水平线就进行合并去重,如此可将逻辑上属于上同一条水平线但实际被检测为多条的水平线组装为一条水平线,处理过程可使用并查集(Union-Find)算法进行加速。最终,每一表格行的水平线归为一组,组内根据是否有单元格合并情况包含一根或多根水平线。对竖直线的处理采用类似方法。
步骤S50:根据表格线所属组别构建单元格,并将每一单元格范围内的光学字符识别结果作为该单元格中的文字信息保存,最终得到完整的结构化的电子表格。这使得电子表格的版式与原始图像中的表格的版式保持一致。
步骤S60:如果步骤S50的电子表格结构化失败、并且是由于表格线检测错误导致的,则提取该失败场景的典型特征,并以此生成困难样本,重新训练所述语义分割网络,并利用重新训练后的所述语义分割网络重复步骤S10至步骤S50,直至步骤S50的电子表格结构化成功。
请参阅图2,所述步骤S60进一步包括如下子步骤。
步骤S61:准备通用样本合成工具,所述困难样本合成工具可将所生成样本中的图文元素的存在性、大小、位置、样式风格等通过参数进行控制。这样在生成样本时只需要根据期望的样本特征调整参数,即可生成相应特征的样本及标注,规避了成本较高的数据收集及数据标注过程。
作为示例,所述困难样本合成工具例如将样本生成过程抽象为基础背景纹理、表格结构、正文内容与样式、表格线位置与样式、图章水印合成这五个部分,通过灵活配置各部分参数可生成各种样本。表格结构部分的参数包括表格数目、大小、位置、行列数、合并单元格情况等。正文内容与样式部分的参数包括字号、字体、颜色、位置、对齐方式等。表格线位置与样式部分的参数包括表格线的类型风格、粗细、像素区域等。图章水印合成部分的参数包含图章水印的数目、位置、角度、色彩等。
步骤S62:收集并分析由于表格线检测错误造成的电子表格结构化失败的场景下的典型特征。所述失败场景的典型特征例如包括印刷错位或手写造成的字压线、长笔划汉字纵向重复排列造成的假线、图章遮挡引起的漏线、错误地将图章边缘识别为表格线、强光线拍摄造成表线与背景难区分、复杂纹理样本中通过彩色线或颜色块分隔单元格、使用两根平行线分隔邻接单元格、低矮稠密单元格中很短的表格线识别丢失等。
步骤S63:根据步骤S62得到的失败场景的典型特征,调整通用样本合成工具中的参数以生成具有相同特征的困难样本及标注。所述通用样本合成工具在生成困难样本的同时也对所生成的困难样本进行标注。数据标注是指通过数据加工人员标记工具对人工智能算法的学习数据进行加工的一种行为。
作为示例,所述通用样本合成工具先根据基础背景纹理部分的参数生成基础图像,再根据表格结构部分的参数生成表格结构,再根据正文内容与样式部分的参数生成文本内容及样式,再根据表格线位置与样式部分的参数生成框线及样式,再根据图章水印合成部分的参数叠加图章水印,最终将上述各部分的图像、表格结构、正文、表格线、图章水印等合成为一张图片,该图片具有表格结构、表格线等内容的标注。
步骤S64:利用所生成的困难样本重新训练用于在图像中获得潜在表格线临近区域像素集合的所述语义分割网络。重新训练后的所述语义分割网络将用于重复进行步骤S10至步骤S50,并能带来更准确的线段拟合结果,进而提高整体电子表格结构化的成功率。
请参阅图3,本申请提出的在图像中检测表格线的装置包括语义分割单元10、线段拟合单元20、表格线过滤单元30、表格线分组单元40、电子表格结构化单元 50、重新训练单元60。
所述语义分割单元10用于采用语义分割网络在输入的图像中获得潜在表格线临近区域像素集合,就是一些可能存在表格线的区域的孤立的像素点。
所述线段拟合单元20用于对表格线临近区域像素集合进行线段拟合以得到表格线,也就是采用传统的线段拟合方法将前一步预测的孤立的像素点连接成线段。
所述表格线过滤单元30用于根据对图像进行光学字符识别获取的文字行信息,对表格线进行过滤,移除虚假表格线,得到真实表格线。
所述表格线分组单元40用于根据表格线之间的位置关系,将所有表格线分别归入各个行、各个列的组别。
所述电子表格结构化单元50用于根据表格线所属组别构建单元格,并将每一单元格范围内的光学字符识别结果作为该单元格中的文字信息保存,最终得到完整的结构化的电子表格。
所述重新训练单元60用于当所述电子表格结构化单元50执行电子表格结构化失败、并且是由于表格线检测错误导致的,则提取该失败场景的典型特征,并以此生成困难样本,重新训练所述语义分割网络。重新训练后的所述语义分割网络送入所述语义分割单元10,由语义分割单元10、线段拟合单元20、表格线过滤单元30、表格线分组单元40、电子表格结构化单元50重复执行,直至所述电子表格结构化单元50执行电子表格结构化成功。
本申请提出的在图像中检测表格线的方法及装置采用数据驱动(即语义分割网络先训练再使用,并根据失效场景生成困难样本重新训练再使用)与线段拟合相结合的方法,具有强稳健性(robustness,也称鲁棒性)。
以上仅为本申请的优选实施例,并不用于限定本申请。对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (10)

  1. 一种在图像中检测表格线的方法,其特征是,包括如下步骤;
    步骤S10:将图像输入语义分割网络,获得潜在表格线临近区域像素集合;所述潜在表格线临近区域像素集合是指一些可能存在表格线的区域的孤立的像素点;
    步骤S20:对表格线临近区域像素集合进行线段拟合以得到表格线;
    步骤S30:根据对图像进行光学字符识别获取的文字行信息,对步骤S20得到的表格线进行过滤,移除虚假表格线,得到真实表格线;
    步骤S40:根据表格线之间的位置关系,将所有表格线分别归入各个行、各个列的组别;
    步骤S50:根据表格线所属组别构建单元格,并将每一单元格范围内的光学字符识别结果作为该单元格中的文字信息保存,最终得到完整的结构化的电子表格;
    步骤S60:如果步骤S50的电子表格结构化失败、并且是由于表格线检测错误导致的,则提取该失败场景的典型特征,并以此生成困难样本,重新训练所述语义分割网络,并利用重新训练后的所述语义分割网络重复步骤S10至步骤S50,直至步骤S50的电子表格结构化成功。
  2. 根据权利要求1所述的在图像中检测表格线的方法,其特征是,所述步骤S10中,图像的语义分割是对图像中每一个像素点进行分类,确定每个点的类别,从而进行区域划分;所述语义分割网络基于深度学习算法,包括卷积神经网络、深度卷积神经网络、全卷积网络的任意一种或多种。
  3. 根据权利要求1所述的在图像中检测表格线的方法,其特征是,所述步骤S30中,所述文字行信息包括文字行的高度、单个文字的宽度、文字行的角度的任一种或多种。
  4. 根据权利要求1所述的在图像中检测表格线的方法,其特征是,所述步骤S40中,对于水平线,按起始端点排序后进行循环处理,遇到垂直距离接近且水平部分有重叠的水平线就进行合并去重,如此将逻辑上属于上同一条水平线但实际被检测为多条的水平线组装为一条水平线;最终,每一表格行的水平线归为一组,组内根据是否有单元格合并情况包含一根或多根水平线;对竖直线的处理采用类似方法。
  5. 根据权利要求4所述的在图像中检测表格线的方法,其特征是,所述步骤 S40中,处理过程使用并查集算法进行加速。
  6. 根据权利要求1所述的在图像中检测表格线的方法,其特征是,所述步骤S60进一步包括如下子步骤;
    步骤S61:准备通用样本合成工具,所述困难样本合成工具具有多个可调整的参数,通过调整这些参数可生成各种特征的样本及标注;
    步骤S62:收集并分析由于表格线检测错误造成的电子表格结构化失败的场景下的典型特征;
    步骤S63:根据步骤S62得到的失败场景的典型特征,调整通用样本合成工具中的参数以生成具有相同特征的困难样本及标注;
    步骤S64:利用所生成的困难样本重新训练用于在图像中获得潜在表格线临近区域像素集合的所述语义分割网络。
  7. 根据权利要求6所述的在图像中检测表格线的方法,其特征是,所述步骤S61中,所述困难样本合成工具将样本生成过程抽象为基础背景纹理、表格结构、正文内容与样式、表格线位置与样式、图章水印合成这五个部分;基础背景纹理部分的参数包括背景图片、背景颜色、纹理图案、纹理颜色的任一种或多种;表格结构部分的参数包括表格数目、大小、位置、行列数、合并单元格情况的任一种或多种;正文内容与样式部分的参数包括字号、字体、颜色、位置、对齐方式的任一种或多种;表格线位置与样式部分的参数包括表格线的类型风格、粗细、像素区域的任一种或多种;图章水印合成部分的参数包含图章水印的数目、位置、角度、色彩的任一种或多种。
  8. 根据权利要求6所述的在图像中检测表格线的方法,其特征是,所述步骤S62中,所述失败场景的典型特征包括印刷错位或手写造成的字压线、长笔划汉字纵向重复排列造成的假线、图章遮挡引起的漏线、错误地将图章边缘识别为表格线、强光线拍摄造成表线与背景难区分、复杂纹理样本中通过彩色线或颜色块分隔单元格、使用两根平行线分隔邻接单元格、低矮稠密单元格中很短的表格线识别丢失的任一种或多种。
  9. 根据权利要求7所述的在图像中检测表格线的方法,其特征是,所述步骤S63中,所述通用样本合成工具先根据基础背景纹理部分的参数生成基础图像,再根据表格结构部分的参数生成表格结构,再根据正文内容与样式部分的参数生成文 本内容及样式,再根据表格线位置与样式部分的参数生成框线及样式,再根据图章水印合成部分的参数叠加图章水印,最终将上述各部分的图像、表格结构、正文、表格线、图章水印合成为一张图片,该图片具有标注。
  10. 一种在图像中检测表格线的装置,其特征是,包括语义分割单元、线段拟合单元、表格线过滤单元、表格线分组单元、电子表格结构化单元、重新训练单元;
    所述语义分割单元用于采用语义分割网络在输入的图像中获得潜在表格线临近区域像素集合;
    所述线段拟合单元用于对表格线临近区域像素集合进行线段拟合以得到表格线;
    所述表格线过滤单元用于根据对图像进行光学字符识别获取的文字行信息对表格线进行过滤,移除虚假表格线,得到真实表格线;
    所述表格线分组单元用于根据表格线之间的位置关系,将所有表格线分别归入各个行、各个列的组别;
    所述电子表格结构化单元用于根据表格线所属组别构建单元格,并将每一单元格范围内的光学字符识别结果作为该单元格中的文字信息保存,最终得到完整的结构化的电子表格;
    所述重新训练单元用于当所述电子表格结构化单元执行电子表格结构化失败、并且是由于表格线检测错误导致的,则提取该失败场景的典型特征,并以此生成困难样本,重新训练所述语义分割网络;重新训练后的所述语义分割网络送入所述语义分割单元,由语义分割单元、线段拟合单元、表格线过滤单元、表格线分组单元、电子表格结构化单元重复执行,直至所述电子表格结构化单元执行电子表格结构化成功。
PCT/CN2022/085400 2021-09-27 2022-04-06 一种在图像中检测表格线的方法及装置 WO2023045298A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111134050.5A CN113723362A (zh) 2021-09-27 2021-09-27 一种在图像中检测表格线的方法及装置
CN202111134050.5 2021-09-27

Publications (1)

Publication Number Publication Date
WO2023045298A1 true WO2023045298A1 (zh) 2023-03-30

Family

ID=78685034

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/085400 WO2023045298A1 (zh) 2021-09-27 2022-04-06 一种在图像中检测表格线的方法及装置

Country Status (2)

Country Link
CN (1) CN113723362A (zh)
WO (1) WO2023045298A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116311310A (zh) * 2023-05-19 2023-06-23 之江实验室 一种结合语义分割和序列预测的通用表格识别方法和装置
CN117475459A (zh) * 2023-12-28 2024-01-30 杭州恒生聚源信息技术有限公司 表格信息处理方法、装置、电子设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113723362A (zh) * 2021-09-27 2021-11-30 上海合合信息科技股份有限公司 一种在图像中检测表格线的方法及装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110163198A (zh) * 2018-09-27 2019-08-23 腾讯科技(深圳)有限公司 一种表格识别重建方法、装置和存储介质
CN110363095A (zh) * 2019-06-20 2019-10-22 华南农业大学 一种针对表格字体的识别方法
US20200042785A1 (en) * 2018-07-31 2020-02-06 International Business Machines Corporation Table Recognition in Portable Document Format Documents
CN110796031A (zh) * 2019-10-11 2020-02-14 腾讯科技(深圳)有限公司 基于人工智能的表格识别方法、装置及电子设备
CN111860502A (zh) * 2020-07-15 2020-10-30 北京思图场景数据科技服务有限公司 图片表格的识别方法、装置、电子设备及存储介质
CN112507876A (zh) * 2020-12-07 2021-03-16 数地科技(北京)有限公司 一种基于语义分割的有线表格图片解析方法和装置
CN112528863A (zh) * 2020-12-14 2021-03-19 中国平安人寿保险股份有限公司 表格结构的识别方法、装置、电子设备及存储介质
CN113723362A (zh) * 2021-09-27 2021-11-30 上海合合信息科技股份有限公司 一种在图像中检测表格线的方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101676930A (zh) * 2008-09-17 2010-03-24 北大方正集团有限公司 一种识别扫描图像中表格单元的方法及装置
CN107943956A (zh) * 2017-11-24 2018-04-20 北京金堤科技有限公司 页面转换方法、装置和页面转换设备
US11593552B2 (en) * 2018-03-21 2023-02-28 Adobe Inc. Performing semantic segmentation of form images using deep learning
CN110569489B (zh) * 2018-06-05 2023-08-11 北京国双科技有限公司 基于pdf文件的表格数据解析方法及装置
CN112396047B (zh) * 2020-10-30 2022-03-08 中电金信软件有限公司 训练样本生成方法、装置、计算机设备和存储介质
CN113221743B (zh) * 2021-05-12 2024-01-12 北京百度网讯科技有限公司 表格解析方法、装置、电子设备和存储介质

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200042785A1 (en) * 2018-07-31 2020-02-06 International Business Machines Corporation Table Recognition in Portable Document Format Documents
CN110163198A (zh) * 2018-09-27 2019-08-23 腾讯科技(深圳)有限公司 一种表格识别重建方法、装置和存储介质
CN110363095A (zh) * 2019-06-20 2019-10-22 华南农业大学 一种针对表格字体的识别方法
CN110796031A (zh) * 2019-10-11 2020-02-14 腾讯科技(深圳)有限公司 基于人工智能的表格识别方法、装置及电子设备
CN111860502A (zh) * 2020-07-15 2020-10-30 北京思图场景数据科技服务有限公司 图片表格的识别方法、装置、电子设备及存储介质
CN112507876A (zh) * 2020-12-07 2021-03-16 数地科技(北京)有限公司 一种基于语义分割的有线表格图片解析方法和装置
CN112528863A (zh) * 2020-12-14 2021-03-19 中国平安人寿保险股份有限公司 表格结构的识别方法、装置、电子设备及存储介质
CN113723362A (zh) * 2021-09-27 2021-11-30 上海合合信息科技股份有限公司 一种在图像中检测表格线的方法及装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116311310A (zh) * 2023-05-19 2023-06-23 之江实验室 一种结合语义分割和序列预测的通用表格识别方法和装置
CN117475459A (zh) * 2023-12-28 2024-01-30 杭州恒生聚源信息技术有限公司 表格信息处理方法、装置、电子设备及存储介质
CN117475459B (zh) * 2023-12-28 2024-04-09 杭州恒生聚源信息技术有限公司 表格信息处理方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN113723362A (zh) 2021-11-30

Similar Documents

Publication Publication Date Title
WO2023045298A1 (zh) 一种在图像中检测表格线的方法及装置
Guo et al. Separating handwritten material from machine printed text using hidden markov models
Zhou et al. Bangla/English script identification based on analysis of connected component profiles
CN108805076B (zh) 环境影响评估报告书表格文字的提取方法及系统
WO2023045277A1 (zh) 一种将图像中表格转换为电子表格的方法及装置
CN105260751A (zh) 一种文字识别方法及其系统
CN111666938A (zh) 一种基于深度学习的两地双车牌检测识别方法及系统
CN113537227B (zh) 一种结构化文本识别方法及系统
CN114005123A (zh) 一种印刷体文本版面数字化重建系统及方法
CN113139457A (zh) 一种基于crnn的图片表格提取方法
Tardón et al. Optical music recognition for scores written in white mensural notation
CN113191348B (zh) 一种基于模板的文本结构化提取方法及工具
Bijalwan et al. Automatic text recognition in natural scene and its translation into user defined language
CN112329641A (zh) 一种表格识别方法、装置、设备及可读存储介质
CN116824608A (zh) 基于目标检测技术的答题卡版面分析方法
CN111340032A (zh) 一种基于金融领域应用场景的字符识别方法
CN118115782A (zh) 一种基于深度学习的喷印彩色二维码缺陷检测方法
CN107798355B (zh) 一种基于文档图像版式自动分析与判断的方法
CN114241492A (zh) 一种识别作文稿纸的手写文本识别并复现文本结构的方法
CN104123527A (zh) 基于掩膜的图像表格文档识别方法
CN115661183B (zh) 一种基于边缘计算的智能扫描管理系统及方法
Gharde et al. Identification of handwritten simple mathematical equation based on SVM and projection histogram
CN115797945A (zh) 一种文本识别对比与图像对比相结合的标签检测方法
Munir et al. Automatic character extraction from handwritten scanned documents to build large scale database
JP3121466B2 (ja) 画像修正装置

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE