CN114926851A - Method, system and storage medium for identifying table structure in table picture - Google Patents

Method, system and storage medium for identifying table structure in table picture Download PDF

Info

Publication number
CN114926851A
CN114926851A CN202210558928.6A CN202210558928A CN114926851A CN 114926851 A CN114926851 A CN 114926851A CN 202210558928 A CN202210558928 A CN 202210558928A CN 114926851 A CN114926851 A CN 114926851A
Authority
CN
China
Prior art keywords
table structure
cells
mask
horizontal
global
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210558928.6A
Other languages
Chinese (zh)
Inventor
喻晨曦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qizhidao Network Technology Co Ltd
Original Assignee
Qizhidao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qizhidao Network Technology Co Ltd filed Critical Qizhidao Network Technology Co Ltd
Priority to CN202210558928.6A priority Critical patent/CN114926851A/en
Publication of CN114926851A publication Critical patent/CN114926851A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/15Cutting or merging image elements, e.g. region growing, watershed or clustering-based techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/1918Fusion techniques, i.e. combining data from various sources, e.g. sensor fusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Image Analysis (AREA)

Abstract

The application relates to a method, a system and a storage medium for identifying a table structure in a table picture, wherein the method comprises the following steps: acquiring a characteristic diagram corresponding to the form picture; for the feature map, performing global boundary segmentation, learning empty cell segmentation information, namely segmenting all aligned cells including non-empty cells and empty cells, and generating a basic true value of the empty cells according to the maximum height/width of the non-empty cells in the same row/column; performing global mask regression, and distributing soft labels to pixels in all non-empty cells in the horizontal and vertical directions; and aligning the corresponding table structure in the horizontal and vertical directions according to the soft label so as to obtain the adjusted boundary coordinate point and realize the recovery of the table structure. The method and the device can realize that the empty cells can be identified when the table structure is detected, and avoid the visual blurring caused by crossing rows/columns of cells.

Description

Method, system and storage medium for identifying table structure in table picture
Technical Field
The present application relates to the field of table picture identification technologies, and in particular, to a method, a system, and a storage medium for identifying a table structure in a table picture.
Background
Accurate detection of the table structure in the table picture plays a crucial role in high-precision content identification of the table picture data.
In the prior art, the detection of the table structure in the table picture is usually realized by a method for detecting a grid boundary, but the method has a certain limitation that a table without a grid boundary cannot be processed. The method is suitable for processing tables without grid boundaries at the same time by detecting the positions of text blocks firstly and then restoring the relation of the bounding boxes through a graphical neural network, but the method not only needs huge data sample support, but also is difficult to obtain empty cells, so that the method is easy to fall into the problem of visual blurring of cells across rows and columns.
Disclosure of Invention
In order to solve the problems in the prior art, particularly the problem that when a table structure of a table picture is detected, empty cells are difficult to obtain, so that the problem is easy to fall into the visual blurring of cross-row/cross-column cells, the application provides a method, a system and a storage medium for identifying the table structure in the table picture.
In a first aspect, the method for identifying a table structure in a table picture provided by the present application adopts the following technical solutions: a method for identifying a table structure in a table picture comprises the following steps:
acquiring a characteristic diagram corresponding to the form picture;
for the feature map, global boundary segmentation is carried out, empty cell segmentation information is learned, namely all aligned cells are segmented, including non-empty cells and empty cells, and basic truth values of the empty cells are generated according to the maximum height/width of the non-empty cells in the same row/column;
performing global mask regression, and distributing soft labels to pixels in all non-empty cells in the horizontal and vertical directions;
and aligning the corresponding table structure in the horizontal and vertical directions according to the soft label, so as to obtain an adjusted boundary coordinate point and realize the recovery of the table structure.
By adopting the technical scheme, particularly all aligned cells including non-empty cells and empty cells are segmented, the basic true value of the empty cell is generated according to the maximum height/width of the non-empty cells in the same row/column, global mask regression is carried out, and soft labels are distributed to pixels in all non-empty cells in the horizontal and vertical directions in the suggested bounding box area, so that the empty cells can be identified when table structure detection is carried out, and the visual blurring of cross-row/cross-column cells is avoided.
Preferably, the method further comprises:
for the feature map, local mask alignment is carried out, namely a model is trained to learn a binary segmentation task to align a unit region, and meanwhile, a local mask regression task distributes soft labels to pixels in a suggested boundary frame region in the horizontal and vertical directions;
and re-scoring the prediction results of the local mask and the global mask to obtain an updated pixel distribution soft label.
By adopting the method and the mask re-scoring strategy, the advantages of local and global characteristics in the aspect of object perception can be integrated, the prediction expression effects of the local and global characteristics are fused, the local characteristics can predict a more reliable text region mask, the global prediction can provide more reliable remote visual information, and the finally identified table structure is more accurate. In addition, compared with the prior art that table recovery is performed simply by adopting a mode of pixel segmentation such as Unet, the algorithm model of the application realizes empty cell positioning by fusing local and global information, including binary segmentation and mask regression, and can reduce deviation and overfitting risks in the process of fitting data, so that the amount of data samples needing to be supported is relatively small.
Preferably, for mask regression, the pixels are assigned soft labels in the horizontal and vertical directions by:
assuming that the shape of the non-empty cell or the suggested bounding box is rectangular, the upper left corner and the lower right corner of the bounding box are respectively expressed as { (x) 1 ,y 1 ),(x 2 ,y 2 ) In which 0. ltoreq.x 1 ≤x 2 ≤X,0≤y 1 ≤y 2 Less than or equal to Y, the shape of the mask is (2, Y, X), the middle point of the text is taken as the maximum regression target, and the training target belongs to (0, 1)]Then the corresponding local or global horizontal pyramid label prediction score for any pixel (y, x) is calculated by the following formula respectively
Figure BDA0003655674550000021
Local or global vertical pyramid tag prediction scores
Figure BDA0003655674550000022
Figure BDA0003655674550000023
Figure BDA0003655674550000024
Wherein x is more than or equal to 0<X,0≤y<Y,
Figure BDA0003655674550000025
x and y represent the horizontal and vertical coordinates of any pixel, x mid 、y mid Representing the coordinates of the middle point of the bounding box, X, Y representing the maximum width and maximum height of the proposed bounding box, respectively.
By the method, local and global mask alignment can be accurately realized.
Preferably, the mask re-scoring is performed on the prediction results of the local mask and the global mask by the following method: for a certain predicted aligned bounding box B { (x) 0 ,y 0 ),(x 1 ,y 1 ) Firstly, obtaining bbox of the text area mask, and marking as B 1 ={(x′ 0 ,y′ 0 ),(x′ 1 ,y′ 1 ) Then find the matching connected domain P ═ P in the global boundary segmentation map 1 ,P 2 ,......,P n In which P is i Where (x, y) denotes a pixel, let P be 0 ={p|x 0 ≤p·x≤x 1 ,y 0 ≤p·y≤y 1 Indicates the overlap area, then closeAt the point (x, y) epsilon P 0 The prediction pyramid label of (a) is calculated in the following way:
Figure BDA0003655674550000031
Figure BDA0003655674550000032
wherein the content of the first and second substances,
Figure BDA0003655674550000033
Figure BDA0003655674550000034
respectively representing local horizontal, global horizontal, local vertical and global vertical pyramid label predicted values; f (x), F (y) represents the point (x, y) epsilon P obtained after the mask is rescored 0 Pyramid mask labels in horizontal and vertical directions, respectively.
By adopting the method, the soft label prediction results of the local mask and the global mask are subjected to mask re-scoring, the local and global information is compromised, and the prediction boundary is refined, so that the finally identified table structure is more accurate.
Preferably, the adjusted boundary coordinate point is obtained by:
for the suggested bounding box region, finding a matched connected region in the global boundary segmentation graph, and then respectively fitting two planes in a three-dimensional space by adopting horizontal and vertical pyramid labels obtained by re-scoring by using a mask;
and calculating the intersection line of the fitting plane and the zero plane to obtain the boundary coordinate point after the adjustment of the proposed boundary frame.
By adopting the steps, the bounding box which is aligned more accurately can be obtained, and the finally recovered table structure is more accurate.
Preferably, the training model learns a binary segmentation task, and specifically, the prediction mark is obtained by dynamically adjusting a probability map with a pixel point being a text, a threshold map with a pixel point and an approximate binary map, and is used for distinguishing a text region from a non-text region.
More preferably, the prediction flag of the approximate binary map is calculated by the following formula
Figure BDA0003655674550000035
Figure BDA0003655674550000036
Wherein, P i,j For probability map predictive markers, T, in which the pixel is text i,j Predicting and marking the threshold value graph of the pixel point; k is an integer parameter for adjusting the gradient amplitude; when x is equal to P i,j -T i,j <Time 0, predictive flag
Figure BDA0003655674550000037
Indicating that the region is a text-free region; when x is P i,j -T i,j >Time 0, predictive flag
Figure BDA0003655674550000038
Indicating that the region is a text region.
By setting the parameter k, optimization is facilitated, the optimization of three output graphs can be better, and the final segmentation result is more excellent.
Preferably, the recovering of the table structure includes: the step of matching the cells specifically comprises: if a pair of aligned bounding boxes overlap in the abscissa or the ordinate, they are matched in the corresponding vertical or horizontal direction; and connecting the alignment bounding boxes vertically or horizontally.
Preferably, the recovering of the table structure includes: the empty cell positioning method specifically comprises the following steps:
after the cell matching is finished, representing the connection relation between the aligned bounding boxes by using edges, wherein all nodes in the same row or column form a complete subgraph;
searching all maximum cliques in the subgraph by adopting a maximum clique search algorithm; in the searching process of the row or the column, each node belonging to the same row or the column is positioned in the same group, for cells crossing multiple rows or multiple columns, (because the corresponding node appears multiple times in different groups), the corresponding clusters are sorted by averaging y coordinates or x coordinates, each node is marked by utilizing a row index or a column index of the corresponding cluster, and the nodes appearing in multiple groups are marked by utilizing multiple row indexes or column indexes, so that the vacant positions corresponding to the vacant cells are determined.
By adopting the method, the empty cells can be more accurately positioned.
Preferably, the recovering of the table structure further includes: the step of merging the empty cells specifically comprises the following steps:
designating a single empty cell of the aligned bounding box shape as the maximum height/width of the cells in the same row/column;
(according to clues learned by the global segmentation task, and a pixel voting mechanism is used for determining a result), calculating the predicted pixel ratio of 1 in the interval area of each pair of adjacent empty cells, and if the pixel ratio is greater than a preset probability threshold, merging the adjacent empty cells. By adopting the method, the original feature graph processing result containing the global boundary information is fused with the result of realizing the empty cell positioning, so that the empty cell completion is realized.
Preferably, the probability threshold is obtained according to sensitivity analysis; the sensitivity refers to the descending change condition of the loss of each branch under different probability thresholds, and the lowest loss is obtained, namely the optimal solution of the probability threshold is obtained.
Preferably, the VoVNetV2-39-FPN model is used as a backbone network to process the table pictures to obtain the feature maps corresponding to the table pictures. Thus, the accuracy is 5% higher than ResNet50-FPN, the inference speed is faster than ResNeXt-FPN and the small target can be detected with higher accuracy (i.e. the font of the target text is smaller or the table looks smaller in the picture) under the condition that the inference speed of the model is the same.
Preferably, the training model learns a binary segmentation task to align the unit region by using a differentiable binary loss statistical method. So that more accurate segmentation into content cells is possible.
In a second aspect, the present application provides a system for identifying a table structure in a table picture, which adopts the following technical solutions: a system for identifying a table structure in a table picture, comprising:
the characteristic diagram acquisition module is used for acquiring a characteristic diagram corresponding to the form picture;
the global boundary segmentation module is used for carrying out global boundary segmentation on the feature map, learning empty cell segmentation information, namely segmenting all aligned cells including non-empty cells and empty cells, and generating a basic true value of the empty cells according to the maximum height/width of the non-empty cells in the same row/column;
the global mask regression module is used for performing global mask regression and distributing soft labels to pixels in all non-empty cells in the horizontal and vertical directions;
and the table structure aligning and recovering module is used for aligning the corresponding table structure in the horizontal and vertical directions according to the soft label so as to recover the table structure.
In a third aspect, the present application provides an electronic device, which adopts the following technical solution:
an electronic device comprising a memory and a processor, the memory having stored thereon a computer program that can be loaded by the processor and that executes the method according to any of the preceding claims.
In a fourth aspect, the present application provides a computer-readable storage medium, which adopts the following technical solutions:
a computer-readable storage medium storing a computer program that can be loaded by a processor and executed to perform a method as any one of the preceding.
In summary, the present application includes at least one of the following beneficial technical effects:
1. the method and the device divide all aligned cells, including non-empty cells and empty cells, generate basic truth values of the empty cells according to the maximum height/width of the non-empty cells in the same row/column, perform global mask regression, and allocate soft labels to pixels in all the non-empty cells in horizontal and vertical directions in a suggested boundary frame area, so that the empty cells can be identified when table structure detection is performed, and the visual blurring of cross-row/cross-column cells is avoided.
2. By adopting the method, the table structure of most table pictures can be identified with the precision of more than 90%, and the detection and identification of the table content with the precision of 95% can be realized based on the accurate identification of the table structure.
Drawings
FIG. 1 is a flow chart of a method of an embodiment of the present application.
Fig. 2 is a schematic diagram of a table structure in a table picture identified by the method of the present application.
Fig. 3 is a flow chart of a method of another embodiment of the present application.
Detailed Description
The present application is described in further detail below with reference to figures 1-3.
The embodiment of the application discloses a method for identifying a table structure in a table picture. Referring to fig. 1, a method for identifying a table structure in a table picture includes the following steps:
s1, acquiring a characteristic diagram corresponding to the form picture;
s2, for the feature map, global boundary segmentation is carried out, empty cell segmentation information is learned, namely all aligned cells including non-empty cells and empty cells are segmented, and basic truth values of the empty cells are generated according to the maximum height/width of the non-empty cells in the same row/column (only this task can learn the empty cell segmentation information, and the most reasonable cell splitting mode is captured in the global boundary segmentation process);
s3, performing global mask regression, and distributing soft labels to the pixels in all the non-empty cells in the horizontal and vertical directions;
and S4, aligning the corresponding table structure in the horizontal and vertical directions according to the soft label, further obtaining the adjusted boundary coordinate point, and realizing the recovery of the table structure.
By adopting the steps of the method, the detection and the identification of the hollow cells in the figure 2 can be realized, and the visual blurring trapped in the cells crossing rows/columns can be avoided.
In order to further improve the recognition accuracy of the form picture, before step S1, the method may further include: and preprocessing the table picture, such as adjusting definition, size, angle and the like of the table picture.
In order to prevent the frames from overlapping, all real labels of the aligned bounding boxes can be reduced by 5% -9%, and the specific reduction effect is the best, and can be obtained through experiments.
As shown in fig. 1, in order to further improve the accuracy of identifying the table structure in the table picture, the method may further include (as shown in fig. 3):
s2', for the feature map, local mask alignment is carried out, namely a model is trained to learn a binary segmentation task to align a unit region, and meanwhile, a local mask regression task distributes soft labels to pixels in a suggested bounding box region in the horizontal direction and the vertical direction;
s3', re-scoring the local mask and the global mask to obtain the updated pixel distribution soft label.
In step S2', in order to make the obtained text region mask information more accurate, before the training model learns the binary segmentation task, the feature map may be subjected to FCN processing.
In order to make the original mask position more accurate, before the local mask regression task is performed, the ROI Align process may be performed on the feature map.
For the mask regression in step S3 and step S2', the pixels are assigned soft labels in the horizontal and vertical directions by:
assuming that the shape of the non-empty cell or the suggested bounding box is rectangular, the upper left corner and the lower right corner of the bounding box are respectively expressed as { (x) 1 ,y 1 ),(x 2 ,y 2 ) Wherein x is more than or equal to 0 1 ≤x 2 ≤X,0≤y 1 ≤y 2 Less than or equal to Y, the shape of the mask is (2, Y, X), and the middle point of the text is used as the maximum regression targetBidding, training target belongs to (0, 1)]Then the corresponding local or global horizontal pyramid label prediction score for any pixel (y, x) is calculated by the following formula respectively
Figure BDA0003655674550000061
Local or global vertical pyramid tag prediction scores
Figure BDA0003655674550000062
Figure BDA0003655674550000063
Figure BDA0003655674550000064
Wherein x is more than or equal to 0<X,0≤y<Y,
Figure BDA0003655674550000065
x and y represent the horizontal and vertical coordinates of any pixel, x mid 、y mid Representing the midpoint coordinates of the bounding box and X, Y representing the maximum width and maximum height, respectively, of the proposed bounding box.
In step S3', the prediction results of the local mask and the global mask may be re-scored specifically by the following method:
for a certain predicted aligned bounding box B { (x) 0 ,y 0 ),(x 1 ,y 1 ) Firstly, obtaining bbox of the text area mask, and marking as B 1 ={(x′ 0 ,y′ 0 ),(x′ 1 ,y′ 1 ) Then find the matching connected domain P ═ P in the global boundary segmentation map 1 ,P 2 ,......,P n In which P is i Where (x, y) denotes a pixel, let P be 0 ={p|x 0 ≤p·x≤x 1 ,y 0 ≤p·y≤y 1 Denotes the overlap region, then with respect to point (x, y) ∈ P 0 The prediction pyramid label is calculated in the following way:
Figure BDA0003655674550000071
Figure BDA0003655674550000072
wherein the content of the first and second substances,
Figure BDA0003655674550000073
Figure BDA0003655674550000074
respectively representing local horizontal, global horizontal, local vertical and global vertical pyramid label predicted values; f (x), F (y) represents the point (x, y) epsilon P obtained after the mask is rescored 0 Pyramid mask labels in the horizontal and vertical directions, respectively (i.e., the updated pixel assignment soft label described above).
In step S3', the adjusted boundary coordinate point is obtained by:
s3' 1, for the suggested bounding box region, finding a matched connected region in the global boundary segmentation map, and then respectively fitting two planes in a three-dimensional space by using horizontal and vertical pyramid labels obtained by re-scoring by using a mask;
and S3' 2, calculating the intersection line of the fitting plane and the zero plane to obtain the boundary coordinate point after the adjustment of the proposed boundary frame.
In step S2', the training model learns a binary segmentation task, and specifically obtains a prediction flag by dynamically adjusting a probability map with a pixel point being a text, a threshold map with a pixel point, and an approximate binary map, for distinguishing a text region from a non-text region, where the text region may be marked as 1 and other regions are marked as 0.
Specifically, the prediction flag of the approximate binary image can be calculated by the following formula
Figure BDA0003655674550000075
Figure BDA0003655674550000076
Wherein, P i,j For probability maps in which the pixel points are text, predictive markers, T i,j Predicting and marking the threshold value image of the pixel point; k is an integer parameter for adjusting the gradient magnitude, specifically, when the model is trained, for example, when k is 50, the gradient is much greater than k is 1, and the gradient of the wrong region is larger; when x is P i,j -T i,j <Time 0, predictive flag
Figure BDA0003655674550000077
Indicating that the region is a non-text region; when x is P i,j -T i,j >When 0, the flag is predicted
Figure BDA0003655674550000078
Indicating that the region is a text region.
Loss function L of training model of binary segmentation task lm Comprises the following steps: loss L of probability map prediction mark with pixel point being text s Approximate loss L of binary graph b Threshold value map prediction mark loss L of sum pixel point t Sum of:
L lm =L s +αL b +βL t
wherein alpha and beta are hyper-parameters; loss L of probability graph prediction mark with pixel point being text s Bce loss can be used; loss L of approximate binary plot b Dice loss may be used; threshold value image prediction mark loss L of pixel point t L1 loss may be used;
in particular: loss L of approximate binary map b By adopting dice loss, the problem that the samples of the pixels are not unbalanced can be solved.
In step S4, the recovering of the table structure includes: the step of matching the cells specifically comprises: if a pair of aligned bounding boxes overlap in the abscissa or the ordinate, they are matched in the corresponding vertical or horizontal direction; and connecting the alignment bounding boxes vertically or horizontally.
In step S4, the recovering of the table structure further includes: the step of empty cell positioning specifically comprises:
s41, after the cell matching is completed, the edges are used for representing the connection relation between the aligned bounding boxes, and all nodes in the same row or column form a complete subgraph;
s42, searching all the maximum cliques in the subgraph by adopting a maximum clique search algorithm; in the searching process of the row or the column, each node belonging to the same row or the column is positioned in the same group, for cells crossing multiple rows or multiple columns, (because the corresponding node appears multiple times in different groups), the corresponding clusters are sorted by averaging y coordinates or x coordinates, each node is marked by utilizing a row index or a column index of the corresponding cluster, and the nodes appearing in multiple groups are marked by utilizing multiple row indexes or column indexes, so that the vacant positions corresponding to the vacant cells are determined.
In step S4, the recovering the table structure further includes: the empty cell merging step specifically comprises the following steps:
s441, designating the single empty cell in the shape of the aligned bounding box as the maximum height/width of the cells in the same row/column;
s442, (according to the clue learned by the global segmentation task, and using a pixel voting mechanism to determine a result), calculating a predicted pixel ratio of 1 in the partition area of each pair of adjacent empty cells, and merging the adjacent empty cells if the pixel ratio is greater than a preset probability threshold. The probability threshold is obtained according to sensitivity analysis; the sensitivity refers to the decreasing change condition of each branch loss under different probability thresholds, and the lowest loss, namely the optimal solution of the probability threshold, is obtained, and is generally between 0.5 and 0.6.
In step S1, the VoVNetV2-39-FPN model may be used as a backbone network to process the table pictures, so as to obtain the feature maps corresponding to the table pictures.
In S2', the training model may use a differentiable binary loss statistical method to learn a binary segmentation task to align the cell region. Other methods may also be employed to learn the binary segmentation task, such as Riddler-Calvard, etc.
In this application, before the step S1 of obtaining the feature map corresponding to the table picture, a process of distinguishing a foreground from a background may also be performed; in performing the table structure recovery training, after step S4, block classification and block regression may also be included. Then, the total loss function L is:
L=L rpn1 (L cl +L box )+γ 2 (L lm +L lp )+γ 3 (L seg +L gp )
wherein L is rpn A binary cross entropy for distinguishing the foreground from the background; gamma ray 1 、γ 2 、γ 3 Is an adjustable hyper-parameter; l is cl Binary cross entropy for frame classification; l is box The frame regression loss; l is a radical of an alcohol lm Is the binary segmentation loss of the local mask; l is a radical of an alcohol lp Local mask regression loss in the horizontal direction and the vertical direction; l is a radical of an alcohol seg Is the global mask binary segmentation loss; l is gp Is the global mask regression loss in the horizontal and vertical directions. Wherein L is rpn 、L cl 、L box The acquisition method of (1) is prior art.
The embodiment of the application also discloses a system for identifying the table structure in the table picture. A system for identifying a table structure in a table picture, comprising:
the characteristic diagram acquisition module is used for acquiring a characteristic diagram corresponding to the form picture;
the global boundary segmentation module is used for carrying out global boundary segmentation on the feature map, learning empty cell segmentation information, namely segmenting all aligned cells including non-empty cells and empty cells, and generating a basic true value of the empty cells according to the maximum height/width of the non-empty cells in the same row/column;
the global mask regression module is used for performing global mask regression and distributing soft labels to pixels in all non-empty cells in the horizontal and vertical directions;
and the table structure aligning and recovering module is used for aligning the corresponding table structure in the horizontal and vertical directions according to the soft label so as to recover the table structure.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above.
The embodiment of the application further discloses the electronic equipment. An electronic device comprising a memory and a processor, said memory having stored thereon a computer program that can be loaded by the processor and that executes any of the methods described above.
The electronic device may be an electronic device such as a desktop computer, a notebook computer, or a cloud server, and the electronic device includes but is not limited to a processor and a memory, for example, the electronic device may further include an input/output device, a network access device, a bus, and the like.
A processor in the present application may include one or more processing cores. The processor executes or executes the instructions, programs, code sets, or instruction sets stored in the memory, calls data stored in the memory, performs various functions of the present application, and processes the data. The Processor may be at least one of an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor. It is understood that the electronic devices for implementing the above processor functions may be other devices, and the embodiments of the present application are not limited in particular.
The memory may be an internal storage unit of the electronic device, for example, a hard disk or a memory of the electronic device, or an external storage device of the electronic device, for example, a plug-in hard disk, a smart card (SMC), a secure digital card (SD) or a flash memory card (FC) provided on the electronic device, and the memory may also be a combination of the internal storage unit of the electronic device and the external storage device, and the memory is used for storing a computer program and other programs and data required by the electronic device, and may also be used for temporarily storing data that has been output or will be output, which is not limited in this application.
The embodiment of the application also discloses a computer readable storage medium. A computer readable storage medium storing a computer program capable of being loaded by a processor and performing any of the methods described above.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, and the computer program may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).
The above embodiments are preferred embodiments of the present application, and the protection scope of the present application is not limited by the above embodiments, so: equivalent variations of the methods and principles of the present application are intended to be within the scope of the present application.

Claims (16)

1. A method for identifying a table structure in a table picture is characterized in that: the method comprises the following steps:
acquiring a characteristic diagram corresponding to the form picture;
for the feature map, global boundary segmentation is carried out, empty cell segmentation information is learned, namely all aligned cells are segmented, including non-empty cells and empty cells, and basic truth values of the empty cells are generated according to the maximum height/width of the non-empty cells in the same row/column;
performing global mask regression, and distributing soft labels to pixels in all non-empty cells in the horizontal and vertical directions;
and aligning the corresponding table structure in the horizontal and vertical directions according to the soft label so as to obtain the adjusted boundary coordinate point and realize the recovery of the table structure.
2. The method of claim 1, further comprising: for the feature map, local mask alignment is carried out, namely a model is trained to learn a binary segmentation task to align a unit region, and meanwhile, a local mask regression task distributes soft labels to pixels in a suggested boundary frame region in the horizontal and vertical directions;
and re-scoring the local mask and the global mask prediction results to obtain updated pixel allocation soft labels.
3. Method for identifying a table structure in a table picture according to claim 1 or 2, characterized in that for mask regression, the pixels are assigned soft labels in the horizontal and vertical direction by:
assuming that the shape of the non-empty cell or the suggested bounding box is rectangular, the upper left corner and the lower right corner of the bounding box are respectively expressed as { (x) 1 ,y 1 ),(x 2 ,y 2 ) Wherein x is more than or equal to 0 1 ≤x 2 ≤X,0≤y 1 ≤y 2 Y is less than or equal to Y, the shape of the mask is (2, Y, X), the middle point of the text is taken as the maximum regression target, and the training target belongs to (0, 1)]Then the corresponding local OR of any pixel (y, x) is calculated by the following formula respectivelyGlobal horizontal pyramid tag prediction score
Figure FDA0003655674540000011
Local or global vertical pyramid tag prediction scores
Figure FDA0003655674540000012
Figure FDA0003655674540000013
Figure FDA0003655674540000014
Wherein X is more than or equal to 0 and less than X, Y is more than or equal to 0 and less than Y,
Figure FDA0003655674540000015
x and y represent the horizontal and vertical coordinates of any pixel, x mid 、y mid Representing the midpoint coordinates of the bounding box and X, Y representing the maximum width and maximum height, respectively, of the proposed bounding box.
4. The method of claim 2, wherein the mask re-scoring is performed on the predicted results of the local mask and the global mask by a method comprising:
for a certain predicted aligned bounding box B { (x) 0 ,y 0 ),(x 1 ,y 1 ) Firstly, obtaining bbox of the text area mask, and marking as B 1 ={(x′ 0 ,y′ 0 ),(x′ 1 ,y′ 1 ) Then find the matching connected domain P ═ P in the global boundary segmentation map 1 ,P 2 ,......,P n In which P is i Where (x, y) denotes a pixel, let P be 0 ={p|x 0 ≤p·x≤x 1 ,y 0 ≤p·y≤y 1 Denotes an overlap region, then with respect to point (x, y) ∈ P 0 The prediction pyramid label of (a) is calculated in the following way:
Figure FDA0003655674540000021
Figure FDA0003655674540000022
wherein the content of the first and second substances,
Figure FDA0003655674540000023
respectively representing local horizontal, global horizontal, local vertical and global vertical pyramid label predicted values; f (x), F (y) represent points (x, y) E P obtained after re-scoring the mask 0 Pyramid mask labels in horizontal and vertical directions, respectively.
5. The method of identifying a table structure in a table picture of claim 2, wherein the adjusted boundary coordinate points are obtained by:
for the suggested bounding box region, finding a matched connected region in the global boundary segmentation graph, and then respectively fitting two planes in a three-dimensional space by adopting horizontal and vertical pyramid labels obtained by re-scoring by using a mask;
and calculating the intersection line of the fitting plane and the zero plane to obtain the boundary coordinate point after the adjustment of the proposed boundary frame.
6. The method of claim 2, wherein the step of identifying the table structure in the table picture comprises the steps of: the training model learns a binary segmentation task, and specifically, a prediction mark is obtained through dynamic adjustment of a probability map with pixels being texts, a threshold map with pixels and an approximate binary map, and is used for distinguishing text regions from non-text regions.
7. The method for identifying a table structure in a table picture according to claim 6, which comprisesIs characterized in that: calculating a prediction index of the approximate binary image by the following formula
Figure FDA0003655674540000024
Figure FDA0003655674540000025
Wherein, P i,j For probability maps in which the pixel points are text, predictive markers, T i,j Predicting and marking the threshold value graph of the pixel point; k is an integer parameter for adjusting the gradient amplitude; when x is equal to P i,j -T i,j When < 0, the flag is predicted
Figure FDA0003655674540000026
Indicating that the region is a text-free region; when x is equal to P i,j -T i,j When > 0, predict marker
Figure FDA0003655674540000027
Indicating that the region is a text region.
8. The method of claim 1, wherein the implementing of the recovery of the table structure comprises: the step of matching the cells specifically comprises: if a pair of aligned bounding boxes overlap in the abscissa or the ordinate, they are matched in the corresponding vertical or horizontal direction; and connecting the aligned bounding boxes vertically or horizontally.
9. The method of claim 8, wherein the implementing recovery of the table structure comprises: the step of empty cell positioning specifically comprises:
after the cell matching is finished, representing the connection relation between the aligned bounding boxes by using edges, wherein all nodes in the same row or column form a complete subgraph;
searching all maximum cliques in the subgraph by adopting a maximum clique search algorithm; in the searching process of the rows or the columns, each node of the same row or column is located in the same group, for cells spanning multiple rows or multiple columns, (as the corresponding nodes can appear multiple times in different groups), the corresponding clusters are sorted by averaging y coordinates or x coordinates, each node is marked by using a row index or a column index of the corresponding cluster, and the nodes appearing in the multiple groups are marked by using multiple row indexes or column indexes, so that the vacant positions corresponding to the vacant cells are determined.
10. The method of claim 9, wherein the recovering of the table structure further comprises: the empty cell merging step specifically comprises the following steps:
designating a single empty cell of the aligned bounding box shape as the maximum height/width of the cells in the same row/column;
and calculating the predicted pixel ratio of 1 in the interval area of each pair of adjacent empty cells, and merging the adjacent empty cells if the pixel ratio is greater than a preset probability threshold.
11. The method of claim 1, wherein the probability threshold is derived from a sensitivity analysis; the sensitivity refers to the decreasing change condition of each branch loss under different probability thresholds, and the lowest loss is obtained, namely the optimal solution of the probability threshold is obtained.
12. The method for identifying the table structure in the table picture according to claim 1, wherein a VoVNetV2-39-FPN model is used as a backbone network to process the table picture, so as to obtain a feature map corresponding to the table picture.
13. The method of claim 2, wherein the step of identifying the table structure in the table map comprises: the training model adopts a differentiable binary loss statistical method to learn a binary segmentation task to align the unit area.
14. A system for identifying a table structure in a table picture, comprising:
the characteristic diagram acquisition module is used for acquiring a characteristic diagram corresponding to the form picture;
the global boundary segmentation module is used for carrying out global boundary segmentation on the feature map, learning empty cell segmentation information, namely segmenting all aligned cells including non-empty cells and empty cells, and generating a basic true value of the empty cells according to the maximum height/width of the non-empty cells in the same row/column;
the global mask regression module is used for performing global mask regression and distributing soft labels to pixels in all non-empty cells in the horizontal and vertical directions;
and the table structure aligning and recovering module is used for aligning the corresponding table structure in the horizontal and vertical directions according to the soft label so as to recover the table structure.
15. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program that can be loaded by the processor and that executes the method according to any of claims 1 to 13.
16. A computer-readable storage medium, characterized in that a computer program is stored which can be loaded by a processor and which executes a method according to any one of claims 1 to 13.
CN202210558928.6A 2022-05-21 2022-05-21 Method, system and storage medium for identifying table structure in table picture Pending CN114926851A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210558928.6A CN114926851A (en) 2022-05-21 2022-05-21 Method, system and storage medium for identifying table structure in table picture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210558928.6A CN114926851A (en) 2022-05-21 2022-05-21 Method, system and storage medium for identifying table structure in table picture

Publications (1)

Publication Number Publication Date
CN114926851A true CN114926851A (en) 2022-08-19

Family

ID=82810762

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210558928.6A Pending CN114926851A (en) 2022-05-21 2022-05-21 Method, system and storage medium for identifying table structure in table picture

Country Status (1)

Country Link
CN (1) CN114926851A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115331245A (en) * 2022-10-12 2022-11-11 中南民族大学 Table structure identification method based on image instance segmentation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115331245A (en) * 2022-10-12 2022-11-11 中南民族大学 Table structure identification method based on image instance segmentation
CN115331245B (en) * 2022-10-12 2023-02-03 中南民族大学 Table structure identification method based on image instance segmentation

Similar Documents

Publication Publication Date Title
US11842487B2 (en) Detection model training method and apparatus, computer device and storage medium
CN106875406B (en) Image-guided video semantic object segmentation method and device
CN110334709B (en) License plate detection method based on end-to-end multi-task deep learning
CN110738238B (en) Classification positioning method and device for certificate information
CN113158808A (en) Method, medium and equipment for Chinese ancient book character recognition, paragraph grouping and layout reconstruction
CN109886159B (en) Face detection method under non-limited condition
CN112016605A (en) Target detection method based on corner alignment and boundary matching of bounding box
CN112036514B (en) Image classification method, device, server and computer readable storage medium
CN112766170B (en) Self-adaptive segmentation detection method and device based on cluster unmanned aerial vehicle image
CN112669323A (en) Image processing method and related equipment
CN110852327A (en) Image processing method, image processing device, electronic equipment and storage medium
CN113706562B (en) Image segmentation method, device and system and cell segmentation method
CN113723252A (en) Identification method and system for table type text picture
CN115719475A (en) Three-stage trackside equipment fault automatic detection method based on deep learning
CN114926851A (en) Method, system and storage medium for identifying table structure in table picture
CN112149689A (en) Unsupervised domain adaptation method and system based on target domain self-supervised learning
CN115019133A (en) Method and system for detecting weak target in image based on self-training and label anti-noise
CN113255787B (en) Small sample target detection method and system based on semantic features and metric learning
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN113378905A (en) Small target detection method based on distribution distance
CN110287970B (en) Weak supervision object positioning method based on CAM and covering
CN105528791B (en) A kind of quality evaluation device and its evaluation method towards touch screen hand-drawing image
CN111291754A (en) Text cascade detection method, device and storage medium
CN116385374A (en) Cell counting method based on convolutional neural network
CN116310688A (en) Target detection model based on cascade fusion, and construction method, device and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 518051 2201, block D, building 1, bid section 1, Chuangzhi Yuncheng, Liuxian Avenue, Xili community, Xili street, Nanshan District, Shenzhen, Guangdong

Applicant after: Qizhi Technology Co.,Ltd.

Address before: 518051 2201, block D, building 1, bid section 1, Chuangzhi Yuncheng, Liuxian Avenue, Xili community, Xili street, Nanshan District, Shenzhen, Guangdong

Applicant before: Qizhi Network Technology Co.,Ltd.

CB02 Change of applicant information