CN116311310A

CN116311310A - Universal form identification method and device combining semantic segmentation and sequence prediction

Info

Publication number: CN116311310A
Application number: CN202310566244.5A
Authority: CN
Inventors: 李炜铭; 邵研; 段曼妮; 王永恒; 巫英才; 王芷霖; 王超; 刘冰洁
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-05-19
Filing date: 2023-05-19
Publication date: 2023-06-23

Abstract

The invention discloses a general table identification method and a general table identification device combining semantic segmentation and sequence prediction, wherein the method comprehensively uses a YOLO, VGG, UNet, SLANet, DBNet, SVTR deep learning model, combines a two-stage scheme based on semantic segmentation and an end-to-end scheme based on sequence prediction, and can be used for identifying various tables in a picture format, including a wired table, a few-line table and a wireless table. The method may identify structural information and text information in the form. The picture types that can be identified as containing the form include scanned pictures and pictures taken from any angle. The invention trains a target detection model and is used for table detection and table classification at the same time, and provides a simple and effective method for merging cells aiming at the problem of inaccurate identification of a wired table by the existing table identification method, wherein the method is improved by 9.34 percent (79.24%) on a TableBank data set compared with an end-to-end scheme on a TEDS index.

Description

Universal form identification method and device combining semantic segmentation and sequence prediction

Technical Field

The invention relates to the field of image recognition, in particular to a universal form recognition method and device combining semantic segmentation and sequence prediction.

Background

Documents are usually composed of text, pictures, tables, patterns, etc., and are one of the most common information carriers in daily life. For paper documents, in order to obtain the elements in the document, it is necessary to scan or take a picture and convert it into an editable electronic document. Most commonly, OCR (Optical Character Recognition) technology is used, but OCR can only extract text from a document picture, and other technologies are needed to complete elements other than text. Forms often carry key information in documents, but form identification has been a research difficulty in document reconstruction due to the variety and style of forms. The table identification includes two subtasks, table detection and table structure identification. The table detection means that the position of the table is detected from the document, and the table structure identification means that the structural information and the text information of the table are extracted. Form detection may be regarded as a pre-task of form structure recognition, which may be regarded as adding recognition of geometric properties on the basis of OCR. The tables can be divided into a table with a complete frame, a table with a partial frame and a table without a frame according to whether the frames exist or not, and are respectively called a wired table, a less-wired table and a wireless table. The form is divided into a printed form and a handwritten form according to fonts.

Table structure recognition includes text recognition, i.e., the use of general text line recognition techniques to recognize text information in a table, and structure recognition, typically using a two-stage approach of detecting text lines first and then recognizing text lines. Text line detection is mainly based on regression and segmentation methods. Regression-based methods optimize bbox size and convolution kernel size to fit text targets based on general target detection algorithms (e.g., faster RCNN, yolo, SSD). The text region is obtained through segmentation by a pixel level instance based on the segmentation method, and then the boundary box is obtained through post-processing. In contrast to regression-based methods, the method may detect curved lines of text. Text line recognition is generally divided into 4 phases: image correction (correcting inclined, curved text to horizontal text), visual feature extraction (typically using CNN to extract image features), sequence feature extraction (extracting a sequence containing contextual information from visual features, typically using BiLSTM or transfomer), and post-processing (predicting characters from sequence features, typically using CTC (Connectionist Temporal Classification) and intent).

The structure recognition can be classified into a conventional method and a deep learning method according to whether a deep learning model is used. The traditional method is mainly based on rules and traditional image processing methods, such as expansion, corrosion and binarization, detecting straight lines and outlines, solving intersection points, merging cells, screening according to size and the like. The top-down scheme of detecting the table area and then continuously cutting and splitting to obtain the cells is also available, and the bottom-up scheme of detecting the text block and then determining the cells according to the possible table lines and the intersection points thereof is also available. The wired table with simple structure (such as csv format table) can usually have good effect, and the rule is increased and the submodule is replaced conveniently, but the effect is poor for the table with complex structure, especially the less wired table and the wireless table. The deep learning method comprises semantic segmentation, target detection, sequence prediction and other methods. Representative of semantic segmentation is a Tencentrated Table recognition scheme (https:// closed. Content. Com/vector/1452973) that separates line boxes (including invisible line boxes) from non-line box regions by training a segmentation model, which acts as a straight line detection in conventional methods, after which cells are determined using a top-down scheme similar to conventional methods. The identification effect of the wire list is good, and the generalization performance of the wire list is poor because the wire list is a learned invisible wire for the wire list and the wireless list and the wire frame is often detected at a place where the wire does not appear. The target detection method is represented by the LGPMA (Local and Global Pyramid Mask Alignment) scheme of sea (Qiao, liang, et al, "Lgpma: complicated table structure recognition with local and global pyramid Mask alignment," International Conference on Document Analysis and acquisition, spring, cham, 2021.) which divides two branches based on an example division model Mask-RCNN, one branch is used for learning a local alignment boundary, the other branch is used for learning a global alignment boundary, the soft masks of the two branches are obtained and then fused, and the final table structure is obtained through post-processing. The LGPMA scheme uses a large-scale basic network, has certain limitation on hardware and output size in training and reasoning, and is difficult to land practically. Representative of sequence prediction methods are the TableMaster solution (Ye J, qi X, he Y, et al, pingAn-VCGroup's Solution for ICDAR 2021 Competition on Scientific Literature Parsing Task B: table Recognition to HTML J2021.) which uses an encoder-decoder architecture to convert images into html tag sequences that can be conveniently restored to a table. The improved ResNet network is used as an encoder to extract visual characteristics, two branches are separated after a transducer layer is passed through a decoding stage, one branch performs supervised learning of a table structure sequence, and the other branch performs supervised learning of unit cell position regression in a table. After the cell positions are obtained, matching the text boxes obtained through text line detection, and filling text information into corresponding html tags according to the matching results. The sequence prediction method truly solves the difficulties of a few-line table and a wireless table, but because the method is completely dependent on data, rules are not added or sub-modules are replaced, the problem that label dislocation easily occurs to the wired table is solved, and the method is far less robust than a rule-based method when processing unit cells are combined.

Disclosure of Invention

The invention aims at overcoming the defects of the prior art and provides a general table identification method and device combining semantic segmentation and sequence prediction. The method may identify structural information and text information in the form. The picture types that can be identified as containing the form include scanned pictures and pictures taken from any angle. The invention trains a target detection model and is used for table detection and table classification at the same time, and provides a simple and effective method for merging cells aiming at the problem of inaccurate identification of a wired table by the existing table identification method, wherein 9.34 percent (79.24%) of the table bank data set is improved on TEDS (Tree Edit Distance Similarity) indexes compared with an end-to-end scheme. In addition, image blocks with similar sizes are input into GPU (Graphics Processing Unit) together in a clustering mode for calculation, so that QPS (Queries Per Second) of 1.144 is realized, and second-level response is achieved.

The aim of the invention is realized by the following technical scheme: a general table identification method combining semantic segmentation and sequence prediction comprises the following steps:

(1) Inputting a form image;

(2) Preprocessing the table image input in the step (1); the preprocessing operations include min-max normalization, increasing contrast, turning into gray maps, and using classification models to detect image orientation and rotate the image to the positive orientation.

(3) Extracting a table area in the table image processed in the step (2) by using the target detection model and judging whether the table is a wired table or other tables;

(4) If the html label of each cell and the coordinates of each cell are obtained by using an end-to-end deep learning model if the html label of each cell is the other table;

(5) If the table is a wired table, detecting the straight line in the table area extracted in the step (3) by using a table straight line detection model, and complementing the incomplete line segment; determining four vertexes of the table according to the boundary line, calculating a perspective transformation matrix, and correcting the table image; drawing the detected line segments into masks, extracting mask outlines as cells, and eliminating overlapped cells; merging the cells to obtain html labels of each cell and coordinates of each cell;

(6) Detecting a text line area in the table area extracted in the step (3) by using a text detection model, and obtaining coordinates of the text line area; identifying text information in the text line area using a text identification model; correcting the identification errors in the text information by combining a text correction algorithm to obtain correct text information;

(7) And (3) matching the cell coordinates obtained in the step (4) or the step (5) with the coordinates of the text line region extracted in the step (6), and after matching is correct, filling the correct text information into html labels of the corresponding cells, and converting the html labels into excel output.

Further, in the step (4), the end-to-end deep learning model is a sequence prediction model.

Further, in the step (5), the table straight line detection model is a semantic segmentation model.

Further, in the step (5), the completing of the incomplete line segment is specifically:

dividing the detected straight line into a transverse line and a longitudinal line according to the slope; wherein, the absolute value of the slope is smaller than or equal to 1 and is a horizontal line, and the absolute value of the slope is larger than 1 and is a vertical line;

for each transverse line, calculating an intersection point of a longitudinal line closest to the end point, and if the intersection point is not on the transverse line and the distance between the intersection point and the end point is within a threshold value, replacing the end point by the intersection point to complement the line segment; and each longitudinal line is complemented in the same way.

Further, in the step (5), four vertices of the table are extracted according to the boundary line, specifically:

taking the intersection point of the leftmost longitudinal line and the leftmost transverse line or the intersection point of the extension line of the intersection point as an upper left vertex; the upper right vertex, lower left vertex and lower right vertex are determined by the same.

Further, in the step (5), overlapping cells are eliminated, specifically:

calculating IOU between every two cells, and removing the cells with larger areas if the IOU is larger than a threshold value; the threshold is 0.5.

Further, in the step (5), the merging unit cells specifically include:

determining rows and columns according to the transverse lines and the longitudinal lines; using DBSCAN to cluster the abscissa and the ordinate of the left top vertex of all the cells respectively, wherein the category numbers are the column numbers col and row numbers row respectively; solving a median value of each abscissa class to obtain an abscissa x of the class, sequencing all x from small to large, and solving the average value of two adjacent x to obtain an abscissa of each column; the ordinate of each row is obtained by the same method; establishing a row column matrix; wherein, the matrix content is an abscissa value and an ordinate value;

calculating subscripts (x, y) of each cell crossing column number c, each cell crossing column number r and the upper left vertex of each cell crossing column number r in the matrix according to the matrix; and obtaining the html label corresponding to each cell.

Further, in the step (6),

the text line detection model is a semantic segmentation model; before detection, clustering the length and width of the cells by using DBSCAN;

the text line recognition model is a sequence prediction model;

the text correction algorithm is an edit distance algorithm; dynamic programming algorithms can also be used simultaneously to accelerate the solution process.

A universal form recognition device combining semantic segmentation and sequence prediction, comprising:

the preprocessing module is used for preprocessing the input form image; the preprocessing operation comprises min-max normalization, contrast increase, conversion into gray level images, detection of image direction by using a classification model and rotation of the image to a positive direction; the preprocessing module is provided with an interface for inputting images;

the table detection module is used for extracting a table area in the table image and judging whether the table area is a wired table or other tables; if the table is a wired table, a two-stage module is used for processing; if the table is the other table, using an end-to-end module for processing;

the end-to-end module is used for obtaining the html label of each cell and the coordinates of each cell by using an end-to-end deep learning model;

the two-stage module is used for detecting the straight line in the table area by using a table straight line detection model and complementing the incomplete line segment; determining four vertexes of the table according to the boundary line, calculating a perspective transformation matrix, and correcting the table image; drawing the detected line segments into masks, extracting mask outlines as cells, and eliminating overlapped cells; merging the cells to obtain html tag sequences of the cells and cell coordinates;

an OCR module for detecting text line regions in the form region using a text detection model; identifying text information in the text line area using a text identification model; correcting the identification errors in the text information by combining a text correction algorithm;

and the matching module is used for matching the cell coordinates obtained in the end-to-end module or the two-stage module with the text line area extracted from the OCR module, and after the matching is correct, filling the text information in the text line area into the html label of the corresponding cell, and converting the html label into excel output.

A universal form recognition device combining semantic segmentation and sequence prediction comprises one or more processors, and is used for realizing the universal form recognition method combining semantic segmentation and sequence prediction.

A computer readable storage medium having stored thereon a program which, when executed by a processor, is adapted to carry out a method of universal form recognition combining semantic segmentation and sequence prediction as described above.

The above-described direction detection classification model, straight line segmentation model, end-to-end deep learning model, text line detection model, and text line recognition model are not limited to a specific model or algorithm, as long as similar models or algorithms can be implemented for the same function.

The beneficial effects of the invention are as follows: the embodiment of the invention combines the accuracy and the robustness of the two-stage scheme on the identification of the wired table and the generalization of the end-to-end scheme on the identification of the less-wired table and the wireless table, simultaneously detects the table position and the identification table category by using one target detection model, and switches the two-stage scheme and the end-to-end scheme according to the table category, thereby realizing the identification of the universal table. The key difficulty of cell merging is optimized in a two-stage scheme, and the table bank data set is improved by 9.34 percent (79.24%) in TEDS index compared with an end-to-end scheme. In addition, the image blocks with similar sizes are input into the GPU together in a clustering mode for calculation, so that QPS of 1.144 is realized, and second-level response is achieved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of an original table in an embodiment of the present invention;

FIG. 3 is a chart of a colspan matrix in an embodiment of the present invention;

FIG. 4 is a rowspan matrix diagram in an embodiment of the invention;

FIG. 5 is a diagram of an html sequence derived from an original table, a colspan matrix, and a rowspan matrix in accordance with the present invention;

fig. 6 is a hardware configuration diagram of the present invention.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the invention. Rather, they are merely examples of apparatus and methods consistent with aspects of the invention as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

The present invention will be described in detail with reference to the accompanying drawings. The features of the examples and embodiments described below may be combined with each other without conflict.

The invention discloses a universal form identification method combining semantic segmentation and sequence prediction, which comprises the following steps:

(1) Acquiring a form image;

(2) Preprocessing the table image obtained in the step (1);

the preprocessing operations include min-max normalization, increasing contrast, turning into gray maps, and using classification models to detect image orientation and rotate the image to the positive orientation. Specifically, a four classifier was trained using the VGG model (Simonyan, k., and a. Zisserman. "Very Deep Convolutional Networks for Large-Scale Image recognition." arXiv e-print (2014)), and the output classification was: 0 °, 90 °, 180 ° and 270 °. VGG is a classical CNN architecture that learns more abstract features by stacking multiple 3x3 convolutional and pooling layers to make the model depth deeper than it was before.

specifically, YOLO (Wang, c.y., a. Bochkovskiy, and h. Liao. "YOLO v7: trafficable bag-of-freebies sets new state-of-the-art for real-time object detectors" (2022)) was used. YOLO is a single-stage target detection algorithm, and can obtain the coordinates and types of the bounding box of the object by running once, which is much faster than two-stage algorithms such as the master RCNN, and can realize real-time detection. It divides the image into several grids, which are responsible for predicting an object if the center of the object falls in a grid. Each grid needs to predict bboxes of multiple different aspect ratios and scales, each bbox needs to predict boundary coordinates, confidence that objects exist, and classification category, and finally non-maxima suppression is used to remove redundant bboxes.

YOLO was trained using public dataset TableBank. TableBank is an image-based table identification and detection dataset created from internet word and LaTeX documents using a weakly supervised approach, containing 471K high quality annotation tables altogether. The annotation form includes the form bounding box coordinates and html tag sequences of the form structure, but does not include the category of the form. Nor are there any labels in the current public dataset about the table categories. In order for YOLO to detect table bounding box coordinates and identify table categories (wired or other tables) simultaneously, it is necessary to label the table categories in TableBank. The specific method comprises the following steps: firstly, detecting a straight line by using UNet, wherein the tables in the TableBank are all horizontal and vertical tables, dividing the straight line into horizontal lines and vertical lines according to the slope, and if the number of the horizontal lines or the number of the vertical lines is smaller than 2, a wired table cannot be formed, and marking the wired table as other tables. 41k wired tables are initially screened by using the simple screening method, and then manually screened once, and the number of the wired tables is 16k finally. In training, in order to enable the model to learn more changes, a random projective transformation matrix is used for enhancing the image, the coordinates of the boundary frame are also processed in the same way, meanwhile, the brightness, contrast and size proportion of the image are randomly changed, and random erasure is used for covering partial areas. Since the cable table in the dataset is only a small fraction, a greater weight is applied to the cable table class in the loss function, increasing the cost of misidentifying the cable table.

if other tables, an end-to-end deep learning model is used. Specifically, SLANet (https:// github. Com/PaddlePaddle/PaddleOCR/blob/release/2.6/ppstructure/docs/PP-structure v2_introduction. Md#42-% -E8% A1% A8% E6% A0% BC%E8% AF%86%E5% 88%AB) of a flying oar is used, a lightweight model LCNet is used as a backbone model, four feature maps with different scales are extracted, a CSP-PAN module is used for fusing features of a high layer and a low layer, the fused features of the last layer are input into a feature decoding module SLAHead with the structure aligned with position information, then two branches are divided, one branch carries out supervised learning of an html tag sequence of a table structure, and the other branch carries out supervised learning of cell coordinate regression in the table.

lines in the table region are detected using a semantic segmentation model, specifically UNet segmentation lines and background. UNet is a classical semantic segmentation model consisting of two parts, symmetric downsampling for feature extraction and upsampling for restoring the original resolution of the feature map. In the up sampling process, feature maps of different stages in the down sampling process are fused through jump connection, so that edge features of a shallow layer can be well recovered.

The divided straight line is divided into a horizontal line (slope absolute value is smaller than 1) and a vertical line (slope absolute value is larger than 1) according to the slope.

Incomplete line segments are complemented, wherein the incomplete line segments refer to that two lines which should be intersected are not intersected because of the detection problem, and if the line segments are not complemented, omission occurs when the cell outline is extracted. For each transverse line, calculating the intersection point of the vertical line closest to the end point, and if the intersection point is not on the transverse line and the distance between the intersection point and the end point is within a threshold value, replacing the end point by the intersection point to complement the line segment, wherein the threshold value is set as 200 times of the sum of the length and the width of the image.

And determining four vertexes of upper left, upper right, lower left and lower right according to the boundary line. The boundary line is a leftmost vertical line, a rightmost vertical line, a leftmost horizontal line, and a bottommost horizontal line, and an intersection point of the leftmost vertical line and the topmost horizontal line or an intersection point of an extension line thereof is calculated as an upper left vertex, and other three vertexes are calculated in the same manner.

Perspective transformation is performed according to the four vertexes to correct images photographed from different angles.

Drawing the detected line segments into masks, and extracting mask outlines as cells. The overlapped cells are removed, specifically, IOU between every two cells is calculated, and if IOU is more than 0.5, the cells with larger areas are removed.

The merging unit cell specifically comprises:

all possible rows and columns are determined from the transverse and longitudinal lines. And clustering the abscissa and the ordinate of the upper left vertexes of all the cells by using DBSCAN, wherein the category numbers are the column numbers col and the row numbers row. And solving a median value of each abscissa class to obtain an abscissa x of the class, sequencing all x from small to large, and solving the average value of two adjacent x to obtain an abscissa of each column. The ordinate of each row is obtained in the same way. A row column matrix is established, and the content of the matrix is an abscissa value.

And calculating how many columns c each cell spans, how many rows r and subscripts (x, y) of the upper left vertex of the rows r in the abscissa matrix according to the ordinate matrix. If an html tag is used to represent a table, the tag of the x-th row and y-th column cell is < td rowspan= "r" colspan= "c" >/td >.

Fig. 2 to 5 are schematic diagrams of merging cells, wherein fig. 2 is an original table, and broken lines in the table are extended lines of all solid lines. The dashed line divides the original table into 4 rows and 4 columns of tables, and the visible cell a is divided into 4 sub-cells, occupying 4 rows and 1 column, with subscripts (0, 0); cell b is divided into 3 sub-cells, occupying 1 row and 3 columns, with subscript (0, 1). In this way, the row and column subscripts and the number of rows and columns spanned by any cell can be determined, and the geometric relationship can be conveniently organized by using html tags. For this schematic, a specific implementation uses 24 x4 matrices to store the colspan and rowspan of each cell, respectively, fig. 3 and 4. Both matrix initializations are all 1. Taking cell a as an example, where rowspan is 4, 4 is filled in the (0, 0) coordinates of the rowspan matrix, and all the 3 coordinates down the same column are filled in with 0, because there are no cells in these three positions. The (0, 1) coordinates of the colleman matrix are filled in with 3, and the 2 coordinates on the right of the same row are all filled in with 0. When the html representation is generated, corresponding elements in the 2 matrixes are judged at the same time, when the element in one matrix is 0, no cell exists in the position, otherwise, the label of the position is < td rowspan= "r" colspan= "c" >. Fig. 5 is an html representation of the table, with omission when rowspan or colspan is 1.

text line detection uses a segmentation-based text line detection algorithm DBNet (Liao, M., et al, "Real-Time Scene Text Detection with Differentiable Binarization and Adaptive Scale fusion." (2022)), which uses a pyramid structure of the resnet50-vd layer to individually perform adaptive binarization on each pixel, and a threshold value is obtained by network learning, so that the problem that the threshold value is difficult to generalize is solved. In order to improve detection efficiency, the DBSCAN is used for clustering the length and the width of the cells, and the cell images with similar length and width are zero-padded to a uniform size and are input into the GPU once again.

The text line recognition uses a transducer-based SVTR (Du, y., et al, "SVTR: scene Text Recognition with a Single Visual model" (2022)). Similar to swin transformer, a three-stage step-down sampling architecture is adopted, and a local mix is used for extracting stroke features and a global mix is used for extracting correlations among characters to form a multi-scale feature description.

The word with confidence <0.5 is replaced by other words with higher confidence in the preset dictionary using edit distance algorithm. The edit distance is a way to calculate the similarity of strings, specifically the minimum number of editing operations required to convert one string to another. The larger the edit distance, the more dissimilar the two strings. Common editing operations include replacing one character with another, inserting a character, and deleting a character. The complexity of the edit distance is related to the dictionary used, which is particularly computationally intensive when the dictionary is large. To accelerate the computation, the Viterbi dynamic programming algorithm is used to solve the edit distance problem, the core idea is to compute while deleting paths that are unlikely to be answers, and select the best path among the remaining paths.

(7) And (3) matching the cell coordinates obtained in the step (4) or the step (5) with the text line coordinates extracted in the step (6), filling the correct text information into html labels of the corresponding cells after the matching is correct, and converting the html labels into excel output.

The cell is matched to the text box according to the coordinate location, and is considered to be matched to the text box when the IOU > 0.5. And filling the matched text content into the html label of the corresponding cell. And finally, converting the html tag sequence into excel output.

The invention relates to a universal form recognition device combining semantic segmentation and sequence prediction, which is shown in figure 1 and comprises 6 modules: the system comprises a preprocessing module, a form detection module, an end-to-end module, a two-stage module, an OCR module and a matching module. The preprocessing module mainly performs some preprocessing operations on the images. The form detection module mainly detects the form area and judges whether the form area is a wired form or not. If the table is not a wired table, the table structure information and the cell positions are directly obtained by entering the end-to-end module, otherwise, the table structure information and the cell positions are obtained by entering the two-stage module, the linear detection cell is firstly extracted, and then the cell positions are combined. The OCR module is responsible for extracting text line location and content. And finally, the text lines and the cells are corresponding by a matching module and output in an excel format. The respective modules are described in detail below.

the end-to-end module is used for obtaining an html tag sequence and cell coordinates of the table structure by using the end-to-end deep learning model;

Evaluating:

evaluation was performed on a public dataset TableBank, test set size 145519 with a wire table number 4858, 3.34%. TEDS is taken as an evaluation index, the TEDS is the edit distance similarity, and a calculation formula is that

Where d is the edit distance of the character strings str1 and str2, len represents the character string length,

the larger the value is a real number of 1 or less, the more similar it is to the group trunk. With the end-to-end model SLANet as benchmark, the evaluation result is shown in table 1, the TEDS of the scheme of the invention is 81.71%, the improvement is 0.29% compared with SLANet, and the improvement is 9.34% compared with SLANet (79.24%) in the recognition of a wired table. The frontal QPS of this solution is 1.144, 0.303 lower than SLANet, but is also within acceptable limits.

Table 1: evaluation results of the scheme and the end-to-end scheme of the invention on TableBank

Corresponding to the embodiment of the universal form identification method combining semantic segmentation and sequence prediction, the invention also provides an embodiment of a universal form identification device combining semantic segmentation and sequence prediction.

Referring to fig. 6, a generic form recognition device combining semantic segmentation and sequence prediction according to an embodiment of the present invention includes one or more processors configured to implement a generic form recognition method combining semantic segmentation and sequence prediction in the above embodiment.

The embodiment of the universal form identification device combining semantic segmentation and sequence prediction can be applied to any device with data processing capability, such as a computer or the like. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of any device with data processing capability. In terms of hardware, as shown in fig. 6, a hardware structure diagram of an apparatus with any data processing capability where the universal table identifying apparatus combining semantic segmentation and sequence prediction according to the present invention is located is shown in fig. 6, and in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 6, any apparatus with data processing capability in the embodiment generally includes other hardware according to the actual function of the any apparatus with data processing capability, which is not described herein.

The implementation process of the functions and roles of each unit in the above device is specifically shown in the implementation process of the corresponding steps in the above method, and will not be described herein again.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present invention. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

The embodiment of the invention also provides a computer readable storage medium, on which a program is stored, which when executed by a processor, implements a universal form recognition method combining semantic segmentation and sequence prediction in the above embodiment.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any of the data processing enabled devices described in any of the previous embodiments. The computer readable storage medium may be any device having data processing capability, for example, a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), or the like, which are provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing device. The computer readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing apparatus, and may also be used for temporarily storing data that has been output or is to be output.

It should be understood that the specific order or hierarchy of steps in the processes disclosed are examples of exemplary approaches. Based on design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented.

The above embodiments are merely for illustrating the design concept and features of the present invention, and are intended to enable those skilled in the art to understand the content of the present invention and implement the same, the scope of the present invention is not limited to the above embodiments. Therefore, all equivalent changes or modifications according to the principles and design ideas of the present invention are within the scope of the present invention.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. The specification and examples are to be regarded in an illustrative manner only.

Claims

1. A universal form identification method combining semantic segmentation and sequence prediction is characterized by comprising the following steps:

(1) Inputting a form image;

(2) Preprocessing the table image input in the step (1); the preprocessing operation comprises min-max normalization, contrast increase, conversion into gray level images, detection of image direction by using a classification model and rotation of the image to a positive direction;

2. The method for identifying a universal form combining semantic segmentation and sequence prediction according to claim 1, wherein in the step (4), the end-to-end deep learning model is a sequence prediction model.

3. The method for identifying a universal form combining semantic segmentation and sequence prediction according to claim 1, wherein in the step (5), the form straight line detection model is a semantic segmentation model.

4. The method for identifying a universal form by combining semantic segmentation and sequence prediction according to claim 1, wherein in the step (5), the completing of the incomplete line segment is specifically:

5. The method for identifying a universal table combining semantic segmentation and sequence prediction according to claim 1, wherein in the step (5), four vertices of the table are extracted according to boundary lines, specifically:

6. The method for identifying a universal form by combining semantic segmentation and sequence prediction according to claim 1, wherein in the step (5), overlapping cells are eliminated, specifically:

7. The method for identifying a universal form combining semantic segmentation and sequence prediction according to claim 1, wherein in the step (5), the merging unit cells are specifically:

8. The method for universal form identification combining semantic segmentation and sequence prediction according to claim 1, wherein in step (6),

the text line recognition model is a sequence prediction model;

9. A universal form recognition device combining semantic segmentation and sequence prediction, comprising:

10. A generic form recognition device combining semantic segmentation and sequence prediction, comprising one or more processors configured to implement a generic form recognition method combining semantic segmentation and sequence prediction as claimed in any one of claims 1-8.

11. A computer readable storage medium having stored thereon a program which, when executed by a processor, is adapted to carry out a method of universal form recognition combining semantic segmentation and sequence prediction as claimed in any one of claims 1 to 8.