CN113221523A - Method of processing table, computing device, and computer-readable storage medium - Google Patents

Method of processing table, computing device, and computer-readable storage medium Download PDF

Info

Publication number
CN113221523A
CN113221523A CN202110529807.4A CN202110529807A CN113221523A CN 113221523 A CN113221523 A CN 113221523A CN 202110529807 A CN202110529807 A CN 202110529807A CN 113221523 A CN113221523 A CN 113221523A
Authority
CN
China
Prior art keywords
block
blocks
determining
training
column
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110529807.4A
Other languages
Chinese (zh)
Inventor
钟韵山
刘蒙蒙
张钰
孙怀玉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Berry Hekang Biotechnology Co ltd
Berry Genomics Co Ltd
Original Assignee
Beijing Berry Hekang Biotechnology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Berry Hekang Biotechnology Co ltd filed Critical Beijing Berry Hekang Biotechnology Co ltd
Priority to CN202110529807.4A priority Critical patent/CN113221523A/en
Publication of CN113221523A publication Critical patent/CN113221523A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/177Editing, e.g. inserting or deleting of tables; using ruled lines
    • G06F40/18Editing, e.g. inserting or deleting of tables; using ruled lines of spreadsheets
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a method, a computing device, and a computer-readable storage medium for processing a table. The method comprises the following steps: intercepting one or more table subgraphs from the picture by using a target detection model, wherein each table subgraph comprises a table; performing optical character recognition on each table subgraph to detect a plurality of word blocks in the table subgraph, wherein each word block contains one or more characters; predicting the row probability of any two blocks in the plurality of blocks in the same row and the column probability of any two blocks in the same column by using a deep neural network model; and performing structured reorganization on the plurality of word blocks based on the row probability that any two word blocks in the plurality of word blocks are in the same row and the column probability that any two word blocks in the plurality of word blocks are in the same column to reconstruct the table into a structured table.

Description

Method of processing table, computing device, and computer-readable storage medium
Technical Field
The present invention relates generally to the field of machine learning, and more particularly, to a method, computing device, and computer-readable storage medium for processing a form.
Background
An academic paper is a scientific record of scientific research results or innovative insights and knowledge of an academic topic in an experimental, theoretical or predictive way, or a scientific summary of the fact that some known principle is applied to make new progress. In scientific research, a large number of academic papers are often required to be read for knowledge extraction, induction and arrangement. In particular, the records of the experimental results in the academic papers are usually presented in a table form, and the systematic extraction and reconstruction of the table information in the academic papers is very important for acquiring and summarizing knowledge.
Therefore, Table Structure Recognition (TSR) is one of the very challenging tasks in academic paper information extraction, which attempts to embody a structured Table in a uniform format so that Table information can be extracted and applied automatically by a computer.
Currently, there are two main schemes for obtaining a table text from an academic paper in a PDF format or a picture format: one is to upload the papers to the online software or similar little software in batches, return the structured text after the software finishes the automatic processing, wherein the table part is converted into a string of characters; the other method is to call an open source API interface, for example, a Python module PyPDF2, pdfplumber and the like are installed to automatically extract table contents from the PDF-formatted paper.
Furthermore, current schemes that only handle TSR tasks, such as GraphTSR, can handle a single TSR task based on some rules, such as identifying table lines or machine learning (graph convolution network (GCN)), etc.
However, in the above schemes that use online software or call an open source API interface, there is usually a problem that the order of characters is inconsistent with the table, which leads to content confusion between different columns or rows of the table, and therefore, these schemes often only recognize the characters in the table, but cannot reconstruct the cells, and cannot implement a complete table structuring process. For a scheme only processing a TSR task, on one hand, a table structure recognition algorithm is called alone, which usually only supports a block as an input and cannot use an original academic paper (such as a picture or PDF format) as an input, on the other hand, the current scheme has poor recognition accuracy for a complex table, and on the other hand, because a pre-training language model blended with a priori knowledge is not introduced into the TSR algorithm, semantic analysis of a text in the table cannot be supported, and thus the TSR accuracy is insufficient.
Disclosure of Invention
In view of at least one of the above problems, the present invention provides a scheme for extracting a structured table in a picture, which reconstructs a table in a picture into a structured table by constructing more dimensional features for a table region in the picture and predicting row probabilities and/or column probabilities of blocks in the table by using a deep neural network model. The invention provides a complete technical scheme for reconstructing a structured table from an original picture.
According to one aspect of the present invention, a method of processing a form is provided. The method comprises the following steps: intercepting one or more table subgraphs from the picture by using a target detection model, wherein each table subgraph comprises a table; performing optical character recognition on each table subgraph to detect a plurality of word blocks in the table subgraph, wherein each word block contains one or more characters; predicting the row probability of any two blocks in the plurality of blocks in the same row and the column probability of any two blocks in the same column by using a deep neural network model; and performing structured reorganization on the plurality of word blocks based on the row probability that any two word blocks in the plurality of word blocks are in the same row and the column probability that any two word blocks in the plurality of word blocks are in the same column to reconstruct the table into a structured table.
According to another aspect of the invention, a computing device is provided. The computing device includes: at least one processor; and at least one memory coupled to the at least one processor and storing instructions for execution by the at least one processor, the instructions when executed by the at least one processor causing the computing device to perform steps according to the above-described method.
According to yet another aspect of the present invention, a computer-readable storage medium is provided, having stored thereon computer program code, which when executed performs the method as described above.
In some embodiments, the deep neural network model comprises an input layer, a BioBERT network layer, a first fused vector layer, a GCN network layer, a second fused vector layer, a fully-connected network layer, and an output layer, wherein predicting row probabilities and column probabilities that any two of the plurality of blocks are in a same row and a same column using the deep neural network model comprises: determining, at the input layer, input data of the deep neural network model for a first block and a second block of two blocks to be predicted, wherein the input data includes a first text ID of the first block, a first position vector, a second text ID of the second block, a second position vector, a relative position vector between the first block and the second block, and an adjacency matrix and a weight matrix of the subgraph table; determining, at the BioBERT network layer, a first feature vector of the first block and a second feature vector of the second block based on a first text ID of the first block and a second text ID of the second block, respectively; at the first fused vector layer, splicing a first position vector and a first feature vector of the first block to generate a first fused vector of the first block, and splicing a second position vector and a second feature vector of the second block to generate a second fused vector of the second block; at the GCN network layer, respectively determining a first convolution output vector of the first block and a second convolution output vector of the second block based on a first fusion vector of the first block, a second fusion vector of the second block and an adjacency matrix and a weight matrix of the table subgraph; at the second fused vector layer, stitching the relative position vector between the first block and the second block, the first fused vector and the first convolved output vector of the first block, and the second fused vector and the second convolved output vector of the second block to determine a fused feature vector of the first block and the second block; at the fully-connected network layer, predicting row probabilities that the first block and the second block are in the same row based on a fused feature vector of the first block and the second block and a first fully-connected network, and predicting column probabilities that the first block and the second block are in the same column based on the fused feature vector and a second fully-connected network; and outputting, at the output layer, the row probabilities and the column probabilities of the first block and the second block.
In some embodiments, determining input data for the deep neural network model for a first block and a second block of two blocks to be predicted comprises: converting the text of the first block and the second block into the first text ID and the second text ID, respectively; acquiring a first position vector of the first block and a second position vector of the second block based on the position information of the first block and the second block respectively; determining a relative position vector between the first block and the second block based on a first position vector of the first block and a second position vector of the second block; determining a adjacency matrix of the table subgraph based on distances between the plurality of word blocks of the table subgraph; and determining a weight matrix for the table subgraph based on the adjacency matrix.
In some embodiments, obtaining a first location vector of the first block and a second location vector of the second block based on location information of the first block and the second block, respectively, comprises: determining normalized coordinate information of the first block based on the position information of the first block, determining a normalized center position of the first block and a normalized width and a normalized height of the first block based on the normalized coordinate information of the first block, and determining a first position vector for the first block based on the normalized coordinate information, normalized center position, normalized width, and normalized height for the first block, and determining normalized coordinate information of the second block based on the position information of the second block, determining a normalized center position of the second block and a normalized width and a normalized height of the second block based on the normalized coordinate information of the second block, and determining a second position vector for the second block based on the normalized coordinate information, the normalized center position, the normalized width, and the normalized height for the second block.
In some embodiments, structurally reorganizing the plurality of blocks to reconstruct the table into a structured table based on row probabilities and column probabilities that any two of the plurality of blocks are in a same row and a same column comprises: for each block of the plurality of blocks, determining a number of rows and a number of columns of the structured table based on a row probability and a column probability that the block is in a same row and a same column with other blocks of the plurality of blocks and a positional relationship between the block and the other blocks; determining boundaries of each candidate cell in the structured table based on the location information of each block of the plurality of blocks and the number of rows and columns of the structured table; determining whether two adjacent candidate cells in the structured table should be merged based on row probabilities and column probabilities between blocks contained by the two adjacent candidate cells; in response to determining that the two neighboring candidate cells should be merged, merging the two neighboring candidate cells into one cell; in response to determining that the two neighboring candidate cells should not be merged, determining the two neighboring candidate cells as two separate cells; and combining the blocks contained in each cell based on the position information of the plurality of blocks to reconstruct the structured table.
In some embodiments, determining the number of rows and columns of the structured table comprises: for each target block of the plurality of blocks, determining a set of candidate right blocks, a set of candidate left blocks, a set of candidate top blocks, and a set of candidate bottom blocks of the target block; determining a right block, a left block, an upper block and a lower block of the target block based on the position information of each block in the candidate right block set, the candidate left block set, the candidate upper block set and the candidate lower block set of the target block and the position information of the target block respectively; determining a rightmost block set of the table subgraph based on a right block of each block of the plurality of blocks; determining the number of left word blocks of each rightmost word block in the rightmost word block set; determining a number of columns of the structured table based on a number of left blocks of each rightmost block in the set of rightmost blocks; determining a lowest set of blocks of the table subgraph based on a lower block of each block of the plurality of blocks; determining the number of the upper word blocks of each lowermost word block in the lowermost word block set; and determining a number of rows of the structured table based on a number of top word blocks of each bottom word block in the set of bottom word blocks.
In some embodiments, determining the boundary of each candidate cell in the structured table comprises: constructing a line feature vector of each block based on an upper bound coordinate, a lower bound coordinate and a central position longitudinal coordinate of each block; clustering the plurality of blocks based on row feature vectors of the plurality of blocks and rows of the structured table to determine a plurality of row categories for the plurality of blocks, wherein the number of row categories is equal to the number of rows of the structured table; for each row category in the plurality of row categories, determining an average center ordinate of a block contained in the row category; sorting the plurality of line categories according to the size of the mean center ordinate of the blocks contained in each line category; determining the average ordinate of each sorted row category; and determining a vertical coordinate of a line dividing line between the two adjacent line categories based on the average vertical coordinate of the two adjacent line categories after sorting.
In some embodiments, determining the boundary of each candidate cell in the structured table comprises: constructing a column characteristic vector of each block based on the left boundary coordinate, the right boundary coordinate and the central position abscissa of each block; clustering the plurality of blocks based on column feature vectors of the plurality of blocks and a number of columns of the structured table to determine a plurality of column categories for the plurality of blocks, wherein the number of column categories is equal to the number of columns of the structured table; for each column category in the plurality of column categories, determining an average center abscissa of a block contained in the column category; sorting the plurality of column categories by the size of the mean center abscissa of the blocks contained in each column category; determining the average abscissa of each sorted column category; and determining an abscissa of a column boundary between the two sorted adjacent column categories based on the sorted average abscissas of the two adjacent column categories.
In some embodiments, the two neighboring candidate cells are located in the same column and include a first candidate cell located in an ith row and a second candidate cell located in an i +1 th row, and determining whether the two neighboring candidate cells should be merged includes: determining a third set of blocks in the ith row other than the first candidate cell; determining a fourth set of blocks in the (i + 1) th row except for the second candidate cell; determining a row merge value between the first candidate cell and the second candidate cell based on a row probability between any block in the ith row and any block in the (i + 1) th row, a number of blocks included in the first candidate cell, a number of blocks included in the second candidate cell, a number of blocks included in the third set of blocks, and a number of blocks included in the fourth set of blocks; determining whether the row merge value is greater than a predetermined threshold; in response to determining that the row merge value is greater than the predetermined threshold, determining that the first candidate cell and the second candidate cell should be merged; and in response to determining that the row merge value is less than or equal to the predetermined threshold, determining that the first candidate cell and the second candidate cell should not be merged.
In some embodiments, the two neighboring candidate cells are located in the same row and include a first candidate cell located in a jth column and a second candidate cell located in a j +1 th column, and determining whether the two neighboring candidate cells should be merged includes: determining a fifth set of blocks in the jth column other than the first candidate cell; determining a set of sixth blocks in the j +1 th column except for the second candidate cell; determining a column combination value between the first candidate cell and the second candidate cell based on a column probability between any word block in the jth column and any word block in the (j + 1) th column, a number of word blocks included in the first candidate cell, a number of word blocks included in the second candidate cell, a number of word blocks included in the fifth set of word blocks, and a number of word blocks included in the sixth set of word blocks; determining whether the column merge value is greater than a predetermined threshold; in response to determining that the column merge value is greater than the predetermined threshold, determining that the first candidate cell and the second candidate cell should be merged; and in response to determining that the column merge value is less than or equal to the predetermined threshold, determining that the first candidate cell and the second candidate cell should not be merged.
In some embodiments, the method further comprises: acquiring a training data set of the deep neural network model, wherein the training data set comprises a plurality of training data, each training data comprises information of a plurality of training word blocks contained in a training table corresponding to the training data, and the information of each training word block comprises characters contained in the training word block, coordinate positions of the training word blocks and row and column information of the training word blocks in the training table; determining, at an input layer of the deep neural network model, training input data of the deep neural network model for a first training block and a second training block of two training blocks of the training table, wherein the training input data includes a text ID and a position vector of the first training block, a text ID and a position vector of the second training block, a relative position vector between the first training block and the second training block, and an adjacency matrix and a weight matrix of the training table; determining, at a BioBERT network layer of the deep neural network model, a feature vector of the first training block and a feature vector of the second training block based on the text ID of the first training block and the text ID of the second training block, respectively; at a first fused vector layer of the deep neural network model, splicing the position vector and the feature vector of the first training block to generate a fused vector of the first training block, and fusing the position vector and the feature vector of the second training block to generate a fused vector of the second training block; determining, at a GCN network layer of the deep neural network model, a convolution output vector of the first training block and a convolution output vector of the second training block based on the fusion vector of the first training block, the fusion vector of the second training block, and the adjacency matrix and the weight matrix of the training table, respectively; at a second fusion vector layer of the deep neural network model, splicing the relative position vector between the first training block and the second training block, the fusion vector and convolution output vector of the first training block, and the fusion vector and convolution output vector of the second training block to determine a fusion feature vector of the first training block and the second training block; determining, at a fully-connected network layer of the deep neural network model, row probabilities that the first training word block and the second training word block are in a same row based on a fused feature vector of the first training word block and the second training word block and a first fully-connected network, and column probabilities that the first training word block and the second training word block are in a same column based on a fused feature vector of the first training word block and the second training word block and a second fully-connected network; performing Softmax regression on row probabilities and column probabilities of the first and second training blocks being in a same row and a same column, and determining row and column loss values for the first and second training blocks using a cross entropy loss function; and determining a loss value of the deep neural network model based on the row loss value and the column loss value of the first training block and the second training block and updating a parameter matrix of the deep neural network model based on the loss value.
Drawings
The invention will be better understood and other objects, details, features and advantages thereof will become more apparent from the following description of specific embodiments of the invention given with reference to the accompanying drawings.
FIG. 1 shows a schematic diagram of a system for implementing a method of processing a form according to an embodiment of the invention.
FIG. 2 illustrates a flow diagram of a method for processing a form according to some embodiments of the invention.
FIG. 3 shows a schematic structural diagram of a Mask R-CNN model according to an embodiment of the present invention.
FIG. 4 is a diagram illustrating a table subgraph truncated according to an embodiment of the invention.
Fig. 5 shows a schematic structural diagram of a deep neural network model according to an embodiment of the present invention.
FIG. 6 shows a flowchart of the steps of determining the row probability of two blocks being in the same row and the column probability of two blocks being in the same column according to an embodiment of the invention.
FIG. 7 shows a flow diagram of substeps for determining input data for a deep neural network model according to an embodiment of the invention.
FIG. 8 shows a flowchart of steps for reconstructing a structured table, according to an embodiment of the invention.
FIG. 9 shows a flow diagram of sub-steps of determining the number of rows and columns of a structured table according to an embodiment of the invention.
FIG. 10 illustrates a flow diagram of a process for determining row boundaries between candidate cells according to an embodiment of the present invention.
FIG. 11 illustrates a flow diagram of a process for determining column boundaries between candidate cells according to an embodiment of the invention.
FIG. 12 shows a flow diagram of a process for determining whether two neighboring candidate cells should be row merged according to an embodiment of the invention.
FIG. 13 illustrates a flow diagram of a process for determining whether two neighboring candidate cells should be column merged according to an embodiment of the invention.
FIG. 14 is a flowchart illustrating the steps of training a deep neural network model according to an embodiment of the present invention.
FIG. 15 illustrates a block diagram of a computing device suitable for implementing embodiments of the present invention.
Detailed Description
Preferred embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
In the following description, for the purposes of illustrating various inventive embodiments, certain specific details are set forth in order to provide a thorough understanding of the various inventive embodiments. One skilled in the relevant art will recognize, however, that the embodiments may be practiced without one or more of the specific details. In other instances, well-known devices, structures and techniques associated with this application may not be shown or described in detail to avoid unnecessarily obscuring the description of the embodiments.
Throughout the specification and claims, the word "comprise" and variations thereof, such as "comprises" and "comprising," are to be understood as an open, inclusive meaning, i.e., as being interpreted to mean "including, but not limited to," unless the context requires otherwise.
Reference throughout this specification to "one embodiment" or "some embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one embodiment. Thus, the appearances of the phrases "in one embodiment" or "in some embodiments" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the terms first, second, third, fourth, etc. used in the description and in the claims, are used for distinguishing between various objects for clarity of description only and do not limit the size, other order, etc. of the objects described therein.
Fig. 1 shows a schematic diagram of a system 1 for implementing a method of processing a form according to an embodiment of the invention. As shown in fig. 1, system 1 includes a computing device 10, a server 20, and a network 30. Computing device 10 and server 20 may interact with data via network 30. Here, the server 20 may be, for example, a server of a service provider dedicated to providing the form restructuring service, and the computing device 10 is connected to the server 20 to perform a corresponding operation based on a command from the server 20. The computing device 10 may include at least one processor 110 and at least one memory 120 coupled to the at least one processor 110, the memory 120 having stored therein instructions 130 executable by the at least one processor 110, the instructions 130 when executed by the at least one processor 110 performing at least a portion of the method 200 as described below. Note that in this context, computing device 10 may be part of server 20 or may be separate from server 20. The specific structure of computing device 10 or server 20 may be described, for example, in connection with FIG. 15 below.
FIG. 2 illustrates a flow diagram of a method 200 for processing a form according to some embodiments of the invention. Method 200 may be performed, for example, by computing device 10 or server 20 in system 1 shown in fig. 1. The method 200 is described below in conjunction with fig. 1-14, taking as an example execution in the computing device 10.
As shown in fig. 2, method 200 includes step 210, where computing device 10 may utilize an object detection model to intercept one or more table subgraphs from a picture to be detected. Wherein each table subgraph comprises a table.
The target detection model refers to a machine learning model for detecting a specific target object from a picture. Depending on the actual application requirements, either a two-stage (2-stage) object detection model (e.g., R-CNN (Region-based Convolutional Neural Networks), Fast R-CNN, Mask R-CNN, etc.) or a single-stage (1-stage) object detection model (e.g., YOLO (you Only Look one), SSD (Single Shell Multi Detector), etc.) may be used.
For the single-stage target detection model, model training may be performed only for target objects of a required table type, and the trained model is also used only for detecting target objects of the table type.
For the two-stage target detection model, model training can be performed on target objects of various types (including table types) as required, and the trained model can detect the target objects of various types.
In step 210, computing device 10 may use a trained object detection model to cut out one or more target regions from the picture to be detected, each target region containing a table. Of course, it is also possible that no target area is detected. In this case, it may be judged that no table exists in the picture, and the processing of the picture is skipped.
Herein, the training and use of the target detection model are described by taking Mask R-CNN model as an example. Fig. 3 shows a schematic structural diagram of a Mask R-CNN model 300 according to an embodiment of the present invention. However, those skilled in the art will appreciate that the object detection model described herein is not limited to the Mask R-CNN model, and that the various object detection models described above or other conventional object detection algorithms may be used.
Specifically, for example, a PubLayNet data set may be used as a training data set of the target detection model, where the PubLayNet data set is derived from PubMed, and includes 36 thousand document image layouts, where the images are all related pictures of academic papers and include related annotation information. The annotation information is mainly location information and categories of each target region, and the categories include, for example, a Text (Text) category, a Title (Title) category, a Table (Table) category, a chart (Figure) category, and a List (List) category.
As shown in fig. 3, the model 300 may include a convolutional layer 310, an RPN (Region proposed Network) layer 320, an ROI (Region of Interest) alignment layer 330, a target Region prediction layer 340, a Mask (Mask) layer 350, and an output layer 360.
Convolutional layer 310 may be implemented, for example, using a residual network (ResNet) to perform feature extraction on an input training picture to generate a feature map of the picture. For example, convolutional layer 310 may be implemented using a 101-layer residual Network (ResNet-50), FPN (Feature Pyramid Network), or the like.
RPN layer 320 may, for example, employ a full convolutional network to detect a feature map from convolutional layer 310 to generate a predicted target object region.
ROI alignment layer 330 may align each region of interest of the feature map output by convolutional layer 310 based on the predicted target object region output by RPN layer 320 to generate a ROI region.
The target region prediction layer 340 is used to classify and regress the region features of the ROI region generated by the ROI alignment layer 330 to determine a predicted target region.
Mask layer 350 is used to perform target detection and segmentation on the ROI region generated by ROI alignment layer 330 to determine a binary mask of the ROI region.
The output layer 360 determines a loss value of each ROI region based on the predicted target region output by the target region prediction layer 340 and the binary mask output by the mask layer 350 and iteratively updates the parameter matrix of the entire model 300.
After the model 300 is trained, in step 210, ROI areas of various types and coordinate positions thereof in the picture are determined from the picture to be detected in the above manner, and the ROI areas of table types therein are recognized as a table subgraph and are sequentially cut out according to the coordinate positions. Each table subgraph identified in this way contains a table.
In some cases, the file to be detected is not in picture (e.g., jpg) format. In this case, a format conversion of the file to be detected to convert the file into a picture format may be further included in step 210 or before step 210. For example, a common format of academic papers is PDF. In this case, the PDF formatted file can be converted into a picture formatted file by various file format conversion software (e.g., PDF2image module in Python), wherein, in the case where the PDF formatted file contains multiple pages, each page thereof can be converted into a separate picture.
Continuing with method 200 of FIG. 2, at step 220, computing device 10 may perform Optical Character Recognition (OCR) on each table sub-image intercepted in step 210 to detect a plurality of blocks in the table sub-image, wherein each block contains one or more characters.
In step 220, various known or future developed OCR schemes may be utilized to detect blocks in the table lattice sub-graph. For example, in one implementation, the text in each table sub-graph may be identified using google's open source OCR item tesseract. Specifically, tesseract may be run to perform optical character recognition on each table subgraph intercepted in step 210 and recognize all characters therein as a plurality of word blocks. In this case, the recognition result includes the character contained in each block and coordinate position information of the block in the picture. FIG. 4 shows a schematic diagram of a table subgraph 400 truncated according to an embodiment of the invention. As shown in FIG. 4, a number of blocks in a table sub-graph (shown as boxes around letters or numbers in FIG. 4) may be detected after OCR of the table sub-graph.
Here, the table graph may be OCR operated using various existing or future developed OCR tools to detect blocks therein. For example, the table sub-graph 400 shown in fig. 4 is the result of an OCR operation using the open source OCR item tesseract of google. Note that the recognition strategies for different OCR tools are different and thus the resulting presentation may be different, but this does not affect the scope of protection of the inventive concept.
Next, at step 230, computing device 10 may predict, using the deep neural network model, a row probability that any two blocks of the plurality of blocks in the table sub-graph detected in step 220 are in the same row and a column probability in the same column.
Fig. 5 shows a schematic structural diagram of a deep neural network model 500 according to an embodiment of the present invention. As shown in fig. 5, the deep neural Network model 500 may include an input layer 510, a BioBERT Network layer 520, a first fused vector layer 530, a GCN (Graph Convolutional Network) Network layer 540, a second fused vector layer 550, a fully connected Network layer 560, and an output layer 570. The input layer 510 is used to determine or input data for the deep neural network model 500, including the text ID of each block in the table subgraph 400, a position vector, a relative position vector between two blocks, a adjacency matrix and a weight matrix between all blocks of the table subgraph. The BioBERT network layer 520 is used to encode the characters in each block by the BioBERT framework to obtain the feature vector of the block, with the text ID of each block as input. The first fused vector layer 530 is used to fuse the position vector of each block with the feature vector to obtain a fused vector of the block. GCN network layer 540 determines the convolution output vector for each block based on the fusion vector for that block, the adjacency matrix for the table subgraph, and the weight matrix. The second fused vector layer 550 is used to fuse the fused vector, the convolved output vector, and the relative position vector of the two blocks to be predicted to determine the fused feature vector of the two blocks. The fully-connected network layer 560 includes a first fully-connected network 562 and a second fully-connected network 564 for row prediction and column prediction, respectively, that can predict row probabilities and column probabilities of two blocks being in the same row and column, respectively, based on their fused feature vectors.
The operation of the various layers of the deep neural network model 500 is described in detail below in conjunction with fig. 6 and 7. FIG. 6 shows a flowchart of step 230 of determining a row probability that two blocks are in the same row and a column probability in the same column according to an embodiment of the present invention. Note that in the following description, only data processing procedures relating to two blocks a and B to be predicted (also referred to as a first block and a second block) in one prediction process are described with emphasis, and data processing procedures relating to other blocks (denoted by X in fig. 5) in the table sub-diagram are omitted.
As shown in fig. 6, step 230 may include a sub-step 231 in which, at the input layer 510, the computing device 10 may determine input data for the deep neural network model 500 for two blocks a and B to be predicted. The input data may include a first text ID for a first block a, a first position vector, a second text ID for a second block B, a second position vector, a relative position vector between the first block a and the second block B, and an adjacency matrix and a weight matrix of the table subgraph 400.
Fig. 7 shows a flow diagram of sub-step 231 for determining input data for the deep neural network model 500, according to an embodiment of the present invention.
As shown in fig. 7, sub-step 231 may include sub-step 2311, where computing device 10 may convert the text of first block a and second block B to a first text ID and a second text ID, respectively. Sub-step 2311 may have different conversion methods depending on the network structure of the deep neural network model 500. Herein, since the next layer of the input layer 510 is the BioBERT network layer 520, the text in the block may be converted using a method matching the BioBERT at sub-step 2311.
At sub-step 2312, computing device 10 may obtain a first location vector n (a) for first block a and a second location vector n (B) for second block B based on the location information of first block a and second block B, respectively.
In some embodiments herein, the position vector for each block may contain three parts of information: 1) the normalized coordinate information of the block in the picture; 2) a normalized center position of the block; and 3) the normalized width and the normalized height of the block.
Specifically, computing device 10 may determine normalized coordinate information for a block based on the location information for the block. For example, it is assumed that the position information of one block can be expressed as absolute coordinates (x)1,x2,y1,y2) Wherein x is2>x1And y is2>y1Which is represented by four vertex coordinates (x)1,y1)、(x1,y2)、(x2,y1) And (x)2,y2) Forming a rectangle. The width and high resolution of the picture are W and H. The normalized coordinate information for the block may be expressed as a relative coordinate (x)1',x2',y1',y2') wherein x is1'=x1/W,x2'=x2/W,y1'=y1/H,y2'=y2/H。
Computing device 10 may then base on the normalized coordinate information (x) for the block1',x2',y1',y2') determine its normalizationCentering position ((x)1'+x2')/2,(y1'+y2')/2), normalized width (x)2'-x1') and normalized height (y)2'-y1')。
Based on the normalized coordinate information, the normalized center position, the normalized width, and the normalized height for the block, computing device 10 may determine a position vector for the block. For example, the location vector of the block may be 4 dimensions of normalized coordinate information (x) for the block1',x2',y1',y2'), normalized center position of 2 dimensions ((x)1'+x2')/2,(y1'+y2')/2), normalized width of 1 dimension (x)2'-x1') and normalized height (y) of 1 dimension2'-y1') splice the constituent 8-dimensional position vectors.
For a first block a and a second block B as shown in fig. 5, the computing device 10 may determine a position vector of the first block a (referred to as a first position vector n (a)) and a position vector of the second block B (referred to as a second position vector n (B)) in the same manner as described above, respectively.
Continuing with fig. 7, at sub-step 2313, computing device 10 may determine a relative position vector between first block a and second block B based on a first position vector n (a) of first block a and a second position vector n (B) of second block B.
In particular, computing device 10 may be based on the normalized center position of first block a ((x)A1'+xA2')/2,(yA1'+yA2')/2) and the normalized center position of the second block B ((x)B1'+xB2')/2,(yB1'+yB2')/2) determines a relative position vector between the first block A and the second block B, which relative position vector w (AB) can be expressed as ((x)B1'+xB2')/2-(xA1'+xA2')/2,(yB1'+yB2')/2-(yA1'+yA2')/2)。
At sub-step 2314, computing device 10 may determine a adjacency matrix for table sub-graph 400 based on distances between all of the word blocks of table sub-graph 400.
Specifically, each word block in the table subgraph 400 is regarded as a node, for any one node, the spatial distance between other nodes and the node is calculated, and an undirected edge is constructed between each of several (e.g., 10) other nodes with the closest spatial distance and the node, so that an undirected graph of all word blocks in the whole table subgraph 400 is generated. An adjacency matrix among all the word blocks of the table subgraph 400 can be determined according to the undirected graph, and in the adjacency matrix, the values of the elements corresponding to the two nodes with undirected edges are 1, otherwise, the values are 0. For example, for a table subgraph 400 containing Nt word blocks, its adjacency matrix is a matrix of Nt × Nt.
Next, at sub-step 2315, computing device 10 may determine a weight matrix for table sub-graph 400 based on the adjacency matrix.
In some embodiments of the invention, each weight value in the weight matrix may be determined by dividing the value of an element in the adjacency matrix by the sum of all elements of the row in which the element is located.
Continuing with fig. 6, at sub-step 232 of step 230, at the BioBERT network layer 520, the computing device 10 may determine a feature vector for the first block a (hereinafter referred to as first feature vector m (a)) and a feature vector for the second block B (hereinafter referred to as second feature vector m (B)) based on the first text ID of the first block a and the second text ID of the second block B, respectively.
Specifically, computing device 10 may encode the characters in each block using the BioBERT framework with the text ID of each block as input to derive a feature vector for the block.
BERT is a context-based word representation model that is based on a markup language model and pre-trained using a two-way Transformer (Transformer), which makes it possible to learn a two-way representation by using a masking language model to predict random masking words in a sequence. The BioBERT model has almost the same structure as the BERT model, except that it is pre-trained using a biomedical literature database and can be used for text mining in the biomedical field. The BioBERT network layer 520 herein uses a well-trained BioBERT model of the prior art, and thus the structure and training thereof will not be described in detail herein.
Depending on the actual application requirements, different BioBERT model parameters may be selected. In one example, the size of the hidden layer of the selected BioBERT model is 768, so the size of the feature vector (embedding) generated for the input (word or sentence) to the BioBERT network layer 520 is 768 dimensions.
Next, at sub-step 233, at the first fused vector layer 530, computing device 10 may stitch the first position vector n (a) and the first feature vector m (a) of the first block a to generate a first fused vector u (a) of the first block a, and stitch the second position vector n (B) and the second feature vector m (B) of the second block B to generate a second fused vector u (B) of the second block B.
In one example, for the 8-dimensional position vector and the 768-dimensional feature vector described above, the resulting fusion vector has dimensions 776.
At sub-step 234, at GCN network layer 540, computing device 10 may determine a first convolved output vector v (a) for first block a and a second convolved output vector v (B) for second block B, respectively, based on a first fused vector u (a) for first block a, a second fused vector u (B) for second block B, and the adjacency matrix and weight matrix of table subgraph 400. The GCN network layer 540 uses a trained GCN model of the prior art, and thus the structure and training thereof will not be described in detail herein.
Depending on the actual application requirements, different GCN model parameters may be selected. In one example, in the matrix operation of the GCN network layer 540, the selected weight matrix has a dimension of 776 × 223, so it can convert a first fusion vector u (a) with dimension 776 into a first convolved output vector v (a) with dimension 223, and convert a second fusion vector u (b) with dimension 776 into a second convolved output vector v (b) with dimension 223.
Through the GCN network layer 540, the information of other blocks in the table subgraph 400 except for the two blocks a and B to be predicted is filtered out, so as to obtain a convolution output vector of the two blocks a and B to be predicted.
Continuing with fig. 6, at sub-step 235, at second fused vector layer 550, computing device 10 may stitch the relative position vector w (ab) between first block a and second block B, the first fused vector u (a) and first convolved output vector v (a) of first block a, and the second fused vector u (B) and second convolved output vector v (B) of second block B to determine the fused feature vector of first block a and second block B.
In one example, for the first and second fused vectors u (a) and u (b) of the 2-dimensional relative position vectors w (ab), 776 and the first and second convolved output vectors v (a) and v (b) of 223 dimensions, the dimension of the resulting fused feature vector is 2000 dimensions.
At sub-step 236, at the fully-connected network layer 560, the computing device 10 may predict a row probability that the first block a and the second block B are in the same row based on the fused feature vector of the first block a and the second block B and the first fully-connected network 562, and predict a column probability that the first block a and the second block B are in the same column based on the fused feature vector and the second fully-connected network 564.
The first fully-connected network 562 may be a fully-connected network composed of a plurality of fully-connected layers and activation functions, which may determine probability values that the first block a and the second block B belong to the same row and probability values that do not belong to the same row, respectively, and predict a probability value that the first block a and the second block B belong to the same row as a row probability that the first block a and the second block B are in the same row. For example, suppose that the two-dimensional probability vector [0.2,0.8] is output after the first block a and the second block B pass through the first fully-connected network 562, which means that the probability value of the first block a and the second block B belonging to the same row is 0.2, and the probability value of the first block a and the second block B not belonging to the same row is 0.8, so that the row probability that the first block a and the second block B are in the same row is determined to be 0.2.
Similarly, the second fully-connected network 564 may also be a fully-connected network composed of a plurality of fully-connected layers and activation functions, which may determine probability values that the first block a and the second block B belong to the same column and probability values that do not belong to the same column, respectively, and predict the probability values that the first block a and the second block B belong to the same column as a column probability that the first block a and the second block B are in the same column. For example, suppose that the two-dimensional probability vector [0.66,0.33] is output after the first block a and the second block B pass through the second fully-connected network 564, which means that the probability value of the first block a and the second block B belonging to the same column is 0.66, and the probability value of the second block B not belonging to the same column is 0.33, so that the column probability that the first block a and the second block B are in the same column is determined to be 0.66.
Finally, at sub-step 237, at output layer 570, computing device 10 may output row probabilities and column probabilities of first block a and second block B for use in subsequent reconstruction of the structured table.
For the plurality of blocks of the table sub-graph 400 detected in step 220, the operation of step 230 described in fig. 6 may be repeatedly performed on any two blocks thereof to determine the row probability of any two blocks being in the same row and the column probability of any two blocks being in the same column.
Continuing with FIG. 2, at step 240, computing device 10 may structurally reorganize the plurality of word blocks in table sub-graph 400 based on the row probabilities and the column probabilities of any two of the plurality of word blocks in the same row and column in the same column determined in step 230 to reconstruct the table in table sub-graph 400 into a structured table.
FIG. 8 shows a flowchart of step 240 for reconstructing a structured table, according to an embodiment of the present invention.
As shown in fig. 8, step 240 may include sub-step 241, where computing device 10 may determine, for each of a plurality of blocks of the table sub-graph 400, a number of rows and a number of columns of the structured table based on a row probability and a column probability that the block is in the same row and the same column as other blocks of the plurality of blocks of the table sub-graph 400 and a positional relationship between the block and the other blocks.
FIG. 9 shows a flowchart of sub-step 241 of determining the number of rows and columns of a structured table according to an embodiment of the present invention.
As shown in fig. 9, sub-step 241 may include sub-step 2411 in which computing device 10 may determine, for each target block of a plurality of blocks of the tabular sub-graph 200, a set of candidate right blocks, a set of candidate left blocks, a set of candidate upper blocks, and a set of candidate lower blocks for the target block.
In particular, in some embodiments, computing device 10 may traverse all other blocks of table sub-graph 400 except for the block (also referred to as the target block), and select each block from all other blocks having a left boundary greater than a right boundary of the target block as one of a set of candidate right blocks for the target block. Further, the computing device 10 may also screen out the candidate right blocks from the candidate right block sets that are in the same row as the target block and have a row probability greater than a predetermined threshold (e.g., 0.5) as a final candidate right block set.
Similarly, computing device 10 may also determine a set of candidate left blocks, a set of candidate top blocks, and a set of candidate bottom blocks for the target block.
In sub-step 2412, computing device 10 may determine a right block, a left block, a top block, and a bottom block of the target block based on the position information of each block in the set of candidate right blocks, the set of candidate left blocks, the set of candidate top blocks, and the set of candidate bottom blocks of the target block, respectively, and the position information of the target block.
Here, the right block, the left block, the top block, and the bottom block of the target block refer to the right block, the left block, the top block, and the bottom block immediately adjacent to the target block.
Specifically, in some embodiments, computing device 10 may select, from the set of candidate right blocks for the target block, the one candidate right block with the smallest abscissa (i.e., leftmost) center position as the right block for the target block. For example, when block a is the target block, its right block may be labeled right (a).
Similarly, the computing device 10 may also determine a left block (left (a)), an upper block (up (a)), and a lower block (down (a)) of the target block.
In sub-step 2413, computing device 10 may determine a set of rightmost word blocks of the tabular sub-graph 400 based on a right word block of each word block of the plurality of word blocks of the tabular sub-graph 400.
Specifically, taking block a as an example, in some embodiments, it is determined whether the right block right (a) of block a is empty. If block a's right block right (a) is empty, block a is taken as one of the rightmost blocks in the rightmost set of blocks in the table sub-graph 400.
Next, in sub-step 2414, computing device 10 may determine, for each rightmost block in the rightmost block set of the table sub-graph 400, a number of left blocks thereof, and in sub-step 2415, a number of columns of the structured table based on the number of left blocks of each rightmost block in the rightmost block set.
Here, assume that the set of rightmost blocks determined by sub-step 2413 is { x1, x 2.., xn }, where xi (1 ≦ i ≦ n, n being a positive integer greater than 1) represents one rightmost block in the set of rightmost blocks. For the rightmost block xi, recursively find its left block left (xi) among all blocks in the table subgraph 400, and determine the number of prediction columns number (xi) for the rightmost block xi.
Specifically, for the rightmost word block xi, it may be determined whether there is a left word block left (xi), and if there is no left word block left (xi), number (xi) is 1; if there is a left block left (xi), 1 is added to number (xi) and a determination continues as to whether there is still a left block left (xi).
In this way, the number of predicted columns { number (x1), number (x2),.. number (xn) } for each rightmost block in the rightmost block set { x1, x 2.., xn } may be determined.
In one example, the number of columns ncol of the structured table may be determined as:
ncol=median(number(x1),number(x2),...,number(xn))
wherein mean () represents a median operation.
On the other hand, similar to in sub-step 2413, in sub-step 2416, computing device 10 may determine a set of lowest blocks of the tabular sub-graph 400 based on the lower blocks of each of the plurality of blocks of the tabular sub-graph 400.
Specifically, taking block a as an example, in some embodiments, it is determined whether the lower block down (a) of block a is empty. If the lower block down (a) of block a is empty, block a is taken as the one of the lowest block set of the table subgraph 400.
In sub-step 2417, similar to in sub-step 2414, computing device 10 may determine, for each of the lowermost word blocks in the set of lowermost word blocks of the tabular sub-graph 400, a number of its upper word blocks, and in sub-step 2418, similar to in sub-step 2415, a number of rows of the structured table based on the number of upper word blocks of each of the lowermost word blocks in the set of lowermost word blocks.
Here, assume that the set of lowermost blocks determined by sub-step 2416 is { y1, y 2.., ym }, where yi (1 ≦ i ≦ m, m being a positive integer greater than 1) represents a lowermost block in the set of lowermost blocks. For the lowermost block yi, its upper block up (yi) is recursively found among all blocks in the table sub-graph 400, and the number of prediction lines number (yi) for the lowermost block yi is determined.
Specifically, for the lowermost block yi, it may be determined whether it exists in the upper block up (yi), and if not, number (yi) is 1; if there is a block of upper words up (yi), 1 is added to number (yi), and a determination continues as to whether there is still a block of upper words up (yi).
In this way, the predicted number of rows { number (y1), number (y2),.. number (ym) } for each of the lowermost word blocks in the set of lowermost word blocks { y1, y 2.., ym }, may be determined.
In one example, the number of rows nrow of the structured table may be determined as:
nrow=median(number(y1),number(y2),...,number(ym))
wherein mean () represents a median operation.
Note that while sub-steps 2416 to 2418 are shown in fig. 9 as following sub-steps 2413 to 2415, those skilled in the art will appreciate that the order of sub-steps shown in the figure is merely illustrative and that sub-steps 2416 to 2418 may be performed before sub-steps 2413 to 2415 or in parallel with sub-steps 2413 to 2415.
Continuing with FIG. 8, at sub-step 242, computing device 10 may determine boundaries for each candidate cell in the structured table based on the location information for each of the plurality of blocks of the table sub-graph 400 and the number of rows and columns of the structured table determined at sub-step 241. As is well known to those skilled in the art, a cell is an intersection of a row and a column in a table, which is the smallest unit that makes up the table, either split or consolidated. Both the entry and modification of the individual data is done in the cells. When a table is generated by using a tool such as MS Word, Excel or WPS, it is usually operated in units of cells, and text data in each cell is used to represent independent semantics, so when the structured table is reconstructed, the range/boundary of each cell should also be determined so as to obtain an accurate table semantic representation.
Determining boundaries between cells in the structured table may include determining row and column boundaries between cells. Depending on the specific style and reconstruction requirements of the table, only the row demarcation line may be determined, only the column demarcation line may be determined, or both the row demarcation line and the column demarcation line may be determined. Since there may be a split or merge of cells when a table is generated, the range of cells determined by the row and column boundaries between cells is only the smallest granularity cell in the structured table, also referred to as a candidate cell in the following description.
FIG. 10 illustrates a flow diagram of a process for determining row boundaries between candidate cells according to an embodiment of the present invention.
As shown in fig. 10, determining row boundaries between candidate cells may include sub-step 2421, where computing device 10 may construct a row feature vector for each block based on the upper-bound coordinates, the lower-bound coordinates, and the center position ordinate of the block.
As previously described, it is assumed that the normalized coordinate information of one block is represented as (x)1',x2',y1',y2') whose normalized center position is represented by ((x)1'+x2')/2,(y1'+y2')/2), in which case the row feature vector constructed for it can be represented as (y)1',y2',(y1'+y2')/2)。
Next, at sub-step 2422, computing device 10 may cluster the blocks of the table subgraph 400 based on their row feature vectors and the row number nrow of the structured table to determine their row categories.
Here, these blocks may be clustered using a clustering algorithm such as Kmean (referred to as row clustering), where the number of row classes is set to the number nrow of rows of the structured table when row clustering is performed. Clustering a plurality of blocks of table trellis diagram 400 into nrow row categories, each row category containing one or more blocks, results.
In sub-step 2423, computing device 10 may determine, for each of the nrow row categories, an average center ordinate of the tiles contained by the row category.
Here, the average center ordinate of the blocks included in the row category may be determined, for example, from the center position ordinate in the row feature vector of each block determined in sub-step 2421 and the blocks included in the row category.
In sub-step 2424, computing device 10 may sort the nrow row categories by the size of the mean center ordinate of the block contained in each row category determined in sub-step 2423. The sequence numbers of the sorted row categories may be represented as 1,2, 3 … … nrow, for example.
Next, at sub-step 2425, computing device 10 may determine an average ordinate for each row category after sorting. For example, for any row category i (1 ≦ i ≦ nrow), the center position ordinates of all blocks in the row category may be averaged to determine the average ordinate for the row category.
At sub-step 2426, computing device 10 may determine an ordinate of a line demarcation between the sorted two adjacent line categories based on the average ordinates of the two adjacent line categories.
In one embodiment, the ordinate y _ axis of the row dividing line between two adjacent row categories can be expressed as:
y_axis(linej)=(yj+yj+1)/2
wherein, linejRepresents the line demarcation between the jth line category and the j +1 th line category, yjMean ordinate, y, representing the jth line categoryj+1Represents the average ordinate of the j +1 th row category, where j is 1. ltoreq. nrow-1.
In this way, the ordinate of the row borderline between two adjacent row categories, i.e. the row borderline between candidate cells, can be determined.
FIG. 11 illustrates a flow diagram of a process for determining column boundaries between candidate cells according to an embodiment of the invention. The manner in which the column boundaries are determined may be similar to the manner in which the row boundaries are determined as described above, and therefore is described with reference to a description similar to fig. 10.
As shown in fig. 11, determining column boundaries between candidate cells may include sub-step 2421', where computing device 10 may construct a column feature vector for each block based on the left world coordinate, the right world coordinate, and the center position abscissa for the block.
As previously described, it is assumed that the normalized coordinate information of one block is represented as (x)1',x2',y1',y2') whose normalized center position is represented by ((x)1'+x2')/2,(y1'+y2')/2), in which case the column feature vector constructed for it can be represented as (x)1',x2',(x1'+x2')/2)。
Next, at sub-step 2422', computing device 10 may cluster the blocks of the table subgraph 400 based on their column feature vectors and the number of columns ncol of the structured table to determine column categories for the blocks.
Here, these blocks may be clustered using a clustering algorithm such as Kmean (referred to as column clustering), where the number of column categories is set to the number ncol of columns of the structured table when column clustering is performed. Row-column clustering of the blocks of the table trellis diagram 400 results in ncol column categories, each column category containing one or more blocks.
At sub-step 2423', computing device 10 may determine, for each of the ncol column categories, an average center abscissa of the block contained in that column category.
Here, the average center abscissa of the blocks included in the column category may be determined, for example, from the center position abscissa in the column feature vector of each block determined in sub-step 2421' and the blocks included in the column category.
At sub-step 2424', computing device 10 may rank the ncol column categories by the size of the mean center abscissa of the blocks contained in each column category determined in sub-step 2423. The sequence numbers of the sorted column categories may be represented as 1,2, 3 … … ncol, for example.
Next, at sub-step 2425', computing device 10 may determine the sorted average abscissa for each column category. For example, for any column category i (1 ≦ i ≦ ncol), the average abscissa for that column category may be determined by averaging the central position abscissas for all blocks in that column category.
At sub-step 2426', computing device 10 may determine an abscissa of the column boundary between the two sorted adjacent column categories based on the average abscissas of the two adjacent column categories.
In one embodiment, the abscissa x _ axis of the column boundary between two adjacent column categories may be expressed as:
x_axis(linei)=(xi+xi+1)/2
wherein, lineiRepresents the column demarcation, x, between the ith column category and the (i + 1) th column categoryiMean abscissa, x, representing the ith column categoryi+1Represents the average abscissa of the x +1 th column class, where 1. ltoreq. i.ltoreq.ncol-1.
In this way, the abscissa of the column boundary between two adjacent column categories, i.e. the column boundary between candidate cells, can be determined.
After determining the row boundaries and/or column boundaries between candidate cells, the range of each candidate cell is determined, and computing device 10 may correspond each block to a corresponding candidate cell based on the normalized center position of the block.
Continuing with FIG. 8, at sub-step 243, computing device 10 may determine whether two adjacent candidate cells in the structured table should be merged based on row probabilities and column probabilities between blocks contained by the two adjacent candidate cells.
Whether the two adjacent candidate cells should be row-merged or column-merged may be determined according to whether the two adjacent candidate cells are in the same row or column.
FIG. 12 shows a flow diagram of a process for determining whether two neighboring candidate cells should be row merged according to an embodiment of the invention.
It is assumed that the two neighboring candidate cells include a first candidate cell M located at the ith row and a second candidate cell N located at the (i + 1) th row, and that the two neighboring candidate cells are located at the same column.
As shown in fig. 12, the process for determining whether two adjacent candidate cells should be row merged may include sub-step 2431, where computing device 10 may determine a set of blocks in the ith row other than the first candidate cell M (hereinafter also referred to as a third set of blocks for distinction from other sets of blocks).
In sub-step 2432, computing device 10 may determine a set of blocks in row i +1 other than the second candidate cell N (hereinafter also referred to as a fourth set of blocks for distinction from other sets of blocks).
In sub-step 2433, computing device 10 may determine a row merge value between first candidate cell M and second candidate cell N based on a row probability between any block in row i and any block in row i +1, a number of blocks included in first candidate cell M, a number of blocks included in second candidate cell N, a number of blocks included in the third set of blocks, and a number of blocks included in the fourth set of blocks.
Specifically, in one embodiment, assume that a first candidate cell M contains M blocks { X1, X2.., Xm }, a second candidate cell N contains N blocks { Y1, Y2.., Yn }, a third set of blocks contains a blocks { U1, U2.., Ua }, and a fourth set of blocks contains b blocks { V1, V2., Vb }. Representing the row probability that any two blocks a and B predicted in step 230 are in the same row by prob _ row (a, B), the row combination value score _ row between the first candidate cell M and the second candidate cell N may be determined as follows:
Figure BDA0003066951380000261
in sub-step 2434, computing device 10 may determine whether the row merge value score row is greater than a predetermined threshold. The predetermined threshold may be, for example, 0.5.
If it is determined that the row merge value is greater than the predetermined threshold, at sub-step 2435, computing device 10 may determine that the first candidate cell M and the second candidate cell N should merge; conversely, if it is determined that the row merge value is less than or equal to the predetermined threshold, at sub-step 2436, computing device 10 may determine that the first candidate cell M and the second candidate cell N should not be merged.
Further, computing device 10 repeats the above process to further determine whether the merged cell can continue with the row merge. And, computing device 10 traverses all row-adjacent candidate cells in the same manner to arrive at a final row-merged cell.
FIG. 13 illustrates a flow diagram of a process for determining whether two neighboring candidate cells should be column merged according to an embodiment of the invention. The manner of determining whether to perform column merging may be similar to that of determining whether to perform row merging as described above, and thus is described with reference to the description similar to fig. 12.
It is assumed that the two neighboring candidate cells include a first candidate cell M located at a j-th column and a second candidate cell N located at a j + 1-th column, and that the two neighboring candidate cells are located at the same row.
As shown in fig. 13, the process for determining whether two adjacent candidate cells should be column merged may include sub-step 2431', where computing device 10 may determine a set of blocks in the jth column other than the first candidate cell M (hereinafter also referred to as a fifth set of blocks for distinction from other sets of blocks).
In sub-step 2432', computing device 10 may determine a set of blocks in column j +1 (hereinafter also referred to as a sixth set of blocks for distinction from other sets of blocks) other than the second candidate cell N.
In sub-step 2433', computing device 10 may determine a column merge value between first candidate cell M and second candidate cell N based on a column probability between any block in the jth column and any block in the j +1 th column, a number of blocks included in first candidate cell M, a number of blocks included in second candidate cell N, a number of blocks included in the fifth set of blocks, and a number of blocks included in the sixth set of blocks.
Specifically, in one embodiment, assume that a first candidate cell M contains M word blocks { X1, X2.., Xm }, a second candidate cell N contains N word blocks { Y1, Y2.., Yn }, a fifth word block set contains c word blocks { U1, U2.., Uc }, and a sixth word block set contains d word blocks { V1, V2.,.., Vd }. Representing the column probability that any two blocks a and B predicted in step 230 are in the same column by prob _ col (a, B), the column combination value score _ col between the first candidate cell M and the second candidate cell N may be determined as follows:
Figure BDA0003066951380000271
in sub-step 2434', computing device 10 may determine whether the column merge value score _ col is greater than a predetermined threshold. The predetermined threshold may be, for example, 0.5.
If it is determined that the column merge value is greater than the predetermined threshold, at sub-step 2435', computing device 10 may determine that the first candidate cell M and the second candidate cell N should be merged; conversely, if it is determined that the column merge value is less than or equal to the predetermined threshold, at sub-step 2436', computing device 10 may determine that the first candidate cell M and the second candidate cell N should not merge.
Further, computing device 10 repeats the above process to further determine whether the merged cell can continue with column merging. And, computing device 10 traverses all column-adjacent candidate cells in the same manner to arrive at a final column-merged cell.
Continuing with fig. 8, upon determining in sub-step 243 that two adjacent candidate cells should be merged, computing device 10 may merge the two adjacent candidate cells into one cell in sub-step 244, whereas upon determining in sub-step 243 that the two adjacent candidate cells should not be merged, computing device 10 may determine the two adjacent candidate cells as two separate cells in sub-step 245.
After the above-described operations have been performed on the candidate cells, the actual cells of the structured table are obtained, and such cells are not merely candidate cells divided by row and column boundaries, but actual cells that may contain more candidate cells.
Finally, in sub-step 246, computing device 10 may merge the blocks contained in each cell based on the location information of the plurality of blocks to reconstruct the structured table.
To this end, the method 200 of the present invention reconstructs a structured table by intercepting a table subgraph in a picture using a target detection model, detecting a plurality of blocks in the table subgraph by optical character recognition, and predicting row probabilities and column probabilities of any two blocks in the same row and column using a deep neural network model to structurally recombine the blocks.
In the above method, when predicting the row probability and the column probability, the prediction is performed using the trained deep neural network model 500. In some embodiments of the invention, the method 200 may further include the step 250 of training the deep neural network model 500.
From the perspective of the entire model, the relationship of the inputs and outputs of the deep neural network model 500 can be expressed as:
Xout=W*Xin+b
where W is a weight function of the deep neural network model 500, b is a bias function of the deep neural network model 500, and the purpose of training the deep neural network model 500 is to continuously update the weight function W and the bias function b (which may be collectively referred to as a parameter matrix) to a convergence value. Here, the initial value of the parameter matrix may be arbitrarily set, or may be set empirically.
FIG. 14 shows a flowchart of the step 250 of training the deep neural network model 500, according to an embodiment of the present invention. The training process of the deep neural network model 500 is substantially the same as the process of the step 230 of determining the row probabilities and the column probabilities described above in connection with fig. 6, except that the input data and the output data of the deep neural network model 500 are different, and the step 250 further includes a process of iterating and updating the weight parameters of the deep neural network model 500 based on the output data.
As shown in fig. 14, step 250 may include sub-step 251, where computing device 10 may acquire a training data set of deep neural network model 500. The training data set may include a plurality of training data, each of which includes information of a plurality of training blocks included in a training table corresponding to the training data, and the information of each training block includes characters included in the training block, a coordinate position of the training block, and row and column information of the training block in the training table.
In one example, the ScitSR dataset can be utilized as a training dataset for the deep neural network model 500. The data set comprises 12000 training data, each training data corresponds to a training table (which further comprises 2885 complex tables), each training table comprises a plurality of blocks (referred to as training blocks), and the information of each block comprises the characters contained in the block, the coordinate position of the block and the row and column information of the block in the training table. In addition, 3000 test tables (716 complex tables) are included in the data set for testing the effect of the trained model or further optimizing the trained model.
At sub-step 252, similar to sub-step 231 described above, at input layer 510 of deep neural network model 500, computing device 10 may determine training input data for deep neural network model 500 for a first training block and a second training block of two training blocks of a training table. Wherein the training input data comprises a text ID and position vector of the first training block, a text ID and position vector of the second training block, a relative position vector between the first training block and the second training block, and an adjacency matrix and a weight matrix of the training table.
At sub-step 253, similar to sub-step 232 described above, at the BioBERT network layer 520 of the deep neural network model 500, the computing device 10 may determine a feature vector of the first training block and a feature vector of the second training block based on the text ID of the first training block and the text ID of the second training block, respectively.
At sub-step 254, similar to sub-step 233 described above, at the first fused vector layer 530 of the deep neural network model 500, the computing device 10 may splice the position vector and the feature vector of the first training block to generate a fused vector of the first training block and splice the position vector and the feature vector of the second training block to generate a fused vector of the second training block.
At sub-step 255, similar to sub-step 234 described above, at the GCN network layer 540 of the deep neural network model 500, the computing device 10 may determine a convolved output vector for the first training block and a convolved output vector for the second training block based on the fused vector for the first training block, the fused vector for the second training block, and the adjacency matrix and the weight matrix of the training table, respectively.
At sub-step 256, similar to sub-step 235 described above, at the second fused vector layer 550 of the deep neural network model 500, the computing device 10 may concatenate the relative position vector between the first training block and the second training block, the fused vector and the convolved output vector of the first training block, and the fused vector and the convolved output vector of the second training block to determine the fused feature vector of the first training block and the second training block.
At sub-step 257, similar to sub-step 236 described above, at fully-connected network layer 560 of deep neural network model 500, computing device 10 may determine row probabilities that the first training block and the second training block are in the same row based on the fused feature vector of the first training block and the second training block and first fully-connected network 562, and determine column probabilities that the first training block and the second training block are in the same column based on the fused feature vector of the first training block and the second training block and second fully-connected network 564.
Unlike step 230 above, in sub-step 258, computing device 10 may perform Softmax regression on the row probabilities and the column probabilities of the first and second training blocks being in the same row and column, and determine row and column penalty values for the first and second training blocks using a cross entropy penalty function.
Here, as described in sub-step 236 above, the outputs of the first fully-connected network 562 and the second fully-connected network 564 are each a two-dimensional probability vector. Accordingly, in sub-step 258, computing device 10 may perform a Softmax regression on the two-dimensional probability vectors, respectively. The results of the Softmax regression may indicate a probability distribution that the first training block and the second training block are in the same row and a probability distribution in the same column.
The output values of the multi-classification are converted to a probability distribution ranging from [0,1] and having a sum of 1 based on Softmax regression. That is, the sum of the probabilities that two blocks are in the same row and not in the same row is 1, and similarly, the sum of the probabilities that two blocks are in the same column and not in the same column is also 1. The following equation should be followed:
Figure BDA0003066951380000311
wherein x isiIndicates the input data, here the two-dimensional probability vector generated for substep 257, θ indicates the parameter vector of the Softmax model, θlIndicates the l-th parameter in the parameter vector θ, where l ═ 1,2, …, k; j represents xiClass (denoted herein as x)iIn the same row and not in the sameRows are either in the same column and not in the same column), where j is 1, 2. Thus, p (y) calculated in the above manneri=j|xi(ii) a θ) also represents the input data xiProbability distributions in the same row and not in the same row, or probability distributions in the same column and not in the same column.
In one embodiment, the row penalty value or column penalty value calculated based on the cross entropy penalty function Loss may be expressed as:
Figure BDA0003066951380000312
wherein
Figure BDA0003066951380000313
Representing input data xinThe probability that the sample label (here, the first training block and the second training block) is 1 (i.e., the first training block and the second training block are in the same row or column).
In sub-step 259, computing device 10 may determine a penalty value for deep neural network model 500 based on the row penalty value and the column penalty value for the first training block and the second training block and update the parameter matrix of deep neural network model 500 based on the penalty value.
In some embodiments of the present invention, the penalty value for the deep neural network model 500 is determined as the sum of the row penalty value and the column penalty value.
In some embodiments of the invention, a back propagation algorithm is used to weight the W function for each layer of the deep neural network model 500kAnd a bias function bk(K — 1,2, … … K), where K is the number of layers of the deep neural network model 500. Thus, in sub-step 259, the penalty value and the weighting function W of the K-th layer based on the deep neural network model 500 may be usedKAnd a bias function bKTo determine the gradient value of the last layer (K-th layer) of the deep neural network model 500
Figure BDA0003066951380000314
And
Figure BDA0003066951380000315
computing device 20 may then update the weight function for each of the plurality of layers of deep neural network model 500 based on the gradient values of the last layer of deep neural network model 500.
For example, the gradient value of the K-th layer of the deep neural network model 500 may be based on any one of a Batch (Batch), mini-Batch (mini-Batch), or stochastic gradient descent method
Figure BDA0003066951380000321
And
Figure BDA0003066951380000322
determining gradient values of the K-1 st layer, the K-2 nd layer and the … … 1 st layer in turn, and using the gradient value of each layer to weight function W of the layerk(and bias function bk) And (6) updating.
The operation of the above sub-step 259 is repeated based on a preset iteration step until a maximum number of iterations is reached or a stop iteration threshold is reached. To this end, the weighting function W (and the bias function b) of the deep neural network model 500 is trained to a convergence value, which can be used to calculate the probability that two blocks are in the same row or column.
FIG. 15 illustrates a block diagram of a computing device 1500 suitable for implementing embodiments of the present invention. Computing device 1500 may be, for example, computing device 10 or server 20 as described above.
As shown in fig. 15, the computing device 1500 may include one or more Central Processing Units (CPUs) 1510 (only one shown schematically) that may perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)1520 or loaded from a storage unit 1580 into a Random Access Memory (RAM) 1530. In the RAM 1530, various programs and data required for operation of the computing device 1500 may also be stored. The CPU 1510, ROM 1520, and RAM 1530 are connected to each other via a bus 1540. An input/output (I/O) interface 1550 is also connected to bus 1540.
A number of components in computing device 1500 connect to I/O interface 1550, including: an input unit 1560 such as a keyboard, a mouse, or the like; an output unit 1570 such as various types of displays, speakers, and the like; a storage unit 1580 such as a magnetic disk, an optical disk, or the like; and a communication unit 1590 such as a network card, a modem, a wireless communication transceiver, etc. The communication unit 1590 allows the computing device 1500 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks.
The method 200 described above may be performed, for example, by the CPU 1510 of a computing device 1500, such as computing device 10 or server 20. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1580. In some embodiments, part or all of the computer program can be loaded and/or installed onto the computing device 1500 via the ROM 1520 and/or the communication unit 1590. When the computer program is loaded into RAM 1530 and executed by CPU 1510, one or more operations of method 200 described above may be performed. Further, the communication unit 1590 may support wired or wireless communication functions.
Those skilled in the art will appreciate that the computing device 1500 illustrated in FIG. 15 is merely illustrative. In some embodiments, computing device 10 or server 20 may contain more or fewer components than computing device 1500.
The method 200 for processing a form and the computing device 1500 usable as the computing device 10 or the server 20 according to the present invention are described above with reference to the drawings. However, it will be understood by those skilled in the art that the steps of method 200 and its sub-steps are not limited to the order shown in the figures and described above, but may be performed in any other reasonable order. Further, the computing device 1500 need not include all of the components shown in FIG. 15, it may include only some of the components necessary to perform the functions described in the present disclosure, and the manner in which these components are connected is not limited to the form shown in the figures.
The present invention may be methods, apparatus, systems and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therein for carrying out aspects of the present invention.
In one or more exemplary designs, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. For example, if implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium.
The units of the apparatus disclosed herein may be implemented using discrete hardware components, or may be integrally implemented on a single hardware component, such as a processor. For example, the various illustrative logical blocks, modules, and circuits described in connection with the invention may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein.
Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both.
The previous description of the invention is provided to enable any person skilled in the art to make or use the invention. Various modifications to the disclosure will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the present invention is not intended to be limited to the examples and designs described herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (13)

1. A method of processing a form, comprising:
intercepting one or more table subgraphs from the picture by using a target detection model, wherein each table subgraph comprises a table;
performing optical character recognition on each table subgraph to detect a plurality of word blocks in the table subgraph, wherein each word block contains one or more characters;
predicting the row probability of any two blocks in the plurality of blocks in the same row and the column probability of any two blocks in the same column by using a deep neural network model; and
and carrying out structural reorganization on the plurality of word blocks based on the row probability and the column probability of any two word blocks in the plurality of word blocks in the same row and the same column so as to reconstruct the table into a structural table.
2. The method of claim 1, wherein the deep neural network model comprises an input layer, a BioBERT network layer, a first fused vector layer, a GCN network layer, a second fused vector layer, a fully-connected network layer, and an output layer, wherein predicting row probabilities and column probabilities of any two of the plurality of blocks being in a same row and column with the deep neural network model comprises:
determining, at the input layer, input data of the deep neural network model for a first block and a second block of two blocks to be predicted, wherein the input data includes a first text ID of the first block, a first position vector, a second text ID of the second block, a second position vector, a relative position vector between the first block and the second block, and an adjacency matrix and a weight matrix of the subgraph table;
determining, at the BioBERT network layer, a first feature vector of the first block and a second feature vector of the second block based on a first text ID of the first block and a second text ID of the second block, respectively;
at the first fused vector layer, splicing a first position vector and a first feature vector of the first block to generate a first fused vector of the first block, and splicing a second position vector and a second feature vector of the second block to generate a second fused vector of the second block;
at the GCN network layer, respectively determining a first convolution output vector of the first block and a second convolution output vector of the second block based on a first fusion vector of the first block, a second fusion vector of the second block and an adjacency matrix and a weight matrix of the table subgraph;
at the second fused vector layer, stitching the relative position vector between the first block and the second block, the first fused vector and the first convolved output vector of the first block, and the second fused vector and the second convolved output vector of the second block to determine a fused feature vector of the first block and the second block;
at the fully-connected network layer, predicting row probabilities that the first block and the second block are in the same row based on a fused feature vector of the first block and the second block and a first fully-connected network, and predicting column probabilities that the first block and the second block are in the same column based on the fused feature vector and a second fully-connected network; and
at the output layer, outputting the row probabilities and the column probabilities of the first block and the second block.
3. The method of claim 2, wherein determining input data for the deep neural network model for a first block and a second block of two blocks to be predicted comprises:
converting the text of the first block and the second block into the first text ID and the second text ID, respectively;
acquiring a first position vector of the first block and a second position vector of the second block based on the position information of the first block and the second block respectively;
determining a relative position vector between the first block and the second block based on a first position vector of the first block and a second position vector of the second block;
determining a adjacency matrix of the table subgraph based on distances between the plurality of word blocks of the table subgraph; and
determining a weight matrix for the table subgraph based on the adjacency matrix.
4. The method of claim 3, wherein obtaining a first location vector for the first block and a second location vector for the second block based on location information for the first block and the second block, respectively, comprises:
determining normalized coordinate information of the first block based on the position information of the first block, determining a normalized center position of the first block and a normalized width and a normalized height of the first block based on the normalized coordinate information of the first block, and determining a first position vector of the first block based on the normalized coordinate information, the normalized center position, the normalized width, and the normalized height of the first block, and
determining normalized coordinate information of the second block based on the position information of the second block, determining a normalized center position of the second block and normalized width and normalized height of the second block based on the normalized coordinate information of the second block, and determining a second position vector of the second block based on the normalized coordinate information, normalized center position, normalized width, and normalized height of the second block.
5. The method of claim 1, wherein structurally reorganizing the plurality of blocks to reconstruct the table into a structured table based on row probabilities and column probabilities of any two of the plurality of blocks being in a same row and a same column comprises:
for each block of the plurality of blocks, determining a number of rows and a number of columns of the structured table based on a row probability and a column probability that the block is in a same row and a same column with other blocks of the plurality of blocks and a positional relationship between the block and the other blocks;
determining boundaries of each candidate cell in the structured table based on the location information of each block of the plurality of blocks and the number of rows and columns of the structured table;
determining whether two adjacent candidate cells in the structured table should be merged based on row probabilities and column probabilities between blocks contained by the two adjacent candidate cells;
in response to determining that the two neighboring candidate cells should be merged, merging the two neighboring candidate cells into one cell;
in response to determining that the two neighboring candidate cells should not be merged, determining the two neighboring candidate cells as two separate cells; and
the blocks contained by each cell are merged based on the location information of the plurality of blocks to reconstruct the structured table.
6. The method of claim 5, wherein determining the number of rows and columns of the structured table comprises:
for each target block of the plurality of blocks, determining a set of candidate right blocks, a set of candidate left blocks, a set of candidate top blocks, and a set of candidate bottom blocks of the target block;
determining a right block, a left block, an upper block and a lower block of the target block based on the position information of each block in the candidate right block set, the candidate left block set, the candidate upper block set and the candidate lower block set of the target block and the position information of the target block respectively;
determining a rightmost block set of the table subgraph based on a right block of each block of the plurality of blocks;
determining the number of left word blocks of each rightmost word block in the rightmost word block set;
determining a number of columns of the structured table based on a number of left blocks of each rightmost block in the set of rightmost blocks;
determining a lowest set of blocks of the table subgraph based on a lower block of each block of the plurality of blocks;
determining the number of the upper word blocks of each lowermost word block in the lowermost word block set; and
determining a number of rows of the structured table based on a number of top word blocks of each bottom word block in the set of bottom word blocks.
7. The method of claim 5, wherein determining the boundary of each candidate cell in the structured table comprises:
constructing a line feature vector of each block based on an upper bound coordinate, a lower bound coordinate and a central position longitudinal coordinate of each block;
clustering the plurality of blocks based on row feature vectors of the plurality of blocks and rows of the structured table to determine a plurality of row categories for the plurality of blocks, wherein the number of row categories is equal to the number of rows of the structured table;
for each row category in the plurality of row categories, determining an average center ordinate of a block contained in the row category;
sorting the plurality of line categories according to the size of the mean center ordinate of the blocks contained in each line category;
determining the average ordinate of each sorted row category; and
determining a vertical coordinate of a line dividing line between two sorted adjacent line categories based on the average vertical coordinate of the two sorted adjacent line categories.
8. The method of claim 5, wherein determining the boundary of each candidate cell in the structured table comprises:
constructing a column characteristic vector of each block based on the left boundary coordinate, the right boundary coordinate and the central position abscissa of each block;
clustering the plurality of blocks based on column feature vectors of the plurality of blocks and a number of columns of the structured table to determine a plurality of column categories for the plurality of blocks, wherein the number of column categories is equal to the number of columns of the structured table;
for each column category in the plurality of column categories, determining an average center abscissa of a block contained in the column category;
sorting the plurality of column categories by the size of the mean center abscissa of the blocks contained in each column category;
determining the average abscissa of each sorted column category; and
determining an abscissa of a column boundary between two sorted adjacent column categories based on the average abscissas of the two adjacent column categories.
9. The method of claim 5, wherein the two neighboring candidate cells are located in a same column and include a first candidate cell located in an ith row and a second candidate cell located in an i +1 th row, and determining whether the two neighboring candidate cells should be merged comprises:
determining a third set of blocks in the ith row other than the first candidate cell;
determining a fourth set of blocks in the (i + 1) th row except for the second candidate cell;
determining a row merge value between the first candidate cell and the second candidate cell based on a row probability between any block in the ith row and any block in the (i + 1) th row, a number of blocks included in the first candidate cell, a number of blocks included in the second candidate cell, a number of blocks included in the third set of blocks, and a number of blocks included in the fourth set of blocks;
determining whether the row merge value is greater than a predetermined threshold;
in response to determining that the row merge value is greater than the predetermined threshold, determining that the first candidate cell and the second candidate cell should be merged; and
determining that the first candidate cell and the second candidate cell should not be merged in response to determining that the row merge value is less than or equal to the predetermined threshold.
10. The method of claim 5, wherein the two neighboring candidate cells are located in the same row and include a first candidate cell located in a j-th column and a second candidate cell located in a j + 1-th column, and determining whether the two neighboring candidate cells should be merged comprises:
determining a fifth set of blocks in the jth column other than the first candidate cell;
determining a set of sixth blocks in the j +1 th column except for the second candidate cell;
determining a column combination value between the first candidate cell and the second candidate cell based on a column probability between any word block in the jth column and any word block in the (j + 1) th column, a number of word blocks included in the first candidate cell, a number of word blocks included in the second candidate cell, a number of word blocks included in the fifth set of word blocks, and a number of word blocks included in the sixth set of word blocks;
determining whether the column merge value is greater than a predetermined threshold;
in response to determining that the column merge value is greater than the predetermined threshold, determining that the first candidate cell and the second candidate cell should be merged; and
determining that the first candidate cell and the second candidate cell should not be merged in response to determining that the column merge value is less than or equal to the predetermined threshold.
11. The method of claim 1, further comprising:
acquiring a training data set of the deep neural network model, wherein the training data set comprises a plurality of training data, each training data comprises information of a plurality of training word blocks contained in a training table corresponding to the training data, and the information of each training word block comprises characters contained in the training word block, coordinate positions of the training word blocks and row and column information of the training word blocks in the training table;
determining, at an input layer of the deep neural network model, training input data of the deep neural network model for a first training block and a second training block of two training blocks of the training table, wherein the training input data includes a text ID and a position vector of the first training block, a text ID and a position vector of the second training block, a relative position vector between the first training block and the second training block, and an adjacency matrix and a weight matrix of the training table;
determining, at a BioBERT network layer of the deep neural network model, a feature vector of the first training block and a feature vector of the second training block based on the text ID of the first training block and the text ID of the second training block, respectively;
at a first fused vector layer of the deep neural network model, splicing the position vector and the feature vector of the first training block to generate a fused vector of the first training block, and fusing the position vector and the feature vector of the second training block to generate a fused vector of the second training block;
determining, at a GCN network layer of the deep neural network model, a convolution output vector of the first training word block and a convolution output vector of the second training word block based on the fusion vector of the first training word block, the fusion vector of the second training word block, and the adjacency matrix and the weight matrix of the table subgraph, respectively;
at a second fusion vector layer of the deep neural network model, splicing the relative position vector between the first training block and the second training block, the fusion vector and convolution output vector of the first training block, and the fusion vector and convolution output vector of the second training block to determine a fusion feature vector of the first training block and the second training block;
determining, at a fully-connected network layer of the deep neural network model, row probabilities that the first training word block and the second training word block are in a same row based on a fused feature vector of the first training word block and the second training word block and a first fully-connected network, and column probabilities that the first training word block and the second training word block are in a same column based on a fused feature vector of the first training word block and the second training word block and a second fully-connected network;
performing Softmax regression on row probabilities and column probabilities of the first and second training blocks being in a same row and a same column, and determining row and column loss values for the first and second training blocks using a cross entropy loss function; and
determining a penalty value for the deep neural network model based on row penalty values and column penalty values for the first training block and the second training block and updating a parameter matrix for the deep neural network model based on the penalty values.
12. A computing device, comprising:
at least one processor; and
at least one memory coupled to the at least one processor and storing instructions for execution by the at least one processor, the instructions when executed by the at least one processor causing the computing device to perform the steps of the method of any of claims 1-11.
13. A computer readable storage medium having stored thereon computer program code which, when executed, performs the method of any of claims 1 to 11.
CN202110529807.4A 2021-05-14 2021-05-14 Method of processing table, computing device, and computer-readable storage medium Pending CN113221523A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110529807.4A CN113221523A (en) 2021-05-14 2021-05-14 Method of processing table, computing device, and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110529807.4A CN113221523A (en) 2021-05-14 2021-05-14 Method of processing table, computing device, and computer-readable storage medium

Publications (1)

Publication Number Publication Date
CN113221523A true CN113221523A (en) 2021-08-06

Family

ID=77092011

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110529807.4A Pending CN113221523A (en) 2021-05-14 2021-05-14 Method of processing table, computing device, and computer-readable storage medium

Country Status (1)

Country Link
CN (1) CN113221523A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705175A (en) * 2021-08-18 2021-11-26 厦门海迈科技股份有限公司 Method, server and storage medium for simplifying rows and columns of electronic forms
CN113936287A (en) * 2021-10-20 2022-01-14 平安国际智慧城市科技股份有限公司 Table detection method and device based on artificial intelligence, electronic equipment and medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705175A (en) * 2021-08-18 2021-11-26 厦门海迈科技股份有限公司 Method, server and storage medium for simplifying rows and columns of electronic forms
CN113705175B (en) * 2021-08-18 2024-02-23 厦门海迈科技股份有限公司 Method, server and storage medium for simplifying rows and columns of electronic forms
CN113936287A (en) * 2021-10-20 2022-01-14 平安国际智慧城市科技股份有限公司 Table detection method and device based on artificial intelligence, electronic equipment and medium
CN113936287B (en) * 2021-10-20 2024-07-12 平安国际智慧城市科技股份有限公司 Table detection method and device based on artificial intelligence, electronic equipment and medium

Similar Documents

Publication Publication Date Title
CN108549893B (en) End-to-end identification method for scene text with any shape
CN113297975B (en) Table structure identification method and device, storage medium and electronic equipment
US20180137349A1 (en) System and method of character recognition using fully convolutional neural networks
CN109993102B (en) Similar face retrieval method, device and storage medium
US11288324B2 (en) Chart question answering
CN112819686B (en) Image style processing method and device based on artificial intelligence and electronic equipment
CN113011186B (en) Named entity recognition method, named entity recognition device, named entity recognition equipment and computer readable storage medium
CN107683469A (en) A kind of product classification method and device based on deep learning
CN111488826A (en) Text recognition method and device, electronic equipment and storage medium
CN108334805B (en) Method and device for detecting document reading sequence
CN112434691A (en) HS code matching and displaying method and system based on intelligent analysis and identification and storage medium
EP3539051A1 (en) System and method of character recognition using fully convolutional neural networks
CN110968692B (en) Text classification method and system
CN114283350B (en) Visual model training and video processing method, device, equipment and storage medium
CN114612921B (en) Form recognition method and device, electronic equipment and computer readable medium
CN113221523A (en) Method of processing table, computing device, and computer-readable storage medium
CN112528894A (en) Method and device for distinguishing difference items
CN111898704B (en) Method and device for clustering content samples
CN115131613A (en) Small sample image classification method based on multidirectional knowledge migration
CN117083605A (en) Iterative training for text-image-layout transformer models
Al-Barhamtoshy et al. An arabic manuscript regions detection, recognition and its applications for OCRing
CN113240033A (en) Visual relation detection method and device based on scene graph high-order semantic structure
US20230134218A1 (en) Continuous learning for document processing and analysis
CN111768214A (en) Product attribute prediction method, system, device and storage medium
CN115410185A (en) Method for extracting specific name and unit name attributes in multi-modal data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination