CN109389027B - List structure extraction network - Google Patents

List structure extraction network Download PDF

Info

Publication number
CN109389027B
CN109389027B CN201810483302.7A CN201810483302A CN109389027B CN 109389027 B CN109389027 B CN 109389027B CN 201810483302 A CN201810483302 A CN 201810483302A CN 109389027 B CN109389027 B CN 109389027B
Authority
CN
China
Prior art keywords
rnn
feature map
document
image
tile
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810483302.7A
Other languages
Chinese (zh)
Other versions
CN109389027A (en
Inventor
M·萨卡尔
B·克里什纳穆泰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Adobe Inc
Original Assignee
Adobe Systems Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Adobe Systems Inc filed Critical Adobe Systems Inc
Publication of CN109389027A publication Critical patent/CN109389027A/en
Application granted granted Critical
Publication of CN109389027B publication Critical patent/CN109389027B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • G06F18/2137Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on criteria of topology preservation, e.g. multidimensional scaling or self-organising maps
    • G06F18/21375Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on criteria of topology preservation, e.g. multidimensional scaling or self-organising maps involving differential geometry, e.g. embedding of pattern manifold
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2148Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/2163Partitioning the feature space
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4046Scaling of whole images or parts thereof, e.g. expanding or contracting using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4053Scaling of whole images or parts thereof, e.g. expanding or contracting based on super-resolution, i.e. the output image resolution being higher than the sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4092Image resolution transcoding, e.g. by using client-server architectures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/11Region-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/143Segmentation; Edge detection involving probabilistic approaches, e.g. Markov random field [MRF] modelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/50Extraction of image or video features by performing operations within image blocks; by using histograms, e.g. histogram of oriented gradients [HoG]; by summing image-intensity values; Projection analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19147Obtaining sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • G06V30/274Syntactic or semantic context, e.g. balancing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20021Dividing image into blocks, subimages or windows
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30176Document
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/43Editing text-bitmaps, e.g. alignment, spacing; Semantic analysis of bitmaps of text without OCR

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

A method and system for detecting and extracting accurate and precise structures in a document. The high resolution image of the document is partitioned into a set of tiles. Each tile is processed by a convolutional network and then a recursive network set is processed for each row and each column. A global lookup process is disclosed that allows for consideration of the "future" information required for accurate evaluation by recurrent neural networks. The use of high resolution images allows for accurate and precise feature extraction, while segmentation into tiles facilitates processing high resolution images within a reasonable computational resource.

Description

List structure extraction network
Technical Field
The present disclosure relates to techniques for identifying the structure and semantics of a form document such as a PDF. In particular, the present disclosure relates to techniques for processing documents using deep learning and deep neural networks ("DNNs") to extract structure and semantics.
Background
The use of forms for capturing and disseminating information has become ubiquitous. Typically these forms have not been digitized and exist in hard copy format. Even though the forms have been digitized and converted to electronic format, they may only support interaction via a particular electronic device (such as a personal computer), but may not be accessible on the mobile device. An adaptive form is an electronic form that can automatically adapt to viewing and input on a variety of devices, each device having a different form factor, such as a personal computer, tablet computer, smart phone, etc.
Enterprises and governments are making digital transformations in which movement occupies the primary digital strategy for all new offerings. The trend in digital technology is driven by a range of attractive business and revenue incentives. Thus, organizations need to digitize and provide multi-channel stories. However, many existing account registration and service request flows are still paper-based. Currently, to implement digital adaptive form technology, businesses must employ form/content authors to manually replicate the current experience and establish a mobile ready experience on a segment-by-segment basis, which is time consuming, expensive, and requires IT ("information technology") skills.
The elements in a form are typically arranged in a hierarchy. For example, a document is a top-level element. Below the document there may be parts, parts constituting the next level in the hierarchy, and so on.
A field is another important list structural element. The fields may include a combination of gadgets (widgets) and headers (captons). A gadget is an area of a form that facilitates and prompts a user to enter information. Each gadget may have a title associated with it. A title is a piece of text or other signaling information that may help a user provide input in a gadget. Examples of gadgets may include sections (sections) and option groups. A set of options is a set of items that allows a user to select one or more items via a check box or radio button. A form is another example of a structural element that may also include a column heading, a row heading, and the actual gadget that the user may fill in the information. In addition, the form typically contains text portions that are made up of paragraphs, lines of text, and words. The image may even be embedded in the form.
One of the main problems in rapidly converting paper forms to adaptive forms is identifying the structure and semantics of the form document from the format of the image or similar image. Once the form structure is extracted and its hierarchical properties captured, this structure information can be used for various purposes, such as creating an electronically adaptive form, and the like.
Machine learning and deep neural networks ("DNNs") have been applied to document structure extraction. However, due to the computational cost of using high resolution images (e.g., memory requirements and limitations on efficient information dissemination), known methods for applying DNNs to extract document structures from images require the use of lower resolution input images. Thus, typically, the input image provided to the DNN for structure extraction is first downsampled from the higher resolution image. While the use of lower resolution document images can solve the practical problem of reducing the computational cost of performing form identification and extraction, it also places a significant limitation on the ability of DNNs to elicit very fine structures in the document. Therefore, techniques for extracting document structures from high resolution document images in a computationally efficient and tractable manner using machine learning and DNN are needed.
Drawings
FIG. 1a is a flowchart depicting the operation of a form construction extraction network in accordance with an embodiment of the present disclosure;
FIG. 1b is a flow chart depicting more detailed operation of a form structure extraction network in accordance with an embodiment of the present disclosure;
FIG. 2a is a block diagram of a form extraction network according to an embodiment of the present disclosure;
FIG. 2b is a detailed block diagram of global lookup block 216 according to an embodiment of the present disclosure;
FIG. 2c is a flow chart of a global lookup process according to an embodiment of the present disclosure;
FIG. 3a depicts 2D RNN processing of a portion of a high resolution image that has been segmented into tile sets according to one embodiment of the present invention;
FIG. 3b depicts an architecture for processing feature graphs generated by a convolutional network in accordance with an embodiment of the present disclosure;
FIG. 3c depicts an alternative architecture for processing feature graphs generated by a convolutional network in accordance with an embodiment of the present disclosure;
FIG. 3d depicts a single threaded processing sequence for a vertical RNN according to an embodiment of the present disclosure;
FIG. 3e depicts a multi-threaded processing sequence for a vertical RNN according to an embodiment of the present disclosure;
FIG. 4 depicts an input image and an output image processed by a form extraction network according to an embodiment of the present disclosure;
FIG. 5 depicts an input image and an output image processed by a form extraction network according to an embodiment of the present disclosure;
FIG. 6a illustrates an example computing system executing a form extraction network 200 according to various embodiments of the disclosure; and
FIG. 6b illustrates an example integration of the document extraction network 200 into a network environment according to one embodiment of this disclosure.
Detailed Description
In accordance with embodiments described in this disclosure, techniques are described for identifying and extracting the structure and semantics of a form document from a high resolution image of the form document. For purposes of discussion, the terms "form document" and "form" will be used interchangeably. After extracting the structure of the form, the structure information may be utilized to adjust the form to be used in the desired context. Examples of form structures may include logical portions of forms, personal information such as credit card or address information, financial information, form titles, headers, footers, and the like.
According to embodiments described in this disclosure, the form extraction network includes a deep neural network ("DNN") architecture that can automatically identify various form elements and larger semantic structures based on high resolution images of the form. According to embodiments of the present disclosure, a form extraction network provides an end-to-end distinguishable (differential) pipeline for detecting and extracting document structures. According to embodiments of the present disclosure, a form extraction network receives a high resolution image (including original pixels) of a document form to be analyzed and generates classification features corresponding to the form elements. In particular, according to one embodiment, each pixel of the high resolution image is associated with a classification vector that indicates a probability that the pixel belongs to a particular class. The entire classified set of pixels of the high resolution document image may then be utilized to classify larger groupings of pixels into particular form elements.
To reduce the computational resource requirements in processing high resolution images, the form extraction network may use an iterative process to process subsets of the document images. Each subset of the document form image is referred to herein as a tile and includes a subset of pixels in the entire document form image. The form extraction network may include a convolutional network for detecting features of individual tiles of the form, a multi-dimensional recursive (neural) neural network ("RNN") for maintaining spatial state information spatially spanning the tiles, and a global lookup module for modifying state information of the multi-dimensional RNN based on a global lookup of form features from lower dimensional images of the form document. It will be appreciated that RNN is a type of neural network that is well suited to processing sequences.
Briefly, in accordance with embodiments of the present disclosure, an architecture for performing form extraction from high resolution document images may include two branches: (1) A first branch that generates a global tensor representation of the entire image via an auto-encoder, and (2) a second branch that includes a convolution and 2D-RNN layers that operate on the image in a tile-by-tile manner. According to various embodiments, the state of the RNN is stored at tile boundaries and is then used to initialize RNNs of subsequent tiles. The RNN is also equipped with a focus mechanism that can find and retrieve information from the global document representation of the first branch.
According to various embodiments, the global lookup function may be performed by extracting features from a lower resolution representation of a high resolution image. The global lookup can be performed on smaller dimensional images, which provides significant computational benefits. This allows the 2D RNN to look ahead (look-ahead) based on features detected in the lower dimensional representation of the whole image. Thus, a 2D RNN running on a high resolution image may access features that have been extracted from a low resolution trunk (trunk) and perform a lookup to make a decision about the current pixel and utilize "future" information that may actually be viewed from the direction in which the 2D RNN is running.
Thus, according to embodiments described in this disclosure, a convolutional network that processes individual tiles of a high resolution document image is combined with a multi-dimensional RNN to account for information across tiles. According to various embodiments, a global look-up function is provided that allows the 2D RNN to look ahead (i.e., consider "future" information in the context of the direction of 2D RNN operation).
Fig. 1a is a flowchart depicting the operation of a form extraction network in accordance with an embodiment of the present disclosure. The process begins at 122. At 124, a high resolution document image comprising a plurality of pixels is partitioned into a set of tiles, each tile comprising a subset of the pixels of the high resolution document image. At 126, a determination is made as to whether all tiles have been processed. If not (NO branch of 126), the current tile is updated at 128. The tiles are then processed by the neural network to classify pixels in the tiles with particular document elements at 130. A process and system for performing such classification is described below with respect to fig. 1b, 2a to 2 c. Flow then continues to 126.
If all tiles have been processed (the "yes" branch of 126), flow continues to 132 where an editable version of the document is generated from the classified pixels at 132. The process ends at 134.
Fig. 1b is a flowchart depicting detailed operation of a form structure extraction network according to an embodiment of the present disclosure. The process begins at 102. At 104, the high resolution image is segmented into a plurality of tiles. According to embodiments described in the present disclosure, the input image provided to the form extraction network is a high resolution image of the document. Because high resolution images are used, a larger convolutional neural network is required to process the image, otherwise a lower dimensional image may be required. However, as previously mentioned, larger convolutional neural networks present significant computational challenges, particularly with respect to the available computer memory and information dissemination requirements within a computing structure.
To address these computational challenges, according to embodiments described in this disclosure, a high-dimensional image is partitioned into tile sets. Each tile may be a subset of pixels from the original high-dimensional image, and then each tile may be processed separately from each other. However, since each tile retains the resolution of the original image, the high resolution quality of the image is not degraded. Thus, because each tile includes a subset of the original high resolution image and is processed independently of the other tiles, the instantaneous memory and other computational requirements required to process the entire high dimensional image are alleviated. According to embodiments described herein, tiles are generated from an image by dividing the image into rows and columns each having a respective height and width. According to some embodiments, tiles may overlap each other.
At 106, a determination is made as to whether all tiles have been processed. If yes (yes branch of 106), a global feature map of the entire image is generated at 118. Techniques for generating a global feature map of an entire image are described below. The process then ends at 120.
If all tiles have not been processed ("NO" branch of 106), the currently pending tile is updated from the pool of all available tiles for the document image. At 110, a current tile is processed by a convolutional neural network to generate a first feature map. Example embodiments of convolutional neural networks are described below.
Because the convolutional network "sees" or processes only individual tiles at a time, features across multiple tiles cannot be extracted. To address this problem, a state-preserving network, such as an RNN, may be used to leverage information across multiple tiles. In particular, as will be described, according to various embodiments, a 2D RNN may be employed to maintain state information across the horizontal and vertical spatial dimensions of a document image using hidden state representations. It will become apparent that the 2D RNN can be decomposed into a vertical RNN and a horizontal RNN. Further, the vertical RNNs may include a set of RNNs, and the horizontal RNNs may also include a set of RNNs, such that both vertical and horizontal RNNs may operate in parallel. A description of parallel operation of the vertical and horizontal RNNs is provided below.
Thus, at 112, the vertical RNN processes each row of the current tile in the vertical dimension. According to various embodiments, all columns of the first feature map of the current tile may be processed in parallel with a corresponding set of RNNs including vertical RNNs. In this way, the vertical RNN generates a second signature from the first signature.
In a manner similar to the vertical RNN, the horizontal RNN continuously processes each column of the second signature to generate a third signature at 114. As with the vertical RNN, since the horizontal RNN may be composed of a separate set of RNNs, the horizontal RNN may process each row of the second feature map in a parallel manner.
According to some embodiments, the 2D RNN may operate in a left to right manner and then in a top to bottom manner. Although information from the top pixel may be propagated to the bottom pixel, there is an inherent asymmetry in the information flow, and thus information propagation cannot occur in the opposite direction, i.e., from bottom to top using the current example. Similarly, while information may flow from left to right, there is no mechanism to facilitate information flow from right to left. Alternatively, the 2D RNN may operate from right to left and/or from bottom to top. In any event, the particular direction in which the RNN operates limits the direction in which information flows. This limits the ability of the network to develop accurate inferences because look-ahead may be required to accurately classify the current pixel. That is, current reasoning may require "future" information about the direction of network operation.
One possible solution to this problem would be to run the 2D RNN in two directions, e.g. bottom-up, top-down, right-to-left and left-to-right. However, this approach incurs additional computational costs.
Instead, according to one embodiment, additional trunks are introduced into the network (described below) to perform global lookups, thereby enabling look-ahead and allowing for "future" features. Thus, at 116, a determination is made as to whether a global lookup is to be performed. According to one embodiment, the global lookup may be performed based on a predetermined cadence (number of steps) of the 2D RNN. If a global lookup does not need to be performed (the "NO" branch of 116), flow continues at 122.
If a global lookup is to be performed ("yes" branch of 116), flow continues to 118 and the state of the 2D RNN is updated with the global lookup. Techniques for performing global lookups are described below with respect to fig. 2b and related discussion.
At 122, the third feature map is processed by the second convolutional neural network to generate a class prediction for each pixel in the current tile. The flow then continues to 106 where a determination is made as to whether all tiles have been processed 106.
Fig. 2a is a block diagram of a form extraction network according to an embodiment of the present disclosure. The form extraction network 200 also includes a first branch 222 (a), a second branch 222 (b), an optimizer 220, and a global lookup block 216. The first branch 222 (a) also includes a tile extraction block 204, a convolutional network 222, a 2d RNN 208, a classifier 236, a softmax block 218, and a classification loss block 210. The 2d RNNs 208 also include a vertical RNN 206 (a) and a horizontal RNN 206 (b). The second branch 222 (b) also includes an auto encoder block 210 and a reconstruction loss block 214. The auto-encoder block 210 also includes an encoder 208 (a) and a decoder 208 (b).
It should be appreciated that fig. 2a depicts a high-level view of a form extraction network 200. According to various embodiments, the form extraction network 200 is associated with an infrastructure model architecture (not shown in fig. 2 a) that includes a set of artificial neural network layers. Each layer may consist of a collection of nodes or units that implement artificial neurons. The arrangement of layers and the interconnection of nodes between layers form an architectural model for the form extraction network 200. Each interconnection between two neurons may be associated with a weight that may be learned during a learning or training phase (described below). Each neuron may also be associated with a bias term, which may also be learned during training.
Each artificial neuron may receive a set of signals from other artificial neurons to which it is connected. Typically, neurons generate a weighted sum of the respective signals and weights for each interconnect by forming a linear superposition of the signals and weights and a bias term associated with the artificial neuron to generate a scalar value. Each artificial neuron may also be associated with an activation function, which is typically a nonlinear univariate function with a smooth derivative. The activation function may then be applied to the scalar value to generate an output value that includes an output signal for the artificial neuron, which may then be provided to other artificial neurons to which the artificial neuron is connected.
It will be further appreciated that the form extraction network 200 will be used for at least two different phases: (1) a learning or training phase, and (2) an reasoning phase. As previously described, during the training phase, a set of weights associated with each interconnection between two artificial neurons and a bias term associated with each artificial neuron are calculated. In general, the training phase may utilize a training and validation set that includes training and validation set examples. One or more loss functions may be associated with various outputs of the form extraction network that represent distance metrics between target output values associated with respective training examples and actual calculated output values. Typical loss functions may include cross entropy classification loss functions. An optimization algorithm is then applied to form the extraction network 200 to generate an optimal set of weights and biases for the training and validation set provided. The optimization algorithm may include some gradient descent variant, such as random gradient descent. Typically, during the training phase, a back-propagation algorithm is applied to learn the weights of all artificial neurons in the network.
Once the form extraction network 200 has been trained, it can be used in the inference phase. In the inference phase, actual real world inputs including actual form document images may be provided to form the extraction network 200 to generate a classification of form elements. The inference phase uses weights and biases learned during the training phase.
As shown in fig. 2a, a high resolution document image 202 is received by first and second branches [222 (a) through 222 (b) ] of the form extraction network 200. As will be appreciated, the high resolution document image 202 may include a pixel map corresponding to a digital image of the document. The pixel map may, for example, represent gray scale intensities associated with each of a plurality of spatial points of the image. According to one embodiment, each pixel may encode a gray scale intensity value. According to alternative embodiments, each pixel may encode a color value including red, green, and blue intensity values, which may be represented as a channel in the context of DNN.
The processing performed by the first branch 222 (a) of the form extraction network 200 will now be described. The segmentation block 204 receives the high resolution document image 202 and segments the high resolution document image 202 into tiles 224 (1) through 224 (N). Each tile 224 (1) through 224 (N) may be a subset of the high resolution document image 202 and thus include a pixel map of disjoint areas of the high resolution image 202. According to one embodiment, the segmentation of the high resolution document image into tiles 224 (1) through 224 (N) may be performed as a batch process step or may be performed in a pipelined fashion as each tile is processed by the first branch 222 (a). According to one embodiment, overlapping tiles having dimensions of 227 pixels by 227 pixels are generated from the high resolution document image 202. However, any other dimension is possible.
According to one embodiment, each tile 224 (1) through 224 (N) is processed separately by the convolutional network 222 to generate a feature map 226 (a). According to one embodiment, feature map 226 (a) is a tensor of general dimension H W C. Convolutional network 222 may comprise a convolutional neural network that operates in a translational and rotational invariant manner to process a multi-dimensional array of input pixels to generate feature map 226 (a) (also a multi-dimensional array). The feature map 226 (a) may be referred to as a tensor, which does not have the same formal meaning as a tensor in mathematics. Instead, it will be appreciated that the feature map 226 (a) includes a multi-dimensional array of at least dimension 2. Example embodiments of feature map 226 (a) and illustrative dimensions are discussed below.
According to one embodiment of the present disclosure, a convolutional network may exhibit the following architecture:
according to embodiments described in this disclosure, convolutional network 222 does not employ any reduction elements or layers, such as maximum pools, etc. In this way, for each pixel in a given tile 224 (1) through 224 (N), there will be some feature in the feature map.
The first signature 224 (a) is then processed by the 2d RNN 208. As can be appreciated, the 2d RNN 208 can maintain state information such that it can process input sequences with the saved state information. Because the 2D RNN may utilize saved state information generated during processing of a previous tile, the 2D RNN 208 may utilize historical information from previously processed tiles 224 (1) through 224 (N) during processing of a current tile.
As previously described, the 2d RNN 208 may also include a vertical RNN 206 (a) and a horizontal RNN 206 (b). According to one embodiment, the horizontal RNN 206 (a) and the vertical RNN 206 (b) may be identical internally. However, the vertical RNN 206 (a) may be configured to process rows of the first feature map 226 (a), while the horizontal RNN 206 (b) may be configured to process columns of the first feature map 226 (a) in a particular order. According to one embodiment, the feature map 226 (a) is processed by the vertical RNN 206 (a) to generate a second feature map 226 (b) that may also be understood as a multi-dimensional array. According to one embodiment, as described below, the vertical RNNs 206 (a) may also include a set of RNNs such that each RNN may process a column of the first signature 226 (a) independently and in parallel. According to one embodiment, each RNN including vertical RNNs 206 (a) may be an LSTM ("long term memory") network.
The second feature map 226 (b) is then processed by the horizontal RNN 206 (b) to generate the feature map 226 (c). Similar to the vertical RNN 206 (a), the horizontal RNN 206 (b) may include a set of RNNs, which may thus process each row of the second feature map 226 (b) independently and in parallel. Also, similar to the vertical RNNs 206 (a), each of the RNNs including the horizontal RNNs 206 (b) may be an LSTM network.
The feature map 226 (c) is then processed by the classifier 236 to generate class predictions for each pixel in the current tile. The classifier generates an associated component vector that indicates each pixel [ i.e., 224 (1) through 224 (N) ] in the tile for a particular document element class. For example, according to one embodiment, the document element class includes text fields, forms, text input fields, and the like. That is, each component in the vector may indicate a certain correlation that a given pixel belongs to a particular class. According to one embodiment, classifier 236 is a 1×1 convolutional network.
The output of the classifier (not shown in fig. 2 a) is then processed by softmax box 218. The concept of the softmax function is well understood in the art of machine learning and deep neural networks and will not be discussed in detail herein. However, for purposes of this discussion, it is sufficient to understand that the softmax block 218 may operate to normalize the vector, where each vector component represents a particular class, such that the normals of the vector are uniform. In this way, the output of softmax represents the probability distribution.
The softmax block 218 generates a normalized classifier vector (not shown in fig. 2). The classification loss block 210 uses the loss function to process the output of the softmax block 218. According to one embodiment, the classification loss block 210 may utilize a cross entropy loss function. The classification loss block 210 may generate a loss metric value (not shown in fig. 2) that represents the performance of the form extraction network 200 in successfully classifying a given training element.
The optimizer 220 is utilized during the training phase of the form extraction network 200. In particular, the optimizer 220 receives loss metric values from the classification loss block 210, and the classification loss block 210 iteratively utilizes the loss metric values during a training phase to improve the weights and bias of the form extraction network 200. According to one embodiment, the optimizer 220 may use a random gradient descent ("SGD") method or any other optimization method. In addition, the optimizer 220 may employ a back propagation algorithm to improve the weights and bias of the artificial neurons comprising the form extraction network.
The processing performed by the second branch 222 (b) of the form extraction network 200 will now be described. As shown in fig. 2a, the high resolution document image 202 is received by a downsampler 228, the downsampler 228 generating the scaled image 212. It is understood that the scaled image 212 is a lower dimensional representation of the high resolution document image 202. The scaled image 212 is then processed by the auto encoder 210. According to embodiments described in this disclosure, the auto-encoder 210 processes the scaled image in a first stage using the encoder 208 (a) to generate a feature map 226 (d), which may be a lower dimensional representation of the scaled image 212, commonly referred to as a potential space. The encoder 208 (a) effectively maps the higher-dimensional input of the scaled image 212 to the feature map 226 (d) via the bottleneck layer. The auto-encoder maps the potential spatial representation [ i.e., feature map 226 (d) ] back to the higher dimensional space associated with scaled image 212 using decoder 208 (b) in the second stage to generate reconstructed scaled image 222.
In particular, in the first stage, the encoder 208 (a) generates a feature map 226 (d) that is provided to the decoder 208 (b). According to one embodiment, encoder 208 (a) may utilize the following architecture.
However, other architectures are possible.
According to one embodiment, decoder 208 (b) may utilize the following architecture:
however, other architectures are possible.
The reconstruction loss block 214 is utilized in conjunction with an optimizer (previously described) during the training phase to determine weights and biases associated with the second branch 222 (b) of the form extraction network 200. According to one embodiment, the reconstruction loss block 214 may utilize, for example, L2 (squaring loss) to calculate the loss between the scaled image 212 and the reconstructed scaled image 222 generated by the auto encoder 210. Any other loss function may be used, such as an L1 loss function. In particular, the reconstruction loss block 214 may generate a scalar output characterizing the reconstruction loss, which is provided to the optimizer 220. As previously described, the optimizer 220 may utilize a back propagation algorithm in conjunction with an optimization algorithm such as SGD to generate weights and biases for the form extraction network 200 during the training phase.
As previously described, because the 2D RNN 208 is running in a particular direction (e.g., top-to-bottom and left-to-right), unless the 2D RNN 208 is also running in the opposite direction, the "future" (in terms of the direction of 2D RNN running) features are not available during processing of any given tile. However, to avoid computational inefficiency that results in the 2D RNN running in both directions, according to embodiments of the present disclosure, the global lookup function is implemented via the global lookup block 216, the global lookup block 216 allowing the 2D RNN 210 to perform look-ahead and thereby consider "future" information from tiles that have not yet been processed by the 2D RNN.
According to one embodiment, to determine "future" information, a mapping between features in the scaled image 212 and the high resolution tiles 214 (1) through 214 (N) is generated. This mapping is referred to herein as a global lookup and is performed by global lookup block 216. According to embodiments of the present disclosure, the task of learning the mapping in order to perform the global lookup is a task that may be solved by the form extraction network 200 and in particular by the global lookup block 216.
In particular, after a limited number of steps, the horizontal RNN 206 (b) may attempt to generate an approximate gaussian or pseudo gaussian mask that is multiplied by the feature map 226 (d) output from the auto encoder. According to one embodiment, the limited number of steps is 16, but any other value is possible. The gaussian or pseudo-gaussian mask is referred to as a attention map and is generated based on a feature map 226 (c) output by the horizontal RNN 206 (b). According to one embodiment, the mask operates like softmax, and thus the output is actually a probability distribution. By calculating the expected value using the probability distribution, the expected characteristics can be determined. The expected characteristics are used by the RNN to perform its predictions. This remains repeated for the number of periodic steps of horizontal RNN 206 (b). The global lookup block 216 determines a mask or a focus map in the manner described below.
More specifically, according to one embodiment, global lookup block 216 receives feature map 226 (c) (the output of horizontal RNN 206 (b)) and generates N simultaneous attention maps (not shown in fig. 2 a) based on feature map 226 (c).
The meaning of the attention map can be understood by a skilled practitioner. The focus mechanism is implemented via dynamic mask generation (depending on the current position in the high resolution tile) of each RNN, which is used to identify spatial positions on the global tensor representation. In addition, the global lookup block 216 receives the feature map 226 (d) (output of the encoder 208 (a)). Using the N simultaneous attention and feature maps 226 (c), the global lookup block 216 generates state modification information 252 that is used to modify the state information of the 2d RNN 208. More details of how the state modification information is generated are described below with respect to fig. 2 b.
When modifying the state of the 2d RNN 208, the global lookup block effectively causes the 2d RNN 208 to perform look-ahead, and thus considers the "future" information of tiles that it has not "seen". As previously mentioned, "future information relates to information that is otherwise unavailable due to the direction in which the 2d RNN 208 operates. For example, if the 2d RNN 208 is operating from left to right and top to bottom, the "future" information will be related to right to left and/or bottom to top data. Further details regarding the generation of state modification information are described below with respect to fig. 2 b.
According to one embodiment, global lookup block 216 performs a global lookup operation using the output of horizontal RNN 206 (b) (feature map 226 (c)). However, according to other embodiments, the global lookup block 216 may perform a global lookup using output generated by the vertical RNN 206 (a) or both the horizontal 206 (b) and vertical RNN 206 (a).
Fig. 2b is a detailed block diagram of global lookup block 216 according to an embodiment of the present disclosure. As shown in fig. 2b, the global lookup block 216 may also include a focus generation network 230, an average context vector calculation block 232, and a feedback network 234. The output of the horizontal RNN [ feature map 226 (c) ] is provided to the attention generation network 230. The attention generation network 230 processes the feature map 226 (c) to generate one or more attention maps (denoted by p), each of which is provided to represent a context vector calculation block 232. The focus generation network 230 may include DNNs with multiple layers and may utilize the following architecture:
the encoder output represented by z [ feature map 226 (d) ] is also provided to the average context vector calculation block 232. According to one embodiment, the encoder output [ feature map 226 (d) ] z is a tensor of dimension H W C, where C indicates the number of channels. On the other hand, each attention graph generated by the network 230 may be a tensor of dimension h×w×1.
For each attention graph, the average context vector calculation block 232 calculates an average context vector E according to: e (z) =Σp ij Z ij It has a generated N E (z) of dimension 1 xc, each of dimension 1 xc. Each of the N es (z) is provided to the feedback network 234, the feedback network 234 generating state modification information 252, the state modification information 252 being provided to the 2d RNN 208 to modify state information associated with the 2d RNN 208. According to embodiments described herein, the feedback network 234 may include an RNN and may include the following architecture:
fig. 2c is a flow chart of a global lookup process according to an embodiment of the present disclosure. The process depicted in fig. 2c may be performed by the global lookup block 216 previously described with respect to fig. 2 b. The process begins at 240. At 250, a determination is made as to whether a global lookup needs to be performed. According to embodiments described herein, the global lookup may be performed repeatedly after a limited number of steps (e.g., after a limited number of tiles are processed). According to one embodiment, a global lookup is performed every 16 steps. However, any other limited spacing is possible. If it is not time to perform a global lookup ("no" branch of 250), flow continues with 250.
If a global lookup is to be performed ("yes" branch of 250), flow continues to 242. At 242, a map of interest (p) is generated based on the output (p) of the horizontal RNN 206 (b). At 244, an average context vector [ E (z) ] is generated based on the attention map (p) and the encoder output (z). The generation of the average context vector is described above with respect to fig. 2 b. At 246, the average context vector is processed via the feedback network 234 to generate state modification information 252. At 248, state vector information associated with the 2d RNN 208 is modified based on the state modification information 252. Flow then continues to 250 where a determination is made as to whether a global lookup should be performed at 250.
FIG. 3a depicts 2D RNN processing of a portion of a high resolution image that has been segmented into tile sets according to one embodiment of the present invention. Fig. 3a shows feature maps 226 (a) (1) through 226 (a) (16) corresponding to each output of the convolutional network 222 for each respective tile 224 (1) through 224 (N). For purposes of this discussion, the feature maps 226 (a) (1) through 226 (a) (16) are represented as tiles in FIG. 3a because there is a one-to-one correspondence between tiles 224 (1) through 224 (N) of the high resolution document image 202 and the feature maps 226 (a) (1) through 226 (a) (N). That is, each feature map 226 (a) (1) through 226 (a) (N) represents a respective output of the convolutional network 222 for the respective tile 224 (1) through 224 (N). Although fig. 3a only shows feature maps 226 (a) (1) through 226 (a) (16), it will be appreciated that these feature maps correspond to only a portion of tiles 224 (1) through 224 (N), and in fact, the high resolution document image 202 may be segmented into a smaller or greater number of tiles, in which case the number of feature maps 226 shown in fig. 3a will be greater or smaller, and will correspond exactly to the number of segmented tiles of the high resolution document image 202.
FIG. 3a also shows horizontal RNN initial state vectors 308 (1) through 308 (4), vertical RNN initial state vectors 310 (1) through 310 (4), vertical inter-tile RNN state vectors 312 (1) through 312 (16), and horizontal inter-tile RNN state vectors 314 (1) through 314 (16).
For the purposes of this discussion, the processing of a particular feature map [ e.g., 226 (a) (1) ] will be described. It will be appreciated that the processing of other feature maps, such as 226 (a) (2) through 226 (a) (16), will proceed in a similar and analogous manner. Accordingly, all the discussion regarding the feature map 226 (a) (1) and its associated processing also applies to the feature maps 226 (a) (2) through 226 (a) (16). According to one embodiment, each feature map 226 (a) (1) has tensor dimensions H W C, where H corresponds to the height of the row, W corresponds to the width and C corresponds to the number of channels of the feature map 226 (a). For the purposes of this example, assume h=w=n. According to one embodiment, n=227. However, N may be assumed to be any value.
As previously described, according to some embodiments, the vertical RNNs 206 (a) may be associated with a set of RNNs (not shown). During processing of each feature map 226 (a) (1), the vertical RNN sets associated with the vertical RNNs 206 (a) may act in parallel to process each column of the feature map 226 (a) (1). According to alternative embodiments, the vertical RNN 206 (a) is associated with a single RNN, in which case each row of the feature map 226 (a) (1) may be processed one after the other. Each RNN associated with the vertical RNN 206 (a) is assumed to have a respective state size S.
As previously described with respect to fig. 2a, the vertical RNN 206 (a) (1) processes the signature 226 (a) (1) to generate the signature 226 (b) (not shown in fig. 3 a).
According to one embodiment, each RNN associated with the vertical RNN 206 (a) processes each row of the feature map 206 (a) (1) and issues a state vector of size w×s. In other words, a state vector having a tensor dimension of w×s is generated for each row of the feature map 206 (a) (1). In particular, according to one embodiment, at each step, the vertical RNN 206 (a) processes all C channels present at that location in the hxw x C signature. Thus, for all rows in feature map 206 (a) (1), vertical RNN 206 (a) generates feature map 226 (b) (not shown in fig. 3 a) with tensor dimensions h×w×s.
The last row of feature maps 226 (b) that will be used to process feature maps 226 (a) (5) corresponding to subsequent tiles is then utilized to generate the vertical inter-tile state vector 312 (1).
The horizontal RNN 206 (b) then processes the feature map 226 (b) to generate a feature map 226 (c) (not shown in fig. 3 a). Similar to the vertical RNNs 206 (a), according to some embodiments, the horizontal RNNs 206 (b) may be associated with a set of RNNs (not shown). During processing of each feature map 226 (b), the vertical RNN sets associated with the horizontal RNNs 206 (b) may function in parallel to process each row of feature maps 226 (b). According to alternative embodiments, the horizontal RNN 206 (b) is associated with a single RNN, in which case each column of the feature map 226 (b) may be processed one after the other. Assume that each RNN associated with horizontal RNN 206 (b) has a corresponding state size S'.
According to one embodiment, each RNN associated with the horizontal RNN 206 (b) processes each row of the feature map 226 (b) and issues a state vector of size h×s'. In other words, a state vector with tensor dimensions h×s' is generated for each column of the feature map 226 (b). Thus, for all columns in feature map 206 (b) (1), horizontal RNN 206 (b) generates feature map 226 (c) (not shown in fig. 3 a) with tensor dimensions h×w×s'.
The last column of the feature map 226 (c) is then utilized to generate a horizontal inter-tile state vector 314 (1), which horizontal inter-tile state vector 314 (1) is to be used to process the feature map 226 (a) (2) corresponding to the subsequent tile.
Fig. 3b depicts an architecture for processing feature graphs generated by a convolutional network in accordance with an embodiment of the present disclosure. As shown in fig. 3b, feature map 226 (a) is processed by vertical RNN 206 (a). The output of the vertical RNN 206 (a) (not shown in fig. 3 b) is then processed by the horizontal RNN 206 (b).
Fig. 3c depicts an alternative architecture for processing feature graphs generated by a convolutional network in accordance with an embodiment of the present disclosure. FIG. 3c is similar to FIG. 3b, but with an additional tie layer 36, the additional tie layer 36 receiving input from feature map 226 (a) via skip connection 218 and vertical RNN 206 (A). The output of the concatenation layer 316 (not shown in fig. 3 c) is then provided to the horizontal RNN. The embodiment depicted in fig. 3c allows for potentially higher accuracy because it combines features from lower level features [ i.e., feature map 226 (a) ] and features of higher level features (i.e., the output of vertical RNN 206 (a)) for processing via horizontal RNN 206 (b).
FIG. 3d depicts a single threaded processing sequence for a vertical RNN according to embodiments of the present disclosure. Each box shown in fig. 3d may represent a single element of the feature map 226 (a). As shown in fig. 3d, for each column, the associated row is processed sequentially [ e.g., 320 (1) to 320 (4), 320 (5) to 320 (8), 320 (9) to 320 (12), 320 (13) to 320 (16) ].
Fig. 3e depicts a multi-threaded processing sequence for a vertical RNN according to an embodiment of the present disclosure. As shown in fig. 3e, multiple threads per row are processed in parallel, with each thread being associated with a respective column. That is, for example, each element 320 (1) in the first row is processed by a separate thread (not shown in FIG. 3 e). Once the elements in the first row have been processed, each element in the second row is processed by multiple associated threads [ i.e., 320 (2) ].
Fig. 4 depicts an input image and an output image processed by a form extraction network according to an embodiment of the present disclosure. As shown in fig. 4, the final output is a set of labeled pixels of the image. Thus, the output of RNN is a label for each pixel. The example depicted in fig. 4 shows a simplified scenario in which only 3 tags corresponding to features are detected: background, text, and gadgets. Green represents a large amount of text. Yellow represents the gadget into which data is to be entered. As an example, in fig. 4, 401 indicates text, and 402 indicates gadgets. Although fig. 4 depicts only 2 detected features, it is understood that any number of features may be detected by the form extraction network 200.
Fig. 5 depicts an input image and an output image that has been processed by a form extraction network according to an embodiment of the present disclosure. Similar to fig. 4, in fig. 5, 401 indicates text, and 402 indicates gadgets.
Fig. 6a illustrates an example computing system executing a form extraction network 200 according to various embodiments of the disclosure. As shown in fig. 6a, computing device 600 includes CPU/GPU 612, training subsystem 622, and test/reasoning subsystem 624. Training subsystem 622 and test/inference subsystem 624 may be understood as programming structures for performing training and testing of form extraction network 200. In particular, CPU/GPU 612 may be further configured to perform training and/or testing of form extraction network 200 (as variously described herein, such as with respect to fig. 3-4) via programmed instructions. Other components and modules typical of typical computing systems are not shown but will be apparent, such as, for example, coprocessors, processing cores, graphics processing units, mice, touch pads, touch screens, displays, and the like. Many variations of computing environments will be apparent in light of this disclosure. For example, the item store 106 may be external to the computing device 600. Computing device 600 may be any standalone computing platform, such as a desktop or workstation computer, a laptop computer, a tablet computer, a smart phone or personal digital assistant, a gaming machine, a set-top box, or other suitable computing platform.
Training subsystem 622 also includes document image training/verification data store 610 (a), which stores training and verification document images. Training algorithm 616 represents program instructions for performing training of form extraction network 200 in accordance with the training described herein. As shown in fig. 6a, training algorithm 616 receives training and validation document form images from training/validation data store 610 (a) and generates the best weights and biases, which are then stored in weight/bias data store 610 (b). As previously described, training may utilize a back-propagation algorithm and gradient descent or some other optimization method.
The test/inference subsystem also includes a test/inference algorithm 626 that utilizes the form extraction network 200 and the optimal weights/bias generated by the training subsystem 622. CPU/GPU 612 may then execute test/inference algorithm 626 based on the model architecture and the generated weights and biases previously described. In particular, test/inference subsystem 624 may receive test document image 614 using network 200, which may characterize classified document image 620 from test document image 614.
FIG. 6b illustrates an example integration of the document extraction network 200 into a network environment according to one embodiment of this disclosure. As shown in fig. 6b, computing device 600 may be collocated (colocate) in a cloud environment, a data center, a local area network ("LAN"), or the like. The structure of computing device 600 of fig. 6b is the same as the example embodiment described with respect to fig. 6 a. In this case, for example, computing device 600 may be a server or a cluster of servers. As shown in fig. 6b, client 600 interacts with computing device 600 via network 632. In particular, the client 630 may make requests and receive responses via API calls received at the API server 628, which are transmitted via the network 632 and the network interface 626. It is understood that the network 632 may comprise any type of public or private network, including the Internet or a LAN.
It will be further readily appreciated that network 508 may comprise any type of public and/or private network, including the Internet, a LAN, a WAN, or some combination of such networks. In this example case, computing device 600 is a server computer and client 630 may be any typical personal computing platform
As will be further appreciated, computing device 600 (whether the computing device shown in fig. 6a or 6 b) includes and/or is otherwise capable of accessing one or more non-transitory computer-readable media or storage devices having encoded thereon one or more computer-executable instructions or software for implementing the techniques various described in this disclosure. The storage devices may include any number of durable storage devices (e.g., any electronic, optical, and/or magnetic storage devices including RAM, ROM, flash memory, USB drives, on-board CPU caches, hard drives, server storage, magnetic tape, CD-ROM, or other physical computer-readable storage medium) for storing data and computer-readable instructions and/or software that implement the various embodiments provided herein. Any combination of memories may be used, and the various storage components may be located in a single computing device or distributed across multiple computing devices. In addition, as previously described, one or more storage devices may be provided separately or remotely from one or more computing devices. Many configurations are possible.
Further exemplary embodiments
The following examples relate to further embodiments, from which many variations and configurations will become apparent.
Example 1 is a method for extracting a structure from an image of a document, the method comprising: receiving a high resolution image of the document, the high resolution image comprising a plurality of pixels; generating a plurality of tiles from the image, each of the tiles comprising a subset of pixels from the high resolution image; processing tiles through a neural network, wherein processing each tile includes classifying pixels as being associated with document elements of the document, the elements including fillable form fields and text content associated with the fillable form fields; and generating an editable digital version of the document using the classified pixels, the editable digital version including the fillable form fields and the text content.
Example 2 includes the subject matter of example 1, wherein processing each tile separately through the neural network includes: for each tile: processing the tiles through a convolutional network to generate a first feature map; processing the first feature map through a 2D recurrent neural network ("RNN") to generate a second feature map; processing the second feature map to generate class predictions for each pixel in the tile; and aggregating each of the respective predictions for each pixel of the high resolution image to generate a global feature map for the document.
Example 3 includes the subject matter of example 2, wherein the 2D RNN further includes a vertical RNN and a horizontal RNN, wherein the vertical RNN generates a third feature map from the first feature map, and the horizontal RNN generates the second feature map from the third feature map.
Example 4 includes the subject matter of example 2, and further comprising performing a global lookup process periodically after a predetermined number of steps performed by the 2D RNN, wherein the global lookup process further comprises: state information associated with the 2D RNN is modified based on a potential spatial representation of the document, wherein the potential spatial representation is generated based on a second image of the document, wherein the second image has a resolution lower than a resolution of the high resolution image.
Example 5 includes the subject matter of example 4, wherein modifying status information associated with the 2D RNN further comprises: generating a map of interest from the second feature map; generating an average context vector using the second feature map and the potential spatial representation; generating state modification information using the average context vector; and modifying state information associated with the 2D RNN using the state modification information.
Example 6 includes the subject matter of example 5, wherein the average context vector is generated according to the following relationship: e (z) =Σp ij z ij Where z is generated from the potential spatial representation and p is a map of interest.
Example 7 includes the subject matter of example 6, wherein the potential spatial representation is generated by an automatic encoder.
Example 8 is a network for performing extraction and classification of document forms, comprising: a first branch, the first branch further comprising: a segmentation block for segmenting a high resolution document image comprising a plurality of pixels into a plurality of tiles, wherein each tile comprises a subset of pixels of the high resolution document image; a convolutional network for processing each tile to generate a first feature map; a 2D RNN, wherein the 2D RNN processes the first feature map to generate a second feature map; a classification block, wherein the classification block processes the second feature map to generate a classification vector for pixels in a tile; and a softmax block to generate a probability distribution for a pixel in a tile, the probability distribution indicating a probability that the pixel is associated with a document element class; a second branch, the second branch further comprising: an image scaler block, wherein the image scaler block generates a lower resolution document image from the high resolution document image; and an auto encoder, wherein the auto encoder processes the lower resolution document image to generate a potential spatial representation of the lower resolution document image; and a global lookup block, wherein the global lookup block causes the 2D RNN to consider tiles associated with the high resolution document image that are not currently processed by the 2D RNN.
Example 9 includes the subject matter of example 8, wherein the automatic encoder further includes an encoder and a decoder, and the potential spatial representation is generated by the encoder.
Example 10 includes the subject matter of example 9, wherein the 2D RNN further comprises a vertical RNN and a horizontal RNN, wherein the vertical RNN processes tiles along a vertical direction and the horizontal RNN processes tiles along a horizontal direction.
Example 11 includes the subject matter of example 10, wherein the 2D RNN stores state information including vertical inter-tile state information and horizontal inter-tile state information, wherein the state information is used to correlate information between at least two tiles.
Example 12 includes the subject matter of example 11, wherein the global lookup block utilizes the potential spatial representation and an output of the horizontal RNN to modify the state information of the 2D RNN.
Example 13 includes the subject matter of example 12, wherein the second feature map is processed by the attention generation network to generate an attention map.
Example 14 includes the subject matter of example 13, wherein the attention graph and the state information are to be used in accordance with a relationship E (z) =Σp ij z ij To generate an average context vector, where z is generated from the potential spatial representation and p is a map of interest.
Example 15 is a computer program product comprising one or more non-transitory machine-readable media encoded with instructions that, when executed by one or more processors, cause a process to be performed for performing document form extraction and classification from a high resolution image of an input document, the process comprising: generating a high resolution image of the document, the high resolution image comprising a plurality of pixels; generating a plurality of tiles from the high resolution image, each of the tiles comprising a subset of pixels from the high resolution image; for each tile: processing the tiles through a convolutional network to generate a first feature map; processing the first feature map through a 2D recurrent neural network ("RNN") to generate a second feature map; processing the second feature map to generate class predictions for each pixel in the tile; and aggregating, for each pixel of the high resolution image, each of the respective predictions to generate a global feature map for the document.
Example 16 includes the subject matter of example 15, wherein the 2D RNN further includes a vertical RNN and a horizontal RNN, wherein the vertical RNN generates a third feature map from the first feature map, and the horizontal RNN generates the second feature map from the third feature map.
Example 17 includes the subject matter of example 15, and further comprising performing a global lookup process periodically after a predetermined number of steps performed by the 2D RNN, wherein the global lookup process further comprises: state information associated with the 2D RNN is modified based on a potential spatial representation of the document, wherein the potential spatial representation is generated based on a second image of the document, wherein the second image has a resolution lower than a resolution of the high resolution image.
Example 18 includes the subject matter of example 17, wherein modifying status information associated with the 2D RNN further comprises: generating a map of interest from the second feature map; generating an average context vector using the second feature map and the potential spatial representation; generating state modification information using the average context vector; and modifying state information associated with the 2D RNN using the state modification information.
Example 19 includes the subject matter of example 18, wherein the average context vector is generated according to the following relationship: e (z) =Σp ij z ij Where z is generated from the potential spatial representation and p is a map of interest.
Example 20 includes the subject matter of example 19, wherein the potential spatial representation is generated by an automatic encoder.
In some example embodiments of the present disclosure, the various functional modules described herein, particularly for training and/or testing network 200, may be implemented in software, such as a set of instructions (e.g., HTML, XML, C, C ++, object-oriented C, javaScript, java, BASIC, etc.) encoded on any non-transitory computer-readable medium or computer program product (e.g., hard disk drive, server, optical disk, or other suitable non-transitory memory or memory set), which when executed by one or more processors, cause the various creator recommendation methods provided herein to be performed.
In other embodiments, the techniques provided herein are implemented using a software-based engine. In such embodiments, the engine is a functional unit comprising one or more processors programmed or otherwise configured with instructions encoding the creator recommendation process provided herein. In this way, the software-based engine is a functional circuit.
In other embodiments, the techniques provided herein are implemented in hardware circuitry, such as gate level logic (FPGA) or application specific semiconductor (e.g., application specific integrated circuit or ASIC). Other embodiments are implemented with a microcontroller having a processor, a plurality of input/output ports for receiving and outputting data, and a plurality of embedded routines used by the processor to perform the functions provided herein. It will be apparent that any suitable combination of hardware, software and firmware may be used in a more general sense. As used herein, a circuit is one or more physical components and is used to perform tasks. For example, the circuitry may be one or more processors programmed or otherwise configured with software modules or logic-based hardware circuitry that provide a set of outputs in response to a set of input stimuli. Many configurations will be apparent.
The foregoing description of the exemplary embodiments of the present disclosure has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure to the precise form disclosed. Many modifications and variations are possible in light of this disclosure. It is intended that the scope of the disclosure be limited not by this detailed description, but rather by the claims appended hereto.

Claims (19)

1. A method for extracting structure from an image of a document, the method comprising:
receiving a high resolution image of the document, the high resolution image comprising a plurality of pixels;
generating a plurality of tiles from the image, each of the tiles comprising a subset of pixels from the high resolution image;
processing each tile individually through a neural network, wherein processing each tile includes classifying pixels as being associated with document elements of the document, the elements including fillable form fields and text content associated with the fillable form fields,
wherein processing each tile individually through the neural network comprises: for each tile, processing the tile through a convolutional network to generate a first feature map, processing the first feature map through a 2D recurrent neural network, 2D RNN, to generate a second feature map, and processing the second feature map to generate a class prediction for each pixel in the tile; and
Aggregating, for each pixel of the high resolution image, each of the class predictions to generate a global feature map for the document; and
an editable digital version of the document is generated using the classified pixels, the editable digital version including the fillable form fields and the text content.
2. The method of claim 1, wherein the 2D RNN further comprises a vertical RNN and a horizontal RNN, wherein the vertical RNN generates a third feature map from the first feature map and the horizontal RNN generates the second feature map from the third feature map.
3. The method of claim 1, further comprising periodically performing a global lookup process after a predetermined number of steps performed by the 2DRNN, wherein the global lookup process further comprises:
modifying state information associated with the 2D RNN based on a potential spatial representation of the document, wherein the potential spatial representation is generated based on a second image of the document, wherein the second image has a lower resolution than a resolution of the high resolution image, wherein the state information comprises vertical inter-tile state information and horizontal inter-tile state information, wherein the state information is used to correlate information between at least two tiles.
4. The method of claim 3, wherein modifying status information associated with the 2D RNN further comprises:
generating a map of interest from the second feature map;
generating an average context vector using the second feature map and the potential spatial representation;
generating state modification information using the average context vector; and
the state modification information is used to modify state information associated with the 2D RNN.
5. The method of claim 4, wherein the average context vector is generated according to the following relationship: e (z) =Σp ij z ij Where z is a feature map generated from the potential spatial representation and p is a attention map.
6. The method of claim 5, wherein the potential spatial representation is generated by an automatic encoder.
7. A network for performing extraction and classification of document forms, comprising:
a first branch, the first branch further comprising:
a segmentation block for segmenting a high resolution document image comprising a plurality of pixels into a plurality of tiles, wherein each tile comprises a subset of pixels of the high resolution document image;
a convolutional network for processing each tile to generate a first feature map;
A 2D RNN, wherein the 2D RNN processes the first feature map to generate a second feature map;
a classification block, wherein the classification block processes the second feature map to generate a classification vector for pixels in a tile;
a softmax block to generate a probability distribution for a pixel in a tile, the probability distribution indicating a probability that the pixel is associated with a document element class;
a second branch, the second branch further comprising:
an image scaler block, wherein the image scaler block generates a lower resolution document image from the high resolution document image; and
an auto encoder, wherein the auto encoder processes the lower resolution document image to generate a potential spatial representation of the lower resolution document image; and a global lookup block, wherein the global lookup block causes the 2D RNN to consider tiles associated with the high resolution document image that are not currently processed by the 2D RNN.
8. The network of claim 7, wherein the automatic encoder further comprises an encoder and a decoder, and the potential spatial representation is generated by the encoder.
9. The network of claim 8, wherein the 2D RNN further comprises a vertical RNN and a horizontal RNN, wherein the vertical RNN processes tiles along a vertical direction and the horizontal RNN processes tiles along a horizontal direction.
10. The network of claim 9, wherein the 2D RNN stores state information comprising vertical inter-tile state information and horizontal inter-tile state information, wherein the state information is used to correlate information between at least two tiles.
11. The network of claim 10, wherein the global lookup block utilizes the potential spatial representation and an output of the horizontal RNN to modify the state information of the 2D RNN.
12. The network of claim 11, wherein the second feature map is processed by a focus generation network to generate a focus map.
13. The network of claim 12, wherein the attention graph and the potential spatial representation are for use according to the relation E (z) = Σp ij z ij To generate an average context vector, where z is a feature map generated from the potential spatial representation and p is a attention map.
14. A non-transitory machine-readable medium encoded with instructions that, when executed by one or more processors, cause a process to be performed for processing a document, the process comprising:
generating a high resolution image of the document, the high resolution image comprising a plurality of pixels;
generating a plurality of tiles from the high resolution image, each of the tiles comprising a subset of pixels from the high resolution image;
Processing each tile individually through the neural network, wherein processing each tile includes classifying pixels as being associated with document elements of the document, wherein for each tile:
processing the tiles through a convolutional network to generate a first feature map,
processing the first feature map by a 2D recurrent neural network 2D RNN to generate a second feature map, and
processing the second feature map to generate class predictions for each pixel in the tile; and
for each pixel of the high resolution image, aggregating each of the class predictions to generate a global feature map for the document.
15. The non-transitory machine readable medium of claim 14, wherein the 2DRNN further comprises a vertical RNN and a horizontal RNN, wherein the vertical RNN generates a third feature map from the first feature map and the horizontal RNN generates the second feature map from the third feature map.
16. The non-transitory machine readable medium of claim 14, further comprising performing a global lookup process periodically after a predetermined number of steps performed by the 2D RNN, wherein the global lookup process further comprises:
modifying state information associated with the 2D RNN based on a potential spatial representation of the document, wherein the potential spatial representation is generated based on a second image of the document, wherein the second image has a lower resolution than a resolution of the high resolution image, wherein the state information comprises vertical inter-tile state information and horizontal inter-tile state information, wherein the state information is used to correlate information between at least two tiles.
17. The non-transitory machine readable medium of claim 16, wherein modifying status information associated with the 2D RNN further comprises:
generating a map of interest from the second feature map;
generating an average context vector using the second feature map and the potential spatial representation;
generating state modification information using the average context vector; and
the state modification information is used to modify state information associated with the 2D RNN.
18. The non-transitory machine readable medium of claim 17, wherein the flatThe homocontext vector is generated according to the following relationship: e (z) =Σp ij z ij Where z is a feature map generated from the potential spatial representation and p is a attention map.
19. The non-transitory machine readable medium of claim 18, wherein the potential spatial representation is generated by an automatic encoder.
CN201810483302.7A 2017-08-10 2018-05-18 List structure extraction network Active CN109389027B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15/674,100 2017-08-10
US15/674,100 US10268883B2 (en) 2017-08-10 2017-08-10 Form structure extraction network

Publications (2)

Publication Number Publication Date
CN109389027A CN109389027A (en) 2019-02-26
CN109389027B true CN109389027B (en) 2023-11-21

Family

ID=62812163

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810483302.7A Active CN109389027B (en) 2017-08-10 2018-05-18 List structure extraction network

Country Status (5)

Country Link
US (1) US10268883B2 (en)
CN (1) CN109389027B (en)
AU (1) AU2018203368B2 (en)
DE (1) DE102018004117A1 (en)
GB (1) GB2565401B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10984315B2 (en) * 2017-04-28 2021-04-20 Microsoft Technology Licensing, Llc Learning-based noise reduction in data produced by a network of sensors, such as one incorporated into loose-fitting clothing worn by a person
EP3540610B1 (en) * 2018-03-13 2024-05-01 Ivalua Sas Standardized form recognition method, associated computer program product, processing and learning systems
US11087177B2 (en) * 2018-09-27 2021-08-10 Salesforce.Com, Inc. Prediction-correction approach to zero shot learning
CA3123317A1 (en) * 2018-12-21 2020-06-25 Sightline Innovation Inc. Systems and methods for computer-implemented data trusts
US11003909B2 (en) * 2019-03-20 2021-05-11 Raytheon Company Neural network trained by homographic augmentation
CN110222752B (en) * 2019-05-28 2021-11-16 北京金山数字娱乐科技有限公司 Image processing method, system, computer device, storage medium and chip
CN110490199A (en) * 2019-08-26 2019-11-22 北京香侬慧语科技有限责任公司 A kind of method, apparatus of text identification, storage medium and electronic equipment
US11570030B2 (en) * 2019-10-11 2023-01-31 University Of South Carolina Method for non-linear distortion immune end-to-end learning with autoencoder—OFDM
US11481605B2 (en) 2019-10-25 2022-10-25 Servicenow Canada Inc. 2D document extractor
JP7293157B2 (en) * 2020-03-17 2023-06-19 株式会社東芝 Image processing device
CN111598844B (en) * 2020-04-24 2024-05-07 理光软件研究所(北京)有限公司 Image segmentation method and device, electronic equipment and readable storage medium
US11657306B2 (en) * 2020-06-17 2023-05-23 Adobe Inc. Form structure extraction by predicting associations
CN111815515B (en) * 2020-07-01 2024-02-09 成都智学易数字科技有限公司 Object three-dimensional drawing method based on medical education
CA3195077A1 (en) * 2020-10-07 2022-04-14 Dante DE NIGRIS Systems and methods for segmenting 3d images
US20220147843A1 (en) * 2020-11-12 2022-05-12 Samsung Electronics Co., Ltd. On-device knowledge extraction from visually rich documents
US11961094B2 (en) * 2020-11-15 2024-04-16 Morgan Stanley Services Group Inc. Fraud detection via automated handwriting clustering
US12056945B2 (en) 2020-11-16 2024-08-06 Kyocera Document Solutions Inc. Method and system for extracting information from a document image
CN112766073B (en) * 2020-12-31 2022-06-10 贝壳找房(北京)科技有限公司 Table extraction method and device, electronic equipment and readable storage medium
CN113435240B (en) * 2021-04-13 2024-06-14 北京易道博识科技有限公司 End-to-end form detection and structure identification method and system
US20230029335A1 (en) * 2021-07-23 2023-01-26 Taiwan Semiconductor Manufacturing Company, Ltd. System and method of convolutional neural network
JP7393509B2 (en) * 2021-11-29 2023-12-06 ネイバー コーポレーション Deep learning-based method and system for extracting structured information from atypical documents

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1673995A (en) * 2004-03-24 2005-09-28 微软公司 Method and apparatus for populating electronic forms from scanned documents
US9298981B1 (en) * 2014-10-08 2016-03-29 Xerox Corporation Categorizer assisted capture of customer documents using a mobile device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0622863D0 (en) * 2006-11-16 2006-12-27 Ibm Automated generation of form definitions from hard-copy forms
US8566349B2 (en) * 2009-09-28 2013-10-22 Xerox Corporation Handwritten document categorizer and method of training
US8788930B2 (en) * 2012-03-07 2014-07-22 Ricoh Co., Ltd. Automatic identification of fields and labels in forms
US10223344B2 (en) * 2015-01-26 2019-03-05 Adobe Inc. Recognition and population of form fields in an electronic document
US9659213B2 (en) * 2015-07-03 2017-05-23 Cognizant Technology Solutions India Pvt. Ltd. System and method for efficient recognition of handwritten characters in documents

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1673995A (en) * 2004-03-24 2005-09-28 微软公司 Method and apparatus for populating electronic forms from scanned documents
US9298981B1 (en) * 2014-10-08 2016-03-29 Xerox Corporation Categorizer assisted capture of customer documents using a mobile device

Also Published As

Publication number Publication date
US20190050640A1 (en) 2019-02-14
CN109389027A (en) 2019-02-26
US10268883B2 (en) 2019-04-23
GB2565401A (en) 2019-02-13
DE102018004117A1 (en) 2019-02-14
GB201808406D0 (en) 2018-07-11
GB2565401B (en) 2020-05-27
AU2018203368A1 (en) 2019-02-28
AU2018203368B2 (en) 2021-07-08

Similar Documents

Publication Publication Date Title
CN109389027B (en) List structure extraction network
AU2019200270B2 (en) Concept mask: large-scale segmentation from semantic concepts
US10839543B2 (en) Systems and methods for depth estimation using convolutional spatial propagation networks
US20220284546A1 (en) Iterative multiscale image generation using neural networks
CN114008663A (en) Real-time video super-resolution
US10997464B2 (en) Digital image layout training using wireframe rendering within a generative adversarial network (GAN) system
US20210216874A1 (en) Radioactive data generation
CN117597703B (en) Multi-scale converter for image analysis
US20200111214A1 (en) Multi-level convolutional lstm model for the segmentation of mr images
JP6612486B1 (en) Learning device, classification device, learning method, classification method, learning program, and classification program
CN114021696A (en) Conditional axial transform layer for high fidelity image transformation
US20240119697A1 (en) Neural Semantic Fields for Generalizable Semantic Segmentation of 3D Scenes
AU2021354030B2 (en) Processing images using self-attention based neural networks
CN115601759A (en) End-to-end text recognition method, device, equipment and storage medium
US20190156182A1 (en) Data inference apparatus, data inference method and non-transitory computer readable medium
WO2021245287A2 (en) Cross-transformer neural network system for few-shot similarity determination and classification
Li et al. Image Semantic Space Segmentation Based on Cascaded Feature Fusion and Asymmetric Convolution Module
US12125247B2 (en) Processing images using self-attention based neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant