CN109389027B

CN109389027B - List structure extraction network

Info

Publication number: CN109389027B
Application number: CN201810483302.7A
Authority: CN
Inventors: M·萨卡尔; B·克里什纳穆泰
Original assignee: Adobe Systems Inc
Current assignee: Adobe Inc
Priority date: 2017-08-10
Filing date: 2018-05-18
Publication date: 2023-11-21
Anticipated expiration: 2038-05-18
Also published as: US20190050640A1; CN109389027A; US10268883B2; GB2565401A; DE102018004117A1; GB201808406D0; GB2565401B; AU2018203368A1; AU2018203368B2

Abstract

A method and system for detecting and extracting accurate and precise structures in a document. The high resolution image of the document is partitioned into a set of tiles. Each tile is processed by a convolutional network and then a recursive network set is processed for each row and each column. A global lookup process is disclosed that allows for consideration of the "future" information required for accurate evaluation by recurrent neural networks. The use of high resolution images allows for accurate and precise feature extraction, while segmentation into tiles facilitates processing high resolution images within a reasonable computational resource.

Description

List structure extraction network

Technical Field

The present disclosure relates to techniques for identifying the structure and semantics of a form document such as a PDF. In particular, the present disclosure relates to techniques for processing documents using deep learning and deep neural networks ("DNNs") to extract structure and semantics.

Background

The use of forms for capturing and disseminating information has become ubiquitous. Typically these forms have not been digitized and exist in hard copy format. Even though the forms have been digitized and converted to electronic format, they may only support interaction via a particular electronic device (such as a personal computer), but may not be accessible on the mobile device. An adaptive form is an electronic form that can automatically adapt to viewing and input on a variety of devices, each device having a different form factor, such as a personal computer, tablet computer, smart phone, etc.

Enterprises and governments are making digital transformations in which movement occupies the primary digital strategy for all new offerings. The trend in digital technology is driven by a range of attractive business and revenue incentives. Thus, organizations need to digitize and provide multi-channel stories. However, many existing account registration and service request flows are still paper-based. Currently, to implement digital adaptive form technology, businesses must employ form/content authors to manually replicate the current experience and establish a mobile ready experience on a segment-by-segment basis, which is time consuming, expensive, and requires IT ("information technology") skills.

The elements in a form are typically arranged in a hierarchy. For example, a document is a top-level element. Below the document there may be parts, parts constituting the next level in the hierarchy, and so on.

A field is another important list structural element. The fields may include a combination of gadgets (widgets) and headers (captons). A gadget is an area of a form that facilitates and prompts a user to enter information. Each gadget may have a title associated with it. A title is a piece of text or other signaling information that may help a user provide input in a gadget. Examples of gadgets may include sections (sections) and option groups. A set of options is a set of items that allows a user to select one or more items via a check box or radio button. A form is another example of a structural element that may also include a column heading, a row heading, and the actual gadget that the user may fill in the information. In addition, the form typically contains text portions that are made up of paragraphs, lines of text, and words. The image may even be embedded in the form.

One of the main problems in rapidly converting paper forms to adaptive forms is identifying the structure and semantics of the form document from the format of the image or similar image. Once the form structure is extracted and its hierarchical properties captured, this structure information can be used for various purposes, such as creating an electronically adaptive form, and the like.

Machine learning and deep neural networks ("DNNs") have been applied to document structure extraction. However, due to the computational cost of using high resolution images (e.g., memory requirements and limitations on efficient information dissemination), known methods for applying DNNs to extract document structures from images require the use of lower resolution input images. Thus, typically, the input image provided to the DNN for structure extraction is first downsampled from the higher resolution image. While the use of lower resolution document images can solve the practical problem of reducing the computational cost of performing form identification and extraction, it also places a significant limitation on the ability of DNNs to elicit very fine structures in the document. Therefore, techniques for extracting document structures from high resolution document images in a computationally efficient and tractable manner using machine learning and DNN are needed.

Drawings

FIG. 1a is a flowchart depicting the operation of a form construction extraction network in accordance with an embodiment of the present disclosure;

FIG. 1b is a flow chart depicting more detailed operation of a form structure extraction network in accordance with an embodiment of the present disclosure;

FIG. 2a is a block diagram of a form extraction network according to an embodiment of the present disclosure;

FIG. 2b is a detailed block diagram of global lookup block 216 according to an embodiment of the present disclosure;

FIG. 2c is a flow chart of a global lookup process according to an embodiment of the present disclosure;

FIG. 3a depicts 2D RNN processing of a portion of a high resolution image that has been segmented into tile sets according to one embodiment of the present invention;

FIG. 3b depicts an architecture for processing feature graphs generated by a convolutional network in accordance with an embodiment of the present disclosure;

FIG. 3c depicts an alternative architecture for processing feature graphs generated by a convolutional network in accordance with an embodiment of the present disclosure;

FIG. 3d depicts a single threaded processing sequence for a vertical RNN according to an embodiment of the present disclosure;

FIG. 3e depicts a multi-threaded processing sequence for a vertical RNN according to an embodiment of the present disclosure;

FIG. 4 depicts an input image and an output image processed by a form extraction network according to an embodiment of the present disclosure;

FIG. 5 depicts an input image and an output image processed by a form extraction network according to an embodiment of the present disclosure;

FIG. 6a illustrates an example computing system executing a form extraction network 200 according to various embodiments of the disclosure; and

FIG. 6b illustrates an example integration of the document extraction network 200 into a network environment according to one embodiment of this disclosure.

Detailed Description

In accordance with embodiments described in this disclosure, techniques are described for identifying and extracting the structure and semantics of a form document from a high resolution image of the form document. For purposes of discussion, the terms "form document" and "form" will be used interchangeably. After extracting the structure of the form, the structure information may be utilized to adjust the form to be used in the desired context. Examples of form structures may include logical portions of forms, personal information such as credit card or address information, financial information, form titles, headers, footers, and the like.

According to embodiments described in this disclosure, the form extraction network includes a deep neural network ("DNN") architecture that can automatically identify various form elements and larger semantic structures based on high resolution images of the form. According to embodiments of the present disclosure, a form extraction network provides an end-to-end distinguishable (differential) pipeline for detecting and extracting document structures. According to embodiments of the present disclosure, a form extraction network receives a high resolution image (including original pixels) of a document form to be analyzed and generates classification features corresponding to the form elements. In particular, according to one embodiment, each pixel of the high resolution image is associated with a classification vector that indicates a probability that the pixel belongs to a particular class. The entire classified set of pixels of the high resolution document image may then be utilized to classify larger groupings of pixels into particular form elements.

To reduce the computational resource requirements in processing high resolution images, the form extraction network may use an iterative process to process subsets of the document images. Each subset of the document form image is referred to herein as a tile and includes a subset of pixels in the entire document form image. The form extraction network may include a convolutional network for detecting features of individual tiles of the form, a multi-dimensional recursive (neural) neural network ("RNN") for maintaining spatial state information spatially spanning the tiles, and a global lookup module for modifying state information of the multi-dimensional RNN based on a global lookup of form features from lower dimensional images of the form document. It will be appreciated that RNN is a type of neural network that is well suited to processing sequences.

Briefly, in accordance with embodiments of the present disclosure, an architecture for performing form extraction from high resolution document images may include two branches: (1) A first branch that generates a global tensor representation of the entire image via an auto-encoder, and (2) a second branch that includes a convolution and 2D-RNN layers that operate on the image in a tile-by-tile manner. According to various embodiments, the state of the RNN is stored at tile boundaries and is then used to initialize RNNs of subsequent tiles. The RNN is also equipped with a focus mechanism that can find and retrieve information from the global document representation of the first branch.

According to various embodiments, the global lookup function may be performed by extracting features from a lower resolution representation of a high resolution image. The global lookup can be performed on smaller dimensional images, which provides significant computational benefits. This allows the 2D RNN to look ahead (look-ahead) based on features detected in the lower dimensional representation of the whole image. Thus, a 2D RNN running on a high resolution image may access features that have been extracted from a low resolution trunk (trunk) and perform a lookup to make a decision about the current pixel and utilize "future" information that may actually be viewed from the direction in which the 2D RNN is running.

Thus, according to embodiments described in this disclosure, a convolutional network that processes individual tiles of a high resolution document image is combined with a multi-dimensional RNN to account for information across tiles. According to various embodiments, a global look-up function is provided that allows the 2D RNN to look ahead (i.e., consider "future" information in the context of the direction of 2D RNN operation).

Fig. 1a is a flowchart depicting the operation of a form extraction network in accordance with an embodiment of the present disclosure. The process begins at 122. At 124, a high resolution document image comprising a plurality of pixels is partitioned into a set of tiles, each tile comprising a subset of the pixels of the high resolution document image. At 126, a determination is made as to whether all tiles have been processed. If not (NO branch of 126), the current tile is updated at 128. The tiles are then processed by the neural network to classify pixels in the tiles with particular document elements at 130. A process and system for performing such classification is described below with respect to fig. 1b, 2a to 2 c. Flow then continues to 126.

If all tiles have been processed (the "yes" branch of 126), flow continues to 132 where an editable version of the document is generated from the classified pixels at 132. The process ends at 134.

Fig. 1b is a flowchart depicting detailed operation of a form structure extraction network according to an embodiment of the present disclosure. The process begins at 102. At 104, the high resolution image is segmented into a plurality of tiles. According to embodiments described in the present disclosure, the input image provided to the form extraction network is a high resolution image of the document. Because high resolution images are used, a larger convolutional neural network is required to process the image, otherwise a lower dimensional image may be required. However, as previously mentioned, larger convolutional neural networks present significant computational challenges, particularly with respect to the available computer memory and information dissemination requirements within a computing structure.

To address these computational challenges, according to embodiments described in this disclosure, a high-dimensional image is partitioned into tile sets. Each tile may be a subset of pixels from the original high-dimensional image, and then each tile may be processed separately from each other. However, since each tile retains the resolution of the original image, the high resolution quality of the image is not degraded. Thus, because each tile includes a subset of the original high resolution image and is processed independently of the other tiles, the instantaneous memory and other computational requirements required to process the entire high dimensional image are alleviated. According to embodiments described herein, tiles are generated from an image by dividing the image into rows and columns each having a respective height and width. According to some embodiments, tiles may overlap each other.

At 106, a determination is made as to whether all tiles have been processed. If yes (yes branch of 106), a global feature map of the entire image is generated at 118. Techniques for generating a global feature map of an entire image are described below. The process then ends at 120.

If all tiles have not been processed ("NO" branch of 106), the currently pending tile is updated from the pool of all available tiles for the document image. At 110, a current tile is processed by a convolutional neural network to generate a first feature map. Example embodiments of convolutional neural networks are described below.

Because the convolutional network "sees" or processes only individual tiles at a time, features across multiple tiles cannot be extracted. To address this problem, a state-preserving network, such as an RNN, may be used to leverage information across multiple tiles. In particular, as will be described, according to various embodiments, a 2D RNN may be employed to maintain state information across the horizontal and vertical spatial dimensions of a document image using hidden state representations. It will become apparent that the 2D RNN can be decomposed into a vertical RNN and a horizontal RNN. Further, the vertical RNNs may include a set of RNNs, and the horizontal RNNs may also include a set of RNNs, such that both vertical and horizontal RNNs may operate in parallel. A description of parallel operation of the vertical and horizontal RNNs is provided below.

Thus, at 112, the vertical RNN processes each row of the current tile in the vertical dimension. According to various embodiments, all columns of the first feature map of the current tile may be processed in parallel with a corresponding set of RNNs including vertical RNNs. In this way, the vertical RNN generates a second signature from the first signature.

In a manner similar to the vertical RNN, the horizontal RNN continuously processes each column of the second signature to generate a third signature at 114. As with the vertical RNN, since the horizontal RNN may be composed of a separate set of RNNs, the horizontal RNN may process each row of the second feature map in a parallel manner.

According to some embodiments, the 2D RNN may operate in a left to right manner and then in a top to bottom manner. Although information from the top pixel may be propagated to the bottom pixel, there is an inherent asymmetry in the information flow, and thus information propagation cannot occur in the opposite direction, i.e., from bottom to top using the current example. Similarly, while information may flow from left to right, there is no mechanism to facilitate information flow from right to left. Alternatively, the 2D RNN may operate from right to left and/or from bottom to top. In any event, the particular direction in which the RNN operates limits the direction in which information flows. This limits the ability of the network to develop accurate inferences because look-ahead may be required to accurately classify the current pixel. That is, current reasoning may require "future" information about the direction of network operation.

One possible solution to this problem would be to run the 2D RNN in two directions, e.g. bottom-up, top-down, right-to-left and left-to-right. However, this approach incurs additional computational costs.

Instead, according to one embodiment, additional trunks are introduced into the network (described below) to perform global lookups, thereby enabling look-ahead and allowing for "future" features. Thus, at 116, a determination is made as to whether a global lookup is to be performed. According to one embodiment, the global lookup may be performed based on a predetermined cadence (number of steps) of the 2D RNN. If a global lookup does not need to be performed (the "NO" branch of 116), flow continues at 122.

If a global lookup is to be performed ("yes" branch of 116), flow continues to 118 and the state of the 2D RNN is updated with the global lookup. Techniques for performing global lookups are described below with respect to fig. 2b and related discussion.

At 122, the third feature map is processed by the second convolutional neural network to generate a class prediction for each pixel in the current tile. The flow then continues to 106 where a determination is made as to whether all tiles have been processed 106.

Fig. 2a is a block diagram of a form extraction network according to an embodiment of the present disclosure. The form extraction network 200 also includes a first branch 222 (a), a second branch 222 (b), an optimizer 220, and a global lookup block 216. The first branch 222 (a) also includes a tile extraction block 204, a convolutional network 222, a 2d RNN 208, a classifier 236, a softmax block 218, and a classification loss block 210. The 2d RNNs 208 also include a vertical RNN 206 (a) and a horizontal RNN 206 (b). The second branch 222 (b) also includes an auto encoder block 210 and a reconstruction loss block 214. The auto-encoder block 210 also includes an encoder 208 (a) and a decoder 208 (b).

It should be appreciated that fig. 2a depicts a high-level view of a form extraction network 200. According to various embodiments, the form extraction network 200 is associated with an infrastructure model architecture (not shown in fig. 2 a) that includes a set of artificial neural network layers. Each layer may consist of a collection of nodes or units that implement artificial neurons. The arrangement of layers and the interconnection of nodes between layers form an architectural model for the form extraction network 200. Each interconnection between two neurons may be associated with a weight that may be learned during a learning or training phase (described below). Each neuron may also be associated with a bias term, which may also be learned during training.

Each artificial neuron may receive a set of signals from other artificial neurons to which it is connected. Typically, neurons generate a weighted sum of the respective signals and weights for each interconnect by forming a linear superposition of the signals and weights and a bias term associated with the artificial neuron to generate a scalar value. Each artificial neuron may also be associated with an activation function, which is typically a nonlinear univariate function with a smooth derivative. The activation function may then be applied to the scalar value to generate an output value that includes an output signal for the artificial neuron, which may then be provided to other artificial neurons to which the artificial neuron is connected.

It will be further appreciated that the form extraction network 200 will be used for at least two different phases: (1) a learning or training phase, and (2) an reasoning phase. As previously described, during the training phase, a set of weights associated with each interconnection between two artificial neurons and a bias term associated with each artificial neuron are calculated. In general, the training phase may utilize a training and validation set that includes training and validation set examples. One or more loss functions may be associated with various outputs of the form extraction network that represent distance metrics between target output values associated with respective training examples and actual calculated output values. Typical loss functions may include cross entropy classification loss functions. An optimization algorithm is then applied to form the extraction network 200 to generate an optimal set of weights and biases for the training and validation set provided. The optimization algorithm may include some gradient descent variant, such as random gradient descent. Typically, during the training phase, a back-propagation algorithm is applied to learn the weights of all artificial neurons in the network.

Once the form extraction network 200 has been trained, it can be used in the inference phase. In the inference phase, actual real world inputs including actual form document images may be provided to form the extraction network 200 to generate a classification of form elements. The inference phase uses weights and biases learned during the training phase.

As shown in fig. 2a, a high resolution document image 202 is received by first and second branches [222 (a) through 222 (b) ] of the form extraction network 200. As will be appreciated, the high resolution document image 202 may include a pixel map corresponding to a digital image of the document. The pixel map may, for example, represent gray scale intensities associated with each of a plurality of spatial points of the image. According to one embodiment, each pixel may encode a gray scale intensity value. According to alternative embodiments, each pixel may encode a color value including red, green, and blue intensity values, which may be represented as a channel in the context of DNN.

The processing performed by the first branch 222 (a) of the form extraction network 200 will now be described. The segmentation block 204 receives the high resolution document image 202 and segments the high resolution document image 202 into tiles 224 (1) through 224 (N). Each tile 224 (1) through 224 (N) may be a subset of the high resolution document image 202 and thus include a pixel map of disjoint areas of the high resolution image 202. According to one embodiment, the segmentation of the high resolution document image into tiles 224 (1) through 224 (N) may be performed as a batch process step or may be performed in a pipelined fashion as each tile is processed by the first branch 222 (a). According to one embodiment, overlapping tiles having dimensions of 227 pixels by 227 pixels are generated from the high resolution document image 202. However, any other dimension is possible.

According to one embodiment, each tile 224 (1) through 224 (N) is processed separately by the convolutional network 222 to generate a feature map 226 (a). According to one embodiment, feature map 226 (a) is a tensor of general dimension H W C. Convolutional network 222 may comprise a convolutional neural network that operates in a translational and rotational invariant manner to process a multi-dimensional array of input pixels to generate feature map 226 (a) (also a multi-dimensional array). The feature map 226 (a) may be referred to as a tensor, which does not have the same formal meaning as a tensor in mathematics. Instead, it will be appreciated that the feature map 226 (a) includes a multi-dimensional array of at least dimension 2. Example embodiments of feature map 226 (a) and illustrative dimensions are discussed below.

According to one embodiment of the present disclosure, a convolutional network may exhibit the following architecture:

according to embodiments described in this disclosure, convolutional network 222 does not employ any reduction elements or layers, such as maximum pools, etc. In this way, for each pixel in a given tile 224 (1) through 224 (N), there will be some feature in the feature map.

The first signature 224 (a) is then processed by the 2d RNN 208. As can be appreciated, the 2d RNN 208 can maintain state information such that it can process input sequences with the saved state information. Because the 2D RNN may utilize saved state information generated during processing of a previous tile, the 2D RNN 208 may utilize historical information from previously processed tiles 224 (1) through 224 (N) during processing of a current tile.

As previously described, the 2d RNN 208 may also include a vertical RNN 206 (a) and a horizontal RNN 206 (b). According to one embodiment, the horizontal RNN 206 (a) and the vertical RNN 206 (b) may be identical internally. However, the vertical RNN 206 (a) may be configured to process rows of the first feature map 226 (a), while the horizontal RNN 206 (b) may be configured to process columns of the first feature map 226 (a) in a particular order. According to one embodiment, the feature map 226 (a) is processed by the vertical RNN 206 (a) to generate a second feature map 226 (b) that may also be understood as a multi-dimensional array. According to one embodiment, as described below, the vertical RNNs 206 (a) may also include a set of RNNs such that each RNN may process a column of the first signature 226 (a) independently and in parallel. According to one embodiment, each RNN including vertical RNNs 206 (a) may be an LSTM ("long term memory") network.

The second feature map 226 (b) is then processed by the horizontal RNN 206 (b) to generate the feature map 226 (c). Similar to the vertical RNN 206 (a), the horizontal RNN 206 (b) may include a set of RNNs, which may thus process each row of the second feature map 226 (b) independently and in parallel. Also, similar to the vertical RNNs 206 (a), each of the RNNs including the horizontal RNNs 206 (b) may be an LSTM network.

The feature map 226 (c) is then processed by the classifier 236 to generate class predictions for each pixel in the current tile. The classifier generates an associated component vector that indicates each pixel [ i.e., 224 (1) through 224 (N) ] in the tile for a particular document element class. For example, according to one embodiment, the document element class includes text fields, forms, text input fields, and the like. That is, each component in the vector may indicate a certain correlation that a given pixel belongs to a particular class. According to one embodiment, classifier 236 is a 1×1 convolutional network.

The output of the classifier (not shown in fig. 2 a) is then processed by softmax box 218. The concept of the softmax function is well understood in the art of machine learning and deep neural networks and will not be discussed in detail herein. However, for purposes of this discussion, it is sufficient to understand that the softmax block 218 may operate to normalize the vector, where each vector component represents a particular class, such that the normals of the vector are uniform. In this way, the output of softmax represents the probability distribution.

The softmax block 218 generates a normalized classifier vector (not shown in fig. 2). The classification loss block 210 uses the loss function to process the output of the softmax block 218. According to one embodiment, the classification loss block 210 may utilize a cross entropy loss function. The classification loss block 210 may generate a loss metric value (not shown in fig. 2) that represents the performance of the form extraction network 200 in successfully classifying a given training element.

The optimizer 220 is utilized during the training phase of the form extraction network 200. In particular, the optimizer 220 receives loss metric values from the classification loss block 210, and the classification loss block 210 iteratively utilizes the loss metric values during a training phase to improve the weights and bias of the form extraction network 200. According to one embodiment, the optimizer 220 may use a random gradient descent ("SGD") method or any other optimization method. In addition, the optimizer 220 may employ a back propagation algorithm to improve the weights and bias of the artificial neurons comprising the form extraction network.

The processing performed by the second branch 222 (b) of the form extraction network 200 will now be described. As shown in fig. 2a, the high resolution document image 202 is received by a downsampler 228, the downsampler 228 generating the scaled image 212. It is understood that the scaled image 212 is a lower dimensional representation of the high resolution document image 202. The scaled image 212 is then processed by the auto encoder 210. According to embodiments described in this disclosure, the auto-encoder 210 processes the scaled image in a first stage using the encoder 208 (a) to generate a feature map 226 (d), which may be a lower dimensional representation of the scaled image 212, commonly referred to as a potential space. The encoder 208 (a) effectively maps the higher-dimensional input of the scaled image 212 to the feature map 226 (d) via the bottleneck layer. The auto-encoder maps the potential spatial representation [ i.e., feature map 226 (d) ] back to the higher dimensional space associated with scaled image 212 using decoder 208 (b) in the second stage to generate reconstructed scaled image 222.

In particular, in the first stage, the encoder 208 (a) generates a feature map 226 (d) that is provided to the decoder 208 (b). According to one embodiment, encoder 208 (a) may utilize the following architecture.

However, other architectures are possible.

According to one embodiment, decoder 208 (b) may utilize the following architecture:

however, other architectures are possible.

The reconstruction loss block 214 is utilized in conjunction with an optimizer (previously described) during the training phase to determine weights and biases associated with the second branch 222 (b) of the form extraction network 200. According to one embodiment, the reconstruction loss block 214 may utilize, for example, L2 (squaring loss) to calculate the loss between the scaled image 212 and the reconstructed scaled image 222 generated by the auto encoder 210. Any other loss function may be used, such as an L1 loss function. In particular, the reconstruction loss block 214 may generate a scalar output characterizing the reconstruction loss, which is provided to the optimizer 220. As previously described, the optimizer 220 may utilize a back propagation algorithm in conjunction with an optimization algorithm such as SGD to generate weights and biases for the form extraction network 200 during the training phase.

As previously described, because the 2D RNN 208 is running in a particular direction (e.g., top-to-bottom and left-to-right), unless the 2D RNN 208 is also running in the opposite direction, the "future" (in terms of the direction of 2D RNN running) features are not available during processing of any given tile. However, to avoid computational inefficiency that results in the 2D RNN running in both directions, according to embodiments of the present disclosure, the global lookup function is implemented via the global lookup block 216, the global lookup block 216 allowing the 2D RNN 210 to perform look-ahead and thereby consider "future" information from tiles that have not yet been processed by the 2D RNN.

According to one embodiment, to determine "future" information, a mapping between features in the scaled image 212 and the high resolution tiles 214 (1) through 214 (N) is generated. This mapping is referred to herein as a global lookup and is performed by global lookup block 216. According to embodiments of the present disclosure, the task of learning the mapping in order to perform the global lookup is a task that may be solved by the form extraction network 200 and in particular by the global lookup block 216.

In particular, after a limited number of steps, the horizontal RNN 206 (b) may attempt to generate an approximate gaussian or pseudo gaussian mask that is multiplied by the feature map 226 (d) output from the auto encoder. According to one embodiment, the limited number of steps is 16, but any other value is possible. The gaussian or pseudo-gaussian mask is referred to as a attention map and is generated based on a feature map 226 (c) output by the horizontal RNN 206 (b). According to one embodiment, the mask operates like softmax, and thus the output is actually a probability distribution. By calculating the expected value using the probability distribution, the expected characteristics can be determined. The expected characteristics are used by the RNN to perform its predictions. This remains repeated for the number of periodic steps of horizontal RNN 206 (b). The global lookup block 216 determines a mask or a focus map in the manner described below.

More specifically, according to one embodiment, global lookup block 216 receives feature map 226 (c) (the output of horizontal RNN 206 (b)) and generates N simultaneous attention maps (not shown in fig. 2 a) based on feature map 226 (c).

The meaning of the attention map can be understood by a skilled practitioner. The focus mechanism is implemented via dynamic mask generation (depending on the current position in the high resolution tile) of each RNN, which is used to identify spatial positions on the global tensor representation. In addition, the global lookup block 216 receives the feature map 226 (d) (output of the encoder 208 (a)). Using the N simultaneous attention and feature maps 226 (c), the global lookup block 216 generates state modification information 252 that is used to modify the state information of the 2d RNN 208. More details of how the state modification information is generated are described below with respect to fig. 2 b.

When modifying the state of the 2d RNN 208, the global lookup block effectively causes the 2d RNN 208 to perform look-ahead, and thus considers the "future" information of tiles that it has not "seen". As previously mentioned, "future information relates to information that is otherwise unavailable due to the direction in which the 2d RNN 208 operates. For example, if the 2d RNN 208 is operating from left to right and top to bottom, the "future" information will be related to right to left and/or bottom to top data. Further details regarding the generation of state modification information are described below with respect to fig. 2 b.

According to one embodiment, global lookup block 216 performs a global lookup operation using the output of horizontal RNN 206 (b) (feature map 226 (c)). However, according to other embodiments, the global lookup block 216 may perform a global lookup using output generated by the vertical RNN 206 (a) or both the horizontal 206 (b) and vertical RNN 206 (a).

Fig. 2b is a detailed block diagram of global lookup block 216 according to an embodiment of the present disclosure. As shown in fig. 2b, the global lookup block 216 may also include a focus generation network 230, an average context vector calculation block 232, and a feedback network 234. The output of the horizontal RNN [ feature map 226 (c) ] is provided to the attention generation network 230. The attention generation network 230 processes the feature map 226 (c) to generate one or more attention maps (denoted by p), each of which is provided to represent a context vector calculation block 232. The focus generation network 230 may include DNNs with multiple layers and may utilize the following architecture:

the encoder output represented by z [ feature map 226 (d) ] is also provided to the average context vector calculation block 232. According to one embodiment, the encoder output [ feature map 226 (d) ] z is a tensor of dimension H W C, where C indicates the number of channels. On the other hand, each attention graph generated by the network 230 may be a tensor of dimension h×w×1.

For each attention graph, the average context vector calculation block 232 calculates an average context vector E according to: e (z) =Σp _ij Z _ij It has a generated N E (z) of dimension 1 xc, each of dimension 1 xc. Each of the N es (z) is provided to the feedback network 234, the feedback network 234 generating state modification information 252, the state modification information 252 being provided to the 2d RNN 208 to modify state information associated with the 2d RNN 208. According to embodiments described herein, the feedback network 234 may include an RNN and may include the following architecture:

fig. 2c is a flow chart of a global lookup process according to an embodiment of the present disclosure. The process depicted in fig. 2c may be performed by the global lookup block 216 previously described with respect to fig. 2 b. The process begins at 240. At 250, a determination is made as to whether a global lookup needs to be performed. According to embodiments described herein, the global lookup may be performed repeatedly after a limited number of steps (e.g., after a limited number of tiles are processed). According to one embodiment, a global lookup is performed every 16 steps. However, any other limited spacing is possible. If it is not time to perform a global lookup ("no" branch of 250), flow continues with 250.

If a global lookup is to be performed ("yes" branch of 250), flow continues to 242. At 242, a map of interest (p) is generated based on the output (p) of the horizontal RNN 206 (b). At 244, an average context vector [ E (z) ] is generated based on the attention map (p) and the encoder output (z). The generation of the average context vector is described above with respect to fig. 2 b. At 246, the average context vector is processed via the feedback network 234 to generate state modification information 252. At 248, state vector information associated with the 2d RNN 208 is modified based on the state modification information 252. Flow then continues to 250 where a determination is made as to whether a global lookup should be performed at 250.

FIG. 3a depicts 2D RNN processing of a portion of a high resolution image that has been segmented into tile sets according to one embodiment of the present invention. Fig. 3a shows feature maps 226 (a) (1) through 226 (a) (16) corresponding to each output of the convolutional network 222 for each respective tile 224 (1) through 224 (N). For purposes of this discussion, the feature maps 226 (a) (1) through 226 (a) (16) are represented as tiles in FIG. 3a because there is a one-to-one correspondence between tiles 224 (1) through 224 (N) of the high resolution document image 202 and the feature maps 226 (a) (1) through 226 (a) (N). That is, each feature map 226 (a) (1) through 226 (a) (N) represents a respective output of the convolutional network 222 for the respective tile 224 (1) through 224 (N). Although fig. 3a only shows feature maps 226 (a) (1) through 226 (a) (16), it will be appreciated that these feature maps correspond to only a portion of tiles 224 (1) through 224 (N), and in fact, the high resolution document image 202 may be segmented into a smaller or greater number of tiles, in which case the number of feature maps 226 shown in fig. 3a will be greater or smaller, and will correspond exactly to the number of segmented tiles of the high resolution document image 202.

FIG. 3a also shows horizontal RNN initial state vectors 308 (1) through 308 (4), vertical RNN initial state vectors 310 (1) through 310 (4), vertical inter-tile RNN state vectors 312 (1) through 312 (16), and horizontal inter-tile RNN state vectors 314 (1) through 314 (16).

For the purposes of this discussion, the processing of a particular feature map [ e.g., 226 (a) (1) ] will be described. It will be appreciated that the processing of other feature maps, such as 226 (a) (2) through 226 (a) (16), will proceed in a similar and analogous manner. Accordingly, all the discussion regarding the feature map 226 (a) (1) and its associated processing also applies to the feature maps 226 (a) (2) through 226 (a) (16). According to one embodiment, each feature map 226 (a) (1) has tensor dimensions H W C, where H corresponds to the height of the row, W corresponds to the width and C corresponds to the number of channels of the feature map 226 (a). For the purposes of this example, assume h=w=n. According to one embodiment, n=227. However, N may be assumed to be any value.

As previously described, according to some embodiments, the vertical RNNs 206 (a) may be associated with a set of RNNs (not shown). During processing of each feature map 226 (a) (1), the vertical RNN sets associated with the vertical RNNs 206 (a) may act in parallel to process each column of the feature map 226 (a) (1). According to alternative embodiments, the vertical RNN 206 (a) is associated with a single RNN, in which case each row of the feature map 226 (a) (1) may be processed one after the other. Each RNN associated with the vertical RNN 206 (a) is assumed to have a respective state size S.

As previously described with respect to fig. 2a, the vertical RNN 206 (a) (1) processes the signature 226 (a) (1) to generate the signature 226 (b) (not shown in fig. 3 a).

According to one embodiment, each RNN associated with the vertical RNN 206 (a) processes each row of the feature map 206 (a) (1) and issues a state vector of size w×s. In other words, a state vector having a tensor dimension of w×s is generated for each row of the feature map 206 (a) (1). In particular, according to one embodiment, at each step, the vertical RNN 206 (a) processes all C channels present at that location in the hxw x C signature. Thus, for all rows in feature map 206 (a) (1), vertical RNN 206 (a) generates feature map 226 (b) (not shown in fig. 3 a) with tensor dimensions h×w×s.

The last row of feature maps 226 (b) that will be used to process feature maps 226 (a) (5) corresponding to subsequent tiles is then utilized to generate the vertical inter-tile state vector 312 (1).

The horizontal RNN 206 (b) then processes the feature map 226 (b) to generate a feature map 226 (c) (not shown in fig. 3 a). Similar to the vertical RNNs 206 (a), according to some embodiments, the horizontal RNNs 206 (b) may be associated with a set of RNNs (not shown). During processing of each feature map 226 (b), the vertical RNN sets associated with the horizontal RNNs 206 (b) may function in parallel to process each row of feature maps 226 (b). According to alternative embodiments, the horizontal RNN 206 (b) is associated with a single RNN, in which case each column of the feature map 226 (b) may be processed one after the other. Assume that each RNN associated with horizontal RNN 206 (b) has a corresponding state size S'.

According to one embodiment, each RNN associated with the horizontal RNN 206 (b) processes each row of the feature map 226 (b) and issues a state vector of size h×s'. In other words, a state vector with tensor dimensions h×s' is generated for each column of the feature map 226 (b). Thus, for all columns in feature map 206 (b) (1), horizontal RNN 206 (b) generates feature map 226 (c) (not shown in fig. 3 a) with tensor dimensions h×w×s'.

The last column of the feature map 226 (c) is then utilized to generate a horizontal inter-tile state vector 314 (1), which horizontal inter-tile state vector 314 (1) is to be used to process the feature map 226 (a) (2) corresponding to the subsequent tile.

Fig. 3b depicts an architecture for processing feature graphs generated by a convolutional network in accordance with an embodiment of the present disclosure. As shown in fig. 3b, feature map 226 (a) is processed by vertical RNN 206 (a). The output of the vertical RNN 206 (a) (not shown in fig. 3 b) is then processed by the horizontal RNN 206 (b).

Fig. 3c depicts an alternative architecture for processing feature graphs generated by a convolutional network in accordance with an embodiment of the present disclosure. FIG. 3c is similar to FIG. 3b, but with an additional tie layer 36, the additional tie layer 36 receiving input from feature map 226 (a) via skip connection 218 and vertical RNN 206 (A). The output of the concatenation layer 316 (not shown in fig. 3 c) is then provided to the horizontal RNN. The embodiment depicted in fig. 3c allows for potentially higher accuracy because it combines features from lower level features [ i.e., feature map 226 (a) ] and features of higher level features (i.e., the output of vertical RNN 206 (a)) for processing via horizontal RNN 206 (b).

FIG. 3d depicts a single threaded processing sequence for a vertical RNN according to embodiments of the present disclosure. Each box shown in fig. 3d may represent a single element of the feature map 226 (a). As shown in fig. 3d, for each column, the associated row is processed sequentially [ e.g., 320 (1) to 320 (4), 320 (5) to 320 (8), 320 (9) to 320 (12), 320 (13) to 320 (16) ].

Fig. 3e depicts a multi-threaded processing sequence for a vertical RNN according to an embodiment of the present disclosure. As shown in fig. 3e, multiple threads per row are processed in parallel, with each thread being associated with a respective column. That is, for example, each element 320 (1) in the first row is processed by a separate thread (not shown in FIG. 3 e). Once the elements in the first row have been processed, each element in the second row is processed by multiple associated threads [ i.e., 320 (2) ].

Fig. 4 depicts an input image and an output image processed by a form extraction network according to an embodiment of the present disclosure. As shown in fig. 4, the final output is a set of labeled pixels of the image. Thus, the output of RNN is a label for each pixel. The example depicted in fig. 4 shows a simplified scenario in which only 3 tags corresponding to features are detected: background, text, and gadgets. Green represents a large amount of text. Yellow represents the gadget into which data is to be entered. As an example, in fig. 4, 401 indicates text, and 402 indicates gadgets. Although fig. 4 depicts only 2 detected features, it is understood that any number of features may be detected by the form extraction network 200.

Fig. 5 depicts an input image and an output image that has been processed by a form extraction network according to an embodiment of the present disclosure. Similar to fig. 4, in fig. 5, 401 indicates text, and 402 indicates gadgets.

Fig. 6a illustrates an example computing system executing a form extraction network 200 according to various embodiments of the disclosure. As shown in fig. 6a, computing device 600 includes CPU/GPU 612, training subsystem 622, and test/reasoning subsystem 624. Training subsystem 622 and test/inference subsystem 624 may be understood as programming structures for performing training and testing of form extraction network 200. In particular, CPU/GPU 612 may be further configured to perform training and/or testing of form extraction network 200 (as variously described herein, such as with respect to fig. 3-4) via programmed instructions. Other components and modules typical of typical computing systems are not shown but will be apparent, such as, for example, coprocessors, processing cores, graphics processing units, mice, touch pads, touch screens, displays, and the like. Many variations of computing environments will be apparent in light of this disclosure. For example, the item store 106 may be external to the computing device 600. Computing device 600 may be any standalone computing platform, such as a desktop or workstation computer, a laptop computer, a tablet computer, a smart phone or personal digital assistant, a gaming machine, a set-top box, or other suitable computing platform.

Training subsystem 622 also includes document image training/verification data store 610 (a), which stores training and verification document images. Training algorithm 616 represents program instructions for performing training of form extraction network 200 in accordance with the training described herein. As shown in fig. 6a, training algorithm 616 receives training and validation document form images from training/validation data store 610 (a) and generates the best weights and biases, which are then stored in weight/bias data store 610 (b). As previously described, training may utilize a back-propagation algorithm and gradient descent or some other optimization method.

The test/inference subsystem also includes a test/inference algorithm 626 that utilizes the form extraction network 200 and the optimal weights/bias generated by the training subsystem 622. CPU/GPU 612 may then execute test/inference algorithm 626 based on the model architecture and the generated weights and biases previously described. In particular, test/inference subsystem 624 may receive test document image 614 using network 200, which may characterize classified document image 620 from test document image 614.

FIG. 6b illustrates an example integration of the document extraction network 200 into a network environment according to one embodiment of this disclosure. As shown in fig. 6b, computing device 600 may be collocated (colocate) in a cloud environment, a data center, a local area network ("LAN"), or the like. The structure of computing device 600 of fig. 6b is the same as the example embodiment described with respect to fig. 6 a. In this case, for example, computing device 600 may be a server or a cluster of servers. As shown in fig. 6b, client 600 interacts with computing device 600 via network 632. In particular, the client 630 may make requests and receive responses via API calls received at the API server 628, which are transmitted via the network 632 and the network interface 626. It is understood that the network 632 may comprise any type of public or private network, including the Internet or a LAN.

It will be further readily appreciated that network 508 may comprise any type of public and/or private network, including the Internet, a LAN, a WAN, or some combination of such networks. In this example case, computing device 600 is a server computer and client 630 may be any typical personal computing platform

As will be further appreciated, computing device 600 (whether the computing device shown in fig. 6a or 6 b) includes and/or is otherwise capable of accessing one or more non-transitory computer-readable media or storage devices having encoded thereon one or more computer-executable instructions or software for implementing the techniques various described in this disclosure. The storage devices may include any number of durable storage devices (e.g., any electronic, optical, and/or magnetic storage devices including RAM, ROM, flash memory, USB drives, on-board CPU caches, hard drives, server storage, magnetic tape, CD-ROM, or other physical computer-readable storage medium) for storing data and computer-readable instructions and/or software that implement the various embodiments provided herein. Any combination of memories may be used, and the various storage components may be located in a single computing device or distributed across multiple computing devices. In addition, as previously described, one or more storage devices may be provided separately or remotely from one or more computing devices. Many configurations are possible.

Further exemplary embodiments

The following examples relate to further embodiments, from which many variations and configurations will become apparent.

Example 1 is a method for extracting a structure from an image of a document, the method comprising: receiving a high resolution image of the document, the high resolution image comprising a plurality of pixels; generating a plurality of tiles from the image, each of the tiles comprising a subset of pixels from the high resolution image; processing tiles through a neural network, wherein processing each tile includes classifying pixels as being associated with document elements of the document, the elements including fillable form fields and text content associated with the fillable form fields; and generating an editable digital version of the document using the classified pixels, the editable digital version including the fillable form fields and the text content.

Example 2 includes the subject matter of example 1, wherein processing each tile separately through the neural network includes: for each tile: processing the tiles through a convolutional network to generate a first feature map; processing the first feature map through a 2D recurrent neural network ("RNN") to generate a second feature map; processing the second feature map to generate class predictions for each pixel in the tile; and aggregating each of the respective predictions for each pixel of the high resolution image to generate a global feature map for the document.

Example 3 includes the subject matter of example 2, wherein the 2D RNN further includes a vertical RNN and a horizontal RNN, wherein the vertical RNN generates a third feature map from the first feature map, and the horizontal RNN generates the second feature map from the third feature map.

Example 4 includes the subject matter of example 2, and further comprising performing a global lookup process periodically after a predetermined number of steps performed by the 2D RNN, wherein the global lookup process further comprises: state information associated with the 2D RNN is modified based on a potential spatial representation of the document, wherein the potential spatial representation is generated based on a second image of the document, wherein the second image has a resolution lower than a resolution of the high resolution image.

Example 5 includes the subject matter of example 4, wherein modifying status information associated with the 2D RNN further comprises: generating a map of interest from the second feature map; generating an average context vector using the second feature map and the potential spatial representation; generating state modification information using the average context vector; and modifying state information associated with the 2D RNN using the state modification information.

Example 6 includes the subject matter of example 5, wherein the average context vector is generated according to the following relationship: e (z) =Σp _ij z _ij Where z is generated from the potential spatial representation and p is a map of interest.

Example 7 includes the subject matter of example 6, wherein the potential spatial representation is generated by an automatic encoder.

Example 8 is a network for performing extraction and classification of document forms, comprising: a first branch, the first branch further comprising: a segmentation block for segmenting a high resolution document image comprising a plurality of pixels into a plurality of tiles, wherein each tile comprises a subset of pixels of the high resolution document image; a convolutional network for processing each tile to generate a first feature map; a 2D RNN, wherein the 2D RNN processes the first feature map to generate a second feature map; a classification block, wherein the classification block processes the second feature map to generate a classification vector for pixels in a tile; and a softmax block to generate a probability distribution for a pixel in a tile, the probability distribution indicating a probability that the pixel is associated with a document element class; a second branch, the second branch further comprising: an image scaler block, wherein the image scaler block generates a lower resolution document image from the high resolution document image; and an auto encoder, wherein the auto encoder processes the lower resolution document image to generate a potential spatial representation of the lower resolution document image; and a global lookup block, wherein the global lookup block causes the 2D RNN to consider tiles associated with the high resolution document image that are not currently processed by the 2D RNN.

Example 9 includes the subject matter of example 8, wherein the automatic encoder further includes an encoder and a decoder, and the potential spatial representation is generated by the encoder.

Example 10 includes the subject matter of example 9, wherein the 2D RNN further comprises a vertical RNN and a horizontal RNN, wherein the vertical RNN processes tiles along a vertical direction and the horizontal RNN processes tiles along a horizontal direction.

Example 11 includes the subject matter of example 10, wherein the 2D RNN stores state information including vertical inter-tile state information and horizontal inter-tile state information, wherein the state information is used to correlate information between at least two tiles.

Example 12 includes the subject matter of example 11, wherein the global lookup block utilizes the potential spatial representation and an output of the horizontal RNN to modify the state information of the 2D RNN.

Example 13 includes the subject matter of example 12, wherein the second feature map is processed by the attention generation network to generate an attention map.

Example 14 includes the subject matter of example 13, wherein the attention graph and the state information are to be used in accordance with a relationship E (z) =Σp _ij z _ij To generate an average context vector, where z is generated from the potential spatial representation and p is a map of interest.

Example 15 is a computer program product comprising one or more non-transitory machine-readable media encoded with instructions that, when executed by one or more processors, cause a process to be performed for performing document form extraction and classification from a high resolution image of an input document, the process comprising: generating a high resolution image of the document, the high resolution image comprising a plurality of pixels; generating a plurality of tiles from the high resolution image, each of the tiles comprising a subset of pixels from the high resolution image; for each tile: processing the tiles through a convolutional network to generate a first feature map; processing the first feature map through a 2D recurrent neural network ("RNN") to generate a second feature map; processing the second feature map to generate class predictions for each pixel in the tile; and aggregating, for each pixel of the high resolution image, each of the respective predictions to generate a global feature map for the document.

Example 16 includes the subject matter of example 15, wherein the 2D RNN further includes a vertical RNN and a horizontal RNN, wherein the vertical RNN generates a third feature map from the first feature map, and the horizontal RNN generates the second feature map from the third feature map.

Example 17 includes the subject matter of example 15, and further comprising performing a global lookup process periodically after a predetermined number of steps performed by the 2D RNN, wherein the global lookup process further comprises: state information associated with the 2D RNN is modified based on a potential spatial representation of the document, wherein the potential spatial representation is generated based on a second image of the document, wherein the second image has a resolution lower than a resolution of the high resolution image.

Example 18 includes the subject matter of example 17, wherein modifying status information associated with the 2D RNN further comprises: generating a map of interest from the second feature map; generating an average context vector using the second feature map and the potential spatial representation; generating state modification information using the average context vector; and modifying state information associated with the 2D RNN using the state modification information.

Example 19 includes the subject matter of example 18, wherein the average context vector is generated according to the following relationship: e (z) =Σp _ij z _ij Where z is generated from the potential spatial representation and p is a map of interest.

Example 20 includes the subject matter of example 19, wherein the potential spatial representation is generated by an automatic encoder.

In some example embodiments of the present disclosure, the various functional modules described herein, particularly for training and/or testing network 200, may be implemented in software, such as a set of instructions (e.g., HTML, XML, C, C ++, object-oriented C, javaScript, java, BASIC, etc.) encoded on any non-transitory computer-readable medium or computer program product (e.g., hard disk drive, server, optical disk, or other suitable non-transitory memory or memory set), which when executed by one or more processors, cause the various creator recommendation methods provided herein to be performed.

In other embodiments, the techniques provided herein are implemented using a software-based engine. In such embodiments, the engine is a functional unit comprising one or more processors programmed or otherwise configured with instructions encoding the creator recommendation process provided herein. In this way, the software-based engine is a functional circuit.

In other embodiments, the techniques provided herein are implemented in hardware circuitry, such as gate level logic (FPGA) or application specific semiconductor (e.g., application specific integrated circuit or ASIC). Other embodiments are implemented with a microcontroller having a processor, a plurality of input/output ports for receiving and outputting data, and a plurality of embedded routines used by the processor to perform the functions provided herein. It will be apparent that any suitable combination of hardware, software and firmware may be used in a more general sense. As used herein, a circuit is one or more physical components and is used to perform tasks. For example, the circuitry may be one or more processors programmed or otherwise configured with software modules or logic-based hardware circuitry that provide a set of outputs in response to a set of input stimuli. Many configurations will be apparent.

The foregoing description of the exemplary embodiments of the present disclosure has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the disclosure to the precise form disclosed. Many modifications and variations are possible in light of this disclosure. It is intended that the scope of the disclosure be limited not by this detailed description, but rather by the claims appended hereto.

Claims

1. A method for extracting structure from an image of a document, the method comprising:

receiving a high resolution image of the document, the high resolution image comprising a plurality of pixels;

generating a plurality of tiles from the image, each of the tiles comprising a subset of pixels from the high resolution image;

processing each tile individually through a neural network, wherein processing each tile includes classifying pixels as being associated with document elements of the document, the elements including fillable form fields and text content associated with the fillable form fields,

wherein processing each tile individually through the neural network comprises: for each tile, processing the tile through a convolutional network to generate a first feature map, processing the first feature map through a 2D recurrent neural network, 2D RNN, to generate a second feature map, and processing the second feature map to generate a class prediction for each pixel in the tile; and

Aggregating, for each pixel of the high resolution image, each of the class predictions to generate a global feature map for the document; and

an editable digital version of the document is generated using the classified pixels, the editable digital version including the fillable form fields and the text content.

2. The method of claim 1, wherein the 2D RNN further comprises a vertical RNN and a horizontal RNN, wherein the vertical RNN generates a third feature map from the first feature map and the horizontal RNN generates the second feature map from the third feature map.

3. The method of claim 1, further comprising periodically performing a global lookup process after a predetermined number of steps performed by the 2DRNN, wherein the global lookup process further comprises:

modifying state information associated with the 2D RNN based on a potential spatial representation of the document, wherein the potential spatial representation is generated based on a second image of the document, wherein the second image has a lower resolution than a resolution of the high resolution image, wherein the state information comprises vertical inter-tile state information and horizontal inter-tile state information, wherein the state information is used to correlate information between at least two tiles.

4. The method of claim 3, wherein modifying status information associated with the 2D RNN further comprises:

generating a map of interest from the second feature map;

generating an average context vector using the second feature map and the potential spatial representation;

generating state modification information using the average context vector; and

the state modification information is used to modify state information associated with the 2D RNN.

5. The method of claim 4, wherein the average context vector is generated according to the following relationship: e (z) =Σp _ij z _ij Where z is a feature map generated from the potential spatial representation and p is a attention map.

6. The method of claim 5, wherein the potential spatial representation is generated by an automatic encoder.

7. A network for performing extraction and classification of document forms, comprising:

a first branch, the first branch further comprising:

a segmentation block for segmenting a high resolution document image comprising a plurality of pixels into a plurality of tiles, wherein each tile comprises a subset of pixels of the high resolution document image;

a convolutional network for processing each tile to generate a first feature map;

A 2D RNN, wherein the 2D RNN processes the first feature map to generate a second feature map;

a classification block, wherein the classification block processes the second feature map to generate a classification vector for pixels in a tile;

a softmax block to generate a probability distribution for a pixel in a tile, the probability distribution indicating a probability that the pixel is associated with a document element class;

a second branch, the second branch further comprising:

an image scaler block, wherein the image scaler block generates a lower resolution document image from the high resolution document image; and

an auto encoder, wherein the auto encoder processes the lower resolution document image to generate a potential spatial representation of the lower resolution document image; and a global lookup block, wherein the global lookup block causes the 2D RNN to consider tiles associated with the high resolution document image that are not currently processed by the 2D RNN.

8. The network of claim 7, wherein the automatic encoder further comprises an encoder and a decoder, and the potential spatial representation is generated by the encoder.

9. The network of claim 8, wherein the 2D RNN further comprises a vertical RNN and a horizontal RNN, wherein the vertical RNN processes tiles along a vertical direction and the horizontal RNN processes tiles along a horizontal direction.

10. The network of claim 9, wherein the 2D RNN stores state information comprising vertical inter-tile state information and horizontal inter-tile state information, wherein the state information is used to correlate information between at least two tiles.

11. The network of claim 10, wherein the global lookup block utilizes the potential spatial representation and an output of the horizontal RNN to modify the state information of the 2D RNN.

12. The network of claim 11, wherein the second feature map is processed by a focus generation network to generate a focus map.

13. The network of claim 12, wherein the attention graph and the potential spatial representation are for use according to the relation E (z) = Σp _ij z _ij To generate an average context vector, where z is a feature map generated from the potential spatial representation and p is a attention map.

14. A non-transitory machine-readable medium encoded with instructions that, when executed by one or more processors, cause a process to be performed for processing a document, the process comprising:

generating a high resolution image of the document, the high resolution image comprising a plurality of pixels;

generating a plurality of tiles from the high resolution image, each of the tiles comprising a subset of pixels from the high resolution image;

Processing each tile individually through the neural network, wherein processing each tile includes classifying pixels as being associated with document elements of the document, wherein for each tile:

processing the tiles through a convolutional network to generate a first feature map,

processing the first feature map by a 2D recurrent neural network 2D RNN to generate a second feature map, and

processing the second feature map to generate class predictions for each pixel in the tile; and

for each pixel of the high resolution image, aggregating each of the class predictions to generate a global feature map for the document.

15. The non-transitory machine readable medium of claim 14, wherein the 2DRNN further comprises a vertical RNN and a horizontal RNN, wherein the vertical RNN generates a third feature map from the first feature map and the horizontal RNN generates the second feature map from the third feature map.

16. The non-transitory machine readable medium of claim 14, further comprising performing a global lookup process periodically after a predetermined number of steps performed by the 2D RNN, wherein the global lookup process further comprises:

17. The non-transitory machine readable medium of claim 16, wherein modifying status information associated with the 2D RNN further comprises:

generating a map of interest from the second feature map;

generating state modification information using the average context vector; and

18. The non-transitory machine readable medium of claim 17, wherein the flatThe homocontext vector is generated according to the following relationship: e (z) =Σp _ij z _ij Where z is a feature map generated from the potential spatial representation and p is a attention map.

19. The non-transitory machine readable medium of claim 18, wherein the potential spatial representation is generated by an automatic encoder.