WO2021171274A1 - Machine learned structured data extraction from document image - Google Patents
Machine learned structured data extraction from document image Download PDFInfo
- Publication number
- WO2021171274A1 WO2021171274A1 PCT/IB2021/051702 IB2021051702W WO2021171274A1 WO 2021171274 A1 WO2021171274 A1 WO 2021171274A1 IB 2021051702 W IB2021051702 W IB 2021051702W WO 2021171274 A1 WO2021171274 A1 WO 2021171274A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- computing device
- text
- document
- image
- structured data
- Prior art date
Links
- 238000013075 data extraction Methods 0.000 title description 2
- 238000010801 machine learning Methods 0.000 claims abstract description 58
- 238000012015 optical character recognition Methods 0.000 claims abstract description 27
- 238000000034 method Methods 0.000 claims description 36
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 230000008569 process Effects 0.000 claims description 9
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 230000006403 short-term memory Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims 2
- 238000013518 transcription Methods 0.000 abstract description 23
- 230000035897 transcription Effects 0.000 abstract description 23
- 238000012805 post-processing Methods 0.000 description 20
- 230000015654 memory Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 238000004891 communication Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000012546 transfer Methods 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000007274 generation of a signal involved in cell-cell signaling Effects 0.000 description 1
- 230000037308 hair color Effects 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19173—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
Definitions
- the present disclosure relates in general to machine learning and in particular to computer vision and natural language processing.
- Structured data is data that resides within fixed fields (or “classes”) of a document, record, or file (“document” herein).
- a driver’s license is a document that has structured data in classes such as “given name,” “surname,” “date of birth,” “license number,” “State,” and so on.
- Other examples of documents with structured data include restaurant menus, invoices, receipts, and various standardized forms.
- Structured data is often manually extracted from images of documents and entered into a software application or database. For example, recruiters at a company may manually enter information from resumes into a database of potential candidates, or users of a software application for local restaurants may manually enter dishes and corresponding prices into the software application. Manual extraction of structured data from documents can be expensive and time consuming, especially when the number of documents is high. Existing optical character recognition and natural language processing technology fails to reliably and autonomously transcribe structured data from images of documents with high accuracy and precision.
- a method involves a document transcription application receiving an image of a document that includes structured data of one or more classes.
- the document transcription application performs optical character recognition upon the image of the document to produce a block of text.
- the block of text is subdivided into text chunks corresponding to particular locations upon the image of the document identified by bounding boxes.
- the method further involves the document transcription application applying the block of text to a first machine learning model to determine a heat map for a class of data in the image of the document.
- the first machine learning model performs named entity recognition upon the block of text to predict a set of text chunks that contain the class of data, where each text chunk in the block of text is assigned a probability of containing the class of data by the first machine learning model.
- the document transcription application generates a heat map for predicted locations of the class in the image of the document by matching the set of text chunks to locations upon the document image using the corresponding bounding boxes of the set of text chunks.
- the heat map is an image channel corresponding to the image of the document, where the value of each pixel of the image channel is the probability of the pixel containing the class of data.
- the first machine learning model may be a bi-directional long short-term memory neural network with a conditional random field layer.
- the method further involves the document transcription application applying the heat map and the image of the document to a second machine learning model to identify a region of the image of the document that contains the class of data.
- the heat map and the image of the document act as complimentary signals that result in an identified region with greater accuracy and precision than would be obtained solely from the image of the document.
- the second machine learning model may be a convolutional neural network.
- the method further involves the document transcription application generating a structured data file containing the class of data using the output of the second machine learning model.
- the document transcription application matches the identified region of the image of the document to a particular bounding box of the block of text.
- the document transcription application records the respective text chunk of the particular bounding box as data of the class of data in the structured data file.
- the document transcription application may send the structured data file to a computing device, such as a database or a server.
- a non-transitory computer-readable storage medium stores instructions that when executed by a processor causes the processor to execute the above-described method.
- a computer system includes a processor and a non- transitory computer-readable storage medium that stores instructions for executing the above- described method.
- the method involves the document transcription application applying the block of text to a graphical neural network, which identifies the region of the image of the document that represents the class of data.
- FIG. 1 is a block diagram illustrating a system for autonomous document image transcription, according to one embodiment.
- FIG. 2 is a data flow diagram for autonomous document image transcription, according to one embodiment.
- FIG. 3 is a simplified illustration of class heat map generation, according to one embodiment.
- FIG. 4 is a flowchart illustrating a process for autonomous document image transcription, according to one embodiment.
- FIG. 5 is a block diagram that illustrates a computer system, according to one embodiment.
- FIG. 1 is a block diagram illustrating a system for autonomous document image transcription, according to one embodiment.
- the system includes a document transcription application (“DTA”) 110, an analyst 120, and a source 130 coupled to a network 140.
- DTA document transcription application
- the DTA 110 autonomously transcribes the document, extracting structured data and formatting it into a file that the DTA 110 sends to the analyst 120.
- the DTA 110, analyst 120, and/or source 130 may be separate devices or a singular device.
- the source 130 is a computing device that sends an image of a document to the DTA 110.
- the source 130 may be a personal computer, tablet, or smart phone that submits a photo of a menu or a driver’s license to the DTA 110.
- the image of the document, or “document image,” is a photo of a document containing structured data of one or more classes.
- a document image contains pixels of three color channels, those being red, green, and blue.
- alternative embodiments may involve document images of alternative numbers or types of color channels, such as a greyscale image containing pixels of one channel, e.g., an intensity of light.
- a driver’s license may contain some or all of a license number, an expiration date, a state, a first name, a last name, a date of birth, an address, a height, an eye color, a hair color, and a sex.
- a menu may contain some or all of an item name, an item description, an item price, and modifiers (such as a discounted rate for particular orders).
- a document may contain one or more classes of data, and there may be one or more instances of each of the one or more classes of data in the document.
- the analyst 120 is a computing device used to access a file of structured data extracted by the DTA 110, and may generate and present one or more user interfaces including the structured data and/or perform one or more analyses using the structured data.
- the analyst 120 may be a database to which the file of structured data is sent for storage by the DTA 110.
- the analyst 120 is multiple computing devices, such as a database for storage of the file and a computing device for accessing the stored data.
- there may be multiple analysts 120 e.g., multiple computing devices with access to a database that stores files of structured data received from the DTA 110.
- the network 140 may comprise any combination of local area and/or wide area networks, using both wired and/or wireless communication systems.
- the network 140 uses standard communications technologies and/or protocols.
- the network 140 includes communication links using technologies such as Ethernet, 802.11, worldwide interoperability for microwave access (WiMAX), 3G, 4G, code division multiple access (CDMA), digital subscriber line (DSL), etc.
- networking protocols used for communicating via the network 140 include multiprotocol label switching (MPLS), transmission control protocol/Intemet protocol (TCP/IP), hypertext transport protocol (HTTP), simple mail transfer protocol (SMTP), and fde transfer protocol (FTP).
- MPLS multiprotocol label switching
- TCP/IP transmission control protocol/Intemet protocol
- HTTP hypertext transport protocol
- SMTP simple mail transfer protocol
- FTP fde transfer protocol
- Data exchanged over the network 140 may be represented using any suitable format, such as hypertext markup language (HTML) or extensible markup language (XML).
- HTML hypertext markup language
- XML extensible markup language
- all or some of the communication links of the network 140 may be encrypted using any suitable technique or techniques.
- the analyst 120, source 130, and DTA 110 exchange data among one another via the network 140.
- the DTA 110 automatically transcribes document images into structured data fdes.
- the DTA 110 includes an Optical Character Recognition (“OCR”) module 112, a context modeling module 114, a recognition modeling module 116, and a post-processing module 118.
- OCR Optical Character Recognition
- the DTA 110 receives an image of a document from the source 130 and transcribes it into a fde including structured data extracted from the image of the document.
- the file may be a JavaScript Object Notation (“JSON”) Object.
- JSON JavaScript Object Notation
- the DTA 110 may send the file to the analyst 120 for storage, analysis and/or presentation.
- the OCR module 112 performs optical character recognition on the image of the document. This results in an unstructured block of text, a “text block,” extracted from the image.
- the OCR module 112 sends the text block to the context modeling module 114.
- the OCR module 112 sends the document image to the context modeling module 114.
- the text block contains one or more text chunks, each of which is a subset of the text in the text block.
- Different text chunks are associated with different locations upon the document image, e.g., in terms of pixels of the document image from which the characters of the text chunk were extracted, and/or in terms of a bounding box enclosing the text chunk.
- a text chunk extracted from the image of the document may be associated with a particular set of pixels of the image containing the text chunk, or with a set of pixels or coordinates that define a bounding box enclosing the text chunk.
- the OCR module 112 performs optical character recognition using one or more techniques, depending upon the embodiment.
- the OCR module 112 may employ matrix matching or feature extraction techniques, or a two-pass technique with adaptive recognition.
- the OCR module 112 uses TESSERACT for optical character recognition.
- the OCR module 112 uses the GOOGLE CLOUD VISION API.
- the order in which the OCR 112 identifies text in the document image and adds it to the text block proceeds according to a preset direction, e.g., as set by the analyst 120 or an implementer of the DTA 110.
- the OCR 112 may first perform optical character recognition to generate bounding boxes for text chunks in the document image, then write each text chunk into the text block according to the order of the bounding boxes.
- the OCR 112 may write text chunks into the text block such that bonding boxes are evaluated from left to right and from the top of the document image to the bottom of the document image. In this manner, text in the document image may be entered into the text block sequentially.
- the context modeling module 114 generates heat maps for structured data classes, which is illustrated in greater detail below with reference to FIG. 3.
- the context modeling module 114 includes a first machine learning model that performs named entity recognition.
- the context modeling module 114 applies the text block to the first machine learning model.
- the first machine learning model receives as input the block of text extracted from the document image by the OCR 112 and outputs, for each text chunk and each class for which the model is trained, a probability that the text chunk is data of the class.
- the first machine learning model may predict a text chunk “Callie” is 84% likely to be a first name, 10% likely to be a last name, 5% likely to be a State, 1% likely to be a license number, and 0% likely to be a date of birth.
- the probability that a text chunk belongs to a class may be output by the first machine learning model as a value between 0 and 1, e.g., 0.84 for 84%.
- the first machine learning model is trained to make predictions for different numbers of classes. For example, the first machine learning model may be trained to predict the likelihood that a text chunk belongs to one class, or it may be trained to predict the likelihood that a text chunk belongs to each of a dozen classes.
- the first machine learning model may be trained on labeled training data.
- the labeled training data may be one or more blocks of text with text chunks labeled by class.
- the first machine learning model is a bi-directional long short-term memory (“LSTM”) neural network with a conditional random field (“CRF”) layer.
- LSTM long short-term memory
- CRF conditional random field
- the context modeling module 114 generates, for each class on which the first machine learning model is trained, a heat map representing the likelihood that a pixel of the document image contains data of the class.
- the heat map for a class is an image channel corresponding to the image of the document, where the value of each pixel of the image channel is the probability of the pixel containing the class of data.
- the context modeling module 114 generates a heat map for a class by identifying each text chunk with at least a threshold probability of corresponding to the class, as determined by the first machine learning model. Alternatively, the context modeling module 114 may identify each text chunk for which the probability that the text chunk was of the class was greater than any other probabilities assigned to the text chunk for other classes. In yet another alternative embodiment, the context modeling module 114 identifies all text chunks for the purpose of generating a heat map, regardless of probability value.
- the context modeling module 114 For each identified text chunk, the context modeling module 114 identifies the set of pixels corresponding to the text chunk or the pixels contained within the bounding box corresponding to the text chunk. The context modeling module 114 identifies the corresponding pixels of the heat map and assigns those pixels the probability value of the text chunk produced by the first machine learning model. In an embodiment, each pixel of a heat map that is not assigned the probability value of a text chunk is assigned a probability value of 0.
- a text block may contain two text chunks.
- the first text chunk occupies a first ten pixels of a document image
- the second text chunk occupies a second ten pixels of the document image.
- the first machine learning model may assign the first text chunk a probability value of 0.9 for a “name” class and a probability value of 0.1 for a “state” class.
- the first machine learning model may assign the second text chunk a probability value of 0.2 for “name” and 0.8 for “state.”
- the context modeling module 114 may assign each of the first ten pixels the value 0.9 and the second ten pixels the value of 0.2.
- the context modeling module 114 may assign each of the first ten pixels the value 0.1, and the second ten pixels the value 0.8.
- the context modeling module 114 may modify one or more of the probability values in a heat map. For example, the context modeling module 114 may perform an arithmetic operation upon the probability values to scale them, or the context modeling module 114 may overwrite probability values of less than a threshold value with a secondary value, such as 0, or the context modeling module 114 may overwrite probability values of at least a threshold value with a secondary value, such as 1.
- the context modeling module 114 sends the one or more generated heat maps to the recognition modeling module 116.
- the recognition modeling module 116 receives as input the document image and one or more heat maps, which it applies to a second machine learning model that performs computer vision.
- the second machine learning model outputs a set of regions of the document image, where each region in the set of regions corresponds to a particular identified class and includes one or more pixels.
- the sets of regions output of the second machine learning model include, for each region of each set, a probability value representing a probability that the region contains data of the respective class of the region, as predicted by the second machine learning model.
- the second machine learning model of the recognition modeling module 116 is a multimodal convolutional neural network (“CNN”).
- CNN convolutional neural network
- the second machine learning model is trained on images of documents with labeled bounding boxes or pixels for classes in the images as well as respective heat maps.
- the second machine learning model may be trained to output one or more types of sets of regions, e.g., responsive to a number of classes for which the model has been trained. For example, if the model is trained to identify “first name” classes and “surname” classes in the image, the second machine learning model outputs two sets of regions, one corresponding to instances of the “first name” class in the image and a second corresponding to instances of the “surname” class in the image.
- the document image and one or more heat maps input to the second machine learning model act as complimentary signals.
- a machine learning model to which is input a document image and a heat map for a class can identify a region in the document image corresponding to the class with greater accuracy and precision than a machine learning model to which is input solely a document image.
- results obtained using the techniques described herein provide for greater accuracy and precision of regions in an image of a document where data is predicted to be of a particular class.
- the recognition modeling module 116 sends the sets of regions to the post-processing module 118.
- the post-processing module 118 uses sets of regions from the recognition modeling module 116 and blocks of text from the OCR module 112 to generate structured data.
- the post-processing module 118 may perform one or more intermediate steps to further process a set of regions before generating structured data. For example, the post-processing module 118 may remove a region if no text chunk has a bounding box containing pixels with at least a threshold amount of overlap with the pixels of the region. For example, if no bounding box overlaps at least 70% with the region, the post-processing module 118 removes the region from the respective set of regions. Additionally, the post-processing module 118 may remove regions with a probability value below a threshold probability value.
- the post-processing module 118 determines that a region has a probability value indicating less than 70%, the post-processing module 118 removes the region from the respective set of regions. In an embodiment, if, for a class, no region has at least the threshold probability needed to not be removed, the post-processing module 118 sends a notification to the analyst 120 that structured data of the class could not be extracted from the document image.
- the post-processing module 118 may include rules for each class which are verified by checking the text chunks corresponding to the bounding boxes that overlap the regions for that class.
- a “price” class may correspond to a rule that text chunks corresponding to bounding boxes that overlap the regions from the set of regions corresponding to the “price” class cannot contain letters.
- the post-processing module 118 identifies a region corresponding to the “price” class is overlapped by a bounding box for a text chunk that contains letters, the region is removed from the set of regions.
- the post-processing module 118 may eliminate duplicate regions in a set of regions, e.g., regions that overlap at least a threshold amount.
- the post-processing module 118 for one or more classes, the post-processing module 118 eliminates from the respective set of regions all regions except for a certain number of regions of highest probability.
- the post-processing module 118 may retain, for a set of regions, only the region with the highest probability value, or only the two regions with the two highest probability values.
- a specific example of the former is for “license number”, where the post-processing module 118 only retains one region, as there is only one license number on the driver’s license.
- the post-processing module 118 After performing the one or more intermediate steps, the post-processing module 118 generates structured data by matching bounding boxes of text chunks from the block of text to the regions in the sets of regions. For example, for a region corresponding to a “name” class, the post-processing module 118 may identify a bounding box for the text chunk “Joseph Example” overlapping at least a threshold number or percentage of pixels as the region, and therefore include in the structured data the text “Joseph Example” for the “name” class.
- the generated structured data is a JavaScript Object Notation (“JSON”) file.
- JSON JavaScript Object Notation
- the DTA 110 sends the structured data to the analyst 120.
- the structured data file may store the structured data as attribute- value pairs, such as the following for an example driver’s license:
- the DTA 110 includes a single neural network that receives as input the text block and outputs a probability that each word in the text block belongs to each of one or more classes, depending upon the number of classes for which the single neural network is trained.
- the single neural network may be a graph neural network (“GNN”).
- the GNN may be trained on text blocks where each word is labeled with a class of data.
- the DTA 110 performs word embedding upon each word of the block of text to generate a set of word embeddings.
- a word embedding is a vector representing the word.
- Each word embedding is used by the DTA 110 as a node and the DTA 110 connects the nodes using edges to form a graph.
- the graph is a fully connected graph, where each node is connected to all other nodes.
- the DTA 110 may use one or more heuristics to connect nodes. For example, words in the same row of text or column of text may be connected, or words within a threshold distance of one another are connected, where the distance between words is determined by the DTA 110 using the bounding boxes of each word.
- the DTA 110 applies the graph to the GNN, which generates as output an updated word embedding for each node in the graph.
- Each of the output word embeddings not only contains information for the respective word itself, but also local context from nearby words learned by the model.
- a second layer of the GNN takes as input the output word embeddings and generates, for each word embedding, a probability that the word embedding belongs to each of one or more classes.
- the DTA 110 uses the produced probabilities of the word embeddings as the probabilities for the respective words of the text block.
- FIG. 2 is a data flow diagram for autonomous document image transcription, according to one embodiment.
- the OCR module 112 receives a document image 205 and outputs an extracted text block 210.
- the extracted text block is input to the context modeling module 114, which outputs a set of heatmaps, one for each class 215.
- the recognition modeling module 116 receives as input the heatmaps 215 and the document image 205 and outputs class location estimates 220, e.g., the sets of regions.
- the post-processing module 118 receives as input the class location estimates 220 and outputs the structured text fde 225.
- FIG. 3 is a simplified illustration of class heat map generation, according to one embodiment.
- the document image is a driver’s license 305.
- the driver’s license 305 has structured text including a state name “Hawaii,” a license number “H1234678”, a name “Joseph Example,” and a date of birth “July 27, 1985”.
- the DTA 110 performs OCR 310 upon the driver’s license 305 to extract text 320.
- the DTA 110 identifies text in the driver’s license 305, places bounding boxes around each text chunk 315, and writes the text contained in each bounding box into a text block 325.
- the DTA 110 tracks the pixels contained by each bounding box 330.
- the DTA 110 performs named entity recognition 335 upon the text block, e.g., using the first machine learning model. This produces, for each text chunk and each class, a probability that the text chunk belongs to the class. This is illustrated with each text chunk being associated with its highest probability class 340, such as “H12345678” having a highest probability value for being a license number, at 0.83 or 83%.
- the DTA 110 generates heatmaps 345 using the probability values and the text block. For example, for a “State” heatmap 355, the DTA 110 identifies a set of pixels in the heat map corresponding to the pixels contained by the bounding box for the text chunk “Hawaii”. As the heat map is a channel with the same number of pixels as the document image, an X th pixel in the document image is the X th pixel in the heat map. Each pixel in the heat map identified as corresponding to a pixel in the bounding box is assigned the probability value of the respective text chunk - as, in this example, the text chunk is “Hawaii”, each of the corresponding pixels in the “state” heat map is assigned the probability value 0.9.
- the DTA 110 repeats this for one or more bounding boxes for each heat map in the set of heat maps 350.
- the text chunk “Joseph Example” has a probability value of 0.07 for the “state” class, and so the pixels in the state heat map corresponding to the bounding box for “Joseph Example” are assigned the value 0.07.
- FIG. 4 is a flowchart illustrating a process for autonomous document image transcription, according to one embodiment.
- the DTA 110 receives 410 an image of a document from the source 130.
- the DTA 110 performs 415 optical character recognition upon the image of the document. This produces a text block including one or more text chunks, each corresponding to a set of pixels in the document image.
- the set of pixels may be specified in terms of a bounding box that contains the pixels representing the text of the text chunk in the document image.
- the DTA 110 determines 420 one or more heat maps corresponding to one or more classes of data in the image of the document using a first machine learning model.
- the DTA 110 applies the text block to the first machine learning model to generate probabilities that each text chunk corresponds to each class.
- the DTA 110 uses the generated probabilities to generate heat maps for each class, where the heat map for a class includes the probability of a text chunk at one or more pixels in the heat map corresponding to pixels of the bounding box of the text chunk.
- the DTA 110 identifies 425 regions representing specific classes in the image or the document using a second machine learning model.
- the DTA 110 applies the one or more heat maps and the document image to the second machine learning model to generate, for each class, a set of regions where data of the respective class is predicted to exist within the document image.
- the DTA 110 generates 430 a structured data file using the identified regions.
- the DTA 110 may first perform one or more intermediate steps to pare down or otherwise modify the sets of regions.
- the DTA 110 matches, for each class, one or more of the regions of the respective set of regions to one or more text chunks from the text block, based on overlap between the regions and the bounding boxes of the text chunks.
- the DTA 110 associates the text of the matched text blocks with the class of the respective matched regions, then adds the text to the structured data file 430 in association with the classes, e.g., as attribute-value pair entries.
- the DTA 110 may send the structured data file to the analyst 120.
- FIG. 5 is a block diagram illustrating components of an example machine able to read instructions from a machine-readable medium and execute them in a processor (or controller).
- FIG. 5 shows a diagrammatic representation of a machine in the example form of a computer system 500.
- the computer system 500 can be used to execute instructions 524 (e.g., program code or software) for causing the machine to perform any one or more of the methodologies (or processes) described herein.
- the machine operates as a standalone device or a connected (e.g., networked) device that connects to other machines. In a networked deployment, the machine may operate in the capacity of a server machine or a client machine in a server-client network environment, or as a peer machine in a peer-to-peer (or distributed) network environment.
- the machine may be a server computer, a client computer, a personal computer (PC), a tablet PC, a set-top box (STB), a smartphone, an internet of things (IoT) appliance, a network router, switch or bridge, or any machine capable of executing instructions 524 (sequential or otherwise) that specify actions to be taken by that machine.
- PC personal computer
- tablet PC tablet PC
- STB set-top box
- smartphone an internet of things (IoT) appliance
- IoT internet of things
- network router switch or bridge
- the example computer system 500 includes one or more processing units (generally processor 502).
- the processor 502 is, for example, a central processing unit (CPU), a graphics processing unit (GPU), a digital signal processor (DSP), a controller, a state machine, one or more application specific integrated circuits (ASICs), one or more radio-frequency integrated circuits (RFICs), or any combination of these.
- the computer system 500 also includes a main memory 504.
- the computer system may include a storage unit 516.
- the processor 502, memory 504 and the storage unit 516 communicate via a bus 508.
- the computer system 506 can include a static memory 506, a display driver 510 (e.g., to drive a plasma display panel (PDP), a liquid crystal display (UCD), or a projector).
- the computer system 500 may also include alphanumeric input device 512 (e.g., a keyboard), a cursor control device 514 (e.g., a mouse, a trackball, a joystick, a motion sensor, or other pointing instrument), a signal generation device 518 (e.g., a speaker), and a network interface device 520, which also are configured to communicate via the bus 508.
- the storage unit 516 includes a machine-readable medium 522 on which is stored instructions 524 (e.g., software) embodying any one or more of the methodologies or functions described herein.
- the instructions 524 may also reside, completely or at least partially, within the main memory 504 or within the processor 502 (e.g., within a processor’s cache memory) during execution thereof by the computer system 500, the main memory 504 and the processor 502 also constituting machine-readable media.
- the instructions 524 may be transmitted or received over a network 526 via the network interface device 520.
- machine-readable medium 522 is shown in an example embodiment to be a single medium, the term “machine -readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, or associated caches and servers) able to store the instructions 524.
- the term “machine-readable medium” shall also be taken to include any medium that is capable of storing instructions 524 for execution by the machine and that cause the machine to perform any one or more of the methodologies disclosed herein.
- the term “machine -readable medium” includes, but not be limited to, data repositories in the form of solid-state memories, optical media, and magnetic media. Additional Considerations
- aspects of the invention such as software for implementing the processes described herein, may be embodied in a non-transitory tangible computer readable storage medium or any type of media suitable for storing electronic instructions which may be coupled to a computer system bus.
- any computing systems referred to in the specification may include a single processor or may be architectures employing multiple processor designs for increased computing capability.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- General Engineering & Computer Science (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Geometry (AREA)
- Computer Graphics (AREA)
- Character Discrimination (AREA)
- Character Input (AREA)
- Document Processing Apparatus (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2021226214A AU2021226214A1 (en) | 2020-02-28 | 2021-03-01 | Machine learned structured data extraction from document image |
BR112022017004A BR112022017004A2 (en) | 2020-02-28 | 2021-03-01 | STRUCTURED DATA EXTRACTION BY MACHINE LEARNING FROM DOCUMENT IMAGE |
EP21761757.0A EP4097654A4 (en) | 2020-02-28 | 2021-03-01 | Machine learned structured data extraction from document image |
CA3168501A CA3168501A1 (en) | 2020-02-28 | 2021-03-01 | Machine learned structured data extraction from document image |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US202062983302P | 2020-02-28 | 2020-02-28 | |
US62/983,302 | 2020-02-28 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021171274A1 true WO2021171274A1 (en) | 2021-09-02 |
Family
ID=77462854
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2021/051702 WO2021171274A1 (en) | 2020-02-28 | 2021-03-01 | Machine learned structured data extraction from document image |
Country Status (6)
Country | Link |
---|---|
US (1) | US20210271872A1 (en) |
EP (1) | EP4097654A4 (en) |
AU (1) | AU2021226214A1 (en) |
BR (1) | BR112022017004A2 (en) |
CA (1) | CA3168501A1 (en) |
WO (1) | WO2021171274A1 (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2721189C1 (en) | 2019-08-29 | 2020-05-18 | Общество с ограниченной ответственностью "Аби Продакшн" | Detecting sections of tables in documents by neural networks using global document context |
US11403488B2 (en) * | 2020-03-19 | 2022-08-02 | Hong Kong Applied Science and Technology Research Institute Company Limited | Apparatus and method for recognizing image-based content presented in a structured layout |
RU2760471C1 (en) * | 2020-12-17 | 2021-11-25 | АБИ Девелопмент Инк. | Methods and systems for identifying fields in a document |
US20230036217A1 (en) * | 2021-07-27 | 2023-02-02 | Pricewaterhousecoopers Llp | Systems and methods for using a structured data database and for exchanging electronic files containing unstructured or partially structered data |
US11830264B2 (en) * | 2022-01-31 | 2023-11-28 | Intuit Inc. | End to end trainable document extraction |
US11720605B1 (en) * | 2022-07-28 | 2023-08-08 | Intuit Inc. | Text feature guided visual based document classifier |
DE102023135247A1 (en) * | 2022-12-15 | 2024-06-20 | Carefusion 303, Inc. | EXTRACTION OF UNSTRUCTURED CLINICAL DATA ENABLED BY MACHINE LEARNING |
US11804057B1 (en) * | 2023-03-23 | 2023-10-31 | Liquidx, Inc. | Computer systems and computer-implemented methods utilizing a digital asset generation platform for classifying data structures |
US12020140B1 (en) | 2023-10-24 | 2024-06-25 | Mckinsey & Company, Inc. | Systems and methods for ensuring resilience in generative artificial intelligence pipelines |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190005322A1 (en) * | 2017-01-14 | 2019-01-03 | Innoplexus Ag | Method and system for generating parsed document from digital document |
US20190114743A1 (en) * | 2017-07-17 | 2019-04-18 | Open Text Corporation | Systems and methods for image modification and image based content capture and extraction in neural networks |
WO2019092672A2 (en) * | 2017-11-13 | 2019-05-16 | Way2Vat Ltd. | Systems and methods for neuronal visual-linguistic data retrieval from an imaged document |
US10402640B1 (en) * | 2017-10-31 | 2019-09-03 | Intuit Inc. | Method and system for schematizing fields in documents |
US20190332937A1 (en) * | 2017-03-10 | 2019-10-31 | Adobe Inc. | Recurrent neural network architectures which provide text describing images |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070094296A1 (en) * | 2005-10-25 | 2007-04-26 | Peters Richard C Iii | Document management system for vehicle sales |
US11080808B2 (en) * | 2017-12-05 | 2021-08-03 | Lendingclub Corporation | Automatically attaching optical character recognition data to images |
-
2021
- 2021-03-01 AU AU2021226214A patent/AU2021226214A1/en not_active Abandoned
- 2021-03-01 CA CA3168501A patent/CA3168501A1/en active Pending
- 2021-03-01 EP EP21761757.0A patent/EP4097654A4/en active Pending
- 2021-03-01 WO PCT/IB2021/051702 patent/WO2021171274A1/en unknown
- 2021-03-01 US US17/188,339 patent/US20210271872A1/en not_active Abandoned
- 2021-03-01 BR BR112022017004A patent/BR112022017004A2/en not_active Application Discontinuation
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190005322A1 (en) * | 2017-01-14 | 2019-01-03 | Innoplexus Ag | Method and system for generating parsed document from digital document |
US20190332937A1 (en) * | 2017-03-10 | 2019-10-31 | Adobe Inc. | Recurrent neural network architectures which provide text describing images |
US20190114743A1 (en) * | 2017-07-17 | 2019-04-18 | Open Text Corporation | Systems and methods for image modification and image based content capture and extraction in neural networks |
US10402640B1 (en) * | 2017-10-31 | 2019-09-03 | Intuit Inc. | Method and system for schematizing fields in documents |
WO2019092672A2 (en) * | 2017-11-13 | 2019-05-16 | Way2Vat Ltd. | Systems and methods for neuronal visual-linguistic data retrieval from an imaged document |
Non-Patent Citations (1)
Title |
---|
See also references of EP4097654A4 * |
Also Published As
Publication number | Publication date |
---|---|
US20210271872A1 (en) | 2021-09-02 |
BR112022017004A2 (en) | 2022-10-11 |
EP4097654A4 (en) | 2024-01-31 |
EP4097654A1 (en) | 2022-12-07 |
CA3168501A1 (en) | 2021-09-02 |
AU2021226214A1 (en) | 2022-09-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210271872A1 (en) | Machine Learned Structured Data Extraction From Document Image | |
RU2699687C1 (en) | Detecting text fields using neural networks | |
US11816710B2 (en) | Identifying key-value pairs in documents | |
US20200004815A1 (en) | Text entity detection and recognition from images | |
CN111615702B (en) | Method, device and equipment for extracting structured data from image | |
US20180189950A1 (en) | Generating structured output predictions using neural networks | |
CN111488732B (en) | Method, system and related equipment for detecting deformed keywords | |
US11687704B2 (en) | Method, apparatus and electronic device for annotating information of structured document | |
CN115631205B (en) | Method, device and equipment for image segmentation and model training | |
US11972625B2 (en) | Character-based representation learning for table data extraction using artificial intelligence techniques | |
US20220408155A1 (en) | System and method for providing media content | |
CN113837157B (en) | Topic type identification method, system and storage medium | |
WO2022100401A1 (en) | Image recognition-based price information processing method and apparatus, device, and medium | |
US20230115091A1 (en) | Method and system for providing signature recognition and attribution service for digital documents | |
CN114661904A (en) | Method, apparatus, device, storage medium, and program for training document processing model | |
CN113011410A (en) | Training method of character recognition model, character recognition method and device | |
CN112801030B (en) | Target text region positioning method and device | |
US11763589B1 (en) | Detection of blanks in documents | |
US20240249084A1 (en) | Servers, systems and methods for machine translation as a computerized and/or network-provided service | |
US20240289551A1 (en) | Domain adapting graph networks for visually rich documents | |
US20240185339A1 (en) | Enabling Electronic Loan Documents | |
CN111373416B (en) | Enhancing neural network security through discrete neural network input | |
CA3232796A1 (en) | Enabling electronic loan documents | |
CN117876751A (en) | Image processing method, image processing system, and computer readable medium | |
CN116975191A (en) | Entity relation extraction method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21761757 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 3168501 Country of ref document: CA |
|
ENP | Entry into the national phase |
Ref document number: 2021761757 Country of ref document: EP Effective date: 20220831 |
|
REG | Reference to national code |
Ref country code: BR Ref legal event code: B01A Ref document number: 112022017004 Country of ref document: BR |
|
ENP | Entry into the national phase |
Ref document number: 2021226214 Country of ref document: AU Date of ref document: 20210301 Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 112022017004 Country of ref document: BR Kind code of ref document: A2 Effective date: 20220825 |