US10489644B2 - System and method for automatic detection and verification of optical character recognition data - Google Patents

System and method for automatic detection and verification of optical character recognition data Download PDF

Info

Publication number
US10489644B2
US10489644B2 US16/047,346 US201816047346A US10489644B2 US 10489644 B2 US10489644 B2 US 10489644B2 US 201816047346 A US201816047346 A US 201816047346A US 10489644 B2 US10489644 B2 US 10489644B2
Authority
US
United States
Prior art keywords
text
layer
ocr
detected
normalized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
US16/047,346
Other versions
US20190286896A1 (en
Inventor
David Wyle
Srinivas Lingineni
Will Hosek
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sureprep LLC
Original Assignee
Sureprep LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US15/922,821 external-priority patent/US10489645B2/en
Application filed by Sureprep LLC filed Critical Sureprep LLC
Priority to US16/047,346 priority Critical patent/US10489644B2/en
Assigned to SUREPREP, LLC reassignment SUREPREP, LLC ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HOSEK, WILL, LINGINENI, SRINIVAS, WYLE, DAVID
Publication of US20190286896A1 publication Critical patent/US20190286896A1/en
Priority to US16/659,193 priority patent/US11232300B2/en
Application granted granted Critical
Publication of US10489644B2 publication Critical patent/US10489644B2/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • G06K9/00469
    • G06K9/42
    • G06K2209/015
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/01Solutions for problems related to non-uniform document background
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • This specification relates to a system and a method for automatically detecting and verifying data obtained by optical character recognition performed on a native digital document.
  • OCR optical character recognition
  • OCR is an electronic conversion of images of text into machine-encoded text.
  • OCR is necessarily rooted in computer technology.
  • OCR is performed on a scanned or photographed document to detect the text of the document.
  • the text may be selected, searched, or edited by software executed by a computer.
  • OCR may be susceptible to errors, particularly when the image of the document is of poor quality.
  • the lowercase letter “1” may be detected by OCR when the document has a lowercase letter “i” or the number 1. These errors may prevent OCR from being reliably used to efficiently process documents, where accuracy is important.
  • the method includes obtaining a native digital document having an image layer comprising a matrix of computer-renderable pixels and a text layer comprising computer-readable encodings of a sequence of characters.
  • the method also includes obtaining normalized OCR-detected text corresponding to OCR-detected text from the image layer of the native digital document and a pixel-based coordinate location of the OCR-detected text in the image layer of the native digital document.
  • the method also includes determining, using a pixel transformation, a computer-interpretable location of the OCR-detected text in the text layer of the native digital document based on the pixel-based coordinate location of the OCR-detected text in the image layer of the native digital document.
  • the method also includes applying the computer-interpretable location of the OCR-detected text to the text layer of the native digital document to detect text in the text layer corresponding to the OCR-detected text.
  • the method also includes applying normalization processing to the detected text in the text layer to generate normalized text-layer text.
  • the method also includes rendering only the normalized text-layer text as an output when the normalized OCR-detected text does not match the normalized text-layer text, to improve accuracy of the output text.
  • a method for automatically verifying text detected by optical character recognition includes receiving a native digital document having an image layer comprising a matrix of computer-renderable pixels and a text layer comprising computer-readable encodings of a sequence of characters.
  • the method also includes receiving, from an OCR device, normalized OCR-detected text corresponding to OCR-detected text from the image layer of the native digital document and a pixel-based coordinate location of the OCR-detected text in the image layer of the native digital document.
  • the method also includes determining, using a pixel transformation, a computer-interpretable location of the OCR-detected text in the text layer of the native digital document based on the pixel-based coordinate location of the OCR-detected text in the image layer of the native digital document.
  • the method also includes applying, by the verification device, the computer-interpretable location of the OCR-detected text to the text layer of the native digital document to detect text in the text layer corresponding to the OCR-detected text.
  • the method also includes applying, by the verification device, normalization processing to the detected text in the text layer to generate normalized text-layer text.
  • the method also includes producing only the normalized text-layer text as an output when the normalized OCR-detected text does not match the normalized text-layer text, to improve accuracy of the output text.
  • a computer program product embodied on a computer readable storage medium for processing native digital document information to produce a native digital document record corresponding to the native digital document information includes computer code for detecting image-layer data from an image layer of the native digital document.
  • the computer program product also includes computer code for receiving location data associated with the detected image-layer data.
  • the computer program product also includes computer code for detecting text-layer data associated with the received location data.
  • the computer program product also includes computer code for comparing the detected image-layer data and the text-layer data.
  • the computer program product also includes computer code for using the comparison of the detected image-layer data and the text-layer data to enhance the native digital document record by indicating whether the detected image-layer data matches the detected text-layer data.
  • FIG. 1 illustrates an example native digital document, according to embodiments of the invention.
  • FIG. 2 illustrates a process diagram of a workflow between various components of the system, according to embodiments of the invention.
  • FIG. 3A illustrates the image layer of the native digital document of FIG. 1 , according to embodiments of the invention.
  • FIG. 3B illustrates the text layer of the native digital document of FIG. 1 , according to embodiments of the invention.
  • FIG. 4 illustrates an example output XML file of the text and location detected by optical character recognition, according to embodiments of the invention.
  • FIG. 5A illustrates the image layer of a native digital document, according to embodiments of the invention.
  • FIG. 5B illustrates the text layer of the native digital document of FIG. 5A , according to embodiments of the invention.
  • FIG. 6 illustrates an example output XML file of the text and location detected by optical character recognition, according to embodiments of the invention.
  • FIG. 7 illustrates an example user interface output after the system has traversed the native digital document, according to embodiments of the invention.
  • FIG. 8 illustrates an example system for automatically verifying text detected by optical character recognition, according to embodiments of the invention.
  • FIG. 9 illustrates a flow diagram of a process of automatically verifying text detected by optical character recognition, according to embodiments of the invention.
  • text may refer to letters, numbers, symbols, or any other character that may be read by a user.
  • a non-native digital document is one which is created based on a scan or photograph of a physical document and has only an image layer
  • a native digital document is one which is created by a computer program and includes an image layer and a text layer.
  • Optical character recognition has conventionally been used in detecting text or numbers in digital representations of physical documents.
  • a user may scan or photograph a physical document to create a digital representation of the document (i.e., a non-native digital document).
  • the non-native digital document may be comprised of a matrix of computer-renderable pixels having various color values, and this non-native digital document has an image layer only.
  • Optical character recognition software is capable of detecting text contained in the non-native digital document based on an analysis of the pixels of the digital document.
  • a text layer may be added to the image layer of the digital document, so that the document may be searchable, and parts of the text may be copied and pasted to another computer application.
  • the text layer may comprise computer-readable encodings of a sequence of characters representing the characters in the matrix of computer-renderable pixels which make up the image layer.
  • a physical document that is a letter may be scanned by a digital scanner, and the scanner may create a non-native digital document (e.g., a PDF) with an image layer.
  • the image layer may have a matrix of pixels each having a color value.
  • a computer may receive the non-native digital document and perform optical character recognition on the non-native digital document using optical character recognition software.
  • the optical character recognition software detects all of the text on the non-native digital document based on an analysis of the pixels in the image layer.
  • the optical character recognition software may add a text layer to the non-native digital document containing all of the detected text of the non-native digital document so that the digital document now contains an image layer and a text layer.
  • the image layer may have pixels in an arrangement corresponding to the letters of “Dear Arnold,” and the text layer may contain text data corresponding to the detected letters of “Dear Arnold,” that were detected by the optical character recognition software.
  • this digital document remains a non-native digital document, as it did not have its text layer upon creation of the non-native digital document.
  • the OCR-generated text may be searched, selected, or even edited by computer software.
  • conventional optical character recognition software may be error-prone, and in many cases, a human being may review the results of the optical character recognition to determine whether the optical character recognition software has accurately detected the text or numbers on the physical document. For example, when the original physical document contains the text “illustrations of St. Mark's Square” the optical character recognition software may instead detect “Illustra1on5 of St, Mark;s 5quare” because of a low-quality scan or a stray mark on the physical page of the physical document. Now, when a user conducts a search of the digital document for “illustration” or “Mark's” the user will not be provided with the correct result.
  • Some digital documents may be created using computer software (i.e., native digital documents), and not scans or photographs from physical documents. These native digital documents may commonly be in the Portable Document Format (PDF) developed by Adobe®. While PDF is described herein, any digital document format having an image layer and a text layer may be used by the systems and methods described herein. These native digital documents may be created by an originating computer software and converted or output into the digital document format. For example, a user may create a document using Microsoft WordTM and output the document as a PDF. The output PDF may contain an image layer and a text layer. The image layer may be displayed on a display of the computer, and the text layer may be used to search the document or select parts of the text within the document, for example.
  • PDF Portable Document Format
  • these digital documents created using computer software and not created based on a physical document may be referred to herein as “native digital documents” or “true digital documents.”
  • a PDF created using computer software and not based on a scan of a physical document may be referred to as a “native PDF” or a “true PDF.”
  • These native digital documents have a text layer created from a firsthand input of the intended text from the originating computer software. Accordingly, these native digital documents do not require optical character recognition to detect the text included in the document. Thus, these native digital documents have text in the text layer that is more reliable than a digital document created based on a scan of a physical document and having optical character recognition performed on it (i.e., a “non-native digital document”).
  • FIG. 1 is a portion of a completed tax form 100 .
  • the tax form 100 may be a native digital document created from a computer program, such as payroll management software, and the native digital document includes a text layer and an image layer.
  • the image layer may be viewable by the user on a computer display
  • the text layer may be an underlying layer containing all of the text in the document. If the user would like to extract the data from the document, the user could select all of the text (made available by the text layer), and copy and paste the text into another digital document (e.g., a DOC file) or digital data recording system (e.g., a tax return preparation software).
  • copying all of the data in the text layer may provide an output of all of the text in the page, without regard for the blocked formatting of the fields. Copying the entire text layer may result in a string of characters similar to:
  • the spacing and formatting of the data within the document provides a challenge in being able to readily use the text from the text layer.
  • this data has been manually processed by a human being using a computer.
  • the human being may select the values of interest in the native digital document and copy and paste the values to another computer program.
  • the human being may manually use a computer mouse to: (1) select the Payer's name and address information, line by line, (2) copy the text to a short-term memory (e.g., a clipboard of the operating system), and (3) paste the text to another computer software (e.g., a tax return preparation program).
  • a short-term memory e.g., a clipboard of the operating system
  • paste the text e.g., a tax return preparation program.
  • this manual process is tiresome and error-prone.
  • a human being may select the wrong text, not enough text, or may have errors when executing the copy and paste functions of the computer.
  • correctly identifying the correct text to be copied is also a potential source of errors, as a human being may incorrectly copy the text from an adjacent field (e.g., “Early withdrawal penalty”) when intending to copy the text from an intended field (e.g., “Interest income”).
  • the computer system may conduct a local search within a particular number of pixels of the known location of “Interest income” and optical character recognition may be performed on the detected local pixels where text corresponding to “Interest income” is likely to be found.
  • the systems and methods described herein provide solutions to the errors involved in conventional optical character recognition systems, human processing, and human review systems by using native digital documents, as further described herein.
  • the present invention overcomes many of the deficiencies of the conventional methods and systems and obtains its objectives by providing an integrated method embodied in computer software for use with a computer for the rapid, efficient, and automatic verification of text in a native digital document, thereby allowing for rapid, accurate, and efficient data verification of digital documents. Accordingly, it is the objective of the invention to improve verification of the text of native digital documents, which is integrated with computers for producing accurate verification of digital documents.
  • FIG. 2 illustrates a process diagram of a workflow 200 between various components of the system, according to some embodiments of the invention.
  • the workflow begins at Step 1 with a user 202 uploading a document 204 to a system 206 using the systems and methods described herein.
  • the document 204 may be automatically retrieved from a third-party server or automatically communicated by a third-party server.
  • the document 204 is a form having a plurality of fields, and the system 206 may be configured to determine and use the data corresponding to each of the fields in the document 204 .
  • the fields within the document 204 and the identifiers used to label each of the fields in the document 204 may be known to the system 206 before the document 204 is analyzed.
  • the user 202 may be a taxpayer or an automated document retrieval system
  • the document 204 may be a tax document provided to the taxpayer by an employer or a bank
  • the system 206 may be a tax document automation software (or scan and populate software) used by the taxpayer or tax preparer to automate the organization of and data entry from, the taxpayer source documents.
  • the document 204 may contain 25 different fields at different locations in the document 204 .
  • the 25 different fields may have unique field identifiers, such as “Payer's name,” “Interest income,” “Early withdrawal penalty,” and “Payer's federal identification number,” for example, and each field may have a corresponding value to be determined.
  • the document 204 may be a native digital document, such as a native PDF, having a text layer and an image layer, or may be a non-native digital document having only an image layer.
  • the document 204 is analyzed and optical character recognition is performed on the image layer of the document 204 .
  • the optical character recognition may be performed by an optical character recognition device 208 configured to perform optical character recognition on a given document.
  • the optical character recognition process results in data associated with the given document being detected, such as the values of various fields (e.g., $259.54).
  • the optical character recognition process further processes the detected data associated with the document.
  • the further processing of the documents may include one or more business rules being applied to the detected data.
  • the detected data may include currency symbols, such as $.
  • the currency symbols may not be important to store in memory, as the database may already associate particular detected data values as being in corresponding units, such as dollars.
  • the system may be instructed that data detected within the box associated with “Interest income” will be dollar amounts.
  • the detected data may be a number with a decimal point, and the number may be rounded to the closest whole number (e.g., $97 for $97.22 and $98 for $97.99).
  • the preciseness of the detected data may not have consequence, such as for tax preparation purposes, and computing resources may be saved by transforming the detected data and rounding to the nearest whole number.
  • the optical character recognition may detect multiple pieces of information in a single field.
  • the optical character recognition software may detect DYK 100 from a single field.
  • this detected data may include the stock symbol (DYK) as well as the number of shares sold (100).
  • particular data types e.g., numbers or letters
  • the system may associate any detected letters with the stock symbol, and may associate any numbers with the number of shares sold. In this way, even if stray letters are detected by the optical character recognition in a field associated with a number value, the stray letters may be ignored and may not affect what value is detected, as the system may identify only the numbers from the detected data.
  • Step 3 it is determined if the document 204 is a native digital document.
  • An analysis of the metadata associated with the document 204 or an analysis of the content or structure of the document 204 may be performed to determine whether the document 204 is a native digital document.
  • optical character recognition is used to detect the text within the various fields of the document 204 , and a human being 216 manually verifies that the correct text was detected by the optical character recognition at Step 4 .
  • the image layer is analyzed to determine a location of text corresponding to a given field.
  • the given field may be considered search text corresponding with the target text (or sought-after text).
  • search text For example, in FIG. 1 , “Interest income” may be the search text and “$259.54” may be the target text.
  • a list of search text may be provided to the optical character recognition device 208 by the system 206 .
  • Coordinates corresponding to the location of the target text is determined (Step 3 ( a )), as well as an OCR-based detection of the target text. These steps are all performed on the image layer of the document 204 . These coordinates may be pixel-based and computer-interpretable, such that a human would be unable to detect the location of the target text based on the coordinates alone.
  • Step 3 ( b ) the location of the target text in the text layer is determined based on the coordinates corresponding to the location of the target text in the image layer.
  • the coordinates corresponding to the location of the target text in the image layer is in terms of pixels
  • a pixel transformation converting the pixels to another mapping convention e.g., dots or points
  • the detected data from the text layer is processed in a similar manner as the data detected from the image layer using optical character recognition.
  • the detected data from the text layer may include currency symbols, such as $, which may be removed.
  • the detected data may be a number with a decimal point, and the number may be rounded to the closest whole number (e.g., $97 for $97.22 and $98 for $97.99).
  • the processing performed on the detected data from the image layer is also performed on the detected data from the text layer in order to maintain a normalization of data. For example, when the system processes the detected data from the image layer by removing currency symbols and rounding to the closest whole number, the detected data from the text layer is also processed by the system in the same way. Otherwise, the detected data from the image layer may not match the detected data from the text layer, even when the optical character recognition had properly detected the data. For example, a detected value on a particular native digital document may be $27.05.
  • the image layer detected data (detected by optical character recognition) may be $27.05, which is processed by removing the currency symbol and rounding to the nearest whole integer, resulting in a detected and processed value of 27.
  • the text layer detected data may also be $27.05.
  • the system may wrongly determine that the OCR-based detected text does not match the text layer text because the values (e.g., 27 and $27.05) do not match.
  • the system may correctly determine that the values match (e.g., 27 and 27).
  • the detected text from the text layer is compared with the OCR-based detected text from the image layer.
  • the detected and processed text from the text layer is compared with the OCR-based detected and processed text from the image layer.
  • the document 204 is reviewed by a human being 216 at Step 4 .
  • the detected text from the text layer is automatically used instead of the detected text from the image layer with no human review.
  • the detected text from the document is output at Step 5 .
  • the detected text is output to another software, such as a spreadsheet software, a database software, or a tax preparation software.
  • the detected text is stored in a file on a memory of the system.
  • the detected text is rendered by a computing device and shown on a display screen of the system.
  • FIG. 3A illustrates the image layer 300 of the same native digital document shown in FIG. 1 after an optical character recognition process has been performed on the native digital document 302 to identify a location in the image layer of the target text.
  • the optical character recognition process may be performed by an optical character recognition device, as described herein.
  • the target text 310 is associated with a search text 308 .
  • the search text 308 is known by the optical character recognition device and provided to the optical character recognition device.
  • the text value and the location of the search text 308 may be known by the optical character recognition device.
  • the optical character recognition device may use the known text value and location of the search text 308 to locate the target text 310 .
  • the search text 308 is “Interest income” and the target text 310 is “$259.54.”
  • the optical character recognition device may locate the target text 310 by defining a search area based on the search text 308 and one or more separators present on the document.
  • the optical character recognition device identifies data by separating the spaced text in the document to tables.
  • the optical character recognition device locates all of the header text and generates columns based on its respective text.
  • the optical character recognition device then defines a footer element to restrict the table by using a text or separator element.
  • the optical character recognition device is able to detect the data for each respective row based on the location of the determined columns.
  • the optical character recognition device detects text appearing multiple times in the document.
  • the optical character recognition device may achieve this by locating the header text and capturing unique data appearing multiple times in the document. Once the unique data is captured, based on the unique element, other required information may be detected by taking the right and left search area of the respective header.
  • the optical character recognition device identifies data conforming to a standardized format, such as XXX-XX-XXX for a social security number, to identify the target text 310 .
  • the optical character recognition device may know the text value of the target text expected to be on the document based on historical data associated with the document 302 or the user associated with the document 302 , and identifies target text 310 that is within a particular percentage match of the expected text value.
  • the optical character recognition device determines location coordinates associated with the location of the target text 310 on the image layer of the native digital document 302 .
  • the location of the target text 310 is represented by a four-sided box 330 surrounding the target text 310
  • the coordinates associated with the location of the target text 310 may be a set of four pixel values representing a respective distance from an edge of the native digital document to an edge of the box 330 .
  • the top edge 312 of the box 330 surrounding the target text 310 is a distance 322 away from the top of the digital document 302 .
  • the bottom edge 314 of the box 330 surrounding the target text 310 is a distance 324 from the top of the digital document 302 .
  • the left edge 316 of the box 330 surrounding the target text 310 is a distance 326 from the left of the digital document 302 .
  • the right edge 318 of the box 330 surrounding the target text 310 is a distance 328 from the left of the digital document 302 .
  • the coordinate system illustrated herein is merely illustrative, and any system of locating the box 330 in the two-dimensional plane of the image layer of the document 302 may be used.
  • the optical character recognition device may output those location coordinates.
  • the optical character recognition device may also detect the target text 310 using optical character recognition on the image layer of the native digital document 302 , and output this OCR-detected target text value.
  • These outputs of the optical character recognition device may be in the form of an Extensible Markup Language (XML) file or any other file for communicating metadata.
  • XML Extensible Markup Language
  • FIG. 3B illustrates the text layer 350 of the document 302 . Also shown in the box 330 , which is not a part of the text layer 350 .
  • the text layer contains only the text of the document 302 , and is accurate because the document 302 is a native digital document created by computer software.
  • a human being is required to select the appropriate text data from the text layer to output or export the text data, as the text of the text layer is essentially one large, unformatted string of text.
  • the automatically determined box of the target text may be overly large and may wrongly include other text in addition to the desired text.
  • the box 330 may also include “alty” which is part of the word “penalty” below the target text.
  • the target text of $259.54 may be distinguished from the stray letters “alty” and correctly identified because the system associates the target text of the “Interest income” field as being a numerical value, not a letter value.
  • the system may use associated value types to ensure that stray text or numbers are not incorporated into the detected data.
  • FIG. 4 illustrates an example output XML file 400 corresponding to the target text 310 .
  • the output XML file 400 is output by the optical character recognition device.
  • the XML file 400 includes OCR-detected text 402 based on optical character recognition of the target text 310 .
  • the XML file 400 also includes a left value 404 corresponding to the distance 326 from the left of the digital document 302 to the left side 316 of the box 330 , a top value 406 corresponding to the distance 322 from the top of the digital document 302 to the top side 312 of the box 330 , a right value 408 corresponding to the distance 328 from the left of the digital document to the right side 318 of the box 330 , and a bottom value 410 corresponding to the distance 324 from the top of the digital document to the bottom side 314 of the box 330 .
  • the values 404 , 406 , 408 , and 410 may be in pixels or any other unit of measuring a distance on the digital document 302 .
  • the OCR-detected text 402 is “5,259.54” but the actual target text is “$259.54.”
  • a human being may, at this point, compare the detected text 402 to the target text 310 and overwrite or correct the detected text 402 to correctly read “$259.54.”
  • the execution of this verification step by a human being is prone to error, and the systems and methods described herein provide an automatic way to verify whether the detected text 402 is accurate, and an automatic way to determine the correct value of the target text 310 .
  • a verification device receives the OCR-detected text 402 , the left value 404 , the top value 406 , the right value 408 , and the bottom value 410 .
  • the left value 404 , the top value 406 , the right value 408 , and the bottom value 410 represent the location of the target text 310 within the image layer of the native digital document.
  • the verification device receives this data in an XML file, as shown in FIG. 4 .
  • the verification device may convert the left value 404 , the top value 406 , the right value 408 , and the bottom value 410 to respective text layer values, if the text layer has a different coordinate or measurement system than the image layer.
  • the verification device may perform a pixel transformation sequence to convert the values 404 - 410 in pixels to another digital document mapping convention, such as dots or points.
  • the transformation sequence may not be performed by an individual human without using a computing device because the digital document mapping systems are not replicable on a physical document.
  • the units of pixels, dots, or points may not be accurately translatable to a physical document, and a physical document may be incapable of representing the computer-specific concepts of pixels, dots, or points.
  • the digital document mapping conventions used for the native digital document may be more precise than a human being is capable of being.
  • the same values corresponding to the location of the target text 310 in the image layer may be used for the text layer.
  • the verification device After the verification device determines the location of the target text 310 on the text layer of the native digital document, the verification device detects the text value of the text layer at that location. Referring back to FIG. 3B , the text value of the text layer is shown in box 330 . In this case, the text layer has a text value of “$259.54” at the location of the target text 310 . The text value detected from the text layer ($259.54) is compared against the text value 402 from the image layer ( 5 , 259 . 54 ), as detected by the optical character recognition device, and the verification device determines that the text values do not match.
  • the text value detected from the text layer when the text values do not match, the entire document is flagged for review by a human being. In some embodiments, when the text values do not match, the text value detected from the text layer is used, and the text value 402 from the image layer is discarded or disregarded.
  • the text value detected from the text layer may be output by the verification device. In some embodiments, the text value detected from the text layer is output to another computer software, such as tax preparation software or patient management software. In some embodiments, the text value detected from the text layer is rendered by a computing device and displayed on a display screen for the user to view. In some embodiments, the text value detected from the text layer is saved in a database on a non-volatile memory connected to the verification device.
  • the text from the text layer is used, and when the text values do not exceed the threshold percentage of similarity, an alert is generated, and a human being may review the document manually.
  • a threshold percentage of similarity e.g. 70%, 80%, 85%, 90% similar
  • the process may be repeated on the document until all of the desired text on the document is detected.
  • the desired text on the document to be detected may be identified by a user.
  • the identification may be a list of names of the values (e.g., Payer Name or Interest Income) and associated search text or associated locations on the document where the desired text may be located.
  • FIG. 5A illustrates a portion of the native digital document shown in FIG. 1 after an optical character recognition process has been performed on the native digital document 502 to identify a location of the target text 506 in the image layer 500 .
  • the optical character recognition process may be performed by an optical character recognition device, as described herein.
  • the target text 506 is associated with a search text 504 .
  • the search text 504 is known by the optical character recognition device and provided to the optical character recognition device.
  • the text value and the location of the search text 504 may be known by the optical character recognition device.
  • the optical character recognition device may use the known text value and location of the search text 504 to locate the target text 506 .
  • the search text 504 is “PAYER'S name” and the target text 506 is “BIG COMPANY A.”
  • the optical character recognition device may locate the target text 506 by defining a search area based on the search text 504 and one or more separators present on the document.
  • the optical character recognition device determines location coordinates associated with the location of the target text 506 on the image layer 500 of the native digital document 502 .
  • the location of the target text 506 is represented by a four-sided box 530 surrounding the target text 506
  • the coordinates associated with the location of the target text 506 may be a set of four pixel values representing a respective distance from an edge of the native digital document to an edge of the box 530 .
  • the top edge 512 of the box 530 surrounding the target text 506 is a distance 522 away from the top of the digital document 502 .
  • the bottom edge 514 of the box 530 surrounding the target text 510 is a distance 524 from the top of the digital document 502 .
  • the left edge 516 of the box 530 surrounding the target text 510 is a distance 526 from the left of the digital document 502 .
  • the right edge 518 of the box 530 surrounding the target text 510 is a distance 528 from the left of the digital document 502 .
  • the optical character recognition device may output those location coordinates, along with a detected text of the target text 506 using optical character recognition on the image layer of the native digital document 502 .
  • These outputs of the optical character recognition device may be in the form of an Extensible Markup Language (XML) file.
  • XML Extensible Markup Language
  • FIG. 5B illustrates the text layer 550 of the native digital document 502 . Also illustrated is the box 530 surrounding the target text 506 . The box 530 is not a part of the text layer 550 . As described herein with respect to FIG. 3B , the text layer 550 contains only the text of the document 502 , and is accurate because the document 502 is a native digital document created by computer software. However, as described herein, conventionally a human being is required to select the appropriate text data from the text layer to output or export the text data, as the text of the text layer is essentially one large, unformatted string of text.
  • FIG. 6 illustrates an example output XML file 600 corresponding to the target text 506 .
  • the output XML file 600 is output by the optical character recognition device.
  • the XML file 600 includes a detected text 602 based on optical character recognition of the target text 506 .
  • the XML file 600 also includes a left value 604 corresponding to the distance 526 from the left of the digital document 502 to the left side 516 of the box 530 , a top value 606 corresponding to the distance 522 from the top of the digital document 502 to the top side 512 of the box 530 , a right value 608 corresponding to the distance 528 from the left of the digital document to the right side 518 of the box 530 , and a bottom value 610 corresponding to the distance 524 from the top of the digital document to the bottom side 514 of the box 530 .
  • the values 604 , 606 , 608 , and 610 may be in pixels or any other unit of measuring a distance on the digital document 502 .
  • the detected text 602 is “BIG OOMPANY A” but the actual target text is “BIG COMPANY A.”
  • a human being may, at this point, compare the detected text 602 to the target text 510 and overwrite or correct the detected text 602 to correctly read “BIG COMPANY A.”
  • the execution of this verification step by a human being is prone to error, and the systems and methods described herein provide an automatic way to verify whether the detected text 602 is accurate, and an automatic way to determine the correct value of the target text 510 .
  • a verification device receives the OCR-detected text 602 , the left value 604 , the top value 606 , the right value 608 , and the bottom value 610 .
  • the left value 604 , the top value 606 , the right value 608 , and the bottom value 610 represent the location of the target text 510 within the image layer 500 of the native digital document 502 .
  • the verification device receives this data in an XML file, as shown in FIG. 6 .
  • the verification device may convert the left value 604 , the top value 606 , the right value 608 , and the bottom value 610 to respective text layer values, if the text layer has a different coordinate or measurement system than the image layer.
  • the text layer uses the same coordinate or measurement system as the image layer, the same values corresponding to the location of the target text 510 may be used.
  • the verification device After the verification device determines the location of the target text 506 on the text layer 550 of the native digital document, the verification device detects the text value of the text layer 550 at that location. Referring back to FIG. 5B , the text value of the text layer is shown in box 530 . In this case, the text layer has a text value of “BIG COMPANY A” at the location of the target text 506 .
  • the text value detected from the text layer (BIG COMPANY A) is compared against the text value 602 from the image layer (B1G OOMPANY A), as detected by the optical character recognition device, and the verification device determines that the text values do not match.
  • the text value detected from the text layer when the text values do not match, the entire document is flagged for review by a human being. In some embodiments, when the text values do not match, the text value detected from the text layer is used, and the text value 602 from the image layer is discarded or disregarded.
  • the text value detected from the text layer may be output by the verification device. In some embodiments, the text value detected from the text layer is output to another computer software, such as tax preparation software or patient management software. In some embodiments, the text value detected from the text layer is displayed on a display screen for the user to view. In some embodiments, the text value detected from the text layer is saved in a database on a non-volatile memory connected to the verification device.
  • the system traverses the native digital document one text item at a time to verify the each of the text items detected by performing optical character recognition in the image layer. That is, in these embodiments, there is no search text, and the steps described herein are repeated as the native digital document is traversed, with the system identifying a new target text with each iteration of the steps.
  • the system may separate groups of text based on the presence of separating elements (e.g., lines or borders), based on the whitespace separating the groups of text, or based on a machine-learning-tuned automatic determination of the type of document represented by the native digital document. For example, over time, and with sufficient training data, the system may be able to recognize various types of documents and may automatically be able to identify the target text locations without being provided the search text associated with each of the target texts.
  • FIG. 7 illustrates an example user interface output after the system has traversed the native digital document 702 , according to embodiments of the invention.
  • a computing device may render a display 700 to be shown on a display screen.
  • the display 700 may be a graphical user interface showing a representation of the native digital document 702 .
  • the display 700 may be rendered based on the image layer of the native digital document.
  • the display 700 includes confirmatory indicators 704 A- 704 C and non-confirmatory indicators 706 .
  • the confirmatory indicators 704 are located adjacent to text in the native digital document where the OCR-detected text matches with the text in the text layer. For example, when the optical character recognition device detects the Payer name as “BIG COMPANY A” and the corresponding text of the text layer of the native digital document is “BIG COMPANY A”, the OCR-detected text matches the text in the text layer. Accordingly, a confirmatory indicator 704 A is rendered and displayed adjacent to the text that was confirmed.
  • the non-confirmatory indicators 706 are located adjacent to text in the native digital document where the OCR-detected text does not match with the text in the text layer. For example, when the optical character recognition device detects the Interest income as being “5259.54” and the corresponding text of the text layer is “$5259.54”, the OCR-detected text does not match the text in the text layer. Accordingly, a non-confirmatory indicator 706 is rendered and displayed adjacent to the text that was not confirmed.
  • the display 700 may be displayed to human reviewers reviewing data extraction from native digital documents.
  • the human reviewers were tasked with viewing extracted data from the native digital document and reviewing the image layer of the native digital document to determine whether the extracted data was accurately detected.
  • the human reviewer reviewing native digital document 702 may have had to go back and forth between the extracted data and the image layer of the native digital document to determine whether each field was properly detected by optical character recognition. This process is prone to error and extremely time consuming.
  • the human eye may not be capable of detecting some errors.
  • a current hospital of a patient may complete a form, using a computer, requesting records of the patient from a previous hospital.
  • optical character recognition erroneously detects “Patient” with the Greek lowercase letter Alpha instead of lowercase A, in the patient records request form, a significant delay in obtaining the records of the patient may occur, if the records are able to ever be obtained at all. This significant delay or inability to properly locate the patient's records may prevent the current hospital from being able to administer the best care to the patient.
  • This difference between the Greek lowercase letter Alpha instead of lowercase A may be unrecognizable to a human being reviewing dozens of forms every hour, but is easily and readily recognized by the computing devices of the systems and methods described herein.
  • the display 700 provides a streamlined user interface for the human reviewer by indicating, using the confirmatory indicators 704 , which fields have already been confirmed, and indicating, using the non-confirmatory indicators 706 , which fields have not been confirmed.
  • This improved display 700 focuses the human reviewer on the fields that the human reviewer should manually review.
  • the human reviewer may have to have two windows open on the display screen—one for the detected text values and one for the image layer of the native digital document.
  • Display 700 which may be shown on a single page of a display screen, allows the human reviewer to view the annotated image layer of the native digital document and to quickly determine which fields to manually check.
  • the human reviewer may click, using an input device such as a computer mouse, on the text adjacent to the non-confirmatory indicator 706 , and edit the OCR-detected text in real-time.
  • the human reviewer may click an icon 708 directing the system to discard the conflicting OCR-detected text and to use the text from the text layer of the native digital document for any OCR-detected text that does not match the text from the text layer.
  • the system may use the text from the text layer, disregard or delete the OCR-detected text, and not show the pages for verification if all the required fields are located and their respective text layers extracted.
  • FIG. 8 illustrates an example system 800 for automatically verifying text detected by optical character recognition.
  • the system 800 includes a verification device 802 , an optical character recognition (OCR) device 804 , a memory 806 , a user device 808 , an output device 810 , a correction device 812 , and an input device 814 .
  • OCR optical character recognition
  • any of the devices may be a separate hardware device having a processor and a non-volatile memory, the processor configured to execute instructions stored on the non-volatile memory.
  • the devices described herein may alternatively be a part of a single device having multiple software devices executed by a processor and a non-volatile memory, the processor configured to execute instructions stored on the non-volatile memory.
  • the devices described herein are special purpose machines configured to perform their respective tasks described herein.
  • the verification device 802 , the optical character recognition device 804 , the memory 806 , the output device 810 , the correction device 812 , and the input device 814 are computing modules of a single computing system having a processor and non-transitory memory.
  • the user device 808 may be a computing device communicatively coupled to the verification device 802 .
  • the user device 808 may be, for example, a smartphone, a laptop, or a tablet computer.
  • the user device 808 may have its own display and memory, and is capable of generating a native digital document.
  • the user device 808 may be a computer which has software for generating invoices or account statements in the PDF format, and the generated invoices or account statements contain an image layer and a text layer.
  • the user device 808 may communicate the generated native digital document to the verification device 802 for extraction of the text data within the native digital document.
  • the verification device 802 may provide the native digital document to the optical character recognition device 804 .
  • the optical character recognition device 804 may execute special-purpose optical character recognition software to detect text data in the image layer of the native digital document and the location of the text data in the image layer of the native digital document.
  • the verification device 802 receives the OCR-detected text and the location of the OCR-detected text from the optical character recognition device 804 , and determines the corresponding location of the text values in the text layer of the native digital document. The verification device 802 compares the text value in the text layer to the OCR-detected text, and determines whether the text values match.
  • processing is performed on both the OCR-detected text and the text values detected from the text layer.
  • the processing may include removing symbols, rounding values to the nearest whole number, or discerning value types (e.g., text or numbers).
  • the processing of the OCR-detected text may be performed by the OCR device 804 or by the verification device 802 .
  • the processing of the text values detected from the text layer may be performed by the verification device 802 .
  • the memory 806 may be a non-transitory memory configured to store multiple native digital documents, lists of search text to use for various different types of documents, or any other data described herein.
  • the memory 806 may also store the computer code used by the verification device 802 for executing the functions described herein.
  • the user device 808 , output device 810 , OCR device 804 , correction device 812 , and the input device 814 may have respective non-transitory memories storing computer code for executing the respective functions of the respective devices described herein.
  • “computer code” may refer to instructions stored in non-transitory memory to be executed by a processor of a device.
  • the output device 810 may be a display screen configured to display the results of the verification of text values between the OCR-detected text and the text detected from the text layer of the native digital document.
  • the display screen may display the image layer of the native digital document and may also display icons where the OCR-detected text was verified (e.g., a green check mark) and where the OCR-detected text was not verified (e.g., a red X mark).
  • a green check mark e.g., a green check mark
  • the OCR-detected text was not verified
  • the output device 810 may be a separate computing device executing software which collects or uses the text detected from the native digital document.
  • the output device 810 may be a computing device of a tax return preparation service, which processes tax documents received by a user, extracts the data from the tax documents, and either stores the text data or populates one or more tax-related forms based on the text data of the tax documents.
  • the output device 810 is a computing device executing database software, and the extracted data may be organized and stored by the database software.
  • the correction device 812 may render a graphical user interface to be displayed on the output device 810 .
  • the graphical user interface rendered by the correction device 812 may be similar to display 700 of FIG. 7 .
  • the correction device 812 may provide for a human review and correction of any OCR-detected text that does not match the corresponding text in the text layer.
  • the correction device 812 may receive, from an input device 814 , an indication from the user to adjust or correct the OCR-detected values to a value entered by the user or to the value of the text layer.
  • the input device 814 may be one or more of a computer mouse, a computer keyboard, a microphone, or any other device or apparatus for communicating with the system 800 .
  • FIG. 9 illustrates a flow diagram of a process 900 used by the system described herein.
  • the system receives, from a user, a native digital document having an image layer and a text layer (step 902 ).
  • the user device 808 may communicate the native digital document to the verification device 802 .
  • the native digital document is provided to an optical character recognition device 804 (step 904 ), which detects text in the image layer of the native digital document and a location of the text in the image layer (step 906 ).
  • the optical character recognition device 804 performs processing (or “normalization processing”) on the text detected from the image layer (step 908 ).
  • the processing may include removing symbols, rounding values to the nearest whole number, or discerning value types (e.g., text or numbers).
  • the verification device 802 determines a location of the text in the text layer of the native digital document based on the location received from the optical character recognition device 804 (step 910 ).
  • the text in the text layer may be considered reliable, as the document is a native digital document.
  • the verification device 802 detects the text in the text layer of the native digital document (step 912 ), performs the same normalization processing on the detected text from the text layer as was performed on the detected text from the image layer in step 908 by the optical character recognition device 804 (step 914 ), and compares the OCR-detected text to the text in the text layer of the native digital document (step 916 ).
  • the OCR-detected text does not match the text in the text layer of the native digital document
  • the text from the text layer may be output (step 918 ).
  • an indication that the two values did not match may be displayed on a user interface.
  • the OCR-detected text does match the text in the text layer of the native digital document
  • the text from the text layer may be output.
  • an indication that the two values did match may be displayed on a user interface.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Character Discrimination (AREA)
  • Character Input (AREA)

Abstract

Disclosed are systems and methods for automatically verifying text of a native digital document having an image layer and a text layer. The text is detected by optical character recognition (OCR) of the image layer, and is compared to text at a corresponding location in the text layer. Normalization processing is performed on both the detected image-layer text and the text-layer text. When the image-layer text and the text-layer text do not match, the text-layer text may be used or an icon indicating that the image-layer text and the text-layer text do not match is rendered and displayed.

Description

CROSS REFERENCE TO RELATED APPLICATIONS
This application is a continuation-in-part of U.S. patent application Ser. No. 15/922,821 entitled “System and Method for Automatic Detection and Verification of Optical Character Recognition Data,” filed Mar. 15, 2018, the contents of which is herein incorporated by reference in its entirety.
BACKGROUND 1. Field
This specification relates to a system and a method for automatically detecting and verifying data obtained by optical character recognition performed on a native digital document.
2. Description of the Related Art
Optical character recognition (OCR) is an electronic conversion of images of text into machine-encoded text. Thus, use of OCR is necessarily rooted in computer technology. In its most common application, OCR is performed on a scanned or photographed document to detect the text of the document. After the text is detected using OCR, the text may be selected, searched, or edited by software executed by a computer. However, OCR may be susceptible to errors, particularly when the image of the document is of poor quality. For example, the lowercase letter “1” may be detected by OCR when the document has a lowercase letter “i” or the number 1. These errors may prevent OCR from being reliably used to efficiently process documents, where accuracy is important. Thus, there is a need for an improved system of detecting text from a document and/or verifying the text detected using OCR.
SUMMARY
What is described is a method for automatically verifying text detected by optical character recognition (OCR). The method includes obtaining a native digital document having an image layer comprising a matrix of computer-renderable pixels and a text layer comprising computer-readable encodings of a sequence of characters. The method also includes obtaining normalized OCR-detected text corresponding to OCR-detected text from the image layer of the native digital document and a pixel-based coordinate location of the OCR-detected text in the image layer of the native digital document. The method also includes determining, using a pixel transformation, a computer-interpretable location of the OCR-detected text in the text layer of the native digital document based on the pixel-based coordinate location of the OCR-detected text in the image layer of the native digital document. The method also includes applying the computer-interpretable location of the OCR-detected text to the text layer of the native digital document to detect text in the text layer corresponding to the OCR-detected text. The method also includes applying normalization processing to the detected text in the text layer to generate normalized text-layer text. The method also includes rendering only the normalized text-layer text as an output when the normalized OCR-detected text does not match the normalized text-layer text, to improve accuracy of the output text.
A method for automatically verifying text detected by optical character recognition (OCR) is also described. The method includes receiving a native digital document having an image layer comprising a matrix of computer-renderable pixels and a text layer comprising computer-readable encodings of a sequence of characters. The method also includes receiving, from an OCR device, normalized OCR-detected text corresponding to OCR-detected text from the image layer of the native digital document and a pixel-based coordinate location of the OCR-detected text in the image layer of the native digital document. The method also includes determining, using a pixel transformation, a computer-interpretable location of the OCR-detected text in the text layer of the native digital document based on the pixel-based coordinate location of the OCR-detected text in the image layer of the native digital document. The method also includes applying, by the verification device, the computer-interpretable location of the OCR-detected text to the text layer of the native digital document to detect text in the text layer corresponding to the OCR-detected text. The method also includes applying, by the verification device, normalization processing to the detected text in the text layer to generate normalized text-layer text. The method also includes producing only the normalized text-layer text as an output when the normalized OCR-detected text does not match the normalized text-layer text, to improve accuracy of the output text.
A computer program product embodied on a computer readable storage medium for processing native digital document information to produce a native digital document record corresponding to the native digital document information is disclosed. The computer program product includes computer code for detecting image-layer data from an image layer of the native digital document. The computer program product also includes computer code for receiving location data associated with the detected image-layer data. The computer program product also includes computer code for detecting text-layer data associated with the received location data. The computer program product also includes computer code for comparing the detected image-layer data and the text-layer data. The computer program product also includes computer code for using the comparison of the detected image-layer data and the text-layer data to enhance the native digital document record by indicating whether the detected image-layer data matches the detected text-layer data.
BRIEF DESCRIPTION OF THE DRAWINGS
Other systems, methods, features, and advantages of the present invention will be apparent to one skilled in the art upon examination of the following figures and detailed description. Component parts shown in the drawings are not necessarily to scale, and may be exaggerated to better illustrate the important features of the present invention.
FIG. 1 illustrates an example native digital document, according to embodiments of the invention.
FIG. 2 illustrates a process diagram of a workflow between various components of the system, according to embodiments of the invention.
FIG. 3A illustrates the image layer of the native digital document of FIG. 1, according to embodiments of the invention.
FIG. 3B illustrates the text layer of the native digital document of FIG. 1, according to embodiments of the invention.
FIG. 4 illustrates an example output XML file of the text and location detected by optical character recognition, according to embodiments of the invention.
FIG. 5A illustrates the image layer of a native digital document, according to embodiments of the invention.
FIG. 5B illustrates the text layer of the native digital document of FIG. 5A, according to embodiments of the invention.
FIG. 6 illustrates an example output XML file of the text and location detected by optical character recognition, according to embodiments of the invention.
FIG. 7 illustrates an example user interface output after the system has traversed the native digital document, according to embodiments of the invention.
FIG. 8 illustrates an example system for automatically verifying text detected by optical character recognition, according to embodiments of the invention.
FIG. 9 illustrates a flow diagram of a process of automatically verifying text detected by optical character recognition, according to embodiments of the invention.
DETAILED DESCRIPTION
Disclosed herein are systems and methods for automatically detecting text or numbers in a native digital document having an image layer and a text layer. More specifically, the systems and methods described herein are an improvement to existing computer technologies for detecting text or numbers in a digital document. As used herein, the term “text” may refer to letters, numbers, symbols, or any other character that may be read by a user. A non-native digital document is one which is created based on a scan or photograph of a physical document and has only an image layer, and a native digital document is one which is created by a computer program and includes an image layer and a text layer.
Optical character recognition (OCR) has conventionally been used in detecting text or numbers in digital representations of physical documents. A user may scan or photograph a physical document to create a digital representation of the document (i.e., a non-native digital document). The non-native digital document may be comprised of a matrix of computer-renderable pixels having various color values, and this non-native digital document has an image layer only. Optical character recognition software is capable of detecting text contained in the non-native digital document based on an analysis of the pixels of the digital document. When the optical character recognition process is completed, a text layer may be added to the image layer of the digital document, so that the document may be searchable, and parts of the text may be copied and pasted to another computer application. The text layer may comprise computer-readable encodings of a sequence of characters representing the characters in the matrix of computer-renderable pixels which make up the image layer.
For example, a physical document that is a letter may be scanned by a digital scanner, and the scanner may create a non-native digital document (e.g., a PDF) with an image layer. The image layer may have a matrix of pixels each having a color value. A computer may receive the non-native digital document and perform optical character recognition on the non-native digital document using optical character recognition software. The optical character recognition software detects all of the text on the non-native digital document based on an analysis of the pixels in the image layer. The optical character recognition software may add a text layer to the non-native digital document containing all of the detected text of the non-native digital document so that the digital document now contains an image layer and a text layer. For example, if the letter contained the words “Dear Arnold,” the image layer may have pixels in an arrangement corresponding to the letters of “Dear Arnold,” and the text layer may contain text data corresponding to the detected letters of “Dear Arnold,” that were detected by the optical character recognition software. However, despite the addition of the text layer by optical character recognition, this digital document remains a non-native digital document, as it did not have its text layer upon creation of the non-native digital document.
When the optical character recognition software scans pixels to detect text of a non-native digital document, the OCR-generated text may be searched, selected, or even edited by computer software. However, depending on the quality of the scan of the physical document, conventional optical character recognition software may be error-prone, and in many cases, a human being may review the results of the optical character recognition to determine whether the optical character recognition software has accurately detected the text or numbers on the physical document. For example, when the original physical document contains the text “illustrations of St. Mark's Square” the optical character recognition software may instead detect “Illustra1on5 of St, Mark;s 5quare” because of a low-quality scan or a stray mark on the physical page of the physical document. Now, when a user conducts a search of the digital document for “illustration” or “Mark's” the user will not be provided with the correct result.
Some digital documents may be created using computer software (i.e., native digital documents), and not scans or photographs from physical documents. These native digital documents may commonly be in the Portable Document Format (PDF) developed by Adobe®. While PDF is described herein, any digital document format having an image layer and a text layer may be used by the systems and methods described herein. These native digital documents may be created by an originating computer software and converted or output into the digital document format. For example, a user may create a document using Microsoft Word™ and output the document as a PDF. The output PDF may contain an image layer and a text layer. The image layer may be displayed on a display of the computer, and the text layer may be used to search the document or select parts of the text within the document, for example. Again, these digital documents created using computer software and not created based on a physical document may be referred to herein as “native digital documents” or “true digital documents.” For example, a PDF created using computer software and not based on a scan of a physical document may be referred to as a “native PDF” or a “true PDF.”
These native digital documents have a text layer created from a firsthand input of the intended text from the originating computer software. Accordingly, these native digital documents do not require optical character recognition to detect the text included in the document. Thus, these native digital documents have text in the text layer that is more reliable than a digital document created based on a scan of a physical document and having optical character recognition performed on it (i.e., a “non-native digital document”).
While these native digital documents may be improvements over non-native digital documents created based on physical documents, they may still have shortcomings. For example, when data is to be extracted from a form, simply selecting all of the text in the text layer may not provide a suitable data output, as there may be spacing within the text and other objects separating the text, which provides context for the text in the form. An example is shown in FIG. 1.
FIG. 1 is a portion of a completed tax form 100. The tax form 100 may be a native digital document created from a computer program, such as payroll management software, and the native digital document includes a text layer and an image layer. The image layer may be viewable by the user on a computer display, and the text layer may be an underlying layer containing all of the text in the document. If the user would like to extract the data from the document, the user could select all of the text (made available by the text layer), and copy and paste the text into another digital document (e.g., a DOC file) or digital data recording system (e.g., a tax return preparation software). However, copying all of the data in the text layer may provide an output of all of the text in the page, without regard for the blocked formatting of the fields. Copying the entire text layer may result in a string of characters similar to:
PAYER'S name, street address, city or town, state of province, country, Payer's RTN (optional) ZIP or foreign postal code, and telephone no. BIG COMPANY A 1 Interest income 100 BIG COMPANY ST., $259.54 COMPANYVILLE, USA 99999-000 2 Early withdrawal penalty PAYER'S federal identification number RECIPIENT'S identification number 3 Interest on U.S. Savings Bonds and 99-999999 ***_**_RECIPIENT'S name, street address, city or town, state or province, 4 Federal income tax withheld 5 country, and ZIP or foreign postal code 6 Foreign tax paid 7 8 Tax-exempt interest 9.
While the accuracy of the text is ensured by the document being a native digital document, the spacing and formatting of the data within the document provides a challenge in being able to readily use the text from the text layer.
Conventionally, this data has been manually processed by a human being using a computer. The human being may select the values of interest in the native digital document and copy and paste the values to another computer program. For example, when preparing a tax return, the human being may manually use a computer mouse to: (1) select the Payer's name and address information, line by line, (2) copy the text to a short-term memory (e.g., a clipboard of the operating system), and (3) paste the text to another computer software (e.g., a tax return preparation program). However, this manual process is tiresome and error-prone. A human being may select the wrong text, not enough text, or may have errors when executing the copy and paste functions of the computer. In addition, correctly identifying the correct text to be copied is also a potential source of errors, as a human being may incorrectly copy the text from an adjacent field (e.g., “Early withdrawal penalty”) when intending to copy the text from an intended field (e.g., “Interest income”).
Other computer-based systems exist where the image layer of a digital document is analyzed, and using objects present in the document, such as lines and other dividing elements, various text may be detected. In some of these systems, particularly ones for determining values entered into a standardized form, known text present in the document, such as a field description, may be located. A localized search for a corresponding value may then be performed around the area of the known field description. For example, in FIG. 1, the descriptions of the fields (e.g., “Payer's name,” “Payer's federal identification number,” or “Interest income”) may be known by the system. When text corresponding to “Interest income” is sought, the computer system may conduct a local search within a particular number of pixels of the known location of “Interest income” and optical character recognition may be performed on the detected local pixels where text corresponding to “Interest income” is likely to be found.
While this may result in a correct detection of the location of “$259.54” as shown in FIG. 1, because of the potentially unacceptably high error rate of optical character recognition, which may interpret “$259.54” as 5259.54 or $25954 or $2$59.54, a human being may be tasked with reviewing the detected text and comparing it against the digital document displayed on a screen, to ensure the correct value was detected. Again, this human-being-conducted review is prone to error, as the human reviewer may miss errors due to fatigue, poor judgment, lack of motivation, or for any other reason a human being is prone to error.
The systems and methods described herein provide solutions to the errors involved in conventional optical character recognition systems, human processing, and human review systems by using native digital documents, as further described herein. The present invention overcomes many of the deficiencies of the conventional methods and systems and obtains its objectives by providing an integrated method embodied in computer software for use with a computer for the rapid, efficient, and automatic verification of text in a native digital document, thereby allowing for rapid, accurate, and efficient data verification of digital documents. Accordingly, it is the objective of the invention to improve verification of the text of native digital documents, which is integrated with computers for producing accurate verification of digital documents.
FIG. 2 illustrates a process diagram of a workflow 200 between various components of the system, according to some embodiments of the invention.
The workflow begins at Step 1 with a user 202 uploading a document 204 to a system 206 using the systems and methods described herein. In some embodiments, the document 204 may be automatically retrieved from a third-party server or automatically communicated by a third-party server. In many embodiments, the document 204 is a form having a plurality of fields, and the system 206 may be configured to determine and use the data corresponding to each of the fields in the document 204. The fields within the document 204 and the identifiers used to label each of the fields in the document 204 may be known to the system 206 before the document 204 is analyzed.
In an example embodiment, the user 202 may be a taxpayer or an automated document retrieval system, the document 204 may be a tax document provided to the taxpayer by an employer or a bank, and the system 206 may be a tax document automation software (or scan and populate software) used by the taxpayer or tax preparer to automate the organization of and data entry from, the taxpayer source documents. The document 204 may contain 25 different fields at different locations in the document 204. The 25 different fields may have unique field identifiers, such as “Payer's name,” “Interest income,” “Early withdrawal penalty,” and “Payer's federal identification number,” for example, and each field may have a corresponding value to be determined.
The document 204 may be a native digital document, such as a native PDF, having a text layer and an image layer, or may be a non-native digital document having only an image layer. At Step 2, the document 204 is analyzed and optical character recognition is performed on the image layer of the document 204. The optical character recognition may be performed by an optical character recognition device 208 configured to perform optical character recognition on a given document. The optical character recognition process results in data associated with the given document being detected, such as the values of various fields (e.g., $259.54).
In some embodiments, the optical character recognition process further processes the detected data associated with the document. The further processing of the documents may include one or more business rules being applied to the detected data. For example, the detected data may include currency symbols, such as $. In some embodiments, the currency symbols may not be important to store in memory, as the database may already associate particular detected data values as being in corresponding units, such as dollars. For example, in the tax form 100 of FIG. 1, the system may be instructed that data detected within the box associated with “Interest income” will be dollar amounts.
In another example, the detected data may be a number with a decimal point, and the number may be rounded to the closest whole number (e.g., $97 for $97.22 and $98 for $97.99). In some situations, the preciseness of the detected data may not have consequence, such as for tax preparation purposes, and computing resources may be saved by transforming the detected data and rounding to the nearest whole number.
In yet another example, the optical character recognition may detect multiple pieces of information in a single field. For example, when the document is a listing of stock identifiers and number of shares sold, the optical character recognition software may detect DYK 100 from a single field. However, this detected data may include the stock symbol (DYK) as well as the number of shares sold (100). In some embodiments, particular data types (e.g., numbers or letters) may be associated with each piece of data detected by the optical character recognition software. For example, the system may associate any detected letters with the stock symbol, and may associate any numbers with the number of shares sold. In this way, even if stray letters are detected by the optical character recognition in a field associated with a number value, the stray letters may be ignored and may not affect what value is detected, as the system may identify only the numbers from the detected data.
At Step 3, it is determined if the document 204 is a native digital document. An analysis of the metadata associated with the document 204 or an analysis of the content or structure of the document 204 may be performed to determine whether the document 204 is a native digital document. When the document 204 is a non-native digital document, optical character recognition is used to detect the text within the various fields of the document 204, and a human being 216 manually verifies that the correct text was detected by the optical character recognition at Step 4.
When the document 204 is a native digital document, the image layer is analyzed to determine a location of text corresponding to a given field. The given field may be considered search text corresponding with the target text (or sought-after text). For example, in FIG. 1, “Interest income” may be the search text and “$259.54” may be the target text. A list of search text may be provided to the optical character recognition device 208 by the system 206. Coordinates corresponding to the location of the target text is determined (Step 3(a)), as well as an OCR-based detection of the target text. These steps are all performed on the image layer of the document 204. These coordinates may be pixel-based and computer-interpretable, such that a human would be unable to detect the location of the target text based on the coordinates alone.
In Step 3(b), the location of the target text in the text layer is determined based on the coordinates corresponding to the location of the target text in the image layer. In some embodiments, when the coordinates corresponding to the location of the target text in the image layer is in terms of pixels, a pixel transformation converting the pixels to another mapping convention (e.g., dots or points) is used. Once the location of the target text in the text layer is determined, the target text in the text layer is detected.
In some embodiments, the detected data from the text layer is processed in a similar manner as the data detected from the image layer using optical character recognition. For example, the detected data from the text layer may include currency symbols, such as $, which may be removed. In another example, the detected data may be a number with a decimal point, and the number may be rounded to the closest whole number (e.g., $97 for $97.22 and $98 for $97.99).
The processing performed on the detected data from the image layer is also performed on the detected data from the text layer in order to maintain a normalization of data. For example, when the system processes the detected data from the image layer by removing currency symbols and rounding to the closest whole number, the detected data from the text layer is also processed by the system in the same way. Otherwise, the detected data from the image layer may not match the detected data from the text layer, even when the optical character recognition had properly detected the data. For example, a detected value on a particular native digital document may be $27.05. The image layer detected data (detected by optical character recognition) may be $27.05, which is processed by removing the currency symbol and rounding to the nearest whole integer, resulting in a detected and processed value of 27. The text layer detected data may also be $27.05. If the same processing (removing the currency symbol and rounding to the nearest whole integer) is not performed on the text-layer detected data, the system may wrongly determine that the OCR-based detected text does not match the text layer text because the values (e.g., 27 and $27.05) do not match. When the same normalization processing is performed on both the OCR-based detected text and the text-layer text, the system may correctly determine that the values match (e.g., 27 and 27).
At Step 3(c), the detected text from the text layer is compared with the OCR-based detected text from the image layer. As described above, in some embodiments, the detected and processed text from the text layer is compared with the OCR-based detected and processed text from the image layer. In some embodiments, when the two detected text values do not match, the document 204 is reviewed by a human being 216 at Step 4. In some embodiments, when the two detected text values do not match, the detected text from the text layer is automatically used instead of the detected text from the image layer with no human review.
Once the text is determined, the detected text from the document is output at Step 5. In some embodiments, the detected text is output to another software, such as a spreadsheet software, a database software, or a tax preparation software. In some embodiments, the detected text is stored in a file on a memory of the system. In some embodiments, the detected text is rendered by a computing device and shown on a display screen of the system.
FIG. 3A illustrates the image layer 300 of the same native digital document shown in FIG. 1 after an optical character recognition process has been performed on the native digital document 302 to identify a location in the image layer of the target text. The optical character recognition process may be performed by an optical character recognition device, as described herein.
The target text 310 is associated with a search text 308. The search text 308 is known by the optical character recognition device and provided to the optical character recognition device. In particular, the text value and the location of the search text 308 may be known by the optical character recognition device. The optical character recognition device may use the known text value and location of the search text 308 to locate the target text 310.
As shown in FIG. 3A, the search text 308 is “Interest income” and the target text 310 is “$259.54.” The optical character recognition device may locate the target text 310 by defining a search area based on the search text 308 and one or more separators present on the document.
In some embodiments, the optical character recognition device identifies data by separating the spaced text in the document to tables. The optical character recognition device locates all of the header text and generates columns based on its respective text. The optical character recognition device then defines a footer element to restrict the table by using a text or separator element. The optical character recognition device is able to detect the data for each respective row based on the location of the determined columns.
In some embodiments, the optical character recognition device detects text appearing multiple times in the document. The optical character recognition device may achieve this by locating the header text and capturing unique data appearing multiple times in the document. Once the unique data is captured, based on the unique element, other required information may be detected by taking the right and left search area of the respective header.
In some embodiments, the optical character recognition device identifies data conforming to a standardized format, such as XXX-XX-XXXX for a social security number, to identify the target text 310.
In some embodiments, the optical character recognition device may know the text value of the target text expected to be on the document based on historical data associated with the document 302 or the user associated with the document 302, and identifies target text 310 that is within a particular percentage match of the expected text value.
Once the target text 310 is located, the optical character recognition device determines location coordinates associated with the location of the target text 310 on the image layer of the native digital document 302. In some embodiments, the location of the target text 310 is represented by a four-sided box 330 surrounding the target text 310, and the coordinates associated with the location of the target text 310 may be a set of four pixel values representing a respective distance from an edge of the native digital document to an edge of the box 330.
For example, the top edge 312 of the box 330 surrounding the target text 310 is a distance 322 away from the top of the digital document 302. The bottom edge 314 of the box 330 surrounding the target text 310 is a distance 324 from the top of the digital document 302. The left edge 316 of the box 330 surrounding the target text 310 is a distance 326 from the left of the digital document 302. The right edge 318 of the box 330 surrounding the target text 310 is a distance 328 from the left of the digital document 302. The coordinate system illustrated herein is merely illustrative, and any system of locating the box 330 in the two-dimensional plane of the image layer of the document 302 may be used.
Once the optical character recognition device has determined the location coordinates associated with the location of the target text 310, the optical character recognition device may output those location coordinates. The optical character recognition device may also detect the target text 310 using optical character recognition on the image layer of the native digital document 302, and output this OCR-detected target text value. These outputs of the optical character recognition device may be in the form of an Extensible Markup Language (XML) file or any other file for communicating metadata.
FIG. 3B illustrates the text layer 350 of the document 302. Also shown in the box 330, which is not a part of the text layer 350. The text layer contains only the text of the document 302, and is accurate because the document 302 is a native digital document created by computer software. However, as described herein, conventionally a human being is required to select the appropriate text data from the text layer to output or export the text data, as the text of the text layer is essentially one large, unformatted string of text.
In some situations, the automatically determined box of the target text may be overly large and may wrongly include other text in addition to the desired text. For example, the box 330 may also include “alty” which is part of the word “penalty” below the target text. In these situations, the target text of $259.54 may be distinguished from the stray letters “alty” and correctly identified because the system associates the target text of the “Interest income” field as being a numerical value, not a letter value. Thus, the system may use associated value types to ensure that stray text or numbers are not incorporated into the detected data.
FIG. 4 illustrates an example output XML file 400 corresponding to the target text 310. The output XML file 400 is output by the optical character recognition device. The XML file 400 includes OCR-detected text 402 based on optical character recognition of the target text 310. The XML file 400 also includes a left value 404 corresponding to the distance 326 from the left of the digital document 302 to the left side 316 of the box 330, a top value 406 corresponding to the distance 322 from the top of the digital document 302 to the top side 312 of the box 330, a right value 408 corresponding to the distance 328 from the left of the digital document to the right side 318 of the box 330, and a bottom value 410 corresponding to the distance 324 from the top of the digital document to the bottom side 314 of the box 330. The values 404, 406, 408, and 410 may be in pixels or any other unit of measuring a distance on the digital document 302.
As shown in FIG. 4, the OCR-detected text 402 is “5,259.54” but the actual target text is “$259.54.” This is an example situation where the optical character recognition device has inaccurately detected a text value. As described herein, conventionally, a human being may, at this point, compare the detected text 402 to the target text 310 and overwrite or correct the detected text 402 to correctly read “$259.54.” However, the execution of this verification step by a human being is prone to error, and the systems and methods described herein provide an automatic way to verify whether the detected text 402 is accurate, and an automatic way to determine the correct value of the target text 310.
A verification device receives the OCR-detected text 402, the left value 404, the top value 406, the right value 408, and the bottom value 410. The left value 404, the top value 406, the right value 408, and the bottom value 410 represent the location of the target text 310 within the image layer of the native digital document. In some embodiments, the verification device receives this data in an XML file, as shown in FIG. 4.
The verification device may convert the left value 404, the top value 406, the right value 408, and the bottom value 410 to respective text layer values, if the text layer has a different coordinate or measurement system than the image layer. For example, when the image layer is a matrix of computer-renderable pixels and the values 404-410 are in terms of pixels, the verification device may perform a pixel transformation sequence to convert the values 404-410 in pixels to another digital document mapping convention, such as dots or points. The transformation sequence may not be performed by an individual human without using a computing device because the digital document mapping systems are not replicable on a physical document. The units of pixels, dots, or points, may not be accurately translatable to a physical document, and a physical document may be incapable of representing the computer-specific concepts of pixels, dots, or points.
The digital document mapping conventions used for the native digital document may be more precise than a human being is capable of being. In some embodiments, when the text layer uses the same coordinate or measurement system as the image layer, the same values corresponding to the location of the target text 310 in the image layer may be used for the text layer.
After the verification device determines the location of the target text 310 on the text layer of the native digital document, the verification device detects the text value of the text layer at that location. Referring back to FIG. 3B, the text value of the text layer is shown in box 330. In this case, the text layer has a text value of “$259.54” at the location of the target text 310. The text value detected from the text layer ($259.54) is compared against the text value 402 from the image layer (5,259.54), as detected by the optical character recognition device, and the verification device determines that the text values do not match.
In some embodiments, when the text values do not match, the entire document is flagged for review by a human being. In some embodiments, when the text values do not match, the text value detected from the text layer is used, and the text value 402 from the image layer is discarded or disregarded. The text value detected from the text layer may be output by the verification device. In some embodiments, the text value detected from the text layer is output to another computer software, such as tax preparation software or patient management software. In some embodiments, the text value detected from the text layer is rendered by a computing device and displayed on a display screen for the user to view. In some embodiments, the text value detected from the text layer is saved in a database on a non-volatile memory connected to the verification device. In some embodiments, when the text values exceed a particular threshold percentage of similarity (e.g., 70%, 80%, 85%, 90% similar), the text from the text layer is used, and when the text values do not exceed the threshold percentage of similarity, an alert is generated, and a human being may review the document manually.
The process may be repeated on the document until all of the desired text on the document is detected. The desired text on the document to be detected may be identified by a user. The identification may be a list of names of the values (e.g., Payer Name or Interest Income) and associated search text or associated locations on the document where the desired text may be located.
FIG. 5A illustrates a portion of the native digital document shown in FIG. 1 after an optical character recognition process has been performed on the native digital document 502 to identify a location of the target text 506 in the image layer 500. The optical character recognition process may be performed by an optical character recognition device, as described herein.
The target text 506 is associated with a search text 504. The search text 504 is known by the optical character recognition device and provided to the optical character recognition device. In particular, the text value and the location of the search text 504 may be known by the optical character recognition device. The optical character recognition device may use the known text value and location of the search text 504 to locate the target text 506.
As shown in FIG. 5, the search text 504 is “PAYER'S name” and the target text 506 is “BIG COMPANY A.” The optical character recognition device may locate the target text 506 by defining a search area based on the search text 504 and one or more separators present on the document.
Once the target text 506 is located, the optical character recognition device determines location coordinates associated with the location of the target text 506 on the image layer 500 of the native digital document 502. In some embodiments, the location of the target text 506 is represented by a four-sided box 530 surrounding the target text 506, and the coordinates associated with the location of the target text 506 may be a set of four pixel values representing a respective distance from an edge of the native digital document to an edge of the box 530.
For example, the top edge 512 of the box 530 surrounding the target text 506 is a distance 522 away from the top of the digital document 502. The bottom edge 514 of the box 530 surrounding the target text 510 is a distance 524 from the top of the digital document 502. The left edge 516 of the box 530 surrounding the target text 510 is a distance 526 from the left of the digital document 502. The right edge 518 of the box 530 surrounding the target text 510 is a distance 528 from the left of the digital document 502.
Once the optical character recognition device has determined the location coordinates associated with the location of the target text 506, the optical character recognition device may output those location coordinates, along with a detected text of the target text 506 using optical character recognition on the image layer of the native digital document 502. These outputs of the optical character recognition device may be in the form of an Extensible Markup Language (XML) file.
FIG. 5B illustrates the text layer 550 of the native digital document 502. Also illustrated is the box 530 surrounding the target text 506. The box 530 is not a part of the text layer 550. As described herein with respect to FIG. 3B, the text layer 550 contains only the text of the document 502, and is accurate because the document 502 is a native digital document created by computer software. However, as described herein, conventionally a human being is required to select the appropriate text data from the text layer to output or export the text data, as the text of the text layer is essentially one large, unformatted string of text.
FIG. 6 illustrates an example output XML file 600 corresponding to the target text 506. The output XML file 600 is output by the optical character recognition device. The XML file 600 includes a detected text 602 based on optical character recognition of the target text 506. The XML file 600 also includes a left value 604 corresponding to the distance 526 from the left of the digital document 502 to the left side 516 of the box 530, a top value 606 corresponding to the distance 522 from the top of the digital document 502 to the top side 512 of the box 530, a right value 608 corresponding to the distance 528 from the left of the digital document to the right side 518 of the box 530, and a bottom value 610 corresponding to the distance 524 from the top of the digital document to the bottom side 514 of the box 530. The values 604, 606, 608, and 610 may be in pixels or any other unit of measuring a distance on the digital document 502.
As shown in FIG. 6, the detected text 602 is “BIG OOMPANY A” but the actual target text is “BIG COMPANY A.” This is an example situation where the optical character recognition device has inaccurately detected a text value. As described herein, conventionally, a human being may, at this point, compare the detected text 602 to the target text 510 and overwrite or correct the detected text 602 to correctly read “BIG COMPANY A.” However, the execution of this verification step by a human being is prone to error, and the systems and methods described herein provide an automatic way to verify whether the detected text 602 is accurate, and an automatic way to determine the correct value of the target text 510.
A verification device receives the OCR-detected text 602, the left value 604, the top value 606, the right value 608, and the bottom value 610. The left value 604, the top value 606, the right value 608, and the bottom value 610 represent the location of the target text 510 within the image layer 500 of the native digital document 502. In some embodiments, the verification device receives this data in an XML file, as shown in FIG. 6.
The verification device may convert the left value 604, the top value 606, the right value 608, and the bottom value 610 to respective text layer values, if the text layer has a different coordinate or measurement system than the image layer. When the text layer uses the same coordinate or measurement system as the image layer, the same values corresponding to the location of the target text 510 may be used.
After the verification device determines the location of the target text 506 on the text layer 550 of the native digital document, the verification device detects the text value of the text layer 550 at that location. Referring back to FIG. 5B, the text value of the text layer is shown in box 530. In this case, the text layer has a text value of “BIG COMPANY A” at the location of the target text 506. The text value detected from the text layer (BIG COMPANY A) is compared against the text value 602 from the image layer (B1G OOMPANY A), as detected by the optical character recognition device, and the verification device determines that the text values do not match.
In some embodiments, when the text values do not match, the entire document is flagged for review by a human being. In some embodiments, when the text values do not match, the text value detected from the text layer is used, and the text value 602 from the image layer is discarded or disregarded. The text value detected from the text layer may be output by the verification device. In some embodiments, the text value detected from the text layer is output to another computer software, such as tax preparation software or patient management software. In some embodiments, the text value detected from the text layer is displayed on a display screen for the user to view. In some embodiments, the text value detected from the text layer is saved in a database on a non-volatile memory connected to the verification device.
While the examples illustrated herein have search text uniquely associated with the target text, in some embodiments, the system traverses the native digital document one text item at a time to verify the each of the text items detected by performing optical character recognition in the image layer. That is, in these embodiments, there is no search text, and the steps described herein are repeated as the native digital document is traversed, with the system identifying a new target text with each iteration of the steps. The system may separate groups of text based on the presence of separating elements (e.g., lines or borders), based on the whitespace separating the groups of text, or based on a machine-learning-tuned automatic determination of the type of document represented by the native digital document. For example, over time, and with sufficient training data, the system may be able to recognize various types of documents and may automatically be able to identify the target text locations without being provided the search text associated with each of the target texts.
FIG. 7 illustrates an example user interface output after the system has traversed the native digital document 702, according to embodiments of the invention.
A computing device may render a display 700 to be shown on a display screen. The display 700 may be a graphical user interface showing a representation of the native digital document 702. The display 700 may be rendered based on the image layer of the native digital document. The display 700 includes confirmatory indicators 704A-704C and non-confirmatory indicators 706. The confirmatory indicators 704 are located adjacent to text in the native digital document where the OCR-detected text matches with the text in the text layer. For example, when the optical character recognition device detects the Payer name as “BIG COMPANY A” and the corresponding text of the text layer of the native digital document is “BIG COMPANY A”, the OCR-detected text matches the text in the text layer. Accordingly, a confirmatory indicator 704A is rendered and displayed adjacent to the text that was confirmed.
The non-confirmatory indicators 706 are located adjacent to text in the native digital document where the OCR-detected text does not match with the text in the text layer. For example, when the optical character recognition device detects the Interest income as being “5259.54” and the corresponding text of the text layer is “$5259.54”, the OCR-detected text does not match the text in the text layer. Accordingly, a non-confirmatory indicator 706 is rendered and displayed adjacent to the text that was not confirmed.
The display 700 may be displayed to human reviewers reviewing data extraction from native digital documents. Conventionally, the human reviewers were tasked with viewing extracted data from the native digital document and reviewing the image layer of the native digital document to determine whether the extracted data was accurately detected. For example, the human reviewer reviewing native digital document 702 may have had to go back and forth between the extracted data and the image layer of the native digital document to determine whether each field was properly detected by optical character recognition. This process is prone to error and extremely time consuming. In addition, the human eye may not be capable of detecting some errors. For example, when the OCR-detected text is “BIG COMPANY A” with a Greek capital letter Iota detected instead of an uppercase I, a human being is, in practically all cases, unable to recognize this difference on a document. However, a computer capable of detecting the different ASCII (American Standard Code for Information Interchange) values associated with Greek capital letter Iota and an uppercase I, is able to detect the erroneous detection performed by optical character recognition. This erroneous detection of Greek capital letter Iota instead of uppercase I may result in a mistaken detection of data from the digital document when text data is extracted from the digital document. This may cause inaccuracy and significant delays in the larger systems using the systems and method described herein. For example, a current hospital of a patient may complete a form, using a computer, requesting records of the patient from a previous hospital. If optical character recognition erroneously detects “Patient” with the Greek lowercase letter Alpha instead of lowercase A, in the patient records request form, a significant delay in obtaining the records of the patient may occur, if the records are able to ever be obtained at all. This significant delay or inability to properly locate the patient's records may prevent the current hospital from being able to administer the best care to the patient. This difference between the Greek lowercase letter Alpha instead of lowercase A may be unrecognizable to a human being reviewing dozens of forms every hour, but is easily and readily recognized by the computing devices of the systems and methods described herein.
The display 700 provides a streamlined user interface for the human reviewer by indicating, using the confirmatory indicators 704, which fields have already been confirmed, and indicating, using the non-confirmatory indicators 706, which fields have not been confirmed. This improved display 700 focuses the human reviewer on the fields that the human reviewer should manually review. Conventionally, the human reviewer may have to have two windows open on the display screen—one for the detected text values and one for the image layer of the native digital document. Display 700, which may be shown on a single page of a display screen, allows the human reviewer to view the annotated image layer of the native digital document and to quickly determine which fields to manually check.
In some embodiments, the human reviewer may click, using an input device such as a computer mouse, on the text adjacent to the non-confirmatory indicator 706, and edit the OCR-detected text in real-time. Alternatively, the human reviewer may click an icon 708 directing the system to discard the conflicting OCR-detected text and to use the text from the text layer of the native digital document for any OCR-detected text that does not match the text from the text layer. In some embodiments, the system may use the text from the text layer, disregard or delete the OCR-detected text, and not show the pages for verification if all the required fields are located and their respective text layers extracted.
FIG. 8 illustrates an example system 800 for automatically verifying text detected by optical character recognition. The system 800 includes a verification device 802, an optical character recognition (OCR) device 804, a memory 806, a user device 808, an output device 810, a correction device 812, and an input device 814.
Any of the devices (e.g., verification device, optical character recognition device, user device, or correction device) described herein may be a separate hardware device having a processor and a non-volatile memory, the processor configured to execute instructions stored on the non-volatile memory. The devices described herein may alternatively be a part of a single device having multiple software devices executed by a processor and a non-volatile memory, the processor configured to execute instructions stored on the non-volatile memory. The devices described herein are special purpose machines configured to perform their respective tasks described herein. In some embodiments, the verification device 802, the optical character recognition device 804, the memory 806, the output device 810, the correction device 812, and the input device 814 are computing modules of a single computing system having a processor and non-transitory memory.
The user device 808 may be a computing device communicatively coupled to the verification device 802. The user device 808 may be, for example, a smartphone, a laptop, or a tablet computer. The user device 808 may have its own display and memory, and is capable of generating a native digital document. For example, the user device 808 may be a computer which has software for generating invoices or account statements in the PDF format, and the generated invoices or account statements contain an image layer and a text layer. The user device 808 may communicate the generated native digital document to the verification device 802 for extraction of the text data within the native digital document.
The verification device 802 may provide the native digital document to the optical character recognition device 804. The optical character recognition device 804 may execute special-purpose optical character recognition software to detect text data in the image layer of the native digital document and the location of the text data in the image layer of the native digital document.
The verification device 802 receives the OCR-detected text and the location of the OCR-detected text from the optical character recognition device 804, and determines the corresponding location of the text values in the text layer of the native digital document. The verification device 802 compares the text value in the text layer to the OCR-detected text, and determines whether the text values match.
In some embodiments, processing is performed on both the OCR-detected text and the text values detected from the text layer. The processing may include removing symbols, rounding values to the nearest whole number, or discerning value types (e.g., text or numbers). The processing of the OCR-detected text may be performed by the OCR device 804 or by the verification device 802. The processing of the text values detected from the text layer may be performed by the verification device 802.
The memory 806 may be a non-transitory memory configured to store multiple native digital documents, lists of search text to use for various different types of documents, or any other data described herein. The memory 806 may also store the computer code used by the verification device 802 for executing the functions described herein. The user device 808, output device 810, OCR device 804, correction device 812, and the input device 814 may have respective non-transitory memories storing computer code for executing the respective functions of the respective devices described herein. As used herein, “computer code” may refer to instructions stored in non-transitory memory to be executed by a processor of a device.
The output device 810 may be a display screen configured to display the results of the verification of text values between the OCR-detected text and the text detected from the text layer of the native digital document. The display screen may display the image layer of the native digital document and may also display icons where the OCR-detected text was verified (e.g., a green check mark) and where the OCR-detected text was not verified (e.g., a red X mark). In this way, a system requiring human being review of the document when the OCR-detected text does not match the text layer text can be performed more accurately, as the system is capable of automatically verifying at least a portion of the native digital document.
The output device 810 may be a separate computing device executing software which collects or uses the text detected from the native digital document. For example, the output device 810 may be a computing device of a tax return preparation service, which processes tax documents received by a user, extracts the data from the tax documents, and either stores the text data or populates one or more tax-related forms based on the text data of the tax documents. In another example, the output device 810 is a computing device executing database software, and the extracted data may be organized and stored by the database software.
The correction device 812 may render a graphical user interface to be displayed on the output device 810. The graphical user interface rendered by the correction device 812 may be similar to display 700 of FIG. 7. The correction device 812 may provide for a human review and correction of any OCR-detected text that does not match the corresponding text in the text layer. The correction device 812 may receive, from an input device 814, an indication from the user to adjust or correct the OCR-detected values to a value entered by the user or to the value of the text layer. The input device 814 may be one or more of a computer mouse, a computer keyboard, a microphone, or any other device or apparatus for communicating with the system 800.
FIG. 9 illustrates a flow diagram of a process 900 used by the system described herein. The system receives, from a user, a native digital document having an image layer and a text layer (step 902). As described herein, the user device 808 may communicate the native digital document to the verification device 802.
The native digital document is provided to an optical character recognition device 804 (step 904), which detects text in the image layer of the native digital document and a location of the text in the image layer (step 906).
The optical character recognition device 804 performs processing (or “normalization processing”) on the text detected from the image layer (step 908). The processing may include removing symbols, rounding values to the nearest whole number, or discerning value types (e.g., text or numbers).
The verification device 802 determines a location of the text in the text layer of the native digital document based on the location received from the optical character recognition device 804 (step 910). The text in the text layer may be considered reliable, as the document is a native digital document.
The verification device 802 detects the text in the text layer of the native digital document (step 912), performs the same normalization processing on the detected text from the text layer as was performed on the detected text from the image layer in step 908 by the optical character recognition device 804 (step 914), and compares the OCR-detected text to the text in the text layer of the native digital document (step 916). When the OCR-detected text does not match the text in the text layer of the native digital document, the text from the text layer may be output (step 918). In addition, an indication that the two values did not match may be displayed on a user interface. When the OCR-detected text does match the text in the text layer of the native digital document, the text from the text layer may be output. In addition, an indication that the two values did match may be displayed on a user interface.
Exemplary embodiments of the methods/systems have been disclosed in an illustrative style. Accordingly, the terminology employed throughout should be read in a non-limiting manner. Although minor modifications to the teachings herein will occur to those well versed in the art, it shall be understood that what is intended to be circumscribed within the scope of the patent warranted hereon are all such embodiments that reasonably fall within the scope of the advancement to the art hereby contributed, and that that scope shall not be restricted, except in light of the appended claims and their equivalents.

Claims (19)

What is claimed is:
1. A method for automatically verifying text detected by optical character recognition (OCR), the method comprising:
obtaining a native digital document having an image layer comprising a matrix of computer-renderable pixels and a text layer comprising computer-readable encodings of a sequence of characters;
obtaining normalized OCR-detected text corresponding to OCR-detected text from the image layer of the native digital document and a pixel-based coordinate location of the OCR-detected text in the image layer of the native digital document;
determining, using a pixel transformation, a computer-interpretable location of the OCR-detected text in the text layer of the native digital document based on the pixel-based coordinate location of the OCR-detected text in the image layer of the native digital document;
applying the computer-interpretable location of the OCR-detected text to the text layer of the native digital document to detect text in the text layer corresponding to the OCR-detected text;
applying normalization processing to the detected text in the text layer to generate normalized text-layer text; and
rendering only the normalized text-layer text as an output when the normalized OCR-detected text does not match the normalized text-layer text, to improve accuracy of the output text.
2. The method of claim 1, wherein the normalized OCR-detected text and the pixel-based coordinate location of the OCR-detected text in the image layer is determined by an optical character recognition device executing optical character recognition computer software.
3. The method of claim 1, wherein the pixel-based coordinate location of the OCR-detected text in the image layer is associated with a four-sided box surrounding the OCR-detected text, the four-sided box having a left side, a top side, a right side, and a bottom side.
4. The method of claim 3, wherein the pixel-based coordinate location of the OCR-detected text in the image layer includes a left value corresponding to a distance from a left edge of the native digital document to the left side of the four-sided box, a top value corresponding to a distance from the top edge of the native digital document to the top side of the four-sided box, a right value corresponding to a distance from the left edge of the native digital document to the right side of the four-sided box, and a bottom value corresponding to a distance from the top edge of the native digital document to the bottom side of the four-sided box.
5. The method of claim 1, further comprising providing a display of the image layer of the native digital document and a confirmatory indicator adjacent to the location of the OCR-detected text when the normalized OCR-detected text matches the normalized text-layer text or a non-confirmatory indicator adjacent to the location of the OCR-detected text when the normalized OCR-detected text does not match the normalized text-layer text.
6. The method of claim 5, further comprising:
determining, for each text of the native digital document, whether a normalized OCR-detected text for each text matches a corresponding normalized text-layer text, and
providing, on the display, a respective confirmatory indicator for each text where the normalized OCR-detected text matches the corresponding normalized text-layer text, and a respective non-confirmatory indicator for each text where the normalized OCR-detected text does not match the corresponding normalized text-layer text,
wherein the display is limited to a single page of a display screen.
7. The method of claim 5, further comprising receiving, in real-time from the user via an input unit, a correction for the normalized OCR-detected text when the normalized OCR-detected text does not match the normalized text-layer text.
8. The method of claim 1, wherein the native digital document is obtained by receiving the native digital document from a user device of the user, by receiving the native digital document from a third-party server, or by retrieving the native digital document from the third-party server.
9. The method of claim 1, further comprising discarding or deleting the normalized OCR-detected text when the normalized OCR-detected text does not match the normalized text-layer text.
10. The method of claim 1, further comprising automatically outputting, to a computer software, the normalized text-layer text to improve accuracy of text extraction from the native digital document.
11. A method for automatically verifying text detected by optical character recognition (OCR), the method comprising:
receiving a native digital document having an image layer comprising a matrix of computer-renderable pixels and a text layer comprising computer-readable encodings of a sequence of characters;
receiving, from an OCR device, normalized OCR-detected text corresponding to OCR-detected text from the image layer of the native digital document and a pixel-based coordinate location of the OCR-detected text in the image layer of the native digital document;
determining, using a pixel transformation, a computer-interpretable location of the OCR-detected text in the text layer of the native digital document based on the pixel-based coordinate location of the OCR-detected text in the image layer of the native digital document;
applying, by the verification device, the computer-interpretable location of the OCR-detected text to the text layer of the native digital document to detect text in the text layer corresponding to the OCR-detected text;
applying, by the verification device, normalization processing to the detected text in the text layer to generate normalized text-layer text; and
producing only the normalized text-layer text as an output when the normalized OCR-detected text does not match the normalized text-layer text, to improve accuracy of the output text.
12. The method of claim 11, wherein the pixel-based coordinate location of the OCR-detected text in the image layer is associated with a four-sided box surrounding the OCR-detected text, the four-sided box having a left side, a top side, a right side, and a bottom side.
13. The method of claim 12, wherein the pixel-based coordinate location of the OCR-detected text in the image layer includes a left value corresponding to a distance from a left edge of the native digital document to the left side of the four-sided box, a top value corresponding to a distance from the top edge of the native digital document to the top side of the four-sided box, a right value corresponding to a distance from the left edge of the native digital document to the right side of the four-sided box, and a bottom value corresponding to a distance from the top edge of the native digital document to the bottom side of the four-sided box.
14. The method of claim 11, further comprising providing a display of the native digital document and a confirmatory indicator adjacent to the location of the OCR-detected text when the normalized OCR-detected text matches the normalized text-layer text or a non-confirmatory indicator adjacent to the location of the OCR-detected text when the normalized OCR-detected text does not match the normalized text-layer text.
15. The method of claim 14, further comprising:
determining, for each text of the native digital document, whether a normalized OCR-detected text for each text matches a corresponding normalized text-layer text, and
providing, on the display, a respective confirmatory indicator for each text where the normalized OCR-detected text matches the corresponding normalized text-layer text, and a respective non-confirmatory indicator for each text where the normalized OCR-detected text does not match the corresponding normalized text-layer text,
wherein the display is limited to a single page of a display screen.
16. The method of claim 14, further comprising receiving, in real-time from the user via an input unit, a corrected text for the normalized OCR-detected text when the normalized OCR-detected text does not match the normalized text-layer text.
17. The method of claim 11, wherein the native digital document is received from a user device of the user, received from a third-party server, or retrieved from the third-party server.
18. The method of claim 11, further comprising discarding or deleting the normalized OCR-detected text when the normalized OCR-detected text does not match the normalized text-layer text.
19. The method of claim 11, further comprising automatically outputting, to a computer software, the normalized text-layer text to improve accuracy of text extraction from the native digital document.
US16/047,346 2018-03-15 2018-07-27 System and method for automatic detection and verification of optical character recognition data Active US10489644B2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
US16/047,346 US10489644B2 (en) 2018-03-15 2018-07-27 System and method for automatic detection and verification of optical character recognition data
US16/659,193 US11232300B2 (en) 2018-03-15 2019-10-21 System and method for automatic detection and verification of optical character recognition data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US15/922,821 US10489645B2 (en) 2018-03-15 2018-03-15 System and method for automatic detection and verification of optical character recognition data
US16/047,346 US10489644B2 (en) 2018-03-15 2018-07-27 System and method for automatic detection and verification of optical character recognition data

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
US15/922,821 Continuation-In-Part US10489645B2 (en) 2018-03-15 2018-03-15 System and method for automatic detection and verification of optical character recognition data

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/659,193 Continuation US11232300B2 (en) 2018-03-15 2019-10-21 System and method for automatic detection and verification of optical character recognition data

Publications (2)

Publication Number Publication Date
US20190286896A1 US20190286896A1 (en) 2019-09-19
US10489644B2 true US10489644B2 (en) 2019-11-26

Family

ID=67905761

Family Applications (2)

Application Number Title Priority Date Filing Date
US16/047,346 Active US10489644B2 (en) 2018-03-15 2018-07-27 System and method for automatic detection and verification of optical character recognition data
US16/659,193 Active 2038-10-09 US11232300B2 (en) 2018-03-15 2019-10-21 System and method for automatic detection and verification of optical character recognition data

Family Applications After (1)

Application Number Title Priority Date Filing Date
US16/659,193 Active 2038-10-09 US11232300B2 (en) 2018-03-15 2019-10-21 System and method for automatic detection and verification of optical character recognition data

Country Status (1)

Country Link
US (2) US10489644B2 (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11037238B1 (en) * 2019-06-03 2021-06-15 Intuit Inc. Machine learning tax based credit score prediction
CN110796135B (en) * 2019-09-20 2024-07-16 平安科技(深圳)有限公司 Target positioning method and device, computer equipment and computer storage medium
CN111291753B (en) * 2020-01-22 2024-05-28 平安科技(深圳)有限公司 Text recognition method and device based on image and storage medium
US11880435B2 (en) * 2020-02-12 2024-01-23 Servicenow, Inc. Determination of intermediate representations of discovered document structures
US11461164B2 (en) 2020-05-01 2022-10-04 UiPath, Inc. Screen response validation of robot execution for robotic process automation
US11080548B1 (en) 2020-05-01 2021-08-03 UiPath, Inc. Text detection, caret tracking, and active element detection
KR102297355B1 (en) * 2020-05-01 2021-09-01 유아이패스, 인크. Text detection, caret tracking, and active element detection
US11200441B2 (en) 2020-05-01 2021-12-14 UiPath, Inc. Text detection, caret tracking, and active element detection
US11366962B2 (en) * 2020-08-11 2022-06-21 Jpmorgan Chase Bank, N.A. Method and apparatus for template authoring and execution
CN112101386B (en) * 2020-09-25 2024-04-23 腾讯科技(深圳)有限公司 Text detection method, device, computer equipment and storage medium
US12008830B2 (en) * 2022-01-07 2024-06-11 Infrrd Inc. System for template invariant information extraction

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020037097A1 (en) * 2000-05-15 2002-03-28 Hector Hoyos Coupon recognition system
US20140161365A1 (en) * 2012-12-12 2014-06-12 Qualcomm Incorporated Method of Perspective Correction For Devanagari Text
US20160055376A1 (en) * 2014-06-21 2016-02-25 iQG DBA iQGATEWAY LLC Method and system for identification and extraction of data from structured documents
US10229314B1 (en) * 2015-09-30 2019-03-12 Groupon, Inc. Optical receipt processing

Family Cites Families (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH0887495A (en) 1994-09-16 1996-04-02 Ibm Japan Ltd Cut amd paste method for table data and data processing system
US6167370A (en) 1998-09-09 2000-12-26 Invention Machine Corporation Document semantic analysis/selection with knowledge creativity capability utilizing subject-action-object (SAO) structures
US20140180883A1 (en) 2000-04-26 2014-06-26 Accenture Llp System, method and article of manufacture for providing tax services in a network-based tax architecture
US7058630B2 (en) 2002-08-12 2006-06-06 International Business Machines Corporation System and method for dynamically controlling access to a database
US8388440B2 (en) 2003-10-20 2013-03-05 Sony Computer Entertainment America Llc Network account linking
US7742958B1 (en) 2004-11-08 2010-06-22 Hrb Tax Group, Inc. System and method for preparing a tax return using electronically distributed tax return data
US20060107206A1 (en) 2004-11-12 2006-05-18 Nokia Corporation Form related data reduction
US8606665B1 (en) 2004-12-30 2013-12-10 Hrb Tax Group, Inc. System and method for acquiring tax data for use in tax preparation software
US7925553B2 (en) 2006-04-14 2011-04-12 Intuit Inc. System and method for preparing a tax liability projection
US7752092B1 (en) 2006-06-16 2010-07-06 Intuit Inc. System and method for indicating previous document source information for current document fields
USRE47533E1 (en) 2006-10-04 2019-07-23 Aaa Internet Publishing Inc. Method and system of securing accounts
US7818222B2 (en) 2006-11-30 2010-10-19 Hrb Innovations, Inc. Method and system for organizing tax information and providing tax advice
US8190499B1 (en) 2009-08-21 2012-05-29 Intuit Inc. Methods systems and articles of manufacture for collecting data for future electronic tax return
US20110255788A1 (en) 2010-01-15 2011-10-20 Copanion, Inc. Systems and methods for automatically extracting data from electronic documents using external data
US9558521B1 (en) 2010-07-29 2017-01-31 Intuit Inc. System and method for populating a field on a form including remote field level data capture
US8204805B2 (en) 2010-10-28 2012-06-19 Intuit Inc. Instant tax return preparation
US10249004B2 (en) 2010-10-01 2019-04-02 Hrb Tax Group, Inc. System, computer program, and method for online, real-time delivery of consumer tax services
US8452048B2 (en) 2011-02-28 2013-05-28 Intuit Inc. Associating an object in an image with an asset in a financial application
US8676689B1 (en) 2011-03-28 2014-03-18 Keith Whelan Financial status measurement and management tool
US8943096B2 (en) 2011-06-22 2015-01-27 Stone Vault, LLC Method and apparatus for storing, sharing, and/or organizing personal information
US11455350B2 (en) 2012-02-08 2022-09-27 Thomson Reuters Enterprise Centre Gmbh System, method, and interfaces for work product management
US9350599B1 (en) 2012-06-26 2016-05-24 Google Inc. User content access management and control
US20150178856A1 (en) 2013-12-20 2015-06-25 Alfredo David Flores System and Method for Collecting and Submitting Tax Related Information
US10339527B1 (en) 2014-10-31 2019-07-02 Experian Information Solutions, Inc. System and architecture for electronic fraud detection
US9922070B2 (en) 2015-05-04 2018-03-20 International Business Machines Corporation Maintaining consistency between a transactional database system and a non-transactional content repository for document objects
US10210580B1 (en) 2015-07-22 2019-02-19 Intuit Inc. System and method to augment electronic documents with externally produced metadata to improve processing
US20170178199A1 (en) 2015-12-22 2017-06-22 Intuit Inc. Method and system for adaptively providing personalized marketing experiences to potential customers and users of a tax return preparation system
US9672487B1 (en) 2016-01-15 2017-06-06 FinLocker LLC Systems and/or methods for providing enhanced control over and visibility into workflows where potentially sensitive data is processed by different operators, regardless of current workflow task owner
US10628495B2 (en) 2016-03-30 2020-04-21 Hrb Innovations, Inc. Document importation, analysis, and storage
US10592994B1 (en) 2016-05-31 2020-03-17 Intuit Inc. Orchestrating electronic signature, payment, and filing of tax returns
US11087411B2 (en) 2016-07-27 2021-08-10 Intuit Inc. Computerized tax return preparation system and computer generated user interfaces for tax topic completion status modifications
US10621678B1 (en) 2017-01-25 2020-04-14 Intuit Inc. Systems, methods and articles for automating access of tax documents for preparing an electronic tax return
US10482170B2 (en) 2017-10-17 2019-11-19 Hrb Innovations, Inc. User interface for contextual document recognition
US10489645B2 (en) * 2018-03-15 2019-11-26 Sureprep, Llc System and method for automatic detection and verification of optical character recognition data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020037097A1 (en) * 2000-05-15 2002-03-28 Hector Hoyos Coupon recognition system
US20140161365A1 (en) * 2012-12-12 2014-06-12 Qualcomm Incorporated Method of Perspective Correction For Devanagari Text
US20160055376A1 (en) * 2014-06-21 2016-02-25 iQG DBA iQGATEWAY LLC Method and system for identification and extraction of data from structured documents
US10229314B1 (en) * 2015-09-30 2019-03-12 Groupon, Inc. Optical receipt processing

Also Published As

Publication number Publication date
US20200050848A1 (en) 2020-02-13
US20190286896A1 (en) 2019-09-19
US11232300B2 (en) 2022-01-25

Similar Documents

Publication Publication Date Title
US11232300B2 (en) System and method for automatic detection and verification of optical character recognition data
US10489645B2 (en) System and method for automatic detection and verification of optical character recognition data
US11816165B2 (en) Identification of fields in documents with neural networks without templates
CN109101469B (en) Extracting searchable information from digitized documents
US7668372B2 (en) Method and system for collecting data from a plurality of machine readable documents
US20210366055A1 (en) Systems and methods for generating accurate transaction data and manipulation
Clausner et al. The ENP image and ground truth dataset of historical newspapers
RU2760471C1 (en) Methods and systems for identifying fields in a document
US9454545B2 (en) Automated field position linking of indexed data to digital images
US20160253303A1 (en) Digital processing and completion of form documents
US9740995B2 (en) Coordinate-based document processing and data entry system and method
US11880435B2 (en) Determination of intermediate representations of discovered document structures
US11379690B2 (en) System to extract information from documents
CN112418812A (en) Distributed full-link automatic intelligent clearance system, method and storage medium
US11315353B1 (en) Systems and methods for spatial-aware information extraction from electronic source documents
US9047533B2 (en) Parsing tables by probabilistic modeling of perceptual cues
CN112418813A (en) AEO qualification intelligent rating management system and method based on intelligent analysis and identification and storage medium
CN111914729A (en) Voucher association method and device, computer equipment and storage medium
CN112926577B (en) Medical bill image structuring method and device and computer readable medium
US11521408B2 (en) Systems and methods for dynamic digitization and extraction of aviation-related data
US11335108B2 (en) System and method to recognise characters from an image
Mariner Optical Character Recognition (OCR)
US20240143632A1 (en) Extracting information from documents using automatic markup based on historical data
RU2774653C1 (en) Methods and systems for identifying fields in a document
WO2022254560A1 (en) Data matching using text data generated by optical character recognition

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

AS Assignment

Owner name: SUREPREP, LLC, CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:WYLE, DAVID;LINGINENI, SRINIVAS;HOSEK, WILL;SIGNING DATES FROM 20180725 TO 20180726;REEL/FRAME:050399/0019

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

MAFP Maintenance fee payment

Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

Year of fee payment: 4