US20210019511A1 - Systems and methods for extracting data from an image - Google Patents
Systems and methods for extracting data from an image Download PDFInfo
- Publication number
- US20210019511A1 US20210019511A1 US17/039,628 US202017039628A US2021019511A1 US 20210019511 A1 US20210019511 A1 US 20210019511A1 US 202017039628 A US202017039628 A US 202017039628A US 2021019511 A1 US2021019511 A1 US 2021019511A1
- Authority
- US
- United States
- Prior art keywords
- text
- line
- lines
- item
- ocr
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000012015 optical character recognition Methods 0.000 claims abstract description 53
- 238000010801 machine learning Methods 0.000 claims abstract description 17
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 15
- 230000004044 response Effects 0.000 claims abstract description 6
- 238000012545 processing Methods 0.000 claims description 15
- 238000013145 classification model Methods 0.000 claims description 13
- 238000012549 training Methods 0.000 claims description 11
- 230000008569 process Effects 0.000 description 15
- 238000007637 random forest analysis Methods 0.000 description 5
- 238000010606 normalization Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 230000001413 cellular effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000007477 logistic regression Methods 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000035755 proliferation Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000026676 system process Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G06K9/00456—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G06K9/00463—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/19—Recognition using electronic means
- G06V30/191—Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
- G06V30/19173—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/414—Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
-
- G06K2209/01—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Definitions
- the present disclosure relates to data extraction and classification, and in particular, to systems and methods for extracting data from an image.
- Embodiments of the present disclosure pertain to systems and method for extracting data from an image.
- a method of extracting data from an image comprises receiving, from an optical character recognition (OCR) system, OCR text in response to sending an image to the OCR system.
- OCR text comprises a plurality of lines of text. Each line of text is classified as either a line item or not a line item using a machine learning algorithm, and a plurality of data fields are extracted from each line of text classified as a line item.
- FIG. 1 illustrates an architecture for extracting data from an image according to one embodiment.
- FIG. 2 illustrates a method of extracting data from an image according to one embodiment.
- FIG. 3 illustrates an example of extracting data from a hotel folio image according to one embodiment.
- FIG. 4 illustrates a method of extracting data from an image according to another embodiment.
- FIG. 5 illustrates a method of extracting data from an image according to yet another embodiment.
- FIG. 6 illustrates hardware of a special purpose computing machine configured according to the above disclosure.
- FIG. 1 illustrates an architecture for extracting data from an image according to one embodiment.
- an “image” refers to an electronic image, which may include electronic photographs or pictures stored in one of a variety of digital formats, for example.
- a mobile device 120 may include a camera 121 .
- Camera 121 may be used to take a picture and create an image 123 , which may be stored on mobile device 120 .
- the following description uses an example image of a hotel folio 101 to describe various aspects of the disclosure. However, it is to be understood that this is not the only embodiment that may use the features and techniques described herein.
- mobile device 120 includes an application 122 (aka “App”), which, when accessed, automatically accesses the camera.
- the App may be an “Expense App” that includes functionality for accessing the camera to take a picture of a receipt or folio and sending the image to a backend system, for example.
- the image 123 is sent to a backend software system that includes functionality for extracting data from the image.
- the backend software system may include a process controller component 110 , optical character recognition (OCR) component 111 (e.g., which may be local or remote), image repository 150 , data services 130 , an Expense application 140 , and one or more databases 160 .
- OCR optical character recognition
- Process controller 110 may receive images from App 123 , via email, or through a variety of other image transfer mechanisms (e.g., text, links, etc. . . . ).
- Process controller 110 may control storing images in repository 150 , sending images to OCR system 111 , interfacing with data services 130 that analyze data, and forward extracted data to application 140 and database 160 , which process and store the data, respectively, so users can interact with the data through application 140 , for example.
- some or all of the data sent to the application and database may be transformed at 112 .
- OCR system 111 may be a remote system provided by a third party, for example.
- Process controller 110 may send an image to OCR system 111 , and the OCR system returns OCR text, for example.
- One example OCR system performs character recognition and produces OCR text comprising a plurality of lines of text (e.g., lines of text that each end in a new line character, “ ⁇ n”).
- OCR text may include all the characters in the image of the hotel folio arranged in lines of text followed by a new line character, for example, substantially based on how the characters appeared in the folio image (e.g., top to bottom/left to right, where lines comprise text appearing in the same row of the image left to right, and different lines are successive rows of text from the top to the bottom of the image).
- the lines of text from the OCR text may be classified using a trained machine learning model (e.g., a random forest model), where the model outputs specify that a particular input line of text is either a line item or not a line item.
- Line items are entries of a list describing elements of an aggregated whole. For example, line items may be entries in a hotel folio that specify a particular expense, such as a room charge, valet parking, room service, TV entertainment, or the like. In any given image, some portions of the image may correspond to line items, while other portions of the image may not correspond to line items. It can be challenging to automate a system to determine which elements of the image are line items and which are not.
- each line of text from the OCR text are classified, line by line, into one of two categories—is a line item or is not a line item.
- line items from a portion of an image may each contain the same data fields. Accordingly, once all the line items from the image are determined, a plurality of data fields may be extracted from each line of text classified as a line item. For example, as illustrated below, data fields for a date, an amount, a description, and even an expense type may be extracted once the line items are identified.
- FIG. 2 illustrates a method of extracting data from an image according to one embodiment.
- OCR text is received from an optical character recognition (OCR) system, for example, in response to sending an image to the OCR system.
- the OCR text comprises a plurality of lines of text, which may be rows of characters recognized by the OCR system, for example.
- each line of text is classified as either a line item or not a line item using a machine learning algorithm.
- One example machine algorithm that may be used is a random forest model, for example.
- FIGS. 3-4 illustrates an example of extracting data from a hotel folio image according to one embodiment.
- an image 301 may be a hotel folio image including a name and address of the guest, name and address of the hotel, a header specifying columns for date, description, and amount, a series of line items for room, bar, TV, tax, parking, and resort fee, and a footer showing a credit card charge, for example.
- the image may be processed by an OCR system to produce recognized characters in OCR text 302 .
- OCR text is received at 401 .
- the image is transformed into lines of text followed by new lines “ ⁇ n” for each line. For example, a top line has “Name Hotel ⁇ n”, an adjacent line below the top line has text from the address, the next line has text from the header, and so on downto the footer text line and any additional lines that might fall below the header, for example.
- Each line of text may be preprocessed and analyzed by a machine learning algorithm, such as a random forest model, for example.
- Each line of text may be preprocessed prior to classification.
- Example embodiments of classification, illustrated at 402 in FIG. 4 may include such preprocessing.
- the example line of text shall be “03-17-18 Room 79.95.”
- the text in each line may be normalized as illustrated at 403 in FIG. 4 .
- all numbers may be set to the same number (e.g., 03-17-18 may be set to 77-77-77 and 79.95 may be set to 77.77).
- all letters may be set to lower case (e.g., “Room” may be set to “room”).
- Normalization advantageously reduces the number of different patterns and may improve classification results, for example.
- a classification software component performs said classifying step, including said normalizing numbers step.
- the normalizing number step may occur as the lines of text are processed. Accordingly, a version of the line with the actual numeric values is retained.
- the numbers in the lines of text are not normalized when input to the data extracting process so that the actual data values may be extracted from the lines and stored in an application database, for example.
- the lines of text may be tokenized as illustrated at 404 in FIG. 4 .
- the line of text may be as follows “77-77-77 room 77.77” (where digits are normalized to “7” and alphabetical characters set to lower case).
- Tokens may be determined by setting each token to successive sequences of characters between each space (or whitespace). Thus, in this example, the following three (3) tokens are generated: “77-77-77,” “room,” and “77.77.”
- a term frequency-inverse document frequency (tf-idf) is determined for each of the plurality of tokens from each line of text. This is illustrated at 405 in FIG. 4 .
- the tf-idf may be performed per line and per token, for example.
- the tf-idf includes a plurality of parameters comprising a total number of lines of text, n, from a corpus of lines of text used to train the classification model, a term frequency specifying a number of times the term, t, shows up in a document, d, and a document frequency specifying a number of documents, d, that contain the term t.
- Documents in this example may be individual lines of text from the OCR text, and terms, t, are the tokens. Tf-idf for each token may be calculated as follows:
- t are terms (here, tokens)
- d are documents (e.g., here, individual lines from the OCR text)
- tf(t) is the term frequency equal to the number of times a term, t, appears in a document
- idf(t) is the inverse document frequency (e.g., the equation here is referred to as a “smooth” idf, but other similar equations could be used)
- df(d,t) is the document frequency equal to the number of documents in the training set that contain term
- t is the total sample size of training documents, which in this example are all the lines of OCR text used to train the model, for example.
- the system may not keep track of which lines came from which hotel folio, or how many lines a given hotel folio has. Rather, the system processes each line to determine if a line of OCR text is a line item or not as further illustrated below.
- the tf-idf of the plurality of tokens from each line of text are processed by classification component (or “classifier”) 304 using a trained classification model to produce an output for each line of text.
- Classifier 304 may determine if each line is/is not a line item based on the tf-idf of each token in each line as shown at 406 .
- the output of classifier 304 may have a first value (e.g., 1) corresponding to the line of text being a line item, and the output has a second value (e.g., 0) corresponding to the line of text being not a line item.
- the line with text “Date Description Amount ⁇ n” may be preprocessed, converted to three (3) tf-idf values for “date,” “description,” and “amount,” and input to classifier 304 .
- the output of classifier 304 may be one of two values corresponding to “is a line item” and “not a line item.”
- Tf-idf values for “date,” “description,” and “amount” may produce an output corresponding to “not a line item.”
- the line with text “03-17-18 Room 79.95” may be converted to three (3) tf-idf values for the tokens “77-77-77,” “room,” and “79.95,” and input to the classifier 304 .
- the output of classifier 304 may correspond to “is a line item.” Similarly, all the lines of text are classified line by line. Each line may be associated with either “is a line item” or “not a line item” (e.g., the lines may be tagged).
- FIG. 5 illustrates an example process flow for extracting data fields according to an embodiment.
- a center line in the lines of text is determined.
- the center line may be found by dividing the lines of text by two or finding a midpoint line (e.g., line N/2 in FIG. 3 ).
- the process moves up one line from the center line at 502 .
- the current line is classified as either “Header” or “Not a Header.” Classification may include similar preprocessing as described above with respect to determine a line item (e.g., normalizing and tokenizing).
- classification may use a logistic regression model as a machine learning model, for example, which returns one value (e.g., 1) corresponding to “Header” and another value (e.g., 0) corresponding to “Not a Header” as illustrated at 504 . If not a header, then the process moves to 502 and the system increments up a line at 502 and classifies the next line at 503 . When a header is found, the process returns to the center line at 505 . At 506 , the process moves down one line from the center line.
- a logistic regression model as a machine learning model, for example, which returns one value (e.g., 1) corresponding to “Header” and another value (e.g., 0) corresponding to “Not a Header” as illustrated at 504 . If not a header, then the process moves to 502 and the system increments up a line at 502 and classifies the next line at 503 . When a header is found, the process returns to the center line at 50
- the current line is classified as either “Footer” or “Not a Footer.”
- Classification may include similar preprocessing as described above with respect to determine a line item (e.g., normalizing and tokenizing).
- classification may use a logistic regression model as a machine learning model, for example, which returns one value (e.g., 1) corresponding to “Footer” and another value (e.g., 0) corresponding to “Not a Footer” as illustrated at 508 . If not a footer, then the process moves to 506 and the system increments down a line at 506 and classifies the next line at 507 . When a footer is found, the process examines the lines between the header/footer.
- Certain embodiments may include finding and appending hanging lines.
- a hanging line is illustrated in FIG. 3 where one data field, here the description “TV entertainment,” has been placed on a different line than another data field, here amount “21.00.”
- Embodiments of the disclosure may examine lines that have been identified as line items to determine if some, but not all, of the data fields are included. If a line identified as a line item has a plurality of expected data fields, but is missing one or more other data fields, then the process may examine the next line to determine if the missing data field is in the next line. If so, the line is determined to be a hanging line. Hanging lines between the header and footer are appended at 509 . Hanging lines are then processed again to determine if the lines are in fact line items as illustrated at 510 . Hanging lines may be normalized, tokenized, and classified using the techniques described above to determine if such lines are line items or not, for example.
- Identification of headers, footers, and hanging text are illustrated in FIG. 3 at 305 - 307 , for example.
- each line of text identified as a line item may have a date, description, and amount extracted from the line item.
- the line items may be processed by yet another classifier to determine an expense type, for example.
- Classification of each line item to determine expense type may include normalizing and tokenizing the line item text, and classifying tf-idfs for the tokens using a random forest model, for example, that performs a multi-class determination.
- the output corresponds to one of a plurality of expense types, for example.
- the classifier outputs corresponding to expense types are translated into FLI type keys (“Folio Line Items”), which may be translated to particular descriptions of expenses when sent to the backend application, for example.
- the extracted data may be sent to the backend application and stored in a database, for example.
- the classification model is trained using a corpus of lines of text, for example.
- Each line of text in the corpus of lines of text may be associated with an indicator specifying that a line of text is a line item or is not a line item, for example.
- the training may include normalizing numbers in each line of text in the corpus to a same value, tokenizing each line of text in the corpus to produce a plurality of training tokens, determining a term frequency-inverse document frequency (tf-idf) of the plurality of tokens from each line of text in the corpus; and processing the tf-idf of the plurality of training tokens from each line of text in the corpus using a classification model to produce the trained classification model.
- tf-idf term frequency-inverse document frequency
- the model to determine line items is a random forest model.
- Header and footer classification may use separate models. Headers in a training set are tagged as “Header” and other lines in the corpus tagged with “Not Header” to train the “header” model. Similarly, footers in a training set are tagged as “Footer” and other lines in the corpus tagged with “Not Footer” to train the “footer” model, for example.
- FIG. 6 illustrates hardware of a special purpose computing machine configured according to the above disclosure.
- the following hardware description is merely one example. It is to be understood that a variety of computers topologies may be used to implement the above described techniques.
- An example computer system 610 is illustrated in FIG. 6 .
- Computer system 610 includes a bus 605 or other communication mechanism for communicating information, and one or more processor(s) 601 coupled with bus 605 for processing information.
- Computer system 610 also includes a memory 602 coupled to bus 605 for storing information and instructions to be executed by processor 601 , including information and instructions for performing some of the techniques described above, for example.
- Memory 602 may also be used for storing programs executed by processor(s) 601 .
- memory 602 may be, but are not limited to, random access memory (RAM), read only memory (ROM), or both.
- a storage device 603 is also provided for storing information and instructions. Common forms of storage devices include, for example, a hard drive, a magnetic disk, an optical disk, a CD-ROM, a DVD, a flash or other non-volatile memory, a USB memory card, or any other medium from which a computer can read.
- Storage device 603 may include source code, binary code, or software files for performing the techniques above, for example.
- Storage device 603 and memory 602 are both examples of non-transitory computer readable storage mediums.
- Computer system 610 may be coupled via bus 605 to a display 612 for displaying information to a computer user.
- An input device 611 such as a keyboard, touchscreen, and/or mouse is coupled to bus 605 for communicating information and command selections from the user to processor 601 .
- the combination of these components allows the user to communicate with the system.
- bus 605 represents multiple specialized buses for coupling various components of the computer together, for example.
- Computer system 610 also includes a network interface 604 coupled with bus 605 .
- Network interface 604 may provide two-way data communication between computer system 610 and a local network 620 .
- Network 620 may represent one or multiple networking technologies, such as Ethernet, local wireless networks (e.g., WiFi), or cellular networks, for example.
- the network interface 604 may be a wireless or wired connection, for example.
- Computer system 610 can send and receive information through the network interface 604 across a wired or wireless local area network, an Intranet, or a cellular network to the Internet 630 , for example.
- a browser may access data and features on backend software systems that may reside on multiple different hardware servers on-prem 631 or across the Internet 630 on servers 632 - 635 .
- servers 632 - 635 may also reside in a cloud computing environment, for example.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Multimedia (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Probability & Statistics with Applications (AREA)
- Algebra (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Computer Graphics (AREA)
- Geometry (AREA)
- Character Discrimination (AREA)
- Character Input (AREA)
Abstract
Description
- This application is a continuation of U.S. patent application Ser. No. 16/011,554, entitled “Systems and Methods for Extracting Data from an Image,” filed Jun. 18, 2018, the entirety of which is incorporated herein by reference.
- The present disclosure relates to data extraction and classification, and in particular, to systems and methods for extracting data from an image.
- The proliferation of cameras and other electronic image capture devices has led to massive growth in the availability of images. For example, cameras can be found on almost all mobile devices, and such ready access to a camera allows users to capture an ever increasing amount of electronic images. Interestingly, images often contain data, and such data can be useful for a wide range of applications. However, extracting data from an image is no simple task. For example, an image of a receipt, such as a hotel receipt (or folio, a list of charges) may include data about the particular expenses incurred during a hotel stay. However, accurately extracting such data from the image is challenging. Accordingly, it would be advantageous to discover efficient and effective techniques for extracting data from electronic images.
- Embodiments of the present disclosure pertain to systems and method for extracting data from an image. In one embodiment, a method of extracting data from an image comprises receiving, from an optical character recognition (OCR) system, OCR text in response to sending an image to the OCR system. The OCR text comprises a plurality of lines of text. Each line of text is classified as either a line item or not a line item using a machine learning algorithm, and a plurality of data fields are extracted from each line of text classified as a line item.
- The following detailed description and accompanying drawings provide a better understanding of the nature and advantages of the present disclosure.
-
FIG. 1 illustrates an architecture for extracting data from an image according to one embodiment. -
FIG. 2 illustrates a method of extracting data from an image according to one embodiment. -
FIG. 3 illustrates an example of extracting data from a hotel folio image according to one embodiment. -
FIG. 4 illustrates a method of extracting data from an image according to another embodiment. -
FIG. 5 illustrates a method of extracting data from an image according to yet another embodiment. -
FIG. 6 illustrates hardware of a special purpose computing machine configured according to the above disclosure. - In the following description, for purposes of explanation, numerous examples and specific details are set forth in order to provide a thorough understanding of the present disclosure. Such examples and details are not to be construed as unduly limiting the elements of the claims or the claimed subject matter as a whole. It will be evident to one skilled in the art, based on the language of the different claims, that the claimed subject matter may include some or all of the features in these examples, alone or in combination, and may further include modifications and equivalents of the features and techniques described herein.
-
FIG. 1 illustrates an architecture for extracting data from an image according to one embodiment. As used herein, an “image” refers to an electronic image, which may include electronic photographs or pictures stored in one of a variety of digital formats, for example. As illustrated inFIG. 1 , amobile device 120 may include acamera 121. Camera 121 may be used to take a picture and create animage 123, which may be stored onmobile device 120. The following description uses an example image of ahotel folio 101 to describe various aspects of the disclosure. However, it is to be understood that this is not the only embodiment that may use the features and techniques described herein. In this example,mobile device 120 includes an application 122 (aka “App”), which, when accessed, automatically accesses the camera. The App may be an “Expense App” that includes functionality for accessing the camera to take a picture of a receipt or folio and sending the image to a backend system, for example. - In this example, the
image 123 is sent to a backend software system that includes functionality for extracting data from the image. The backend software system may include aprocess controller component 110, optical character recognition (OCR) component 111 (e.g., which may be local or remote),image repository 150,data services 130, anExpense application 140, and one ormore databases 160.Process controller 110 may receive images fromApp 123, via email, or through a variety of other image transfer mechanisms (e.g., text, links, etc. . . . ).Process controller 110 may control storing images inrepository 150, sending images toOCR system 111, interfacing withdata services 130 that analyze data, and forward extracted data toapplication 140 anddatabase 160, which process and store the data, respectively, so users can interact with the data throughapplication 140, for example. In this example, some or all of the data sent to the application and database may be transformed at 112. In one embodiment,OCR system 111 may be a remote system provided by a third party, for example.Process controller 110 may send an image toOCR system 111, and the OCR system returns OCR text, for example. One example OCR system performs character recognition and produces OCR text comprising a plurality of lines of text (e.g., lines of text that each end in a new line character, “\n”). - Features and advantages of the present disclosure include classifying each line of text as either a line item or not a line item using a machine learning algorithm. For example, in the case of hotel folios, it may be desirable to extract a number of specific data elements embedded in the image of a hotel folio. Accordingly, OCR text may include all the characters in the image of the hotel folio arranged in lines of text followed by a new line character, for example, substantially based on how the characters appeared in the folio image (e.g., top to bottom/left to right, where lines comprise text appearing in the same row of the image left to right, and different lines are successive rows of text from the top to the bottom of the image). The lines of text from the OCR text may be classified using a trained machine learning model (e.g., a random forest model), where the model outputs specify that a particular input line of text is either a line item or not a line item. Line items are entries of a list describing elements of an aggregated whole. For example, line items may be entries in a hotel folio that specify a particular expense, such as a room charge, valet parking, room service, TV entertainment, or the like. In any given image, some portions of the image may correspond to line items, while other portions of the image may not correspond to line items. It can be challenging to automate a system to determine which elements of the image are line items and which are not. In this example, each line of text from the OCR text are classified, line by line, into one of two categories—is a line item or is not a line item. In one embodiment, line items from a portion of an image may each contain the same data fields. Accordingly, once all the line items from the image are determined, a plurality of data fields may be extracted from each line of text classified as a line item. For example, as illustrated below, data fields for a date, an amount, a description, and even an expense type may be extracted once the line items are identified.
-
FIG. 2 illustrates a method of extracting data from an image according to one embodiment. At 201, OCR text is received from an optical character recognition (OCR) system, for example, in response to sending an image to the OCR system. The OCR text comprises a plurality of lines of text, which may be rows of characters recognized by the OCR system, for example. At 202, each line of text is classified as either a line item or not a line item using a machine learning algorithm. One example machine algorithm that may be used is a random forest model, for example. At 203, a plurality of data fields are extracted from each line of text classified as a line item. For example, if a line of text includes the characters “03-17-18 Room 79.95,” then the line of text may be classified as a line item and the following data fields extracted: date=“03-17-18,” description: “room,” amount: “79.95.” -
FIGS. 3-4 illustrates an example of extracting data from a hotel folio image according to one embodiment. In this example, animage 301 may be a hotel folio image including a name and address of the guest, name and address of the hotel, a header specifying columns for date, description, and amount, a series of line items for room, bar, TV, tax, parking, and resort fee, and a footer showing a credit card charge, for example. The image may be processed by an OCR system to produce recognized characters inOCR text 302. As illustrated in the process flow ofFIG. 4 , OCR text is received at 401. Referring again toFIG. 3 , in this example the image is transformed into lines of text followed by new lines “\n” for each line. For example, a top line has “Name Hotel \n”, an adjacent line below the top line has text from the address, the next line has text from the header, and so on downto the footer text line and any additional lines that might fall below the header, for example. - Each line of text may be preprocessed and analyzed by a machine learning algorithm, such as a random forest model, for example. Each line of text may be preprocessed prior to classification. Example embodiments of classification, illustrated at 402 in
FIG. 4 , may include such preprocessing. For the following description, the example line of text shall be “03-17-18 Room 79.95.” For example, in one embodiment the text in each line may be normalized as illustrated at 403 inFIG. 4 . In one example normalization scheme, all numbers may be set to the same number (e.g., 03-17-18 may be set to 77-77-77 and 79.95 may be set to 77.77). As another example, all letters may be set to lower case (e.g., “Room” may be set to “room”). Normalization advantageously reduces the number of different patterns and may improve classification results, for example. In one embodiment, a classification software component performs said classifying step, including said normalizing numbers step. However, the normalizing number step may occur as the lines of text are processed. Accordingly, a version of the line with the actual numeric values is retained. Thus, the numbers in the lines of text are not normalized when input to the data extracting process so that the actual data values may be extracted from the lines and stored in an application database, for example. - In addition to normalization, the lines of text may be tokenized as illustrated at 404 in
FIG. 4 . For example, after normalization, the line of text may be as follows “77-77-77 room 77.77” (where digits are normalized to “7” and alphabetical characters set to lower case). Tokens may be determined by setting each token to successive sequences of characters between each space (or whitespace). Thus, in this example, the following three (3) tokens are generated: “77-77-77,” “room,” and “77.77.” - After preprocessing, a term frequency-inverse document frequency (tf-idf) is determined for each of the plurality of tokens from each line of text. This is illustrated at 405 in
FIG. 4 . The tf-idf may be performed per line and per token, for example. The tf-idf includes a plurality of parameters comprising a total number of lines of text, n, from a corpus of lines of text used to train the classification model, a term frequency specifying a number of times the term, t, shows up in a document, d, and a document frequency specifying a number of documents, d, that contain the term t. Documents in this example may be individual lines of text from the OCR text, and terms, t, are the tokens. Tf-idf for each token may be calculated as follows: -
Tf-idf(d,t)=tf(t)*idf(t), where idf(t)=log((1+n)/(1+df(d,t))+1, - Where t are terms (here, tokens), d are documents (e.g., here, individual lines from the OCR text), tf(t) is the term frequency equal to the number of times a term, t, appears in a document, idf(t) is the inverse document frequency (e.g., the equation here is referred to as a “smooth” idf, but other similar equations could be used), df(d,t) is the document frequency equal to the number of documents in the training set that contain term, t, and n is the total sample size of training documents, which in this example are all the lines of OCR text used to train the model, for example. In this example implementation, the system may not keep track of which lines came from which hotel folio, or how many lines a given hotel folio has. Rather, the system processes each line to determine if a line of OCR text is a line item or not as further illustrated below.
- Once the tf-idf values are determined, the tf-idf of the plurality of tokens from each line of text are processed by classification component (or “classifier”) 304 using a trained classification model to produce an output for each line of text.
Classifier 304 may determine if each line is/is not a line item based on the tf-idf of each token in each line as shown at 406. The output ofclassifier 304 may have a first value (e.g., 1) corresponding to the line of text being a line item, and the output has a second value (e.g., 0) corresponding to the line of text being not a line item. For example, the line with text “Date Description Amount \n” may be preprocessed, converted to three (3) tf-idf values for “date,” “description,” and “amount,” and input toclassifier 304. The output ofclassifier 304 may be one of two values corresponding to “is a line item” and “not a line item.” Tf-idf values for “date,” “description,” and “amount” may produce an output corresponding to “not a line item.” Next, the line with text “03-17-18 Room 79.95” may be converted to three (3) tf-idf values for the tokens “77-77-77,” “room,” and “79.95,” and input to theclassifier 304. In this case, the output ofclassifier 304 may correspond to “is a line item.” Similarly, all the lines of text are classified line by line. Each line may be associated with either “is a line item” or “not a line item” (e.g., the lines may be tagged). -
FIG. 5 illustrates an example process flow for extracting data fields according to an embodiment. Referring toFIGS. 3 and 5 , at 501 a center line in the lines of text is determined. For example, the center line may be found by dividing the lines of text by two or finding a midpoint line (e.g., line N/2 inFIG. 3 ). To find the header, the process moves up one line from the center line at 502. At 503, the current line is classified as either “Header” or “Not a Header.” Classification may include similar preprocessing as described above with respect to determine a line item (e.g., normalizing and tokenizing). In one embodiment, classification may use a logistic regression model as a machine learning model, for example, which returns one value (e.g., 1) corresponding to “Header” and another value (e.g., 0) corresponding to “Not a Header” as illustrated at 504. If not a header, then the process moves to 502 and the system increments up a line at 502 and classifies the next line at 503. When a header is found, the process returns to the center line at 505. At 506, the process moves down one line from the center line. At 507, the current line is classified as either “Footer” or “Not a Footer.” Classification may include similar preprocessing as described above with respect to determine a line item (e.g., normalizing and tokenizing). In one embodiment, classification may use a logistic regression model as a machine learning model, for example, which returns one value (e.g., 1) corresponding to “Footer” and another value (e.g., 0) corresponding to “Not a Footer” as illustrated at 508. If not a footer, then the process moves to 506 and the system increments down a line at 506 and classifies the next line at 507. When a footer is found, the process examines the lines between the header/footer. - Certain embodiments may include finding and appending hanging lines. A hanging line is illustrated in
FIG. 3 where one data field, here the description “TV entertainment,” has been placed on a different line than another data field, here amount “21.00.” Embodiments of the disclosure may examine lines that have been identified as line items to determine if some, but not all, of the data fields are included. If a line identified as a line item has a plurality of expected data fields, but is missing one or more other data fields, then the process may examine the next line to determine if the missing data field is in the next line. If so, the line is determined to be a hanging line. Hanging lines between the header and footer are appended at 509. Hanging lines are then processed again to determine if the lines are in fact line items as illustrated at 510. Hanging lines may be normalized, tokenized, and classified using the techniques described above to determine if such lines are line items or not, for example. - Identification of headers, footers, and hanging text are illustrated in
FIG. 3 at 305-307, for example. - At 511, all the identified line items are then processed to extract data fields. For example, each line of text identified as a line item may have a date, description, and amount extracted from the line item. Additionally, the line items may be processed by yet another classifier to determine an expense type, for example. Classification of each line item to determine expense type may include normalizing and tokenizing the line item text, and classifying tf-idfs for the tokens using a random forest model, for example, that performs a multi-class determination. The output corresponds to one of a plurality of expense types, for example. In one embodiment, the classifier outputs corresponding to expense types are translated into FLI type keys (“Folio Line Items”), which may be translated to particular descriptions of expenses when sent to the backend application, for example. At 512, the extracted data may be sent to the backend application and stored in a database, for example.
- In one embodiment, the classification model is trained using a corpus of lines of text, for example. Each line of text in the corpus of lines of text may be associated with an indicator specifying that a line of text is a line item or is not a line item, for example. The training may include normalizing numbers in each line of text in the corpus to a same value, tokenizing each line of text in the corpus to produce a plurality of training tokens, determining a term frequency-inverse document frequency (tf-idf) of the plurality of tokens from each line of text in the corpus; and processing the tf-idf of the plurality of training tokens from each line of text in the corpus using a classification model to produce the trained classification model.
- In one embodiment, the model to determine line items is a random forest model. Header and footer classification may use separate models. Headers in a training set are tagged as “Header” and other lines in the corpus tagged with “Not Header” to train the “header” model. Similarly, footers in a training set are tagged as “Footer” and other lines in the corpus tagged with “Not Footer” to train the “footer” model, for example.
-
FIG. 6 illustrates hardware of a special purpose computing machine configured according to the above disclosure. The following hardware description is merely one example. It is to be understood that a variety of computers topologies may be used to implement the above described techniques. Anexample computer system 610 is illustrated inFIG. 6 .Computer system 610 includes abus 605 or other communication mechanism for communicating information, and one or more processor(s) 601 coupled withbus 605 for processing information.Computer system 610 also includes amemory 602 coupled tobus 605 for storing information and instructions to be executed byprocessor 601, including information and instructions for performing some of the techniques described above, for example.Memory 602 may also be used for storing programs executed by processor(s) 601. Possible implementations ofmemory 602 may be, but are not limited to, random access memory (RAM), read only memory (ROM), or both. Astorage device 603 is also provided for storing information and instructions. Common forms of storage devices include, for example, a hard drive, a magnetic disk, an optical disk, a CD-ROM, a DVD, a flash or other non-volatile memory, a USB memory card, or any other medium from which a computer can read.Storage device 603 may include source code, binary code, or software files for performing the techniques above, for example.Storage device 603 andmemory 602 are both examples of non-transitory computer readable storage mediums. -
Computer system 610 may be coupled viabus 605 to adisplay 612 for displaying information to a computer user. Aninput device 611 such as a keyboard, touchscreen, and/or mouse is coupled tobus 605 for communicating information and command selections from the user toprocessor 601. The combination of these components allows the user to communicate with the system. In some systems,bus 605 represents multiple specialized buses for coupling various components of the computer together, for example. -
Computer system 610 also includes anetwork interface 604 coupled withbus 605.Network interface 604 may provide two-way data communication betweencomputer system 610 and alocal network 620.Network 620 may represent one or multiple networking technologies, such as Ethernet, local wireless networks (e.g., WiFi), or cellular networks, for example. Thenetwork interface 604 may be a wireless or wired connection, for example.Computer system 610 can send and receive information through thenetwork interface 604 across a wired or wireless local area network, an Intranet, or a cellular network to theInternet 630, for example. In some embodiments, a browser, for example, may access data and features on backend software systems that may reside on multiple different hardware servers on-prem 631 or across theInternet 630 on servers 632-635. One or more of servers 632-635 may also reside in a cloud computing environment, for example. - The above description illustrates various embodiments of the present disclosure along with examples of how aspects of the particular embodiments may be implemented. The above examples should not be deemed to be the only embodiments, and are presented to illustrate the flexibility and advantages of the particular embodiments as defined by the following claims. Based on the above disclosure and the following claims, other arrangements, embodiments, implementations and equivalents may be employed without departing from the scope of the present disclosure as defined by the claims.
Claims (20)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US17/039,628 US20210019511A1 (en) | 2018-06-18 | 2020-09-30 | Systems and methods for extracting data from an image |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/011,554 US10824854B2 (en) | 2018-06-18 | 2018-06-18 | Systems and methods for extracting data from an image |
US17/039,628 US20210019511A1 (en) | 2018-06-18 | 2020-09-30 | Systems and methods for extracting data from an image |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/011,554 Continuation US10824854B2 (en) | 2018-06-18 | 2018-06-18 | Systems and methods for extracting data from an image |
Publications (1)
Publication Number | Publication Date |
---|---|
US20210019511A1 true US20210019511A1 (en) | 2021-01-21 |
Family
ID=68839989
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/011,554 Active 2039-03-02 US10824854B2 (en) | 2018-06-18 | 2018-06-18 | Systems and methods for extracting data from an image |
US17/039,628 Abandoned US20210019511A1 (en) | 2018-06-18 | 2020-09-30 | Systems and methods for extracting data from an image |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/011,554 Active 2039-03-02 US10824854B2 (en) | 2018-06-18 | 2018-06-18 | Systems and methods for extracting data from an image |
Country Status (1)
Country | Link |
---|---|
US (2) | US10824854B2 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
RU2737720C1 (en) * | 2019-11-20 | 2020-12-02 | Общество с ограниченной ответственностью "Аби Продакшн" | Retrieving fields using neural networks without using templates |
CN112348472B (en) * | 2020-11-09 | 2023-10-31 | 浙江太美医疗科技股份有限公司 | Method, device and computer readable medium for inputting laboratory checklist |
IT202100018317A1 (en) * | 2021-07-12 | 2023-01-12 | Blu Srl | Method for the classification of documentation |
CN114120340A (en) * | 2021-10-21 | 2022-03-01 | 泰康保险集团股份有限公司 | Text image structured processing method and device |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7499588B2 (en) * | 2004-05-20 | 2009-03-03 | Microsoft Corporation | Low resolution OCR for camera acquired documents |
US8509534B2 (en) * | 2010-03-10 | 2013-08-13 | Microsoft Corporation | Document page segmentation in optical character recognition |
US8565474B2 (en) * | 2010-03-10 | 2013-10-22 | Microsoft Corporation | Paragraph recognition in an optical character recognition (OCR) process |
CN103455806B (en) * | 2012-05-31 | 2017-06-13 | 富士通株式会社 | Document processing device, document processing, document processing method and scanner |
US9626629B2 (en) * | 2013-02-14 | 2017-04-18 | 24/7 Customer, Inc. | Categorization of user interactions into predefined hierarchical categories |
-
2018
- 2018-06-18 US US16/011,554 patent/US10824854B2/en active Active
-
2020
- 2020-09-30 US US17/039,628 patent/US20210019511A1/en not_active Abandoned
Also Published As
Publication number | Publication date |
---|---|
US20190384972A1 (en) | 2019-12-19 |
US10824854B2 (en) | 2020-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20210019511A1 (en) | Systems and methods for extracting data from an image | |
US10878269B2 (en) | Data extraction using neural networks | |
US10031925B2 (en) | Method and system of using image recognition and geolocation signal analysis in the construction of a social media user identity graph | |
US9020966B2 (en) | Client device for interacting with a mixed media reality recognition system | |
US9600496B1 (en) | System and method for associating images with semantic entities | |
US11348330B2 (en) | Key value extraction from documents | |
US9870388B2 (en) | Analyzing usage of visual content to determine relationships indicating unsuccessful attempts to retrieve the visual content | |
RU2668717C1 (en) | Generation of marking of document images for training sample | |
US8868555B2 (en) | Computation of a recongnizability score (quality predictor) for image retrieval | |
EP4040310A1 (en) | Image and text data hierarchical classifiers | |
EP2164009A2 (en) | Architecture for mixed media reality retrieval of locations and registration of images | |
US9483740B1 (en) | Automated data classification | |
CN104268175B (en) | A kind of devices and methods therefor of data search | |
CN112016273A (en) | Document directory generation method and device, electronic equipment and readable storage medium | |
CN110929125A (en) | Search recall method, apparatus, device and storage medium thereof | |
US9710769B2 (en) | Methods and systems for crowdsourcing a task | |
CN110765760B (en) | Legal case distribution method and device, storage medium and server | |
US20150242393A1 (en) | System and Method for Classifying Text Sentiment Classes Based on Past Examples | |
CN107533567B (en) | Image entity identification and response | |
US20200218772A1 (en) | Method and apparatus for dynamically identifying a user of an account for posting images | |
US20100256974A1 (en) | Automated screen scraping via grammar induction | |
US9516089B1 (en) | Identifying and processing a number of features identified in a document to determine a type of the document | |
CN114168715A (en) | Method, device and equipment for generating target data set and storage medium | |
CN113254665A (en) | Knowledge graph expansion method and device, electronic equipment and storage medium | |
US10824811B2 (en) | Machine learning data extraction algorithms |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: APPLICATION DISPATCHED FROM PREEXAM, NOT YET DOCKETED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STCV | Information on status: appeal procedure |
Free format text: NOTICE OF APPEAL FILED |
|
STCV | Information on status: appeal procedure |
Free format text: APPEAL BRIEF (OR SUPPLEMENTAL BRIEF) ENTERED AND FORWARDED TO EXAMINER |
|
STCV | Information on status: appeal procedure |
Free format text: ON APPEAL -- AWAITING DECISION BY THE BOARD OF APPEALS |
|
STCV | Information on status: appeal procedure |
Free format text: BOARD OF APPEALS DECISION RENDERED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- AFTER EXAMINER'S ANSWER OR BOARD OF APPEALS DECISION |