CN111263943B - Semantic normalization in document digitization - Google Patents

Semantic normalization in document digitization Download PDF

Info

Publication number
CN111263943B
CN111263943B CN201880069289.9A CN201880069289A CN111263943B CN 111263943 B CN111263943 B CN 111263943B CN 201880069289 A CN201880069289 A CN 201880069289A CN 111263943 B CN111263943 B CN 111263943B
Authority
CN
China
Prior art keywords
key
candidate
document
class
keys
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201880069289.9A
Other languages
Chinese (zh)
Other versions
CN111263943A (en
Inventor
K·诺思罗普
C·特里姆
T·希克凯
A·阿德尼兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of CN111263943A publication Critical patent/CN111263943A/en
Application granted granted Critical
Publication of CN111263943B publication Critical patent/CN111263943B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/109Font handling; Temporal or kinetic typography
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • G06V30/274Syntactic or semantic context, e.g. balancing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables

Abstract

A method for normalizing keys in a document image, comprising: based on the candidate keys being semantically interchangeable with the keys, the candidate keys corresponding to the objects in the document image are identified as keys in the key ontology data. The context, position, and style of each object of the document image are represented in the document metadata. Candidate keys are normalized to a normalized form. A key class corresponding to the normalized form is determined and a confidence score is evaluated that indicates a likelihood that the key class represents a candidate key. The semantic database is updated with key classes at verification to enhance processing of future documents.

Description

Semantic normalization in document digitization
Technical Field
The present disclosure relates to document digitizing technology, and more particularly to a method, computer program product, and system for semantically normalizing keys appearing in document images.
Background
In conventional document processing, a page-by-page scanned paper inking document (ink-on-paper) is prepared as a corresponding visual image. The resulting document file of a scanned sheet is typically a visual image of a series of pages. Each visual image of the page has objects representing words, phrases, sentences, and values in a variety of formats corresponding to particular words. A series of processes that identify the data content of such visual objects and associate certain data content together to produce computed data in corresponding values such as data field names and relational databases is known as document digitizing or data extraction. The computing data may be accessed and further processed through the use of a number of computer program applications. In view of the amount of information represented in conventional paper forms and scanned document images that have not yet been computed, automatic and accurate data extraction from conventional documents can significantly contribute to industrial and social productivity.
Disclosure of Invention
One or more shortcomings of the prior art are addressed and additional advantages are provided through the provision of a method for normalizing keys in a document image, the method comprising: obtaining, by one or more processors of the computer, document metadata of the document image, wherein the document metadata includes a context, a position, and a style for each object appearing in the document image; identifying a candidate key corresponding to an object in the document image as a key in the key ontology data based on the candidate key being interchangeable with the key semantics; normalizing the candidate keys to a normalized form; determining a key class corresponding to the normalized form, wherein the key class is associated with a key in the key body data; evaluating a confidence score for the key class based on the document metadata, wherein the confidence score indicates a likelihood that the key class is represented by the candidate key; and updating a semantic database with the key class based on verifying the key class according to a pre-configured verification manner such that the key class can be effectively associated with semantically interchangeable text appearing in other document images.
Another aspect of the present invention provides a method for normalizing keys in a document image, comprising: identifying, by one or more processors of the computer, candidate keys corresponding to objects in the document image as keys in the key ontology data based on the candidate keys being semantically interchangeable with the keys; normalizing the candidate keys to a normalized form; deriving one or more aliases of the candidate key from the canonical form, wherein the one or more aliases are not associated with the key in a semantic database; evaluating, based on document metadata of the document image, a respective confidence score for each of the one or more aliases, wherein the confidence score indicates a likelihood that each alias is represented by the candidate key; and updating the semantic database with the one or more aliases based on verifying the one or more aliases according to a pre-configured verification manner such that the one or more aliases may be effectively associated with text from other document images based on the text matching the candidate key semantics.
Another aspect of the invention provides a computer program product comprising a computer readable storage medium readable by one or more processors and storing instructions for execution by the one or more processors for performing a method for normalizing keys in a document image, the method comprising: obtaining document metadata of the document image, wherein the document metadata includes a context, a position, and a style for each object appearing in the document image; identifying a candidate key corresponding to an object in the document image as a key in the key ontology data based on the candidate key being interchangeable with the key semantics; normalizing the candidate keys to a normalized form; determining a key class corresponding to the normalized form, wherein the key class is associated with a key in the key body data; evaluating a confidence score for the key class based on the document metadata, wherein the confidence score indicates a likelihood that the key class is represented by the candidate key; and updating a semantic database with the key class based on verifying the key class according to a pre-configured verification manner such that the key class can be effectively associated with semantically interchangeable text appearing in other document images.
Another aspect of the invention provides a computer program product comprising a computer readable storage medium readable by one or more processors and storing instructions for execution by the one or more processors for performing a method of normalizing keys in a document image, the method comprising: identifying a candidate key corresponding to an object in the document image as a key in the key ontology data based on the candidate key being interchangeable with the key semantics; normalizing the candidate keys to a normalized form; deriving one or more aliases of the candidate key from the canonical form, wherein the one or more aliases are not associated with the key in a semantic database; evaluating, based on document metadata of the document image, a respective confidence score for each of the one or more aliases, wherein the confidence score indicates a likelihood that each alias is represented by the candidate key; and updating the semantic database with the one or more aliases based on verifying the one or more aliases according to a pre-configured verification manner such that the one or more aliases may be effectively associated with text from other document images based on the text matching the candidate key semantics.
Yet another aspect of the invention provides a system comprising a memory; one or more processors in communication with the memory; and program instructions executable by the one or more processors via the memory to perform a method for normalizing keys in a document image, the method comprising: obtaining document metadata of the document image, wherein the document metadata comprises a context, a position and a style for each object appearing in the document image; identifying a candidate key corresponding to an object in the document image as a key in the key ontology data based on the candidate key being interchangeable with the key semantics; normalizing the candidate keys to a normalized form; determining a key class corresponding to the normalized form, wherein the key class is associated with a key in the key body data; evaluating a confidence score for the key class based on the document metadata, wherein the confidence score indicates a likelihood that the key class is represented by the candidate key; and updating a semantic database with the key class based on verifying the key class according to a pre-configured verification manner such that the key class can be effectively associated with semantically interchangeable text appearing in other document images.
Additional features are realized through the techniques set forth herein. Other embodiments and aspects, including but not limited to computer program products and systems, are described in detail herein and are considered a part of the claimed invention.
Drawings
One or more aspects of the present invention are particularly pointed out and distinctly claimed as examples in the claims at the conclusion of the specification. The foregoing and other objects, features, and advantages of the invention will become apparent from the following detailed description taken in conjunction with the accompanying drawings in which:
FIG. 1 depicts a system for semantically normalizing content during document digitization in accordance with one or more embodiments set forth herein;
FIG. 2 depicts a flowchart of operations performed by a semantic normalization engine according to one or more embodiments set forth herein;
FIG. 3 depicts an exemplary document image having objects subject to semantic normalization, as performed by a semantic normalization engine, in accordance with one or more embodiments set forth herein;
FIG. 4 depicts exemplary document metadata corresponding to a document image in accordance with one or more embodiments set forth herein;
FIG. 5 depicts exemplary inputs and outputs of a semantic normalization engine according to one or more embodiments set forth herein;
FIG. 6 illustrates a cloud computing node according to an embodiment of the present invention;
FIG. 7 illustrates a cloud computing environment according to an embodiment of the present invention; and
FIG. 8 illustrates an abstract model layer, according to an embodiment of the invention.
Detailed Description
FIG. 1 depicts a system 100 for semantically normalizing content during document digitization in accordance with one or more embodiments set forth herein.
For ease of computation using the data represented in the document, digital documents are generally preferred. When a pen-on-paper document is scanned in, the document is a series of visual images of pages, but is not computationally prepared for use as digital data. At the beginning of document processing, a document image is an unstructured collection of objects. Text, numbers, symbols, and combinations thereof shown in the object are extracted as corresponding data. Some text and data may be associated to form key-value pairs in a relational database so that information in a document image may be computational.
Because the time and cost required to manually digitize the image of a legacy document would be prohibitive, and because manual digitization may be imprecise and inconsistent due to human error factors and individual interpretation of words in the document, it is desirable to automate the process of accurately digitizing the document image to further utilize the data represented in the document image. For example, data from scanning invoices may be input to a relational database management system for searching and/or comparing against content from other documents of a database, website, or the like.
However, many custom formats and unique organizations of documents present for existing document processing applications, and even for documents used for the same purpose, present challenges in processing document images and extracting computational data from the documents. In particular, during digitizing of a document, semantically exchangeable but distinct words of data fields in the document will be identified as corresponding data fields without specific normalization. For example, most invoices have an invoice number that identifies the invoice, an account number that identifies the customer receiving the invoice, and an order number that identifies the transaction for which the invoice is directed. In one company, an invoice may be referred to as "invoice No.", "account No.", and "serial No.". While another company may express the same data fields with "Inv_Num", "Accnt_Num" and "Purchase_Id". As described herein, certain embodiments of the present invention normalize various semantically interchangeable words from a number of document sources to normalized data field names in order to accurately digitize data associated with these semantically interchangeable words across various document formats so that the data will be efficiently represented in a database. In this specification, normalized data field names are keys or key classes.
The system 100 includes a document digitizing engine 120. The document digitizing engine 120 receives a document image 181 from the user 101 via the user device 110. The document image 181 is a visual image of the document that is optically scanned, but the content of the document image 181 is not calculated. For example, since the document image 181 does not have any calculation data, a scanned image of a paper document may not be searched or read into another application as a data input. The document image 181 has a plurality of objects corresponding to respective words, which can be extracted as calculation data.
Document digitization engine 120 includes a semantic normalization engine 160. The document digitizing engine 120 processes the document image 181 and determines document metadata 140 that specifies each object in the document image 181 by using parameters that are preconfigured for each object. An example of an object hierarchy within a document image 181 is presented in FIG. 4. The parameters of the document metadata 140 may be, but are not limited to, object context 141, location coordinates 143, and object style 145. The document digitizing engine 120 generates one or more key-value pairs (KVP) 155 in a Relational Database (RDB) 150 by using keys and values corresponding to the keys as represented in the document metadata 140. The document digitization engine 120 reports to the user 101 the key class confidence score tuples 191 produced by the semantic normalization performed by the semantic normalization engine 160. The user 101 may optionally provide feedback 199 about the key class-confidence score tuples 191 indicating whether the semantic database 130 is to be updated with the key class-confidence score tuples 191.
The document digitizing engine 120 is coupled to one or more external tools 170 and a semantic database 130. Examples of external tools 170 may include, but are not limited to, optical Character Recognition (OCR) applications for capturing document metadata 140, language processing such as word classifiers and dictionaries for building semantic database 130, and machine learning tools for training and improving the accuracy of semantic database 130. The semantic database 130 includes one or more document classes 131, one or more key alias sets 135, and key body data 137.
Some semantically similar changes to keys 138 in key ontology data 137 are stored in one or more key alias sets 135, corresponding to the keys. In storing aliases in the key-alias set 135, the document digitizing engine 120 examines the level of semantic similarity between keys and aliases and discards aliases that do not have semantic similarity with keys less than the level of similarity configured as a threshold.
The key body data 137 is trained with supervised learning when semantic matches are found during processing by the document digitizing engine 160 by using an external tool 170 operating a supervised machine learning tool. Even though semantic similarity between keys and aliases is not obvious, aliases may still be associated with keys in the key-aliases set 135 based on programmer input. Thus, semantic matches can be found across languages to support a broad basis for known aliases.
In the semantic database 130, a document category of the one or more document categories 131 includes a set of class keys 133 that specify a set of keys requested for any document in the document category 131. The set of class keys 133 is uniquely defined by the user 101 to consistently represent the keys and corresponding KVP in the RDB150 throughout the application suite so that the KVP is available without further conversion between applications.
For example, when the document is a purchase invoice class, the corresponding set of class keys may include, but is not limited to, a name, a transaction date, a list of items, a price of the items, a tax, and a total.
In the semantic database 130, each of the one or more key alias sets 135 includes an alias corresponding to a key. As described above, a key is a normalized data field name to be used in RDB150, and all aliases in a key alias set 135 are semantically interchangeable with keys. Each alias is unique among all documents and may correspond to a set of class keys 133 via a document category of keys 138 as represented in the key body data 137.
The semantic database 130 includes key ontology data 137 that defines a set of constraints and meanings that model the knowledge domain represented by the document image 181. The key body data 137 includes respective attributes 139 of a plurality of keys that may exist in the document image 181. Examples of attributes 139 may include, but are not limited to, a document class associated with key 138, a key class to which key 138 belongs, and a data type and format of a value of key 138.
The semantic normalization engine 160 automatically parses the data field names that appear in the document image 181 by using parameters such as the relative context, relative style, and relative position of the data field names stored in the document metadata 140 and the semantic database 130 in relation to other objects in the document image 181. The semantic normalization engine 160 also utilizes various existing techniques, such as text matching, document classification, and vector space modeling, in order to increase the likelihood of correctly capturing various data field names as corresponding normalized keys. The detailed operation of the semantic normalization engine 160 is depicted in fig. 2.
FIG. 2 depicts a flowchart of operations performed by the semantic normalization engine 160 of FIG. 1 according to one or more embodiments set forth herein.
Prior to block 210, the document digitizing engine 120 prepares the document category 131 and key ontology data 137 of the semantic database 130 based on the key-value pair specification of the RDB provided by the user 101 or by using machine learning based on previously processed document images. The document digitizing engine 120 classifies the document image 181 based on the data extracted from the document image 181, and determines the document type of the document image 181. When no candidate key from the document image 181 is found from the semantic database 130, the document digitizing engine 120 invokes the semantic normalization engine 160. If the document digitization engine 120 finds the exact text of a candidate key in one of the class key set 133, the key alias set 135, or the key ontology data 137, the document digitization engine 120 may not invoke the semantic normalization engine 160 because the candidate key is already established in the semantic database 130.
In block 210, the semantic normalization engine 160 identifies candidate keys in the document image 181 input to the document digitizing engine 120 as semantically interchangeable with keys from the key ontology data 137. The semantic normalization engine 160 may identify more than one candidate key for a key found in the set of class keys 133 of the document category 131 of the document image 181. The semantic normalization engine 160 proceeds to block 220.
In this specification, when a key is identified, it means that the value corresponding to that key is also identified, so that key will be reflected as a key-value pair in RDB 150. In one embodiment of the invention, the semantic normalization engine 160 examines the document metadata 140 and selects key group aliases and associated values from the document image 181 based on relative location, relative style of specified colors, font type, and text size, which absorbs known keys from the semantic database 130 or RDB 150. The semantic normalization engine 160 may semantically match key text, values associated with keys of known data types specified in the key ontology data 137.
The semantic matching function determines whether the two input text are semantically interchangeable, as utilized by the semantic normalization engine 160. The semantic matching function is proprietary to the document digitizing engine 120, details of which are not presented in this specification.
In block 220, the semantic normalization engine 160 normalizes the candidate key identified in block 210 with respect to the key name specification by semantic matching and determines a key class corresponding to the candidate key. In some embodiments of the invention, the key name specification may be provided, for example, by the user 101, prepared by the document digitizing engine 120 based on the placement of candidate keys in the document image 181, an existing alias from the semantic database 130, or a combination thereof according to the configuration of the document digitizing engine 120. The semantic normalization engine 160 proceeds to block 230.
In one embodiment of the invention, the semantic normalization engine 160 determines whether the candidate key matches a key 138 in the key ontology data 137 or an alias corresponding to a key 138 in the key alias set 135. If the semantic normalization engine 160 does not find an exact match for the candidate key from the one or more keys from the semantic database 130, the semantic normalization engine 160 continues to determine if the one or more keys semantically match the candidate key. Candidate keys that are semantic matches of one or more key aliases may or may not be added to the key aliases set 135, depending on the confidence score as determined from block 240 and/or the user feedback as determined from block 250.
In one embodiment of the invention, for example for invoice number key-value pairs, the syntax is expressed in the extended Barker-North form (EBNF) as follows:
the letter = "a", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M"
|″N″|″O″|″P″|″Q″|″R″|″S″|″T″|″U″|″V″|″W″|″X″|″Y″|″Z″|″a″|
″b″|″c″|″d″|″e″|″f″|″g″|″h″|″|″i″|″j″|″k″|″I″|″m″|″n″|″o″|″p″|″q″|
″r″|″s″|″t″|″u″|″v″|″w″|″x″|″y″|″z″;
The number = "0", "1", "2", "3", "4", "5", "6", "7", "8", "9";
invoice number= "Inv", ("|" face "|" |_ "|.) and..;
value (invoice number) = "5",9 x number;
in block 230, the semantic normalization engine 160 learns all aliases represented by the normalized forms and/or key classes, as determined from block 220. In one embodiment of the invention, the semantic normalization engine 160 determines whether aliases for candidate keys derived from the normalized form and/or key class exist in the set of key aliases 135 corresponding to keys 138 specified in the key ontology data 137. Semantic normalization engine 160 proceeds to block 240.
In block 240, the semantic normalization engine 160 determines a confidence score for the key class identified for the candidate key and/or the alias derived from the normalized form of block 220 based on comparing various attributes of the key specification specified in the semantic database 130 and/or the framework of the RDB150 to the candidate key. Attributes are represented in the contents of the semantic database 130 and the document metadata 140 respectively associated with the candidate keys. The semantic normalization engine 160 proceeds to block 250.
In one embodiment of the invention, the semantic normalization engine 160 determines the proximity between candidate keys and key aliases, key classes, and/or keys defined from key ontology data by comparing character sequences of the candidate keys and key classes. For example, if the semantic normalization engine 160 observes: defining a key/key class "account" in the key body data 137; the candidate key "ACNT#" appears in the document image; and the alias "Accnt No." in Key aliases set of Key class "Account", then semantic normalization engine 160 may increase the confidence score for the candidate key "ACNT#" that is an alias for the "Account" key based on "A", "C", "N", and "T" in the aliases appearing in the same order as "Account" in keys 138.
The semantic normalization engine 160 may also utilize the relative locations, contexts, and styles of the keys and new aliases in the document image to determine the similarity between the keys and the new aliases. For example, if the semantic normalization engine 160 observes that the key "account" is defined in the key ontology data 137 as appearing in bold font at the lower left corner of a document page with account name data fields, and the alias "ACCNT NO" appears at the lower left corner of a page from the document image 181 with "customer name", the semantic normalization engine 160 may increase the confidence score of the "ACCNT NO" as the alias of the "account" key.
In the same embodiment, the semantic normalization engine 160 may invoke a machine learning process to classify key aliases and respective text of keys to determine whether the text points to the same class. Furthermore, in the same embodiment of the present invention, the semantic normalization engine 160 may run the vector space modeling and topic modeling process to identify topics for document images, which may confirm or deny classification of document images. The semantic normalization engine 160 may use existing vector space modeling and topic modeling toolkits, such as Gensim implemented in the Python programming language.
In block 250, the semantic normalization engine 160 validates the key class and the alias derived from the normalized form/key class of block 230 and the associated confidence score determined in block 240 by reporting the key class and the alias, referred to as the key class confidence score tuple 191, to the user 101 for feedback 199. The semantic normalization engine 160 may update the semantic database 130 with new aliases represented in the key class-confidence score tuples 191 based on feedback 199 provided by the user 101. In the event that the user 101 does not provide feedback 199, the semantic normalization engine 160 can still update the semantic database 130 with content from the key class confidence score tuples 191 based on whether the confidence score meets a pre-configured threshold. The semantic normalization engine 160 terminates processing the candidate keys identified from block 220. The semantic normalization engine 160 may iterate through blocks 220 through 250 for all candidate keys identified from the document image 181.
In particular embodiments of the invention, the semantic normalization engine 160 may have a presentation protocol that configures how to present particular aliases for particular confidence score ranges, where the presentation protocol is used to represent: an alias semantically matching the key in the green box on the document image; another alias semantically matching a threshold portion or more of the keys in an orange box on the document image; and after finding that the alias semantically matches less than the threshold portion of the key, another alias in the red box on the document image that does not match the key. The user 101 may enter a threshold portion to determine a semantic match, e.g., half of the keys, i.e., 50%.
In some embodiments of the invention, the semantic normalization engine 160 may have another representation protocol that indicates how many keys in the class key set have been found from the document image by using aliases derived from the normalized form, as well as pre-existing keys in the key alias set 135 and keys in the key ontology data 137. If there are all keys in the class key set 133 in the document image 181, the semantic normalization engine 160 may mark the boundaries of the document with a green box and each key/alias. If some keys in the class key set 133 are present in the document image 181 by semantic matching only and the confidence score of the alias is not perfect (100%), the semantic normalization engine 160 may mark the word/document image boundaries with a corresponding red box. If some keys in the class key set 133 are missing, the semantic normalization engine 160 may mark the boundaries of the word/document image with a corresponding red box.
FIG. 3 depicts an exemplary document image 300 having objects subject to semantic normalization, as performed by the semantic normalization engine 160 of FIG. 1, in accordance with one or more embodiments set forth herein.
The exemplary document image 300 depicts tax invoices issued by an organization. As described herein, the document digitizing engine 120 processes the document image 300 and extracts data from various objects of the document image. It should be understood that: objects with solid line boundaries represent keys of key-value pairs (KVP) specified by the user 101 by, for example, configuring the RDB150 or specifying keys in the key-body data 137, respectively; objects with a dash-dot boundary on the right side of the key represent the values of KVP, respectively; and objects with dashed boundaries represent one or more key-value pairs, respectively.
The document image 300 has a first area 301 representing a tissue name and a logo in the upper left corner. The document image 300 has a second area 302 in the upper right corner that represents contact information of an organization associated with the organization, such as an address, a phone number, a web site Uniform Resource Locator (URL), and an email address. The document image 300 also has a third region 303 in the upper right corner below the second region 302, which represents a document title that identifies the document type, i.e., "TAX INVOICE". According to the predefined invoice/tax invoice document categories in the semantic database 130, the tax invoice document categories have a set of class keys that specify the keys required for any document of the tax invoice document categories.
Document image 300 has side-by-side a "SOLD TO" object 310K and a "NAME & ADDRESS_" object 310V. The description contained by the underline indicates a space holder for the corresponding data type, which is not presented for the sake of brevity. The document digitizing engine 120 associates the objects 310K and 310V as keys and values of the data fields based on various parameters adapted to determine key-value pairs (KVP) from adjacent objects in the document image. Similarly, the document digitizing engine 120 determines: object 320K, INVOICE# "acts as a key, with the corresponding value represented by object 320V; the "ORDER#" object 330K acts as a key, and its corresponding value is represented as "_ODN_" object 330V; and "DATE" object 340K as a key, the corresponding value of which is represented as "_MM/DD/YY_" object 340V.
The document image 300 has a fourth area 304 "PRP CODE" representing proprietary CODE keys and proprietary CODE values. The document image 300 has a fifth area "ITEMS LIST" 305 describing a list of items subject to tax invoices of the document image 300. The list of items may be represented in a tabular format, wherein each item in the list is described by a respective attribute such as item code, item description, quantity, price adjustment, goods and Services Tax (GST), and item aggregate. The document image 300 has a sixth area "INVIOCE SUMMARY" 306, which represents various details summarizing tax invoices, such as subtotals, GST subtotals, and payable totals, generated by the sum of the prices of the items in the list of items.
The document image 300 has an area marked with "Payment Details" in the lower left corner of the image. In "Payment Details", the "BANK" object 350K and the "_B_CODE_" object 350K serve as key-value pairs for BANK information. Similarly, the document digitizing engine 120 determines: the "ACCOUNT NAME" object 360K, as a key, has the corresponding value represented in the "_A_NAME_" object 360V, describing the ACCOUNT NAME data field; and an "ACCOUNT" object 370K as a key, with the corresponding value represented in the "573093486" object 370V describing the ACCOUNT data field, according to the data type of the value as assigned in the key body data 137.
The document digitizing engine 120 processes the document image 300 and creates component fields of the document metadata 140. For example, the document digitizing engine 120 finds that both invoice number (320K, 320V) and order number (330K, 330V) appear in similar blocks isolated in relative position within the document image 300. Respective positions in the horizontal direction vertically above and to the right of the center line are recorded in the document metadata 140 as position coordinates 143 of the respective objects. Similarly, the object context 141, position coordinates 143, and object styles 145 associated with each object are each represented in the document metadata 140, as generated by the document digitizing engine 120 and as input to the semantic normalization engine 160. Some components of document metadata 140 are shown in FIG. 4.
In some embodiments of the present invention, if the semantic normalization engine 160 obtains a new document image instance for a document category, the semantic normalization engine 160 may develop a grammar for the document category that represents all class keys based on the document metadata, including: a background of how certain KVP/objects appear in a certain space; certain relative positions in the document image for each object; and a specific style corresponding to each object. The document category syntax may be represented by an EBNF-like symbol. When the semantic normalization engine 160 cumulatively processes document images, document category grammars may be learned and trained by using supervised machine learning. If the semantic normalization engine 160 obtains a new document image instance for a document category having a document category syntax, the semantic normalization engine 160 may add document metadata including object context, relative position, and object style to the training data for the document category syntax.
As more invoices are processed by the semantic normalization engine 160, certain elements of the document category syntax may be enhanced to be supported by the actual invoice. When evidence contrary to the document category syntax is greater than a threshold number, certain elements of the document category syntax may be discarded. For example, the document category syntax may clarify that invoice number data fields appear in the context of other data fields, including order number, date of purchase, and amount of expiration, near the top of the document image, with fonts 10% larger than other text. The document category grammar may also clarify that the total data field appears in the context of shipping and handling, tax, and purchase price near the lower right corner of the document image in bolded dollars.
FIG. 4 depicts exemplary document metadata 400 corresponding to a document image 181 in accordance with one or more embodiments set forth herein.
Document digitizing engine 120 processes document image 181 and generates document metadata 140. In some embodiments of the present invention, document digitization engine 120 generates document metadata 140 in JavaScript object notation (JSON) format, as shown in exemplary document metadata 400. The document image 181 is hierarchically organized into one or more blocks comprising one or more rows. Each row has one or more words. Each block, row, and word may be considered a corresponding object within document image 181, the attributes of which are described in document metadata 140, respectively.
Line L401 indicates that the list describes the blocks represented by "blockList". Lines L402 and L403 represent the (x, y) coordinates of the start point of the block. Line L403 indicates that no comment is attached to the block. Line L403 indicates that the block has a certain width. Line L406 indicates that the block has a line denoted by "LineList".
Line L407 indicates that line "LineList" has words represented by "WordList". Line L408 indicates that the word has the value "XYZ inc", and lines L409 and L410 represent the height and density of the word, respectively. Lines L411 and L412 represent the (x, y) coordinates of the start point of the word. Line L413 represents the font size of the words as in the particular custom font size group. Line L414 indicates that the word is to be identified by a "word_0" name. Line L415 indicates that the word has eight (8) characters, and line L426 indicates that the word has a certain width. The measurement may be in pixels, or according to any other custom unit.
The lines L417 to 421 end the line "LineList" introduced in L406. The line width L417 (x, y) coordinates coordinate the line start points of lines L418 and L419, the height of the line in line L420, and the name "line_0" of the line identification in line L421.
The context of an object is represented by how each object appears together in a list. The relative position and size of the object may be determined based on various coordinates and size elements such as height and width. The document metadata 140 is used as input to the semantic normalization engine 160, particularly to evaluate confidence scores for the likelihood that candidate keys are aliases of known keys.
FIG. 5 depicts an exemplary input and output 500 of the semantic normalization engine 160 according to one or more embodiments set forth herein.
As shown in fig. 4, exemplary inputs and outputs 500 are represented in JSON notation. Note that lines L501 to L509 are inputs of candidate keys from the document image 181, and lines L511 to L521 are outputs after the semantic normalization engine 160 processes the candidate keys from the above inputs.
Line L502 indicates that the candidate key is a member of the block identified by the "block_16" name. As shown in FIG. 4 above, "block_16" is specified in the document metadata for context, location, and style. Line L503 indicates that the candidate key has a value of "573093486". Lines L504 and L505 represent the (x, y) coordinates of the starting point of the value of L503. Lines L506 and L507 represent the (x, y) coordinates of the start point of the candidate key. Line L508 indicates that the candidate key has the text "Accnt No".
In processing the input of lines L501 through L509, the semantic normalization engine 160 first looks at the key ontology data for matches, and then normalizes the text "Accnt No" for the key name specification by semantic matching and key alias searching. The semantic normalization engine 160 derives new aliases for the candidate keys from the normalized form. The semantic normalization engine 160 determines key classes for candidate keys and determines confidence scores on the key classes by examining the context, location, and style as specified in the document metadata.
Lines L512 through L518 are equal to lines L502 through L508, respectively, from the input. As shown in line L508, the candidate key has an "Accnt No" in line L518. Line L520 indicates that the semantic normalization engine 160 determines that the key class "CustomerAccount number" corresponds to the candidate key "Accnt No" after normalization. Line L519 represents that the semantic normalization engine 160 determines that the key class "CustomerAccount number" is 82.35% likely to be the key class corresponding to the candidate key "Accnt No" based on the context, relative position, and style represented in the document metadata, text ordering, semantic matching, and vector space modeling and text classification.
In some embodiments, semantic normalization engine 160 determines the key class "customercccountnumber" of candidate key "Accnt No" based on the object grouped in "block_16" in which the candidate key appears and the relative style of candidate key "Accnt No" as compared to other objects in the same block. Semantic normalization engine 160 compares the text similarity between the candidate key "Accnt No" and all key classes and determines the key class "CustomerAccount number" closest to the candidate key "Accnt No" and determines the proximity level as a confidence score.
Fig. 6-8 depict various aspects of computing, including cloud computing systems, in accordance with one or more aspects set forth herein.
It should be understood at the outset that although the present disclosure includes a detailed description of cloud computing, implementation of the technical solutions recited therein is not limited to cloud computing environments, but rather can be implemented in connection with any other type of computing environment now known or later developed.
Cloud computing is a service delivery model for convenient, on-demand network access to a shared pool of configurable computing resources. Configurable computing resources are resources that can be quickly deployed and released with minimal administrative costs or minimal interaction with service providers, such as networks, network bandwidth, servers, processes, memory, storage, applications, virtual machines, and services. Such cloud patterns may include at least five features, at least three service models, and at least four deployment models.
The characteristics include:
on-demand self-service: a consumer of the cloud can unilaterally automatically deploy computing capabilities such as server time and network storage on demand without human interaction with the service provider.
Wide network access: computing power may be obtained over a network through standard mechanisms that facilitate the use of the cloud by heterogeneous thin client platforms or thick client platforms (e.g., mobile phones, laptops, personal digital assistants PDAs).
And (3) a resource pool: the provider's computing resources are grouped into resource pools and served to multiple consumers through a multi-tenant (multi-tenant) model, where different physical and virtual resources are dynamically allocated and reallocated as needed. Typically, the consumer is not able to control or even know the exact location of the provided resources, but may refer to location (e.g., country, state, or data center) at a higher level of abstraction, and thus has location independence.
Rapid elasticity: the computing power can be deployed quickly, flexibly (sometimes automatically) to achieve a quick expansion, and can be released quickly to shrink quickly. The available computing power for deployment tends to appear infinite to consumers and can be accessed at any time and in any number of ways.
Measurable services: cloud systems automatically control and optimize resource utility by leveraging metering capabilities of some degree of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency to both the service provider and consumer.
The service model is as follows:
software as a service (SaaS): the capability provided to the consumer is to use an application that the provider runs on the cloud infrastructure. Applications may be accessed from various client devices through a thin client interface such as a web browser (e.g., web-based email). With the exception of limited user-specific application configuration settings, consumers do not manage nor control the underlying cloud infrastructure including networks, servers, operating systems, storage, or even individual application capabilities, etc.
Platform as a service (PaaS): the capability provided to the consumer is to deploy consumer created or obtained applications on the cloud infrastructure, which are created using programming languages and tools supported by the provider. The consumer does not manage nor control the underlying cloud infrastructure, including the network, server, operating system, or storage, but has control over the applications it deploys, and possibly also over the application hosting environment configuration.
Infrastructure as a service (IaaS): the capability provided to the consumer is the processing, storage, networking, and other underlying computing resources in which the consumer can deploy and run any software, including operating systems and applications. The consumer does not manage nor control the underlying cloud infrastructure, but has control over the operating system, storage, and applications deployed thereof, and may have limited control over selected network components (e.g., host firewalls).
The deployment model is as follows:
private cloud: the cloud infrastructure alone runs for some organization. The cloud infrastructure may be managed by the organization or a third party and may exist inside or outside the organization.
Community cloud: the cloud infrastructure is shared by several organizations and supports specific communities of common interest (e.g., mission tasks, security requirements, policies, and compliance considerations). The community cloud may be managed by multiple organizations or third parties within a community and may exist inside or outside the community.
Public cloud: the cloud infrastructure provides public or large industry groups and is owned by an organization selling cloud services. Hybrid
Mixing cloud: the cloud infrastructure consists of two or more clouds of deployment models (private, community, or public clouds) that remain unique entities, but are bound together by a normalization technique or private technique that enables data and applications to migrate (e.g., cloud bursting traffic sharing techniques for load balancing between clouds).
Cloud computing environments are service-oriented, with features focused on stateless, low-coupling, modular, and semantic interoperability. At the heart of cloud computing is an infrastructure that includes a network of interconnected nodes.
Referring now to FIG. 6, there is shown a schematic diagram of an example of a computer system/cloud computing node. Cloud computing node 10 is only one example of a suitable cloud computing node and is not intended to suggest any limitation as to the scope of use or functionality of embodiments of the invention described herein. Regardless, cloud computing node 10 is capable of being implemented and/or performing any of the functions set forth above.
In the cloud computing node 10, there are computer systems 12 that are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known computing systems, environments, and/or configurations that may be suitable for use with computer system 12 include, but are not limited to, personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems or devices, and the like.
Computer system 12 may be described in the general context of computer system-executable instructions, such as program processes, being executed by a computer system. Generally, program processes may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer system 12 may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program processes may be located in both local and remote computer system storage media including memory storage devices.
As shown in fig. 6, computer system 12 in cloud computing node 10 is shown in the form of a general purpose computing device. Components of computer system 12 may include, but are not limited to, one or more processors 16, a system memory 28, and a bus 18 that couples various system components including system memory 28 to processors 16.
Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, and a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.
Computer system 12 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer system 12 and includes both volatile and nonvolatile media, removable and non-removable media.
The system memory 28 may include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 30 and/or cache memory 32. Computer system 12 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 34 may be used to read from or write to non-removable, nonvolatile magnetic media (not shown in FIG. 1, commonly referred to as a "hard disk drive"). Although not shown in fig. 1, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 18 through one or more data medium interfaces. Memory 28 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.
By way of example, and not limitation, one or more programs 40 having a set (at least one) of program processes 42, as well as an operating system, one or more application programs, other program processes, and program data, may be stored in memory 28. Each of the operating system, one or more application programs, other program processes, and program data, or some combination thereof, may include an implementation of document digitizing engine 120 and/or semantic normalization engine 160 of fig. 1. Program processing 42, as in document digitizing engine 120 and/or semantic normalization engine 160, generally performs the functions and/or methods of embodiments of the present invention as described herein.
The computer system 12 may also communicate with one or more external devices 14, such as a keyboard, pointing device, display 24, or the like; one or more devices that enable a user to interact with computer system 12; and/or any device (e.g., network card, modem, etc.) that enables computer system 12 to communicate with one or more other computing devices. Such communication may occur via an input/output (I/0) interface 22. Further, computer system 12 may communicate with one or more networks such as a Local Area Network (LAN), a general Wide Area Network (WAN), and/or a public network (e.g., the Internet) via network adapter 20. As depicted, the network adapter 20 communicates with other components of the computer system 12 via the bus 18. It should be appreciated that although not shown, other hardware and/or software components may be used in conjunction with computer system 12. Examples include, but are not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data archive storage systems, among others.
Referring now to FIG. 7, an exemplary cloud computing environment 50 is depicted. As shown, cloud computing environment 50 includes one or more cloud computing nodes 10 with which local computing devices used by cloud computing consumers, such as Personal Digital Assistants (PDAs) or mobile telephones 54A, desktop computers 54B, notebook computers 54C, and/or automobile computer systems 54N, may communicate. Cloud computing nodes 10 may communicate with each other. Cloud computing nodes 10 may be physically or virtually grouped (not shown) in one or more networks including, but not limited to, private, community, public, or hybrid clouds as described above, or a combination thereof. In this way, cloud consumers can request infrastructure as a service (IaaS), platform as a service (PaaS), and/or software as a service (SaaS) provided by the cloud computing environment 50 without maintaining resources on the local computing device. It should be appreciated that the various computing devices 54A-N shown in fig. 7 are merely illustrative, and that cloud computing node 10 and cloud computing environment 50 may communicate with any type of computing device (e.g., using a web browser) over any type of network and/or network-addressable connection.
Referring now to FIG. 8, a set of functional abstraction layers provided by cloud computing environment 50 (FIG. 7) is shown. It should be understood in advance that the components, layers, and functions shown in fig. 8 are intended to be illustrative only, and embodiments of the present invention are not limited thereto. As depicted, the following layers and corresponding functions are provided:
the hardware and software layer 60 includes hardware and software components. Examples of hardware components include: a host 61; a server 62 based on a RISC (reduced instruction set computer) architecture; a server 63; blade server 64; a storage device 65; a network and a network component 66. Examples of software components include: web application server software 67 and database software 68.
The virtualization layer 70 provides an abstraction layer that may provide examples of the following virtual entities: virtual server 71, virtual storage 72, virtual network 73 (including a virtual private network), virtual applications and operating system 74, and virtual client 75.
In one example, management layer 80 may provide the following functionality: resource provisioning function 81: providing dynamic acquisition of computing resources and other resources for performing tasks in a cloud computing environment; metering and pricing function 82: cost tracking of resource usage within a cloud computing environment and billing and invoicing therefor are provided. In one example, the resource may include an application software license. Safety function: identity authentication is provided for cloud consumers and tasks, and protection is provided for data and other resources. User portal function 83: providing consumers and system administrators with access to the cloud computing environment. Service level management function 84: allocation and management of cloud computing resources is provided to meet the requisite level of service. Service Level Agreement (SLA) planning and fulfillment function 85: scheduling and provisioning is provided for future demands on cloud computing resources according to SLA predictions.
Workload layer 90 provides an example of functionality that may utilize a cloud computing environment. Examples of workloads and functions that may be provided from this layer include: drawing and navigating 91; software development and lifecycle management 92; virtual classroom education delivery 93; a data analysis process 94; transaction processing 95; and a processing component for document digitizing services provided by a document digitizing engine including a semantic normalization engine 96 as described herein.
The present invention may be a system, method and/or computer program product at any possible level of technical details. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for causing a processor to implement aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disks, hard disks, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disk read-only memory (CD-ROM), digital Versatile Disks (DVD), memory sticks, floppy disks, mechanical coding devices, punch cards or in-groove structures such as punch cards or grooves having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media, as used herein, are not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., optical pulses through fiber optic cables), or electrical signals transmitted through wires.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to various computing/processing devices. Or downloaded to an external computer or external memory device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.
Computer program instructions for carrying out operations of the present invention may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, integrated circuit configuration data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++, or the like and a procedural programming language such as the "C" language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computer may be connected to the user computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (e.g., connected through the internet using an internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), with state information for computer readable program instructions, which can execute the computer readable program instructions.
Various aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprise" (and any form of comprise, such as "comprise" and "contain"), "have" (and any form of have, such as "have" and "have"), "include" (and any form of include, such as "include" and "contain") and "contain" (and any form of contain, such as "contain" and "contain") are open-ended linking verbs. Thus, a method or apparatus that "comprises," "has," "includes" or "contains" one or more steps or elements possesses those one or more steps or elements, but is not limited to possessing only those one or more steps or elements. Likewise, an element of a method or apparatus that "comprises," "has," "includes" or "contains" one or more features possesses those one or more features, but is not limited to possessing only those one or more features. Furthermore, a device or structure configured in a certain way is configured at least in this way, but may also be configured in ways not listed.
The corresponding structures, materials, acts, and equivalents of all means or step plus function elements in the claims below, if any, are intended to include any structure, material, or act for performing the function in combination with other claimed elements as specifically claimed. The description set forth herein is presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The embodiments were chosen and described in order to best explain the principles of one or more aspects set forth herein and the practical application, and to enable others of ordinary skill in the art to understand one or more aspects of the various embodiments as described herein for various modifications as are suited to the particular use contemplated.

Claims (19)

1. A computer-implemented method for normalizing keys in a document image, comprising:
obtaining, by one or more processors of a computer, document metadata for the document image, wherein the document metadata includes a context, a location, and a style for each object appearing in the document image;
Identifying candidate keys corresponding to objects in the document image as keys in key ontology data based on the candidate keys being interchangeable with key semantics;
normalizing the candidate keys to a normalized form;
determining a key class corresponding to the normalized form, wherein the key class is associated with a key in the key body data;
evaluating a confidence score for the key class based on the document metadata, wherein the confidence score indicates a likelihood of the key class being represented by the candidate key; and
based on verifying the key class according to a pre-configured verification scheme, the semantic database is updated with the key class such that the key class can be effectively associated with semantically interchangeable text appearing in other document images.
2. The computer-implemented method of claim 1, further comprising:
deriving zero or more aliases of the key class;
comparing the derived aliases with aliases in a set of key aliases corresponding to the keys from the semantic database;
discovering that the derived aliases are not present in the key alias set and that the respective confidence scores corresponding to each of the derived aliases are greater than a pre-configured threshold; and
The semantic database is updated with the derived aliases.
3. The computer-implemented method of claim 1, wherein the key in the key body data is equal to the key class.
4. The computer-implemented method of claim 1, the evaluating comprising:
comparing the context of the candidate key specified in the document metadata with the context of the key class, wherein the context of the candidate key indicates other objects that appear with the candidate key in the document image, and wherein the context of the key class indicates other typical objects that appear with keys of the key class in past document images; and
the confidence score is adjusted in proportion to a level of similarity between the context of the candidate key and the context of the key class.
5. The computer-implemented method of claim 1, the evaluating comprising:
comparing the position of the candidate key specified in the document metadata with the position of the key class, wherein the position of the candidate key indicates a relative position at which the candidate key appears in the document image, and wherein the position of the key class indicates a relative position at which a key of the key class appears in a past document image; and
The confidence score is adjusted in proportion to a level of similarity between the location of the candidate key and the location of the key class.
6. The computer-implemented method of claim 1, the evaluating comprising:
comparing the style of the candidate key specified in the document metadata with the style of the key class, wherein the style of the candidate key indicates a font type and a size of the candidate key related to styles of other objects appearing in the document image, and wherein the style of the key class indicates a font type and a size of keys of the key class related to styles of other objects commonly appearing in past document images; and
the confidence score is adjusted in proportion to a level of similarity between the style of the candidate key and the style of the key class.
7. The computer-implemented method of claim 1, wherein the canonical form is an extended bachel-nul form (EBNF) representation, and wherein the document metadata is expressed in JavaScript object notation (JSON) format.
8. The computer-implemented method of claim 1, wherein the key specified in the key ontology data is a name of a data field in a relational database, wherein the key is associated with a value extracted from the document image, and the key and the value form a key-value pair in the relational database for future computation.
9. A computer-implemented method for normalizing keys in a document image, comprising:
identifying, by one or more processors of the computer, candidate keys corresponding to objects in the document image as keys in the key ontology data based on the candidate keys being interchangeable with key semantics;
normalizing the candidate keys to a normalized form;
deriving one or more aliases of the candidate key from the canonical form, wherein the one or more aliases are not associated with the key in a semantic database;
evaluating a respective confidence score for each of the one or more aliases based on document metadata of the document image, wherein the confidence score indicates a likelihood that each alias is represented by the candidate key; and
based on verifying the one or more aliases according to a pre-configured verification scheme, the semantic database is updated with the one or more aliases such that the one or more aliases may be effectively associated with text from other document images based on the text semantically matching the candidate key.
10. A computer-readable storage medium readable by one or more processors and storing instructions for execution by the one or more processors to perform a method for normalizing keys in a document image, the method comprising:
Obtaining document metadata of the document image, wherein the document metadata comprises a context, a position and a style for each object appearing in the document image;
identifying a candidate key corresponding to an object in the document image as a key in the key ontology data based on the candidate key being semantically interchangeable with the key;
normalizing the candidate keys to a normalized form;
determining a key class corresponding to the normalized form, wherein the key class is associated with a key in the key body data;
evaluating a confidence score for the key class based on the document metadata, wherein the confidence score indicates a likelihood of the key class being represented by the candidate key; and
based on verifying the key class according to a pre-configured verification scheme, the semantic database is updated with the key class such that the key class can be effectively associated with semantically interchangeable text appearing in other document images.
11. The computer-readable storage medium of claim 10, further comprising:
deriving zero or more aliases of the key class;
comparing the derived aliases with aliases in a set of key aliases corresponding to the keys from the semantic database;
Discovering that the derived aliases are not present in the key alias set and that a respective confidence score corresponding to each of the derived aliases is greater than a pre-configured threshold; and
the semantic database is updated with the derived aliases.
12. The computer-readable storage medium of claim 10, wherein the key in the key body data is equal to the key class.
13. The computer-readable storage medium of claim 10, the evaluating comprising:
comparing the context of the candidate key specified in the document metadata with the context of the key class, wherein the context of the candidate key indicates other objects that appear with the candidate key in the document image, and wherein the context of the key class indicates other typical objects that appear with keys of the key class in past document images; and
the confidence score is adjusted in proportion to a level of similarity between the context of the candidate key and the context of the key class.
14. The computer-readable storage medium of claim 10, the evaluating comprising:
comparing the position of the candidate key specified in the document metadata with the position of the key class, wherein the position of the candidate key indicates a relative portion of the document image in which the candidate key appears, and wherein the position of the key class indicates a relative portion of a past document image in which a key of the key class typically appears; and
The confidence score is adjusted in proportion to a level of similarity between the location of the candidate key and the location of the key class.
15. The computer-readable storage medium of claim 10, the evaluating comprising:
comparing the style of the candidate key specified in the document metadata with the style of the key class, wherein the style of the candidate key indicates a font type and size of the candidate key that is related to styles of other objects that appear in the document image, and wherein the style of the key class indicates a font type and size of keys of the key class that are related to styles of other objects that commonly appear in past document images; and
the confidence score is adjusted in proportion to a level of similarity between the style of the candidate key and the style of the key class.
16. The computer-readable storage medium of claim 10, wherein the canonical form is an extended bachelter-nul form (EBNF) symbol, and wherein the document metadata is expressed in JavaScript object notation (JSON) format.
17. The computer-readable storage medium of claim 10, wherein a key specified in the key-ontology data is a name of a data field in a relational database, wherein the key is associated with a value extracted from the document image, and the key and the value form a key-value pair in the relational database for future computation.
18. A computer-readable storage medium readable by one or more processors and storing instructions for execution by the one or more processors to perform a method for normalizing keys in a document image, the method comprising:
identifying a candidate key corresponding to an object in the document image as a key in the key ontology data based on the candidate key being interchangeable with the key semantics;
normalizing the candidate keys to a normalized form;
deriving one or more aliases of the candidate key from the canonical form, wherein the one or more aliases are not associated with the key in a semantic database;
evaluating, based on document metadata of the document image, a respective confidence score for each of the one or more aliases, wherein the confidence score indicates a likelihood that each alias is represented by the candidate key; and
based on verifying the one or more aliases according to a pre-configured verification scheme, the semantic database is updated with the one or more aliases such that the one or more aliases may be effectively associated with text from other document images based on the text semantically matching the candidate key.
19. A system, comprising:
a memory;
one or more processors in communication with the memory; and
program instructions executable by the one or more processors via the memory to perform the method for normalizing keys in document images as recited in any one of claims 1-9.
CN201880069289.9A 2017-12-01 2018-11-30 Semantic normalization in document digitization Active CN111263943B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US15/829,078 2017-12-01
US15/829,078 US10963686B2 (en) 2017-12-01 2017-12-01 Semantic normalization in document digitization
PCT/IB2018/059490 WO2019106613A1 (en) 2017-12-01 2018-11-30 Semantic normalization in document digitization

Publications (2)

Publication Number Publication Date
CN111263943A CN111263943A (en) 2020-06-09
CN111263943B true CN111263943B (en) 2023-10-10

Family

ID=66659258

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201880069289.9A Active CN111263943B (en) 2017-12-01 2018-11-30 Semantic normalization in document digitization

Country Status (6)

Country Link
US (1) US10963686B2 (en)
JP (1) JP6964383B2 (en)
CN (1) CN111263943B (en)
DE (1) DE112018006131T5 (en)
GB (1) GB2581461A (en)
WO (1) WO2019106613A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11616816B2 (en) * 2018-12-28 2023-03-28 Speedchain, Inc. Distributed ledger based document image extracting and processing within an enterprise system
KR20210130790A (en) * 2019-02-27 2021-11-01 구글 엘엘씨 Identification of key-value pairs in documents
US20230077289A1 (en) * 2021-09-09 2023-03-09 Bank Of America Corporation System for electronic data artifact testing using a hybrid centralized-decentralized computing platform
US11914943B1 (en) * 2022-08-22 2024-02-27 Oracle International Corporation Generating an electronic document with a consistent text ordering

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105630751A (en) * 2015-12-28 2016-06-01 厦门优芽网络科技有限公司 Method and system for rapidly comparing text content
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN106469143A (en) * 2015-08-21 2017-03-01 国际商业机器公司 The estimation of file structure

Family Cites Families (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06119392A (en) * 1992-10-02 1994-04-28 Hitachi Ltd Data item identification system
US7362775B1 (en) * 1996-07-02 2008-04-22 Wistaria Trading, Inc. Exchange mechanisms for digital information packages with bandwidth securitization, multichannel digital watermarks, and key management
JP4042830B2 (en) * 1998-05-12 2008-02-06 日本電信電話株式会社 Content attribute information normalization method, information collection / service provision system, and program storage recording medium
US20150066895A1 (en) * 2004-06-18 2015-03-05 Glenbrook Networks System and method for automatic fact extraction from images of domain-specific documents with further web verification
JP2007058605A (en) * 2005-08-24 2007-03-08 Ricoh Co Ltd Document management system
WO2008074160A1 (en) 2006-12-21 2008-06-26 Cogniva Information Solutions Inc. Software for facet classification and information management
US7836396B2 (en) * 2007-01-05 2010-11-16 International Business Machines Corporation Automatically collecting and compressing style attributes within a web document
US9043367B2 (en) 2007-05-23 2015-05-26 Oracle International Corporation Self-learning data lenses for conversion of information from a first form to a second form
US9268849B2 (en) 2007-09-07 2016-02-23 Alexander Siedlecki Apparatus and methods for web marketing tools for digital archives—web portal advertising arts
US8379989B2 (en) * 2008-04-01 2013-02-19 Toyota Jidosha Kabushiki Kaisha Image search apparatus and image processing apparatus
JP5226425B2 (en) * 2008-08-13 2013-07-03 インターナショナル・ビジネス・マシーンズ・コーポレーション Information processing apparatus, information processing method, and program
US8520909B2 (en) 2009-03-11 2013-08-27 Hong Kong Baptist University Automatic and semi-automatic image classification, annotation and tagging through the use of image acquisition parameters and metadata
US9323784B2 (en) * 2009-12-09 2016-04-26 Google Inc. Image search using text-based elements within the contents of images
US8417651B2 (en) 2010-05-20 2013-04-09 Microsoft Corporation Matching offers to known products
US9104749B2 (en) 2011-01-12 2015-08-11 International Business Machines Corporation Semantically aggregated index in an indexer-agnostic index building system
US9064004B2 (en) 2011-03-04 2015-06-23 Microsoft Technology Licensing, Llc Extensible surface for consuming information extraction services
US9251180B2 (en) 2012-05-29 2016-02-02 International Business Machines Corporation Supplementing structured information about entities with information from unstructured data sources
US9674132B1 (en) 2013-03-25 2017-06-06 Guangsheng Zhang System, methods, and user interface for effectively managing message communications
US9876848B1 (en) * 2014-02-21 2018-01-23 Twitter, Inc. Television key phrase detection
US10394851B2 (en) * 2014-08-07 2019-08-27 Cortical.Io Ag Methods and systems for mapping data items to sparse distributed representations
JP6050843B2 (en) * 2015-01-30 2016-12-21 株式会社Pfu Information processing apparatus, method, and program
US10395325B2 (en) * 2015-11-11 2019-08-27 International Business Machines Corporation Legal document search based on legal similarity
WO2017082352A1 (en) * 2015-11-14 2017-05-18 Sharp Kabushiki Kaisha Service list
CN105787056A (en) 2016-02-29 2016-07-20 浪潮通用软件有限公司 Calculating method based on custom calculation model
US10438083B1 (en) * 2016-09-27 2019-10-08 Matrox Electronic Systems Ltd. Method and system for processing candidate strings generated by an optical character recognition process
JP6973782B2 (en) * 2017-09-27 2021-12-01 株式会社ミラボ Standard item name setting device, standard item name setting method and standard item name setting program

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106469143A (en) * 2015-08-21 2017-03-01 国际商业机器公司 The estimation of file structure
CN105630751A (en) * 2015-12-28 2016-06-01 厦门优芽网络科技有限公司 Method and system for rapidly comparing text content
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于规范化关系模式生成XML的键;岳昆,周傲英;计算机工程(09);全文 *
面向文本分类的有监督显式语义表示;孙飞;郭嘉丰;兰艳艳;程学旗;;数据采集与处理(03);全文 *

Also Published As

Publication number Publication date
JP6964383B2 (en) 2021-11-10
CN111263943A (en) 2020-06-09
GB2581461A (en) 2020-08-19
US10963686B2 (en) 2021-03-30
WO2019106613A1 (en) 2019-06-06
US20190171872A1 (en) 2019-06-06
GB202009248D0 (en) 2020-07-29
DE112018006131T5 (en) 2020-08-20
JP2021504779A (en) 2021-02-15

Similar Documents

Publication Publication Date Title
US10049270B1 (en) Using visual features to identify document sections
CN113807098B (en) Model training method and device, electronic equipment and storage medium
US10592738B2 (en) Cognitive document image digitalization
CN111263943B (en) Semantic normalization in document digitization
US10977486B2 (en) Blockwise extraction of document metadata
US11361030B2 (en) Positive/negative facet identification in similar documents to search context
CN112541359B (en) Document content identification method, device, electronic equipment and medium
US11500840B2 (en) Contrasting document-embedded structured data and generating summaries thereof
US9916361B2 (en) Dynamically mapping zones
US11308084B2 (en) Optimized search service
WO2022048535A1 (en) Reasoning based natural language interpretation
US20220309276A1 (en) Automatically classifying heterogenous documents using machine learning techniques
CN115618034A (en) Mapping application of machine learning model to answer queries according to semantic specifications
US10970533B2 (en) Methods and systems for finding elements in optical character recognition documents
US11881042B2 (en) Semantic template matching
CN112307134B (en) Entity information processing method, device, electronic equipment and storage medium
US11321409B2 (en) Performing a search based on position information
US20160062974A1 (en) Recording reasons for metadata changes
US20230222150A1 (en) Cognitive recognition and reproduction of structure graphs

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant