WO2021251972A1

WO2021251972A1 - Method to improve probability calculation of knowledge base construction

Info

Publication number: WO2021251972A1
Application number: PCT/US2020/037248
Authority: WO
Inventors: Yusuke Jin; Satoshi Oshima; Hirofumi Nagano
Original assignee: Hitachi, Ltd.
Priority date: 2020-06-11
Filing date: 2020-06-11
Publication date: 2021-12-16

Abstract

Example implementations improve the probability calculation of knowledge base construction involving detecting the knowledge probability to be adjusted and adjusting its probability based on Optical Character Recognition (OCR) output accuracy which is related to the knowledge. The example implementations involve a function for receiving OCR output which contains multiple recognized texts and its accuracy, a function for receiving extracted knowledge which contains knowledge description, correctness probability and result of adapting constraint condition of correct knowledge, a function identifying low accuracy text and low knowledge probability, detecting constraint condition which causes low probability, identifying text which causes low knowledge probability by comparing text location and text similarity, and a function re-calculating probability by replacing OCR output text with next candidate and adjusting probability by adjusting result of adapted constraint.

Description

METHOD TO IMPROVE PROBABILITY CALCULATION OF KNOWLEDGE

BASE CONSTRUCTION

BACKGROUND

Field

[0001] The present disclosure generally relates to knowledge bases for character recognition systems, and more specifically, to systems and methods for improving probability calculation of knowledge base construction.

Related Art

[0002] Improving operational efficiency and promoting rapid decision-making are important management issues. However, quite a bit of manual work is required for non routine processes such as processing or analyzing atypical formatted documents. In these situations, enterprises utilize dark data, which is stored within their organization and has not yet been integrated into the business. For example, a large volume of non-digitized paper documents is one type of dark data. Because of the volume, handling dark data manually is not realistic. Therefore, it can be important to extract essential information from documents efficiently and store extracted information for easier use. In order to extract knowledge from document type dark data, two functions are utilized: Optical Character Recognition (OCR) and Knowledge Base (KB) Construction.

[0003] OCR is utilized for recognizing characters in a document image. In the related art, the OCR system has a function recognizing a location in which some characters exist and a function to identify the character by comparing the character features. The OCR system also has a function comparing the recognized characters with terms in a dictionary and a function adjusting the recognized characters if the character string (text) is not defined in the dictionary. So, if there is some false in text recognition, OCR system can adjust the text based on the dictionary. The OCR system also has a function recognizing both printed and hand written texts, faint texts and text in dirty documents, but OCR system cannot always recognize texts perfectly. Therefore, the OCR system has a function calculating accuracy, which indicates the probability that the recognized text is accurate, and a function enumerating multiple candidates of recognized characters. [0004] The OCR system output is simply text, and the OCR system cannot extract a relationship among multiple texts. However, a KB construction system can extract such a relationship and maintain it as knowledge from the OCR output. The KB construction system has a function extracting a relationship among multiple texts, which matches with a defined knowledge schema, as knowledge candidates. Patterns of knowledge are defined in the knowledge schema. The KB construction system also has a function calculating the probability that the knowledge candidate is correct through judgement functions. Users can utilize these extracted knowledges with reference to their probability. For example, if the probability of a particular knowledge is high, then no visual confirmation from users are required, otherwise users will be required to visually confirm the extracted knowledge.

[0005] In the related art, there are various character recognition devices and systems. In such related art implementations, if there are various types of mixed characters such as letters, numerical and symbols, it is inefficient to extract text from the document. To solve the problem, such related art implementations integrate statistical information for the text and its reliability.

[0006] In related art implementations, there are methods for dividing the text image into multiple segments, adapting OCR processing to the multiple segments, getting statistical information of candidate text and/or statistical information of text combinations involving the candidate text, and determining the candidate text by integrating statistical information and reliability of OCR-ed candidate text. In this context, the reliability of the text and the accuracy of the text is almost synonymous.

SUMMARY

[0007] In KB systems, the calculation procedure is isolated from the OCR accuracy calculation. Therefore, the accuracy of the input data (OCR output) and its multiple candidates are not reflected to the probability calculation in the KB system. The highest accuracy candidate is used for the probability calculation which is assumed to be 100% accurate. Such related art systems cause several problems. For example, even if the accuracy of text in the OCR output is low, the probability calculation is conducted without considering the low text accuracy and the calculated probability is factually inaccurate. Users may misunderstand importance of the extracted knowledge. Further, if a false text candidate is scored as having the highest accuracy, a correct text candidate with lower accuracy is not used in the probability calculation and essential knowledge may not be extracted properly.

[0008] Thus, there is a need to adjust for the incorrect probability, but it is difficult to identify which knowledge probability is to be adjusted and how to adjust the probability.

[0009] Example implementations facilitate a solution to the above problem through a method for detecting the knowledge probability which is to be adjusted, and adjusting the probability based on the OCR output accuracy that is related to the knowledge. The method includes a function for receiving OCR output which contains multiple recognized texts and the corresponding accuracy. The method includes a function for receiving extracted knowledge which contains the knowledge description, correctness probability and results from adapting constraint conditions of the correct knowledge. The method also includes a function for identifying low accuracy text and row knowledge probability, detecting constraint conditions which cause low probability, and identifying text which causes low knowledge probability by comparing text location and text similarity. The method also includes a function for re-calculating probability by replacing OCR output with the next candidate and adjusting probability by adjusting result of adapted constraint.

[0010] The main concept is detecting constraint condition which causes low knowledge probability by using adapting result of constraint condition and identifying text which causes low probability by comparing text described in constraint condition with row accuracy text in view of location and similarity.

[0011] In example implementations described herein, there are systems and methods directed to detecting which OCR output text is causing low accuracy and utilizing the accuracy of OCR output text for improving accuracy for a combination of multiple texts.

[0012] Aspects of the present disclosure involve a method, involving obtaining optical character recognition (OCR) output, the OCR output including one or more candidate text, each of the one or more candidate text prioritized based on an associated accuracy probability; extracting, from the OCR output, knowledge information that satisfies one or more knowledge base rules from a plurality of knowledge base rules; calculating a knowledge probability for the knowledge information based on matching the OCR output to the plurality of knowledge base rules; and managing the knowledge information and the knowledge probability in a knowledge base (KB). [0013] Aspects of the present disclosure involve a computer program, storing instructions involving obtaining optical character recognition (OCR) output, the OCR output including one or more candidate text, each of the one or more candidate text prioritized based on an associated accuracy probability; extracting, from the OCR output, knowledge information that satisfies one or more knowledge base rules from a plurality of knowledge base rules; calculating a knowledge probability for the knowledge information based on matching the OCR output to the plurality of knowledge base rules; and managing the knowledge information and the knowledge probability in a knowledge base (KB). The computer program may be stored on a non-transitory computer readable medium and executed by one or more processors.

[0014] Aspects of the present disclosure involve a system involving means for obtaining optical character recognition (OCR) output, the OCR output including one or more candidate text, each of the one or more candidate text prioritized based on an associated accuracy probability; means for extracting, from the OCR output, knowledge information that satisfies one or more knowledge base rules from a plurality of knowledge base rules; means for calculating a knowledge probability for the knowledge information based on matching the OCR output to the plurality of knowledge base rules; and means for managing the knowledge information and the knowledge probability in a knowledge base (KB).

[0015] Aspects of the present disclosure involve an apparatus involving a processor configured to obtain optical character recognition (OCR) output, the OCR output including one or more candidate text, each of the one or more candidate text prioritized based on an associated accuracy probability; extract, from the OCR output, knowledge information that satisfies one or more knowledge base rules from a plurality of knowledge base rules; calculate a knowledge probability for the knowledge information based on matching the OCR output to the plurality of knowledge base rules; and manage the knowledge information and the knowledge probability in a knowledge base (KB).

BRIEF DESCRIPTION OF DRAWINGS

[0016] FIG. 1 illustrates an example system architecture of the probability adjustment system, in accordance with an example implementation.

[0017] FIG. 2 illustrates an example physical configuration of OCR node, in accordance with an example implementation. [0018] FIG. 3 illustrates an example physical configuration of KB node, in accordance with an example implementation.

[0019] FIG. 4 illustrates an example physical configuration of adjustment node, in accordance with an example implementation.

[0020] FIG. 5 illustrates an example conceptual diagram of OCR process in OCR program, in accordance with an example implementation.

[0021] FIG. 6 is a flow diagram illustrating an example process of knowledge extractor, in accordance with an example implementation.

[0022] FIG. 7 is a flow diagram illustrating an example process of probability adjuster, in accordance with an example implementation.

[0023] FIG. 8 is a flow diagram illustrating an example process of target detector, in accordance with an example implementation.

[0024] FIG. 9 illustrates a flow diagram illustrating an example process of probability re calculator, in accordance with an example implementation.

[0025] FIG. 10 is a data structure illustrating example information of OCR output table, in accordance with an example implementation.

[0026] FIG. 11 is a data structure illustrating example information of knowledge probability table, in accordance with an example implementation.

DETAILED DESCRIPTION

[0027] The following detailed description provides details of the figures and example implementations of the present application. Reference numerals and descriptions of redundant elements between figures are omitted for clarity. Terms used throughout the description are provided as examples and are not intended to be limiting. For example, the use of the term “automatic” may involve fully automatic or semi-automatic implementations involving user or administrator control over certain aspects of the implementation, depending on the desired implementation of one of ordinary skill in the art practicing implementations of the present application. Selection can be conducted by a user through a user interface or other input means, or can be implemented through a desired algorithm. Example implementations as described herein can be utilized either singularly or in combination and the functionality of the example implementations can be implemented through any means according to the desired implementations.

[0028] The following example implementation illustrates how the method and apparatus of probability adjustment systems works.

[0029] FIG. 1 illustrates an example system architecture of the probability adjustment system, in accordance with an example implementation. System 101 has OCR nodes 102, KB nodes 103 and adjustment nodes 104. All components are connected though network 105. OCR nodes 102 are used to recognize text characters in a document and calculate the accuracy of the recognized text characters. KB nodes 103 are used to extract knowledges from the recognized text characters and calculate the possibility that the extracted knowledges are correct. Adjustment nodes 104 are used to adjust the possibility of the knowledges based on the accuracy of text characters. All components are connected though network 105.

[0030] FIG. 2 illustrates an example physical configuration of OCR node 102, in accordance with an example implementation. OCR node 102 can include memory 201, local storage 202, communication interface(s) 203, processor(s) 204 and I/O Devices(s) 205. Local storage 202 contains operating system 206, OCR program 207, OCR input data store 208 and OCR output data store 209. OCR program 207 is a software application providing function for the OCR to recognize printed or handwritten text characters in digital images of documents. OCR program 207 reads the input image files stored in the OCR input data store 208 and store the result of the recognition to output files to OCR output data store 209.

[0031] FIG. 3 illustrates an example physical configuration of KB node 103, in accordance with an example implementation. KB node 103 involves memory 301, local storage 302, communication interface(s) 303, processor(s) 304 and Input/Output (I/O) Devices(s) 305. Local storage 302 contains operating system 306, Knowledge extractor 307, Probability calculation functions (PCFs) 308, Categorized dictionary 309, Knowledge schema 310, PCF weight configuration 311 and Knowledge probability table 312. Knowledge extractor 307 is a software application providing function for extracting knowledge from OCR output and calculate probability that the extracted knowledge is correct. The detailed procedure of this function is described in FIG. 6 later. Probability calculation function (PCF) 308 is a software application providing function to define constraint conditions that the extracted knowledge is correct. Examples of PCF 308 are described in further detail at FIG. 6 and FIG. 11.

[0032] Categorized dictionary 309 is a dictionary which is used for comparing with text in OCR output. Knowledge schema 310 is the definition of the relationships among multiple texts to extract knowledge candidates. The details of these elements are described in FIG. 6.

[0033] PCF weight configuration 311 is a weighting condition which weights PCF results to calculate the probability according to the importance of the PCFs. Knowledge probability table 312 is the list of extracted knowledge with results of the related PCFs and the corresponding probability. The details of these elements are described in FIG. 11.

[0034] FIG. 4 illustrates an example physical configuration of adjustment node 104, in accordance with an example implementation. Adjustment node 104 involves memory 401, local storage 402, communication interface(s) 403, processor(s) 404 and EO Devices(s) 405. Local storage 402 contains operating system 406, probability adjuster 407, target detector 408, probability calculator 409, low accuracy text (LAT) list 410, low possibility knowledge (LPK) list 411 and threshold configuration 412. Probability adjuster 407 is a software application providing function for adjusting knowledge probability which is calculated by knowledge extractor 307. The detailed procedure of this function is described in FIG. 7.

[0035] Target detector 408 is a software application providing functionality for detecting knowledge probability which is to be adjusted. This function is called from probability adjuster 407 and the detailed procedure of this function is described in FIG. 8 later.

[0036] Probability re-calculator 409 is a software application providing functionality for re-calculating knowledge probability. This function is called from probability adjuster 407 and the detailed procedure of this function is described in FIG. 9.

[0037] Low accuracy text (LAT) list 410 is a list of records of OCR output data store 208 whose accuracy is below the OCR output threshold defined in threshold configuration 412. The detailed procedure to generate this list is described in FIG. 8.

[0038] Low possibility knowledge (LPK) list 411 is a list of records of knowledge probability table 312 whose probability is below the knowledge probability threshold defined in threshold configuration 412. The detailed procedure to generate this list is described in FIG. 8.

[0039] Threshold configuration 412 is the definition of the threshold used to adjust the probability. It contains at least an OCR output accuracy threshold and a knowledge probability threshold. The detailed usage of these threshold is described in FIG. 8.

[0040] As will be described herein, processor(s) 404 can be configured to obtain optical character recognition (OCR) output, the OCR output comprising one or more candidate text, each of the one or more candidate text prioritized based on an associated accuracy probability, extract, from the OCR output, knowledge information that satisfies one or more knowledge base rules from a plurality of knowledge base rules; calculate a knowledge probability for the knowledge information based on matching the OCR output to the plurality of knowledge base rules; and manage the knowledge information and the knowledge probability in a knowledge base (KB) as illustrated in FIG. 7.

[0041] Processor(s) 404 can be configured to, for the extracted knowledge information failing to satisfy a knowledge base rule from the plurality of knowledge base rules while other ones of the plurality of knowledge base rules are satisfied, utilizing location information associated with the knowledge base rule to identify text from the OCR output which relates to the extracted knowledge information, as illustrated at 809 of FIG. 8.

[0042] Processor(s) 404 can be configured to, for the extracted knowledge information failing to satisfy the knowledge base rule related to word spelling, further adjusting the knowledge probability for the knowledge information based on the associated accuracy probability of a candidate text from the one or more candidate text used for extracting the knowledge information as illustrated at 905 of FIG. 9.

[0043] Processor(s) 404 can be configured to identify the alternate candidate text from the one or more candidate text of the OCR output associated with the knowledge information by identifying a candidate text from the one or more candidate text of the OCR output utilized to satisfy a knowledge base rule from the plurality of knowledge base rules having the associated accuracy probability below another threshold; and identifying the alternate candidate text associated with the candidate text from the one or more candidate text as illustrated at 810 of FIG. 8. [0044] Processor(s) 404 can be configured to detecting a knowledge base rule from the plurality of knowledge base rules causing the knowledge probability for the knowledge information to fall below the threshold; wherein the identifying the candidate text from the one or more candidate text of the OCR output utilized to satisfy a knowledge base rule from the plurality of knowledge base rules having the associated accuracy probability below the another threshold is based on the candidate text from the one or more candidate text associated with the detected knowledge base rule as illustrated at 807 and 808 of FIG. 8.

[0045] Processor(s) 404 can be configured to, for the knowledge probability for the knowledge information falling below a threshold, identify an alternate candidate text from the one or more candidate text of the OCR output associated with the knowledge information; and extract the knowledge information from the alternate candidate text as illustrated at 903 of FIG. 9.

[0046] FIG. 5 illustrates an example conceptual diagram of OCR process in OCR program 207, in accordance with an example implementation. Digital image of document 501 is part of an example filled form which involves printed text “First name” 502, hand-written first name “John” 503, printed text “Last name” 504, hand-written last name “Smith” 505, printed text “Title” 506, hand-written title “Mr.” 507, printed text “Address” 508 and hand written address “1234 Hitachi street, Santa Clara” 509. Layout image 510 is an example of the layout image of the OCR output. Recognized text 511 is a recognition result of printed text “First name” 502 in OCR program 207. In this example, the recognized text 511 is incorrect and it shows “First”, not “First”. The recognized text also has locational information of the recognized area which is the top left location (horizontal and vertical coordination) 512, width 513 and height 514. OCR program 207 integrates the recognized text 511 and the locational information 512, 513, 514 and records the data to OCR output data store 209. In this example implementation, OCR is any standard OCR implementation as is known in the art.

[0047] FIG. 6 is a flow diagram illustrating an example process of knowledge extractor 307, in accordance with an example implementation. The procedure begins at 601.

[0048] At 602, the knowledge extractor 307 receives information from OCR output table 209 of the OCR node 102 via communication interface(s) 203, 303. The detailed data structure of the information is described in FIG. 10. The OCR output table 209 may contain multiple candidates of recognized text and its accuracy, but knowledge extractor 307 does not receive secondary candidate text 1005 and secondary candidate accuracy 1006.

[0049] At 603, the knowledge extractor 307 compares recognized text information in the received OCR output with categorized dictionary 309. If there are any primary candidate texts 1003 which matches with words defined in the categorized dictionary 309, knowledge extractor 307 looks up a category to which the matched word belongs to, and detects its category. For example, if characters of “name” is registered as the member of “personal attribute” and characters of “John” is registered as the member of “person’s name” in the dictionary, recognized a text “First name” is categorized as “personal attribute” and a text “John” as “person’s name”. This categorization may be conducted with other methods such as clustering analysis with unsupervised machine learning, depending on the desired implementation.

[0050] At 604, the knowledge extractor 307 gets a relation schema which defines relation among multiple texts by using category information. For example, if a relation between “personal attribute” and “person’s name” is defined as a relation schema, knowledge extractor 307 searches the combination of two texts whose categories are “personal attribute” and “person’s name” respectively. In this example, text combinations such as (“First name”, “John”), (“First name”, “Smith”), (“Last name”, “John”), (“Last name”, Smith), etc. are extracted as knowledge candidates.

[0051] At 605, the knowledge extractor 307 calculates a probability that knowledge candidates extracted in step 604 are correct by adapting probability calculation functions (PCFs) which describes constraint condition for correct knowledge. For example, PCFs for finding first name from document is defined as:

• PCF1 : if an argument text of the function is located next to and to the right of, a text “First name” or “Given name”, the result is 1, otherwise, the result is 0

• PCF2: if an argument text of the function is not located next to and to the right of, a text “Last name” or “Family name”, the result is 1, otherwise, the result is 0

• PCF3: if an argument text of the function is located near “Title”, the result is 1, otherwise, the result is 0 • PCF4: if an argument text of the function does not include numerical elements, the result is 1, otherwise, the result is 0

[0052] Knowledge extractor 307 gets results of PCFs. Examples of PCF adaptation to knowledge candidate are:

• “John” does not meet the condition of PCF1 and get 0, because the left text of “John” is “First”, not “First”

• “John” meets the condition of PCF2 and get 1, because “John” is located the left of “Last name”

• “John” meets the condition of PCF3 and get 1, because “John” is located near “Title”

• “John” meets the condition of PCF4 and get 1, because “John” does not contain any numerical.

[0053] This result is expressed as “PCF1: 0, PCF2: 1, PCF3: 1, PCF4: 1”. Knowledge extractor 307 adapts PCFs to all candidates and calculates their probability. For example, knowledge extractor 307 calculates knowledge probability from the PCF result and PCF weight configuration 311. If each PCF weight is defined as “PCF1: 0.9, PCF2: 0.6, PCF3: 0.5, PCF4: 0.2”, the probability is calculated by summing up weighted PCF result and adjusting its result to be between 0 and 1. In this case, the equation is (0*0.9+1 *0.6+1 *0.5+1 *0.2) / (0.9+0.6+0.5+0.2) =0.59. The probability of knowledge “John is first name” is 0.59. Finally, knowledge extractor 307 incorporates knowledge description “John is first name”, PCF result “PCF1: 0, PCF2: 1, PCF3: 1, PCF4: 1” and the probability 0.59, adds the identifier (ID) to the knowledge and stores this information to knowledge probability table 312.

[0054] At 606, the knowledge extractor 307 ends the process.

[0055] FIG. 7 is a flow diagram illustrating an example process of probability adjuster 407, in accordance with an example implementation. The process begins at 701.

[0056] At 702, the probability adjuster 407 receives information of OCR output table 209 of OCR node 102 via communication interface(s) 203, 403. The information may contain multiple candidates of recognized text and its accuracy. The detailed data structure of the information is described in FIG. 10.

[0057] At 703, the probability adjuster 407 receives extracted knowledges and relative information including at least the PCF weight configuration 311 and knowledge probability table 312. The knowledges must include knowledge generated from received OCR output in step 702. The detailed data structure of the information is described in FIG. 11.

[0058] At 704, the probability adjuster 407 detects knowledges whose probability is to be adjusted. The detailed procedure of this step is described in FIG. 8.

[0059] At 705, the probability adjuster 407 recalculates the probability of the detected knowledges in step 704 and adjusts it. The detailed procedure of this step is described in FIG. 9.

[0060] At 706, the probability adjuster 407 ends the process.

[0061] FIG. 8 is a flow diagram illustrating an example process of target detector 408, in accordance with an example implementation. The process begins at 801.

[0062] At 803, the target detector 408 gets the threshold parameter of the OCR output accuracy from threshold configuration 412 to narrow down records in the OCR output data received in step 702 through comparing with primary candidate accuracy 1004. The OCR output accuracy threshold, for example, is expressed between 0 to 1, similar to OCR output accuracy. Target detector 408 compares the selected OCR output with the OCR output accuracy threshold, makes a list of primary candidate text whose accuracy is below the threshold and stores the list to low accuracy text (LAT) list 410. For example, if the OCR output accuracy threshold is 0.8, the OCR output table shown in FIG. 10 will be narrowed down to only two records whose text ID 1001 is T001 and T004.

[0063] At 805, the target detector 408 gets the threshold parameter of the knowledge probability from threshold configuration 412 to narrow down records in extracted knowledges received in step 703 comparing with probability 1104. The knowledge probability threshold, for example, is expressed between 0 to 1, similar to knowledge probability. Target detector 408 compares the selected knowledge with the knowledge probability threshold, makes a list of knowledge having a probability below the threshold, and stores the list to low possibility knowledge (LPK) list 411. For example, if the knowledge probability threshold is 0.75, the knowledge probability table shown in FIG. 11 will be narrowed down to only one records whose knowledge ID is K001.

[0064] At 807, the target detector 408 detects PCFs, which are causing low knowledge probability, from PCFs. Target detector 408 selects knowledge records from LPK list made in step 806 and gets the PCF result 1103 of the selected knowledge records. This PCF result contains each result of PCF which is adapted to the knowledge to calculate its probability in step 605. For example, if the knowledge record of K001 is selected, target detector 408 will get the PCF result of “PCF 1: 0, PCF2: 1, PCF3: 1, PCF4: 1”. The result means that the knowledge of K001 was not matched to the condition of PCF1 and rather matched to the condition of PCF2, PCF3 and PCF4. Therefore, it can be assumed that the cause of being below the threshold is related to PCF1. In another example, if target detector 408 gets PCF weight information from PCF weight configuration 311, target detector 408 detect PCF in consideration of effect of from the PCF weight. A detailed calculation example of probability using PCF result and PCF weight is described in FIG. 11.

[0065] At 808, the target detector 408 identifies texts which are used in the detected PCF in step 807. For example, PCF1 is defined that if an argument text of the function is located next to the right of a text “First name” or “Given name”, the result is 1, otherwise, the result is 0. In this example, target detector 408 identifies two texts: “First name” and “Given name” in definition constraint in PCF1.

[0066] At 809, the target detector 408 identifies texts which have relation with extracted knowledge in LPK list 411. First, target detector 408 detects locational information described in the PCF. For example, the knowledge “John is first name” (ID=K001 in knowledge probability table 312) may have relation with recognized text “First name” (ID=T001 in OCR output table 209), “John” (ID=T002) and “Last name” (ID=T003), because a text on the left of “John” is judgement target in PCF1 and a text on the right of “John” is judgement target in PCF2. Target detector 408 memories the relation between a knowledge and texts, e.g. relation between K001 and (T001, T002, T003) as a knowledge-text relation.

[0067] At 810, the target detector 408 identifies texts in knowledge which causes low probability by comparing knowledge-text relation table 413 and LAT list 410. If texts in the knowledge-text relation match with texts listed in LAT list 410, target detector 408 identifies the matched text as a candidate of the cause of the low probability. For example, only “First name” (T001) is listed in LAT list 410 among T001, T002 and T003. If there are multiple candidate texts which causes low probability, target detector 408 compares their locational information with each other, and identifies the nearest candidate text as the cause of the low probability. As a result, target detector 408 identifies that “First name” (T001) may be the cause for low probability of “John is first name”.

[0068] At 811, the target detector 408 check the similarity between the text identified in step 810 (“First name”) and the text identified in step 808 (“First name” and “Given name”). If the similarity is above a certain level, target detector 408 identifies “First name” as a target that requires an adjusted probability.

[0069] At 813, the target detector 408 ends the process.

[0070] FIG. 9 illustrates a flow diagram illustrating an example process of probability re calculator 409, in accordance with an example implementation. The process starts at 901.

[0071] At 902, the probability re-calculator 409 checks whether there is any other candidate text except for primary candidate text 1003 or not. If there is (Yes), the next step is 903. If there is not (No), the next step is 905. For example, if there are primary candidate “First name” and secondary candidate “First name”, the next candidate is the secondary one. Furthermore, if there is no third candidate, probability re-calculator 409 proceeds to step 905.

[0072] At 903, the probability re-calculator 409 replaces the previous candidate text in OCR output with next candidate text and sends it to KB node 103 and executes knowledge extractor 307. For example, “First name” in OCR output is replaced with “First name” and PCF1 result in knowledge probability table changes from 0 to 1, because a replaced text “First name” meets the condition of PCF1. The probability 1104 is re-calculated based on the changed PCF result 1103 in knowledge extractor 307. For example, probability is calculated to 1.00 based on PCF result “PCF1: 1, PCF2: 1, PCF3: 1, PCF4: 1”. Probability re-calculator 409 gets the re-calculated probability.

[0073] At 904, the probability re-calculator 409 checks whether the re-calculated probability is above a threshold of re-calculator stored in the threshold configuration 412 or not. If it is (Yes), then the next step is 906. If it is not (No), then the next step is 902. For example, if the threshold is 0.8 and re-calculated probability is 1.00, the probability is above the threshold and probability re-calculator 409 proceeds to step 906 and ends this program.

[0074] At 905, the probability re-calculator 409 directory adjusts the PCF result 1103 of the detected PCF in step 807. Probability re-calculator 409 may adjust the result of PCF based on the primary candidate accuracy 0.42 of “First name”. For example, assuming that the result of PCF 1 should be adjusted with a probability of 0.58 (difference of 1 - 0.42), the equation (0*0.9+1 *0.6+1 *0.5+1 *0.2) / (0.9+0.6+0.5+0.2) =0.59 can be adjusted to (0.58 *0.9+1 *0.6+1 *0.5+1 *0.2) / (0.9+0.6+0.5+0.2) =0.82. As the result, the probability of knowledge “John is first name” is adjusted from 0.59 to 0.82.

[0075] At 906, the probability re-calculator 409 ends the process.

[0076] FIG. 10 is a data structure illustrating example information of OCR output table 209, in accordance with an example implementation. The information contains at least text ID 1001, locational information 1002, primary candidate text 1003, primary candidate accuracy 1004, secondary candidate text 1005 and secondary candidate accuracy 1006. The text ID 1001 is an identifier for each recognized information. The locational information 1002 is locational information which at least horizontal coordination (X) and vertical coordination (Y) of upper left position of recognized text area, and width and height of the text area. The locational information 1002 is generated in OCR program 207. The primary candidate text 1003 is a candidate of recognized text and the accuracy score, which is calculated when OCR program 207 recognizes texts from OCR input document, is highest among the multiple candidates. The primary candidate accuracy 1004 is an accuracy score of the primary candidate text 1003. The secondary candidate text 1005 is a candidate of recognized text and the accuracy score is second highest among the multiple candidates. If there are only one candidate, no information is stored in this column. The secondary candidate accuracy 1006 is an accuracy score of the secondary candidate text 1005. If there are only one candidate, no information is stored in this column. If there are three or more candidates, OCR output table 209 may contain information of their text and accuracy.

[0077] FIG. 11 is a data structure illustrating example information of knowledge probability table 312, in accordance with an example implementation. The information contains at least knowledge ID 1101, knowledge description 1102, PCF result 1103 and probability 1104. The knowledge ID 1101 is an identifier for each extracted knowledge. The knowledge description 1102 is the detail of extracted knowledge that is the PCF result 1103 contains each judgement result of probability calculation functions (PCFs) 308. For example, a PCF named PCF1 is defined that if an argument text of the function is located next to the right of a text “First name” or “Given name”, the result is 1, otherwise, the result is 0. The definition “is located next to the right of’ may be defined in detail by using locational information 1002. An example “PCF 1:0” means that the result of the PCF1 is 0. The probability 1104 is correctness of the knowledge stored in knowledge description 1102 and is scored between 0 and 1.

[0078] FIGS. 2 to 4 are example block diagrams. Their functional components can be placed in another node and can be executed, and their data stores can be placed in another node and can be transferred in accordance with the desired implementation. FIG. 6 is an example procedure of knowledge extraction. Another knowledge base construction procedure can be adapted. FIG. 10 and FIG. 11 are illustrated with a table structure, but these data stores can be implemented with other types of database, markup languages or other structures, in accordance with the desired implementation. Further, such data does not need to be a single data store. They can be divided into multiple data stores or can be integrated as a single data store depending on the desired implementation.

[0079] Some portions of the detailed description are presented in terms of algorithms and symbolic representations of operations within a computer. These algorithmic descriptions and symbolic representations are the means used by those skilled in the data processing arts to convey the essence of their innovations to others skilled in the art. An algorithm is a series of defined steps leading to a desired end state or result. In example implementations, the steps carried out require physical manipulations of tangible quantities for achieving a tangible result.

[0080] Unless specifically stated otherwise, as apparent from the discussion, it is appreciated that throughout the description, discussions utilizing terms such as “processing,” “computing,” “calculating,” “determining,” “displaying,” or the like, can include the actions and processes of a computer system or other information processing device that manipulates and transforms data represented as physical (electronic) quantities within the computer system’s registers and memories into other data similarly represented as physical quantities within the computer system’s memories or registers or other information storage, transmission or display devices. [0081] Example implementations may also relate to an apparatus for performing the operations herein. This apparatus may be specially constructed for the required purposes, or it may include one or more general-purpose computers selectively activated or reconfigured by one or more computer programs. Such computer programs may be stored in a computer readable medium, such as a computer-readable storage medium or a computer-readable signal medium. A computer-readable storage medium may involve tangible mediums such as, but not limited to optical disks, magnetic disks, read-only memories, random access memories, solid state devices and drives, or any other types of tangible or non-transitory media suitable for storing electronic information. A computer readable signal medium may include mediums such as carrier waves. The algorithms and displays presented herein are not inherently related to any particular computer or other apparatus. Computer programs can involve pure software implementations that involve instructions that perform the operations of the desired implementation.

[0082] Various general-purpose systems may be used with programs and modules in accordance with the examples herein, or it may prove convenient to construct a more specialized apparatus to perform desired method steps. In addition, the example implementations are not described with reference to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the example implementations as described herein. The instructions of the programming language(s) may be executed by one or more processing devices, e.g., central processing units (CPUs), processors, or controllers.

[0083] As is known in the art, the operations described above can be performed by hardware, software, or some combination of software and hardware. Various aspects of the example implementations may be implemented using circuits and logic devices (hardware), while other aspects may be implemented using instructions stored on a machine-readable medium (software), which if executed by a processor, would cause the processor to perform a method to carry out implementations of the present application. Further, some example implementations of the present application may be performed solely in hardware, whereas other example implementations may be performed solely in software. Moreover, the various functions described can be performed in a single unit, or can be spread across a number of components in any number of ways. When performed by software, the methods may be executed by a processor, such as a general purpose computer, based on instructions stored on a computer-readable medium. If desired, the instructions can be stored on the medium in a compressed and/or encrypted format.

[0084] Moreover, other implementations of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the teachings of the present application. Various aspects and/or components of the described example implementations may be used singly or in any combination. It is intended that the specification and example implementations be considered as examples only, with the true scope and spirit of the present application being indicated by the following claims.

Claims

CLAIMS What is claimed is:

1. A method, comprising: obtaining optical character recognition (OCR) output, the OCR output comprising one or more candidate text, each of the one or more candidate text prioritized based on an associated accuracy probability; extracting, from the OCR output, knowledge information that satisfies one or more knowledge base rules from a plurality of knowledge base rules; calculating a knowledge probability for the knowledge information based on matching the OCR output to the plurality of knowledge base rules; and managing the knowledge information and the knowledge probability in a knowledge base (KB).

2. The method of claim 1, further comprising: for the extracted knowledge information failing to satisfy a knowledge base rule from the plurality of knowledge base rules while other ones of the plurality of knowledge base rules are satisfied, utilizing location information associated with the knowledge base rule to identify text from the OCR output which relates to the extracted knowledge information.

3. The method of claim 2, wherein : for the extracted knowledge information failing to satisfy the knowledge base rule related to word spelling, further adjusting the knowledge probability for the knowledge information based on the associated accuracy probability of a candidate text from the one or more candidate text used for extracting the knowledge information.

4. The method of claim 2, wherein the identifying the alternate candidate text from the one or more candidate text of the OCR output associated with the knowledge information comprises: identifying a candidate text from the one or more candidate text of the OCR output utilized to satisfy a knowledge base rule from the plurality of knowledge base rules having the associated accuracy probability below another threshold; and identifying the alternate candidate text associated with the candidate text from the one or more candidate text.

5. The method of claim 2, further comprising: detecting a knowledge base rule from the plurality of knowledge base rules causing the knowledge probability for the knowledge information to fall below the threshold; wherein the identifying the candidate text from the one or more candidate text of the OCR output utilized to satisfy a knowledge base rule from the plurality of knowledge base rules having the associated accuracy probability below the another threshold is based on the candidate text from the one or more candidate text associated with the detected knowledge base rule.

6. The method of claim 1, further comprising, for the knowledge probability for the knowledge information falling below a threshold: identifying an alternate candidate text from the one or more candidate text of the OCR output associated with the knowledge information; and extracting the knowledge information from the alternate candidate text.

7. A computer program, storing instructions for executing a process comprising: obtaining optical character recognition (OCR) output, the OCR output comprising one or more candidate text, each of the one or more candidate text prioritized based on an associated accuracy probability; extracting, from the OCR output, knowledge information that satisfies one or more knowledge base rules from a plurality of knowledge base rules; calculating a knowledge probability for the knowledge information based on matching the OCR output to the plurality of knowledge base rules; and managing the knowledge information and the knowledge probability in a knowledge base (KB).

8. The computer program of claim 7, the instructions further comprising: for the extracted knowledge information failing to satisfy a knowledge base rule from the plurality of knowledge base rules while other ones of the plurality of knowledge base rules are satisfied, utilizing location information associated with the knowledge base rule to identify text from the OCR output which relates to the extracted knowledge information.

9. The computer program of claim 8, the instructions further comprising: for the extracted knowledge information failing to satisfy the knowledge base rule related to word spelling, further adjusting the knowledge probability for the knowledge information based on the associated accuracy probability of a candidate text from the one or more candidate text used for extracting the knowledge information.

10. The computer program of claim 8, wherein the identifying the alternate candidate text from the one or more candidate text of the OCR output associated with the knowledge information comprises: identifying a candidate text from the one or more candidate text of the OCR output utilized to satisfy a knowledge base rule from the plurality of knowledge base rules having the associated accuracy probability below another threshold; and identifying the alternate candidate text associated with the candidate text from the one or more candidate text.

11. The computer program of claim 8, the instructions further comprising: detecting a knowledge base rule from the plurality of knowledge base rules causing the knowledge probability for the knowledge information to fall below the threshold; wherein the identifying the candidate text from the one or more candidate text of the OCR output utilized to satisfy a knowledge base rule from the plurality of knowledge base rules having the associated accuracy probability below the another threshold is based on the candidate text from the one or more candidate text associated with the detected knowledge base rule.

12. The computer program of claim 8, the instructions further comprising, for the knowledge probability for the knowledge information falling below a threshold: identifying an alternate candidate text from the one or more candidate text of the OCR output associated with the knowledge information; and extracting the knowledge information from the alternate candidate text.

13. An apparatus, comprising: a processor, configured to: obtain optical character recognition (OCR) output, the OCR output comprising one or more candidate text, each of the one or more candidate text prioritized based on an associated accuracy probability; extract, from the OCR output, knowledge information that satisfies one or more knowledge base rules from a plurality of knowledge base rules; calculate a knowledge probability for the knowledge information based on matching the OCR output to the plurality of knowledge base rules; and manage the knowledge information and the knowledge probability in a knowledge base (KB).