EP4085343A1 - Bereichsbasierte textextraktion - Google Patents

Bereichsbasierte textextraktion

Info

Publication number
EP4085343A1
EP4085343A1 EP20910797.8A EP20910797A EP4085343A1 EP 4085343 A1 EP4085343 A1 EP 4085343A1 EP 20910797 A EP20910797 A EP 20910797A EP 4085343 A1 EP4085343 A1 EP 4085343A1
Authority
EP
European Patent Office
Prior art keywords
text
text entities
entities
entity
pattern
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
EP20910797.8A
Other languages
English (en)
French (fr)
Other versions
EP4085343A4 (de
Inventor
Madhusudan Singh
Kaushik Halder
Nirmal VANAPALLI VENKATA RAMESH RAYULU
Aritra Ghosh Dastidar
Ajay SHA
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
L&T Technology Services Ltd
Original Assignee
L&T Technology Services Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by L&T Technology Services Ltd filed Critical L&T Technology Services Ltd
Publication of EP4085343A1 publication Critical patent/EP4085343A1/de
Publication of EP4085343A4 publication Critical patent/EP4085343A4/de
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/26Techniques for post-processing, e.g. correcting the recognition result
    • G06V30/262Techniques for post-processing, e.g. correcting the recognition result using context analysis, e.g. lexical, syntactic or semantic context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/10Recognition assisted with metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • This disclosure relates generally to data extraction, and more particularly to a method and a system for extracting text information from contents of an input file, using one or more data extraction approaches.
  • Text extraction techniques have assumed importance lately.
  • extraction techniques such as Optical Character Recognition (OCR) may allow a user extract text data from a file, such as an image or a Portable Document Format (PDF) file. Further, it may be desirable to extract relevant information and generate expressions using the extracted relevant information.
  • OCR Optical Character Recognition
  • PDF Portable Document Format
  • Some available techniques may allow to extract textual information from input files and generate expressions using the extracted information, when the input files include tags, or a pattern can be identified in the identified text.
  • tags may be present or no pattern can be identified, it is difficult to extract relevant information or generate expressions.
  • FIG. 1 is functional block diagram of an exemplary system for extracting information from contents of an input file, in accordance with some embodiments of the present disclosure.
  • FIG. 2 is a functional block diagram of a Natural Fanguage Processing (NFP) metadata extract framework, in accordance with some embodiments of the present disclosure.
  • FIG. 3 is a block diagram of a spell check system, in accordance with some embodiments of the present disclosure.
  • NFP Natural Fanguage Processing
  • FIG. 4 is a block diagram of a recommendation system, in accordance with some embodiments of the present disclosure.
  • FIG. 5 is a block diagram of a metadata update system, in accordance with some embodiments of the present disclosure.
  • FIGS. 6-8 are flowcharts of a method of extracting information from contents of an input file, in accordance with various embodiment of the present disclosure.
  • FIG. 9 is a snapshot of an exemplary input file having a noisy entity, from which information is to be extracted, in accordance with some embodiments of the present disclosure.
  • FIG. 10 is a snapshot of another exemplary input file, from which information is to be extracted, in accordance with various embodiment of the present disclosure.
  • the system 100 may include a text identification device 102.
  • the text identification device 102 may identify text data from the contents of an input file.
  • the input file may be an image or a Portable Document Format (PDF) file.
  • PDF Portable Document Format
  • the identified text data may include a plurality of text entities.
  • the text identification device 102 may use Optical Character Recognition (OCR) technique for identifying the text data from the input file.
  • OCR Optical Character Recognition
  • the text identification device 102 may use any other technique known in the art for identifying the text data from the input file.
  • the system 100 may further include an information extraction device 104 communicatively to the text identification device 102.
  • the information extraction device 104 may be configured to extract relevant information from the plurality of text entities identified from the input file by the text identification device 102.
  • the information extraction device 104 may receive a text input from a user for identifying relevant text entities from the plurality of text entities, and automatically generate a search pattern corresponding to the text input.
  • the information extraction device 104 may further determine a pattern associated with each of the plurality of text entities, and map the search pattern corresponding to the text input with patterns associated with the plurality of text entities.
  • the information extraction device 104 may further identify one or more matching patterns from the patterns associated with the plurality of text entities based on the mapping, and extract, from the plurality of text entities, relevant text entities corresponding to the one or more matching patterns.
  • the system 100 may further include a display 114.
  • the information extraction device 104 may include one or more processors 110 and a computer- readable medium (for example, a memory) 112.
  • the computer-readable storage medium 112 may store instructions that, when executed by the one or more processors 110, cause the one or more processors 110 to generate expressions based on contents of an input file, in accordance with aspects of the present disclosure.
  • the computer-readable storage medium 112 may also store various data (for example, identified text data, value entity data, pattern entity data, and the like) that may be captured, processed, and/or required by the system 100.
  • the system 100 may interact with a user via a user interface 116 accessible via the display 114.
  • the system 100 may also interact with one or more external devices 106 over a communication network 108 for sending or receiving various data.
  • the external devices 106 may include, but may not be limited to, a remote server, a digital device, or another computing system.
  • NLP metadata extraction framework 200 may be implemented in the system 100, and in particularly, in the information extraction device 104.
  • the NLP metadata extraction framework 200 may include various modules that perform various functions so as to extract information from contents of an input file.
  • an extraction engine 202 may extract/identify text data from contents of an input file 204.
  • the input file 204 may be a flat file 206, or a PDF file 208, or a database file 210, or an OCR output 212.
  • the extraction engine 202 may further receive one or more domain rules from a domain rules database 214.
  • the domain rules database 214 may be based on “if-then” rules (rather than through conventional procedural code). It may be noted that domain rules may be used to determine a domain-based name associated with each of the plurality of text entities. The domain-based name may be determined based on mapping each of the plurality of text entities with a list of domain-based names stored in a data depository (not shown in FIG. 1).
  • the domain-based database may determine a domain name, i.e., name of the city having that PIN code, for example, by mapping the PIN code with a list of domain-based names stored in a data repository.
  • the extraction engine 202 may further receive one or more location rules from a location rules database 216. It may be noted that, in some embodiments, the location rules may be used to determine location associated with an entity from the text data in the input file, based on an OCR output. The location may be determined upon first identifying a type associated with the input file, and determining a location of one or more relevant text entities in the input file, based on the type associated with the input file. Once the location of the relevant text entities is determined, the relevant text entities may be extracted from the input file, based on the location.
  • the one or more location rules may be more relevant in case of input data comprising a table, i.e., input data arranged in a tabular format; however, the one or more location rules may be applicable to any other types of input data as well, for example, documents with free text.
  • the type of the input file may be that of an “Aadhar” card. It may be further understood that in the documents like “Aadhar” card, the data may be present in a specific format.
  • the location of the relevant text entities may be predetermined based on a template corresponding to the type associated with the input file, stored in the data repository. For example, the location of text entities “Name” and “Address” (of the concerned person) may be in the middle of the document, and this information may be available and stored in the database of the system 100. In particular, the location may be based on (x, y) coordinates.
  • the location of the text entities “Name” and “Address” may be within a region defined by (x, y) coordinates (5, 10) to (15, 15). Therefore, once the type of the document is identified as “Aadhar” card, the system 100 may access the above location to extract the relevant text entities (i.e., “Name” and “Address”) from that location.
  • the extraction engine 202 may further receive one or more first NLP rules from a first NLP rules database 218.
  • the first NLP rules database 218 may create a Regular Expression (Regex) based on NLP rules provided for values of ground tmth/attributes.
  • a Regex may be a special text string for describing a search pattern.
  • the extraction engine 202 may further receive one or more second NLP rules from a second NLP rules database 220.
  • the second NLP rules database 220 may define part of speech (POS) for values of ground tmth/attributes.
  • POS part of speech
  • a POS associated the plurality of text entities may be identified.
  • the POS may be one of a noun, a pronoun, a verb, an adverb, an adjective, a conjunction, a preposition, and an interjection.
  • one or more unique text entities may be selected from the plurality of text entities for which POS is a noun.
  • the one or more (selected) unique text entities for which POS is identified as a noun may be extracted as relevant data.
  • a salary slip document may include multiple text entities including the Name of the concerned person. Extracting the Name may be an objective of the extraction engine 202. Further, it will be understood that the Name shall fall under the POS category of a noun. As such, the text entities with noun as their POS shall be selected. The required name shall, therefore, be extracted from the selected entities.
  • one or more unique text entities may be identified for which POS is not identified.
  • An entity type associated with each of the one or more unique text entities may be determined.
  • the entity type may be a value entity or a pattern entity. Each value entity may have an associated value and each pattern entity has an associated pattern.
  • an identified text entity may be made up of a combination of text characters indicative of a code name, like that of an e-mail ID or an oil well name (e.g., “john@gmail.com” or “AB-123”, respectively).
  • a text entity may be identified either as a value entity or a pattern entity.
  • an e-mail ID (“john@gmail.com”) may be determined as a value entity, the value being the e-mail ID. This may be because the pattern of an e-mail ID is usually a typical and known pattern, and all the e-mail IDs may share the same or similar pattern.
  • a text entity like “AB-123” corresponding to an oil well name may be determined as a pattern entity, since oil well names may share the same pattern, however, such a pattern may not be generally known, i.e., the NLP metadata extraction framework 200 may not have such a pattern prestored.
  • text entities in the input file having associated value same as the associated value of the value entity may be automatically identified. In other words, all the e-mail ID text entities may be identified and extracted.
  • search pattern corresponding to a pattern (Regex) associated with the pattern entity may be automatically generated. Further, pattern associated with each of the plurality of text entities in the input file may be determined. The search pattern (Regex) corresponding to the pattern entity may be mapped with patterns associated with the plurality of text entities, and one or more matching patterns from the patterns associated with the plurality of text entities may be determined based on the mapping. The matching text entities corresponding to the one or more matching patterns may be extracted from the plurality of text entities.
  • a search pattern (Regex) “apd” (where, “a” signifies one or more alphabets, “p” signifies one or more punctuations, “d” signifies one or more digits) may be automatically generated. Further, patterns associated with other text entities in the input file are determined, the search pattern “apd” may be mapped with patterns associated with the other text entities, and matching patterns may be identified. Thereafter, all the text entities corresponding to the matching patterns are extracted. In this way, all the oil well names in the input file may be extracted.
  • the first NLP rules may cause to receive a text input from a user for identifying relevant text entities from the plurality of text entities, and automatically generate a search pattern corresponding to the text input.
  • the first NLP rules may further cause to determine a pattern associated with each of the plurality of text entities, and map the search pattern corresponding to the text input with patterns associated with the plurality of text entities.
  • the first NLP rules may further cause to identify one or more matching patterns from the patterns associated with the plurality of text entities based on the mapping, and extract, from the plurality of text entities, relevant text entities corresponding to the one or more matching patterns.
  • a user may provide a text input “AB-123” for identifying relevant text entities from the plurality of text entities extracted from the input file.
  • a search pattern (Regex) “apd” may be identified for the input text entity “AB-123”. Further, patterns associated with other text entities in the input file may be determined, the search pattern (Regex) “apd” may be mapped with patterns associated with the other text entities, and matching patterns are identified. Thereafter, all the text entities corresponding to the matching patterns may be extracted. In this way, all the oil well names in the input file may be extracted.
  • the test input from the user may be received in form of an entry in a “Microsoft Excel” sheet.
  • One or more rules may be defined that may automatically generate the search pattern (Regex) corresponding to the text input.
  • the one or more rules may determine a pattern associated with each of the plurality of text entities of the input file, map the search pattern corresponding to the text input with patterns associated with the plurality of text entities, identify one or more matching patterns from the patterns associated with the plurality of text entities based on the mapping, and extract, from the plurality of text entities, relevant text entities corresponding to the one or more matching patterns.
  • the “Microsoft Excel” data may be received from one or more “Microsoft Excel” sheets 222.
  • the various rules domain rules, NLP rules, location rules, etc.
  • “Microsoft Excel” macros may be used for extraction.
  • the same macros may be embedded in Python (language).
  • the extracted data 226 may refer to relevant text entities extracted from the plurality of text entities, or expressions that may be generated based on the extracted relevant text entities using the various rules.
  • ML Machine Learning
  • AI Artificial Intelligence
  • FIG. 3 a block diagram of a spell check system 300 is illustrated, in accordance with some embodiments of the present disclosure.
  • text entities extracted by an extraction module 302 may be received by the spell check system 300.
  • These expressions generated by extraction module 302 may act as input text 304 for the spell check system 300.
  • the input text 304 may be preprocessed by a preprocessing module 306.
  • a cosine similarity analysis may be performed on the preprocessed data, by a cosine similarity module 308.
  • the cosine similarity module 308 may perform the cosine similarity analysis using a threshold.
  • a minimum edit distance module 310 may perform minimum edit distance analysis on the data received from the cosine similarity module 308. In some embodiments, the minimum edit distance analysis may be performed on filtered words.
  • a maximum cosine similarity module 312 may perform maximum cosine similarity analysis on the data received from the minimum edit distance module 310. It may be noted that the maximum cosine similarity analysis may be based on a word within least edit distance word. A replacement module 314 may replace an incorrect word with a corrected word, based on the analysis performed by the maximum cosine similarity module 312.
  • the recommendation system 400 may be used for identifying an entity type, i.e., value entity or pattern entity.
  • the recommendation system 400 may use a classification- based machine learning (ML) model 402.
  • the recommendation system 400 may receive input data (NLP extraction and spell check data) 406. It may be noted that the input data may be data on which NLP and spell check has been performed.
  • the recommendation system 400 may include a configuration (config) file 408.
  • the config file 408 may trigger determining whether a value may be recommended or a pattern may be recommended for the input data.
  • the recommendation system 400 may feed the input data to the ML model 402.
  • the ML model 402 may use historical data from an archive database 404. Based on the historical data, the ML model 402 may either provide a prediction data 410, or provide a recommendation data 412. As such, a recommended value 414 may be generated based on the prediction data 410, or a recommended pattern 416 may be generated based on the recommendation data 412.
  • FIG. 5 a block diagram of a metadata update system 500 is illustrated, in accordance with some embodiments of the present disclosure.
  • the metadata update system 500 may receive NLP-extracted data 502.
  • the metadata update system 500 may include a system date and time module 504.
  • the metadata update system 500 may further include a confidence module 506 and a probability module 508.
  • the confidence module 506 may determine a confidence score associated with the accuracy of data extracted and expressions generated by the NLP rules.
  • the probability module 508 may determine a probability score associated with the accuracy of extracted data and expressions generated by the NLP rules, based on the confidence score. In other words, the probability module 508 may determine a probability on how accurate/useful are the extracted data or expressions generated using the NLP rules.
  • the probability as determined by the probability module 508 is high enough (i.e., greater than a threshold), then the data extracted and the expressions generated using the NLP rules, i.e., rule-based values 514, may be provided. However, if the probability, as determined by the probability module 508, is not high enough (i.e., less than a threshold), then data extracted and expressions generated using the ML model, i.e., Machine Learning (ML) values 516 may be provided. As mentioned earlier, the ML values may be generated by the ML model using historical data which may be stored in an archive database 510. The data extracted and the generated expressions may be stored in a final database 512.
  • ML Machine Learning
  • a flowchart of a method 600 of extracting information from contents of an input file is illustrated, in accordance with an embodiment of the present disclosure.
  • the method 600 may provide for a Regex-based method of text extraction.
  • text data may be identified from the input file.
  • the identified text data may include a plurality of text entities.
  • the input file may include at least one of an image file and a Portable Document Format (PDF) file.
  • PDF Portable Document Format
  • a text input may be received from a user for identifying relevant text entities from the plurality of text entities. For example, the user may provide the text as an entry in a “Microsoft Excel” sheet.
  • a search pattern (Regex) may be automatically generated corresponding to the text input.
  • a pattern may be determined associated with each of the plurality of text entities.
  • the search pattern corresponding to the text input may be mapped with patterns associated with the plurality of text entities.
  • one or more matching patterns from the patterns associated with the plurality of text entities may be identified based on the mapping.
  • relevant text entities corresponding to the one or more matching patterns may be extracted from the plurality of text entities.
  • an expression may be generated using the extracted relevant text entities.
  • a Machine Learning (ML) model may be used for extracting relevant entities, for example, when the method 600 is not able to extract the relevant entities.
  • ML Machine Learning
  • text data may be identified from the input file.
  • the identified text data may include a plurality of text entities.
  • the input file may include at least one of an image file and a Portable Document Format (PDF) file.
  • PDF Portable Document Format
  • a domain-based name associated with each of the plurality of text entities may be determined.
  • the domain-based name may be determined based on mapping each of the plurality of text entities with a list of domain-based names stored in a data depository.
  • a type associated with the input file may be identified.
  • a location of one or more relevant text entities in the input file may be determined, based on the type associated with the input file.
  • the one or more relevant text entities may be extracted from the input file, based on the location.
  • a part-of- speech (POS) associated the plurality of text entities may be identified. It may be understood that the POS may be one of a noun, a pronoun, a verb, an adverb, an adjective, a conjunction, a preposition, and an interjection.
  • POS part-of- speech
  • one or more text entities for which POS is identified as a noun may be determined.
  • one or more unique text entities for which POS is not identified may be selected.
  • an entity type associated with each of the one or more unique text entities may be determined. The entity type may be one of a value entity and a pattern entity. Further, each value entity may have an associated value and each pattern entity has an associated pattern.
  • a (ML-based) recommendation system 400 may be used for determining an entity type associated with each of the one or more unique text entities.
  • step 720 for each value entity, one or more text entities having associated value same as the associated value of the value entity may be automatically identified. For each pattern entity, steps 722-728 may be performed. As such, at step 722, a search pattern (Regex) corresponding to a pattern associated with the pattern entity may be automatically generated. At step 724, the search pattern corresponding to the pattern entity may be mapped with patterns associated with the plurality of text entities. At step 726, one or more matching patterns may be determined from the patterns associated with the plurality of text entities based on the mapping. At step 728, matching text entities corresponding to the one or more matching patterns may be extracted, from the plurality of text entities.
  • a search pattern (Regex) corresponding to a pattern associated with the pattern entity may be automatically generated.
  • the search pattern corresponding to the pattern entity may be mapped with patterns associated with the plurality of text entities.
  • step 726 one or more matching patterns may be determined from the patterns associated with the plurality of text entities based on the mapping.
  • the method 700 may be performed in conjunction with method 600.
  • the method 600 may kick in after the step 710 of the method 700 is completed and before the step 712 of the method 700 starts.
  • step 802 text data may be extracted from contents of the input file.
  • the extracted text data may include a plurality of text entities.
  • a domain-based name associated with each of the plurality of text entities may be determined.
  • the domain-based name may be determined based on mapping each of the plurality of text entities with a list of domain-based names stored in a data depository. For example, when a PIN code is available, the city may be determined from the PIN code using the domain-based approach, for example, by mapping the PIN code with a list (of PIN codes and associated city names) stored in a knowledgebase. Similarly, a country name (e.g., India) may be determined when a city name (e.g., Delhi) is identified.
  • tags may be assigned to each of the second plurality of entities, accordingly.
  • a check may be performed to check if the domain-based name is successfully determined or not. If the domain-based name is successfully determined, the method may proceed to step 822 (“Yes” path), where the identified text entities may be extracted. Further, expressions may be generated using the extracted text entities. If, at step 806, the domain-based name is not determined, the method may proceed to step 808 (“No” path). In other words, when a domain-based name cannot be identified, the method may proceed to take a next alternative approach.
  • a value of a predetermined field may be determined based on location of each of the plurality of text entities.
  • the location may be in form of (x and y) coordinates.
  • the location may be predetermined based on a template stored in the data repository.
  • value of a particular name may be written beside the name or beneath a particular name header, which may be identifiable based on the (x and y) coordinates.
  • the entities for example, the name of the user may be available at a specific location, from where it could be identified.
  • the location technique may be especially helpful in case of a Table, as the data in the Table is available in a structured format, and entities may be easily and accurately located.
  • a type associated with the input file may be identified, i.e., if the input file is an “Aadhar card”, or a “PAN card”, or a driving license. Further, location of one or more relevant text entities in the input file may be determined, based on the type associated with the input file.
  • a check may be performed to check if the location of one or more relevant text entities in the input file is successfully determined or not. If the location is successfully determined, the method 800 may once again proceed to step 822 (“Yes” path), where one or more relevant text entities from the input file may be extracted based on the location and expressions may be generated using the extracted text entities. If it is found that the location is not determined, at step 810, the method 800 may proceed to step 812.
  • Regex-based approach may be attempted on the input to file to extract relevant text entities.
  • a text input may be received from a user for identifying relevant text entities from the plurality of text entities, and a search pattern corresponding to the text input may be automatically generated.
  • a Regex may be generated.
  • a pattern associated with each of the plurality of text entities may be generated, the search pattern (Regex) corresponding to the text input may be mapped with patterns associated with the plurality of text entities, and one or more matching patterns from the patterns associated with the plurality of text entities may be identified based on the mapping.
  • the method 800 may include calculating a probability of determining the relevant entity from the extracted text data, corresponding to the Regex.
  • a check may be performed to determine if the Regex -based approach is successfully applied. If the Regex-based approach is successfully applied, the method 800 may proceed to step 822 (“Yes” path), where relevant text entities corresponding to the one or more matching patterns may be extracted from the plurality of text entities and expressions may be generated using the extracted text entities. If the Regex-based approach is not successful, the method 800 may proceed to step 816 (“No” path).
  • POS part-of-speech
  • POS associated with each of the plurality of text entities may be determined.
  • an attempt may be made to determine which of POS the entity may be associated with, for example, the POS may be a noun, an adverb, an adjective, etc.
  • entity type associated with each of the one or more unique text entities.
  • one or more unique text entities for which POS is not identified may be selected. For example, for unique entities having a combination of alphabets, digits, punctuations, etc.
  • the POS may not be identified.
  • an entity type associated with each of the one or more unique text entities may be determined.
  • the entity type may be one of a value entity and a pattern entity.
  • Each value entity may have an associated value and each pattern entity may have an associated pattern.
  • an attempt may be made to automatically identify one or more text entities having associated value (for example, e-mail ID) same as the associated value of the value entity.
  • a check may be performed to determine if the attempt is successful (i.e., the POS-based approach is successful). If the attempt is found to be successful, the method 800 may proceed to step 822 (“Yes” path), where the one or more text entities (relevant text entities) having associated value same as the associated value of the value entity may be extracted and expressions may be generated using the extracted text entities.
  • an attempt may be made to automatically generate a search pattern corresponding to a pattern associated with the pattern entity, map the search pattern corresponding to the pattern entity with patterns associated with the plurality of text entities, and determine one or more matching patterns from the patterns associated with the plurality of text entities based on the mapping.
  • a check may be performed to determine if the attempt is successful (i.e., the POS-based approach is successful). If the attempt is found to be successful, the method 800 may proceed to step 822 (“Yes” path), where the matching text entities corresponding to the one or more matching patterns may be extracted from the plurality of text entities and expressions may be generated using the extracted text entities. If at step 818, the POS-based approach is found to be not successful, the method 800 may proceed to step 820 (“No” path).
  • an attempt may be made to extract relevant text entities (information) from plurality of text entities using a machine learning (ML) model.
  • ML machine learning
  • the ML model may be first trained to extract a type of relevant text entities. Therefore, an ML- based classification may be applied to the input file to extract relevant text entities, based on the training of the ML model. Accordingly, at step 822, relevant text entities may be extracted based on the ML-based classification, and expressions may be generated using the extracted text entities.
  • steps 804, 808, 812, 816, and 820 may be performed in the sequence describe above, or any other sequence as well.
  • NLP-based approach may be used in the below scenarios:
  • search key i.e., text input from a user
  • associated text extraction i.e., relevant entity
  • search key i.e., text input from a user
  • associated text extraction i.e., relevant entity
  • a confidence score may be calculated in each of the approaches.
  • the maximum probability score for example, 1
  • the probability for location identification may be either be 0 or 1 for Regex-based approach, and the probability score of extracting (e) can be 1 or 0.
  • the extracted values can be multiple (n), and one out of those values may be selected.
  • the probability of selected value may be: “1/n”, and the confidence score may be e*(l/n).
  • the confidence score may be (w*l)/2 + (e*(l/n))/2.
  • the probability score may be derived based on Confusion Matrix (Accuracy Rate).
  • P2 probability score of a Rules-based approach
  • P3 probability score of a ML-based approach.
  • FIG. 9 a snapshot of an exemplary input file 900 with a noisy entity 902 present in the input file 900 is shown.
  • a first part of an entity may not be legible because of smudging, or scratching, etc.
  • a second part of the entity may include the word “United”.
  • the actual entity could possibly be “United Airlines” or “United States”, etc.
  • one or more approaches i.e., domain-based, location-based, POS-based, Regex-based, or ML-based
  • domain-based, location-based, POS-based, Regex-based, or ML-based may be used for determining the actual entity.
  • the ML-based approach may kick in, and may provide a text extraction, based on based on historical data using a classification-based Machine Learning (ML) model.
  • the ML-model may be a deterministic model and/or a probabilistic model.
  • FIG. 10 a snapshot of an input file “Local Order” 1000 is shown, from where various text entities may be required to be extracted. As shown in FIG. 10, the input file “Local Order” 1000 is in a tabular structure.
  • the “RO Number” may be extracted from a domain dictionary database (lookup table) 1002, by mapping the RO date attribute.
  • a key “Part Number” present in the input file 1000 may be identified or received as user input. Thereafter, by location-based approach, using the template information stored in the database, a possible location of the Part Number may be determined. For example, the location-based approach may suggest that the Part Number may be present on the “right” side of the input file 1000. Therefore, the required Part Number “32145643” may be extracted from the right cell coordinates of the tabular structure of input file “Local Order” 1000.
  • the Regex- based approach may be used.
  • a key (Regex) associated with the attribute e-mail may be identified or received from a database.
  • the Regex may be “apapapa” (“a” signifying one or more alphabets, “p” signifying one or more punctuations, and “d” signifying one or more digits), corresponding to the e-mail ID: “ajay.thakur@gmail.com”.
  • the text entity with the matching pattern i.e., of the email-ID may be identified, and the value (“ajay.thakur@gmail.com”) may be extracted from the corresponding cell of the tabular structure of the input file 1000.
  • POS -based approach may be used. For example, a key present for the attribute (Name) may be first identified in the input file 1000. Thereafter, POS of the identified text entities in the input file 1000 may be determined, for example, using a Named Entity Recognizer (NER) tagger. A NER tagger specified in the database for the attribute Name may be searched. A text entity corresponding to the attribute Name (“Ajay Thakur”) having a POS noun may, therefore, be extracted. Accordingly, an expression using the extracted text entities “Ajay Thakur” may be generated.
  • NER Named Entity Recognizer
  • the techniques provide for various approaches, i.e., a domain-based, a location-based, a POS-based, a Regex-based approach, and a ML-based approach for performing text extraction.
  • the text extraction therefore, can be performed using a single approach, or multiple approaches applied in any sequence.
  • the techniques allow to generate expressions using the extracted information.
  • the techniques allow for extracting information and generating expressions using the extracted information, even in complex documents, in which no tags are present or no pattern can be identified in the extracted information.
  • One or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure.
  • a computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored.
  • a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein.
  • the term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)
EP20910797.8A 2019-12-30 2020-12-30 Bereichsbasierte textextraktion Pending EP4085343A4 (de)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN201941054421 2019-12-30
PCT/IB2020/062535 WO2021137166A1 (en) 2019-12-30 2020-12-30 Domain based text extraction

Publications (2)

Publication Number Publication Date
EP4085343A1 true EP4085343A1 (de) 2022-11-09
EP4085343A4 EP4085343A4 (de) 2024-01-03

Family

ID=76685920

Family Applications (1)

Application Number Title Priority Date Filing Date
EP20910797.8A Pending EP4085343A4 (de) 2019-12-30 2020-12-30 Bereichsbasierte textextraktion

Country Status (5)

Country Link
EP (1) EP4085343A4 (de)
JP (1) JP2023507881A (de)
AU (1) AU2020418619A1 (de)
CA (1) CA3156204A1 (de)
WO (1) WO2021137166A1 (de)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912845B (zh) * 2023-06-16 2024-03-19 广东电网有限责任公司佛山供电局 一种基于nlp与ai的智能内容识别与分析方法及装置

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10318804B2 (en) * 2014-06-30 2019-06-11 First American Financial Corporation System and method for data extraction and searching

Also Published As

Publication number Publication date
WO2021137166A1 (en) 2021-07-08
EP4085343A4 (de) 2024-01-03
CA3156204A1 (en) 2021-07-08
AU2020418619A1 (en) 2022-05-26
JP2023507881A (ja) 2023-02-28

Similar Documents

Publication Publication Date Title
US11734328B2 (en) Artificial intelligence based corpus enrichment for knowledge population and query response
US11055327B2 (en) Unstructured data parsing for structured information
US8489388B2 (en) Data detection
CN111460827B (zh) 文本信息处理方法、系统、设备及计算机可读存储介质
CN111767716B (zh) 企业多级行业信息的确定方法、装置及计算机设备
US10733675B2 (en) Accuracy and speed of automatically processing records in an automated environment
US20190392038A1 (en) Methods, devices and systems for data augmentation to improve fraud detection
US9098487B2 (en) Categorization based on word distance
US20210264208A1 (en) Systems and methods for domain agnostic document extraction with zero-shot task transfer
RU2768233C1 (ru) Нечеткий поиск с использованием форм слов для работы с большими данными
EP4141818A1 (de) Digitalisierung, umwandlung und validierung von dokumenten
CN114298035A (zh) 一种文本识别脱敏方法及其系统
CN112149387A (zh) 财务数据的可视化方法、装置、计算机设备及存储介质
CN112258144A (zh) 基于自动构建目标实体集的政策文件信息匹配和推送方法
CN115223188A (zh) 票据信息处理方法、装置、电子设备及计算机存储介质
EP4085343A1 (de) Bereichsbasierte textextraktion
US20240020473A1 (en) Domain Based Text Extraction
CA3170100A1 (en) Text processing method and device and computer-readable storage medium
KS et al. Automatic error detection and correction in malayalam
US20240362939A1 (en) Method and system of extracting non-semantic entities
US20240143632A1 (en) Extracting information from documents using automatic markup based on historical data
US20240020479A1 (en) Training machine learning models for multi-modal entity matching in electronic records
CN116991983B (zh) 一种面向公司资讯文本的事件抽取方法及系统
US20240289557A1 (en) Self-Attentive Key-Value Extraction
Pandey et al. A Robust Approach to Plagiarism Detection in Handwritten Documents

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20211125

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
REG Reference to a national code

Ref country code: DE

Ref legal event code: R079

Free format text: PREVIOUS MAIN CLASS: G06F0016000000

Ipc: G06F0018220000

A4 Supplementary search report drawn up and despatched

Effective date: 20231205

RIC1 Information provided on ipc code assigned before grant

Ipc: G06F 40/216 20200101ALI20231129BHEP

Ipc: G06V 30/416 20220101ALI20231129BHEP

Ipc: G06V 30/262 20220101ALI20231129BHEP

Ipc: G06F 40/279 20200101ALI20231129BHEP

Ipc: G06F 18/2413 20230101ALI20231129BHEP

Ipc: G06F 18/22 20230101AFI20231129BHEP