WO2021186364A1 - Extracting text-entities from a document matching a received input - Google Patents
Extracting text-entities from a document matching a received input Download PDFInfo
- Publication number
- WO2021186364A1 WO2021186364A1 PCT/IB2021/052233 IB2021052233W WO2021186364A1 WO 2021186364 A1 WO2021186364 A1 WO 2021186364A1 IB 2021052233 W IB2021052233 W IB 2021052233W WO 2021186364 A1 WO2021186364 A1 WO 2021186364A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- entity
- character
- received input
- document
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Definitions
- This disclosure relates generally to text extraction, and more particularly to a method of extracting, from a document, one or more text-entities matching a received input.
- NLP Natural Language Processing
- a method for extracting text from a document based on a received input may include deciphering a character pattern associated with the received input.
- the received input may include a plurality of characters.
- deciphering the character pattern may include identifying a character type from a plurality of character types associated with each of the plurality of characters in the received input, and assigning, to each of the plurality of characters, a character code from a plurality of character codes based on the identified character type from the plurality of character types.
- deciphering the character pattern may further include creating one or more clusters for the received input in response to assigning the character code to each of the plurality of characters.
- Each of the one or more clusters may include at least one contiguous occurrence of the same character code.
- deciphering the character pattern may include replacing, for each of the one or more clusters, the at least one contiguous occurrence of the same character code with a single occurrence of the same character code to generate the character pattern.
- the method may further include extracting, from the document, at least one text-entity matching the received input, based on the character pattern deciphered for the received input.
- a system for extracting text from a document based on a received input includes a processor and a memory communicatively coupled to the processor.
- the memory is configured to store processor- executable instructions.
- the processor-executable instructions, on execution, may cause the processor to decipher a character pattern associated with the received input.
- the received input may include a plurality of characters.
- deciphering the character pattern by the processor may include identifying a character type from a plurality of character types associated with each of the plurality of characters in the received input, and assigning, to each of the plurality of characters, a character code from a plurality of character codes based on the identified character type from the plurality of character types.
- deciphering the character pattern by the processor may further include creating one or more clusters for the received input in response to assigning the character code to each of the plurality of characters. Each of the one or more clusters may include at least one contiguous occurrence of the same character code. In accordance with an embodiment, deciphering the character pattern by the processor may further include replacing, for each of the one or more clusters, the at least one contiguous occurrence of the same character code with a single occurrence of the same character code to generate the character pattern. The processor instructions may further cause the processor to extract, from the document, at least one text- entity matching the received input, based on the character pattern deciphered for the received input.
- FIG. 1 is a block diagram that illustrates an environment for extracting text from a document based on a received input, in accordance with an embodiment of the present disclosure.
- FIG. 2 is a functional block diagram of a text extraction system for extracting text from a document based on a received input, in accordance with some other embodiments of the present disclosure.
- FIG. 3 is a flowchart that illustrates an exemplary method for identifying a character pattern in a document and generating a set of rules, in accordance with an embodiment of the present disclosure.
- FIG. 4 illustrates a model/rule engine of a text extraction system for extracting a character pattern in a document based on a set of rules, in accordance with an embodiment of the present disclosure.
- FIG. 5 illustrates a flowchart of an exemplary method of extracting text from a document based on a received input, in accordance with an embodiment of the present disclosure.
- FIG. 6 illustrates a flowchart of an exemplary method of deciphering a character pattern associated with the received input, in accordance with an embodiment of the present disclosure.
- FIG. 7 illustrates a flowchart of an exemplary method for extracting, from a document, at least one text-entity matching the received input, in accordance with an embodiment of the present disclosure.
- FIG. 8 illustrates a flowchart of an exemplary method of deciphering a character pattern associated with each text-entity of the at least one text-entity from the document, in accordance with an embodiment of the present disclosure.
- FIG. 9 illustrates a flowchart of an exemplary method for removing at least one noise text-entity from the extracted at least one text-entity matching the received input, in accordance with an embodiment of the present disclosure.
- the disclosed system may use a rule engine to extract text entities for improving the accuracy of extracting text-entities from the document without requiring manual intervention.
- the disclosed text extraction system may facilitate extracting, from a document, text-entities matching an input-token.
- the disclosed text extraction system may provide for time and labor efficient training of the rule engine for improving the accuracy of extracting text-entities from the document.
- the disclosed text extraction system may facilitate identification of various different type of relevant text-entities in a document, while minimizing chances of missing any relevant text-entities.
- FIG. 1 a block diagram illustrates an environment for extracting text from a document based on a received input, in accordance with an embodiment of the present disclosure.
- the environment 100 includes a text extraction system 102, a data storage 104, external devices 106, and a communication network 108.
- the text extraction system 102 may include a processor 110 and a memory 112.
- a display unit 114 may include a user interface (UI) 116.
- UI user interface
- the text extraction system 102 may be communicatively coupled to the data storage 104 and the external devices 106, via the communication network 108.
- a user (not shown in FIG. 1) may be associated with the text extraction system 102 or the external devices 106.
- the text extraction system 102 may include suitable logic, circuitry, interfaces, and/or code that may be configured to receive an input that may include a plurality of characters.
- the input may correspond to a user query to extract a relevant text entity from a document.
- the input may correspond to an input token that includes a plurality of characters.
- the text extraction system 102 may be configured to decipher the character pattern associated with the received input.
- the text extraction system 102 may be configured to extract, from the document, at least one text-entity matching the received input, based on the character pattern deciphered for the received input.
- the text extraction system 102 may use an application for application-specific deployment that includes software and/or logic to extract text from the document based on the received input.
- the text extraction system 102 may be implemented as a plurality of distributed cloud-based resources by use of several technologies that are well known to those skilled in the art.
- Other examples of implementation of the text extraction system 102 may include, but are not limited to, a web/cloud server, an application server, a media server, and a Consumer Electronic (CE) device (such as, but not limited to, a desktop, a laptop, a smartphone and a tablet).
- CE Consumer Electronic
- the processor 110 may be communicatively coupled to the memory 112.
- the processor 110 may include suitable logic, circuitry, interfaces, and/or code that may be configured to extract text from the document based on the input token or received input.
- the processor 110 may be implemented based on a number of processor technologies, which may be known to one ordinarily skilled in the art. Examples of implementations of the processor 110 may be a Graphics Processing Unit (GPU), a Reduced Instruction Set Computing (RISC) processor, an Application-Specific Integrated Circuit (ASIC) processor, a Complex Instruction Set Computing (CISC) processor, a microcontroller, Artificial Intelligence (AI) accelerator chips, a co-processor, a central processing unit (CPU), and/or a combination thereof.
- the processor 110 may be communicatively coupled to, and communicates with, the memory 112.
- the memory 112 may include suitable logic, circuitry, and/or interfaces that may be configured to store instructions executable by the processor 110. Additionally, the memory 112 may be configured to store program code of one or more software applications that may incorporate the program code of the one or more text extraction algorithms. The memory 112 may be configured to store any received data (such as, a received query as input token from a user) or generated data (such as, clusters of same character codes) associated with storing, maintaining, and executing the text extraction system 102 used to extract text entities from the document based on the received input.
- received data such as, a received query as input token from a user
- generated data such as, clusters of same character codes
- Examples of implementation of the memory 112 may include, but are not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Hard Disk Drive (HDD), a Solid-State Drive (SSD), a CPU cache, and/or a Secure Digital (SD) card.
- RAM Random Access Memory
- ROM Read Only Memory
- EEPROM Electrically Erasable Programmable Read-Only Memory
- HDD Hard Disk Drive
- SSD Solid-State Drive
- CPU cache volatile and/or a Secure Digital (SD) card.
- SD Secure Digital
- the data storage 104 may store documents and input received from users for the text extraction system 102, character patterns associated with text entities in documents, and data associated with the documents for access by users of the text extraction system 102.
- the data storage 104 may store metadata indicative of location of character patterns of text entities within documents along with the documents, and may be accessed via the communication network 108.
- the data storage 104 may store data structures for use in extraction of text entities from documents and input. While the example of FIG. 1 includes a single data storage (the data storage 104) located elsewhere in the environment 100 from the text extraction system 102, in some embodiments, the data storage 104 may also be included as a part of the text extraction system 102.
- the data storage 104 may use computer- readable storage media that includes tangible or non-transitory computer-readable storage media including, but not limited to, Compact Disc Read-Only Memory (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices (e.g., Hard-Disk Drive (HDD)), flash memory devices (e.g., Solid State Drive (SSD), Secure Digital (SD) card, other solid state memory devices), or any other storage medium which may be used to carry or store particular program code in the form of computer-executable instructions or data structures and which may be accessed by a general-purpose or special-purpose computer. Combinations of the above may also be included within the scope of computer-readable storage media.
- CD-ROM Compact Disc Read-Only Memory
- HDD Hard-Disk Drive
- SSD Solid State Drive
- SD Secure Digital
- the external devices 106 may include suitable logic, circuitry, interfaces, and/or code that may be configured to transmit input token or a user query to the text extraction system 102 for extracting at least one text-entity matching the received input or the user query.
- the external devices 106 may be capable of communicating with the text extraction system 102 and the data storage 104 via the communication network 108.
- the external devices 106 and the text extraction system 102 are generally disparately located.
- the functionalities of the external devices 106 may be implemented in portable devices, such as a high-speed computing device, and/or non-portable devices, such as an application server.
- Examples of the external devices 106 may include, but are not limited to, a computing device, a smart phone, a camera, a mobile device, a laptop, a personal digital assistant (PDA), a microphone, a printer and a tablet.
- PDA personal digital assistant
- the communication network 108 may include a communication medium through which the text extraction system 102, the data storage 104, and the external devices 106 may communicate with each other.
- Examples of the communication network 108 may include, but are not limited to, the Internet, a cloud network, a Wireless Fidelity (Wi-Fi) network, a Personal Area Network (PAN), a Local Area Network (LAN), or a Metropolitan Area Network (MAN).
- Various devices in the environment 100 may be configured to connect to the communication network 108, in accordance with various wired and wireless communication protocols.
- wired and wireless communication protocols may include, but are not limited to, a Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP), Zig Bee, EDGE, IEEE 802.11, light fidelity (Li-Fi), 802.16, IEEE 802.11s, IEEE 802.11g, multi-hop communication, wireless access point (AP), device to device communication, cellular communication protocols, and Bluetooth (BT) communication protocols.
- TCP/IP Transmission Control Protocol and Internet Protocol
- UDP User Datagram Protocol
- HTTP Hypertext Transfer Protocol
- FTP File Transfer Protocol
- Zig Bee EDGE
- AP wireless access point
- BT Bluetooth
- the display unit 114 may be configured to communicate with different operational components of the text extraction system 102.
- the display unit 114 may be configured to display data for the text extraction system 102.
- a user may interact in the environment 100 via the UI 116 accessible via the display unit 114.
- the text extraction system 102 may be configured to receive an input from the external devices 106.
- the text extraction system 102 may be further configured to decipher a character pattern associated with the received input.
- the text extraction system 102 may be configured to identify a character type from a plurality of character types associated with each of the plurality of characters in the received input.
- the text extraction system 102 may be configured to assign, to each of the plurality of characters, a character code from a plurality of character codes based on the identified character type from the plurality of character types.
- the text extraction system 102 may be further configured to create one or more clusters for the received input in response to assigning the character code to each of the plurality of characters.
- the one or more clusters may be stored in the memory 112 or the data storage 104, via the communication network 108.
- Each of the one or more clusters may include at least one contiguous occurrence of the same character code.
- the text extraction system 102 may be configured to replace, for each of the one or more clusters, the at least one contiguous occurrence of the same character code with a single occurrence of the same character code to generate the character pattern.
- the text extraction system 102 may be configured to extract, from the document, at least one text-entity matching the received input, based on the character pattern deciphered for the received input.
- the extracted at least one text entity may be displayed on the display unit 114 or the external devices 106 from the text extraction system 102, via the communication network 108.
- the text extraction system 102 may be configured to determine a location identifier of the character pattern associated with each text-entity of the at least one text-entity within the document.
- the character pattern may be localized based on the determination of the character pattern associated with each text-entity of the at least one text- entity within the document. Localization of the character pattern may correspond to determining the location of the character pattern in the document.
- the location identifier may be an attribute of the document corresponding to the character pattern and may be included as metadata of the document for quick extraction of text entities.
- the text extraction system 102 may be configured to create a mapping index, based on an association of the location identifier with the character pattern associated with each text-entity of the at least one text-entity within the document. In accordance with an embodiment, the text extraction system 102 may be configured to extract, from the document, at least one text-entity matching the received input, based on the mapping index.
- example embodiments described herein generally relate to extracting text from a document based on a received input
- example embodiments may also be implemented for extracting, without limitation, an icon, an image, an emoji, a logo, a barcode or a shape from the document, based on received input.
- All the components in the environment 100 may be coupled directly or indirectly to the communication network 108.
- the components described in the environment 100 may be further broken down into more than one component and/or combined together in any suitable arrangement. Further, one or more components may be rearranged, changed, added, and/or removed.
- FIG. 2 is a functional block diagram of a text extraction system for extracting text from a document based on a received input, in accordance with some other embodiments of the present disclosure.
- FIG. 2 is explained in conjunction with elements from FIG. 1.
- the text extraction system 102 8ay include a model/rule engine 202, a confidence layer 204, a model output storage 206, an application feedback module 208, an inference layer 210, and a corpus/knowledge repository 212.
- the model/mle engine 202 may be communicatively coupled to the confidence layer 204, the model output storage 206, the application feedback module 208, the inference layer 210, and the corpus/knowledge repository 212.
- Elements and features of the text extraction system 102 may be operatively associated with one another, coupled to one another, or otherwise configured to cooperate with one another as needed to support the desired functionality, as described herein.
- the various physical, electrical, and logical couplings and interconnections for the elements and the features are not depicted in FIG. 2.
- embodiments of the text extraction system 102 will include other elements, modules, and features that cooperate to support the desired functionality.
- FIG. 2 only depicts certain elements that relate to the techniques described in more detail below.
- the model/rule engine 202 may receive a set of feed files 214.
- a format for each of the set of feed files 214 may include, without limitation, a “.pdf’ format, a “.txt” format, a “.doc” format, or a “.xlsx” format. It may be noted that the model/rule engine 202 may be developed based on the feed files 214.
- the confidence layer 204 may be an extension to the model/rule engine 202. It may be noted that the confidence layer 204 may remove noise text-entities from the extracted text-entities matching the received input. Further, the confidence layer 204 may facilitate extraction of relevant text data.
- the model/rule engine 202 may generate a model output based on the feed files 214 which may be stored in the model output storage 206. Further, the model output storage 206 may be accessed by the application feedback module 208. It may be noted that the application feedback module 208 may receive an application feedback from a user as a response to the model output. In some embodiments, the application feedback module 208 may receive the application feedback from the user via the UI 116. Further, the application feedback may be stored in an application feedback output storage 216. Further, the application feedback may be received by the inference layer 210.
- the inference layer 210 may identify a character pattern in the application feedback 208. Further, based on the pattern, the inference layer 210 may define a set of rules for the model/rule engine 202.
- the corpus/knowledge repository 212 may receive the character pattern or the set of rules, or both, from the inference layer 210. It may be noted that information stored in the corpus/knowledge repository 212 may be used by the model/rule engine 202 for extracting text-entities.
- the model/rule engine 202, the confidence layer 204, the application feedback module 208, and the inference layer 210 may be implemented with (or cooperate with) each other to perform at least some of the functions and operations described in more detail herein.
- the model/mle engine 202, the confidence layer 204, the application feedback module 208, and the inference layer 210 may be realized as suitably written processing logic, application program code, or the like.
- FIG. 3 is a flowchart that illustrates an exemplary method 300 for identifying a character pattern in a document and generating a set of rules, in accordance with an embodiment of the present disclosure.
- FIG. 3 is explained in conjunction with elements from FIG. 1 and FIG.2.
- This method 300 may be executed by any computing system, for example, by the text extraction system 102 of FIG. 1.
- an input may be received in form of an application feedback or a new suggestion may be received.
- the inference layer 210 of the text extraction system 102 may be configured to receive the application feedback or the new suggestion from a user.
- a set of character patterns may be uniquely identified for the feedback or a set of rules.
- the inference layer 210 of the text extraction system 102 may be configured to uniquely identify the set of character patterns for the feedback or the set of rules. As such, the steps 302-304 may be performed by the inference layer 210.
- a character type from a plurality of character types associated with each of the plurality of characters in the received input may be identified.
- a character code (from a plurality of character codes) may be assigned to each of the plurality of characters, based on the identified character type from the plurality of character types.
- One or more clusters may be created for the received input in response to assigning the character code to each of the plurality of characters, such that each of the one or more clusters includes at least one contiguous occurrence of the same character code.
- the at least one contiguous occurrence of the same character code may be replaced, for each of the one or more clusters, with a single occurrence of the same character code to generate the character pattern.
- the set of character patterns may be encoded for unification. In accordance with an embodiment, the unification may be indicative of which set of a character pattern is same and which character pattern is different.
- the set of character patterns may be transmitted to the corpus/knowledge repository 212. Further, at step 306, the set of character patterns may be checked for unique patterns. In accordance with an embodiment, once the set of patterns are transmitted to the corpus/knowledge repository 212, the set of character patterns may be checked for unique patterns, and redundant patterns may be deleted.
- the text extraction system 102 may be configured to generate a set of rules using the inference layer 210.
- the text extraction system 102 may fail to extract/identify an oil well having a name “abc-123” from a document. Further, in this scenario, this oil well name may be identified manually by a user, via a user interface. Thereafter, the user may input this oil well name to the text extraction system 102 so as to train the text extraction system 102 to make it capable to extract in the future oil well names having a pattern similar to the pattern of this oil well name. As such, this oil well name may be received as input or feedback or a new suggestion
- a character pattern associated with the input-token may be deciphered. Accordingly, a character type of each character in the input-token “abc-123” may be identified as “alphabet”, “alphabet”, “alphabet”, “punctuation”, “numerical digit”, “numerical digit”, “numerical digit”, and character code may be assigned as “aaapddd”. Further, one or more clusters may be created using similar character codes positioned adjacent to each other (i.e., contiguous occurrence of the same character code).
- a first cluster of three “a” character codes, followed by a second cluster of single “p” character codes, and followed by a third cluster of three “d” character codes may be identified. Further, a cluster code may be assigned to the first, second, and third clusters to obtain the character pattern associated with the input-token. Therefore, the character pattern deciphered for the received input “abc-123” is “apd”. The deciphering of the character pattern from the received input may also correspond to encoding.
- the character pattern associated with the input-token may be mapped with the character patterns associated with the text-entities of the document to identify text-entities matching the input- token.
- the text extraction system 102 may be configured to determine a location identifier of the character pattern associated with each text-entity of the at least one text-entity within the document.
- the character pattern may be localized based on the determination of the character pattern associated with each text-entity of the at least one text- entity within the document. Localization of the character pattern may correspond to determining the location of the character pattern in the document.
- the location identifier may be an attribute of the document corresponding to the character pattern and may be included as metadata of the document for quick extraction of text entities.
- the text extraction system 102 may be configured to create a mapping index, based on an association of the location identifier with the character pattern associated with each text-entity of the at least one text-entity within the document. In accordance with an embodiment, the text extraction system 102 may be configured to extract, from the document, at least one text-entity matching the received input, based on the mapping index.
- each of the text-entities in the document of the pattern “apd” may be extracted.
- the character pattern associated with the oil well name i.e., “apd”
- the character pattern associated with the oil well name i.e., “apd”
- character patterns associated with text-entities of the document may be deciphered.
- the character pattern “apd” may be mapped with the character patterns associated with the text-entities of the document to identify text-entities matching the oil well name. In other words, all the text-entities having a character pattern matching with the pattern “apd” may be extracted.
- FIG. 4 a block diagram of a model/rule engine 202 of a text extraction system for extracting a character pattern in a document based on a set of rules is illustrated, in accordance with an embodiment of the present disclosure.
- FIG. 4 is explained in conjunction with elements from FIG. 1 to FIG. 3
- the model/rule engine 202 may receive an encoded regex pattern from an encoded regex pattern repository 402.
- the encoded regex pattern repository 402 may include character patterns associated with each text-entity within a document.
- a rules generation module 406 of the model/rule engine 202 may be configured to receive the encoded regex pattern from the encoded regex pattern repository 402. Further, in accordance with an embodiment, the rules generation module 406 may be configured to generate a set of rules by decoding the encoded patterns, based on the encoded regex pattern.
- a relevant data extraction model 408 of the model/rule engine 202 may receive the set of rules from the rules generation module 406.
- the rules generation module 406 may decode the encoded patterns (i.e., encoding performed via the method 300) which may help in extracting text entities. Further, the relevant data extraction model 408 may receive feed files 214. It may be noted that the relevant data extraction model 408 may extract a value from the feed files 214 based on the set of rules. The extracted value may later be stored in the extracted value storage 404.
- FIG. 5 is a flowchart that illustrates an exemplary method 500 of extracting text from a document based on a received input, in accordance with an embodiment of the present disclosure.
- FIG. 5 is explained in conjunction with elements from FIG. 1 to FIG. 4.
- the operations of the exemplary method 500 may be executed by any computing system, for example, by the text extraction system 102 of FIG. 1.
- the operations of the method 500 may start at step 502 and proceed to step 504.
- a character pattern associated with the received input may be deciphered.
- the text extraction system 102 may be configured to decipher a character pattern associated with the received input.
- the received input (also referred as input-token) may include a plurality of characters.
- the character type associated with each of the plurality of characters in the received input may include at least one of: an alphabet, a punctuation, or a digit.
- FIG. 6 a flowchart of an exemplary method 600 of deciphering a character pattern associated with the received input is illustrated, in accordance with an embodiment of the present disclosure.
- FIG. 6 is explained in conjunction with elements from FIG. 1 to FIG. 5.
- a character type from a plurality of character types may be identified associated with each of the plurality of characters in the received input.
- the text extraction system 102 may be configured to identify a character type from a plurality of character types associated with each of the plurality of characters in the received input.
- a character code from a plurality of character codes may be assigned to each of the plurality of characters.
- the text extraction system 102 may be configured to assign, to each of the plurality of characters, a character code from a plurality of character codes based on the identified character type from the plurality of character types. For example, the character code corresponding to an alphabet may be ‘a’. Similarly, the character code corresponding to a punctuation may be ‘p’, and to a digit may be ‘d ⁇ [063]
- one or more clusters may be created for the received input.
- the text extraction system 102 may be configured to create one or more clusters for the received input in response to assigning the character code to each of the plurality of characters.
- each of the one or more clusters may include the at least one contiguous occurrence of the same character code.
- the at least one contiguous occurrence of the same character code may be replaced with a single occurrence of the same character code to generate the character pattern.
- the text extraction system 102 may be configured to replace, for each of the one or more clusters, the at least one contiguous occurrence of the same character code with the single occurrence of the same character code to generate the character pattern.
- At step 504 at least one text-entity from the document may be extracted that matches the received input.
- the text extraction system 102 may be configured to extract, from the document, the at least one text- entity matching the received input, based on the character pattern deciphered for the received input. It may be noted that once the character pattern associated with the received input (or input-token) may be deciphered, the character pattern may be used to extract, from the document, one or more text-entities which match the pattern associated with the input-token. [066] Referring now to FIG.
- FIG. 7 a flowchart of an exemplary method 700 of extracting, from a document, at least one text-entity matching the received input is illustrated, in accordance with an embodiment of the present disclosure.
- FIG. 7 is explained in conjunction with elements from FIG. 1 to FIG. 6.
- the operations of the exemplary method 700 may be executed by any computing system, for example, by the text extraction system 102.
- a character pattern associated with each text-entity of the at least one text- entity may be deciphered from the document.
- the text extraction system 102 may be configured to decipher the character pattern associated with each text-entity of the at least one text-entity from the document.
- the document may be parsed to obtain various text-entities like words, in the document.
- the character type associated with each of the plurality of characters in each text- entity of the document may include at least one of: an alphabet, a punctuation, or a digit.
- FIG. 8 a flowchart of an exemplary method 800 of deciphering a character pattern associated with each text-entity of the at least one text-entity from the document is illustrated, in accordance with an embodiment of the present disclosure.
- FIG. 8 is explained in conjunction with elements from FIG. 1 to FIG. 7.
- the operations of the exemplary method 800 may be executed by any computing system, for example, by the text extraction system 102.
- a character type from a plurality of character types associated with each of a plurality of characters may be identified in each text-entity of the document.
- the text extraction system 102 may be configured to identify the character type from the plurality of character types associated with each of the plurality of characters in each text-entity of the document.
- the character type associated with text characters of each text-entity may include an alphabet, or a punctuation, or a digit.
- a character code from a plurality of character codes may be assigned to each of the plurality of characters.
- the text extraction system 102 may be configured to assign, to each of the plurality of characters, the character code from the plurality of character codes based on the identified character type from the plurality of character types. For example, the character code corresponding to an alphabet may be ‘a’, the character code corresponding to a punctuation may be ‘p’, and to a digit may be ‘d ⁇
- one or more clusters may be created for the document.
- the text extraction system 102 may be configured to create one or more clusters for the document in response to assigning the character code to each of the plurality of characters.
- each of the one or more clusters may include at least one contiguous occurrence of the same character code.
- one or more clusters may be created using similar character codes positioned adjacent to each other.
- At step 808, at least one contiguous occurrence of the same character code may be replaced with a single occurrence of the same character code to generate the character pattern.
- the text extraction system 102 may be configured to replace, for each of the one or more clusters, the at least one contiguous occurrence of the same character code with the single occurrence of the same character code to generate the character pattern.
- character pattern deciphered from the received input may be mapped with the character pattern associated with each of the at least one text-entity from the document to extract at least one text-entity from the document matching the received input.
- the text extraction system 102 may be configured to map the character pattern deciphered from the received input with the character pattern associated with each of the at least one text-entity from the document to extract at least one text-entity from the document matching the received input.
- FIG. 9 a flowchart of an exemplary method 900 of removing at least one noise text-entity from the extracted at least one text-entity matching the received input is illustrated, in accordance with an embodiment of the present disclosure.
- FIG. 9 is explained in conjunction with elements from FIG. 1 to FIG. 8.
- the operations of the exemplary method 900 may be executed by any computing system, for example, by the text extraction system 102.
- a reference text-entity having a semantic or a syntactic relationship with the received input may be identified from the document.
- the text extraction system 102 may be configured to identify a reference text-entity having a semantic or a syntactic relationship with the received input from the document.
- a distance of each of the extracted at least one text-entity may be determined from the reference text-entity.
- the text extraction system 102 may be configured to determine the distance of each of the extracted at least one text-entity from the reference text-entity.
- a weight may be assigned, to each of the at least one extracted text-entity, based on the distance of each of the extracted at least one text-entity from the reference text- entity.
- the text extraction system 102 may be configured to assign, to each of the at least one extracted text-entity, the weight, based on the distance of each of the extracted at least one text-entity from the reference text-entity.
- the weight may be inversely proportional to the distance, i.e., text-entity nearer to the reference text-entity may carry higher weight.
- the text extraction system 102 may be configured to calculate the weight for each of the at least one text-entity.
- the weight may be calculated based on a product of the distance of the one of the at least one text entity from the reference text-entity and a similarity of the one of the at least one text entity with the reference text-entity.
- a weight value also called confidence value
- the weight value is a function of distance (a distance value) and similarity (a match value), as given below:
- Weight Value (distance value) * (match value)
- the highest weight value may be 1 and the lowest weight value may be 0.
- the weight value also referred to as confidence value
- one or more text-entities may be selected from the extracted at least one text- entity, based on the weight.
- the text extraction system 102 may be configured to selecting one or more text-entities from the extracted at least one text- entity, based on the weight. In this way, text-entities lying farther away from the reference text- entity and, therefore, having lower relevance with the received input (or input-token) may be removed.
- the confidence layer 204 of the text extraction system 102 may be configured to remove one or more noise text-entities from the extracted text-entities matching the received input (or input-token), and narrow them down to legitimate text-entities.
- the confidence layer 204 of the text extraction system 102 may be configured to calculate confidence associated with each text-entity based on parameters, like, distance of the reference text-entity (keyword) from an actual value (that is, how far the keywords are from actual value).
- the extent of keywords is searched 10 words left and right of “VALUE”.
- the reference text-entity “VALUE” may occur be in a document as shown below:
- a weight may be assigned to each of the extracted text-entities. It may be understood that weight may be inversely proportional to the distance, i.e., text-entity nearer to the reference text-entity may carry higher weight.
- repetition may be used as one of a confidence types, without limitation.
- the repetition may correspond to a frequency-based scoring algorithm that checks the number of occurrences of a value in the document found with respect to the character patterns provided through the inference layer 210.
- Repetition value may be calculated by using formula given below:
- five patterns may be stored in memory 112 of the text extraction system 102, and the text extraction system 102 may extract two unique values using the five patterns. Further, for example, value “1” is found in the document 15 times and Value “2” is found in the document 24 times. Therefore,
- page number may be used as another confidence type, without limitation.
- a segment may be scored, based on page number as given below:
- Page weightage (1-page number/total pages + 1/total pages)
- One or more techniques of extracting, from a document, text entities text-entities matching a received input are disclosed.
- the disclosed techniques provide for time and labor efficient training of a rule engine for improving the accuracy of extracting text-entities from the document.
- the disclosed text extraction system further facilitates identification of various different type of relevant text-entities in a document, while minimizing chances of missing any relevant text-entities. Further, by using weight/confidence scoring algorithms, the techniques provide for eliminating or minimizing chances of extracting noise results.
- a computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored.
- a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein.
- the term “computer-readable medium” should be understood to include tangible items and exclude carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This disclosure relates to a method (500) and system (102) for extracting text from a document based on a received input. The method (500) includes deciphering (502) a character pattern associated with received input. The deciphering of character pattern includes identifying (602) a character type associated with each of a plurality of characters in received input, assigning (604) to each of the plurality of characters a character code, creating (606) one or more clusters for received input in response to assigning character code to each of plurality of characters and replacing (608), for each of the one or more clusters, at least one contiguous occurrence of the same character code with a single occurrence of the same character code to generate the character pattern. The method (500) further includes extracting (504), from the document, at least one text-entity matching the received input, based on character pattern deciphered for received input.
Description
EXTRACTING TEXT-ENTITIES FROM A DOCUMENT MATCHING A
RECEIVED INPUT
DESCRIPTION
TECHNICAU FIEUD
[001] This disclosure relates generally to text extraction, and more particularly to a method of extracting, from a document, one or more text-entities matching a received input.
BACKGROUND
[002] Conventionally, for knowledge-gathering approach, data may be extracted from a document by reading the document and such approach may be prevalent well before the advent of the age of computation. However, manually browsing through a huge number of documents to gather data for analysis may lead to errors. Moreover, such an approach may be time consuming and labor-intensive.
[003] Some available techniques for extracting text-entities from documents may use Natural Language Processing (NLP) related to a given text-entity that may correspond to a received input or a user query. However, such techniques are limited to matching only certain attributes of the received input with the text-entities of the document. Consequently, such techniques may fail to extract all the text-entities from the document relevant to the received input. Further, these techniques may not provide for dynamic training of a rule engine for improving the accuracy of extracting text-entities from the document without requiring manual intervention. [004] Accordingly, there is a need for an improved method of extracting relevant text-entities that match a received input.
SUMMARY
[005] In accordance with an embodiment, a method for extracting text from a document based on a received input is disclosed. The method may include deciphering a character pattern associated with the received input. The received input may include a plurality of characters. In accordance with an embodiment, deciphering the character pattern may include identifying a character type from a plurality of character types associated with each of the plurality of characters in the received input, and assigning, to each of the plurality of characters, a character code from a plurality of character codes based on the identified character type from the plurality of character types. In accordance with an embodiment, deciphering the character pattern may
further include creating one or more clusters for the received input in response to assigning the character code to each of the plurality of characters. Each of the one or more clusters may include at least one contiguous occurrence of the same character code. In accordance with an embodiment, deciphering the character pattern may include replacing, for each of the one or more clusters, the at least one contiguous occurrence of the same character code with a single occurrence of the same character code to generate the character pattern. The method may further include extracting, from the document, at least one text-entity matching the received input, based on the character pattern deciphered for the received input.
[006] In accordance with an embodiment, a system for extracting text from a document based on a received input is disclosed. The system includes a processor and a memory communicatively coupled to the processor. The memory is configured to store processor- executable instructions. The processor-executable instructions, on execution, may cause the processor to decipher a character pattern associated with the received input. The received input may include a plurality of characters. In accordance with an embodiment, deciphering the character pattern by the processor may include identifying a character type from a plurality of character types associated with each of the plurality of characters in the received input, and assigning, to each of the plurality of characters, a character code from a plurality of character codes based on the identified character type from the plurality of character types. In accordance with an embodiment, deciphering the character pattern by the processor may further include creating one or more clusters for the received input in response to assigning the character code to each of the plurality of characters. Each of the one or more clusters may include at least one contiguous occurrence of the same character code. In accordance with an embodiment, deciphering the character pattern by the processor may further include replacing, for each of the one or more clusters, the at least one contiguous occurrence of the same character code with a single occurrence of the same character code to generate the character pattern. The processor instructions may further cause the processor to extract, from the document, at least one text- entity matching the received input, based on the character pattern deciphered for the received input.
[007] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.
BRTRF DESCRIPTION OF THE DRAWINGS
[008] The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate exemplary embodiments and, together with the description, serve to explain the disclosed principles.
[009] FIG. 1 is a block diagram that illustrates an environment for extracting text from a document based on a received input, in accordance with an embodiment of the present disclosure.
[010] FIG. 2 is a functional block diagram of a text extraction system for extracting text from a document based on a received input, in accordance with some other embodiments of the present disclosure.
[Oil] FIG. 3 is a flowchart that illustrates an exemplary method for identifying a character pattern in a document and generating a set of rules, in accordance with an embodiment of the present disclosure.
[012] FIG. 4 illustrates a model/rule engine of a text extraction system for extracting a character pattern in a document based on a set of rules, in accordance with an embodiment of the present disclosure.
[013] FIG. 5 illustrates a flowchart of an exemplary method of extracting text from a document based on a received input, in accordance with an embodiment of the present disclosure.
[014] FIG. 6 illustrates a flowchart of an exemplary method of deciphering a character pattern associated with the received input, in accordance with an embodiment of the present disclosure. [015] FIG. 7 illustrates a flowchart of an exemplary method for extracting, from a document, at least one text-entity matching the received input, in accordance with an embodiment of the present disclosure.
[016] FIG. 8 illustrates a flowchart of an exemplary method of deciphering a character pattern associated with each text-entity of the at least one text-entity from the document, in accordance with an embodiment of the present disclosure.
[017] FIG. 9 illustrates a flowchart of an exemplary method for removing at least one noise text-entity from the extracted at least one text-entity matching the received input, in accordance with an embodiment of the present disclosure.
DETAILED DESCRIPTION
[018] Exemplary embodiments are described with reference to the accompanying drawings. Wherever convenient, the same reference numbers are used throughout the drawings to refer
to the same or like parts. While examples and features of disclosed principles are described herein, modifications, adaptations, and other implementations are possible without departing from the spirit and scope of the disclosed embodiments. It is intended that the following detailed description be considered as exemplary only, with the true scope and spirit being indicated by the following claims.
[019] The following described implementations may be found in the disclosed method and system for extracting text from a document based on a received input. The disclosed system (referred as text extraction system) may use a rule engine to extract text entities for improving the accuracy of extracting text-entities from the document without requiring manual intervention. The disclosed text extraction system may facilitate extracting, from a document, text-entities matching an input-token. The disclosed text extraction system may provide for time and labor efficient training of the rule engine for improving the accuracy of extracting text-entities from the document. The disclosed text extraction system may facilitate identification of various different type of relevant text-entities in a document, while minimizing chances of missing any relevant text-entities.
[020] Referring now to FIG. 1, a block diagram illustrates an environment for extracting text from a document based on a received input, in accordance with an embodiment of the present disclosure. The environment 100 includes a text extraction system 102, a data storage 104, external devices 106, and a communication network 108. In accordance with an embodiment, the text extraction system 102 may include a processor 110 and a memory 112. There is further shown a display unit 114 that may include a user interface (UI) 116.
[021] The text extraction system 102 may be communicatively coupled to the data storage 104 and the external devices 106, via the communication network 108. A user (not shown in FIG. 1) may be associated with the text extraction system 102 or the external devices 106.
[022] The text extraction system 102 may include suitable logic, circuitry, interfaces, and/or code that may be configured to receive an input that may include a plurality of characters. The input may correspond to a user query to extract a relevant text entity from a document. In accordance with an embodiment, the input may correspond to an input token that includes a plurality of characters. In accordance with an embodiment, the text extraction system 102 may be configured to decipher the character pattern associated with the received input. In accordance with an embodiment, the text extraction system 102 may be configured to extract, from the document, at least one text-entity matching the received input, based on the character pattern deciphered for the received input.
[023] In accordance with an embodiment, the text extraction system 102 may use an application for application-specific deployment that includes software and/or logic to extract text from the document based on the received input. By way of example, the text extraction system 102 may be implemented as a plurality of distributed cloud-based resources by use of several technologies that are well known to those skilled in the art. Other examples of implementation of the text extraction system 102 may include, but are not limited to, a web/cloud server, an application server, a media server, and a Consumer Electronic (CE) device (such as, but not limited to, a desktop, a laptop, a smartphone and a tablet).
[024] The processor 110 may be communicatively coupled to the memory 112. The processor 110 may include suitable logic, circuitry, interfaces, and/or code that may be configured to extract text from the document based on the input token or received input. The processor 110 may be implemented based on a number of processor technologies, which may be known to one ordinarily skilled in the art. Examples of implementations of the processor 110 may be a Graphics Processing Unit (GPU), a Reduced Instruction Set Computing (RISC) processor, an Application-Specific Integrated Circuit (ASIC) processor, a Complex Instruction Set Computing (CISC) processor, a microcontroller, Artificial Intelligence (AI) accelerator chips, a co-processor, a central processing unit (CPU), and/or a combination thereof. The processor 110 may be communicatively coupled to, and communicates with, the memory 112.
[025] The memory 112 may include suitable logic, circuitry, and/or interfaces that may be configured to store instructions executable by the processor 110. Additionally, the memory 112 may be configured to store program code of one or more software applications that may incorporate the program code of the one or more text extraction algorithms. The memory 112 may be configured to store any received data (such as, a received query as input token from a user) or generated data (such as, clusters of same character codes) associated with storing, maintaining, and executing the text extraction system 102 used to extract text entities from the document based on the received input. Examples of implementation of the memory 112 may include, but are not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Hard Disk Drive (HDD), a Solid-State Drive (SSD), a CPU cache, and/or a Secure Digital (SD) card.
[026] In accordance with an embodiment, the data storage 104 may store documents and input received from users for the text extraction system 102, character patterns associated with text entities in documents, and data associated with the documents for access by users of the text extraction system 102. For example, the data storage 104 may store metadata indicative of
location of character patterns of text entities within documents along with the documents, and may be accessed via the communication network 108.
[027] In accordance with an embodiment, the data storage 104 may store data structures for use in extraction of text entities from documents and input. While the example of FIG. 1 includes a single data storage (the data storage 104) located elsewhere in the environment 100 from the text extraction system 102, in some embodiments, the data storage 104 may also be included as a part of the text extraction system 102.
[028] By way of an example, and not limitation, the data storage 104 may use computer- readable storage media that includes tangible or non-transitory computer-readable storage media including, but not limited to, Compact Disc Read-Only Memory (CD-ROM) or other optical disk storage, magnetic disk storage or other magnetic storage devices (e.g., Hard-Disk Drive (HDD)), flash memory devices (e.g., Solid State Drive (SSD), Secure Digital (SD) card, other solid state memory devices), or any other storage medium which may be used to carry or store particular program code in the form of computer-executable instructions or data structures and which may be accessed by a general-purpose or special-purpose computer. Combinations of the above may also be included within the scope of computer-readable storage media.
[029] The external devices 106 may include suitable logic, circuitry, interfaces, and/or code that may be configured to transmit input token or a user query to the text extraction system 102 for extracting at least one text-entity matching the received input or the user query. The external devices 106 may be capable of communicating with the text extraction system 102 and the data storage 104 via the communication network 108. The external devices 106 and the text extraction system 102 are generally disparately located.
[030] The functionalities of the external devices 106 may be implemented in portable devices, such as a high-speed computing device, and/or non-portable devices, such as an application server. Examples of the external devices 106 may include, but are not limited to, a computing device, a smart phone, a camera, a mobile device, a laptop, a personal digital assistant (PDA), a microphone, a printer and a tablet.
[031] The communication network 108 may include a communication medium through which the text extraction system 102, the data storage 104, and the external devices 106 may communicate with each other. Examples of the communication network 108 may include, but are not limited to, the Internet, a cloud network, a Wireless Fidelity (Wi-Fi) network, a Personal Area Network (PAN), a Local Area Network (LAN), or a Metropolitan Area Network (MAN). Various devices in the environment 100 may be configured to connect to the communication
network 108, in accordance with various wired and wireless communication protocols. Examples of such wired and wireless communication protocols may include, but are not limited to, a Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), Hypertext Transfer Protocol (HTTP), File Transfer Protocol (FTP), Zig Bee, EDGE, IEEE 802.11, light fidelity (Li-Fi), 802.16, IEEE 802.11s, IEEE 802.11g, multi-hop communication, wireless access point (AP), device to device communication, cellular communication protocols, and Bluetooth (BT) communication protocols.
[032] The display unit 114 may be configured to communicate with different operational components of the text extraction system 102. The display unit 114 may be configured to display data for the text extraction system 102. A user may interact in the environment 100 via the UI 116 accessible via the display unit 114.
[033] During operation, the text extraction system 102 may be configured to receive an input from the external devices 106. The text extraction system 102 may be further configured to decipher a character pattern associated with the received input. For deciphering the character pattern, the text extraction system 102 may be configured to identify a character type from a plurality of character types associated with each of the plurality of characters in the received input. In accordance with an embodiment, the text extraction system 102 may be configured to assign, to each of the plurality of characters, a character code from a plurality of character codes based on the identified character type from the plurality of character types. In accordance with an embodiment, the text extraction system 102 may be further configured to create one or more clusters for the received input in response to assigning the character code to each of the plurality of characters. The one or more clusters may be stored in the memory 112 or the data storage 104, via the communication network 108. Each of the one or more clusters may include at least one contiguous occurrence of the same character code. In accordance with an embodiment, the text extraction system 102 may be configured to replace, for each of the one or more clusters, the at least one contiguous occurrence of the same character code with a single occurrence of the same character code to generate the character pattern. In accordance with an embodiment, the text extraction system 102 may be configured to extract, from the document, at least one text-entity matching the received input, based on the character pattern deciphered for the received input. The extracted at least one text entity may be displayed on the display unit 114 or the external devices 106 from the text extraction system 102, via the communication network 108.
[034] Additionally, in accordance with an embodiment, for easy extraction of text entities, the text extraction system 102 may be configured to determine a location identifier of the character pattern associated with each text-entity of the at least one text-entity within the document. In accordance with an embodiment, the character pattern may be localized based on the determination of the character pattern associated with each text-entity of the at least one text- entity within the document. Localization of the character pattern may correspond to determining the location of the character pattern in the document. In accordance with an embodiment, the location identifier may be an attribute of the document corresponding to the character pattern and may be included as metadata of the document for quick extraction of text entities.
[035] In accordance with an embodiment, the text extraction system 102 may be configured to create a mapping index, based on an association of the location identifier with the character pattern associated with each text-entity of the at least one text-entity within the document. In accordance with an embodiment, the text extraction system 102 may be configured to extract, from the document, at least one text-entity matching the received input, based on the mapping index.
[036] While example embodiments described herein generally relate to extracting text from a document based on a received input, example embodiments may also be implemented for extracting, without limitation, an icon, an image, an emoji, a logo, a barcode or a shape from the document, based on received input.
[037] All the components in the environment 100 may be coupled directly or indirectly to the communication network 108. The components described in the environment 100 may be further broken down into more than one component and/or combined together in any suitable arrangement. Further, one or more components may be rearranged, changed, added, and/or removed.
[038] FIG. 2 is a functional block diagram of a text extraction system for extracting text from a document based on a received input, in accordance with some other embodiments of the present disclosure. FIG. 2 is explained in conjunction with elements from FIG. 1.
[039] With reference to FIG. 2, there is shown a functional block diagram of a text extraction system 200 (corresponding to the text extraction system 102). The text extraction system 102 8ay include a model/rule engine 202, a confidence layer 204, a model output storage 206, an application feedback module 208, an inference layer 210, and a corpus/knowledge repository 212. The model/mle engine 202 may be communicatively coupled to the confidence layer 204,
the model output storage 206, the application feedback module 208, the inference layer 210, and the corpus/knowledge repository 212.
[040] Elements and features of the text extraction system 102 may be operatively associated with one another, coupled to one another, or otherwise configured to cooperate with one another as needed to support the desired functionality, as described herein. For ease of illustration and clarity, the various physical, electrical, and logical couplings and interconnections for the elements and the features are not depicted in FIG. 2. Moreover, it should be appreciated that embodiments of the text extraction system 102 will include other elements, modules, and features that cooperate to support the desired functionality. For simplicity, FIG. 2 only depicts certain elements that relate to the techniques described in more detail below.
[041] In some embodiments, the model/rule engine 202 may receive a set of feed files 214. By way of an example, a format for each of the set of feed files 214 may include, without limitation, a “.pdf’ format, a “.txt” format, a “.doc” format, or a “.xlsx” format. It may be noted that the model/rule engine 202 may be developed based on the feed files 214.
[042] In some embodiments, the confidence layer 204 may be an extension to the model/rule engine 202. It may be noted that the confidence layer 204 may remove noise text-entities from the extracted text-entities matching the received input. Further, the confidence layer 204 may facilitate extraction of relevant text data.
[043] The model/rule engine 202 may generate a model output based on the feed files 214 which may be stored in the model output storage 206. Further, the model output storage 206 may be accessed by the application feedback module 208. It may be noted that the application feedback module 208 may receive an application feedback from a user as a response to the model output. In some embodiments, the application feedback module 208 may receive the application feedback from the user via the UI 116. Further, the application feedback may be stored in an application feedback output storage 216. Further, the application feedback may be received by the inference layer 210.
[044] In some embodiments, the inference layer 210 may identify a character pattern in the application feedback 208. Further, based on the pattern, the inference layer 210 may define a set of rules for the model/rule engine 202. The corpus/knowledge repository 212 may receive the character pattern or the set of rules, or both, from the inference layer 210. It may be noted that information stored in the corpus/knowledge repository 212 may be used by the model/rule engine 202 for extracting text-entities.
[045] In practice, the model/rule engine 202, the confidence layer 204, the application feedback module 208, and the inference layer 210 may be implemented with (or cooperate with) each other to perform at least some of the functions and operations described in more detail herein. In this regard, the model/mle engine 202, the confidence layer 204, the application feedback module 208, and the inference layer 210 may be realized as suitably written processing logic, application program code, or the like.
[046] FIG. 3 is a flowchart that illustrates an exemplary method 300 for identifying a character pattern in a document and generating a set of rules, in accordance with an embodiment of the present disclosure. FIG. 3 is explained in conjunction with elements from FIG. 1 and FIG.2. This method 300 may be executed by any computing system, for example, by the text extraction system 102 of FIG. 1.
[047] At step 302, an input may be received in form of an application feedback or a new suggestion may be received. In accordance with an embodiment, the inference layer 210 of the text extraction system 102 may be configured to receive the application feedback or the new suggestion from a user. At step 304, a set of character patterns may be uniquely identified for the feedback or a set of rules. In accordance with an embodiment, the inference layer 210 of the text extraction system 102 may be configured to uniquely identify the set of character patterns for the feedback or the set of rules. As such, the steps 302-304 may be performed by the inference layer 210.
[048] For example, in order to identify the set of character patterns for the feedback or the set of rules, a character type from a plurality of character types associated with each of the plurality of characters in the received input may be identified. A character code (from a plurality of character codes) may be assigned to each of the plurality of characters, based on the identified character type from the plurality of character types. One or more clusters may be created for the received input in response to assigning the character code to each of the plurality of characters, such that each of the one or more clusters includes at least one contiguous occurrence of the same character code. The at least one contiguous occurrence of the same character code may be replaced, for each of the one or more clusters, with a single occurrence of the same character code to generate the character pattern. Further, the set of character patterns may be encoded for unification. In accordance with an embodiment, the unification may be indicative of which set of a character pattern is same and which character pattern is different.
[049] At step 306, the set of character patterns may be transmitted to the corpus/knowledge repository 212. Further, at step 306, the set of character patterns may be checked for unique patterns. In accordance with an embodiment, once the set of patterns are transmitted to the corpus/knowledge repository 212, the set of character patterns may be checked for unique patterns, and redundant patterns may be deleted.
[050] Additionally, the text extraction system 102 may be configured to generate a set of rules using the inference layer 210. In an exemplary scenario, the text extraction system 102 may fail to extract/identify an oil well having a name “abc-123” from a document. Further, in this scenario, this oil well name may be identified manually by a user, via a user interface. Thereafter, the user may input this oil well name to the text extraction system 102 so as to train the text extraction system 102 to make it capable to extract in the future oil well names having a pattern similar to the pattern of this oil well name. As such, this oil well name may be received as input or feedback or a new suggestion
[051] Upon receiving the input, a character pattern associated with the input-token may be deciphered. Accordingly, a character type of each character in the input-token “abc-123” may be identified as “alphabet”, “alphabet”, “alphabet”, “punctuation”, “numerical digit”, “numerical digit”, “numerical digit”, and character code may be assigned as “aaapddd”. Further, one or more clusters may be created using similar character codes positioned adjacent to each other (i.e., contiguous occurrence of the same character code). Therefore, a first cluster of three “a” character codes, followed by a second cluster of single “p” character codes, and followed by a third cluster of three “d” character codes may be identified. Further, a cluster code may be assigned to the first, second, and third clusters to obtain the character pattern associated with the input-token. Therefore, the character pattern deciphered for the received input “abc-123” is “apd”. The deciphering of the character pattern from the received input may also correspond to encoding.
[052] Further, the character pattern associated with the input-token may be mapped with the character patterns associated with the text-entities of the document to identify text-entities matching the input- token.
[053] Additionally, in accordance with an embodiment, for easy extraction of text entities, the text extraction system 102 may be configured to determine a location identifier of the character pattern associated with each text-entity of the at least one text-entity within the document. In accordance with an embodiment, the character pattern may be localized based on the determination of the character pattern associated with each text-entity of the at least one text-
entity within the document. Localization of the character pattern may correspond to determining the location of the character pattern in the document. In accordance with an embodiment, the location identifier may be an attribute of the document corresponding to the character pattern and may be included as metadata of the document for quick extraction of text entities. In accordance with an embodiment, the text extraction system 102 may be configured to create a mapping index, based on an association of the location identifier with the character pattern associated with each text-entity of the at least one text-entity within the document. In accordance with an embodiment, the text extraction system 102 may be configured to extract, from the document, at least one text-entity matching the received input, based on the mapping index.
[054] In continuation of the example above, each of the text-entities in the document of the pattern “apd” may be extracted. In order to extract, from a document, text-entities matching the oil well name, in some embodiments, the character pattern associated with the oil well name, i.e., “apd” may be decoded. Accordingly, character patterns associated with text-entities of the document may be deciphered. Once the character patterns associated with the text-entities are deciphered, the character pattern “apd” may be mapped with the character patterns associated with the text-entities of the document to identify text-entities matching the oil well name. In other words, all the text-entities having a character pattern matching with the pattern “apd” may be extracted.
[055] Referring now to FIG. 4, a block diagram of a model/rule engine 202 of a text extraction system for extracting a character pattern in a document based on a set of rules is illustrated, in accordance with an embodiment of the present disclosure. FIG. 4 is explained in conjunction with elements from FIG. 1 to FIG. 3
[056] The model/rule engine 202 may receive an encoded regex pattern from an encoded regex pattern repository 402. The encoded regex pattern repository 402 may include character patterns associated with each text-entity within a document. In accordance with an embodiment, a rules generation module 406 of the model/rule engine 202 may be configured to receive the encoded regex pattern from the encoded regex pattern repository 402. Further, in accordance with an embodiment, the rules generation module 406 may be configured to generate a set of rules by decoding the encoded patterns, based on the encoded regex pattern. [057] A relevant data extraction model 408 of the model/rule engine 202 may receive the set of rules from the rules generation module 406. The rules generation module 406 may decode the encoded patterns (i.e., encoding performed via the method 300) which may help in
extracting text entities. Further, the relevant data extraction model 408 may receive feed files 214. It may be noted that the relevant data extraction model 408 may extract a value from the feed files 214 based on the set of rules. The extracted value may later be stored in the extracted value storage 404.
[058] FIG. 5 is a flowchart that illustrates an exemplary method 500 of extracting text from a document based on a received input, in accordance with an embodiment of the present disclosure. FIG. 5 is explained in conjunction with elements from FIG. 1 to FIG. 4. The operations of the exemplary method 500 may be executed by any computing system, for example, by the text extraction system 102 of FIG. 1. The operations of the method 500 may start at step 502 and proceed to step 504.
[059] At step 502, a character pattern associated with the received input may be deciphered. In accordance with an embodiment, the text extraction system 102 may be configured to decipher a character pattern associated with the received input. In accordance with an embodiment, the received input (also referred as input-token) may include a plurality of characters. For example, the character type associated with each of the plurality of characters in the received input may include at least one of: an alphabet, a punctuation, or a digit. The step 502 of deciphering the character pattern associated with the received input is further explained in detail in conjunction with FIG. 6.
[060] Referring now to FIG. 6, a flowchart of an exemplary method 600 of deciphering a character pattern associated with the received input is illustrated, in accordance with an embodiment of the present disclosure. FIG. 6 is explained in conjunction with elements from FIG. 1 to FIG. 5.
[061] At step 602, a character type from a plurality of character types may be identified associated with each of the plurality of characters in the received input. In accordance with an embodiment, the text extraction system 102 may be configured to identify a character type from a plurality of character types associated with each of the plurality of characters in the received input.
[062] At step 604, a character code from a plurality of character codes may be assigned to each of the plurality of characters. In accordance with an embodiment, the text extraction system 102 may be configured to assign, to each of the plurality of characters, a character code from a plurality of character codes based on the identified character type from the plurality of character types. For example, the character code corresponding to an alphabet may be ‘a’. Similarly, the character code corresponding to a punctuation may be ‘p’, and to a digit may be ‘d\
[063] At step 606, one or more clusters may be created for the received input. In accordance with an embodiment, the text extraction system 102 may be configured to create one or more clusters for the received input in response to assigning the character code to each of the plurality of characters. In accordance with an embodiment, each of the one or more clusters may include the at least one contiguous occurrence of the same character code.
[064] At step 608, for each of the one or more clusters, the at least one contiguous occurrence of the same character code may be replaced with a single occurrence of the same character code to generate the character pattern. In accordance with an embodiment, the text extraction system 102 may be configured to replace, for each of the one or more clusters, the at least one contiguous occurrence of the same character code with the single occurrence of the same character code to generate the character pattern.
[065] Returning to FIG. 5, at step 504, at least one text-entity from the document may be extracted that matches the received input. In accordance with an embodiment, the text extraction system 102 may be configured to extract, from the document, the at least one text- entity matching the received input, based on the character pattern deciphered for the received input. It may be noted that once the character pattern associated with the received input (or input-token) may be deciphered, the character pattern may be used to extract, from the document, one or more text-entities which match the pattern associated with the input-token. [066] Referring now to FIG. 7, a flowchart of an exemplary method 700 of extracting, from a document, at least one text-entity matching the received input is illustrated, in accordance with an embodiment of the present disclosure. FIG. 7 is explained in conjunction with elements from FIG. 1 to FIG. 6. The operations of the exemplary method 700 may be executed by any computing system, for example, by the text extraction system 102.
[067] At step 702, a character pattern associated with each text-entity of the at least one text- entity may be deciphered from the document. In accordance with an embodiment, the text extraction system 102 may be configured to decipher the character pattern associated with each text-entity of the at least one text-entity from the document. For example, the document may be parsed to obtain various text-entities like words, in the document. In accordance with an embodiment, the character type associated with each of the plurality of characters in each text- entity of the document may include at least one of: an alphabet, a punctuation, or a digit. The step 702 of deciphering a character pattern associated with each text-entity of the at least one text-entity from the document is further explained in conjunction with FIG. 8.
[068] Referring now to FIG. 8, a flowchart of an exemplary method 800 of deciphering a character pattern associated with each text-entity of the at least one text-entity from the document is illustrated, in accordance with an embodiment of the present disclosure. FIG. 8 is explained in conjunction with elements from FIG. 1 to FIG. 7. The operations of the exemplary method 800 may be executed by any computing system, for example, by the text extraction system 102.
[069] At step 802, a character type from a plurality of character types associated with each of a plurality of characters may be identified in each text-entity of the document. In accordance with an embodiment, the text extraction system 102 may be configured to identify the character type from the plurality of character types associated with each of the plurality of characters in each text-entity of the document. For example, the character type associated with text characters of each text-entity may include an alphabet, or a punctuation, or a digit.
[070] At step 804, a character code from a plurality of character codes may be assigned to each of the plurality of characters. In accordance with an embodiment, the text extraction system 102 may be configured to assign, to each of the plurality of characters, the character code from the plurality of character codes based on the identified character type from the plurality of character types. For example, the character code corresponding to an alphabet may be ‘a’, the character code corresponding to a punctuation may be ‘p’, and to a digit may be ‘d\
[071] At step 806, one or more clusters may be created for the document. In accordance with an embodiment, the text extraction system 102 may be configured to create one or more clusters for the document in response to assigning the character code to each of the plurality of characters. In accordance with an embodiment, each of the one or more clusters may include at least one contiguous occurrence of the same character code. In other words, one or more clusters may be created using similar character codes positioned adjacent to each other.
[072] At step 808, at least one contiguous occurrence of the same character code may be replaced with a single occurrence of the same character code to generate the character pattern. In accordance with an embodiment, the text extraction system 102 may be configured to replace, for each of the one or more clusters, the at least one contiguous occurrence of the same character code with the single occurrence of the same character code to generate the character pattern.
[073] Returning to FIG. 7, at step 704, character pattern deciphered from the received input may be mapped with the character pattern associated with each of the at least one text-entity from the document to extract at least one text-entity from the document matching the received
input. In accordance with an embodiment, the text extraction system 102 may be configured to map the character pattern deciphered from the received input with the character pattern associated with each of the at least one text-entity from the document to extract at least one text-entity from the document matching the received input.
[074] Referring not to FIG. 9, a flowchart of an exemplary method 900 of removing at least one noise text-entity from the extracted at least one text-entity matching the received input is illustrated, in accordance with an embodiment of the present disclosure. FIG. 9 is explained in conjunction with elements from FIG. 1 to FIG. 8. The operations of the exemplary method 900 may be executed by any computing system, for example, by the text extraction system 102. [075] At step 902, a reference text-entity having a semantic or a syntactic relationship with the received input may be identified from the document. In accordance with an embodiment, the text extraction system 102 may be configured to identify a reference text-entity having a semantic or a syntactic relationship with the received input from the document. At step 904, a distance of each of the extracted at least one text-entity may be determined from the reference text-entity. In accordance with an embodiment, the text extraction system 102 may be configured to determine the distance of each of the extracted at least one text-entity from the reference text-entity.
[076] At step 906, a weight may be assigned, to each of the at least one extracted text-entity, based on the distance of each of the extracted at least one text-entity from the reference text- entity. In accordance with an embodiment, the text extraction system 102 may be configured to assign, to each of the at least one extracted text-entity, the weight, based on the distance of each of the extracted at least one text-entity from the reference text-entity. By way of an example, the weight may be inversely proportional to the distance, i.e., text-entity nearer to the reference text-entity may carry higher weight. In accordance with an embodiment, the text extraction system 102 may be configured to calculate the weight for each of the at least one text-entity. In accordance with an embodiment, the weight may be calculated based on a product of the distance of the one of the at least one text entity from the reference text-entity and a similarity of the one of the at least one text entity with the reference text-entity. For example, a weight value (also called confidence value) may be calculated, such that the weight value is a function of distance (a distance value) and similarity (a match value), as given below:
Weight Value = (distance value) * (match value)
[077] It may be noted that the highest weight value may be 1 and the lowest weight value may be 0. For example, if a keyword is at 10th position and the match value (or a probability of
match) is 0.85, then the weight value (also referred to as confidence value) may be calculated as below:
Weight Value = 1*0.85 = 0.85
[078] At step 908, one or more text-entities may be selected from the extracted at least one text- entity, based on the weight. In accordance with an embodiment, the text extraction system 102 may be configured to selecting one or more text-entities from the extracted at least one text- entity, based on the weight. In this way, text-entities lying farther away from the reference text- entity and, therefore, having lower relevance with the received input (or input-token) may be removed.
[079] For example, the confidence layer 204 of the text extraction system 102 may be configured to remove one or more noise text-entities from the extracted text-entities matching the received input (or input-token), and narrow them down to legitimate text-entities. The confidence layer 204 of the text extraction system 102 may be configured to calculate confidence associated with each text-entity based on parameters, like, distance of the reference text-entity (keyword) from an actual value (that is, how far the keywords are from actual value). [080] By way of an example, the extent of keywords is searched 10 words left and right of “VALUE”. The reference text-entity “VALUE” may occur be in a document as shown below:
“XI X2 X3 X4 X5 X6 X7 X8 X9 X10 VALUE Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y9 Y10”
[081] Distance of each of the extracted text-entities may be determined from the reference text- entity “VALUE”. It may be understood that the distance of text-entity “X10” and text-entity “Yl” from the reference text-entity “VALUE” will be same = 10/10 = 1 (It is 10th position out of 10, which is the nearest to “VALUE”). The value for X9 and Y2 will be same = 9/10 =0.9 (since, it is 9th position out of 10). Similarly, the distance of text-entity “X6” from the reference text-entity “VALUE” is 5.
[082] Based on the distance, a weight may be assigned to each of the extracted text-entities. It may be understood that weight may be inversely proportional to the distance, i.e., text-entity nearer to the reference text-entity may carry higher weight.
[083] In accordance with an embodiment, repetition may be used as one of a confidence types, without limitation. The repetition may correspond to a frequency-based scoring algorithm that checks the number of occurrences of a value in the document found with respect to the character
patterns provided through the inference layer 210. Repetition value may be calculated by using formula given below:
[084] Repetition = Occurrences in document / Total occurrences of values extracted from document
[085] By way of an example, five patterns may be stored in memory 112 of the text extraction system 102, and the text extraction system 102 may extract two unique values using the five patterns. Further, for example, value “1” is found in the document 15 times and Value “2” is found in the document 24 times. Therefore,
Confidence score for Value “1” = 15 / (15+24) = 0.385.
Confidence Score for Value “2” = 24 / (15+24) = 0.62 [086] In accordance with another embodiment, page number may be used as another confidence type, without limitation. In accordance with an embodiment, a segment may be scored, based on page number as given below:
Page weightage = (1-page number/total pages + 1/total pages)
[087] By way of an example, when an attribute is present in the second page and the document has 5 pages, then,
Page weightage for that attribute = (1- 2/5 + 1/5) = 0.8
[088] One or more techniques of extracting, from a document, text entities text-entities matching a received input are disclosed. The disclosed techniques provide for time and labor efficient training of a rule engine for improving the accuracy of extracting text-entities from the document. The disclosed text extraction system further facilitates identification of various different type of relevant text-entities in a document, while minimizing chances of missing any relevant text-entities. Further, by using weight/confidence scoring algorithms, the techniques provide for eliminating or minimizing chances of extracting noise results.
[089] Furthermore, one or more computer-readable storage media may be utilized in implementing embodiments consistent with the present disclosure. A computer-readable storage medium refers to any type of physical memory on which information or data readable by a processor may be stored. Thus, a computer-readable storage medium may store instructions for execution by one or more processors, including instructions for causing the processor(s) to perform steps or stages consistent with the embodiments described herein. The term “computer-readable medium” should be understood to include tangible items and exclude
carrier waves and transient signals, i.e., be non-transitory. Examples include random access memory (RAM), read-only memory (ROM), volatile memory, nonvolatile memory, hard drives, CD ROMs, DVDs, flash drives, disks, and any other known physical storage media. [090] It will be appreciated that, for clarity purposes, the above description has described embodiments with reference to different functional units and processors. However, it will be apparent that any suitable distribution of functionality between different functional units, processors or domains may be used without detracting from the disclosure. For example, functionality illustrated to be performed by separate processors or controllers may be performed by the same processor or controller. Hence, references to specific functional units are only to be seen as references to suitable means for providing the described functionality, rather than indicative of a strict logical or physical structure or organization.
[091] Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the claims. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in accordance with the disclosure.
[092] Furthermore, although individually listed, a plurality of means, elements or process steps may be implemented by, for example, a single unit or processor. Additionally, although individual features may be included in different claims, these may possibly be advantageously combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. Also, the inclusion of a feature in one category of claims does not imply a limitation to this category, but rather the feature may be equally applicable to other claim categories, as appropriate.
[093] It is intended that the disclosure and examples be considered as exemplary only, with a true scope and spirit of disclosed embodiments being indicated by the following claims.
Claims
1. A method (500) for extracting text from a document based on a received input, the method comprising: deciphering (502) a character pattern associated with the received input, wherein the received input comprises a plurality of characters wherein deciphering the character pattern comprises: identifying (602) a character type from a plurality of character types associated with each of the plurality of characters in the received input; assigning (604), to each of the plurality of characters, a character code from a plurality of character codes based on the identified character type from the plurality of character types; creating (606) one or more clusters for the received input in response to assigning the character code to each of the plurality of characters, wherein each of the one or more clusters comprises at least one contiguous occurrence of the same character code; and replacing (608), for each of the one or more clusters, the at least one contiguous occurrence of the same character code with a single occurrence of the same character code to generate the character pattern; and extracting (504), from the document, at least one text-entity matching the received input, based on the character pattern deciphered for the received input.
2. The method (500) as claimed in claim 1, wherein extracting (504) the at least one text-entity from the document further comprises: deciphering (702) a character pattern associated with each text-entity of the at least one text-entity from the document, wherein deciphering the character pattern comprises: identifying (802) a character type from a plurality of character types associated with each of a plurality of characters in each text-entity of the document; assigning (804), to each of the plurality of characters, a character code from a plurality of character codes based on the identified character type from the plurality of character types; creating (806) one or more clusters for the document in response to assigning the character code to each of the plurality of characters, wherein each of the one or more clusters comprises at least one contiguous occurrence of the same character code; and replacing (808), for each of the one or more clusters, at least one contiguous occurrence of the same character code with a single occurrence of the same character code to generate the character pattern; and
mapping (704) the character pattern deciphered from the received input with the character pattern associated with each of the at least one text-entity from the document to extract at least one text-entity from the document matching the received input.
3. The method (500) as claimed in claim 2, wherein the character type associated with each of the plurality of characters in the received input or the character type associated with each of the plurality of characters in each text-entity of the document comprises at least one of: an alphabet, a punctuation, or a digit.
4. The method (500) as claimed in claim 1, further comprising removing at least one noise text- entity from the extracted at least one text-entity matching the received input, wherein removing the at least one noise text-entity comprises: identifying (902) from the document, a reference text-entity having a semantic or syntactic relationship with the received input; determining (904) a distance of each of the extracted at least one text-entity from the reference text-entity; assigning (906), to each of the at least one extracted text-entity, a weight, based on the distance of each of the extracted at least one text-entity from the reference text-entity; and selecting (908) one or more text-entities from the extracted at least one text-entity, based on the weight.
5. The method (500) as claimed in claim 4, further comprising calculating the weight for each of the at least one text-entity, wherein the weight is calculated based on a product of the distance of the one of the at least one text entity from the reference text-entity and a similarity of the one of the at least one text entity with the reference text-entity.
6. The method (500) as claimed in claim 1, further comprising: determining a location identifier of the character pattern associated with each text- entity of the at least one text-entity within the document; creating a mapping index, based on an association of the location identifier with the character pattern associated with each text-entity of the at least one text-entity within the document; and extracting, from the document, at least one text-entity matching the received input, based on the mapping index.
7. A system for extracting text from a document based on a received input, the system comprising: a processor (110); and a memory (112) communicatively coupled to the processor (110), wherein the memory (112) stores processor-executable instructions, which, on execution by the processor (110), cause the processor (110) to: decipher a character pattern associated with the received input, wherein the received input comprises a plurality of characters wherein deciphering the character pattern comprises: identifying a character type from a plurality of character types associated with each of the plurality of characters in the received input; assigning, to each of the plurality of characters, a character code from a plurality of character codes based on the identified character type from the plurality of character types; creating one or more clusters for the received input in response to assigning the character code to each of the plurality of characters, wherein each of the one or more clusters comprises at least one contiguous occurrence of the same character code; and replacing, for each of the one or more clusters, the at least one contiguous occurrence of the same character code with a single occurrence of the same character code to generate the character pattern; and extract, from the document, at least one text-entity matching the received input, based on the character pattern deciphered for the received input.
8. The system as claimed in claim 7, wherein extracting the at least one text-entity from the document further comprises: deciphering a character pattern associated with each text-entity of the at least one text- entity from the document, wherein deciphering the character pattern comprises: identifying a character type from a plurality of character types associated with each of a plurality of characters in each text-entity of the document; assigning, to each of the plurality of characters, a character code from a plurality of character codes based on the identified character type from the plurality of character types; creating one or more clusters for the document t in response to assigning the character code to each of the plurality of characters, wherein each of the one or more clusters comprises at least one contiguous occurrence of the same character code; and
replacing, for each of the one or more clusters, at least one contiguous occurrence of the same character code with a single occurrence of the same character code to generate the character pattern; and mapping the character pattern deciphered from the received input with the character pattern associated with each of the at least one text-entity from the document to extract at least one text-entity from the document matching the received input.
9. The system as claimed in claim 8, wherein the character type associated with each of the plurality of characters in the received input or the character type associated with each of the plurality of characters in each text-entity of the document comprises at least one of: an alphabet, a punctuation, or a digit.
10. The system as claimed in claim 7, wherein the processor-executable instructions, on execution by the processor, further cause the processor to remove at least one noise text-entity from the extracted at least one text-entity matching the received input, wherein removing the at least one noise text-entity comprises: identifying from the document, a reference text-entity having a semantic or syntactic relationship with the received input; determining a distance of each of the extracted at least one text-entity from the reference text-entity; assigning, to each of the at least one extracted text-entity, a weight, based on the distance of each of the extracted at least one text-entity from the reference text-entity; and selecting one or more text-entities from the extracted at least one text-entity, based on the weight.
11. The system as claimed in claim 10, wherein the processor-executable instructions, on execution by the processor, further cause the processor to calculate the weight for each of the at least one text-entity, wherein the weight is calculated based on a product of the distance of the one of the at least one text entity from the reference text-entity and a similarity of the one of the at least one text entity with the reference text-entity.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
IN202041011410 | 2020-03-17 | ||
IN202041011410 | 2020-03-17 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021186364A1 true WO2021186364A1 (en) | 2021-09-23 |
Family
ID=77769180
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/IB2021/052233 WO2021186364A1 (en) | 2020-03-17 | 2021-03-17 | Extracting text-entities from a document matching a received input |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2021186364A1 (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019241422A1 (en) * | 2018-06-13 | 2019-12-19 | Oracle International Corporation | Regular expression generation using longest common subsequence algorithm on combinations of regular expression codes |
-
2021
- 2021-03-17 WO PCT/IB2021/052233 patent/WO2021186364A1/en active Application Filing
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019241422A1 (en) * | 2018-06-13 | 2019-12-19 | Oracle International Corporation | Regular expression generation using longest common subsequence algorithm on combinations of regular expression codes |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108089974B (en) | Testing applications with defined input formats | |
US10922346B2 (en) | Generating a summary based on readability | |
JP6462970B1 (en) | Classification device, classification method, generation method, classification program, and generation program | |
US11763583B2 (en) | Identifying matching fonts utilizing deep learning | |
JP2017054509A (en) | Method and system for extracting sentence | |
US20160188569A1 (en) | Generating a Table of Contents for Unformatted Text | |
CN108280197B (en) | Method and system for identifying homologous binary file | |
CN111984792A (en) | Website classification method and device, computer equipment and storage medium | |
US20220269820A1 (en) | Artificial intelligence based data redaction of documents | |
CN114840869A (en) | Data sensitivity identification method and device based on sensitivity identification model | |
CN115862040A (en) | Text error correction method and device, computer equipment and readable storage medium | |
JP5430312B2 (en) | Data processing apparatus, data name generation method, and computer program | |
CN113850081A (en) | Text processing method, device, equipment and medium based on artificial intelligence | |
JP6805720B2 (en) | Data search program, data search device and data search method | |
JP2016110256A (en) | Information processing device and information processing program | |
JP7390442B2 (en) | Training method, device, device, storage medium and program for document processing model | |
WO2021186364A1 (en) | Extracting text-entities from a document matching a received input | |
CN116450111A (en) | Method, device, equipment and storage medium for creating business object | |
JP6194180B2 (en) | Text mask device and text mask program | |
CN108536713B (en) | Character string auditing method and device and electronic equipment | |
JP2016018279A (en) | Document file search program, document file search device, document file search method, document information output program, document information output device, and document information output method | |
JP6703698B1 (en) | Information provision system | |
JP6361472B2 (en) | Correspondence information generation program, correspondence information generation apparatus, and correspondence information generation method | |
JP2016091354A (en) | Information processing device and information processing program | |
JP7283112B2 (en) | Information processing device, information processing method and program |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21770520 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21770520 Country of ref document: EP Kind code of ref document: A1 |