WO2015080558A1 - A method and system for automated entity recognition - Google Patents

A method and system for automated entity recognition Download PDF

Info

Publication number
WO2015080558A1
WO2015080558A1 PCT/MY2014/000153 MY2014000153W WO2015080558A1 WO 2015080558 A1 WO2015080558 A1 WO 2015080558A1 MY 2014000153 W MY2014000153 W MY 2014000153W WO 2015080558 A1 WO2015080558 A1 WO 2015080558A1
Authority
WO
WIPO (PCT)
Prior art keywords
ner
entity
entities
based
knowledge
Prior art date
Application number
PCT/MY2014/000153
Other languages
French (fr)
Inventor
Benjamin CHU MIN XIAN
Daniel Bahls
Dickson Lukose
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to MYPI2013004281 priority Critical
Priority to MYPI2013004281 priority
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2015080558A1 publication Critical patent/WO2015080558A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/27Automatic analysis, e.g. parsing
    • G06F17/2765Recognition
    • G06F17/2775Phrasal analysis, e.g. finite state techniques, chunking
    • G06F17/278Named entity recognition

Abstract

The present invention provides a system for extracting concept and named-entities from a text-containing document. An entity recognition eagine (102) is provided to process an entity with a Rule-based Named-Entity Recognition (NER) (122), a Natural-Language-Processing (NLP) based NER (124), and a knowledge-based NER (126). The NERs are further scored and weighted, wherein the highest weighted score will be taken. A method thereof is also provided.

Description

A Method and System for Automated Entity Recognition

Field of the Invention

[0001] The present invention relates to information extraction. More specifically, the present invention relates to a system and method for automated entity recognition.

Background

[0002] Entity Recognition, a subset of information extraction, is adapted for recognizing and extracting entities that include personal name, location, and organization, and etc. from a document. In general, some features, such as string label, indicator function for each word are used for the entity recognition. The labeled training data and extracted features are then used to train a classifier to obtain optimized parameters through a numerical optimization algorithm. Given the features of a word in testing data, the classifier is able to recognize the label of which it belongs to. [0003] State-of-the-art Named-Entity Recognition (NER) techniques identify entities based on rules or statistical training methods. Such methods are at least limited in their classification abilities, because the language structure of a sentence itself does not give enough clues for determining the category of an entity at a fine-grained level. Misclassification is a main setback of the conventional methods. [0004] The problems existed in the state-of-the art system and method for NER are known mainly with the following challenges: 1) Training datasets. dependency (limited gazetteers); and 2) Inflexibility for fine-grained entity type classification.

[0005] In relation to training datasets dependency, it is known that 3MER systems developed for one domain do not typically perform well on another domains. Considerable effort is required in tuning an NER system intended for one domain into another for a new domain; this is true for both rule-based and trainable statistical systems. It remains a question whether the altered system is capable of recognizing generic and domain-specific entities after the effort has been spent. [0006] Further, state-of-the-art NER systems is adapted for a fixed entity type, for example, PERSON, LOCATION, ORGANIZATION, etc, is usually inflexible for configuration.

Summary

[0007] In accordance with one aspect of the present invention, there is a system for extracting concept and named-entities from a text-containing document having sentences therein. The system comprises a text processor adapted for tokenizing the sentences to extract possible entities from the document; an entity recognition module for identifying concepts and named-entities; an entity resolution module for resolving acronyms or abbreviations identified from the sentence into recognized entities based on a knowledge base; an entity disambiguation module adapted for disambiguating entities based on a sentence context. The entity recognition module operationally carries out a Rule-based Named-Entity Recognition (NER), a Natural-Language- Processing (NLP) based NER, and a knowledge-based NER on each entity identified through the text processor, wherein a score is assigned to each NER, and a. result of the NER having a highest score will be stored into a knowledge base.

[0008] In one embodiment, the system further comprises a weighted majority voting module, wherein each NER is pre-assigned with a weighted vote, and the score assigned to each NER on the entity is further weighted with the respective weighted votes to obtaining ranking scores, a result of the NER with the highest ranking score of the entity will be stored into the knowledge base.

[0009] In another embodiment, the text processor comprises a tokenizer for tokenizing the sentences, a syntax analyzer for parsing the tokenizing sentences to identify noun phrases, an acronym/abbreviation extractor adapted for identifying acronym and abbreviations through the entity resolution module, and a lexical processor for identifying noun phrases and compound words.

[0010] In a further embodiment, the system may further comprise linguistic resources having an abbreviation/acronym repository, a named-entity and a lexicon database, wherein the linguistic resources is operationally accessed by the text processor and the entity recognition engine; and Linked Data that is operationally accessed by the knowledge-based NER for processing entities. There is further provided a pattern matcher adapted for searching entity patters from the linguistic resources and a term finder adapted for finding and matching a target string from the Linked Data to processes the target string to identify entities for processing the knowledge-based NER.

[0011] In another aspect of the present invention, there is also provided a method of extracting concept and named-entities from a text-containing document having sentences therein. The method comprises tokenizing the sentences for extracting possible entities from the document; performing a Rule-based Named-Entity Recognition (NER) on the entities; performing a Natural -Language-Processing (NLP) based NER on the entities; performing a knowledge-based NER on the on each entity identified through the text processor, scoring each of the NERs processed on the entities; resolving acronyms and abbreviations identified from the -sentence into recognised entities based on a knowledge base; disambiguating entities based on a sentence context; and storing the entity that is processed by the NER with a highest score.

[0012] In another embodiment, the method further comprises defining a weighted vote to each NER; weighting each NER's score for each entity to obtain a ranking score for each NER; storing the results of the NER having the highest ranking score on the knowledge database.

[0013] The step of tokenizing the sentence may further comprise tokenizing the sentences; parsing the tokenized sentences to identify noun phrases; identifying acronym and abbreviations through an entity resolution module; and identifying noun phrases and compound words through lexical processor.

[0014] In yet another embodiment, the method may further comprise accessing the linguistic resources by the text processor and the entity recognition engine to extract and recognize entities; and accessing the Linked Data by the knowledge-based NER to process the entity. The linguistic resources have an abbreviation/acronym repository, a named-entity and a lexicon database. Further, it may also comprise searching entity patterns through a pattern matcher from the linguistic resources; and finding and matching a target string a term finder from the Linked Data to processes the target string to identify entities for processing the knowledge-based NER.

Brief Description of the Drawings

[0015] Preferred embodiments according to the present invention will now be described with reference to the figures accompanied herein, in which like reference numerals denote like elements;

[0016] FIG. 1 illustrates a block diagram of a Named-Entity Recognition (NER) system in accordance with one embodiment of the present invention;

[0017] FIG. 2 illustrates a process of recognizing entity in accordance with one embodiment of the present invention;

[0018] FIGs. 3A-C exemplify an example of a sentence that is being processed to extract its entities therefrom; and

[00 9] FIG. 4 exemplifies a scoring and weighted voting process of an entity

El. Detailed Description

[0020] Embodiments of the present invention shall now be described in detail, with reference to the attached drawings. It is to be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated device, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates. [0021] FIG. 1 illustrates a block diagram of a Named-Entity Recognition (NEK) system 100 in accordance with one embodiment of the present invention. The NER system 100 is capable of automatically identifying concepts and named-entities from a document having natural language texts. The NER system 100 comprises a text processor 101, an entity-recognition engine 102, linguistic resources manager 103, a disambiguation module 104.

[0022] A document 190 having machine-readable text is inputted to the system

100 for processing. The document 190 can be originated in various formats available The text is first processed by the text processor 101 to extract and tokenize the sentences for all possible entities. The text processor 101 includes a tokenizer 112, a syntax analyser 114, an acronym/abbreviation extractor 116 and a Lexical processor 118 for processing the text. The tokenizer 112 is provided for splitting the document into tokenized sentences and eventually into tokenized words. The main role of the tokenizer is to produce eventually a series of word tokens as the basic units for syntax analyser to produce the corresponding syntax or part-of-speech classes for these units. Tokenization techniques are well known in the art, and it is possible to adapt any of those in the present invention. The syntax analyser 114 is adapted for analysing syntax to produce syntax structure and entity patterns. It parses the tokenized sentences to identify all possible noun phrases to be matched by a pattern matcher for entity recognition. The acronym / abbreviations extractor 116 extracts all acronyms / abbreviations for resolving by the entity resolution module 148. The lexical processor 118 is adapted to identify noun phrase or compound words from the target string. It supplements the text preprocessing in generating noun phrases and compound words. It plays a role to identify whether a token exists as a single compound word or noun phrase, or there are several tokens that made up the compound word or noun phrase. It helps the entity recognition module to identify entities.

[0023] The linguistic manager 103 is connected to the lingui stic resource knowledge base 132. The linguistic manager 103 is adapted to manage the interaction between the pattern matcher and linguistic resources, and the interactioa between the term finder and the Linked Data 130. The linguistic resources knowledge base 132 has an abbreviation/acronym repository 134, a named-entity pattern repository 136 and a lexicon 138. The lexicon may include generic and specific lexicons for various domains. The linguistic manager 103 feeds the required information from the linguistic resources database 132 to the entity recognition engine 102 for processing the processed text at the entity recognition engine 102.

[0024] The Linked Data 130 is used with referenced to the linguistic resources in identifying textual representations (such as names, labels, etc) for each of the knowledge base concepts. [0025] After the text is being processed by the text processor 101 to obtain a potential entity, the potential entity is further processed by the entity recognition engine 102. It is adapted for identifying concept and named-entities. The entity recognition engine 102 has a rule-based NER 122, a natural-language-processing (NLP) based NER 124 and a knowledge-based NER 126 for processing the text. In this entity recognition engine, the text are process under the Rule-based, the NLP-based and the knowledge- based NER respectively, and a score is assigned under each NER process for the text. The scores will be used to compare with each other to determine which NER results shall be taken. [0026] To process the entities, the entity recognition module 102 comprises a pattern matcher and a term finder. The preprocessed texts are being process by the pattern matcher to search for entity patterns from the linguistic resources 132, and subsequently the term finder for finding and matching a target string from. Linked Data 130 for identifying entities for processing the knowledge-based NER 126.

[0027] The scores from the entity recognition engine 102 are further weighted through the weighted majority voting module 144. The weighted majority-voting module 144 computes a final ranking score for results filtering between the rule-based, NLP-based, and knowledge-based NERs. The results of the NER that obtained the highest ranking score will be annotated through the annotator 146 and stored into the knowledge base 142.

[0028] The Linked Data 130 provides the background knowledge for the knowledge-based NER 126. The Link Data 130 may be taken or extracted from any opened knowledge bases, such as DbPedia, Agrovoc, Freebase, etc. When it is combined with the linguistic resources 132 that include thesaurus, dictionary, etc, they can be configured dynamically to produce a knowledge pattern structure, i.e. dynamic configuration regardless of the domain of the knowledge base. With this enriched knowledge pattern structure, it provides sufficient and rich background knowledge to the knowledge based NER 126 for entity recognition. [0029] The entity resolution module 148 is configured to resolve acronyms or abbreviations of the target sentence into the recognized entities from the knowledge base 142. [0030] The entity disambiguation module 104 is configured to disambiguate entities based on the sentence context. Many systems and methods for disambiguate entities are well known in the art and they can be adapted for the present invention. In one embodiment of the present invention, the disambiguation system and method disclose in the Malaysia patent application entitled "A METHOD AND SYSTEM FOR AUTOMATED WORD SENSE DISAMBIGUATION" filed on the same day as the present application, can also be adapted.

[0031] FIG- 2 illustrates a process of recognizing entity in accordance with one embodiment of the present invention. The process starts with searching for entity pattern through a pattern matcher at step 202. The method is done by referencing the tested text, which is a potential entity determined through the text processor 101, with the linguistic resources from the linguistic resources database 132. It follows with a step 204 of finding and matching the potential entity from the linked data 130 through a term finder. When the potential entity can be found or matched with the Linked Data 130 at step 206, a property type for the potential entity is being retrieved at step 209 for knowledge-based NER 126. When no match can be found with the Linked Data 130, the potential entity is processed under the rule-based NER 122 and the NLP-based NER at steps 207 and 208 respectively. The potential entity is being processed by the Role- based NER 122 or the NLP-based NER 124 or the knowledge-based NER 126, the potential entity processed by the entity recognition engine 102 will be assigned with a matching score of similarity measure at step 210. The system 100 weighted for majority voting at step 212. The potential entity with the best score is selected at step 214. The entity is then stored on the knowledge base 412. The entity is stored together with the related information resulting from the NERs. [0032] FIGs. 3A-C exemplify an example of a sentence that is being processed to extract its entities therefrom. The exemplified sentence is "The iPad 4ini is a mini tablet computer designed, developed, and marketed by Apple Inc. which was announced on October 23, 2012. ". The exemplified sentence can also herewith referred as a target string. As shown in FIG. 3A, the target string is parsed by the text processor 101 to retrieve its phrase syntax structure through the syntax analyzer 114. It uses the pattern matcher to search for the entity patterns from the linguistic resources. Among the targeted string, the words "Apple" and "Inc." being a pattern number 1 will be identified as an entity, or named-entity. And the series of string "October", "23", "," and "2012" being identified as pattern number 2 will be identified as a date.

[0033] As shown in FIG. 3B, the targeted string is further processed to extract possible named entities through the pattern matcher to identify all named-entities from the target string above. In this case, "iPad Mini" and "Apple Inc" will be identified as the possible entities and the "October 23, 2012" will be identified as date. The system will go on to search and match for possible acronyms/abbreviations. The entity resolution module 148 is also used to resolve the existing entities when required. In addition to the syntax structure, the text processor also perform noun phrase or compound words identification through lexical processor 118.

[0034] Returning to the entity resolution module 148, its main role is for resolving the entities to identify acronyms/abbreviations. Systems and methods for resolving entities to identify acronyms/abbreviations therefrom are well known in the art and they can be adapted for the present invention. In one proposed embodiment, the entity resolution may include searching and querying the Linked Data 130 and linguistic resources 132 to find all related information for each entity, storing these information as context vectors, generating the number of salient words from the document or text, N ~ Poisson(x), where x is a fixed prior parameter. Salient words refer to important and meaningful words (e.g. nouns), exclusive stopwords (e.g. preposition words). Based on the identified salient words, a document vector can be formed. A similarity function can be used to compute similarity scores by iterating through each of the elements between the context and document vectors and the final score is averaged out to resolve entities.

[0035] As shown in FIG. 3C, with the potential named-entities identified, the term finder finds and matches the target string from the Linked Data. For example, the label extracted from the Linked Data 130 is being matched with the target string. The label is defined in the Linked Data 130. Through the property type available in the Link Data 130, such as rdfs:label, the string label can be obtained. Each of the potential entities is being assigned with a match score with similarity measure.

[0036] FIG. 4 exemplifies derivations of a ranking score for an entity for one embodiment of the present invention. As provided in table 402, an entity El that has been proceed based on NLP-based, Rule-based and knowledge-based ERs are assigned with similarity scores of 0.67, 0.51 and 0.70 respectively. The weighted votes for NLP-based, Rule-based and knowledge-based NERs are preset as 1, 3 and 5 respectively as shown in table 404. The scores in table 402 is then multiply directly with the weighted votes in table 404 to obtain the ranking scores of the entity El that are processed under NLP-based, Rule-based and knowledge-based NERs. The highest ranking score that falls above a certain specific threshold 408 is taken, for which, the entity that processed under the corresponding NER is stored on the knowledge base 142. The specific threshold can be defined by the user. [0037] While specific embodiments have been described and ill-ustrated, it is understood that many changes, modifications, variations, and combinations thereof could be made to the present invention without departing from the scope of the invention.

Claims

Claims
1. A system (100) for extracting concept and named-entities from a text-containing document having sentences therein, the system comprising:
a text processor (101) adapted for tokenizing the sentences to extract possible entities from the document;
an entity recognition module (102) for identifying concepts and named-entities wherein the entity recognition module (102) operationally carries out a Rule-based Named-Entity Recognition (NER) (122), a Natural-Language-Processing (NLP) based NER (124), and a knowledge-based NER (126) on each entity identified, through the text processor (101);
an entity resolution module (148) for resolving acronyms or abbreviations identified from the sentence into recognized entities based on a knowledge base (142); and
an entity disambiguation module (103) adapted for disambiguating entities based on a sentence context,
wherein a score is assigned to each NER, and a result of the NER having a highest score will be recorded on a knowledge base (142).
2. The system (100) according to claim 1, further comprising a weighted majority voting module (144), wherein each NER is pre-assigned with a weighted vote, and the score assigned to each NER on the entity is further weighted with the respective weighted votes to obtaining ranking scores, a result of the NER with the highest ranking score of the entity will be recorded on the knowledge base (142).
3. The system (100) according to claim 1, the system (100) further comprising: linguistic resources (132) having an abbreviation/acronym repository (134), a named-entity (136) and a lexicon database 138, wherein the linguistic resources (132) is operationally accessed by the text processor (101) and the entity recognition engine (102); and
Linked Data (130) that is operationally accessed by the knowledge-based NER (126) for processing entities.
4. The system (100) according to claim 4, the system (100) further comprising: a pattern matcher adapted for searching entity patterns from the linguistic resources and a term finder adapted for finding and matching a target string from the Linked Data (130) to processes the target string to identify entities for processing the knowledge based NER.
5. A method of extracting concept and named-entities from a text-containing document (190) having sentences therein, the method comprising: tokenizing the sentences for extracting possible entities from the document
(190); performing a Rule-based Named-Entity Recognition (NER) (122) on the entities; performing a Natural-Language-Processing (NLP) based NER ( 124) on the entities; performing a knowledge-based NER (126) on the on each entity identified through the text processor (101); scoring each of the NERs processed on the entities; resolving acronyms and abbreviations identified from the sentence into recognised entities based on a knowledge base (142) ; disambiguating entities based on a sentence context; and storing the entity that is processed by the NER with a highest score.
6. The method according to claim 5, the method further comprising:
defining a weighted vote to each NER; weighting each NER's score for each entity to obtain a ranking score for each
NER;
storing the results of the NER having the highest ranking score on the knowledge database.
7. The method according to claim 5, wherein tokenizing the sentence further comprising: tokenizing the sentences; parsing the tokenized sentences to identify noun phrases; identifying acronym and abbreviations through an entity resolution module
(148); and identifying noun phrases and compound words through lexical processor (118).
8. The method according to claim 5, the method further comprising:
accessing the linguistic resources (132) by the text processor (101) and the entity recognition engine (102) to extract and recognize entities; and
accessing the Linked Data (130) by the knowledge-based NER (126) to process the entity,
wherein the linguistic resources have an abbreviation/acronym repository (134), a named-entity (136) and a lexicon database (138).
9. The method according to claim 8, the method further comprising: searching entity patterns through a pattern matcher from the linguistic resources (132); and finding and matching a target string through a term finder from the Linked Data
(130) to process the target string to identify entities for processing the knowledge-based NER (126).
PCT/MY2014/000153 2013-11-27 2014-05-29 A method and system for automated entity recognition WO2015080558A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
MYPI2013004281 2013-11-27
MYPI2013004281 2013-11-27

Publications (1)

Publication Number Publication Date
WO2015080558A1 true WO2015080558A1 (en) 2015-06-04

Family

ID=51703361

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2014/000153 WO2015080558A1 (en) 2013-11-27 2014-05-29 A method and system for automated entity recognition

Country Status (1)

Country Link
WO (1) WO2015080558A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106991085A (en) * 2017-04-01 2017-07-28 中国工商银行股份有限公司 Method and device for abbreviation generation of entity

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009052277A1 (en) * 2007-10-17 2009-04-23 Evri, Inc. Nlp-based entity recognition and disambiguation
US20130158979A1 (en) * 2011-12-14 2013-06-20 Purediscovery Corporation System and Method for Identifying Phrases in Text

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009052277A1 (en) * 2007-10-17 2009-04-23 Evri, Inc. Nlp-based entity recognition and disambiguation
US20130158979A1 (en) * 2011-12-14 2013-06-20 Purediscovery Corporation System and Method for Identifying Phrases in Text

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
KAMEL N NEBHI: "Ontology-Based Information Extraction from Twitter", PROCEEDINGS OF THE WORKSHOP ON INFORMATION EXTRACTION AND ENTITY ANALYTICS ON SOCIAL MEDIA DATA COLING 2012, 9 December 2012 (2012-12-09), pages 17-22, XP055169825, Mumbai, India *
NING KANG ET AL: "Using an ensemble system to improve concept extraction from clinical records", JOURNAL OF BIOMEDICAL INFORMATICS, ACADEMIC PRESS, NEW YORK, NY, US, vol. 45, no. 3, 25 December 2011 (2011-12-25), pages 423-428, XP028519146, ISSN: 1532-0464, DOI: 10.1016/J.JBI.2011.12.009 [retrieved on 2012-01-03] *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106991085A (en) * 2017-04-01 2017-07-28 中国工商银行股份有限公司 Method and device for abbreviation generation of entity

Similar Documents

Publication Publication Date Title
Tanabe et al. Tagging gene and protein names in biomedical text
Cucerzan Large-scale named entity disambiguation based on Wikipedia data
US8594996B2 (en) NLP-based entity recognition and disambiguation
US9400838B2 (en) System and method for searching for a query
Barr et al. The linguistic structure of English web-search queries
US20040236566A1 (en) System and method for identifying special word usage in a document
US20090319257A1 (en) Translation of entity names
Chen et al. Unknown word extraction for Chinese documents
EP1793318A2 (en) Answer determination for natural language questionning
US7805303B2 (en) Question answering system, data search method, and computer program
Jiang et al. CRCTOL: A semantic‐based domain ontology learning system
US8370128B2 (en) Semantically-driven extraction of relations between named entities
US20070233656A1 (en) Disambiguation of Named Entities
Benajiba et al. Arabic named entity recognition using conditional random fields
Evert The statistics of word cooccurrences
JP4650072B2 (en) Question answering system, data retrieval method, and computer program
Jing et al. The decomposition of human-written summary sentences
CA2390784C (en) A method and system for theme-based word sense ambiguity reduction
Benajiba et al. Arabic named entity recognition using optimized feature sets
Lita et al. Truecasing
US8527522B2 (en) Confidence links between name entities in disparate documents
JP5825676B2 (en) Non-factoid question answering system and computer program
Chen et al. Language specific issue and feature exploration in Chinese event extraction
Erjavec et al. Machine learning of morphosyntactic structure: Lemmatizing unknown Slovene words
CA2536262A1 (en) System and method for processing text utilizing a suite of disambiguation techniques

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14784113

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase in:

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14784113

Country of ref document: EP

Kind code of ref document: A1