WO2015080561A1 - A method and system for automated relation discovery from texts - Google Patents

A method and system for automated relation discovery from texts Download PDF

Info

Publication number
WO2015080561A1
WO2015080561A1 PCT/MY2014/000175 MY2014000175W WO2015080561A1 WO 2015080561 A1 WO2015080561 A1 WO 2015080561A1 MY 2014000175 W MY2014000175 W MY 2014000175W WO 2015080561 A1 WO2015080561 A1 WO 2015080561A1
Authority
WO
WIPO (PCT)
Prior art keywords
entities
relation
noun phrases
sentence
verb
Prior art date
Application number
PCT/MY2014/000175
Other languages
French (fr)
Inventor
Chu Min Xian BENJAMIN
Qiang Liu
Ben Mohamed KHALIL
Lukose Dickson
Klaus TOCHTERMANN
Original Assignee
Mimos Berhad
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mimos Berhad filed Critical Mimos Berhad
Publication of WO2015080561A1 publication Critical patent/WO2015080561A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present invention relates to information extraction. More specifically, the present invention relates to a system and method for automated relation discovery from texts.
  • a system for carrying out a relation extraction from a machine-readable document having sentences comprises a text preprocessing module for lemmatizing the sentences into tokenized text and identifying paragraphs from the document; a coreference resolution module configured to resolve all possible anaphors; an entity recognition module for extracting entities from the sentence; an entity resolution and disambiguation module adapted for resolving all ambiguous entities / noun phrases, acronyms and abbreviations; a relation extraction module configured to operably carrying out a generic relation extraction and a semantic relations extraction for extracting triples for a sentence based on the extracted entities, wherein a weighted ranking score is computed for each triple for a sentence for selecting and storing an most appropriate triple.
  • each triple is computed with a basic ranking score by matching the identified concepts with the respective verb schemas, and weighted with their respective recorded popularity scores to obtain the weighted ranking score for each triple, the highest ranking score of which is selected as the most appropriate triple.
  • the generic relation extraction among entities in an intra-sentential context extracts noun phrases and verb from linguistic resources to match the sentence with predefined patterns, the linguistic resources and a Linked Data is searched for possible properties to generate corresponding relation the sentence refers to.
  • the generic relation extraction among entities in an inter-sentential context retrieves a list of retrieves a list of entities and noun phrases from a repository and another list of entities, noun phrases that are not within the same paragraph and the connecting verbs in order to extract and match schemas based on the verb identified from the linguistic resources to identify relations from a knowledge base, the linguistic resources is searched for possible triples based on the properties found.
  • the semantic relations extraction for entities in an intra-sentential context matches predefined patterns to extract and match schemas based on the identified verb, and searches entities or noun phrases through a linked database.
  • the semantic relations extraction for entities in an inter-sentential context retrieves a list of entities and noun phrases from a repository and another list of entities, noun phrases that are not within the same paragraph and the connecting verbs in order to extract and match schemas based on the verb identified from the linguistic resources.
  • a method of carrying out a relation extraction from a machine-readable document having sentences comprises lemmatizing texts of the sentences into tokenized text and identifying paragraphs from the document; resolving all possible anaphors through a coreference resolution module; extracting entities from the sentences through entity recognition module for extracting entities from the sentence; resolving all ambiguous entities / noun phrases, acronyms and abbreviations; carrying out a generic relation extraction and a semantic relations extraction for extracting triples for a sentence based on the extracted entities, wherein a weighted ranking score is computed for each triple for a sentence for selecting and storing an most appropriate triple.
  • the relation extraction further comprises computing a basic ranking score by matching the identified concepts to the respective verb schemas; retrieving a popularity score for each of the schemas, computing a popularity weighted score for each schema to obtain weighted ranking score for each schema; selecting a schema with the highest weighted ranking score; generating the relation triples based on the selected schema.
  • the generic relation extraction among entities in an intra-sentential context may extract noun phrases and verb from linguistic resources to match the sentence with predefined patterns, the linguistic resources and a Linked Data is searched for possible properties to generate corresponding relation the sentence refers to; and the semantic relations extraction for entities in an intra-sentential context matches predefined patterns to extract and match schemas based on the identified verb, and searches entities or noun phrases through a linked database.
  • the generic relation extraction among entities in an inter-sentential context retrieves a list of retrieves a list of entities and noun phrases from a repository and another list of entities, noun phrases that are not within the same paragraph and the connecting verbs in order to extract and match schemas based on the verb identified from the linguistic resources to identify relations from a knowledge base, the linguistic resources is searched for possible triples based on the properties found; and the semantic relations extraction for entities in an inter-sentential context retrieves a list of entities and noun phrases from a repository and another list of entities, noun phrases that are not within the same paragraph and the connecting verbs in order to extract and match schemas based on the verb identified from the linguistic resources.
  • FIG. 1 illustrates a relation discovery system in accordance with one embodiment of the present invention
  • FIG. 2 illustrates a process of relation extraction process in accordance with one embodiment of the present invention
  • FIGs. 3A and 3B exemplify an example of a sentence that is being processed to extract generic relation
  • FIG. 4 exemplifies a further example of generic relation extractions
  • FIG. 5 exemplifies an example of semantic relation extractions
  • FIG. 6 exemplifies another example of generic relation extractions
  • FIG. 7 illustrates the ranking process with a popularity weight in accordance with one embodiment of the present invention.
  • FIG. 8 illustrates a process of inter-sentential discovery in accordance with one embodiment of the present invention.
  • FIG. 1 illustrates a relation discovery system 100 in accordance . with one embodiment of the present invention.
  • the system 100 is adapted for automatically discovering relations within and across sentences from natural language texts.
  • the system is able to process both structured and unstructured texts. It is further capable of resolving intra-sentential and inter-sentential contexts. Yet, it is capable of discovering generic and semantic relations of the intra-sentential and inter-sentential contexts.
  • the system comprises a text preprocessor 101, a relation discovery module 102, a Linked Data 103, a pattern database 104, linguistic resources 105 and triples 106.
  • the system comprises a text preprocessor 101, a relation discovery module 102, a Linked Data 103, a pattern database 104, linguistic resources 105 and triples 106.
  • the text preprocessor 101 receives machine-readable texts 190, or simple texts for processing.
  • the texts are to be first processed by the text preprocessor 101 and subsequently through the relation discovery module 102 to discover relations between the texts.
  • the text preprocessor 101 receives machine-readable texts 190, or simple texts for processing.
  • the texts are to be first processed by the text preprocessor 101 and subsequently through the relation discovery module 102 to discover relations between the texts.
  • the text preprocessor 101 receives machine-readable texts 190, or simple texts for processing.
  • the texts are to be first processed by the text preprocessor 101 and subsequently through the relation discovery module 102 to discover relations between the texts.
  • the text preprocessor 101 includes a tokenizer 112, a lemmatization module 114 and paragraph identifier 116.
  • the text preprocessor 101 is adapted for transforming sentence into tokenized lemma form.
  • the relation discovery module 102 comprises a coreference resolution module 122, an entity recognition module 124, an entity resolution and disambiguation module 126 and a relation extractor 128.
  • the coreference resolution module 122 is configured to resolve all possible anaphors into the according noun antecedents.
  • the entity recognition module 124 is configured to extract all possible entities from a sentence. The texts are then processed by the entity resolution and disambiguation module 126 to resolve all ambiguous entities/noun phrases, acronyms and abbreviations. Once the entities are identified and resolved, the relation extractor 128 determines and extracts the relationship that exists between entities.
  • the relation discovery module 102 is assisted through the Linked Data 103, the pattern database 104 and the linguistic resources 105. Specifically, the correlation resolution 122, entity recognitions 124, entity resolutions and disambiguation 126 are processed with the use of Linked Data 103; the coreference resolution 122 and the relation extraction 128 are processed with the use of linguistic resources 105; and the relation extraction 128 is further processed with the use of the predefined pattern database 104. [0031] Through the paragraph identification 116, coreference resolution 122, entity recognition, entity resolution and disambiguation, inter-sentential contexts can be resolved.
  • FIG. 2 illustrates a process of relation extraction process in accordance with one embodiment of the present invention.
  • the system 1 00 performs a coreference resolution to resolve all anaphors to the corresponding noun antecedents. For example, given a sentence: "I saw Scott yesterday. He was fishing by the lake," . It can be identified that "Scott” and "he” in that sentence are coreference and therefore the anaphor 'he' will be resolved to refer to 'Scott' .
  • the texts are tokenized into sentences.
  • the entities/noun phrases are identified at step 203, and at step 204, the verbs are identified.
  • the identified entities/noun phrases and verbs are then stored onto the index repository 205.
  • the patterns of the sentence under process is being matched with the pattern database 104 to trigger a specified rule for extracting an appropriate relation.
  • the schemas of the verb are also extracted from integrated linguistic resources to trigger the specified rule for extracting the most appropriate relation.
  • the system 100 matches for intra-sentential patterns to identify relation triples 212 for storing on a knowledge base 215.
  • the system 100 determines if more sentences are to be processed. If there are at step 211, the system 100 returns to the step 202 to identify the next sentence to be processed. If not further sentence is to be processed, at step 220, the extraction type shall be specified. Subsequently, the system 100 further determines if there exist any intra-sentential to be processed at step 222. If intra-sentential pattern is matched a ranking/filtering is carried out at step 230. If the no further intra-sentential context is to be processed, the system 100 retrieves the entity and verb indexes 205 at step 224.
  • the index is utilised for matching inter-sentential patterns at step 225.
  • the inter-sentential patterns matching also includes a generic relations extraction and a semantic relations extraction.
  • triples relation will be generated at step 228 and stored on the knowledge base 215.
  • the candidate triples in the intra-sentential and/or inter-sentential context are being ranked and filtered in the step 230.
  • the system computes a basic ranking score by matching the identified entities with the verb schemas. It then retrieves a popularity weight table that provides the popularities scores for the respective schemas. Accordingly, popularity weighted scores is computed accordingly for each schema. The schema with the highest weighted score is then selected to generate the relation triples based on the selected schema.
  • Generic relation extraction comprises the steps of retrieving a list of entities/noun phrase from the paragraph to be processed to form a List A from index and iterating each of the entities of the List A; selecting a next entry Ei from the List A and retrieving a list of entities/noun phrase which is not within the same paragraph (i.e.
  • the system 100 goes on searching the linguistic resources 105, such as WordNet, etc, and retrieves all possible properties related to the entities/noun phrases to generate holonym relation triples: Holonym(Germany, Europe); Holonym(Germany, European Union); and Holonym(Germany, European Economic Community).
  • a similarity measure between the additional property labels with all the entities/noun phrases identified in the sentence is carried out, and respective scores are being assigned to each holonym.
  • the candidate relation triples are being filtered based on similarity scores computed with a threshold specified.
  • FIG. 4 exemplifies a further example of generic relation extractions based on a sentence " Samsung release Galaxy Mini, another smartphone for Android fans.” in another embodiment of the present invention.
  • the system 100 first identifies the possible entities/noun phrases from the exemplified example: Samsung, Galaxy Mini, etc. and the noun phrase that can be identified is "smartphone". The identified entities are then lemmatized accordingly.
  • the linguistic resources 105 are then being used to identify verb, if any. In this given example, "release” is identified as a verb.
  • the identified verb(s), entities and nouns are marked up accordingly for matching with the pattern database 104 according to the structure of the sentence exemplified here.
  • the matching patterns if one can be found, trigger the corresponding rule to generate the relation triples.
  • the relation triples are well known in the field of Resorce Description Framework (RDF).
  • RDF Resorce Description Framework
  • hyponymy or is-a relation, metonym or part-of relation, synonyms and etc. may be utilized for determining relations between the phrases.
  • the rules and matching patterns are used to extract generic relations, which are preferably of taxonomic type derivable derived from a knowledge base e.g. hyponyms, metonyms, synonyms, hypernyms, holonyms, antonyms, etc.
  • two relation triples can be matched: (Samsung, release, Galaxy Mini) and (Samsung, release, smartphone), based on a predefined pattern: NP1+VP1+(NP2,[NP3+PP3]), where P1, NP2, NP3 are referring to noun phrases, VP1 is referring to verb phrase and PP3 is referring to preposition phase.
  • the linguistic resources 105 is searched again to retrieve the possible properties related to the entities/noun phrases.
  • a similarity measure between the additional property labels with all the entities/noun phrases identified in the sentence is carried out.
  • the candidate relation triples based on similarity scores computed with a threshold specified is being filtered.
  • the semantic relation extraction comprising the steps of retrieving a list of entities/noun phrase from the paragraph/sentence to be processed to form a List A from index and iterating each of the entities thereof; selecting a next entry Ei from the List A and retrieving a list of entities/noun phrase which is not within the same paragraph (i.e. other paragraphs) to generate a List B; searching the intra-sentential results stored in the knowledge base if the relation exist between Ei and List B, wherein the List B can be reused if exist; search Linked Data 103 for Ei and obtaining all possible candidate relations; removing candidate triple if the subject/object does not match the List B; and repeating the above steps the last entries Ei in List A.
  • generic relations extraction differs from the semantic relations extraction in that the Linked Data.
  • FIG. 5 exemplifies a sentence "The Bugatti Veyron EB 16.4 is a mid- engined grand touring car by the Volkswagen Group in France” which is to be processed by the system 100.
  • the entities/noun phrases "Bugatti”, Veyron, etc. can be identified.
  • No verb, on the other hand, can be identified.
  • no patterns can be matched too.
  • a class property label "Sports car” can be retrieved for the entity "Bugatti”, and through a list of predefined property rules to denote a specific relation to be triggered, isType relation can be applied to generate a relation triple: isType(Bugatti, Veyron, Sports Car).
  • FIG. 6 exemplifies another sentence "John bought Mary a Ferrari which is to be processed by the system 100.
  • similar steps are followed as described in the previous examples. The only difference from the previous scenarios is that when the verb is identified while there are no available patterns for this verb found. All possible schemas related to this target verb are extracted from the Linguistic Resources 105.
  • VerbNet for example, offers a comprehensive resource for verbs.
  • FrameNet for example can be used to identify all schemas to match all the identified concepts.
  • the selectional constrains for each of the semantic roles are to be met.
  • the identified entities are "John”, “Mary”,
  • FIG. 7 shows the process of ranking the schemas with popularity weight to select the schema to generate the corresponding relation triples with the example of FIG. 6.
  • the sentence 701 is processed to identify an appropriate schema, Schema 1, Schema 2 and Schema 3, from a schemas list 702.
  • each of the schemas is computed with a basic ranking score based on matches of the identified concepts.
  • the appropriate schema is selected based on a highest weighted score of those assigned to the schemas.
  • the schemas are further weighted with a weighted ranking score as shown in table 703.
  • the popularity weight added to differentiate them. In which case, even Schema 2 has a higher weighted ranking score than Schema 1, because Schema 3 already had a higher basic score, the popularity score for Schema 3 is added with 1 and the table 704 is updated accordingly.
  • FIG. 8 shows steps involved for relation discovery in the inter-sentential context.
  • a list of entities / NP, List A is retrieved from the index 222 and iterate each of the items on the list. It starts from the first entry.
  • the intra-sentential results stored in the knowledge base 215 if the relation exists between Ei and List B (reuse if exist).
  • the Linked Data 103 is searched for Ei and obtains all possible candidate relations.
  • candidate triple is removed if the subject/object do not match to List A or it falls below a similarity threshold.
  • the matched results are stored on the knowledge base. The processes are repeated until the last entry in List A is processed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a system (100) for discovering relations between texts in sentence of a machine-readable document. The system comprises a text preprocessor (101) and a relation discovery module (102). The text preprocessor (101) processes the documents to identify and extract entities, noun phrases and verb from therefrom. The relation discovery module (102) discovers the relation through a generic and semantic relation extraction for unstructured and structured texts to resolves intra-sentential and inter-sentential contexts.

Description

A Method and System for Automated Relation Discovery from Texts
Field of the Invention
[0001] The present invention relates to information extraction. More specifically, the present invention relates to a system and method for automated relation discovery from texts.
Background
[0002] Given typed entities and relations, it would be desired to be able to infer implicit contexts from both structured and unstructured collections of texts. To be able to achieve that, one need to be able to extract the entities and their relations with each other from natural language texts. There are two main challenges: 1) Extraction of entities from structured/unstructured texts; 2) Extraction of typed relations between entities from structured/unstructured texts; and 3) Intra-sentential relation extraction.
[0003] Conventional approach relation discovery from natural language texts requires substantial amount of training examples or tagged datasets, i.e. supervised approach. Such supervised approach offers poor extraction quality of entities from texts; as it will either heavily dependent on the availability of the annotated data, or the use of predefined gazetteer lists do not accommodate the diversity to address for multiple domains.
[0004] US patent publication no. US2009/0019032 Al , in FIG 2, showed a table belowas an example for annotated training corpus used by a conventional extraction method known in the art. Training Corpus
"We found that TP53 is a lung cancer gene"
"Smoking is bad for your lungs"
Figure imgf000003_0001
[0005] Details of the illustrations are provided in the US publication and therefore not elaborate herein. [0006] When it comes to the extraction of typed relations between entities from structured/unstructured texts, the state-of-the art approaches frequently involve parsing that depends on syntax structures (such as part-of-speech 'POS' tags, dependency parse trees, etc).
[0007] Further, relation extraction techniques thus far focus on intra-sentential context (i.e. within a sentence). The challenge, which is also the keen interest, is to extract relations between entities across sentences, in the inter-sentential context. On the overall outset of the problems, there are mainly two: granularity (e.g. intra-sentential level vs. inter-sentential level), heterogeneity of the texts (structured vs. unstructured documents) and the use of domain specific sources versus open domains. Summary
[0008] In accordance with one aspect of the present invention, there is provided, a system for carrying out a relation extraction from a machine-readable document having sentences. The system comprises a text preprocessing module for lemmatizing the sentences into tokenized text and identifying paragraphs from the document; a coreference resolution module configured to resolve all possible anaphors; an entity recognition module for extracting entities from the sentence; an entity resolution and disambiguation module adapted for resolving all ambiguous entities / noun phrases, acronyms and abbreviations; a relation extraction module configured to operably carrying out a generic relation extraction and a semantic relations extraction for extracting triples for a sentence based on the extracted entities, wherein a weighted ranking score is computed for each triple for a sentence for selecting and storing an most appropriate triple.
[0009] In one embodiment, each triple is computed with a basic ranking score by matching the identified concepts with the respective verb schemas, and weighted with their respective recorded popularity scores to obtain the weighted ranking score for each triple, the highest ranking score of which is selected as the most appropriate triple.
[0010] In another embodiment, the generic relation extraction among entities in an intra-sentential context extracts noun phrases and verb from linguistic resources to match the sentence with predefined patterns, the linguistic resources and a Linked Data is searched for possible properties to generate corresponding relation the sentence refers to. [0011] In yet another embodiment, the generic relation extraction among entities in an inter-sentential context retrieves a list of retrieves a list of entities and noun phrases from a repository and another list of entities, noun phrases that are not within the same paragraph and the connecting verbs in order to extract and match schemas based on the verb identified from the linguistic resources to identify relations from a knowledge base, the linguistic resources is searched for possible triples based on the properties found.
[0012] Further, the semantic relations extraction for entities in an intra-sentential context matches predefined patterns to extract and match schemas based on the identified verb, and searches entities or noun phrases through a linked database.
[0013] In yet a further embodiment, the semantic relations extraction for entities in an inter-sentential context retrieves a list of entities and noun phrases from a repository and another list of entities, noun phrases that are not within the same paragraph and the connecting verbs in order to extract and match schemas based on the verb identified from the linguistic resources.
[0014] In another aspect of the present invention, there is provided a method of carrying out a relation extraction from a machine-readable document having sentences. The method comprises lemmatizing texts of the sentences into tokenized text and identifying paragraphs from the document; resolving all possible anaphors through a coreference resolution module; extracting entities from the sentences through entity recognition module for extracting entities from the sentence; resolving all ambiguous entities / noun phrases, acronyms and abbreviations; carrying out a generic relation extraction and a semantic relations extraction for extracting triples for a sentence based on the extracted entities, wherein a weighted ranking score is computed for each triple for a sentence for selecting and storing an most appropriate triple.
[0015] In one embodiment, the relation extraction further comprises computing a basic ranking score by matching the identified concepts to the respective verb schemas; retrieving a popularity score for each of the schemas, computing a popularity weighted score for each schema to obtain weighted ranking score for each schema; selecting a schema with the highest weighted ranking score; generating the relation triples based on the selected schema.
[0016] In another embodiment the generic relation extraction among entities in an intra-sentential context may extract noun phrases and verb from linguistic resources to match the sentence with predefined patterns, the linguistic resources and a Linked Data is searched for possible properties to generate corresponding relation the sentence refers to; and the semantic relations extraction for entities in an intra-sentential context matches predefined patterns to extract and match schemas based on the identified verb, and searches entities or noun phrases through a linked database.
[0017] In a further embodiment, the generic relation extraction among entities in an inter-sentential context retrieves a list of retrieves a list of entities and noun phrases from a repository and another list of entities, noun phrases that are not within the same paragraph and the connecting verbs in order to extract and match schemas based on the verb identified from the linguistic resources to identify relations from a knowledge base, the linguistic resources is searched for possible triples based on the properties found; and the semantic relations extraction for entities in an inter-sentential context retrieves a list of entities and noun phrases from a repository and another list of entities, noun phrases that are not within the same paragraph and the connecting verbs in order to extract and match schemas based on the verb identified from the linguistic resources.
Brief Description of the Drawings
[0018] Preferred embodiments according to the present invention will now be described with reference to the figures accompanied herein, in which like reference numerals denote like elements;
[0019] FIG. 1 illustrates a relation discovery system in accordance with one embodiment of the present invention;
[0020] FIG. 2 illustrates a process of relation extraction process in accordance with one embodiment of the present invention;
[0021] FIGs. 3A and 3B exemplify an example of a sentence that is being processed to extract generic relation;
[0022] FIG. 4 exemplifies a further example of generic relation extractions;
[0023] FIG. 5 exemplifies an example of semantic relation extractions; [0024] FIG. 6 exemplifies another example of generic relation extractions;
[0025] FIG. 7 illustrates the ranking process with a popularity weight in accordance with one embodiment of the present invention; and
[0026] FIG. 8 illustrates a process of inter-sentential discovery in accordance with one embodiment of the present invention. Detailed Description
[0027] Embodiments of the present invention shall now be described in detail, with reference to the attached drawings. It is to be understood that no limitation of the scope of the invention is thereby intended, such alterations and further modifications in the illustrated device, and such further applications of the principles of the invention as illustrated therein being contemplated as would normally occur to one skilled in the art to which the invention relates.
[0028] FIG. 1 illustrates a relation discovery system 100 in accordance . with one embodiment of the present invention. The system 100 is adapted for automatically discovering relations within and across sentences from natural language texts. The system is able to process both structured and unstructured texts. It is further capable of resolving intra-sentential and inter-sentential contexts. Yet, it is capable of discovering generic and semantic relations of the intra-sentential and inter-sentential contexts. The system comprises a text preprocessor 101, a relation discovery module 102, a Linked Data 103, a pattern database 104, linguistic resources 105 and triples 106. The system
100 receives machine-readable texts 190, or simple texts for processing. The texts are to be first processed by the text preprocessor 101 and subsequently through the relation discovery module 102 to discover relations between the texts. The text preprocessor
101 includes a tokenizer 112, a lemmatization module 114 and paragraph identifier 116. The text preprocessor 101 is adapted for transforming sentence into tokenized lemma form.
[0029] The relation discovery module 102 comprises a coreference resolution module 122, an entity recognition module 124, an entity resolution and disambiguation module 126 and a relation extractor 128. The coreference resolution module 122 is configured to resolve all possible anaphors into the according noun antecedents. The entity recognition module 124 is configured to extract all possible entities from a sentence. The texts are then processed by the entity resolution and disambiguation module 126 to resolve all ambiguous entities/noun phrases, acronyms and abbreviations. Once the entities are identified and resolved, the relation extractor 128 determines and extracts the relationship that exists between entities.
[0030] During the operations of the relation discovery, the relation discovery module 102 is assisted through the Linked Data 103, the pattern database 104 and the linguistic resources 105. Specifically, the correlation resolution 122, entity recognitions 124, entity resolutions and disambiguation 126 are processed with the use of Linked Data 103; the coreference resolution 122 and the relation extraction 128 are processed with the use of linguistic resources 105; and the relation extraction 128 is further processed with the use of the predefined pattern database 104. [0031] Through the paragraph identification 116, coreference resolution 122, entity recognition, entity resolution and disambiguation, inter-sentential contexts can be resolved. On the other hand, generic relations among entities in the intra-sentential context can be resolved by applying noun phrase extraction, then identifying the verb through an aid of the linguistic resources 105, then matching the sentence with the pattern database 104, and searching the linguistic resources 105 and the Linked Data 103 to determine the corresponding relation it refers to. It is to be noted that the Linked Data 103 can be a proprietary database or a public database. [0032] FIG. 2 illustrates a process of relation extraction process in accordance with one embodiment of the present invention. At step 201, the system 1 00 performs a coreference resolution to resolve all anaphors to the corresponding noun antecedents. For example, given a sentence: "I saw Scott yesterday. He was fishing by the lake," . It can be identified that "Scott" and "he" in that sentence are coreference and therefore the anaphor 'he' will be resolved to refer to 'Scott' .
[0033] At step 202, the texts are tokenized into sentences. For each sentence, the entities/noun phrases are identified at step 203, and at step 204, the verbs are identified. The identified entities/noun phrases and verbs are then stored onto the index repository 205. To identify and extract the verb(s), the patterns of the sentence under process is being matched with the pattern database 104 to trigger a specified rule for extracting an appropriate relation. The schemas of the verb are also extracted from integrated linguistic resources to trigger the specified rule for extracting the most appropriate relation. [0034] At step 210, the system 100 matches for intra-sentential patterns to identify relation triples 212 for storing on a knowledge base 215. To do that, it performs dual granularity for discovering and extracting both generic and semantic relations from texts. Once the relation extractions are done at the step 210, the system 100 determines if more sentences are to be processed. If there are at step 211, the system 100 returns to the step 202 to identify the next sentence to be processed. If not further sentence is to be processed, at step 220, the extraction type shall be specified. Subsequently, the system 100 further determines if there exist any intra-sentential to be processed at step 222. If intra-sentential pattern is matched a ranking/filtering is carried out at step 230. If the no further intra-sentential context is to be processed, the system 100 retrieves the entity and verb indexes 205 at step 224. The index is utilised for matching inter-sentential patterns at step 225. Similarly, the inter-sentential patterns matching also includes a generic relations extraction and a semantic relations extraction. Through the inter-sentential patterns matching at step 225, triples relation will be generated at step 228 and stored on the knowledge base 215.
[0035] The candidate triples in the intra-sentential and/or inter-sentential context are being ranked and filtered in the step 230. During the ranking and filtering process, the system computes a basic ranking score by matching the identified entities with the verb schemas. It then retrieves a popularity weight table that provides the popularities scores for the respective schemas. Accordingly, popularity weighted scores is computed accordingly for each schema. The schema with the highest weighted score is then selected to generate the relation triples based on the selected schema.
[0036] In a generic relations extraction, all the possible entities and noun phrases are identified through the entity recognition module 124. Generic relation extraction comprises the steps of retrieving a list of entities/noun phrase from the paragraph to be processed to form a List A from index and iterating each of the entities of the List A; selecting a next entry Ei from the List A and retrieving a list of entities/noun phrase which is not within the same paragraph (i.e. other paragraphs) to generate a List B; searching the intra-sentential results stored in the knowledge base if the relation exist between Ei and List B, wherein the List B can be reuse if exist; search the linguistic resources for Ei and obtaining all possible candidate relations; removing candidate triple if the subject/object do not match the List B; and repeating the above steps the last entries Ei in List A. [0037] Given an example showing in FIG. 3A, where a partial sentence is provided "... most European countries especially England, Germany and France "England", "Germany" and "France" shall be identified as possible entities, and "European countries" as noun phrase. These identified nouns are then being lemmatized. In this case, "European counties" is being lemmatized as "European country" and the rest remains no change. Following that, the linguistic resources 105, such as VerbNet and/or FrameNet, are used for identifying verb(s). In this sentence, no verb can be identified. Thereafter, matching patterns are retrieved to trigger the corresponding rule to generate relation triples. In this example, three hyponym relation triples are generated: FfYPONYM(Germany, European country); HYPONYM(Germany, European country); and HYPONYM(France, European country).
[0038] Following the above, as shown in FIG. 3B, the system 100 goes on searching the linguistic resources 105, such as WordNet, etc, and retrieves all possible properties related to the entities/noun phrases to generate holonym relation triples: Holonym(Germany, Europe); Holonym(Germany, European Union); and Holonym(Germany, European Economic Community). A similarity measure between the additional property labels with all the entities/noun phrases identified in the sentence is carried out, and respective scores are being assigned to each holonym. The candidate relation triples are being filtered based on similarity scores computed with a threshold specified. In this case, "Europe" and "European country" will obtain a similarity match score of 73.33%; whereas, "European Union" and "European country" will obtain the match score of 54.55%. [0039] FIG. 4 exemplifies a further example of generic relation extractions based on a sentence " Samsung release Galaxy Mini, another smartphone for Android fans." in another embodiment of the present invention. Similarly, the system 100 first identifies the possible entities/noun phrases from the exemplified example: Samsung, Galaxy Mini, etc. and the noun phrase that can be identified is "smartphone". The identified entities are then lemmatized accordingly. The linguistic resources 105 are then being used to identify verb, if any. In this given example, "release" is identified as a verb. The identified verb(s), entities and nouns are marked up accordingly for matching with the pattern database 104 according to the structure of the sentence exemplified here.
[0040] The matching patterns, if one can be found, trigger the corresponding rule to generate the relation triples. The relation triples are well known in the field of Resorce Description Framework (RDF). For example, hyponymy or is-a relation, metonym or part-of relation, synonyms and etc. may be utilized for determining relations between the phrases. The rules and matching patterns are used to extract generic relations, which are preferably of taxonomic type derivable derived from a knowledge base e.g. hyponyms, metonyms, synonyms, hypernyms, holonyms, antonyms, etc. By way of illustrations, not limitation, in a sentence like "... such exotic fruit as kiwis, mangoes, pineapples or coconuts it is possible to extract that "kiwi is-a exotic fruit, mango is-a exotic fruit, etc. In another sentence link "... basement of a building it is possible to extract that basement part-of building. In yet another sentence " ... United Kingdom or Great Britain ... ", it is possible to extract that United Kingdom same-as Great Britain. This can be achieved by identifying certain pattern, which may consist specific words or terms defining the patterns. [0041] Still referring to FIG. 4, two relation triples can be matched: (Samsung, release, Galaxy Mini) and (Samsung, release, smartphone), based on a predefined pattern: NP1+VP1+(NP2,[NP3+PP3]), where P1, NP2, NP3 are referring to noun phrases, VP1 is referring to verb phrase and PP3 is referring to preposition phase. [0042] Following that, the linguistic resources 105 is searched again to retrieve the possible properties related to the entities/noun phrases. A similarity measure between the additional property labels with all the entities/noun phrases identified in the sentence is carried out. The candidate relation triples based on similarity scores computed with a threshold specified is being filtered. [0043] Returning to step 210, the semantic relation extraction comprising the steps of retrieving a list of entities/noun phrase from the paragraph/sentence to be processed to form a List A from index and iterating each of the entities thereof; selecting a next entry Ei from the List A and retrieving a list of entities/noun phrase which is not within the same paragraph (i.e. other paragraphs) to generate a List B; searching the intra-sentential results stored in the knowledge base if the relation exist between Ei and List B, wherein the List B can be reused if exist; search Linked Data 103 for Ei and obtaining all possible candidate relations; removing candidate triple if the subject/object does not match the List B; and repeating the above steps the last entries Ei in List A. In general, generic relations extraction differs from the semantic relations extraction in that the Linked Data.
[0044] FIG. 5 exemplifies a sentence "The Bugatti Veyron EB 16.4 is a mid- engined grand touring car by the Volkswagen Group in France" which is to be processed by the system 100. In this sentence, the entities/noun phrases "Bugatti", Veyron, etc. can be identified. No verb, on the other hand, can be identified. Similarly, no patterns can be matched too. Through the Linked Data 103, a class property label "Sports car" can be retrieved for the entity "Bugatti", and through a list of predefined property rules to denote a specific relation to be triggered, isType relation can be applied to generate a relation triple: isType(Bugatti, Veyron, Sports Car).
[0045] FIG. 6 exemplifies another sentence "John bought Mary a Ferrari which is to be processed by the system 100. In this example, when determining semantic relations in the intra-sentential context, similar steps are followed as described in the previous examples. The only difference from the previous scenarios is that when the verb is identified while there are no available patterns for this verb found. All possible schemas related to this target verb are extracted from the Linguistic Resources 105.
[0046] VerbNet for example, offers a comprehensive resource for verbs.
Through this Linguistic Resource, the use of syntax parsing of the sentences to identify verbs can be eliminated. Subsequently, FrameNet for example can be used to identify all schemas to match all the identified concepts. To match the concepts with the schemas, the selectional constrains for each of the semantic roles are to be met.
[0047] In this given example, the identified entities are "John", "Mary",
"Ferrari", etc. The identified verb is "buy" being the lemmatized word of "bought". Through the linguistic resources, the schema "Agent v Beneficiary Theme" can be identified. A semantic role agnt has the selectional constraint {animate, action). In this case, John who is a person matched to the first constraint 'animate' (i.e. living person) as the reference of a concept hierarchy will be used to determine this. [0048] FIG. 7 shows the process of ranking the schemas with popularity weight to select the schema to generate the corresponding relation triples with the example of FIG. 6. The sentence 701 is processed to identify an appropriate schema, Schema 1, Schema 2 and Schema 3, from a schemas list 702. As explained above, each of the schemas is computed with a basic ranking score based on matches of the identified concepts. The appropriate schema is selected based on a highest weighted score of those assigned to the schemas. Based on its current popularity as shown in table 704, the schemas are further weighted with a weighted ranking score as shown in table 703. When 2 schemas have a similar basic score, the popularity weight added to differentiate them. In which case, even Schema 2 has a higher weighted ranking score than Schema 1, because Schema 3 already had a higher basic score, the popularity score for Schema 3 is added with 1 and the table 704 is updated accordingly.
[0049] FIG. 8 shows steps involved for relation discovery in the inter-sentential context. At step 802, a list of entities / NP, List A, is retrieved from the index 222 and iterate each of the items on the list. It starts from the first entry. At step 804, select a next entry Ei from List A and retrieve a list of entities/NP, which is not within the same paragraph as in List B. At step 806, the intra-sentential results stored in the knowledge base 215 if the relation exists between Ei and List B (reuse if exist). The Linked Data 103 is searched for Ei and obtains all possible candidate relations. In step 808, candidate triple is removed if the subject/object do not match to List A or it falls below a similarity threshold. The matched results are stored on the knowledge base. The processes are repeated until the last entry in List A is processed.
[0050] While specific embodiments have been described and illustrated, it is understood that many changes, modifications, variations, and combinations thereof could be made to the present invention without departing from the scope of the invention.

Claims

2015/080561 17 Claims
1. A system (100) for carrying out a relation extraction from a machine-readable document (190) having sentences, the system (100) comprising:
a text preprocessing module (101) for lemmatizing the sentences into tokenized text and identifying paragraphs from the document (190); a coreference resolution module (122) configured to resolve all possible anaphors; an entity recognition module (124) for extracting entities from the sentence; an entity resolution and disambiguation module (126) adapted for resolving all ambiguous entities / noun phrases, acronyms and abbreviations; a relation extraction module (128) configured to operably carrying out a generic relation extraction and a semantic relations extraction for extracting triples (106) for a sentence based on the extracted entities, wherein a weighted ranking score is computed for each triple for a sentence for selecting and storing an most appropriate triple.
2. The system (100) according to claim 1, wherein each triple (106) is computed with a basic ranking score by matching the identified concepts with the respective verb schemas, and weighted with their respective recorded popularity scores to obtain the weighted ranking score for each triple (106), the highest ranking score of which is selected as the most appropriate triple (106).
3. The system (100) according to claim 1, wherein the relation extraction module (128) carries out the generic relation extraction for resolving entities in an intra- sentential context, the generic relation extraction operationally extracts noun phrases and verb from the sentence, matches predefined patterns (104) from the linguistic resources (105), and searches the linguistic resources again with a Linked Data (103) for possible properties to generate corresponding relation of the noun phrases and verb are referring to in that sentence.
4. The system (100) according to claim 1, wherein the relation extraction module (128) carries out the generic relation extraction for resolving entities from an inter- sentential context, the relation extraction module (128) operationally renders a list of entities of noun phrases from a repository and another list of entities of noun phrases that are not within the same paragraph and the connecting verbs, extracts and matches both lists for schemas based on the verb identified from the linguistic resources (105), identifies possible properties to generate corresponding relations from a knowledge base (215), and searches the linguistic resources (105) for possible triples based on the properties found.
5. The system (100) according to claim 1, wherein the semantic relations extraction for entities in an intra- sentential context matches predefined patterns to extract and match schemas based on the identified verb, and searches entities or noun phrases through a linked database (103).
6. The system (100) according to claim 1, wherein the semantic relations extraction for entities in an inter- sentential context retrieves a list of entities and noun phrases from a repository and another list of entities, noun phrases that are not within the same paragraph and the connecting verbs in order to extract and match schemas based on the verb identified from the linguistic resources (105).
7. A method of carrying out a relation extraction from a machine-readable document (190) having sentences, the method comprising:
lemmatizing texts of the sentences into tokenized text and identifying paragraphs from the document (190);
resolving all possible anaphors through a coreference resolution module. (122); extracting (201) entities from the sentences through entity recognition module (124) for extracting entities from the sentence;
resolving (203) all ambiguous entities / noun phrases, acronyms and abbreviations;
extracting a generic relation and a semantic relations (210, 215) for extracting triples (106) for a sentence based on the extracted entities, wherein a weighted ranking score is computed for each triple (106) for a sentence for selecting and storing an most appropriate triple (106).
8. The method according to claim 7, wherein the relation extraction further comprising:
computing a basic ranking score by matching the identified concepts to the respective verb schemas;
retrieving a popularity score for each of the schemas; computing a popularity weighted score for each schema to obtain weighted ranking score for each schema;
selecting a schema with the highest weighted ranking score;
generating the relation triples based on the selected schema.
9. The method according to claim 7, wherein
the extracting the generic relation among entities in an intra-sentential context further comprising:
extracting noun phrases and verb from linguistic resources to match the sentence with predefined patterns (104); and
searching the linguistic resources and a Linked Data for possible properties to generate corresponding relation the sentence refers to; and the extracting the semantic relations extraction for entities in an intra-sentential context further comprising: matching predefined patterns to extract and match schemas based on the identified verb, and searching entities or noun phrases through a linked database (103).
10. The method according to claim 7, wherein the generic relation extraction among entities in an inter-sentential context further comprising: retrieving a list of entities and noun phrases from a repository and another list of entities for noun phrases that are not within the same paragraph and the connecting verbs; extracting and matching schemas based on the verb identified from the linguistic resources (105) to identify relations from a knowledge base (215); and searching the linguistic resources (105) for possible triples based on the properties found; and the semantic relations extraction for entities in an inter-sentential context further comprising:
retrieving a list of entities and noun phrases from a repository and another list of entities for noun phrases that are not within the same paragraph and the connecting verbs; and
extracting and matching schemas based on the verb identified from the linguistic resources.
PCT/MY2014/000175 2013-11-27 2014-06-02 A method and system for automated relation discovery from texts WO2015080561A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
MYPI2013004282 2013-11-27
MYPI2013004282A MY186402A (en) 2013-11-27 2013-11-27 A method and system for automated relation discovery from texts

Publications (1)

Publication Number Publication Date
WO2015080561A1 true WO2015080561A1 (en) 2015-06-04

Family

ID=51690420

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/MY2014/000175 WO2015080561A1 (en) 2013-11-27 2014-06-02 A method and system for automated relation discovery from texts

Country Status (2)

Country Link
MY (1) MY186402A (en)
WO (1) WO2015080561A1 (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407180A (en) * 2016-08-30 2017-02-15 北京奇艺世纪科技有限公司 Entity disambiguation method and apparatus
CN106445911A (en) * 2016-03-18 2017-02-22 苏州大学 Anaphora resolution method and system based on microscopic topic structure
CN107783957A (en) * 2016-08-30 2018-03-09 中国电信股份有限公司 Ontology method and apparatus
US10002129B1 (en) 2017-02-15 2018-06-19 Wipro Limited System and method for extracting information from unstructured text
CN108595421A (en) * 2018-04-13 2018-09-28 北京神州泰岳软件股份有限公司 A kind of abstracting method, the apparatus and system of Chinese entity associated relationship
CN109359184A (en) * 2018-10-16 2019-02-19 苏州大学 English event synchronous anomalies method and system
US10394955B2 (en) 2017-12-21 2019-08-27 International Business Machines Corporation Relation extraction from a corpus using an information retrieval based procedure
CN110880142A (en) * 2019-11-22 2020-03-13 深圳前海微众银行股份有限公司 Risk entity acquisition method and device
CN110990525A (en) * 2019-11-15 2020-04-10 华融融通(北京)科技有限公司 Natural language processing-based public opinion information extraction and knowledge base generation method
CN111104790A (en) * 2018-10-10 2020-05-05 百度在线网络技术(北京)有限公司 Method, device and equipment for extracting key relation and computer readable medium
EP3654227A1 (en) * 2018-11-16 2020-05-20 Babylon Partners Limited System for extracting semantic triples for building a knowledge base
CN111209348A (en) * 2018-11-21 2020-05-29 百度在线网络技术(北京)有限公司 Method and apparatus for outputting information
CN113094469A (en) * 2021-04-02 2021-07-09 清华大学 Text data analysis method and device, electronic equipment and storage medium
CN113312922A (en) * 2021-04-14 2021-08-27 中国电子科技集团公司第二十八研究所 Improved chapter-level triple information extraction method
WO2021197905A1 (en) * 2020-03-31 2021-10-07 Crypteia Networks S.A. Horizontal learning methods and apparatus for extracting association rules
CN113535936A (en) * 2021-06-21 2021-10-22 杭州初灵数据科技有限公司 Deep learning-based regulation and regulation retrieval method and system
CN113779995A (en) * 2021-08-26 2021-12-10 北京科技大学 Scientific and technical literature data automatic extraction method and system based on text mining
US11210324B2 (en) 2016-06-03 2021-12-28 Microsoft Technology Licensing, Llc Relation extraction across sentence boundaries
CN118171648A (en) * 2024-05-11 2024-06-11 中移(苏州)软件技术有限公司 Text extraction method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090019032A1 (en) 2007-07-13 2009-01-15 Siemens Aktiengesellschaft Method and a system for semantic relation extraction
WO2009052277A1 (en) * 2007-10-17 2009-04-23 Evri, Inc. Nlp-based entity recognition and disambiguation
WO2010132790A1 (en) * 2009-05-14 2010-11-18 Collexis Holdings, Inc. Methods and systems for knowledge discovery

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090019032A1 (en) 2007-07-13 2009-01-15 Siemens Aktiengesellschaft Method and a system for semantic relation extraction
WO2009052277A1 (en) * 2007-10-17 2009-04-23 Evri, Inc. Nlp-based entity recognition and disambiguation
WO2010132790A1 (en) * 2009-05-14 2010-11-18 Collexis Holdings, Inc. Methods and systems for knowledge discovery

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ALEXANDER SCHUTZ ET AL: "RelExt: A Tool for Relation Extraction from Text in Ontology Extension", 1 January 2005, THE SEMANTIC WEB - ISWC 2005 LECTURE NOTES IN COMPUTER SCIENCE;;LNCS, SPRINGER, BERLIN, DE, PAGE(S) 593 - 606, ISBN: 978-3-540-29754-3, XP019022779 *
HEARST M A: "Automatic Acquisition of Hyponyms from Large Text Corpora", PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON COMPUTATIONALLINGUISTICS, XX, XX, 1 July 1992 (1992-07-01), pages 1 - 8, XP002269625 *
KAMEL NEBHI: "A Rule-Based Relation Extraction System using DBpedia and Syntactic Parsing", PROCEEDINGS OF THE NLP-DBPEDIA-2013 WORKSHOP CO-LOCATED WITH THE 12TH INTERNATIONAL SEMANTIC WEB CONFERENCE (ISWC 2013), 22 October 2013 (2013-10-22), Sydney, Australia, XP055172351 *

Cited By (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445911A (en) * 2016-03-18 2017-02-22 苏州大学 Anaphora resolution method and system based on microscopic topic structure
CN106445911B (en) * 2016-03-18 2022-02-22 苏州大学 Reference resolution method and system based on micro topic structure
US11210324B2 (en) 2016-06-03 2021-12-28 Microsoft Technology Licensing, Llc Relation extraction across sentence boundaries
CN107783957B (en) * 2016-08-30 2021-05-18 中国电信股份有限公司 Ontology creating method and device
CN107783957A (en) * 2016-08-30 2018-03-09 中国电信股份有限公司 Ontology method and apparatus
CN106407180A (en) * 2016-08-30 2017-02-15 北京奇艺世纪科技有限公司 Entity disambiguation method and apparatus
US10002129B1 (en) 2017-02-15 2018-06-19 Wipro Limited System and method for extracting information from unstructured text
US10394955B2 (en) 2017-12-21 2019-08-27 International Business Machines Corporation Relation extraction from a corpus using an information retrieval based procedure
CN108595421A (en) * 2018-04-13 2018-09-28 北京神州泰岳软件股份有限公司 A kind of abstracting method, the apparatus and system of Chinese entity associated relationship
CN111104790B (en) * 2018-10-10 2024-03-22 百度在线网络技术(北京)有限公司 Method, apparatus, device and computer readable medium for extracting key relation
CN111104790A (en) * 2018-10-10 2020-05-05 百度在线网络技术(北京)有限公司 Method, device and equipment for extracting key relation and computer readable medium
CN109359184A (en) * 2018-10-16 2019-02-19 苏州大学 English event synchronous anomalies method and system
CN109359184B (en) * 2018-10-16 2020-08-18 苏州大学 English event co-fingering resolution method and system
EP3654227A1 (en) * 2018-11-16 2020-05-20 Babylon Partners Limited System for extracting semantic triples for building a knowledge base
CN111209348A (en) * 2018-11-21 2020-05-29 百度在线网络技术(北京)有限公司 Method and apparatus for outputting information
CN111209348B (en) * 2018-11-21 2023-09-29 百度在线网络技术(北京)有限公司 Method and device for outputting information
CN110990525A (en) * 2019-11-15 2020-04-10 华融融通(北京)科技有限公司 Natural language processing-based public opinion information extraction and knowledge base generation method
CN110880142B (en) * 2019-11-22 2024-01-19 深圳前海微众银行股份有限公司 Risk entity acquisition method and device
CN110880142A (en) * 2019-11-22 2020-03-13 深圳前海微众银行股份有限公司 Risk entity acquisition method and device
WO2021197905A1 (en) * 2020-03-31 2021-10-07 Crypteia Networks S.A. Horizontal learning methods and apparatus for extracting association rules
CN113094469A (en) * 2021-04-02 2021-07-09 清华大学 Text data analysis method and device, electronic equipment and storage medium
CN113094469B (en) * 2021-04-02 2022-07-05 清华大学 Text data analysis method and device, electronic equipment and storage medium
CN113312922B (en) * 2021-04-14 2023-10-24 中国电子科技集团公司第二十八研究所 Improved chapter-level triple information extraction method
CN113312922A (en) * 2021-04-14 2021-08-27 中国电子科技集团公司第二十八研究所 Improved chapter-level triple information extraction method
CN113535936A (en) * 2021-06-21 2021-10-22 杭州初灵数据科技有限公司 Deep learning-based regulation and regulation retrieval method and system
CN113535936B (en) * 2021-06-21 2024-02-13 杭州初灵数据科技有限公司 Deep learning-based regulation system retrieval method and system
CN113779995B (en) * 2021-08-26 2023-07-18 北京科技大学 Automatic extraction method and system for scientific and technological literature data based on text mining
CN113779995A (en) * 2021-08-26 2021-12-10 北京科技大学 Scientific and technical literature data automatic extraction method and system based on text mining
CN118171648A (en) * 2024-05-11 2024-06-11 中移(苏州)软件技术有限公司 Text extraction method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
MY186402A (en) 2021-07-22

Similar Documents

Publication Publication Date Title
WO2015080561A1 (en) A method and system for automated relation discovery from texts
US10282389B2 (en) NLP-based entity recognition and disambiguation
Zhang et al. Entity linking leveraging automatically generated annotation
TWI512507B (en) A method and apparatus for providing multi-granularity word segmentation results
Alzahrani et al. Fuzzy semantic-based string similarity for extrinsic plagiarism detection
Ramisch et al. mwetoolkit: A framework for multiword expression identification.
Rigouts Terryn et al. Termeval 2020: Shared task on automatic term extraction using the annotated corpora for term extraction research (acter) dataset
US20130018650A1 (en) Selection of Language Model Training Data
US8515731B1 (en) Synonym verification
JP5710581B2 (en) Question answering apparatus, method, and program
KR101508070B1 (en) Method for word sense diambiguration of polysemy predicates using UWordMap
Gupta A correction model for real-word errors
US8204736B2 (en) Access to multilingual textual resources
WO2015080558A1 (en) A method and system for automated entity recognition
Jayan et al. A hybrid statistical approach for named entity recognition for malayalam language
CN111428031B (en) Graph model filtering method integrating shallow semantic information
Ghosh et al. A rule based extractive text summarization technique for Bangla news documents
Attia et al. An automatically built named entity lexicon for Arabic
Alfina et al. DBpedia entities expansion in automatically building dataset for Indonesian NER
US20110106849A1 (en) New case generation device, new case generation method, and new case generation program
Tang et al. Sentiment analysis of online Chinese comments based on statistical learning combining with pattern matching
US11520989B1 (en) Natural language processing with keywords
Alasiry et al. Extraction and evaluation of candidate named entities in search engine queries
Cheatham et al. The role of string similarity metrics in ontology alignment
CN112269852A (en) Method, system and storage medium for generating public opinion topic

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14783676

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14783676

Country of ref document: EP

Kind code of ref document: A1