CN111797237B - Text entity relationship recognition method, system and medium - Google Patents

Text entity relationship recognition method, system and medium Download PDF

Info

Publication number
CN111797237B
CN111797237B CN202010665101.6A CN202010665101A CN111797237B CN 111797237 B CN111797237 B CN 111797237B CN 202010665101 A CN202010665101 A CN 202010665101A CN 111797237 B CN111797237 B CN 111797237B
Authority
CN
China
Prior art keywords
entity
relationship
pairs
pair
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010665101.6A
Other languages
Chinese (zh)
Other versions
CN111797237A (en
Inventor
何柳
谢佳雨
陈伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
4Paradigm Beijing Technology Co Ltd
Original Assignee
4Paradigm Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 4Paradigm Beijing Technology Co Ltd filed Critical 4Paradigm Beijing Technology Co Ltd
Priority to CN202010665101.6A priority Critical patent/CN111797237B/en
Publication of CN111797237A publication Critical patent/CN111797237A/en
Application granted granted Critical
Publication of CN111797237B publication Critical patent/CN111797237B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A text entity relationship identification method, system and medium are provided. Generating a support data set based on a reference text of a target field, wherein the support data set comprises positive instance entity pairs belonging to preset X appointed relation types and negative instance entity pairs not belonging to the preset X appointed relation types; generating a query data set comprising at least one entity pair in the text to be tested in the target field; and determining the relationship type of each entity pair in the query data set by using a machine learning model based on the support data set and the query data set, wherein the machine learning model is trained in advance based on data of other fields except the target field.

Description

Text entity relationship recognition method, system and medium
Technical Field
The present invention relates generally to the field of natural language processing, and more particularly, to a text entity relationship recognition method, system, and storage medium.
Background
Entity relationship identification in text refers to a technique of extracting tasks of implicit relationship among entities in text. The existing mainstream relation recognition method comprises the following two methods:
(1) Firstly, labeling the relation among the entities of the texts in the target field, training an entity recognition model according to labeling corpus, and recognizing the entity relation of the texts to be tested in the target field by utilizing the trained entity recognition model. The entity relation recognition mode needs to make a large amount of annotation data on one hand, is only suitable for texts in the target field on the other hand, and can be used for entity relation recognition of texts to be detected in the new field only by retraining an entity recognition model by using the annotation corpus in the new field.
(2) Based on rule pattern matching, extraction is performed according to syntax rules. The entity relation recognition mode is relatively dead and has poor expansibility.
Disclosure of Invention
The invention aims to solve the problems that the existing entity relationship identification data has large labeling quantity and/or an entity identification model cannot be rapidly applied to a new field.
According to a first aspect of the present invention, a text entity relationship recognition method is provided, including:
Generating a support data set based on a reference text of the target field, wherein the support data set comprises positive instance entity pairs belonging to preset X designated relation types and negative instance entity pairs not belonging to the preset X designated relation types; generating a query data set comprising at least one entity pair in the text to be tested in the target field; based on the support data set and the query data set, determining a relationship type of each entity pair in the query data set by using a machine learning model, wherein the machine learning model is trained in advance based on data of other fields except the target field.
Optionally, generating the support dataset based on the reference text of the target area includes: acquiring a reference entity list of a reference text, wherein the reference entity list stores data of a positive example entity pair in the reference text; combining all the entities in the reference entity list into a plurality of entity pairs, and determining the type of each combined entity pair according to whether each combined entity pair belongs to the preset X specified relationship types; and extracting at least one positive instance entity pair and at least one negative instance entity pair from the combined entity pairs, and merging all the extracted entity pairs into a supporting data set.
Optionally, combining all entities in the reference entity list into a plurality of entity pairs includes:
And taking each entity in the reference entity list as a head entity and a tail entity respectively, and forming entity pairs with other entities in the reference entity list.
Optionally, extracting at least one positive instance entity pair and at least one negative instance entity pair from the combined entity pair includes: extracting all positive instance entity pairs; the same number of negative instance entity pairs as positive instance entity pairs are extracted from all negative instance entity pairs.
Optionally, extracting at least one positive instance entity pair and at least one negative instance entity pair from the combined entity pair includes: randomly extracting at least one negative instance entity pair from all the negative instance entity pairs; or determining entity pairs of which the included entities meet the preset entity type conditions in all negative entity pairs, and randomly extracting at least one negative entity pair from the entity pairs meeting the preset entity type conditions.
Optionally, generating the query dataset including at least one entity pair in the text under test of the target field includes: acquiring a to-be-detected entity list of a to-be-detected text, wherein the to-be-detected entity list stores data of entities in the to-be-detected text; and combining the entities in the entity list to be tested into a plurality of entity pairs, and combining the combined entity pairs into a query data set.
Optionally, combining the entities in the entity list to be tested into a plurality of entity pairs, including: each entity in the entity list to be tested is respectively used as a head entity and a tail entity, and forms an entity pair with other entities in the entity list to be tested; or determining the entity meeting the preset entity type condition from the entities in the entity list to be detected, and forming an entity pair by taking each entity in the entity meeting the preset entity type condition as a head entity and a tail entity respectively and other entities in the entity meeting the preset entity type condition.
Optionally, determining, based on the support dataset and the query dataset, a relationship type to which each entity pair in the query dataset belongs using a machine learning model, comprising: calculating relation scores of each entity pair in the query data set on X preset appointed relation types in the support data set by using a machine learning model; and determining the relationship type of each entity pair in the query data set according to whether the relationship score of each entity pair in the query data set meets a preset score condition.
Optionally, determining the relationship type of each entity pair in the query data set according to whether the relationship score of each entity pair in the query data set meets a preset score condition, including any one of the following three steps: if the relation score of the entity pair is higher than the score threshold, determining that the entity pair belongs to preset X designated relation types; if the relation score of the entity pair is not higher than the score threshold, determining that the entity pair belongs to other relation categories except the preset X designated relation types; if the relationship scores of the entity pairs are Y names before in the ranking of the relationship scores of all the entity pairs in the query data set, determining that the entity pairs belong to preset X designated relationship types; if the relationship scores of the entity pairs are not the first Y in the ranking of the relationship scores of all the entity pairs in the query data set, determining that the entity pairs belong to other relationship categories except the preset X designated relationship types, wherein Y is a positive integer; if the relationship score of the entity pair is higher than the score threshold value and the relationship score of the entity pair is the first Y in the ranking of the relationship scores of all entity pairs in the query data set, determining that the entity pair belongs to the preset X designated relationship types; if the relationship score of the entity pair is not higher than the score threshold value, or the relationship score of the entity pair is not the first Y in the ranking of the relationship scores of all entity pairs in the query data set, determining that the entity pair belongs to other relationship categories except for the preset X designated relationship types, wherein Y is a positive integer.
Optionally, the text entity relationship recognition method further includes: and outputting data information of the relationship type to which each entity pair belongs in the query data set, wherein the data information of the relationship type to which each entity pair belongs at least comprises the content of a head entity, the content of a tail entity and the relationship type to which the entity pair belongs.
Optionally, the machine learning model comprises a small sample learning model.
According to a second aspect of the present invention, there is provided a text entity relationship recognition system, comprising: the support data set generation module is configured to generate a support data set based on a reference text of the target field, wherein the support data set comprises positive instance entity pairs belonging to preset X designated relation types and negative instance entity pairs not belonging to the preset X designated relation types; a query data set generation module configured to generate a query data set including at least one entity pair in a text to be tested of a target field; and a relationship type prediction module configured to determine a relationship type to which each entity pair in the query data set belongs using a machine learning model based on the support data set and the query data set, wherein the machine learning model is pre-trained based on data of other fields than the target field.
Optionally, the support dataset generation module is configured to: acquiring a reference entity list of a reference text, wherein the reference entity list stores data of a positive example entity pair in the reference text; combining all the entities in the reference entity list into a plurality of entity pairs, and determining the type of each combined entity pair according to whether each combined entity pair belongs to the preset X specified relationship types; and extracting at least one positive instance entity pair and at least one negative instance entity pair from the combined entity pairs, and merging all the extracted entity pairs into a supporting data set.
Optionally, the support dataset generation module is configured to: and taking each entity in the reference entity list as a head entity and a tail entity respectively, and forming entity pairs with other entities in the reference entity list.
Optionally, the support dataset generation module is configured to: extracting all positive instance entity pairs; the same number of negative instance entity pairs as positive instance entity pairs are extracted from all negative instance entity pairs.
Optionally, the support dataset generation module is configured to: randomly extracting at least one negative instance entity pair from all the negative instance entity pairs; or determining entity pairs of which the included entities meet the preset entity type conditions in all negative entity pairs, and randomly extracting at least one negative entity pair from the entity pairs meeting the preset entity type conditions.
Optionally, the query data set generation module is configured to: acquiring a to-be-detected entity list of a to-be-detected text, wherein the to-be-detected entity list stores data of entities in the to-be-detected text; and combining the entities in the entity list to be tested into a plurality of entity pairs, and combining the combined entity pairs into a query data set.
Optionally, the query data set generation module is configured to: each entity in the entity list to be tested is respectively used as a head entity and a tail entity, and forms an entity pair with other entities in the entity list to be tested; or determining the entity meeting the preset entity type condition from the entities in the entity list to be detected, and forming an entity pair by taking each entity in the entity meeting the preset entity type condition as a head entity and a tail entity respectively and other entities in the entity meeting the preset entity type condition.
Optionally, the relationship type prediction module is configured to: calculating relation scores of each entity pair in the query data set on X preset appointed relation types in the support data set by using a machine learning model; and determining the relationship type of each entity pair in the query data set according to whether the relationship score of each entity pair in the query data set meets a preset score condition.
Optionally, the relationship type prediction module is configured to perform any one of the following three steps: if the relation score of the entity pair is higher than the score threshold, determining that the entity pair belongs to preset X designated relation types; if the relation score of the entity pair is not higher than the score threshold, determining that the entity pair belongs to other relation categories except the preset X designated relation types; if the relationship scores of the entity pairs are Y names before in the ranking of the relationship scores of all the entity pairs in the query data set, determining that the entity pairs belong to preset X designated relationship types; if the relationship scores of the entity pairs are not the first Y in the ranking of the relationship scores of all the entity pairs in the query data set, determining that the entity pairs belong to other relationship categories except the preset X designated relationship types, wherein Y is a positive integer; if the relationship score of the entity pair is higher than the score threshold value and the relationship score of the entity pair is the first Y in the ranking of the relationship scores of all entity pairs in the query data set, determining that the entity pair belongs to the preset X designated relationship types; if the relationship score of the entity pair is not higher than the score threshold value, or the relationship score of the entity pair is not the first Y in the ranking of the relationship scores of all entity pairs in the query data set, determining that the entity pair belongs to other relationship categories except for the preset X designated relationship types, wherein Y is a positive integer.
Optionally, the text entity relationship recognition system further comprises a result output module configured to: and outputting data information of the relationship type to which each entity pair belongs in the query data set, wherein the data information of the relationship type to which each entity pair belongs at least comprises the content of a head entity, the content of a tail entity and the relationship type to which the entity pair belongs.
Optionally, the machine learning model comprises a small sample learning model.
According to a third aspect of the present invention, a computer-readable storage medium storing instructions is presented, wherein the instructions, when executed by at least one computing device, cause the at least one computing device to perform the above-described text entity relationship identification method.
According to a fourth aspect of the present invention, a system is presented comprising at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least one computing device to perform the above-described text entity relationship identification method.
According to the text entity relation recognition method, system and medium provided by the embodiment of the invention, the entity relation recognition in the text to be detected in the target field can be completed by utilizing the machine learning model trained in advance by data in other fields except the target field, so that the machine learning model can be applied to different fields and scenes without carrying out large-scale training again.
Specifically, a support data set is generated based on a reference text in the target field, the support data set comprises two entity pairs (a positive entity pair and a negative entity pair) with completely different relation types, and the negative entity pair does not belong to the preset X specified relation types, so that the relation labeling of two entities in the negative entity pair is not needed in the data labeling stage, and the data labeling quantity is greatly reduced. In determining the relationship type of each entity pair in the query data set, only a selection is needed among the preset X designated relationship types, and whether the relationship type is a positive entity pair or not and the specific designated relationship type of the entity pair are determined. And when the entity pair does not belong to the preset X designated relationship types, the entity pair can be directly classified into other relationship types except the preset X designated relationship types. The negative instance entity pairs are added to the support data set, so that the accuracy and the authenticity of the relationship classification and the entity relationship identification can be improved, and the efficiency of the entity relationship identification can be improved.
Drawings
These and/or other aspects and advantages of the present invention will become more apparent and more readily appreciated from the following detailed description of the embodiments of the invention, taken in conjunction with the accompanying drawings, wherein:
FIG. 1 illustrates an architecture diagram of a machine learning model according to an exemplary embodiment of the present invention;
FIG. 2 illustrates a flowchart of a text entity relationship recognition method according to an exemplary embodiment of the present invention;
FIG. 3 illustrates a schematic diagram of the number of entity pairs supporting a dataset versus F1 score for a machine learning model according to an exemplary embodiment of the invention;
FIG. 4 illustrates a block diagram of a text entity relationship recognition system according to an exemplary embodiment of the present invention;
Reference numerals illustrate:
200-a text entity relationship recognition system;
201-a support dataset generation module;
202-a query data set generation module;
203-a relationship type prediction module;
204-a result output module.
Detailed Description
The following description with reference to the accompanying drawings is provided to assist in a comprehensive understanding of embodiments of the invention defined by the claims and their equivalents. Various specific details are included to aid understanding, but are merely to be considered exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.
The text entity relationship recognition method, system and medium of the exemplary embodiments of the present invention require that a machine learning model is utilized to determine the relationship type to which each entity pair in a query data set belongs based on a support data set (support-set) and a query data set (query-set). The support dataset may be seen as annotation data comprising N different relationship types, K samples for each category. The determination of the relationship type to which each entity pair in the query data set belongs is actually that is, determining which of the N relationship types in the support data set each entity pair in the query data set belongs to, which is also called solving the N-way-K-shot problem.
The invention does not limit the specific type of the machine learning model, and the machine learning model can be any model capable of solving the problem of N-way-K-shot. The training process of the machine learning model is described below by taking a small sample learning (Few-shot learning) model as an example.
1. Model construction
The architecture of the machine learning model may employ a Encoder-structure-Relation three-level framework. Fig. 1 shows a schematic architecture diagram of a machine learning model according to an exemplary embodiment of the invention, the architecture comprising an encoding layer (Encoder), a generalization layer (structure), and a relationship layer (Relation).
As an example, the encoding layer may use a typical CNN, LSTM, transformer or like structure, may use the BERT structure for encoding, and is used to obtain a semantic representation of each sample. The induction layer can use a twin Network (Siamese Network), a prototype Network (Prototype Network), a Meta Networks and the like, and the induction layer can be a text classification model of BERT-PAIR, and is used for inducing category characteristics from sample semantics of a support dataset, pairing each query instance with all support instances, inputting each spliced instance PAIR, and obtaining scores of the same relationship expressed by the two instances. The relation layer is used for measuring semantic relations between the query data set and the category so as to finish classification, and specifically, the relation layer trains the CNN network, and the loss function is cross-entcopy. For each report-query pair, the model will average the scores of the same query over all the reports corresponding to different categories as the scores of the queries over the categories.
Alternatively, the coding layer may be replaced by lstm or albert, or the like.
Alternatively, BERT-PAIR applied by the induction layer may be replaced with Prototypical Network or meta Network.
2. Training machine learning model using large-scale corpus
Before model training, the text needs to be converted into training data (train set) and verification data (val set) according to a preset format conversion rule, the formats of the training data and the verification data are the same, and the training data needs to include content of at least one entity pair in the text.
The training data (train set) is in the format:
{class_0:[{"tokens":[],"h":[entity,entity_type,entity_idnex_list],"t":[entity,entity_type,entity_idnex_list},…],class_1:[{"tokens":[],"h":[entity,entity_type,entity_idnex_list]},…]…}. Wherein, different class corresponds to the content of different entity pairs, class itself may refer to the relationship type to which two entity pairs belong, and the number of entity pairs depends on the number of entities included in the text.
Tokens is a text character list, h is a head entity, contains the content of the head entity, the type of the head entity (which may be null), and the position of the head entity in the text, t is a tail entity, contains the content of the tail entity, the type of the tail entity (which may be null), and the position of the tail entity in the text.
In order to "how to play its own role, please read" actor self-maintenance "in" king of comedy ", zhou Xingchi rises up to the soldier rare book in down and out. The text of this paragraph is exemplified by the entities in this paragraph including "Zhou Xingchi", "actor self-maintenance" and "comedy king", and these 3 entities may constitute 6 entity pairs, namely "actor self-maintenance-Zhou Xingchi", "Zhou Xingchi-actor self-maintenance", "comedy king-Zhou Xingchi", "Zhou Xingchi-comedy king", "comedy king-actor self-maintenance", "actor self-maintenance-comedy king", with each entity pair being the leading entity and the trailing entity.
Assuming that class_0 corresponds to the content of "actor self-maintenance-Zhou Xingchi", class_0 is specifically:
{ "lead actor" [ { "tokens" [ "how", "show", "good", "self", "known", "corner", "color", "," please "," read "," "," show "," member "," self "," me "," repair "," nourish "," "in" and "," happy "," play "," king "," in "and", "" week "," star "," run "," grow "," start "," in "," bad "," reverse "," medium "," unique "," secret "," play "]," h "[" comedy king "," film and television work ", [ [21,22,23,24] ]," t "[" in "Zhou Xingchi", "persona", "in", [ [26,27,28] ] } ].
The user may specify that the maximum length of sentences in the text be max length. As an example, max_length is set to be not more than 128 characters in text length supported by the 128 machine learning model, and if the text length is more than 128 characters, sentence truncation will be performed.
The entity pairs contained in the training data have M relation types, S entity pairs are used as samples under each relation type, N and K of the N-way-K-shot problem are designated by a user, and the number Q of the entity pairs in the query data set in the training process is designated. N relationship types are randomly extracted from M relationship types in batches, and K entity pairs are randomly extracted from each of the N extracted relationship types to be used as samples (n×k samples in total) for training, wherein n×k samples are used as support data sets.
Q entity pairs are randomly taken out of each of the extracted N relationship types (not repeated with the support set), and Q entity pairs are taken out of each of the other extracted N relationship types, and the 2Q entity pairs are used as support data sets for learning, namely judging which of the N relationship types in the support data sets the relationship type to which each entity pair in the query data set belongs.
And in the training process, evaluating the accuracy of the judgment result by using the verification data. If the currently trained model is not the first time trained model, the currently trained model is compared with the historical model optimally, and if the currently trained model performs better than the historical model, the currently trained model is saved. If the accuracy of the currently trained model reaches the target requirement, finishing training; if the accuracy of the currently trained model does not meet the target requirement, the next training is continued.
According to the small sample learning model, the meta-task is obtained by sampling the training data to train the model, so that the model can learn the common part of the meta-task and forget the task related part in the meta-task, and can better classify when facing a new meta-task.
Fig. 2 illustrates a flowchart of a text entity relationship recognition method according to an exemplary embodiment of the present invention.
The method shown in fig. 2 may be implemented entirely in software by a computer program, and the method shown in fig. 2 may also be performed by a specifically configured computing device.
In step S101, a support dataset is generated based on the reference text of the target area. The support data set comprises positive instance entity pairs belonging to preset X designated relation types and negative instance entity pairs not belonging to preset X designated relation types.
The reference text of the target field includes a plurality of entities, and step S101 needs to combine the entities in the reference text into entity pairs to generate a support data set including the plurality of entity pairs. The invention classifies entity pairs according to relation types, takes entity pairs belonging to preset X appointed relation types as positive example entity pairs, and takes entity pairs not belonging to preset X appointed relation types as negative example entity pairs.
As an example, all negative instance entity pairs can be attributed to negative instance relationship types, and then there are (x+1) relationship types in the support data set, and for the foregoing N-way-K-shot problem, N in the N-way-K-shot problem is (x+1). For example, when only one specified relationship type is set, N in the N-way-K-shot problem is 2.
The specific content of a given relationship type may depend on the actual design requirements, such as the "is_a" relationship and the "investment" relationship described later.
As an example, generating a support dataset based on reference text for a target domain, comprising: acquiring a reference entity list of a reference text, wherein the reference entity list stores data of a positive example entity pair in the reference text; combining all the entities in the reference entity list into a plurality of entity pairs, and determining the type of each combined entity pair according to whether each combined entity pair belongs to the preset X specified relationship types; and extracting at least one positive instance entity pair and at least one negative instance entity pair from the combined entity pairs, and merging all the extracted entity pairs into a supporting data set.
The user needs to mark the reference text of the target field in advance to form a reference entity list corresponding to the reference text. The reference entity list can only store the data of the positive instance entity pairs in the reference text, the data of the negative instance entity pairs are not required to be stored, and the negative instance entity pairs can be generated according to the entities in the reference entity list in the subsequent steps, so that the data quantity required to be marked by a user is obviously reduced, the workload of the user can be lightened, the manual participation degree in the text entity relation recognition work is reduced, and the text entity relation recognition efficiency is improved.
Taking a reference text (hereinafter referred to as "first reference text") in the financial field, which is a "network loan information intermediation organization" that is a financial information intermediation company specializing in network loan information intermediation business activities, as an example, entities in the first reference text include "network loan information intermediation organization" and "financial information intermediation company".
The format of the reference entity list of the first reference text is: { "text": "network loan information intermediary means that the financial information intermediary company ","spo_list":[{"object_index":{"begin":34,"end":42},"subject_index":{"begin":0,"end":10},"predicate":"is_a","object_type":"","subject_type":"","object":", which is specialized in network loan information intermediary business activities, is established in law," "subject": "network loan information intermediary entity" ] }.
The text is original text content, spo_list is a relation list, the relation list is used for storing all marked relations, each content in the relation list is relation marking data of an entity pair, and the relation of the entity pair marks head entity content, tail entity content and the relation type of the entity pair. Specifically, the prediction is a relationship type, the subject is a header entity content, the subject_index is a position of the header entity in the first reference text, the subject_type is a specific type (may be null) of the header entity, the subject is a tail entity content, the subject_index is a position of the tail entity in the first reference text, and the subject_type is a specific type (may be null) of the tail entity.
The first reference text includes two entities, namely "a financial information agency" and "a network loan information agency", and may constitute two entity pairs, namely "a financial information agency-network loan information agency" and "a network loan information agency-financial information agency", and designate the relationship type as "is_a". When only the network lending information intermediaries are the head entity and the finance information intermediaries are the tail entity, the two entities are in a positive example relationship, namely, only the entity pair of the network lending information intermediaries-finance information intermediaries is_a is, so that the reference entity list only contains the content of the entity pair of the network lending information intermediaries-finance information intermediaries. Where "is_a" may be interpreted as an entity having a "yes" relationship to the leading entity and the trailing entity in the pair.
According to the text entity relation recognition method of the exemplary embodiment of the invention, all entities in the reference entity list are combined into a plurality of entity pairs, the type of each combined entity pair is determined according to whether each combined entity pair belongs to the preset X designated relation types, and the process is to reconstruct the determined positive instance entity pairs in the reference text to form negative instance entity pairs, so that the distribution of real data is more consistent. Moreover, the negative instance entity pairs are automatically generated based on the positive instance sample pairs, and the negative instance entity pairs in the reference text do not need to be marked manually in the earlier stage, so that the data marking quantity can be reduced. The negative instance entity pairs are added in the support data set, so that the accuracy and the authenticity of the relationship classification and the entity relationship identification can be increased, the efficiency of the entity relationship identification can be improved, and the method is more valuable in actual business.
As an example, combining all entities in the reference entity list into a plurality of entity pairs includes: and taking each entity in the reference entity list as a head entity and a tail entity respectively, and forming entity pairs with other entities in the reference entity list.
By way of example, the specific procedure of composing the entity pair may be as follows:
Traversing all relation lists in the reference entity list, and uniformly storing a head entity (subject, subject _ index, subject _type) and a tail entity (object, object _index, object_type) as an entity set (entity, entity_index, entity_type). Wherein, entity is entity content, entity_index is the position of entity in the first reference text, and entity_type is the specific type (which may be null value) of entity.
Combining the entities in the entity set into a plurality of entity pairs, and indicating that the entity pair is a positive entity pair when the relationship type to which the entity pair belongs is a designated relationship type in a relationship list; when the relationship type to which the entity pair belongs does not belong to the relationship type specified in the relationship list, the entity pair is indicated to be a negative example entity pair.
After determining the relationship type of the entity pair, storing each entity pair according to the following format:
{"tokens":[],"h":[entity,entity_type,entity_idnex_list],"t":[entity,entity_type,entity_idnex_list]}. Where tokens is a text character list, h is a header entity, including header entity content, type of header entity (which may be null), and location of header entity in the reference text, and t is a tail entity, including tail entity content, type of tail entity (which may be null), and location of tail entity in the reference text.
It should be noted that two entities may be combined into two entity pairs, for example, entity ase:Sub>A and entity B, which may form entity pair "ase:Sub>A-B" and entity pair "B-ase:Sub>A", with the preceding entity in each entity pair being the leading entity and the following entity being the trailing entity. And, two entity pairs combined by two entities may or may not belong to the same relationship type.
The investment of the company A, the company B and the company C is 12.6 hundred million RMB by the company T on the 26 th 4 th 2017. The Ding company is located in Xiamen City of Fujian province of China. Established on month 08 and 27 of 1988. Corporate legal: meng Keliang. For example, the reference text of the financial field (hereinafter referred to as "second reference text"), the entities in the second reference text include: "company a", "company b", "company c", "Ding Gongsi", "month 26 in 2017", "month 27 in 08 in 1988", "Meng Keliang", and the like.
When "investment" is taken as a specified relationship type, only the entity pairs composed when "a company", "b company", "c company" are respectively taken as head entities and "Ding Gongsi" are taken as tail entities belong to the specified relationship type.
The positive example entity pairs based on the combination of the entities in the second reference text include the following 3 entity pairs: "A company-Ding Gongsi", "B company-Ding Gongsi", "C company-T company".
The negative example entity pairs combined based on the entities in the second reference text at least comprise the following 39 entity pairs: a company-27 of 08 in 1988, A company-Meng Keliang, B company-26 of 4 in 2017 company B-company A, company B-company C A company-27 of 08 in 1988, A company-Meng Keliang, B company-26 of 2017, B company-A company, B company-C company, A company, B company-C company B Co-27/08/1988, B Co-Meng Keliang, C Co-2017/4/26, C Co-A, C Co-B propyl-27, propyl-Meng Keliang, butyl-2017, 4-26, butyl-A, butyl-B, butyl-propyl, butyl-08-27, butyl-Meng Keliang, 08-27-26, 1988, 27-B, 27-27, 1988, 27-butyl, 27-Meng Keliang, meng Keliang-2017, 26, meng Keliang-A, meng Keliang-B, meng Keliang-propyl, meng Keliang-butyl, meng Keliang-27, 08-27.
And then at least one positive instance entity pair and at least one negative instance entity pair are extracted from the combined entity pairs, and all the extracted entity pairs are combined into a supporting data set.
As an example, extracting at least one positive instance entity pair and at least one negative instance entity pair from the combined entity pair includes: extracting all positive instance entity pairs; the same number of negative instance entity pairs as positive instance entity pairs are extracted from all negative instance entity pairs.
And continuing to take the second reference text as an example, combining the entities in the second reference text into 3 positive entity pairs, and extracting all the 3 positive entity pairs. Based on the total of 39 negative example entity pairs formed by combining the entities in the second reference text, 3 negative example entity pairs are extracted from the 39 negative example entity pairs, for example, 3 negative example entity pairs of 'company b-company a', 'company 08, 27 th year-Ding Gongsi' and 'company Meng Keliang-company' are extracted from the 39 negative example entity pairs.
And combining the extracted 3 positive instance entity pairs and 3 negative instance entity pairs into a supporting data set, wherein the supporting data set comprises data of 6 entity pairs.
As an example, extracting at least one positive instance entity pair and at least one negative instance entity pair from the combined entity pair includes: randomly extracting at least one negative instance entity pair from all the negative instance entity pairs; or determining entity pairs of which the included entities meet the preset entity type conditions in all negative entity pairs, and randomly extracting at least one negative entity pair from the entity pairs meeting the preset entity type conditions.
Continuing to take the second reference text as an example, based on at least 39 negative example entity pairs formed by combining the entities in the second reference text, 3 negative example entity pairs can be randomly extracted from the 39 negative example entity pairs. Or determining that the included entity accords with the entity pair of the preset entity type condition in 39 negative entity pairs, and randomly extracting 3 negative entity pairs from the entity pairs which accord with the preset entity type condition.
Optionally, the preset entity type condition may be determined according to an actual design requirement, and as an example, the preset entity type condition is that the entity type of the entity is a company, and if only the types of two entities included in the negative entity pair are companies in the 39 negative entity pairs, the preset entity type condition is met; when the negative example entity pair contains entity date or legal person, the condition of the preset entity type is not met. And randomly extracting 3 negative entity pairs from the entity pairs meeting the preset entity type condition.
According to the text entity relation recognition method provided by the exemplary embodiment of the application, the noise data can be remarkably reduced by setting the preset entity type condition to select the entity pair.
In step S102, a query dataset comprising at least one entity pair in the text under test of the target area is generated.
The text to be tested in the target field includes a plurality of entities, and step S102 needs to combine the entities in the reference text into entity pairs, generate a query dataset including the plurality of entity pairs, and then determine a relationship type to which each entity pair in the query dataset belongs in step 103.
As an example, generating a query dataset comprising at least one entity pair in a text under test of a target domain, comprising: acquiring a to-be-detected entity list of a to-be-detected text, wherein the to-be-detected entity list stores data of entities in the to-be-detected text; and combining the entities in the entity list to be tested into a plurality of entity pairs, and combining the combined entity pairs into a query data set.
The user needs to mark the reference text in the target field in advance to form a reference entity list corresponding to the reference text, and the entity list to be tested can store data of a plurality of entities in the text to be tested.
With "how to play its own role, please read" actor self-repair ", and" king of comedy ", zhou Xingchi rises in the solitary rare book in the poverty down and out. For example, the text to be tested (hereinafter referred to as "first text to be tested") includes "Zhou Xingchi", "actor self-maintenance" and "king of comedy", and the entity list to be tested may store data of at least two entities of the 3 entities.
The format of the entity list to be tested of the text to be tested is as follows: { "text": how to play its own role, please read "actor self-maintenance" ("king of comedy") Zhou Xingchi rise up to the independent door rare book "in the poor down and out", "entity_list" ({ "entity_index" { "begin":22, "end":26}, "entity_type": literary work "," entity ":" actor self-maintenance "}, {" entity_index ": {" begin ": 22", "end":26}, "entity_type": film and television work "," entity ": king of comedy" }, { "entity_index" { "begin":27 "end":29}, "entity_type": E "}," entity ": zhou Xingchi" }
The text is the content of the text to be tested, the entity_list is a list used for storing all marked entities, each content is the data of one entity, and the data of one entity comprises the position of the entity in the text to be tested, the type of the entity and the content of the entity. Specifically, the entity_index represents the position of the entity in the text to be tested, the entity_type represents the type of the entity, and the entity represents the content of the entity.
And combining the entities in the entity list to be tested into a plurality of entity pairs, and combining the combined entity pairs into a query data set.
As an example, combining entities in the entity list to be tested into a plurality of entity pairs includes: each entity in the entity list to be tested is respectively used as a head entity and a tail entity, and forms an entity pair with other entities in the entity list to be tested; or determining the entity meeting the preset entity type condition from the entities in the entity list to be detected, and forming an entity pair by taking each entity in the entity meeting the preset entity type condition as a head entity and a tail entity respectively and other entities in the entity meeting the preset entity type condition.
Continuing with the first text to be tested as an example, taking each entity in the entity list to be tested as a head entity and a tail entity respectively, forming entity pairs with other entities in the entity list to be tested, combining three entities based on 'actor self-maintenance', 'comedy king' and 'Zhou Xingchi' into 6 entity pairs, and combining the combined 6 entity pairs into a query data set.
Continuing with the above-mentioned first test text as an example, when the entity pairs are combined with the preset entity type condition as a constraint, the number of the combined entity pairs needs to be determined according to the specific content of the preset entity type condition, and the specific content of the entity type condition may be determined according to the actual design requirement, for example, the type of the head entity and/or the tail entity in the entity pairs may be specified.
As an example, the entity type conditions are: the type of at least one entity in the pair of entities is a "persona". Under the condition of the entity type, the three entities based on "actor self-maintenance", "king of comedy" and "Zhou Xingchi" can be combined into 4 entity pairs, wherein the 4 entity pairs are respectively "actor self-maintenance-Zhou Xingchi", "Zhou Xingchi-actor self-maintenance", "king of comedy-Zhou Xingchi" and "king of Zhou Xingchi-comedy", and then the combined 4 entity pairs are combined into a query data set.
In step S103, based on the support data set and the query data set, a relationship type to which each entity pair in the query data set belongs is determined using a machine learning model.
It should be noted that, the machine learning model in step S103 is trained in advance based on the data of the other fields except the target field, and specific training steps can be referred to the training process of the small sample learning model described above. The invention does not limit the specific type of the machine learning model, and the machine learning model can be any model capable of solving the problem of N-way-K-shot, such as a small sample learning model.
Alternatively, K entity pairs in the support data set and Q entity pairs in the query data set may be randomly selected at a time, where the K entity pairs include a first class of entity and a negative case entity pair. A machine learning model is utilized to determine whether each of the Q entity pairs in the query data set is a positive example entity pair or a negative example entity pair. When one entity pair in the query data set is a positive example entity pair, determining that the entity pair belongs to a specified relationship type; when an entity pair in the query data set is a negative instance entity pair, it is determined that the entity pair belongs to a negative instance relationship type.
As an example, determining, based on the support dataset and the query dataset, a relationship type to which each entity pair in the query dataset belongs using a machine learning model, includes: calculating relation scores of each entity pair in the query data set on X preset appointed relation types in the support data set by using a machine learning model; and determining the relationship type of each entity pair in the query data set according to whether the relationship score of each entity pair in the query data set meets a preset score condition.
The Micro SIM card is a small plastic card, and one side of the card is provided with an Integrated Circuit (IC) chip. The text to be tested in the field of the SIM card (hereinafter referred to as "second text to be tested") is taken as an example, and the entities of the second text to be tested include 3 entities of the Micro SIM card, the plastic card and the Integrated Circuit (IC) chip, and the 3 entities can be combined into 6 entity pairs, and the 6 entity pairs are combined into the query data set.
It should be noted that the support data set has been generated based on the reference text in the SIM card field based on the method of step S101 before the 6 entity pairs for the second text under test.
As an example, the preset X specified relationship types include the relationship type "is_a", and the machine learning model calculates a relationship score for each of the 6 entity pairs in the query data set on the relationship type "is_a" in the support data set, and the relationship scores for the 6 entity pairs are shown in table 1.
Relationship score Entity pair (head entity-tail entity)
0.53 Micro SIM card-plastic card
0.52 Integrated Circuit (IC) chip-plastic card
0.52 Plastic card-Integrated Circuit (IC) chip
0.51 Micro SIM card-Integrated Circuit (IC) chip
0.50 Plastic card-Micro SIM card
0.49 Integrated Circuit (IC) chip-Micro SIM card
TABLE 1
As an example, according to whether the relationship score of each entity pair in the query data set meets a preset score condition, determining the relationship type of each entity pair in the query data set at least includes the following 3 specific methods, one of which can be selected according to design requirements, and the 3 specific methods are respectively:
Method (1): if the relation score of the entity pair is higher than the score threshold, determining that the entity pair belongs to preset X designated relation types; if the relation score of the entity pair is not higher than the score threshold, determining that the entity pair belongs to other relation types except the preset X designated relation types.
Taking the second text to be tested as an example, the score threshold is set to 0.52, the relationship score of the entity pair of the Micro SIM card-plastic card is higher than the score threshold by 0.52, and the relationship scores of the other 5 entity pairs are lower than the score threshold by 0.52, so that the Micro SIM card-plastic card is determined to belong to the relationship type of the "is_a", and the other 5 entity pairs are determined to belong to the relationship type other than the "is_a".
Method (2): if the relationship scores of the entity pairs are Y names before in the ranking of the relationship scores of all the entity pairs in the query data set, determining that the entity pairs belong to preset X designated relationship types; if the relationship scores of the entity pairs are not the first Y in the ranking of the relationship scores of all the entity pairs in the query data set, determining that the entity pairs belong to other relationship types except the preset X designated relationship types, wherein Y is a positive integer.
Taking the second text to be tested as an example, setting N to 1, where the relationship score of the entity pair "Micro SIM card-plastic card" is the first 1 in the ranking of the relationship scores of all entity pairs in the query dataset, and the relationship scores of the other 5 entity pairs are after the 1 st in the ranking of the relationship scores of all entity pairs in the query dataset, so that it is determined that "Micro SIM card-plastic card" belongs to "is_a" relationship type, and the other 5 entity pairs belong to other relationship types except "is_a".
Method (3): if the relationship score of the entity pair is higher than the score threshold value and the relationship score of the entity pair is the first Y in the ranking of the relationship scores of all entity pairs in the query data set, determining that the entity pair belongs to the preset X designated relationship types; if the relationship score of the entity pair is not higher than the score threshold value, or the relationship score of the entity pair is not the first Y in the ranking of the relationship scores of all entity pairs in the query data set, determining that the entity pair belongs to other relationship types except the preset X designated relationship types, wherein Y is a positive integer.
Taking the second text to be tested as an example, the score threshold is set to 0.52 and N is set to 1. The relationship score of the entity pair of the Micro SIM card-plastic card is higher than the score threshold value of 0.52, and the relationship score of the entity pair is the first 1 in the ranking of the relationship scores of all entity pairs in the query data set, and the relationship scores of the other 5 entity pairs are lower than the score threshold value of 0.52, so that the Micro SIM card-plastic card is determined to belong to the relationship type of 'is_a', and the other 5 entity pairs are determined to belong to the relationship type except for 'is_a'.
According to the text entity relation recognition method provided by the embodiment of the invention, the machine learning model which is trained in advance by using data of other fields except the target field can be used for completing the entity relation recognition in the text to be detected in the target field, so that the machine learning model can be applied to different fields and scenes without carrying out large-scale training again.
Specifically, a support data set is generated based on a reference text in the target field, the support data set comprises two entity pairs (a positive entity pair and a negative entity pair) with completely different relation types, and the negative entity pair does not belong to the preset X specified relation types, so that the relation labeling of two entities in the negative entity pair is not needed in the data labeling stage, and the data labeling quantity is greatly reduced. In determining the relationship type of each entity pair in the query data set, only a selection is needed among the preset X designated relationship types, and whether the relationship type is a positive entity pair or not and the specific designated relationship type of the entity pair are determined. And when the entity pair does not belong to the preset X designated relationship types, the entity pair can be directly classified into other relationship types except the preset X designated relationship types. The negative instance entity pairs are added to the support data set, so that the accuracy and the authenticity of the relationship classification and the entity relationship identification can be improved, and the efficiency of the entity relationship identification can be improved.
The applicant uses 'investment' as a designated relationship type based on a query data set containing 500 entity pairs and a support data set containing a plurality of entity pairs in the financing field, and determines the relationship type of each entity pair in the query data set by using a machine learning model under the condition that the support data set contains different numbers of entity pairs. As shown in fig. 3, the F1 score (F1-score) of the machine learning model generally increases with increasing number of entity pairs included in the support data set, and the F1 score of the machine learning model exceeds 0.8 when the number of entity pairs included in the support data set exceeds 100.
It should be noted that the F1 score is a measure of classification problems, which is a harmonic mean of the accuracy and recall, the F1 score is 1 at the maximum, and the F1 score is 0 at the minimum. The greater the F1 score, the higher the accuracy and recall of the machine learning model.
As can be seen from the content shown in fig. 3, when the text entity relationship identification method according to the exemplary embodiment of the present invention is applied, the machine learning model predicts the relationship types to which the entities in the query data set of the target domain belong based on a smaller number (e.g., 150) of entities in the support data set of the target domain, and a higher F1 score can also be obtained.
As an example, step S104 is further included after step S103. In step S104, data information of the relationship type to which each entity pair in the query data set belongs is output. The data information of the relationship type of each entity pair at least comprises the content of the head entity, the content of the tail entity and the relationship type of the entity pair.
Alternatively, the data information of the relationship type to which each entity pair in the query data set belongs may be output in the following format:
{"text":"","spo_list":[{"object_index":{"begin":,"end":},"subject_index":{"b egin":,"end":},"predicate":"","object_type":"","subject_type""","object""","subject"""]}.
the text is the content of the text to be tested, the spo_list is a relation list and is used for storing all labeling relations, wherein each content is relation labeling data of one entity pair, and the relation of one entity pair is labeled with the entity content of a header, the entity content of a tail and the relation type of the entity pair. Specifically, the prediction is a relationship type, the subject is a header entity content, the subject_index is a position of the header entity in the first reference text, the subject_type is a specific type (may be null) of the header entity, the subject is a tail entity content, the subject_index is a position of the tail entity in the first reference text, and the subject_type is a specific type (may be null) of the tail entity.
Based on the same inventive concept, the exemplary embodiment of the present invention further proposes a text entity relationship recognition system 200, and a brief description is made below about functional units that may be provided in the text entity relationship recognition system 200 and operations that may be performed by the functional units, and details related thereto may be referred to the above related description, which is not repeated herein.
As shown in fig. 4, the text entity relationship recognition system 200 includes a support dataset generation module 201, a query dataset generation module 202, and a relationship type prediction module 203.
The support data set generating module 201 is configured to generate a support data set based on the reference text of the target field, where the support data set includes positive case entity pairs belonging to the preset X specified relationship types and negative case entity pairs not belonging to the preset X specified relationship types.
The query data set generation module 202 is configured to generate a query data set including at least one entity pair in the text under test of the target domain.
The relationship type prediction module 203 is configured to determine, based on the support data set and the query data set, a relationship type to which each entity pair in the query data set belongs using a machine learning model,
Wherein the machine learning model is pre-trained based on data from other areas outside the target area.
As an example, the support dataset generation module 201 is configured to: acquiring a reference entity list of a reference text, wherein the reference entity list stores data of a positive example entity pair in the reference text; combining all the entities in the reference entity list into a plurality of entity pairs, and determining the type of each combined entity pair according to whether each combined entity pair belongs to the preset X specified relationship types; and extracting at least one positive instance entity pair and at least one negative instance entity pair from the combined entity pairs, and merging all the extracted entity pairs into a supporting data set.
As an example, the support dataset generation module 201 is configured to: and taking each entity in the reference entity list as a head entity and a tail entity respectively, and forming entity pairs with other entities in the reference entity list.
As an example, the support dataset generation module 201 is configured to: extracting all positive instance entity pairs; the same number of negative instance entity pairs as positive instance entity pairs are extracted from all negative instance entity pairs.
As an example, the support dataset generation module 201 is configured to: randomly extracting at least one negative instance entity pair from all the negative instance entity pairs; or determining entity pairs of which the included entities meet the preset entity type conditions in all negative entity pairs, and randomly extracting at least one negative entity pair from the entity pairs meeting the preset entity type conditions.
As an example, the query data set generation module 202 is configured to: acquiring a to-be-detected entity list of a to-be-detected text, wherein the to-be-detected entity list stores data of entities in the to-be-detected text; and combining the entities in the entity list to be tested into a plurality of entity pairs, and combining the combined entity pairs into a query data set.
As an example, the query data set generation module 202 is configured to: each entity in the entity list to be tested is respectively used as a head entity and a tail entity, and forms an entity pair with other entities in the entity list to be tested; or determining the entity meeting the preset entity type condition from the entities in the entity list to be detected, and forming an entity pair by taking each entity in the entity meeting the preset entity type condition as a head entity and a tail entity respectively and other entities in the entity meeting the preset entity type condition.
As an example, the relationship type prediction module 203 is configured to: calculating relation scores of each entity pair in the query data set on X preset appointed relation types in the support data set by using a machine learning model; and determining the relationship type of each entity pair in the query data set according to whether the relationship score of each entity pair in the query data set meets a preset score condition.
As an example, the relationship type prediction module 203 is configured to perform any one of the following three steps:
If the relation score of the entity pair is higher than the score threshold, determining that the entity pair belongs to preset X designated relation types; if the relation score of the entity pair is not higher than the score threshold, determining that the entity pair belongs to other relation categories except the preset X designated relation types;
If the relationship scores of the entity pairs are Y names before in the ranking of the relationship scores of all the entity pairs in the query data set, determining that the entity pairs belong to preset X designated relationship types; if the relationship scores of the entity pairs are not the first Y in the ranking of the relationship scores of all the entity pairs in the query data set, determining that the entity pairs belong to other relationship categories except the preset X designated relationship types, wherein Y is a positive integer;
if the relationship score of the entity pair is higher than the score threshold value and the relationship score of the entity pair is the first Y in the ranking of the relationship scores of all entity pairs in the query data set, determining that the entity pair belongs to the preset X designated relationship types; if the relationship score of the entity pair is not higher than the score threshold value, or the relationship score of the entity pair is not the first Y in the ranking of the relationship scores of all entity pairs in the query data set, determining that the entity pair belongs to other relationship categories except for the preset X designated relationship types, wherein Y is a positive integer.
As an example, the text entity relationship recognition system 200 further includes a result output module 204, the result output module 204 configured to: and outputting data information of the relationship type to which each entity pair belongs in the query data set, wherein the data information of the relationship type to which each entity pair belongs at least comprises the content of a head entity, the content of a tail entity and the relationship type to which the entity pair belongs.
As an example, in the text entity relationship recognition system 200, the machine learning model includes a small sample learning model.
It should be appreciated that the specific implementation of the text entity relationship recognition system 200 according to the exemplary embodiment of the present invention may be implemented with reference to the related specific implementation described in connection with fig. 2, and will not be described herein.
A text entity relationship recognition method, a text entity relationship recognition system 200 according to an exemplary embodiment of the present invention are described above with reference to fig. 1 to 4. It should be appreciated that the above-described method may be implemented by a program recorded on a computer-readable medium, for example, a computer-readable storage medium storing instructions may be provided according to an exemplary embodiment of the present invention, in which a computer program for executing the text entity relationship recognition method shown in fig. 2 is recorded on the computer-readable medium.
The computer program in the above-described computer readable medium may be run in an environment deployed in a computer device such as a client, a host, a proxy device, a server, etc., and it should be noted that the computer program may be used to perform additional steps other than those shown in fig. 2 or more specific processes when performing the above-described steps, and the contents of these additional steps and further processes have been described with reference to fig. 2, and will not be repeated here.
It should be noted that, according to the implementation system of the machine learning modeling process of the exemplary embodiment of the present invention, the corresponding functions may be implemented entirely depending on the execution of the computer program, that is, each device corresponds to each step in the functional architecture of the computer program, so that the entire system is called through a specific software package (for example, lib library) to implement the corresponding functions.
On the other hand, each of the devices shown in fig. 4 may also be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium, such as a storage medium, so that the processor can perform the corresponding operations by reading and executing the corresponding program code or code segments.
For example, exemplary embodiments of the present invention may also be implemented as a computing device including a storage element having a set of computer-executable instructions stored therein that, when executed by a processor, perform a method of text entity relationship identification.
In particular, the computing devices may be deployed in servers or clients, as well as on node devices in a distributed network environment. Further, the computing device may be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing the above set of instructions.
Here, the computing device is not necessarily a single computing device, but may be any device or aggregate of circuits capable of executing the above-described instructions (or instruction set) alone or in combination. The computing device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with locally or remotely (e.g., via wireless transmission).
In a computing device, the processor may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.
Some of the operations described in the text entity relationship recognition method according to the exemplary embodiment of the present invention may be implemented in software, some of the operations may be implemented in hardware, and furthermore, the operations may be implemented in a combination of software and hardware.
The processor may execute instructions or code stored in one of the memory components, where the memory component may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.
The memory component may be integrated with the processor, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the storage component may comprise a stand-alone device, such as an external disk drive, a storage array, or any other storage device usable by a database system. The storage component and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, network connection, etc., such that the processor is able to read files stored in the storage component.
In addition, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via buses and/or networks.
Operations involved in a text entity relationship recognition method according to exemplary embodiments of the present invention may be described as various interconnected or coupled functional blocks or functional diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logic device or operate at non-exact boundaries.
For example, as described above, an implementation system of a machine learning modeling process according to an exemplary embodiment of the present invention may include a storage unit and a processor, wherein the storage unit stores a set of computer-executable instructions that, when executed by the processor, perform the above-mentioned text entity relationship recognition method.
The foregoing description of exemplary embodiments of the invention has been presented only to be understood as illustrative and not exhaustive, and the invention is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. Therefore, the protection scope of the present invention shall be subject to the scope of the claims.

Claims (20)

1. A text entity relationship recognition method, comprising:
Generating a support data set based on a reference text of a target field, wherein the support data set comprises positive instance entity pairs belonging to preset X appointed relation types and negative instance entity pairs not belonging to the preset X appointed relation types;
Generating a query data set comprising at least one entity pair in the text to be tested in the target field;
Determining a relationship type to which each entity pair in the query data set belongs using a machine learning model based on the support data set and the query data set,
Wherein the machine learning model is pre-trained based on data of other areas than the target area;
wherein generating a support dataset based on the reference text of the target area comprises:
Acquiring a reference entity list of the reference text, wherein the reference entity list stores data of the positive example entity pairs in the reference text;
Combining all the entities in the reference entity list into a plurality of entity pairs, and determining the type of each combined entity pair according to whether each combined entity pair belongs to the preset X designated relationship types;
Extracting at least one positive instance entity pair and at least one negative instance entity pair from the combined entity pairs, and merging all extracted entity pairs into the support data set;
wherein combining all entities in the reference entity list into a plurality of entity pairs comprises:
And taking each entity in the reference entity list as a head entity and a tail entity respectively, and forming entity pairs with other entities in the reference entity list.
2. The text entity relationship recognition method of claim 1, wherein extracting at least one of the positive instance entity pairs and at least one of the negative instance entity pairs in the combined entity pairs comprises:
Extracting all the positive instance entity pairs;
And extracting the same number of negative instance entity pairs as the positive instance entity pairs from all the negative instance entity pairs.
3. The text entity relationship recognition method of claim 1, wherein extracting at least one of the positive instance entity pairs and at least one of the negative instance entity pairs in the combined entity pairs comprises:
Randomly extracting at least one negative instance entity pair from all the negative instance entity pairs;
or determining entity pairs of which the included entities meet the preset entity type conditions in all the negative entity pairs, and randomly extracting at least one negative entity pair from the entity pairs meeting the preset entity type conditions.
4. The text entity relationship recognition method of claim 1, wherein generating a query dataset comprising at least one entity pair in the text under test of the target field comprises:
acquiring a to-be-detected entity list of the to-be-detected text, wherein the to-be-detected entity list stores data of entities in the to-be-detected text;
combining the entities in the entity list to be tested into a plurality of entity pairs, and combining the combined entity pairs into the query data set.
5. The text entity relationship recognition method of claim 4, wherein combining the entities in the entity-under-test list into a plurality of entity pairs comprises:
each entity in the entity list to be tested is respectively used as a head entity and a tail entity, and forms an entity pair with other entities in the entity list to be tested;
Or determining the entity meeting the preset entity type condition from the entities in the entity list to be tested, and forming an entity pair with other entities in the entity meeting the preset entity type condition by taking each entity in the entity meeting the preset entity type condition as a head entity and a tail entity respectively.
6. The text entity relationship identification method of claim 1, wherein determining, based on the support dataset and the query dataset, a relationship type to which each entity pair in the query dataset belongs using a machine learning model comprises:
Calculating relationship scores of each entity in the query data set on the preset X designated relationship types in the support data set by using the machine learning model;
and determining the relationship type of each entity pair in the query data set according to whether the relationship score of each entity pair in the query data set meets a preset score condition.
7. The text entity relationship identification method of claim 6, wherein determining the relationship type to which each entity pair in the query data set belongs according to whether the relationship score of each entity pair in the query data set meets a preset score condition comprises any one of the following three steps:
if the relation score of the entity pair is higher than a score threshold, determining that the entity pair belongs to the preset X designated relation types; if the relation score of the entity pair is not higher than the score threshold, determining that the entity pair belongs to other relation categories except the preset X designated relation types;
If the relationship scores of the entity pairs are Y names before in the ranking of the relationship scores of all the entity pairs in the query data set, determining that the entity pairs belong to the preset X designated relationship types; if the relationship scores of the entity pairs are not the first Y in the ranking of the relationship scores of all the entity pairs in the query data set, determining that the entity pairs belong to other relationship categories except the preset X designated relationship types, wherein Y is a positive integer;
If the relation score of the entity pair is higher than the score threshold value and the relation score of the entity pair is the first Y in the ranking of the relation scores of all entity pairs in the query data set, determining that the entity pair belongs to the preset X designated relation types; if the relationship score of the entity pair is not higher than the score threshold value, or the relationship score of the entity pair is not the first Y in the ranking of the relationship scores of all entity pairs in the query data set, determining that the entity pair belongs to other relationship categories except the preset X designated relationship types, wherein Y is a positive integer.
8. The text entity relationship recognition method of claim 1, further comprising:
outputting data information of the relationship type to which each entity pair in the query data set belongs,
The data information of the relationship type of each entity pair at least comprises the content of the head entity, the content of the tail entity and the relationship type of the entity pair.
9. The text entity relationship recognition method of claim 1, further comprising:
the machine learning model includes a small sample learning model.
10. A text entity relationship identification system, comprising:
A support data set generation module configured to generate a support data set based on a reference text of a target field, the support data set including positive case entity pairs belonging to a preset X kinds of specified relationship types and negative case entity pairs not belonging to the preset X kinds of specified relationship types;
A query data set generation module configured to generate a query data set including at least one entity pair in a text to be tested of the target domain;
A relationship type prediction module configured to determine, based on the support data set and the query data set, a relationship type to which each entity pair in the query data set belongs using a machine learning model,
Wherein the machine learning model is pre-trained based on data of other areas than the target area;
Wherein the support dataset generation module is configured to:
Acquiring a reference entity list of the reference text, wherein the reference entity list stores data of the positive example entity pairs in the reference text;
Combining all the entities in the reference entity list into a plurality of entity pairs, and determining the type of each combined entity pair according to whether each combined entity pair belongs to the preset X designated relationship types;
Extracting at least one positive instance entity pair and at least one negative instance entity pair from the combined entity pairs, and merging all extracted entity pairs into the support data set;
Wherein the support dataset generation module is configured to:
And taking each entity in the reference entity list as a head entity and a tail entity respectively, and forming entity pairs with other entities in the reference entity list.
11. The text entity relationship identification system of claim 10, wherein the support dataset generation module is configured to:
Extracting all the positive instance entity pairs;
And extracting the same number of negative instance entity pairs as the positive instance entity pairs from all the negative instance entity pairs.
12. The text entity relationship identification system of claim 10, wherein the support dataset generation module is configured to:
Randomly extracting at least one negative instance entity pair from all the negative instance entity pairs;
or determining entity pairs of which the included entities meet the preset entity type conditions in all the negative entity pairs, and randomly extracting at least one negative entity pair from the entity pairs meeting the preset entity type conditions.
13. The text entity relationship identification system of claim 10, wherein the query dataset generation module is configured to:
acquiring a to-be-detected entity list of the to-be-detected text, wherein the to-be-detected entity list stores data of entities in the to-be-detected text;
combining the entities in the entity list to be tested into a plurality of entity pairs, and combining the combined entity pairs into the query data set.
14. The text entity relationship identification system of claim 13, wherein the query dataset generation module is configured to:
each entity in the entity list to be tested is respectively used as a head entity and a tail entity, and forms an entity pair with other entities in the entity list to be tested;
Or determining the entity meeting the preset entity type condition from the entities in the entity list to be tested, and forming an entity pair with other entities in the entity meeting the preset entity type condition by taking each entity in the entity meeting the preset entity type condition as a head entity and a tail entity respectively.
15. The text entity relationship identification system of claim 10, wherein the relationship type prediction module is configured to:
Calculating relationship scores of each entity in the query data set on the preset X designated relationship types in the support data set by using the machine learning model;
and determining the relationship type of each entity pair in the query data set according to whether the relationship score of each entity pair in the query data set meets a preset score condition.
16. A text entity relationship identification system as claimed in claim 15 wherein said relationship type prediction module is configured to perform any one of the following three steps:
if the relation score of the entity pair is higher than a score threshold, determining that the entity pair belongs to the preset X designated relation types; if the relation score of the entity pair is not higher than the score threshold, determining that the entity pair belongs to other relation categories except the preset X designated relation types;
If the relationship scores of the entity pairs are Y names before in the ranking of the relationship scores of all the entity pairs in the query data set, determining that the entity pairs belong to the preset X designated relationship types; if the relationship scores of the entity pairs are not the first Y in the ranking of the relationship scores of all the entity pairs in the query data set, determining that the entity pairs belong to other relationship categories except the preset X designated relationship types, wherein Y is a positive integer;
If the relation score of the entity pair is higher than the score threshold value and the relation score of the entity pair is the first Y in the ranking of the relation scores of all entity pairs in the query data set, determining that the entity pair belongs to the preset X designated relation types; if the relationship score of the entity pair is not higher than the score threshold value, or the relationship score of the entity pair is not the first Y in the ranking of the relationship scores of all entity pairs in the query data set, determining that the entity pair belongs to other relationship categories except the preset X designated relationship types, wherein Y is a positive integer.
17. The text entity relationship identification system of claim 10, further comprising a result output module configured to:
outputting data information of the relationship type to which each entity pair in the query data set belongs,
The data information of the relationship type of each entity pair at least comprises the content of the head entity, the content of the tail entity and the relationship type of the entity pair.
18. The text entity relationship identification system of claim 10 wherein,
The machine learning model includes a small sample learning model.
19. A computer readable storage medium storing instructions which, when executed by at least one computing device, cause the at least one computing device to perform the text entity relationship identification method of any one of claims 1 to 9.
20. A system comprising at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform the text entity relationship identification method of any of claims 1 to 9.
CN202010665101.6A 2020-07-10 2020-07-10 Text entity relationship recognition method, system and medium Active CN111797237B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010665101.6A CN111797237B (en) 2020-07-10 2020-07-10 Text entity relationship recognition method, system and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010665101.6A CN111797237B (en) 2020-07-10 2020-07-10 Text entity relationship recognition method, system and medium

Publications (2)

Publication Number Publication Date
CN111797237A CN111797237A (en) 2020-10-20
CN111797237B true CN111797237B (en) 2024-05-07

Family

ID=72808188

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010665101.6A Active CN111797237B (en) 2020-07-10 2020-07-10 Text entity relationship recognition method, system and medium

Country Status (1)

Country Link
CN (1) CN111797237B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015088168A (en) * 2013-09-25 2015-05-07 国際航業株式会社 Learning sample creation device, learning sample creation program, and automatic recognition device
CN108446769A (en) * 2018-01-23 2018-08-24 深圳市阿西莫夫科技有限公司 Knowledge mapping relation inference method, apparatus, computer equipment and storage medium
CN111062424A (en) * 2019-12-05 2020-04-24 中国科学院计算技术研究所 Small sample food image recognition model training method and food image recognition method
CN111078847A (en) * 2019-11-27 2020-04-28 中国南方电网有限责任公司 Power consumer intention identification method and device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190197176A1 (en) * 2017-12-21 2019-06-27 Microsoft Technology Licensing, Llc Identifying relationships between entities using machine learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2015088168A (en) * 2013-09-25 2015-05-07 国際航業株式会社 Learning sample creation device, learning sample creation program, and automatic recognition device
CN108446769A (en) * 2018-01-23 2018-08-24 深圳市阿西莫夫科技有限公司 Knowledge mapping relation inference method, apparatus, computer equipment and storage medium
CN111078847A (en) * 2019-11-27 2020-04-28 中国南方电网有限责任公司 Power consumer intention identification method and device, computer equipment and storage medium
CN111062424A (en) * 2019-12-05 2020-04-24 中国科学院计算技术研究所 Small sample food image recognition model training method and food image recognition method

Also Published As

Publication number Publication date
CN111797237A (en) 2020-10-20

Similar Documents

Publication Publication Date Title
CA3129745C (en) Neural network system for text classification
US20230022845A1 (en) Model for textual and numerical information retrieval in documents
US8452772B1 (en) Methods, systems, and articles of manufacture for addressing popular topics in a socials sphere
WO2022105115A1 (en) Question and answer pair matching method and apparatus, electronic device and storage medium
CN106649223A (en) Financial report automatic generation method based on natural language processing
CN112328761B (en) Method and device for setting intention label, computer equipment and storage medium
US20220100772A1 (en) Context-sensitive linking of entities to private databases
US20220100967A1 (en) Lifecycle management for customized natural language processing
CN111651552B (en) Structured information determining method and device and electronic equipment
US20230214579A1 (en) Intelligent character correction and search in documents
CN114648392A (en) Product recommendation method and device based on user portrait, electronic equipment and medium
CN116383399A (en) Event public opinion risk prediction method and system
CN116150613A (en) Information extraction model training method, information extraction method and device
CN115687647A (en) Notarization document generation method and device, electronic equipment and storage medium
CN112951233A (en) Voice question and answer method and device, electronic equipment and readable storage medium
CN113626576A (en) Method and device for extracting relational characteristics in remote supervision, terminal and storage medium
CN115510188A (en) Text keyword association method, device, equipment and storage medium
CN116860856A (en) Financial data processing method and device, computer equipment and storage medium
EP4222635A1 (en) Lifecycle management for customized natural language processing
CN113254814A (en) Network course video labeling method and device, electronic equipment and medium
CN111797237B (en) Text entity relationship recognition method, system and medium
CN115099680A (en) Risk management method, device, equipment and storage medium
US20220350814A1 (en) Intelligent data extraction
CN113807920A (en) Artificial intelligence based product recommendation method, device, equipment and storage medium
CN113505293A (en) Information pushing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant