CN113704392A - Method, device and equipment for extracting entity relationship in text and storage medium - Google Patents
Method, device and equipment for extracting entity relationship in text and storage medium Download PDFInfo
- Publication number
- CN113704392A CN113704392A CN202110393735.5A CN202110393735A CN113704392A CN 113704392 A CN113704392 A CN 113704392A CN 202110393735 A CN202110393735 A CN 202110393735A CN 113704392 A CN113704392 A CN 113704392A
- Authority
- CN
- China
- Prior art keywords
- target
- entity
- subject
- text
- potential
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 100
- 239000013598 vector Substances 0.000 claims abstract description 156
- 238000000605 extraction Methods 0.000 claims abstract description 85
- 238000002372 labelling Methods 0.000 claims description 75
- 238000012549 training Methods 0.000 claims description 27
- 239000011159 matrix material Substances 0.000 claims description 19
- 230000015654 memory Effects 0.000 claims description 17
- 238000011176 pooling Methods 0.000 claims description 7
- 238000001914 filtration Methods 0.000 claims description 4
- 238000012216 screening Methods 0.000 abstract description 7
- 238000013473 artificial intelligence Methods 0.000 abstract description 5
- 230000008569 process Effects 0.000 description 36
- 230000006870 function Effects 0.000 description 19
- 238000010586 diagram Methods 0.000 description 14
- 208000004998 Abdominal Pain Diseases 0.000 description 11
- 238000012545 processing Methods 0.000 description 10
- 210000001015 abdomen Anatomy 0.000 description 8
- 208000024891 symptom Diseases 0.000 description 8
- 238000004891 communication Methods 0.000 description 7
- 238000003058 natural language processing Methods 0.000 description 7
- 230000001314 paroxysmal effect Effects 0.000 description 7
- 238000010276 construction Methods 0.000 description 6
- 238000011156 evaluation Methods 0.000 description 6
- 238000003062 neural network model Methods 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 201000004624 Dermatitis Diseases 0.000 description 2
- 208000002193 Pain Diseases 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000012512 characterization method Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 238000013519 translation Methods 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 208000010201 Exanthema Diseases 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 201000005884 exanthem Diseases 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 206010037844 rash Diseases 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3346—Query execution using probabilistic model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Biophysics (AREA)
- Animal Behavior & Ethology (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a method, a device, equipment and a storage medium for extracting entity relations in texts, and relates to the field of artificial intelligence. The method comprises the following steps: coding the target text to obtain a word vector corresponding to each word in the target text; determining a potential entity relationship corresponding to the target text based on the word vector corresponding to each word, wherein the probability of the potential entity relationship existing in the target text is higher than the probability of other candidate entity relationships except the potential entity relationship; determining a target subject and a target object in the target text based on the potential entity relationship and the word vector corresponding to each word, wherein the target subject and the target object belong to the entity; and extracting entity relation triples from the target text based on the target subject, the potential entity relation and the target object. The potential entity relations are obtained by screening the candidate entity relations, so that false recall results caused by redundant entity relations are reduced, the accuracy of extracting the entity relations in the text is improved, and the extraction efficiency of the entity relations is improved.
Description
Technical Field
The embodiment of the application relates to the field of artificial intelligence, in particular to a method, a device, equipment and a storage medium for extracting entity relations in texts.
Background
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.
When the knowledge graph is constructed, a large amount of texts need to be structured, and unstructured data in the texts are converted into structured data and recalled. For example, when an entity relationship in a text is extracted, the recalled structured data is a triple including a subject (subject), a relationship (relationship), and an object (object).
However, because the relationships between entities in the text are complex, when the entity relationship extraction is performed on the text, a large number of false recall results exist, so that the accuracy of the entity relationship extraction in the text is poor.
Disclosure of Invention
The embodiment of the application provides a method, a device, equipment and a storage medium for extracting entity relations in texts, which can reduce the error recall result in the extraction of the entity relations and improve the accuracy of the extraction of the entity relations in the texts. The technical scheme is as follows:
in one aspect, an embodiment of the present application provides a method for extracting an entity relationship in a text, where the method includes:
coding a target text to obtain a word vector corresponding to each word in the target text;
determining a potential entity relationship corresponding to the target text based on the word vector corresponding to each word, wherein the probability of the potential entity relationship existing in the target text is higher than the probability of other candidate entity relationships except the potential entity relationship existing in the target text;
determining a target subject and a target object in the target text based on the potential entity relationship and the word vector corresponding to each word, wherein the target subject and the target object belong to an entity;
and extracting entity relation triples from the target text based on the target subject, the potential entity relations and the target object.
On the other hand, an embodiment of the present application provides an apparatus for extracting entity relationships in a text, where the apparatus includes:
the encoding module is used for encoding a target text to obtain a word vector corresponding to each word in the target text;
a relationship determining module, configured to determine a potential entity relationship corresponding to the target text based on the word vector corresponding to each word, where a probability that the potential entity relationship exists in the target text is higher than a probability that other candidate entity relationships other than the potential entity relationship exist;
a subject-object determining module, configured to determine, based on the potential entity relationship and the word vector corresponding to each word, a target subject and a target object in the target text, where the target subject and the target object belong to an entity;
and the extraction module is used for extracting entity relation triples from the target text based on the target subject, the potential entity relation and the target object.
In another aspect, an embodiment of the present application provides a computer device, where the computer device includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the method for extracting entity relationships in text according to the above aspect.
In another aspect, an embodiment of the present application provides a computer-readable storage medium, where at least one instruction is stored in the computer-readable storage medium, and the at least one instruction is loaded and executed by a processor to implement the method for extracting entity relationships in text according to the above aspect.
In another aspect, embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to enable the computer device to execute the method for extracting entity relations in texts provided by the above aspects.
In the process of extracting the entity relationship, firstly, determining a high-probability potential entity relationship in a target text based on a word vector obtained by encoding the target text, filtering a low-probability candidate entity relationship, then determining a target subject and a target object from the target text based on the determined potential entity relationship and the word vector, and finally extracting an entity relationship triple comprising the target subject, the potential entity relationship and the target object from the target text; by adopting the scheme provided by the embodiment of the application, the potential entity relationship is obtained by screening the candidate entity relationship before the subject and object extraction is carried out, the false recall result caused by the redundant entity relationship irrelevant to the target text can be reduced, the accuracy of the entity relationship extraction in the text is improved, and the extraction efficiency of the entity relationship is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram illustrating a method for extracting entity relationships in text according to an embodiment of the present application;
FIG. 2 illustrates a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application;
FIG. 3 is a flowchart illustrating a method for extracting entity relationships in text according to an exemplary embodiment of the present application;
FIG. 4 is a flowchart illustrating a method for extracting entity relationships in text according to another exemplary embodiment of the present application;
FIG. 5 is a schematic diagram illustrating an implementation of an entity relationship extraction process, according to an exemplary embodiment of the present application;
FIG. 6 is a flowchart illustrating an entity relationship triplet generation process, shown in an exemplary embodiment of the present application;
FIG. 7 is a schematic diagram illustrating an implementation of an entity relationship extraction process according to another exemplary embodiment of the present application;
FIG. 8 illustrates a flow diagram of an entity relationship extraction model training process provided by an exemplary embodiment of the present application;
FIG. 9 illustrates a flow diagram of an entity relationship extraction model training process provided by another exemplary embodiment of the present application;
FIG. 10 is a block diagram illustrating an apparatus for extracting entity relationships in text according to an exemplary embodiment of the present application;
fig. 11 shows a schematic structural diagram of a computer device provided in an exemplary embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
For convenience of understanding, terms referred to in the embodiments of the present application will be first described below.
Entity relationship triplets: the three-element structure comprises a subject, an object and a subject-object relationship. Wherein, the subject and the object are both specific domain entities. Taking the medical field as an example, the subject may be a medical symptom, the subject-object relationship may be attributes of the medical symptom, such as a property, a part, and time, and the object is an entity corresponding to the attribute indicated by the subject-object relationship. In one illustrative example, the entity relationship triplets are (dermatitis, region, leg), or (dermatitis, nature, intermediate). The method provided by the embodiment of the application is used for extracting the entity relationship triples containing the relationships among the entities in the specific field from the text in the specific field.
Sequence Tagging (Sequence Tagging): the basic task in NLP is used to solve the problem of classifying characters, such as word segmentation, part of speech tagging, named entity recognition, relation extraction, etc. The sequence labeling in the embodiment of the application is used for labeling a subject and an object in a text, and a bio (begin Inside out) label is used in the sequence labeling process, wherein a B label indicates that a word is located at the beginning of an entity, an I label indicates that a word is located Inside the entity, and an O label indicates that the word does not belong to the entity.
Manual labeling: the method refers to a process of performing real-value (ground-route) labeling on training samples in a training data set by a labeling person before training a neural network model. And the labeled labels obtained by manual labeling are used for monitoring the output result of the model in the model training process, and correspondingly, the model training process leads the output result of the model to tend to the labeled process by adjusting the parameters of the model. The manual labeling process related in the embodiment of the application comprises the step of labeling the relations of a subject, an object and an object in a sample text.
Loss function (loss function): also called cost function, is a function for evaluating the degree of difference between the predicted value and the true value of the neural network model, and the smaller the loss function is, the better the performance of the neural network model is, the training process of the model is the process of minimizing the loss function by adjusting the model parameters. The loss functions used are different for different neural network models, and common loss functions include 0-1 loss functions, absolute value loss functions, logarithmic loss functions, exponential loss functions, perceptual loss functions, cross-entropy loss functions, and the like.
Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. The method for extracting the entity relationship in the text is applied to the field of knowledge graphs.
The knowledge graph is a data structure based on a graph, and comprises nodes and edges, wherein each node represents an entity, and the edges between the nodes represent the relationship between the entities, so that in order to realize the construction of the knowledge graph in a specific field, the relationship between the entities needs to be extracted from mass text data in the specific field. In the related art, in order to improve the extraction efficiency of the entity relationship, the entity relationship is extracted by using a neural network model. Because the relationships between entities are complex and various, when extracting the entity relationships, various entity relationships need to be distinguished one by one. However, in practical applications, it is found that, for a piece of text, the entity relationship among the entities in the text only accounts for a very small part of the total amount of the entity relationship, and other redundant entity relationships not only affect the accuracy of the entity relationship extraction, but also affect the efficiency of the entity relationship extraction.
The embodiment of the application provides an extraction method of entity relations in a text, which is characterized in that potential entity relations with high probability in the text are screened from candidate entity relations, so that the influence of redundant entity relations on entity relation extraction is reduced, the accuracy rate of entity relation extraction is improved, and the extraction efficiency of the entity relations is improved. Fig. 1 shows a schematic diagram of a principle of an extraction method of entity relationships in text provided by an embodiment of the present application.
As shown in fig. 1, a computer device first encodes a text 11 to obtain word vectors 12 corresponding to n words in the text 11, and thus screens out potential entity relationships 14 from m candidate entity relationships 13 based on the word vectors 12, where a probability of existence of a potential entity relationship 14 in the text 11 is higher than a probability of existence of other candidate entity relationships. Further, the computer device determines that the text 11 includes the subject 15 and the object 16 based on the potential entity relationship 14 and the word vector 12, and further generates an entity relationship triple 17 based on the subject 15, the object 16 and the potential entity relationship 14, thereby completing the entity relationship extraction of the text 11. Because the potential entity relationship is screened out before the subject and the object are determined, when the subject and the object are determined, various candidate entity relationships do not need to be traversed one by one, and the calculation amount in the process of determining the subject and the object is reduced; meanwhile, only the potential entity relationship is contained in the entity relationship triple finally extracted, so that false recall caused by redundant entity relationships is avoided, and the accuracy of entity relationship extraction is improved.
The method for extracting the entity relationship in the text can be used for the construction process of the knowledge graph in the specific field. Taking the construction process of the knowledge graph in the medical field as an example, a developer firstly carries out manual labeling on text corpora in part of the medical field, so that an entity relationship extraction model is trained by utilizing the manually labeled text corpora, and the entity relationship extraction model is used for outputting entity relationship triples based on input texts. After the trained entity relationship extraction model is deployed on computer equipment, the computer equipment inputs the unlabeled text corpus in the medical field text corpus into the entity relationship extraction model to obtain entity relationship triples output by the entity relationship extraction model. Based on the extracted massive entity relationship triples, the computer equipment can further construct a knowledge graph in the medical field.
Furthermore, the constructed medical field knowledge graph can be used in various downstream services. For example, an automatic question-answering system in the medical field can be designed based on the knowledge graph in the medical field, and the medical question of the user is automatically replied by the automatic question-answering system; or, the knowledge graph in the medical field is used as background information of the relation between the entity and the entity in natural language understanding, so that the accuracy of the natural language understanding is improved; or, the medical field knowledge graph is integrated into a recommendation system as auxiliary information to improve the accuracy of a recommendation result.
Of course, the method for extracting entity relationships in texts provided in the embodiment of the present application may also be applied to the construction process of knowledge maps in other fields, such as the customer service field, the financial field, and the like.
FIG. 2 illustrates a schematic diagram of an implementation environment provided by an exemplary embodiment of the present application. The implementation environment includes a terminal 210 and a server 220. The data communication between the terminal 210 and the server 220 is performed through a communication network, optionally, the communication network may be a wired network or a wireless network, and the communication network may be at least one of a local area network, a metropolitan area network, and a wide area network.
The terminal 210 is an electronic device for providing text corpus, and the electronic device may be a smart phone, a tablet computer, a personal computer, or the like, which is not limited in this embodiment. In fig. 2, a computer used by the medical staff as the terminal 210 is described as an example.
After the terminal 210 obtains the text corpus, the text corpus is sent to the server 220, and the server 220 extracts the entity relationship for constructing the knowledge graph from the text corpus. As shown in fig. 2, after the medical staff enters the symptom description of the patient through the terminal 210, the terminal 210 sends the symptom description to the server 220 as a text corpus in the medical field.
The server 220 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
In this embodiment, the server 220 is provided with an entity relationship extraction model for extracting entity relationship triplets including entities and entity relationships from the specific field text. Optionally, the entity relationship extraction model is obtained by training in advance according to manually labeled texts. In addition, the server 220 is further configured to construct a domain-specific knowledge graph based on the extracted entity relationship triples. In one possible implementation, the corpus of text and entity-relationship triples extracted from the corpus of text may be stored on a blockchain.
Schematically, as shown in fig. 2, after receiving a text corpus sent by the terminal 210, the server 220 inputs a medical text 221 in the text corpus into the entity relationship extraction model 222, so as to obtain an entity relationship triple 223 extracted by the entity relationship extraction model 222. When the data volume of the entity relationship triplets 223 meets the knowledge-graph construction (or update) requirements, the server 220 constructs (or updates) the medical domain knowledge-graph 224 based on the entity relationship triplets 223.
Of course, the server 220 may capture the corpus from the network or obtain the corpus from the corpus, in addition to obtaining the corpus from the terminal 210, which is not limited in this embodiment.
In other possible embodiments, the entity relationship extraction model 222 may also be deployed at a terminal side, where the terminal extracts entity relationship triples from a text and reports the entity relationship triples to a server (the server is prevented from directly acquiring an original text corpus), so that the server performs knowledge graph construction. This embodiment is not limited to this. For convenience of description, the following embodiments are described by taking an example in which a method for extracting an entity relationship in a text is executed by a computer device.
Fig. 3 is a flowchart illustrating a method for extracting entity relationships in text according to an exemplary embodiment of the present application. The embodiment is described by taking the method as an example for a computer device, and the method comprises the following steps.
The target text is a text corpus in a specific field, the target text is composed of a plurality of words (tokens), and when the target text is coded (encode), the target text is coded by taking the words as units to obtain word vectors corresponding to the words. When the target text is English, coding the target text to obtain a word vector corresponding to each English word; and when the target text is Chinese, coding the target text to obtain a word vector corresponding to each Chinese character. For convenience of description, the embodiments of the present application take a target text as an example of chinese.
Regarding the manner of text encoding, in one possible implementation, the computer device inputs the target text into a pre-trained BERT (bidirectional Encoder retrieval from transforms) model, encodes the target text by the BERT model, and outputs word vectors corresponding to respective words. In other possible implementations, the computer device may also use an encoder such as Word2Vec, Glove, RoBerta, and the like to encode the target text, and this embodiment of the present application does not limit a specific encoding manner.
Illustratively, when the input target text is represented as S ═ x1,x2,…,xnWhen the word vector is coded, the word vector corresponding to each word can be represented as n is the length of the target text, namely the number of the contained wordsAnd the dimension of each word vector is d dimension.
Optionally, before extracting the entity relationship, a developer first sets a candidate entity relationship between entities in a specific field. Although the types of the candidate entity relations are various, the entity relations existing in the single text only account for a very small part of the total number of the candidate entity relations (generally, the entity relations existing in the single text are below 5). Therefore, in order to avoid the influence of the invalid redundant entity relationship on the speed and accuracy of extracting the entity relationship, in this embodiment, the computer device first screens out at least one potential entity relationship from the candidate entity relationships, where the potential entity relationship is an entity relationship that exists with a high probability in the target text.
The number of potential entity relationships corresponding to different texts may be different, and the types of the potential entity relationships may be different.
In an illustrative example, when there are 100 candidate entity relationships between entities in a particular domain, the computer device determines 2 candidate entity relationships of the 100 candidate entity relationships as potential entity relationships corresponding to the target text based on the word vector.
Because the potential entity relationship existing at high probability is screened out, the subject and the object in the target text only need to be determined based on the potential entity relationship, and other candidate entity relationships existing at low probability are not needed, so that the calculation amount during the determination of the subject and the object can be reduced, and the false recall caused by the candidate entity relationships at low probability can be avoided. For example, when the candidate entity relationships are 100, and the potential entity relationships corresponding to the target text are only 2, the computer device only needs to determine the subject and the object based on the 2 potential entity relationships, and does not need to determine the subject and the object based on the other 98 candidate entity relationships.
In one possible implementation, the computer device combines the potential entity relationship with the word vectors corresponding to the words to obtain a word vector assigned with a specific entity relationship, and thereby determines the target subject and the target object based on the word vector assigned with the specific entity relationship. The target subject and the target object are both entities in a specific field, the target subject is a subject with a potential entity relationship, and the target object is an object with a potential entity relationship.
In one possible embodiment, when the target text is a medical field text, the target subject and the target object are medical text entities, and the entity relationship between the target subject and the target object includes at least one of a location, a time, and a property.
For example, when the target subject is a symptom, the target object may include a site where the symptom appears, the nature (severity) of the symptom, the duration of the symptom, and the like.
And step 304, extracting entity relationship triples from the target text based on the target subject, the potential entity relationship and the target object.
After the target subject, the target object and the potential entity relationship are determined, the computer equipment combines the target subject, the target object and the potential entity relationship to generate an entity relationship triple, wherein the target subject and the target object in the entity relationship triple have the potential entity relationship.
In a possible implementation manner, the computer device may combine the hosts and the objects by using a heuristic nearest neighbor method, that is, combine the target host and the object closest to each other in the target text and the potential entity relationship in time sequence to obtain the entity relationship triples.
In another possible implementation manner, the computer device determines the confidence level when the target subject and object are combined, so that an entity relationship triple is obtained based on the confidence level combination, and the accuracy of the entity relationship between the subject and the object in the entity relationship triple is improved.
In an illustrative example, the target text is "paroxysmal abdominal pain before the patient takes two days, the target text is right lower abdomen and is in a dull pain state", the computer device determines that the potential entity relationship corresponding to the target text comprises "nature" and "position", the determined target subject comprises the abdominal pain, the target object comprises paroxysmal abdominal pain and right lower part, and the extracted entity relationship triplet comprises: (abdominal pain, nature, paroxysmal), (abdominal pain, location, right lower abdomen).
To sum up, in the embodiment of the application, in the process of extracting the entity relationship, firstly, a high-probability existing potential entity relationship in a target text is determined based on a word vector obtained by encoding the target text, a low-probability existing candidate entity relationship is filtered, then, a target subject and a target object are determined from the target text based on the determined potential entity relationship and the word vector, and finally, an entity relationship triple including the target subject, the potential entity relationship and the target object is extracted from the target text; by adopting the scheme provided by the embodiment of the application, the potential entity relationship is obtained by screening the candidate entity relationship before the subject and object extraction is carried out, the false recall result caused by the redundant entity relationship irrelevant to the target text can be reduced, the accuracy of the entity relationship extraction in the text is improved, and the extraction efficiency of the entity relationship is improved.
In one possible implementation, the computer device performs entity relationship triplet extraction through a pre-trained entity relationship extraction model, which is composed of an encoding layer, a potential relationship decision layer, and a relationship-specific sequence labeling layer. The encoding layer is used for encoding an input text to obtain a word vector, the potential relation judging layer is used for determining a potential entity relation existing in the text based on the word vector, and the sequence marking layer with a specific relation is used for fusing the potential entity relation and the word vector, so that sequence marking is carried out based on the word vector with the potential entity relation, and a target subject and an object are determined. The following description will be made using exemplary embodiments.
Fig. 4 shows a flowchart of a method for extracting entity relationships in text according to another exemplary embodiment of the present application. The embodiment is described by taking the method as an example for a computer device, and the method comprises the following steps.
The step 301 may be referred to in the implementation manner of this step, and this embodiment is not described herein again.
Schematically, as shown in fig. 5, after the target text 51 is input into the encoding layer 52, a word vector 53(h) corresponding to each word is obtained.
In the embodiment of the application, the determination process of the potential entity relationship in the text is modeled as a multi-label binary problem based on the global representation of the text, so that the computer device obtains the global representation of the target text, namely the text vector, based on the word vector corresponding to each word.
In a possible implementation manner, the computer device performs average pooling processing on word vectors corresponding to each word to obtain a text vector corresponding to the target text, and the text vector and the word vectors have the same dimension, so that dimension reduction of the output of the coding layer is realized. Illustratively, the process of determining a text vector based on a word vector may be represented by the following formula:
the Avgpool is an average pooling method, that is, word vectors of each word are averaged in each dimension, and h is a word vector of each word in the target text.
Of course, the global representation of the target text may be obtained by the computer device in other ways besides by performing average pooling on the word vectors, which is not limited in this embodiment.
In one possible implementation, a Fully Connected (FC) implementation is used as a potential relationship fault in the entity relationship extraction model. Correspondingly, the process of determining the potential entity relationship in the target text is to perform multi-label two-classification on the text vector by using the full connection layer.
Optionally, the computer device inputs the text vector into the full-link layer, and the full-link layer performs convolution processing on the text vector (for example, using a convolution kernel of 1 × 1) to obtain existence probabilities corresponding to the various candidate entity relationships. The higher the existence probability corresponding to the candidate entity relationship is, the more likely the entity in the target text exists in the candidate entity relationship. The existence probability corresponding to each candidate entity relationship can be expressed as:
Prel=σ(Wrhavg+br)
wherein h isavgIn the form of a vector of text,is a trainable weight (i.e., weight of fully-connected layer), σ is a sigmoid function, brIs the bias term.
Illustratively, as shown in fig. 5, the computer device inputs the text vector (obtained by averaging and pooling the word vectors 53) into the potential relationship judgment layer 54 to obtain the existence probabilities 541 corresponding to the corresponding relationships of the various candidate entities.
Potential entity relationships are determined from the candidate entity relationships based on the probability of existence, step 404.
In one possible implementation, if the existence probability corresponding to the candidate entity relationship is higher than the probability threshold, the computer device determines the candidate entity relationship as a potential entity relationship corresponding to the target text, and if the existence probability corresponding to the candidate entity relationship is lower than the probability threshold, the computer device filters the candidate entity relationship.
Illustratively, as shown in fig. 5, the computer device determines "properties" and "locations" as potential entity relationships based on the probability of existence 541 for the various candidate entity correspondences.
In a possible implementation manner, the computer device obtains a relationship vector corresponding to each potential entity relationship, and thus fuses the relationship vector and each word vector to obtain a word vector having a potential entity relationship. And the relation vector corresponding to the potential entity relation and the word vector have the same dimensionality.
Illustratively, a word vector with potential entity relationships may be denoted as hi+ujWherein h isiWord vector, u, for the ith word in the target textjIs the jth potential entity relationship, and hi,
And 406, performing sequence labeling based on the word vectors with the potential entity relationship, and determining a target subject and a target object in the target text.
In the embodiment of the application, a process of extracting a subject and an object from a text is modeled as a sequence tagging task, and when the sequence tagging is performed, a computer device uses a BIO tag mode to assign tags representing an entity position and a category to each word in the text, wherein the entity position is used for representing the position of the word in the entity, and the category is used for representing the subject and the object.
In one possible implementation, the computer device performs sequence labeling through a relationship-specific sequence labeling layer based on the word vectors with the potential entity relationships to obtain a target subject and a target object in the target text. The sequence labeling layer specific to the relationship may use a Recurrent Neural Network (RNN) or a Long Short-Term Memory (LSTM) Network to realize sequence labeling, and the embodiment of the present application does not limit a specific Network structure used in sequence labeling.
In order to ensure the accuracy of the subject-object extraction, in the embodiment of the present application, the sequence labeling of the subject and the object is performed separately, and in a possible implementation, this step may include the following sub-steps.
The method comprises the steps of firstly, carrying out main body sequence labeling based on a word vector with a potential entity relationship to obtain a main body labeling result, wherein the main body labeling result is used for representing a first entity position of the word vector with the potential entity relationship, and the first entity position comprises a main body start, a main body interior or a main body exterior.
Optionally, the computer device performs body sequence tagging on the word vector with the potential entity relationship through a sequence tagging layer specific to the relationship, so as to obtain a body tagging result indicating a position of each word in the target text corresponding to the first entity. Wherein, the main body annotates the result and includes: B-OBJ (meaning that the word belongs to the body and is at the beginning of the body), I-OBJ (meaning that the word belongs to the body and is inside the body), and O (meaning that the word is outside the body).
The main sequence labeling process can be expressed as:
wherein,to train weights, bsubIs an offset term, hiWord vector, u, for the ith word in the target textjIs the jth potential entity relationship, and hi,
Illustratively, as shown in fig. 5, the computer device inputs the word vector h output by the encoding layer 52 and the potential entity relationship r output by the potential relationship decision layer 54 into the relationship-specific sequence labeling layer 55. The relation-specific sequence labeling layer 55 splices the word vector h and the relation vector corresponding to the entity relation r, and performs body labeling on the spliced vector to obtain a body labeling result, wherein the body labeling result corresponding to "abdomen" is "B-SUB", "the body labeling result corresponding to" pain "is" I-SUB ", and the body labeling results corresponding to the remaining words are" O ".
And secondly, carrying out object sequence labeling based on the word vectors with the potential entity relationship to obtain an object labeling result, wherein the object labeling result is used for representing a second entity position of the word vectors with the potential entity relationship, and the second entity position comprises an object head, an object interior or an object exterior.
Optionally, the computer device performs object sequence tagging on the word vector with the potential entity relationship through a sequence tagging layer specific to the relationship, so as to obtain an object tagging result indicating that each word in the target text corresponds to the second entity position. Wherein, the object annotates the result and includes: B-SUB (meaning that the word belongs to the object and is located at the head of the object), I-SUB (meaning that the word belongs to the object and is located inside the object), and O (meaning that the word is located outside the object).
The object sequence labeling process can be expressed as follows:
wherein,to train weights, bobjIs an offset term, hiWord vector, u, for the ith word in the target textjIs the jth potential entity relationship, and hi,
Schematically, as shown in fig. 5, the sequence labeling layer 55 with specific relationship splices the word vector h and the relationship vector corresponding to the entity relationship r, and performs object labeling on the spliced vector to obtain object labeling results, where the subject labeling results corresponding to "burst", "issue" and "sex" are sequentially "B-OBJ", "I-OBJ", "right", "lower" and "abdomen" are sequentially "B-OBJ", "I-OBJ" and "I-OBJ", and the object labeling results corresponding to the remaining words are all "O".
And thirdly, determining a target subject and a target object in the target text based on the subject labeling result and the object labeling result.
After the subject and object sequence labeling is completed, the computer equipment determines a target subject and a target object based on subject labeling results and object labeling results corresponding to all words in the target text. In one possible implementation mode, the computer determines the main body labeling result as a word corresponding to the head of the main body and determines continuous words corresponding to the interior of the main body as a target main body; and determining the object labeling result as a word corresponding to the head of the object and the continuous words corresponding to the interior of the object as the target object.
Illustratively, as shown in FIG. 5, the computer device determines that the target subject is "abdominal pain" based on the words "B-SUB" and "I-SUB"; and determining the target object as 'paroxysmal' and 'lower right abdomen' based on the words corresponding to 'B-OBJ' and 'I-OBJ'.
Illustratively, as shown in fig. 5, the computer device generates an entity relationship triple 56 based on the determined target subject, target object, and potential entity relationship.
In this embodiment, the computer device performs average pooling on the word vectors of each word in the target text to obtain the text vector corresponding to the target text, so that a potential entity relationship is screened out from the candidate entity relationships based on the text vector, the efficiency of subsequently performing subject and object extraction is improved, and the influence of the redundant entity relationship on the accuracy of the subject and object extraction is avoided.
In addition, in the embodiment, the word vectors and the relationship vectors corresponding to the potential entity relationships are fused to obtain a plurality of text representations with specific relationships, and the text representations are respectively subjected to subject sequence labeling and object sequence labeling, so that the accuracy of the subject-object labeling is improved, and the accuracy of the generated entity relationship triples is improved.
In an illustrative example, as shown in the first diagram, when the entity relationship extraction is performed on the text by using the related art scheme, since the potential entity relationship screening is not performed, the entity relationship triple including the incorrect entity relationship "property" is finally extracted. When the scheme provided by the embodiment of the application is adopted to extract the entity relationship, the entity relationship property can be filtered in the potential entity relationship screening stage, so that the entity relationship triple extracted finally does not contain the entity relationship property, and the accuracy of the entity relationship extraction is improved.
Because the heuristic nearest neighbor method is an ideal method, and the word order of the text in the real scene has diversity, a larger error exists when the entity relationship triple is generated by adopting the heuristic nearest neighbor method. In order to further improve the accuracy of extracting the entity relationship, in a possible implementation manner, the entity relationship extraction model further includes a subject-object alignment layer, where the subject-object alignment layer is configured to align a subject and an object output by the sequence annotation layer specific to the relationship, and output an entity relationship triple in combination with the potential entity relationship. Optionally, as shown in fig. 6, the step 407 may include the following steps:
In one possible implementation, the computer device combines the target subject and the target object extracted under the same potential entity relationship (i.e., having the same potential entity relationship), to obtain a pair of subjects and objects. In an illustrative example, when the target subjects having the same potential entity relationship include a subject a and a subject B, and the target objects include an object a and an object B, the obtained pair of the subject and the object includes: (host a, guest a), (host a, guest B), (host B, guest a), (host B, guest B).
In a possible implementation manner, in a training stage, the computer device learns in advance to obtain a global correspondence matrix, and when determining the confidence of the subject-object pair, the computer device determines, in the global correspondence matrix, the confidence between the target subject and the target object in the subject-object pair as the confidence of the subject-object pair, where the global correspondence matrix is a matrix formed by the confidences between different entities, that is, the confidences between different entities are corresponding to different positions in the global correspondence matrix, and a higher confidence indicates that a higher probability of an entity relationship exists between the entities.
Illustratively, the confidence corresponding to each position in the global correspondence matrix is calculated by the following formula:
wherein,respectively of the i-th word and the j-th wordThe object vector is characterized in that the object vector is represented,for trainable weights, σ is sigmoid function, bgIs the bias term.
When it needs to be explained, the global correspondence matrix is obtained by learning before constructing a word vector with a specific relationship, that is, the global correspondence matrix is independent of an entity relationship and only focuses on the entity itself.
Illustratively, on the basis of fig. 5, as shown in fig. 7, the computer device determines the position of the target subject "abdominal pain" in the global correspondence matrix 571, and the positions of the target subject "paroxysmal" and "lower right abdomen" in the global correspondence matrix 571, thereby determining the confidence of the subject-object pair (abdominal pain, paroxysmal), and the confidence of the subject-object pair (abdominal pain, lower right abdomen).
Optionally, the computer device detects whether the confidence of the subject-object pair is higher than a confidence threshold, and if so, executes step 407C to generate an entity relationship triple based on the subject-object pair and the potential entity relationship; and if the confidence coefficient of the subject-object pair is lower than the confidence coefficient threshold value, the computer equipment filters the subject-object pair.
Illustratively, as shown in fig. 7, the confidence of both the subject-object pair (abdominal pain, paroxysmal) and the subject-object pair (abdominal pain, lower right abdomen) is higher than the confidence threshold, so the computer device retains the subject-object pair and generates the entity relationship triplets 56 based on the subject-object pair and the potential entity relationships "nature" and "location".
In this embodiment, the computer device determines the confidence of the subject-object pair by using the global correspondence matrix obtained through pre-learning, so that unreasonable subject-object pairs are filtered based on the confidence, subject-object alignment is realized, and the accuracy of the finally extracted entity relationship triples is improved.
In an illustrative example, as shown in the second diagram, when the entity relationship extraction is performed on the text by using the related art scheme, because the heuristic nearest neighbor method lacks constraint, a large number of subject-object alignment errors (adjacent rashes and leg misalignment) exist. When the scheme provided by the embodiment of the application is adopted for extracting the entity relationship, the subject-object alignment is restrained based on the confidence of the subject-object pair, and the subject-object pair with the wrong alignment mode is filtered, so that the accuracy of the extraction result is improved.
Watch two
On the real case data, the entity relationship extraction is performed by adopting the related technology and the scheme provided by the embodiment of the application, the F1-score (when the entity relationship between the subject and the object is correct, the triple is considered to be correct) of the triple matching is used as an evaluation index, and the obtained evaluation result is shown in table three.
Watch III
Rate of accuracy | Recall rate | F1-score | |
Related technical scheme | 80.0% | 88.2% | 83.9% |
The embodiment of the application provides a scheme | 91.8% | 89.5% | 90.6% |
The evaluation results obtained by using F1-score (ignoring subject-object accuracy) of the entity relationship in the triple as the evaluation index are shown in Table four.
Watch four
Rate of accuracy | Recall rate | F1-score | |
Related technical scheme | 83.0% | 93.1% | 87.7% |
The embodiment of the application provides a scheme | 92.8% | 96.2% | 94.5% |
The F1-score (ignoring the accuracy of the object relationship) of the subject and the object in the triplet is used as an evaluation index, and the obtained evaluation result is shown in table five.
Watch five
Rate of accuracy | Recall rate | F1-score | |
Related technical scheme | 83.0% | 91.8% | 87.1% |
The embodiment of the application provides a scheme | 94.0% | 92.3% | 93.1% |
Compared with the related technical scheme, the scheme provided by the embodiment of the application obviously improves the extraction performance of the entity relation through potential relation judgment and subject-object alignment, and particularly greatly leads the related technical scheme in the aspect of accuracy.
The above embodiments describe the extraction process of entity relationships. Before the entity relationship extraction is implemented, a developer first needs to complete the entity relationship extraction model training by using a sample text, and the following describes a training process of the entity relationship extraction model by using an exemplary embodiment.
FIG. 8 is a flowchart illustrating an entity relationship extraction model training process provided by an exemplary embodiment of the present application. The embodiment is described by taking the method as an example for a computer device, and the method comprises the following steps.
Optionally, the computer device inputs the sample text into the coding layer of the entity relationship extraction model to obtain a sample word vector corresponding to each word in the sample text. The process of encoding the sample text by the encoding layer may refer to the above embodiments, and this embodiment is not described herein again.
Similar to the application process, optionally, the computer device inputs the sample word vector corresponding to each word into the potential relationship decision layer of the entity relationship extraction model to obtain the sample potential entity relationship corresponding to the sample text. The process of determining the sample potential entity relationship based on the sample word vector in the potential relationship judgment layer may refer to the above embodiments, which are not described herein again.
Similar to the application process, optionally, the computer device inputs the sample potential entity relationship and the sample word vector corresponding to each word into a relationship-specific sequence labeling layer of the entity relationship extraction model to obtain a sample subject and a sample object in the sample text. For the process of performing object-object labeling on the sequence labeling layer with a specific relationship, reference may be made to the foregoing embodiment, which is not described herein again.
And step 804, determining the potential relationship judgment loss based on the entity relationship label corresponding to the sample text and the sample potential entity relationship.
In order to determine the judgment loss of the potential relation judgment layer, the computer device takes the entity relation label corresponding to the sample text as supervision, and determines the difference between the entity relation label and the sample potential entity relation as the potential relation judgment loss. Wherein, the entity relation label is used for representing the entity relation contained in the sample text.
In one possible implementation, the computer device determines a multi-label two-class cross-entropy loss between the entity relationship label and the sample potential entity relationship as a potential relationship judgment loss. The potential relationship judgment loss may be expressed as:
wherein n isrAs the number of candidate entity relationships, yiFor entity relationship labels, P, corresponding to sample textrelIs the probability of existence of a potential entity relationship in the sample text.
In order to determine the judgment loss of the subject-object labeling layer with a specific relationship, the computer device determines the difference between the entity relationship label and the sample potential entity relationship as the potential relationship judgment loss by taking the word label corresponding to each word in the sample text as supervision. Wherein, the word label is a manually labeled BIO label, and the entity relationship label comprises at least one of the following: B-SUB tags, I-SUB tags, B-OBJ tags, I-OBJ tags, and O tags.
In one possible implementation, the computer device determines a multi-class cross-entropy penalty as a subject-to-object judgment penalty. The subject-object judgment loss can be expressed as:
wherein,the number of potential entity relationships, n is the length of the sample text,is a label for the relationship of an entity,is the probability that the ith word vector belongs to the subject or the object under the jth potential entity relationship.
And 806, training an entity relationship extraction model based on the potential relationship judgment loss and the subject-object judgment loss, wherein the entity relationship extraction model is used for outputting entity relationship triples based on the input text.
In a possible implementation manner, the computer device adopts a joint training strategy to jointly optimize the potential relationship judgment layer and the gradient descent optimal solution of the loss of the relationship-specific sequence annotation layer, namely, the entity relationship extraction model is trained by using the total loss of the potential relationship judgment loss and the subject-object judgment loss. The total loss of the entity-relationship extraction model can be expressed as:
where α and β are loss weights.
In summary, in the embodiment of the application, the entity relationship extraction model is trained based on the potential relationship judgment loss of the potential relationship judgment layer and the subject-object judgment loss of the subject-object labeling layer with a specific relationship, so that the entity relationship extraction model can learn how to screen the potential entity relationship and how to label the subject-object in the training process; when entity relations are extracted by using an entity relation extraction model obtained by training subsequently, false recall results caused by redundant entity relations can be reduced by screening potential entity relations, the accuracy of extracting the entity relations in the text is improved, and the extraction efficiency of the entity relations is improved.
In another possible implementation, when the entity relationship extraction model includes a host-object alignment layer, on the basis of fig. 8, as shown in fig. 9, steps 807 to 809 may be further included after step 803, and step 806 may be replaced with step 8061.
In step 807, the sample host and the sample object are combined into a pair of sample host and sample object.
Similar to the application process, the computer device combines the extracted sample subject and sample object (having the same sample potential entity relationship) to obtain a sample subject-object pair.
In one possible implementation, the computer device determines a sample confidence of the sample subject-object pair according to the subject vector characterization corresponding to the sample subject and the object vector characterization corresponding to the sample object. The step 407B may be referred to as a way of calculating the sample confidence, which is not described herein again in this embodiment.
And step 809, determining the global correspondence loss based on the confidence label corresponding to the sample text and the sample confidence.
In order to determine the judgment loss of the subject-object alignment layer, the computer device determines the difference between the sample confidence degree and the confidence degree label as the global correspondence loss by using the confidence degree label of the subject-object pair (i.e. the artificially labeled subject-object pair) in the sample text as the supervision. In one possible implementation, the computer device determines a multi-label two-class cross-entropy loss between the confidence label and the sample confidence as a global correspondence loss. The global correspondence loss may be expressed as:
where n is the length of the sample text, yi,jIs the confidence label of the subject-object pair (i word, j word),is the sample confidence of (i word, j word).
Step 8061, training an entity relationship extraction model based on the potential relationship judgment loss, the subject-object judgment loss, and the global correspondence loss.
In a possible implementation manner, the computer device adopts a joint training strategy to jointly optimize the optimal gradient descent solution of the loss of the potential relationship judgment layer, the relationship-specific sequence labeling layer and the subject-object alignment layer, namely, train an entity relationship extraction model by using the total loss of the potential relationship judgment loss, the subject-object judgment loss and the global corresponding relationship. Optionally, the loss weights corresponding to the potential relationship judgment loss, the subject-object judgment loss, and the global correspondence loss are the same. The total loss of the entity-relationship extraction model can be expressed as:
wherein α, β, γ are loss weights.
In this embodiment, the global correspondence loss is used as a part of the total loss corresponding to the entity relationship extraction model, so that the global correspondence matrix can be learned in the training process, and therefore, unreasonable pairs of subjects and objects are filtered by using the global correspondence matrix in the subsequent process, subject and object alignment is realized, and the entity relationship extraction accuracy of the entity relationship extraction model is improved.
Fig. 10 is a block diagram of a structure of an apparatus for extracting entity relationships in text according to an exemplary embodiment of the present application, where the apparatus includes:
the encoding module 1001 is configured to encode a target text to obtain a word vector corresponding to each word in the target text;
a relationship determining module 1002, configured to determine, based on the word vector corresponding to each word, a potential entity relationship corresponding to the target text, where a probability that the potential entity relationship exists in the target text is higher than a probability that other candidate entity relationships other than the potential entity relationship exist;
a subject-object determining module 1003, configured to determine, based on the potential entity relationship and the word vector corresponding to each word, a target subject and a target object in the target text, where the target subject and the target object belong to an entity;
an extraction module 1004, configured to extract entity relationship triplets from the target text based on the target subject, the potential entity relationships, and the target object.
Optionally, the relationship determining module 1002 includes:
the text vector determining unit is used for determining a text vector corresponding to the target text based on the word vector corresponding to each word;
the classification unit is used for classifying the text vectors through a full connection layer to obtain existence probabilities corresponding to various candidate entity relationships, wherein the existence probabilities refer to the probabilities of the candidate entity relationships existing in the target text;
a relationship determination unit to determine the potential entity relationship from the candidate entity relationships based on the probability of existence.
Optionally, the text vector determining unit is configured to:
and carrying out average pooling on the word vectors corresponding to the words to obtain the text vectors corresponding to the target text, wherein the text vectors and the word vectors have the same dimensionality.
Optionally, the subject-object determining module 1003 includes:
the fusion unit is used for fusing the word vector and the relation vector corresponding to the potential entity relation to obtain a word vector with the potential entity relation;
and the labeling unit is used for performing sequence labeling on the word vectors with the potential entity relationship and determining the target subject and the target object in the target text.
Optionally, the labeling unit is configured to:
performing main body sequence labeling on the word vectors with the potential entity relationship to obtain a main body labeling result, wherein the main body labeling result is used for representing a first entity position of the word vectors with the potential entity relationship, and the first entity position comprises a main body start, a main body interior or a main body exterior;
performing object sequence labeling based on the word vector with the potential entity relationship to obtain an object labeling result, wherein the object labeling result is used for representing a second entity position of the word vector with the potential entity relationship, and the second entity position comprises an object start, an object interior or an object exterior;
and determining the target subject and the target object in the target text based on the subject labeling result and the object labeling result.
Optionally, the extracting module 1004 includes:
the combination unit is used for combining the target subject and the target object with the same potential entity relationship to obtain at least one subject-object pair;
a confidence determining unit, configured to determine a confidence of each of the subject-object pairs;
and the generating unit is used for generating the entity relationship triple based on the target subject, the target object and the potential entity relationship in the subject-object pair if the confidence of the subject-object pair is higher than a confidence threshold.
Optionally, the confidence determining unit is configured to:
and determining the confidence coefficient between the target subject and the target object in the subject-object pair in a global corresponding relation matrix as the confidence coefficient of the subject-object pair, wherein the global corresponding relation matrix is a matrix formed by the confidence coefficients of different entities.
Optionally, the apparatus further comprises:
and the filtering module is used for filtering the subject-object pair if the confidence of the subject-object pair is lower than the confidence threshold.
Optionally, the apparatus further comprises: a training module to:
coding a sample text to obtain a sample word vector corresponding to each word in the sample text;
determining a sample potential entity relationship corresponding to the sample text based on the sample word vector corresponding to each word;
determining a sample subject and a sample object in the sample text based on the sample potential entity relationship and the sample word vector corresponding to each word;
determining a potential relationship judgment loss based on the entity relationship label corresponding to the sample text and the sample potential entity relationship;
determining the judgment loss of the subject and the object based on the sample subject, the sample object and the word labels corresponding to all words in the sample text;
training an entity relationship extraction model based on the potential relationship judgment loss and the subject-object judgment loss, wherein the entity relationship extraction model is used for outputting entity relationship triples based on input texts.
Optionally, the training module is further configured to:
combining the sample host and the sample object into a pair of sample host and object;
determining a sample confidence of the sample subject-object pair;
determining a global correspondence loss based on the confidence label corresponding to the sample text and the sample confidence;
and training the entity relationship extraction model based on the potential relationship judgment loss, the subject-object judgment loss and the global corresponding relationship loss.
Optionally, the loss weights corresponding to the potential relationship judgment loss, the subject-object judgment loss, and the global correspondence loss are the same.
Optionally, the target subject and the target object are medical text entities, and the entity relationship between the target subject and the target object includes at least one of a location, a time, and a property.
To sum up, in the embodiment of the application, in the process of extracting the entity relationship, firstly, a high-probability existing potential entity relationship in a target text is determined based on a word vector obtained by encoding the target text, a low-probability existing candidate entity relationship is filtered, then, a target subject and a target object are determined from the target text based on the determined potential entity relationship and the word vector, and finally, an entity relationship triple including the target subject, the potential entity relationship and the target object is extracted from the target text; by adopting the scheme provided by the embodiment of the application, the potential entity relationship is obtained by screening the candidate entity relationship before the subject and object extraction is carried out, the false recall result caused by the redundant entity relationship irrelevant to the target text can be reduced, the accuracy of the entity relationship extraction in the text is improved, and the extraction efficiency of the entity relationship is improved.
It should be noted that: the device provided in the above embodiment is only illustrated by dividing the functional modules, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the apparatus and method embodiments provided by the above embodiments belong to the same concept, and details of the implementation process are referred to as method embodiments, which are not described herein again.
Referring to fig. 11, a schematic structural diagram of a computer device according to an exemplary embodiment of the present application is shown. Specifically, the method comprises the following steps: the computer device 1300 includes a Central Processing Unit (CPU) 1301, a system memory 1304 including a random access memory 1302 and a read only memory 1303, and a system bus 1305 connecting the system memory 1304 and the CPU 1301. The computer device 1300 also includes a basic Input/Output system (I/O system) 1306, which facilitates information transfer between devices within the computer, and a mass storage device 1307 for storing an operating system 1313, application programs 1314, and other program modules 1315.
The basic input/output system 1306 includes a display 1308 for displaying information and an input device 1309, such as a mouse, keyboard, etc., for a user to input information. Wherein the display 1308 and input device 1309 are connected to the central processing unit 1301 through an input-output controller 1310 connected to the system bus 1305. The basic input/output system 1306 may also include an input/output controller 1310 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 1310 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 1307 is connected to the central processing unit 1301 through a mass storage controller (not shown) connected to the system bus 1305. The mass storage device 1307 and its associated computer-readable media provide non-volatile storage for the computer device 1300. That is, the mass storage device 1307 may include a computer-readable medium (not shown), such as a hard disk or drive.
Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes Random Access Memory (RAM), Read Only Memory (ROM), flash Memory or other solid state Memory technology, Compact disk Read-Only Memory (CD-ROM), Digital Versatile Disks (DVD), or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1304 and mass storage device 1307 described above may be collectively referred to as memory.
The memory stores one or more programs configured to be executed by the one or more central processing units 1301, the one or more programs containing instructions for implementing the methods described above, and the central processing unit 1301 executes the one or more programs to implement the methods provided by the various method embodiments described above.
According to various embodiments of the present application, the computer device 1300 may also operate as a remote computer connected to a network via a network, such as the Internet. That is, the computer device 1300 may be connected to the network 1312 through the network interface unit 1311, which is connected to the system bus 1305, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1311.
The memory also includes one or more programs, stored in the memory, that include instructions for performing the steps performed by the computer device in the methods provided by the embodiments of the present application.
The embodiment of the present application further provides a computer-readable storage medium, where at least one instruction is stored in the computer-readable storage medium, and the at least one instruction is loaded and executed by a processor to implement the method for extracting an entity relationship in a text according to any of the above embodiments.
Optionally, the computer-readable storage medium may include: ROM, RAM, Solid State Drives (SSD), or optical disks, etc. The RAM may include a resistive Random Access Memory (ReRAM) and a Dynamic Random Access Memory (DRAM), among others.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instruction from the computer-readable storage medium, and executes the computer instruction, so that the computer device executes the method for extracting the entity relationship in the text described in the above embodiment.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is intended to be exemplary only, and not to limit the present application, and any modifications, equivalents, improvements, etc. made within the spirit and scope of the present application are intended to be included therein.
Claims (15)
1. A method for extracting entity relationships in texts is characterized by comprising the following steps:
coding a target text to obtain a word vector corresponding to each word in the target text;
determining a potential entity relationship corresponding to the target text based on the word vector corresponding to each word, wherein the probability of the potential entity relationship existing in the target text is higher than the probability of other candidate entity relationships except the potential entity relationship existing in the target text;
determining a target subject and a target object in the target text based on the potential entity relationship and the word vector corresponding to each word, wherein the target subject and the target object belong to an entity;
and extracting entity relation triples from the target text based on the target subject, the potential entity relations and the target object.
2. The method of claim 1, wherein the determining potential entity relationships corresponding to the target text based on the word vectors corresponding to the respective words comprises:
determining a text vector corresponding to the target text based on the word vector corresponding to each word;
classifying the text vectors through a full-connection layer to obtain existence probabilities corresponding to various candidate entity relationships, wherein the existence probabilities refer to the probabilities of the candidate entity relationships existing in the target text;
determining the potential entity relationship from the candidate entity relationships based on the existence probabilities.
3. The method of claim 2, wherein determining a text vector corresponding to the target text based on the word vector corresponding to each word comprises:
and carrying out average pooling on the word vectors corresponding to the words to obtain the text vectors corresponding to the target text, wherein the text vectors and the word vectors have the same dimensionality.
4. The method according to any one of claims 1 to 3, wherein the determining the target subject and the target object in the target text based on the potential entity relationship and the word vector corresponding to each word comprises:
fusing the word vector and the relation vector corresponding to the potential entity relation to obtain a word vector with the potential entity relation;
performing sequence labeling based on the word vectors with the potential entity relationship, and determining the target subject and the target object in the target text.
5. The method of claim 4, wherein the determining the target subject and the target object in the target text based on the sequence labeling of the word vectors with the potential entity relationships comprises:
performing main body sequence labeling on the word vectors with the potential entity relationship to obtain a main body labeling result, wherein the main body labeling result is used for representing a first entity position of the word vectors with the potential entity relationship, and the first entity position comprises a main body start, a main body interior or a main body exterior;
performing object sequence labeling based on the word vector with the potential entity relationship to obtain an object labeling result, wherein the object labeling result is used for representing a second entity position of the word vector with the potential entity relationship, and the second entity position comprises an object start, an object interior or an object exterior;
and determining the target subject and the target object in the target text based on the subject labeling result and the object labeling result.
6. The method of any one of claims 1 to 3, wherein the extracting entity relationship triples from the target text based on the target subject, the potential entity relationships, and the target object comprises:
combining the target subject and the target object with the same potential entity relationship to obtain at least one subject-object pair;
determining the confidence of each host-object pair;
and if the confidence of the subject-object pair is higher than a confidence threshold, generating the entity relationship triple based on the target subject, the target object and the potential entity relationship in the subject-object pair.
7. The method of claim 6, wherein said determining a confidence level for each of said subject-object pairs comprises:
and determining the confidence coefficient between the target subject and the target object in the subject-object pair in a global corresponding relation matrix as the confidence coefficient of the subject-object pair, wherein the global corresponding relation matrix is a matrix formed by the confidence coefficients of different entities.
8. The method of claim 6, wherein after determining the confidence level of each of the subject-object pairs, the method further comprises:
and if the confidence coefficient of the subject-object pair is lower than the confidence coefficient threshold value, filtering the subject-object pair.
9. The method of any of claims 1 to 3, further comprising:
coding a sample text to obtain a sample word vector corresponding to each word in the sample text;
determining a sample potential entity relationship corresponding to the sample text based on the sample word vector corresponding to each word;
determining a sample subject and a sample object in the sample text based on the sample potential entity relationship and the sample word vector corresponding to each word;
determining a potential relationship judgment loss based on the entity relationship label corresponding to the sample text and the sample potential entity relationship;
determining the judgment loss of the subject and the object based on the sample subject, the sample object and the word labels corresponding to all words in the sample text;
training an entity relationship extraction model based on the potential relationship judgment loss and the subject-object judgment loss, wherein the entity relationship extraction model is used for outputting entity relationship triples based on input texts.
10. The method of claim 9, wherein after determining the sample subject and the sample object in the sample text, the method further comprises:
combining the sample host and the sample object into a pair of sample host and object;
determining a sample confidence of the sample subject-object pair;
determining a global correspondence loss based on the confidence label corresponding to the sample text and the sample confidence;
the training entity relationship extraction model based on the potential relationship judgment loss and the subject-object judgment loss comprises the following steps:
and training the entity relationship extraction model based on the potential relationship judgment loss, the subject-object judgment loss and the global corresponding relationship loss.
11. The method of claim 10, wherein the potential relationship determination loss, the subject-object determination loss, and the global correspondence loss have the same loss weight.
12. The method of any one of claims 1 to 3, wherein the target subject and target object are medical textual entities and the physical relationship between the target subject and the target object includes at least one of location, time, and nature.
13. An apparatus for extracting entity relationships in text, the apparatus comprising:
the encoding module is used for encoding a target text to obtain a word vector corresponding to each word in the target text;
a relationship determining module, configured to determine a potential entity relationship corresponding to the target text based on the word vector corresponding to each word, where a probability that the potential entity relationship exists in the target text is higher than a probability that other candidate entity relationships other than the potential entity relationship exist;
a subject-object determining module, configured to determine, based on the potential entity relationship and the word vector corresponding to each word, a target subject and a target object in the target text, where the target subject and the target object belong to an entity;
and the extraction module is used for extracting entity relation triples from the target text based on the target subject, the potential entity relation and the target object.
14. A computer device comprising a processor and a memory, wherein at least one instruction is stored in the memory, and is loaded and executed by the processor to implement the method for extracting entity relationships in text as claimed in any one of claims 1 to 12.
15. A computer-readable storage medium having stored thereon at least one instruction, which is loaded and executed by a processor, to implement the method for extracting entity relationships in text according to any one of claims 1 to 12.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110393735.5A CN113704392A (en) | 2021-04-13 | 2021-04-13 | Method, device and equipment for extracting entity relationship in text and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110393735.5A CN113704392A (en) | 2021-04-13 | 2021-04-13 | Method, device and equipment for extracting entity relationship in text and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113704392A true CN113704392A (en) | 2021-11-26 |
Family
ID=78647981
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110393735.5A Pending CN113704392A (en) | 2021-04-13 | 2021-04-13 | Method, device and equipment for extracting entity relationship in text and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113704392A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114220505A (en) * | 2021-11-29 | 2022-03-22 | 中国科学院深圳先进技术研究院 | Information extraction method of medical record data, terminal equipment and readable storage medium |
CN114385787A (en) * | 2021-12-28 | 2022-04-22 | 北京惠及智医科技有限公司 | Medical text detection method, model training method and related device |
CN114528394A (en) * | 2022-04-22 | 2022-05-24 | 杭州费尔斯通科技有限公司 | Text triple extraction method and device based on mask language model |
CN114817562A (en) * | 2022-04-26 | 2022-07-29 | 马上消费金融股份有限公司 | Knowledge graph construction method, knowledge graph training method, information recommendation method and information recommendation device |
CN115309915A (en) * | 2022-09-29 | 2022-11-08 | 北京如炬科技有限公司 | Knowledge graph construction method, device, equipment and storage medium |
-
2021
- 2021-04-13 CN CN202110393735.5A patent/CN113704392A/en active Pending
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114220505A (en) * | 2021-11-29 | 2022-03-22 | 中国科学院深圳先进技术研究院 | Information extraction method of medical record data, terminal equipment and readable storage medium |
CN114385787A (en) * | 2021-12-28 | 2022-04-22 | 北京惠及智医科技有限公司 | Medical text detection method, model training method and related device |
CN114528394A (en) * | 2022-04-22 | 2022-05-24 | 杭州费尔斯通科技有限公司 | Text triple extraction method and device based on mask language model |
CN114817562A (en) * | 2022-04-26 | 2022-07-29 | 马上消费金融股份有限公司 | Knowledge graph construction method, knowledge graph training method, information recommendation method and information recommendation device |
CN115309915A (en) * | 2022-09-29 | 2022-11-08 | 北京如炬科技有限公司 | Knowledge graph construction method, device, equipment and storage medium |
CN115309915B (en) * | 2022-09-29 | 2022-12-09 | 北京如炬科技有限公司 | Knowledge graph construction method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111444709B (en) | Text classification method, device, storage medium and equipment | |
CN113822494B (en) | Risk prediction method, device, equipment and storage medium | |
Logeswaran et al. | Sentence ordering and coherence modeling using recurrent neural networks | |
CN113704392A (en) | Method, device and equipment for extracting entity relationship in text and storage medium | |
CN109214006B (en) | Natural language reasoning method for image enhanced hierarchical semantic representation | |
CN114330354B (en) | Event extraction method and device based on vocabulary enhancement and storage medium | |
CN111680484B (en) | Answer model generation method and system for visual general knowledge reasoning question and answer | |
CN109344404A (en) | The dual attention natural language inference method of context aware | |
CN112131883B (en) | Language model training method, device, computer equipment and storage medium | |
CN113392209B (en) | Text clustering method based on artificial intelligence, related equipment and storage medium | |
CN113707307A (en) | Disease analysis method and device, electronic equipment and storage medium | |
CN111881292B (en) | Text classification method and device | |
CN111145914B (en) | Method and device for determining text entity of lung cancer clinical disease seed bank | |
CN114065848A (en) | Chinese aspect level emotion classification method based on pre-training emotion embedding | |
KR20220076419A (en) | Method for utilizing deep learning based semantic role analysis | |
CN111859979A (en) | Ironic text collaborative recognition method, ironic text collaborative recognition device, ironic text collaborative recognition equipment and computer readable medium | |
CN114757183B (en) | Cross-domain emotion classification method based on comparison alignment network | |
CN115935991A (en) | Multitask model generation method and device, computer equipment and storage medium | |
CN114357167B (en) | Bi-LSTM-GCN-based multi-label text classification method and system | |
CN115017879A (en) | Text comparison method, computer device and computer storage medium | |
US20220253630A1 (en) | Optimized policy-based active learning for content detection | |
Shen et al. | Student public opinion management in campus commentary based on deep learning | |
Tüselmann et al. | Recognition-free question answering on handwritten document collections | |
CN113869068A (en) | Scene service recommendation method, device, equipment and storage medium | |
CN117216617A (en) | Text classification model training method, device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |