CN112926332A

CN112926332A - Entity relationship joint extraction method and device

Info

Publication number: CN112926332A
Application number: CN202110340031.1A
Authority: CN
Inventors: 陈培华
Original assignee: Good Diagnosis Shanghai Information Technology Co ltd
Current assignee: Good Diagnosis Shanghai Information Technology Co ltd
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2021-06-08

Abstract

The document provides an entity relationship joint extraction method and device, wherein the method comprises the following steps: acquiring text data to be predicted; extracting text data to be predicted by using a pre-established entity relation joint extraction model, predicting to obtain the type of a word case interval and the relation type of an entity phrase, wherein the type of the word case interval comprises an entity type and a non-entity type, an entity word is the word case interval of the entity type, and the relation type of the entity phrase comprises a relation and a non-relation; the entity relationship joint extraction model is used for preprocessing text data; predicting the type of the word case interval according to the information obtained by preprocessing; and predicting to obtain the relation type of the entity phrase according to the entity phrase and the character vector between the entity words in the entity phrase. Text semantic information is enriched by considering entity phrases and character vectors between entity words in the entity phrases, and all entity phrase relation types of complex text data can be accurately extracted.

Description

Entity relationship joint extraction method and device

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a method and an apparatus for extracting entity relationships jointly.

Background

With the continuous development of medical informatization technology, extraction and structuring of useful information are urgently needed for a large amount of unstructured text data in health medical data such as health examination reports and electronic medical records, so that the data have greater value in practical application and production.

The extraction of the entity relationship of the medical data is a core task of extracting unstructured text information in the medical field and constructing a knowledge map in the health medical field. In the prior art, there are two main methods for extracting entity relationships: one method is to extract entity relations in a serial connection mode, namely named entity recognition is firstly carried out, related medical entities in a text are recognized, and then a classification method is utilized to obtain the relation between every two entities; the other method is an entity relationship combined extraction method, which can simultaneously identify medical entities in a text by using a model and judge the relationship type between every two entities.

For the existing serial connection method for extracting entity relationship, error transmission and accumulation can be caused, redundant information can be generated, and the effect is not ideal. For the existing entity relationship joint extraction method, although the effect of the existing entity relationship joint extraction method is obviously improved compared with the extraction method of the first series connection mode, context semantic information of characters between two entities is not considered, and for text data with complex text structures (such as complex text structures of entity word parallel, entity overlapping, relationship overlapping and the like) and more entities and relationships (up to hundreds), the effect of entity relationship extraction is still not ideal even with the assistance of some expert experiences.

Disclosure of Invention

The method is used for solving the problems that the influence of characters between entities on the entity relationship is not considered, text semantic information is not fully recognized, the recognition precision is poor, and the method is not suitable for scenes with complex entity relationships (such as entity word parallel, entity word overlapping, relationship overlapping and the like) and more entity relationships in the prior art.

In order to solve the above technical problem, a first aspect of the present disclosure provides an entity relationship joint extraction method, including:

acquiring text data to be predicted;

extracting the text data to be predicted by using a pre-established entity relationship joint extraction model, predicting to obtain the type of a word case interval and the relationship type of an entity phrase, wherein the type of the word case interval comprises an entity type and a non-entity type, an entity word is the word case interval of the entity type, and the relationship type of the entity phrase comprises a relationship and a non-relationship;

the entity relation joint extraction model is used for preprocessing text data to obtain word case intervals, word case interval vectors, word case interval length vectors and text vectors; predicting the type of the word case interval according to the information obtained by preprocessing; and predicting to obtain the relation type of the entity phrase according to the entity phrase and the character vector between the entity words in the entity phrase.

In a further embodiment herein, the entity relationship joint extraction method further includes:

and filtering the relationship type of the entity phrase obtained by prediction according to the allowable relationship constraint dictionary in the field to which the text data to be predicted belongs.

In a further embodiment herein, the entity-relationship joint extraction model comprises: the device comprises a preprocessing module and a classification module, wherein the classification module comprises an embedding layer, a first classifier, a transition layer and a second classifier;

the preprocessing module is used for preprocessing the text data to obtain word case intervals, word case interval vectors, word case interval length vectors and text vectors;

the embedded layer is connected with the preprocessing module and used for constructing a first vector according to information obtained by preprocessing;

the first classifier is connected with the embedded layer, and the type of the word case interval is obtained through prediction according to the first vector;

the transition layer is connected with the first classifier and the second classifier and used for screening out word case intervals of entity types to obtain entity words; splicing an entity phrase formed by every two entity words and character vectors between the entity words in the entity phrase into a second vector;

and the second classifier is used for predicting the relationship type of the entity phrase according to the second vector.

In a further embodiment of this document, the preprocessing module processes the text data to obtain a word case interval, a word case interval vector, a word case interval length vector, and a text vector, and includes:

performing word segmentation/word segmentation processing on the text data to obtain a word case list;

processing the word case list by using a BERT pre-training model to obtain a text vector and word case vectors corresponding to all word cases;

acquiring a word example interval according to the word example list and a preset sliding window;

the word case vectors contained in the word case interval are subjected to a fusion function to obtain a word case interval vector;

and acquiring a word example interval length vector according to the length of the word example interval.

In a further embodiment herein, constructing a first vector based on the preprocessed information comprises:

and splicing the word case interval vector, or the word case interval vector and the text vector, or the word case interval vector and the word case interval length vector, or the word case interval vector, the word case interval length vector and the text vector into a first vector.

In further embodiments herein, the first classifier comprises: the first classification function unit is used for outputting a probability vector of a word case interval type, and the first judgment unit is used for determining the type of the word case interval according to the probability vector of the word case interval type;

the second classifier includes: a second classification function unit and a second judgment unit; the second classification function unit is used for outputting the probability vector of the relationship type of the entity phrase, and the second judgment unit is used for determining the relationship type of the entity phrase according to the probability vector of the relationship type of the entity phrase.

In a further embodiment herein, the entity-relationship joint extraction model is trained by:

preprocessing the training text data by utilizing the preprocessing module to obtain word case intervals, word case interval vectors, word case interval length vectors and text vectors;

acquiring entity types of the word case intervals obtained by labeling and the association relation of entity phrases;

constructing a first vector according to the information obtained by preprocessing;

inputting the first vector into the first classifier, and predicting the type of a word case interval;

screening out word case intervals of entity types to obtain entity words, and splicing entity word groups formed by every two entity words and vectors formed by characters between the entity word groups into second vectors;

inputting the second vector into the second classifier, and predicting to obtain a relation type of the entity phrase;

and training parameters in the entity relation joint extraction model according to the entity type of the word case interval and the relation type of the entity phrase obtained by prediction, and the entity type of the word case interval and the relation type of the entity phrase obtained by labeling.

In a further embodiment herein, before constructing the first vector according to the preprocessed information, the method further includes:

comparing the word case interval with the entity words marked in advance, if one word case interval is the same as one of the entity words marked in advance, the word case interval is an entity positive sample case, otherwise, the word case interval is an entity negative sample case;

and sampling the entity load according to the first preset value.

In a further embodiment herein, before determining the second vector, the method further comprises:

judging whether each entity phrase accords with the entity relationship labeled in advance; if an entity phrase accords with the entity relationship labeled in advance, the entity phrase is a relationship positive sample, otherwise, the entity phrase is a relationship negative sample;

and sampling the relation negative sample according to a second preset value.

The second aspect of the present disclosure also provides an entity relationship joint extraction apparatus, including:

the receiving module is used for acquiring text data to be predicted;

the extraction module is used for extracting the text data to be predicted by utilizing a pre-established entity relationship joint extraction model, predicting to obtain the type of a word case interval and the relationship type of an entity phrase, wherein the type of the word case interval comprises an entity type and a non-entity type, an entity word is the word case interval of the entity type, and the relationship type of the entity phrase comprises a relationship and a non-relationship;

According to the entity relationship joint extraction method and device, the recognition of the non-entity data and the non-relationship data is added, so that the entity relationship joint extraction model extracts the data of the non-entity type and the non-relationship type, and the recognition accuracy of different data types can be improved. When the entity phrase relationship type is identified, text semantic information is enriched by considering the character vectors between entity words, all entity phrase relationship types of complex text data can be accurately extracted, and further structuralization is carried out according to the obtained entity words and the obtained entity phrase relationship types, so that useful information is extracted.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments or technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram illustrating a structure of an entity-relationship joint extraction model according to an embodiment of the present disclosure;

FIG. 2 is a first flowchart illustrating a process of training a joint extraction model of entity relationships according to an embodiment herein;

FIG. 3 is a second flowchart illustrating the training process of the entity-relationship joint extraction model according to the embodiment herein;

FIG. 4 is a third flowchart of the entity-relationship joint extraction model training process according to the embodiment of the present disclosure;

FIG. 5 is a first flowchart of a method for entity relationship co-extraction according to an embodiment of the present disclosure;

FIG. 6 is a second flowchart illustrating a method for federated abstraction of entity relationships according to an embodiment herein;

FIG. 7 shows a first block diagram of a physical relationship federation extraction mechanism according to embodiments herein;

FIG. 8 is a second block diagram of an entity relationship joint extraction apparatus according to an embodiment of the present disclosure;

FIG. 9 is a block diagram illustrating a computer device according to an embodiment herein;

FIG. 10 is a diagram illustrating a labeling result of a word example interval according to an embodiment of the present disclosure.

Description of the symbols of the drawings:

110. a preprocessing module;

120. a classification module;

121. an embedding layer;

122. a first classifier;

123. a transition layer;

124. a second classifier;

710. a receiving module;

720. an extraction module;

730. a filtration module;

902. a computer device;

904. a processor;

906. a memory;

908. a drive mechanism;

910. an input/output module;

912. an input device;

914. an output device;

916. a presentation device;

918. a graphical user interface;

920. a network interface;

922. a communication link;

924. a communication bus.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments herein without making any creative effort, shall fall within the scope of protection.

The present specification provides method steps as described in the examples or flowcharts, but may include more or fewer steps based on routine or non-inventive labor. The order of steps recited in the embodiments is merely one manner of performing the steps in a multitude of orders and does not represent the only order of execution. When an actual system or apparatus product executes, it can execute sequentially or in parallel according to the method shown in the embodiment or the figures.

The entity relationship joint extraction method provided by the invention is suitable for identification of health medical text data (including but not limited to health examination reports, electronic medical records and the like), and can also be applied to identification of text data in other professional fields (such as legal documents and the like), and the cited field is not particularly limited.

The term "entity" as used herein refers to a word or phrase having a descriptive meaning, typically a name of a person, a place, an organization, a product, or a domain having a meaning, such as a disease, a drug, a name of an organism, or a proprietary vocabulary related to law.

The term relationship of entities as used herein refers to the relationship of different entities to each other. Entities are not independent from one another and often have certain association.

Although the entity relation joint extraction method in the prior art can identify entity words and entity phrase relation types to a certain extent, the existing entity relation joint extraction method cannot realize entity phrase identification of non-entity words and non-relation types, does not consider the influence of characters between entities on entity relations, does not fully identify text semantic information, has poor identification precision, and is not suitable for scenes with complex entity relations (such as entity word parallel, entity word overlapping, relation overlapping and the like).

In order to solve the technical problems, a novel entity relationship joint extraction model is established in advance, and is used for preprocessing text data to obtain word case intervals, word case interval vectors, word case interval length vectors and text vectors; predicting the type of the word case interval according to the information obtained by preprocessing; and predicting to obtain the relation type of the entity phrase according to the entity phrase and the character vector between the entity words in the entity phrase. The type of the word case interval comprises an entity type and a non-entity type, the entity word is the word case interval of the entity type, the word case interval of the non-entity type is a non-entity word, and the relationship type of the entity phrase comprises a relationship type and a non-relationship type.

Specifically, as shown in fig. 1, the entity-relationship joint extraction model includes: a preprocessing module 110 and a classification module 120, wherein the classification module 120 includes: an embedding layer 121, a first classifier 122, a transition layer 123, and a second classifier 124.

The preprocessing module 110 is configured to preprocess the text data to obtain a word case interval, a word case interval vector, a word case interval length vector, and a text vector;

the embedded layer 121 is connected to the preprocessing module, and is configured to construct a first vector according to information obtained by preprocessing;

the first classifier 122 is connected with the embedded layer 121, and predicts the type of the word case interval according to the first vector;

the transition layer 123 is connected to the first classifier 122 and the second classifier 124, and is configured to screen out word case intervals of the entity type to obtain entity words; splicing an entity phrase formed by every two entity words and character vectors among the entity words in the entity phrase into a second vector;

the second classifier 124 is configured to predict a relationship type of the entity phrase according to the second vector, where the relationship type of the entity phrase includes a relationship and a non-relationship.

The entity relation extraction model provided by the text comprises learning of non-entities and non-relation types, and can improve identification precision. In addition, when the entity relation extraction model identifies the entity phrase relation types, text semantic information is enriched by considering the character vectors between entity words, so that the semantic information expression is richer, and all entity phrase relation types of complex text data can be accurately extracted.

In detail, the term intervals described herein are determined as follows: performing word segmentation/word segmentation processing on the text data to obtain a word example (token) list; and acquiring a word example interval according to the word example list and a preset sliding window. The word case list refers to a list of words (characters) corresponding to the text data, for example, the text data "liver size is normal", and the corresponding word case list is "liver size is normal". The word case interval comprises an entity positive example (a word conforming to the pre-labeled entity word, namely, a word case interval of the entity type) and an entity negative example (a word except the pre-labeled entity word, namely, a word case interval of the non-entity type), the words in the word case list can be divided comprehensively, the preset sliding window can be determined according to the maximum word number of the entity word in the text, and the specific value of the text is not limited. Assuming that the preset sliding window size is 3, the word case interval corresponding to the word case list "liver size is normal" includes: the "liver", "liver big", "viscera big", "small and normal", "normal", and "normal".

The term interval vector described herein is a vector composed of all the participles/participles included in the term interval. For example, if the word case interval is "liver large", the corresponding word case vector is the word case vector corresponding to the word case { "liver", "large" } according to the word segmentation, and the corresponding word case vector is the word case vector corresponding to the word case { "liver", "dirty", "large" } according to the word segmentation. In specific implementation, the word case vectors included in the word case interval are subjected to a fusion function (such as a maximum pooling method) to obtain the word case interval vector. The term interval length vector described herein is mainly used to represent text length information corresponding to a term interval. In specific implementation, according to the length information of the case interval, a case interval length vector can be obtained, the case interval length vector can be a vector with parameters initialized randomly and fixed length, and each parameter in the vector can be learned through model training.

The text vector described herein is a vector representation corresponding to text data, and may be obtained by processing a list of word cases using a pre-training model such as BERT (bidirectional Encoder retrieval from transforms), which includes but is not limited to BERT. The implementation of the pre-training model such as BERT can be referred to the prior art, and is not described in detail herein.

The entity phrase described herein refers to a phrase consisting of any two entity words in all entity words. In specific implementation, the second vector is formed according to the sequence of the first entity word, the text vector between the first entity word and the second entity word, and the second entity word.

The number of the first vectors constructed by the embedding layer 121 is the same as the number of the case intervals, for example, if the case intervals are 100, the number of the case intervals corresponds to 100 first vectors.

If the processing capability of the first classifier 122 is large, the first vector constructed by the embedding layer 121 may be input into the first classifier 122 at a time, and if the processing capability of the first classifier 122 is small, the first vector constructed by the embedding layer 121 may be input into the first classifier 122 in batches.

The second vector and the number vector of the entity phrase constructed by the transition layer 123 are specifically the following vectors

(n-1) Wherein n is the number of entity words.

If the processing capacity of the second classifier 124 is large, the second vector constructed by the transition layer 123 may be input into the second classifier 124 once, and if the processing capacity of the second classifier 124 is small, the second vector constructed by the transition layer 123 may be input into the second classifier 124 in batches.

In specific implementation, for different application scenes, an entity relationship joint extraction model can be established in advance according to the training text data in the corresponding application scene, and then the established entity relationship joint extraction model is utilized to predict the entity relationship.

In an embodiment herein, constructing the first vector according to the preprocessed information may be performed in one of the following manners:

(1) splicing the word case interval vectors into a first vector;

(2) splicing the word case interval vector and the text vector into a first vector;

(3) splicing the word case interval vector and the word case interval length vector into a first vector, wherein the word case interval length vector is output by the preprocessing module, the word case interval length vector is used for limiting the length of a solid word, for example, the length of the solid word is 8 characters, 10 characters and the like, the word case interval length vector can be a vector with a fixed length and parameters in the vector can be learned through model training;

(4) and splicing the word case interval vector, the word case interval length vector and the text vector into a first vector.

Preferably, in order to enrich text semantic information and improve the recognition accuracy of the word case interval type, the first vector is determined by adopting the (4) th mode.

In an embodiment of this document, the step of splicing an entity phrase composed of every two entity words and the text vectors between the entity words in the entity phrase into a second vector further includes:

the word vector is formed by splicing a first entity word vector (namely a word case interval vector) and a first entity word length vector (namely the word case interval length vector) in an entity word group, a character vector consisting of characters between two entity words, a second entity word vector (namely the word case interval vector) and a second entity word length vector (namely the word case interval length vector) in the two entity words front and back.

In practical implementation, if the context between the first entity word and the second entity word is empty, the text vector between the first entity word and the second entity word is represented by a zero vector.

In specific implementation, in order to further enrich text semantic information, a text vector can be added to the second vector.

In one embodiment, to identify an entity type and a non-entity type between word examples, the first classifier 122 includes: a first classification function unit and a first judgment unit. The first classification function unit is configured to output a probability vector of a case interval type, which may adopt a softmax function, where the case interval type includes an entity type and a non-entity type, for example, the entity type includes 5 types, and 1 type of the non-entity type is added, and the softmax function unit outputs a probability vector including 6 elements, for example, { entity type 1 probability, entity type 2 probability, entity type 3 probability, entity type 4 probability, entity type 5 probability, and non-entity type probability }, where a sum of all probabilities in the probability vector is 1.

The first determining unit is configured to determine a type of the token interval according to the probability vector of the type of the token interval, and in a specific implementation, the type of the token interval corresponding to the type of the token interval may be obtained according to a maximum probability value in the probability vector output by the softmax function unit, for example, if the probability vector output by the softmax function unit is { entity type 1 probability 0.5, entity type 2 probability 0.25, entity type 3 probability 0.1, entity type 4 probability 0.1, entity type 5 probability 0.05, and non-entity type probability 0}, the type of the token interval determined by the first determining unit is entity type 1.

In one embodiment, to realize the identification of the relationship type between the entity words, the second classifier includes a second classification function unit and a second judgment unit. The second classification function unit is used for outputting probability vectors of relationship types of entity word groups, and in order to solve the problem that a plurality of relationships may exist between two same entity words, for example, in "jijlun composition singing" qilixiang ", 2 relationships exist: (Qilix, singer, Zhougelon) and (Qilix, composer, Zhougelon), for example, sigmoid functions may be employed. The relationship type of the entity phrase comprises a relationship type and a non-relationship type, the relationship type probability and the non-relationship type probability are relatively independent, and the sum is not 1. The second judging unit is used for determining the relationship type of the entity phrase according to the probability vector of the relationship type of the entity phrase. In specific implementation, the relationship type of the entity phrase can be determined according to a preset probability upper limit.

In an embodiment of the present disclosure, as shown in fig. 2, the entity-relationship joint extraction model shown in fig. 1 is trained as follows:

step 210, preprocessing the training text data by using a preprocessing module to obtain a word case interval, a word case interval vector, a word case interval length vector and a text vector;

step 220, acquiring entity types of the marked word case intervals and the association relation of entity phrases;

step 230, constructing a first vector according to the information obtained by preprocessing;

step 240, inputting the first vector into the first classifier, and predicting the type of the word case interval;

step 250, screening out word case intervals of entity types to obtain entity words;

step 260, splicing an entity phrase consisting of every two entity words and character vectors among the entity words in the entity phrase into a second vector;

step 270, inputting the second vector into the second classifier, and predicting to obtain a relationship type of the entity phrase;

step 280, training parameters of the entity relationship joint extraction model according to the entity type of the word example interval and the relationship type of the entity phrase obtained by prediction, and the entity type of the word example interval and the relationship type of the entity phrase obtained by labeling. And (3) replacing the finally obtained model parameters back to a preprocessing module and a classification module of the entity relationship joint extraction model, and combining the preprocessing module and the classification module together to form the entity relationship joint extraction model.

Specifically, the training text data described herein is data generated historically, and taking analysis of health medical text data as an example, the training text data is health medical text data generated historically, and the health medical text data may be acquired from a hospital, a physical examination institution or a patient, and the acquisition manner of the training text data herein is not limited. In specific implementation, the training text can be preprocessed into a training set, a verification set and a test set according to a certain proportion. The training set is used for training the entity-relationship joint extraction model, the verification set is used for evaluating and adjusting parameters in the entity-relationship joint extraction model, and the test set is used for testing the generalization capability of the entity-relationship joint extraction model.

The step 210 performs a process including: performing word segmentation/word segmentation processing on the training text data to obtain a word case list; processing the word case list by using a BERT pre-training model to obtain a text vector and word case vectors corresponding to all word cases; acquiring a word example interval according to the word example list and a preset sliding window; and the word case vectors contained in the word case interval are subjected to a fusion function to obtain the word case interval vector.

The entity type of the word-case interval marked in the step 220 and the association relationship of the entity phrase may be manually implemented, and in specific implementation, a marking person may mark the word-case interval generated in the step 210 or directly analyze and mark the training text data.

In step 230, the first vector may be constructed in one of the following ways: the word case interval vectors are spliced into a first vector, the word case interval vectors and the text vectors are spliced into a first vector, the word case interval vectors and the word case interval length vectors are spliced into a first vector, and the word case interval vectors, the word case interval length vectors and the text vectors are spliced into a first vector. The word case interval vector and the word case interval length vector contain parameters to be adjusted in the training stage and are not fixed values.

The parameters in the first classifier used in step 240 are parameters to be adjusted in the training stage, and are not fixed values.

When the above step 260 is implemented, the text vector and the word case interval length vector corresponding to each entity word in the entity word group are also added to the second vector.

The parameters in the second classifier used in step 270 are the parameters to be adjusted in the training phase.

The step 280 may be implemented as follows: establishing a loss function L according to the entity type of the word case interval and the relation type of the entity phrase obtained by prediction, the entity type of the word case interval and the relation type of the entity phrase obtained by labeling, wherein the loss function L comprises two parts, and a first classifier loses L₁(cross entropy is used here) and second classifier penalty L₂(binary cross entropy is adopted here), when the loss function is smaller, the accuracy of the model is higher, and the model can better extract entity relationship combination in the text, which is defined as follows:

wherein, y₁Is the entity type of the annotated interval of the word case,

is the entity type of the word case interval obtained by prediction, y₂Is the relationship type of the labeled entity phrase,

is the relationship type of the entity phrase obtained by prediction, and lambda is a parameter.

In specific implementation, whether the entity type of the predicted word case interval and the relation type of the entity phrase approach the entity type of the labeled word case interval and the relation type of the entity phrase infinitely or not is judged through the loss function, if not, different parameter adjusting step lengths can be set for the preprocessing module, the first classifier and the second classifier through the step length adjusting parameters of the preprocessing module, the first classifier and the second classifier.

In an embodiment of this document, as shown in fig. 3, before the step 230 constructs the first vector according to the preprocessed information, the method further includes:

step 221, comparing the word case interval with the entity words labeled in advance, if a word case interval is the same as one of the entity words labeled in advance, the word case interval is an entity positive example, otherwise, the word case interval is an entity negative example.

For example, the training text data is "liver size is normal", the pre-labeled entity words are "liver", "size", and "normal", and when the sliding window size is 3, the word example intervals are "liver", "liver size", "dirty size", "large", "size" plus "," small plus "," normal ", and" normal ". By comparing the word case intervals with the pre-labeled entity words, it can be determined that the entity positive examples include "liver", "size", and "normal", and the entity negative examples include "liver", "liver large", "dirty large", "dirty size", "large", "size positive", "small positive", "small normal", "positive", "normal", and "normal".

Step 222, sampling the entity load according to the first preset value. The subsequent step 230 is continued after the sampling of the entity load sample is completed. The first preset value may be determined according to the computing power of the computing device, which is not specifically limited herein.

As can be seen from the above step, there are more entity load examples, so that in order to reduce the calculation amount of the model, step 222 is added, and entity load examples not exceeding the first preset value can be obtained by performing random sampling on the entity load examples, and the entity types corresponding to the entity load examples are non-entity types.

It should be noted here that the sampling process for the entity negative examples only exists in the model training process, and all the word example intervals need to be reserved in the model prediction process to determine all possible entity words.

In the embodiment, the number of the entity load examples is considered to be large, and the calculation rate of the entity relationship extraction model can be improved by randomly sampling the entity load examples.

In a further embodiment herein, as shown in fig. 4, before the step 260 of determining the second vector, the method further includes:

step 251, judging whether each entity phrase accords with the entity relationship labeled in advance; if an entity phrase conforms to the pre-labeled entity relationship, the entity phrase is a relationship positive sample, otherwise, it is a relationship negative sample.

For example, the training text data is "liver size is normal", and the pre-labeled entity words are "liver" (type is part), "size" (type is attribute), and "normal" (type is non-numeric result). The relationship types of the pre-labeled entity phrases are (size, modified, liver), (normal, modified, size), and the entity phrases include (size, liver), (normal, size), (liver, normal), (size, normal), (normal, liver). The positive examples of the analyzed relationship include (size, modified, liver), (normal, modified, size). Examples of negative relational examples include (liver, irrelevant, size), (liver, irrelevant, normal), (size, irrelevant, normal), (normal, irrelevant, liver)).

Step 252, according to the second preset value, the relation negative sample is sampled.

The second preset value may be the same as or different from the first preset value, and may be determined according to the calculation force of the computing device in the specific implementation, which is not specifically limited herein.

In the embodiment, the relationship negative examples not exceeding the second preset value are obtained by randomly sampling the relationship negative examples, and the relationship types corresponding to the relationship negative examples are defined as "non-relationship" types, so that the calculation amount of the entity-relationship joint extraction model can be reduced.

It should be noted here that the sampling process for the relationship negative examples only exists in the model training process, and all entity phrases need to be retained in the model prediction process to determine all entity phrases that may exist in the relationship.

After the entity relationship joint extraction model is established and obtained through the foregoing embodiment, the entity relationship joint extraction model can be used for identifying an entity relationship, and specifically, as shown in fig. 5, the entity relationship joint extraction method includes:

step 510, acquiring text data to be predicted;

and 520, extracting the text data to be predicted by using a pre-established entity relation joint extraction model, and predicting to obtain the type of the word case interval and the relation type of the entity phrase. The types of the word case intervals comprise entity types and non-entity types, the entity words are the word case intervals of the entity types, and the relationship types of the entity word groups comprise relationships and non-relationships.

In detail, the text data to be predicted is the text data generated in the entity relationship joint extraction model adaptation field. The structure, training process, etc. of the entity-relationship joint extraction model are referred to the foregoing embodiments, and will not be described in detail here. The type of the word example interval and the relationship type of the entity phrase can be output in a list form.

For example, the text data to be predicted is: the liver has normal shape and size, even distribution of substantial echoes, clear vascular texture, no echoes in the right lobe of the liver, the size of about 24 x 18mm, and clear boundary. The manual labeling result is shown in fig. 10, and the labeled entity words include: liver, shape, size, normal, essence, echo, distribution, uniformity, blood vessel, texture, clarity, liver, right lobe, sight, no echo, size, 24 x 18mm, boundary, clear, different line block diagrams in the figure correspond to entity words of different entity types, and the relationship exists between the interconnected entity words.

The prediction result of the entity relationship joint extraction model provided by the text is shown as the following output result, wherein entites represents entity words, end represents the end positions of the entity words in the text, id represents the number of the entity words, start represents the starting positions of the entity words in the text, type is the entity word type and the relationship type of the entity word groups, and word is the entity word characters. Relationships represent the type of relationships between entity words, and head and tail represent id numbers corresponding to a first entity word and a second entity word in the entity phrase.

The following output results show that the entity relation joint extraction model prediction result is the same as the manual labeling result, so that the prediction of the type of the word case interval and the relation type of the entity phrase can be accurately realized.

In the embodiment, the identification of the non-entity type data and the non-relationship type data is added, so that the entity-relationship joint extraction model extracts the data of the non-entity type and the non-relationship type, and the accuracy of identification of different data types can be improved. When the entity phrase relationship type is identified, text semantic information is enriched by considering entity words and character vectors between the entity words, all entity phrase relationship types of complex text data can be accurately extracted, structuring is carried out according to the obtained entity words and the obtained entity phrase relationship types, and useful information extraction is completed.

In a further embodiment of this document, as shown in fig. 6, the entity relationship joint extraction method further includes, in addition to the above steps 510 and 520:

and 530, filtering the relationship type of the entity phrase obtained by prediction according to the allowable relationship constraint dictionary in the field to which the text data to be predicted belongs.

In detail, the allowable relation constraint dictionary in the field to which the text data to be predicted belongs is determined by a person skilled in the art, for example, an entity of an attribute type cannot modify an entity of a numerical type, and the like, which is not specifically limited herein.

In the embodiment, the recognition result of the entity relationship joint extraction model is combined with the allowable relationship constraint dictionary of the field to which the text data to be predicted belongs, and the entity phrases which do not conform to the allowable relationship constraint can be filtered out through the allowable relationship constraint dictionary of the field to which the text data to be predicted belongs, so that the entity relationship extraction effect can be further improved.

Based on the same inventive concept, an entity-relationship joint extraction device is also provided, as described in the following embodiments. Because the principle of solving the problem of the entity relationship joint extraction device is similar to that of the entity relationship joint extraction method, the entity relationship joint extraction device can be implemented by referring to the entity relationship joint extraction method, and repeated parts are not described again.

The entity relationship joint extraction device provided in this embodiment includes a plurality of functional modules, which may be implemented by dedicated or general chips, and may also be implemented by software programs, which are not limited herein.

Specifically, as shown in fig. 7, the entity relationship joint extraction device includes:

a receiving module 710, configured to obtain text data to be predicted;

an extraction module 720, configured to extract the text data to be predicted by using a pre-established entity-relationship joint extraction model, and predict to obtain a type of a word-case interval and a relationship type of an entity phrase, where the type of the word-case interval includes an entity type and a non-entity type, an entity word is a word-case interval of the entity type, and the relationship type of the entity phrase includes a relationship and a non-relationship;

Further, as shown in fig. 8, the entity-relationship joint extraction apparatus further includes:

and the filtering module 730 is configured to filter the relationship type of the entity phrase obtained by prediction according to the allowable relationship constraint dictionary in the field to which the text data to be predicted belongs.

Compared with the prior art, the entity relationship joint extraction method and the entity relationship joint extraction device have the advantages that when a plurality of training text data are learned, the relationships between the entities and the entities are subjected to joint classification learning, learning of 'non-entities' and 'non-relationship' categories is added in entity classification and relationship classification, random sampling of non-entity samples and relationship negative samples is carried out in the text, calculation efficiency and model effect are considered, and the problem of text entity relationship extraction with more entities and relationships can be solved well.

And the text vector and the context vector between the entity words are respectively added in the calculation of the first vector and the second vector, so that the text semantic information is enriched, and the entity relation extraction under the complex text structure (such as entity word parallel, entity overlapping, relation overlapping and the like) can be well processed.

And finally, filtering entity relationship combinations which do not conform to the allowable relationship constraint based on the allowable relationship constraint dictionary in the field added after model prediction, thereby further improving the extraction effect of the entity relationship.

In an embodiment herein, the entity relationship joint extraction model training process and the entity relationship prediction process described above may be implemented by a computer device, and specifically, as shown in fig. 9, the computer device 902 may include one or more processors 904, such as one or more Central Processing Units (CPUs), each of which may implement one or more hardware threads. The computer device 902 may further comprise any memory 906 for storing any kind of information, such as code, settings, data, etc., some embodiments summarize that a computer program is stored in the memory 906, which computer program, when executed by the processor 904 of the computer device, performs the entity-relationship joint extraction method or the training method of the entity-relationship joint extraction model as described in the previous embodiments. For example, and without limitation, memory 906 may include any one or more of the following in combination: any type of RAM, any type of ROM, flash memory devices, hard disks, optical disks, etc. More generally, any memory may use any technology to store information. Further, any memory may provide volatile or non-volatile retention of information. Further, any memory may represent fixed or removable components of computer device 902. In one case, when the processor 904 executes the associated instructions, which are stored in any memory or combination of memories, the computer device 902 can perform any of the operations of the associated instructions. The computer device 902 also includes one or more drive mechanisms 908, such as a hard disk drive mechanism, an optical disk drive mechanism, etc., for interacting with any memory.

Computer device 902 may also include an input/output module 910(I/O) for receiving various inputs (via input device 912) and for providing various outputs (via output device 914)). One particular output mechanism may include a presentation device 916 and an associated graphical user interface 918 (GUI). In other embodiments, input/output module 910(I/O), input device 912, and output device 914 may also be excluded, acting as only one computer device in a network. Computer device 902 may also include one or more network interfaces 920 for exchanging data with other devices via one or more communication links 922. One or more communication buses 924 couple the above-described components together.

Communication link 922 may be implemented in any manner, such as over a local area network, a wide area network (e.g., the Internet), a point-to-point connection, etc., or any combination thereof. Communication link 922 may include any combination of hardwired links, wireless links, routers, gateway functions, name servers, etc., governed by any protocol or combination of protocols.

Corresponding to the methods in fig. 2-6, the embodiments herein also provide a computer-readable storage medium having stored thereon a computer program, which, when executed by a processor, performs the steps of the above-described method.

Embodiments herein also provide computer readable instructions, wherein when executed by a processor, a program thereof causes the processor to perform the method as shown in fig. 2-6.

It should be understood that, in various embodiments herein, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments herein.

It should also be understood that, in the embodiments herein, the term "and/or" is only one kind of association relation describing an associated object, meaning that three kinds of relations may exist. For example, a and/or B, may represent: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided herein, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purposes of the embodiments herein.

In addition, functional units in the embodiments herein may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present invention may be implemented in a form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The principles and embodiments of this document are explained herein using specific examples, which are presented only to aid in understanding the methods and their core concepts; meanwhile, for the general technical personnel in the field, according to the idea of this document, there may be changes in the concrete implementation and the application scope, in summary, this description should not be understood as the limitation of this document.

Claims

1. An entity relationship joint extraction method is characterized by comprising the following steps:

acquiring text data to be predicted;

2. The entity relationship joint extraction method of claim 1, further comprising:

3. The entity-relationship joint extraction method according to claim 1, wherein the entity-relationship joint extraction model comprises: the device comprises a preprocessing module and a classification module, wherein the classification module comprises an embedding layer, a first classifier, a transition layer and a second classifier;

4. The entity relationship joint extraction method as claimed in claim 3, wherein the preprocessing module processes the text data to obtain a word case interval, a word case interval vector, a word case interval length vector and a text vector, and comprises:

5. The entity-relationship joint extraction method as claimed in claim 3, wherein constructing the first vector according to the preprocessed information comprises:

6. The entity relationship joint extraction method of claim 3, wherein the first classifier comprises: the first classification function unit is used for outputting a probability vector of a word case interval type, and the first judgment unit is used for determining the type of the word case interval according to the probability vector of the word case interval type;

7. The entity-relationship joint extraction method according to claim 3, wherein the entity-relationship joint extraction model is trained by:

screening out word case intervals of entity types to obtain entity words, and splicing an entity phrase formed by every two entity words and character vectors among the entity words in the entity phrase into a second vector;

8. The entity-relationship joint extraction method as claimed in claim 7, wherein before constructing the first vector according to the preprocessed information, the method further comprises:

and sampling the entity load example according to a first preset value.

9. The entity-relationship joint extraction method of claim 7, wherein before determining the second vector, further comprising:

and sampling the relation negative sample according to a second preset value.

10. An entity-relationship joint extraction device, comprising:

the receiving module is used for acquiring text data to be predicted;