CN111798987A

CN111798987A - Entity relationship extraction method and device

Info

Publication number: CN111798987A
Application number: CN202010648089.8A
Authority: CN
Inventors: 陆晓静
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2020-10-20

Abstract

The invention provides a method and a device for extracting entity relations, wherein the method comprises the following steps: acquiring an entity pair data set containing a preset relation; wherein the entity pair dataset comprises a plurality of entity pairs; extracting sentences containing the entity pairs from professional data in the medical field; screening sentence templates for representing the relation from the sentences based on an initial BERT model; and adjusting the initial BERT model based on the sentence template and the entity pair data set so as to extract the relation of the entity pair data set to be extracted through the adjusted BERT model. Through the scheme, data are extracted from professional data in the medical field, the sentence templates used for representing relations are selected based on the sentences obtained through extraction, the work efficiency of data labeling and feature matching is improved, various data in the medical field can be adapted, a large amount of manpower is not needed, and the cost is saved.

Description

Entity relationship extraction method and device

Technical Field

The present invention relates to the field of data relationship extraction technologies, and in particular, to a method and an apparatus for extracting an entity relationship.

Background

Currently, for medical entities such as diseases and symptoms, diseases and operations, extraction of relationships between entities is required, and there are two types of existing extraction methods: the method is based on rule extraction, and the other method is based on supervised learning, wherein the rule extraction method is to use a preset rule to extract corresponding entities from a text or judge whether the entities conform to the corresponding relationship, and the supervised learning method is to train a classifier to judge whether the entities have the corresponding relationship after marking a large amount of data. Both of these current solutions present problems:

the method for using the rules depends on the quality of the established rules, a large amount of manpower input is needed in the early stage, the manually established rules cannot necessarily cover all relation types, and the recall rate is poor. The method of supervised learning requires a large amount of labeled data, is high in cost, time-consuming and labor-consuming, and is poor in flexibility because data labeling is required if a new relationship needs to be extracted.

Thus, there is a need for a better approach to solving the problems encountered in entity relationship extraction.

Disclosure of Invention

According to the scheme, data are extracted from professional data in the medical field, sentence templates for representing the relation are screened out based on the extracted sentences, the work efficiency of data labeling and feature matching is improved, the method can adapt to various data in the medical field, a large amount of manpower is not needed, and the cost is saved.

Specifically, the present invention proposes the following specific examples:

the embodiment of the invention provides a method for extracting entity relationships, which comprises the following steps:

acquiring an entity pair data set containing a preset relation; wherein the entity pair dataset comprises a plurality of entity pairs;

extracting sentences containing the entity pairs from professional data in the medical field;

screening sentence templates for representing the relation from the sentences based on an initial BERT model;

and adjusting the initial BERT model based on the sentence template and the entity pair data set so as to extract the relation of the entity pair data set to be extracted through the adjusted BERT model. .

In a specific embodiment, the medical domain professional data comprises medical record data.

In a specific embodiment, a preset relationship exists between entities in each of the entity pairs;

the extracting the sentence containing the entity pair from the professional data of the medical field comprises:

and extracting sentences which have preset lengths and contain preset interval values of the intervals of the entities in the entity pairs from professional data in the medical field.

In a specific embodiment, the "screening a sentence template for characterizing the relationship from the sentences based on the initial BERT model" includes:

generating an initial sentence template based on the sentences, wherein the sentences constructed based on the initial sentence template exceed a preset proportion and meet the relationship;

carrying out usability scoring on each initial sentence template through a BERT model;

and screening the initial sentence template according to the usability scores to select a sentence template for representing the relationship.

In a particular embodiment of the present invention,

the "scoring usability of each of the initial sentence templates by a BERT model" includes:

for each initial sentence template, constructing a sentence with a space based on the initial sentence template and each entity in the entity pair data set;

predicting the blank in the sentence based on a BERT model to obtain a prediction result;

determining a score for the initial sentence template based on the prediction result.

In a particular embodiment, the availability score is determined based on the following formula:

wherein, the

As sentence templates

An availability score of (a);

at s_j∈S_ijIs then 1, in

Is 0;

at t_j∈T_ijIs then 1, in

Is 0;

T_ijand S_ijAre respectively sentences

And sentences

Top-k predictions of (c); .

s_jAnd t_jIs an entity in a pair of entity pairs.

In a specific embodiment, the "adjusting the initial BERT model based on the sentence template and the entity pair dataset" includes:

summarizing the sentence templates obtained after screening into a sentence template set;

constructing a regular sentence on the sentence template set based on the entity pair data set;

constructing a counterexample sentence on the sentence template set based on a part of entity pairs in the entity pair data set and the anti-entity pair data set; the anti-entity pair dataset is identical to the entities of the entity pair dataset, and the entities in the entity pairs are in reverse order;

and adjusting the BERT model based on the positive example sentences and the negative example sentences so as to pass through the adjusted BERT model.

In a specific embodiment, the method further comprises the following steps:

when the designated entity pair is required to be judged whether to meet the given relationship, predicting the constructed sentences to be predicted through the adjusted BERT model; the sentence to be predicted is constructed and generated based on the specified entity pair and the sentence template set;

and if the average value of the obtained prediction results is larger than a set threshold value, determining that the designated entity pair meets the given relationship.

The embodiment of the present invention further provides an entity relationship extraction device, including:

the acquisition module is used for acquiring an entity pair data set containing a preset relation; wherein the entity pair dataset comprises a plurality of entity pairs;

the extraction module is used for extracting sentences containing the entity pairs from professional data in the medical field;

the screening module is used for screening sentence templates for representing the relation from the sentences based on an initial BERT model;

and the processing module is used for adjusting the initial BERT model based on the sentence template and the entity pair data set so as to extract the relation of the entity pair data set to be extracted through the adjusted BERT model.

In a specific embodiment, the medical domain professional data comprises medical record data. Therefore, the embodiment of the invention provides a method and equipment for extracting entity relationships, wherein the method comprises the following steps: acquiring an entity pair data set containing a preset relation; wherein the entity pair dataset comprises a plurality of entity pairs; extracting sentences containing the entity pairs from professional data in the medical field; screening sentence templates for representing the relation from the sentences based on an initial BERT model; and adjusting the initial BERT model based on the sentence template and the entity pair data set so as to extract the relation of the entity pair data set to be extracted through the adjusted BERT model. Through the scheme, data are extracted from professional data in the medical field, the sentence templates used for representing relations are selected based on the sentences obtained through extraction, the work efficiency of data labeling and feature matching is improved, various data in the medical field can be adapted, a large amount of manpower is not needed, and the cost is saved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained according to the drawings without inventive efforts.

Fig. 1 is a schematic flow chart of an entity relationship extraction method according to an embodiment of the present invention;

fig. 2 is a schematic view of a flow framework of an entity relationship extraction method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an entity relationship extraction device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an entity relationship extraction device according to an embodiment of the present invention.

Detailed Description

Various embodiments of the present disclosure will be described more fully hereinafter. The present disclosure is capable of various embodiments and of modifications and variations therein. However, it should be understood that: there is no intention to limit the various embodiments of the disclosure to the specific embodiments disclosed herein, but rather, the disclosure is to cover all modifications, equivalents, and/or alternatives falling within the spirit and scope of the various embodiments of the disclosure.

The terminology used in the various embodiments of the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the various embodiments of the present disclosure. As used herein, the singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the various embodiments of the present disclosure belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined in various embodiments of the present disclosure.

Example 1

The embodiment 1 of the invention discloses a method for extracting entity relationship, which comprises the following steps as shown in figure 1:

step 101, acquiring an entity pair data set containing a preset relationship; wherein the entity pair data set comprises a plurality of entity pairs.

In particular, given a set of related entity pair datasets

For example, entity-to-data sets of disease and symptoms { (hepatitis, hepatomegaly), (rubella, headache) }. The specific entity pair may be, for example, (rubella, headache), wherein "rubella" and "headache" are both entities in the entity pair.

Step 102, extracting sentences containing the entity pairs from professional data in the medical field;

specifically, the professional data in the medical field includes medical record data.

Thus, all sentences containing the entity pairs in R are extracted from the medical record, wherein only the length L (for example, the number of characters is 15 or other numbers, and the specific length can be) s is extracted_iAnd t_iThe interval of (a) is W (which may also be measured in characters or other manners, for example, the interval is 8 characters, etc., and the specific interval may be flexibly set according to actual conditions). By phi₁(x₁，y₁)，…，φ_m(x_m，y_m) Representing all the extracted sentences.

And 103, screening a sentence template for representing the relation from the sentences based on the initial BERT model.

Specifically, an initial sentence template phi is screened from the extracted sentences_j(x_j，y_j) So that the sentence set phi constructed therewith_j(s₁，t₁)，...，φ_j(s_n，t_n) Most sentences in (e.g., scale 8 or higher or lower) may satisfy the relationship to be extracted. For example, the initial sentence template "patient _ would show _" can be used to characterize the relationship of most diseases to symptoms.

In one particular embodiment, to screen out more desirable sentence templates,

"screening out a sentence template for characterizing the relationship from the sentences based on the initial BERT model" includes:

In addition, the "scoring usability of each of the initial sentence templates by the BERT model" includes:

Evaluation of a template phi using the original BERT model_iAvailability of (c). Construction of a sentence phi using a data set R_i(s₁，_)，...，φ_i(s_nPhi and phi are_i(_，t₁)，...，φ_i(_，t_n) Wherein the underline represents a space, if the sentence "will show a _" if hepatitis is found ", the space part is predicted using the BERT model, and the corresponding t is counted₁,…,t_nAnd s₁,…,s_nWhether it is in the first k (top-k) of the prediction. Using T_ij，S_ijRespectively represent the sentences phi_i(s_jAnd phi and the sentence_i(_,t_j) Top-k predictions, the usability score is determined based on the following equation:

wherein, the

As sentence templates

An availability score of (a);

at s_j∈S_ijIs then 1, in

Is 0;

at t_j∈T_ijIs then 1, in

Is 0;

T_ijand S_ijAre respectively sentences

And sentences

Top-k predictions of (c); wherein, the first K (top-K) prediction results (such as predicted words) of underlining are predicted by using bert;

s_jand t_jIs an entity in a pair of entity pairs.

Specifically, as shown in fig. 2, psi ═ psi is used₁,…,ψ_cRepresents C template sentences after screening. To improve the prediction accuracy, the data set R {(s) may be used₁,t₁),…,(s_n,t_n) In the moldConstructing sentences of a positive example on the board set psi;

in addition, the usage data set R { (t)₁,s₁),…,(t_n,s_n) And sampling a batch of samples(s) from the data set R_i,t_j)；i≠j；i,j∈[1,n]And constructing sentences of negative examples on the template set psi, and performing finetune (namely adjustment) on the original BERT model by using the constructed positive and negative example sentences through a binary classification method.

Further, as shown in fig. 2, the method further includes:

Specifically, BERT after using finetune predicts whether (s, t) is such that a given relationship is satisfied. Because there are c sentence templates, c sentences can be constructed, and thus, the predicted results are respectively: p is a radical of₁(s, t), …, pc (s, t), setting a threshold lambda, and averaging the prediction results

When, the input (s, t) can be considered to satisfy a given relationship.

Example 2

Embodiment 2 of the present invention further discloses an entity relationship extraction device, as shown in fig. 3, including:

an obtaining module 201, configured to obtain an entity pair data set including a preset relationship; wherein the entity pair data set comprises a plurality of entity pairs.

An extraction module 202, configured to extract a sentence including the entity pair from professional data in the medical field;

a screening module 203, configured to screen a sentence template for characterizing the relationship from the sentences based on an initial BERT model;

the processing module 204 is configured to adjust the initial BERT model based on the sentence template and the entity-to-data set, so as to perform relationship extraction on the entity-to-data set to be subjected to the extraction relationship through the adjusted BERT model.

the extraction module 202 is configured to:

In a particular embodiment, the screening module 203 is used for

In a specific embodiment, the filtering module 203 scores the usability of each of the initial sentence templates through a BERT model, which includes:

wherein, the

As sentence templates

An availability score of (a);

at s_j∈S_ijIs then 1, in

Is 0;

at t_j∈T_ijIs then 1, in

Is 0;

T_ijand S_ijAre respectively sentences

And sentences

Top-k predictions of (c); .

s_jAnd t_jIs an entity in a pair of entity pairs.

In a specific embodiment, the processing module 204 is configured to

In a specific embodiment, as shown in fig. 4, the method further includes: a judging module 205 for

Therefore, the embodiment of the invention provides a method and equipment for extracting entity relationships, wherein the method comprises the following steps: acquiring an entity pair data set containing a preset relation; wherein the entity pair dataset comprises a plurality of entity pairs; extracting sentences containing the entity pairs from professional data in the medical field; screening sentence templates for representing the relation from the sentences based on an initial BERT model; and adjusting the initial BERT model based on the sentence template and the entity pair data set so as to extract the relation of the entity pair data set to be extracted through the adjusted BERT model. Through the scheme, data are extracted from professional data in the medical field, the sentence templates used for representing relations are selected based on the sentences obtained through extraction, the work efficiency of data labeling and feature matching is improved, various data in the medical field can be adapted, a large amount of manpower is not needed, and the cost is saved.

Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.

Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.

The above-mentioned invention numbers are merely for description and do not represent the merits of the implementation scenarios.

The above disclosure is only a few specific implementation scenarios of the present invention, however, the present invention is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present invention.

Claims

1. A method of entity relationship extraction, comprising:

and adjusting the initial BERT model based on the sentence template and the entity pair data set so as to extract the relation of the entity pair data set to be extracted through the adjusted BERT model.

2. The method of claim 1, wherein the medical domain professional data comprises medical record data.

3. The method of claim 1, wherein there is a predetermined relationship between entities in each of the entity pairs;

4. The method of claim 1, wherein the step of selecting sentence templates for characterizing the relationship from the sentences based on the initial BERT model comprises:

5. The method of claim 4, wherein said scoring usability of each of said initial sentence templates by a BERT model comprises:

6. A method of entity relationship extraction as claimed in claim 4 or 5, wherein said availability score is determined based on the following formula:

wherein the score (phi)_i) For sentence templates phi_iAn availability score of (a);

at s_j∈S_ijIs then 1, in

Is 0;

at t_j∈T_ijIs then 1, in

Is 0;

T_ijand S_ijAre respectively a sentence phi_i(s_jAnd phi and the sentence_i(_,t_j) Top-k predictions of (c); .

s_jAnd t_jIs an entity in a pair of entity pairs.

7. The method of claim 1, wherein the "adjusting the initial BERT model based on the sentence templates and the entities to the dataset" comprises:

8. The method of entity relationship extraction as claimed in claim 7, further comprising:

when judging whether the designated entity pair meets the given relationship, predicting the constructed sentences to be predicted through the adjusted BERT model; the sentence to be predicted is constructed and generated based on the specified entity pair and the sentence template set;

9. An apparatus for entity relationship extraction, comprising:

10. The apparatus for entity relationship extraction as claimed in claim 9, wherein the professional data of the medical field comprises medical record data.