CN112347771A

CN112347771A - Method and equipment for extracting entity relationship

Info

Publication number: CN112347771A
Application number: CN202011402086.2A
Authority: CN
Inventors: 史亚飞
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-12-03
Filing date: 2020-12-03
Publication date: 2021-02-09

Abstract

The invention relates to a method and a device for extracting entity relationships, wherein the method comprises the following steps: identifying the category of the medical entity in the medical text, and splitting the medical text into single sentences; respectively selecting a medical entity to combine with a single sentence to form a sentence for inputting into a pre-training BERT model aiming at each category, and obtaining a vector of the sentence and a vector of the name of each medical entity from an output layer of the BERT model; respectively inputting the vectors of the sentences and the averaged value of the vector of each medical entity name into a feedforward neural network to obtain a plurality of intermediate vectors; splicing the plurality of intermediate vectors, accessing the intermediate vectors into a fully-connected neural network, and classifying the intermediate vectors based on softmax in the fully-connected neural network to obtain classification probability; and selecting the relation class with the highest classification probability as the final relation between the medical entities. According to the scheme, the pre-trained BERT model is used for extracting the upper semantic features and the lower semantic features of the entity, and the type of the entity is added into the prediction of the relationship, so that the accuracy of recognition is improved.

Description

Method and equipment for extracting entity relationship

Technical Field

The invention relates to the technical field of data processing, in particular to a method and equipment for extracting entity relationships.

Background

In many fields, such as the medical field, a large number of entities are involved, and for this reason, the relationship between the entities needs to be known, so as to facilitate subsequent applications, but at present, the entity relationship is generally obtained by extracting features of different dimensions by using various CNN (Convolutional Neural Networks) and LSTM (Long Short-Term-Memory artificial Neural Networks) deep learning Networks, and then combining these various CNN and LSTM deep learning Networks together to select the entity relationship with the rightmost sample.

However, the current method does not take into account the semantic information of the upper and lower parts among different entities and the type information of the entities, which leads to inaccurate identification.

For this reason, there is a need for a better solution to the problems of the prior art.

Disclosure of Invention

The invention provides an entity relationship extraction method and device, which can solve the technical problem of inaccurate identification in the prior art.

The technical scheme for solving the technical problems is as follows:

the embodiment of the invention provides an entity relationship extraction method, which comprises the following steps:

identifying the category of a medical entity in a medical text, and splitting the medical text into single sentences;

respectively selecting one medical entity to combine with the single sentence to form a sentence for inputting into a pre-training BERT model aiming at each category, and obtaining a vector of the sentence and a vector of each medical entity name from an output layer of the BERT model;

respectively inputting the vectors of the sentences and the averaged value of the vectors of the medical entity names into a feedforward neural network to obtain a plurality of intermediate vectors;

splicing the intermediate vectors, accessing the intermediate vectors into a fully-connected neural network, and classifying the intermediate vectors based on softmax in the fully-connected neural network to obtain classification probability;

selecting the relationship class with the highest classification probability as the final relationship between the medical entities.

In a specific embodiment, before identifying the category of the medical entity in the medical text, the method further comprises: medical texts are acquired.

In a particular embodiment, the medical text includes any combination of one or more of the following: medical teaching materials, clinical guidelines, and medical records.

In a specific embodiment, the "identifying the category of the medical entity in the medical text" includes:

and identifying the category of the medical entity in the medical text by adopting a combination of BERT and CRF.

In a specific embodiment, the categories include: diseases, examinations, symptoms, treatments, and drugs.

In a specific embodiment, in the sentence, an identifier of a category corresponding to each medical entity is set in front of each medical entity; and setting a sentence mark before the sentence.

In a specific embodiment, the vector of sentences is determined by the following formula:

H'_CLS＝W_CLS(tanh(H_CLS))+b_CLS；

wherein, H'_CLSA vector for the sentence; w_CLSA weight parameter for the sentence; h_CLSIs the sentence; b_CLSIs a bias parameter for the sentence.

In a specific embodiment, when the number of the medical entities is 2; the vector of medical entity names is determined by the following formula:

wherein i_e1，j_e1，i_e2，j_e2The first and last character positions of entity e1 and entity e2 respectively; h_e'₁Is a vector of entity e 1; h_e'₂Is a vector of entity e 2; w_e1A weight parameter for entity e 1; w_e2A weight parameter for entity e 2; b_e1Bias parameter for entity e 1; b_e2Is the bias parameter of entity e 2.

In a specific embodiment, when the number of the medical entities is 2; the classification probability is determined by the following formula:

p＝softmax(W[concat(H'_CLS,H_e'₁,H_e'₂)]+b)；

p is the classification probability; w is a weight parameter of the hidden layer; b is a bias parameter of the hidden layer; h'_CLSA vector for the sentence; h_e'₁Is a vector of entity e 1; h_e'₂Is a vector of entity e 2.

The embodiment of the present invention further provides an extraction device for entity relationships, including:

the identification module is used for identifying the category of the medical entity in the medical text and splitting the medical text into single sentences;

an obtaining module, configured to select one medical entity for each category, combine the single sentence to form a sentence, input the sentence into a pre-training BERT model, and obtain a vector of the sentence and a vector of each medical entity name from an output layer of the BERT model;

the intermediate module is used for respectively inputting the vectors of the sentences and the averaged values of the vectors of the medical entity names into a feedforward neural network to obtain a plurality of intermediate vectors;

the input module is used for splicing the intermediate vectors, accessing the intermediate vectors into a fully-connected neural network, and classifying the intermediate vectors based on softmax in the fully-connected neural network to obtain classification probability;

a determining module for selecting the relationship category with the highest classification probability as the final relationship between the medical entities.

The invention has the beneficial effects that:

according to the scheme, a pre-training BERT model is adopted, and vectors of sentences and vectors of names of medical entities are obtained from an output layer of the BERT model; then, a feedforward neural network is given to obtain a plurality of intermediate vectors which are accessed into the fully-connected neural network to obtain classification probability; and determining the final relationship between the medical entities based on the classification probability, thereby extracting the upper and lower semantic features of the entities by using a pre-trained BERT model, and adding the types of the entities into the prediction of the relationship, thereby improving the accuracy of identification.

Drawings

Fig. 1 is a schematic flowchart of an entity relationship extraction method according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of an entity relationship extraction method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an extraction device for entity relationships according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an extraction device for entity relationships according to an embodiment of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth by way of illustration only and are not intended to limit the scope of the invention.

The method for extracting entity relationships provided by the embodiment of the invention, as shown in fig. 1 or 2, includes the following steps:

step 101, identifying the category of a medical entity in a medical text, and splitting the medical text into single sentences;

specifically, the medical text comprises any combination of one or more of the following: medical teaching materials, clinical guidelines, and medical records.

Thus, the "identifying the category of the medical entity in the medical text" in step 101 includes:

the category of the medical entity in the medical text is identified by using a combination of BERT (Bidirectional Encoder retrieval from Transformers, namely, Encoder of Bidirectional Transformer) and CRF (Conditional Random Field).

The categories include: diseases, examinations, symptoms, treatments, and drugs.

Specifically, other categories may be provided according to needs, and are not limited to the above specific categories.

102, respectively selecting one medical entity to combine with the single sentence to form a sentence for inputting into a pre-training BERT model aiming at each category, and obtaining a vector of the sentence and a vector of each medical entity name from an output layer of the BERT model;

in the sentence, an identifier of the corresponding category of the medical entity is arranged in front of each medical entity; and setting a sentence mark before the sentence.

For example, the identification of the disease category is "& disease", the identification of the examination category is "# examination", and the identification of the entire sentence is "[ CLS ]".

Specifically, the sentence identifiers are different from the category identifiers, and the different category identifiers are also different.

103, respectively inputting the vectors of the sentences and the averaged values of the vectors of the medical entity names into a feed-forward neural network to obtain a plurality of intermediate vectors;

specifically, the vector of the sentence is determined by the following formula:

H'_CLS＝W_CLS(tanh(H_CLS))+b_CLS；

The description is given by taking two medical entities as examples, and the medical entities belong to different categories, when the number of the medical entities is 2; the vector of medical entity names is determined by the following formula:

Step 104, splicing the intermediate vectors, accessing the intermediate vectors into a fully-connected neural network, and classifying the intermediate vectors based on softmax in the fully-connected neural network to obtain classification probability;

still taking the above as an example, when the number of the medical entities is 2; the classification probability is determined by the following formula:

p＝softmax(W[concat(H'_CLS,H_e'₁,H_e'₂)]+b)；

And 105, selecting the relation category with the maximum classification probability as the final relation between the medical entities.

According to the scheme, medical entities in a medical text are input into a pre-trained BERT model, and vectors of sentences and the name of each medical entity are obtained from an output layer of the BERT model; then, a feedforward neural network is given to obtain a plurality of intermediate vectors which are accessed into the fully-connected neural network to obtain classification probability; and determining the final relationship between the medical entities based on the classification probability, thereby extracting the upper and lower semantic features of the entities by using a pre-trained BERT model, and adding the types of the entities into the prediction of the relationship, thereby improving the accuracy of identification.

In a specific example, as shown in fig. 2, the present solution further includes the following steps:

1. collecting medical documents, e.g. medical textbooks, clinical guidelines, medical records, etc

2. Medical entity recognition is carried out on the medical text in the step 1 by adopting a pre-training model BERT + CRF, the text is split into single sentences, and two types of entities are selected from the single sentences for relation recognition, such as disease and examination

3. And (3) randomly selecting one entity from each category of the two entities in the step (2) to be combined pairwise to obtain a plurality of entity pairs e1 and e2, wherein each entity pair is used as one input. Two entities are distinguished using a special symbol (e.g., "&" and "#") for an input sentence, and the entity is previously added to the entity type. The processing method is exemplified as follows:

[ CLS ] & diseases & chronic obstructive pulmonary disease & acute exacerbation is often induced by microbial infection, and when bacterial infection is combined, # test # blood leukocyte count # is increased, and neutrophilic granulosa nuclei are moved to the left

4. Adding the single sentence processed in the step 3 into a pre-trained BERT model, extracting [ CLS ] vectors and vectors of two entity names from an output layer of the BERT, respectively adding the [ CLS ] vectors and the vectors into a feedforward neural network, finally splicing the three updated variables, then accessing the three variables into a fully-connected neural network, and classifying the variables through softmax, wherein the method comprises the following steps:

(1) [ CLS ] ACCESS FEED-FORWARD NEURAL NETWORK

H'_CLS＝W_CLS(tanh(H_CLS))+b_CLS

(2) The vectors of two entity names are averaged respectively and are connected into a feedforward neural network

Wherein; the vector averaging of the entity name is the vector averaging of each word of the entity name; i.e. i_e1，j_e1，i_e2，j_e2The first and last character positions of the entity e1 and the entity e2 respectively

(3) Splicing the three vectors obtained in the step (1) and the step (2), accessing the three vectors into a fully-connected neural network, and obtaining classification probability through softmax, wherein the formula is as follows:

p＝softmax(W[concat(H'_CLS,H_e'₁,H_e'₂)]+b)

5. based on the probability p of step 4, the relationship class with the highest probability is selected as the final relationship of the entities e1 and e 2.

In the scheme, the pre-trained BERT model is used for extracting the upper semantic features and the lower semantic features of the entity, and the type of the entity is added into the prediction of the relationship, so that the accuracy of recognition is effectively improved.

Example 2

The embodiment 2 of the present invention further discloses an extraction device for entity relationships, as shown in fig. 3, including:

the identification module 201 is configured to identify categories of medical entities in a medical text, and split the medical text into single sentences;

an obtaining module 202, configured to select, for each category, one medical entity to form a sentence in combination with the single sentence, and input the sentence into a pre-training BERT model, and obtain, from an output layer of the BERT model, a vector of the sentence and a vector of a name of each medical entity;

the intermediate module 203 is configured to input the vectors of the sentences and the averaged value of the vector of each medical entity name into a feed-forward neural network, so as to obtain a plurality of intermediate vectors;

the input module 204 is used for splicing the intermediate vectors, accessing the intermediate vectors to a fully-connected neural network, and classifying the intermediate vectors based on softmax in the fully-connected neural network to obtain classification probability;

a determining module 205, configured to select the relationship class with the highest classification probability as the final relationship between the medical entities.

In a specific embodiment, as shown in fig. 4, the method further includes:

a text module 206 for obtaining the medical text before identifying the category of the medical entity in the medical text.

In a specific embodiment, the identifying module 201 is configured to:

An identification module 201, the categories including: diseases, examinations, symptoms, treatments, and drugs.

The identification module 201 is configured to set, in the sentence, an identifier of a category corresponding to each medical entity in front of each medical entity; and setting a sentence mark before the sentence.

A recognition module 201, the vector of the sentence being determined by the following formula:

H'_CLS＝W_CLS(tanh(H_CLS))+b_CLS；

An identification module 201, when the number of the medical entities is 2; the vector of medical entity names is determined by the following formula:

An identification module 201, when the number of the medical entities is 2; the classification probability is determined by the following formula:

p＝softmax(W[concat(H'_CLS,H_e'₁,H_e'₂)]+b)；

While the invention has been described with reference to specific embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. An extraction method of entity relationships, comprising:

2. The method of claim 1, further comprising, prior to identifying the category of the medical entity in the medical text: medical texts are acquired.

3. The method of claim 1 or 2, wherein the medical text comprises any combination of one or more of: medical teaching materials, clinical guidelines, and medical records.

4. The method of claim 1, wherein the identifying the category of the medical entity in the medical text comprises:

5. The method of claim 1 or 4, wherein the categories include: diseases, examinations, symptoms, treatments, and drugs.

6. The method according to claim 1, wherein in the sentence, each medical entity is preceded by an identification of the corresponding category of the medical entity; and setting a sentence mark before the sentence.

7. The method of claim 1, wherein the vector of sentences is determined by the following formula:

H′_CLS＝W_CLS(tanh(H_CLS))+b_CLS；

8. The method of claim 1, wherein when the number of medical entities is 2; the vector of medical entity names is determined by the following formula:

wherein i_e1，j_e1，i_e2，j_e2The first and last character positions of entity e1 and entity e2 respectively; h'_e1Is a vector of entity e 1; h'_e2Is a vector of entity e 2; w_e1A weight parameter for entity e 1; w_e2A weight parameter for entity e 2; b_e1Bias parameter for entity e 1; b_e2Is the bias parameter of entity e 2.

9. The method of any one of claims 1, 7, 8, wherein when the number of medical entities is 2; the classification probability is determined by the following formula:

p＝softmax(W[concat(H′_CLS,H′_e1,H′_e2)]+b)；

p is the classification probability; w is a weight parameter of the hidden layer; b is a bias parameter of the hidden layer; h'_CLSA vector for the sentence; h'_e1Is a vector of entity e 1; h'_e2Is a vector of entity e 2.

10. An entity relationship extraction device, comprising: