CN114756679A

CN114756679A - Chinese medical text entity relation combined extraction method based on conversation attention mechanism

Info

Publication number: CN114756679A
Application number: CN202210315494.7A
Authority: CN
Inventors: 黄杰; 罗之宇; 张蕾; 万健; 史斌彬; 张丽娟
Original assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Current assignee: Zhejiang Lover Health Science and Technology Development Co Ltd
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-07-15

Abstract

The invention discloses a Chinese medical text entity relation joint extraction method based on a conversation attention mechanism. The invention makes a conversational interaction between the relationships by proposing the idea of feature fusion of the CLN layer and the position information and introducing a Talking head association mechanism. The relation between the entity type and the relation type is strengthened, and the accuracy of the model is greatly improved.

Description

Chinese medical text entity relation combined extraction method based on conversation attention mechanism

Technical Field

The invention belongs to the technical field of computer application, and relates to a Chinese medical text entity relationship joint extraction method based on a conversation attention mechanism.

Background

The medical knowledge map is constructed according to medical domain knowledge, aims to systematically organize knowledge in medical texts by establishing association relations among medical entities, and provides convenience for downstream data searching, mining and analyzing. The medical field has a great deal of text information, but how to extract the required medical knowledge from the medical text to construct the knowledge map has become a focus of research now.

The method is characterized in that a knowledge graph is constructed without leaving Information Extraction (IE), and the research difficulty in the Information Extraction (IE) is two tasks of named entity identification (NER) and entity Relationship Extraction (RE). In an era of rapid development of the Natural Language Processing (NLP) field, a pipeline approach and a joint extraction approach have been proposed to address both of these problems.

At present, the application is widely based on a traditional pipeline mode, and the pipeline mode is to extract entities and then identify the relationship of the entities. The traditional pipeline model needs to be trained by using real entity labels in training, and the output of an entity recognition model is used in a relation extraction stage, so that the performance of the relation extraction model is reduced due to the difference in distribution between the real entity labels and the output of the entity recognition model. In fact, there is some implicit relationship between entity types and relationship types, and the pipelined approach does not take advantage of such relationships. And the pipeline type extracts the relation aiming at each entity pair, thereby causing a great deal of information waste. Moreover, for the entity relationship overlapping problem, the traditional model cannot provide a better solution. Thus, the method based on joint extraction starts to enter the visual field of people, and the method can effectively solve the difficulties encountered by the traditional method.

On the basis of entity relationship combined extraction, the method provides a concept of feature fusion of a CLN layer and position information, introduces a Talking head association mechanism, and carries out a Talking (interaction) among all relationships. The relation between the entity type and the relation type is strengthened, and the accuracy of the model is greatly improved.

Disclosure of Invention

The invention aims to provide a combined extraction model which can be effectively applied to the medical field aiming at the defects of the prior art.

In order to achieve the purpose, the invention provides a Chinese medical text entity relation joint extraction method based on a conversation attention mechanism, which comprises the following steps of:

step 1, inputting sentences into a RoBERTA layer, fully extracting sentence characteristics and mining association between words:

inputting sentences into a RoBERTA layer, fully extracting sentence characteristics and mining association between words; extracting head entities and tail entities in the same step, and predicting the relation types among the entities; marking each input initial (start) and end (end) by a pointer, and converting the multi-fragment problem into N2 classifications, wherein N is the length of a sequence; processing the sequence matrix subjected to entity extraction by a CLN layer and a THA layer to complete extraction of the triples;

Step 2, extracting entities of input sentences, and extracting triples by using two cascade modules according to a stacked pointer network, wherein the two modules correspondingly comprise entity extraction and corresponding relation extraction of two contents; extracting entities of each input sentence, wherein the entities comprise a head entity and a tail entity; the extracted entity, namely the head entity, is input into the next module, all the relationships are traversed, and whether a relationship capable of matching the head entity with the tail entity exists is calculated;

and 3, traversing all different objects, inputting the objects into a subsequent module, and extracting the triples.

Further, in the step 1, the RoBERTa layer carries out feature extraction and sentence modeling based on a transform algorithm bidirectional coding characterization algorithm;

slicing and annotating the input sentence, and performing distributed representation on the sentence:

X＝{X₁，X₂，…，X_t，…，X_n}#(1)

X_t＝E_T+E_S+E_P#(2)

each segment comprises a word vector, a text vector and a position vector; in the formula, ET represents a word vector (spoken-Embedding), Es represents a text vector (Eseg-Embedding), and Ep represents a position vector (Epos-Embedding).

Further, each input sentence in step 2 is passed through a 12-layer RoBERTa encoder to obtain a coding vector h for extracting all entities in the input sentence, including a head entity and a tail entity; allocating 0/1 binary marks to each mark point by initializing a pointer network; 0/1 binary marking the initial (start) and end (end) positions of the recognized entity, the marked entity will be input as an object to the next level module;

In the formula s^startAnd s^endThe result of the output is shown, which is the set of probabilities for the start and end positions of all positions; if the position probability exceeds a set threshold value, marking the position probability as 1, and if not, marking the position probability as 0;

represents the weight in the fully connected layer, and the new weight is updated through each input;

representing a bias vector, wherein sigma is a sigmoid function as an activation function;

the representation of all objects in the input sentence x is optimized by the following likelihood function;

wherein L is the length of the sentence; in the initial (start) and end (end) sequences of the output, the start position of the entity is marked 1, its R₁Is 1, R thereof₂Is 0; the end position of the entity is marked 1, R thereof₁Is 0, R thereof₂Is 1; parameter(s)

Further, in step 3, a scene generated by the text with the fixed-length vector as the condition is fused into beta and gamma of the normalization layer; the concrete implementation formula is as follows:

wherein avg is the average value of h and std is the standard deviation of h; beta and gamma are two dynamic matrices that are iterated continuously according to the change of the object in the input sentence;

before entering THA layer, the output result of CLN layer is compared with the previous E when entity is extracted _{pos-Embedding}Splicing and combining are carried out;

the newly derived mixed attention formula is shown below:

in the formula, different Query, Key and Value weight matrixes are used, and each matrix is generated by random initialization; then embedding and projecting words into different spaces through training;

represents the ith feature calculation result, and J represents the calculation results of all the heads_iSplicing together; j is a unit of_iRepresenting each feature through two dialogs to associate all the features; o is_iA result representing attention of the output dialog feature;

wherein rstart and rend denote the output result, which is the set of probabilities for the start and end positions of all positions;

is to represent a bias vector, and sigma is a sigmoid function as an activation function;

the corresponding relation representation of all objects in the input sentence x is optimized by the following likelihood function:

wherein L is the length of the sentence; in the output start or end sequence, the start position of the tail entity of the corresponding relation is marked as 1, and I of the tail entity is₁Is 1, I thereof₂Is 0; the end position of the tail entity of the correspondence is marked 1, its I ₁Is 0, I thereof₂Is 1; parameter(s)

For training set D, for each sentence x_iThe likelihood functions of the entities and relationships of (a) are summed; an Adam loss function method is adopted, and a K value is maximized to train the model; the learning rate of the optimizer is set to be a larger value, and then the learning rate is dynamically reduced according to the increase of times, so that the efficiency and the effect are achieved; in the formula, T_iRepresenting all objects in the input sentence, T_rRepresenting all relations corresponding to the head entity;

the invention has the following beneficial effects: the invention makes a conversational interaction between the relations by proposing the idea of feature fusion between the CLN layer and the position information and introducing a Talking head association mechanism. The relation between the entity type and the relation type is strengthened, and the accuracy of the model is greatly improved.

Drawings

FIG. 1 is an overall architecture diagram of the C-THA model.

FIG. 2 is a RoBERTA for sentence modeling.

FIG. 3 is a schematic diagram of the operation of the Conditional Layer Normalization module.

Detailed Description

The present invention is further analyzed with reference to the following specific examples.

The technology provided by the invention is a Chinese medical text entity relation joint extraction method based on a conversation attention mechanism, which comprises the following steps:

the overall architecture of the model is roughly divided into two parts: the first part is a RoBERTa layer; the second part mainly comprises a CLN layer and a THA layer, corresponding guest entities are predicted according to each relation corresponding to the host entities, and the overall architecture of the model is shown in figure 1. The sentence is input into a RoBERTA layer, and sentence characteristics are fully extracted and association between words is mined. And (3) extracting head entities and tail entities which are placed in the same step, and predicting the relationship types between the entities when the tail entities are extracted. Labeling the head and tail of each input through pointer labeling, and converting the multi-fragment problem into N2 classifications (N is the sequence length). And then, processing a CLN layer and a THA layer on the sequence matrix subjected to entity extraction to finish extraction of the triples.

RoBERTa does not have much innovation in the model relative to BERT, and is improved in detail. RoBERTa has longer training time, larger block size (batch size), more training data, and dynamically adjusted Masking mechanism. Additional output layers can be used for fine tuning through the RoBERTA model, so that the RoBERTA model has excellent performance in downstream tasks. The whole RoBERTA principle is a two-way coding characterization algorithm based on a Transformer algorithm, so that feature extraction and sentence modeling are performed. This model, as an automated coding model, can reconstruct the original data from noisy data by masking out some part of the words in the corpus with a 12-layer transform decoder (encoder), predicting according to the language model, and annotating them with [ MASK ] symbols. The training flow is shown in fig. 2.

X＝{X₁，X₂，…，X_t，…，X_n}#(1)

X_t＝E_T+E_S+E_P#(2)

each segment contains a word vector, a text vector, and a position vector. In the formula, ET represents a word vector (spoken-Embedding), Es represents a text vector (Eseg-Embedding), and Ep represents a position vector (Epos-Embedding).

Step 2, extracting entities of input sentences:

the method is based on a stacked pointer network, and the basic idea is to use two cascaded modules to extract a triple, wherein the two modules correspondingly extract two contents including entity extraction and corresponding relation. And performing entity extraction on each input sentence, wherein the entity extraction comprises a head entity and a tail entity. The extracted entity, i.e., the head entity, is input to the next module, and all relationships are traversed to calculate whether a relationship exists that matches the head entity to the tail entity.

Each input sentence is passed through a 12-layer RoBERTa encoder to obtain a coding vector h for extracting all entities in the input sentence, including a head entity and a tail entity. More deeply, it is essentially a binary problem, assigning 0/1 a binary label to each label point by initializing a pointer network. The (0/1) binary labels indicate the start (start) and end (end) positions of the identified entities, which are to be input as objects to the next level of modules.

In the formula s^startAnd s^endWhat is shown is the output result, which is a set of probabilities for the start and end positions of all positions. If this position probability exceeds the threshold set by us, it is marked as 1, if not, it is marked as 0.

Are representative of the weights in the fully connected layer, with new weights updated through each input.

Is to represent the offset vector, and σ is the sigmoid function as the activation function.

The representation of all objects in the input sentence x is optimized by the following likelihood function.

Wherein L is the length of the sentence. In the start (start) and end (end) sequences of the output, the start position of the entity is marked 1, its R₁Is 1, R thereof₂Is 0. The end position of the entity is marked 1, R thereof₁Is 0, R thereof₂Is 1. Parameter(s)

Step 3, traversing all different objects and inputting the objects into a subsequent module to extract the triples:

the CLN Layer is called the conventional Layer Normalization. The method takes a vector with fixed length as a scene of text generation of conditions, and the conditions are fused into beta and gamma of a standardization Layer (Layer standardization). The working principle is shown in fig. 3. The concrete implementation formula is as follows.

Where avg is the average value of h and std is the standard deviation of h. β and γ are two dynamic matrices that are iterated over time according to the changes in the objects in the input sentence. For the model of the method, vectors of fixed length are used as initial beta and gamma unconditionally.

That is, the method can transform the input condition object to the same dimension as β and γ by initializing the transformation matrix with two different all zeros, and then adding the two transformation results to β and γ, respectively. In this state, the model remains the same as the original pre-trained model.

Before entering THA layer, the method extracts the output result of CLN layer and the previous entity_{pos-Embedding}Splicing and combining are carried out. The idea not only utilizes the relevant parameters when entity recognition is carried out, but also improves the accuracy when the relation is extracted.

The THA layer is called a Talking Head Attention (Talking Head Attention) layer. The original Multi-head Attention focuses only on the expression of each feature (head), the operations of each feature (head) are isolated from each other, and stronger Attention (Attention) can be gained by connecting (Talking) the features. The original formula is as follows:

MultiHead(Q,K,V)＝Concat(head₁,…,head_h)W^O#(9)

the dialogue Attention mechanism (Talking Head Attention) uses a multi-Head Attention as two parameter matrixes lambda on the basis of the dialogue Attention mechanism_LAnd λ_WRe-fusion into multiple mixed attentions. Each new resulting hybrid attention fuses the original feature (head) attentions. The formula is as follows:

In the formula, different Query, Key and Value weight matrixes are used, and each matrix is generated by random initialization. Word embedding is then projected into a different space through training.

Represents the ith feature (head) computation result, and J represents the computation result J of all features (heads)_iAre spliced together. J is a unit of_iRepresenting that each feature (head) associates all features (heads) through two sessions (talking). O is_iRepresenting the result of the Talking Head Attention after output.

In the equations rstart and rend denote the output result, which is a set of probabilities for the start and end positions of all positions. If this position probability exceeds the threshold we set, it is marked as 1, if not, it is marked as 0.

wherein L is the length of the sentence. In the output start (start) or end (end) sequence, the tail entity start position of the correspondence is marked as 1, its I ₁Is 1, I thereof₂Is 0. The end position of the tail entity of the correspondence is marked 1, its I₁Is 0, I thereof₂Is 1. Parameter(s)

In fig. 1, we can see that for each sentence output, all the entities are matched one by one in the schema, and a matrix of length 2 relations is constructed. As shown in the figure, the entity "pancreatic cancer" is compared with the relationships of "imaging examination", "age of onset", "clinical manifestation", etc., to find the entity with the highest probability of forming a triple with the tail entity. Finally, two triplets of "pancreatic cancer-imaging examination-ultrasound examination" and "pancreatic cancer-clinical manifestation-pancreatic mass" were found.

For training set D, we are for each sentence x_iThe likelihood functions of the entities and the relationships of (a) are summed. We train the model using the Adam loss function approach, maximizing the K value. The learning rate at the beginning of the optimizer is set to a larger value and then dynamically decreased as the number of times increases to achieve both efficiency and effectiveness. In the formula, T_iRepresenting all objects in the input sentence, T_rAll relationships corresponding to the head entities are represented.

The experimental process comprises the following steps:

the experiment uses a Baidu2019 and CHIP2020 dataset, and the Baidu2019 dataset is the largest industry-scale Chinese information extraction dataset based on the triple schema (schema), and comprises 50 predefined schemas, 21 ten thousand Chinese sentences and 43 ten thousand triple data. The sentences in the data set are from hundred encyclopedia and hundred information stream text. The data set was divided into 17 ten thousand training sets, 2 ten thousand validation sets and 2 ten thousand test sets. The CHIP-2020 is derived from a Chinese medical information extraction dataset based on a model and is jointly constructed by a national language processing laboratory of Zhengzhou university and a computer linguistics key laboratory of the education department of the Beijing university. CHIP-2020 is a chinese medical data set with high annotation quality and high full-text coverage at present, and contains nearly 2 million disease statements of 109 common diseases.

The statistics of these two data sets are shown in table 1. It was observed that both the Baidu2019 and CHIP-2020 data sets present a physical overlap problem, especially in CHIP-2020, the data set accounts for over 60%. In two respects, the difficulty in the data set of Baidu2019 is easier than that of CHIP2020 for entities that need entity identification. In CHIP2020, more medical terminology is used and the overlap rate is higher. In the aspect of relationship overlapping, the text sentences of the Baidu2019 data set are all in a single sentence format, most of the CHIP2020 are spliced sentences, the context relationship is emphasized, and the relationship identification difficulty is higher. The baseline herein was established with the caseel model and the differences from other models were compared in the general field on the Baidu2019 dataset. For the medical text data set, CHIP-2020, the effectiveness of the baseline model and the model herein at each module is compared.

Table 1 data set statistics

In the experiment, accuracy (precision), recall (recall) and F1 scores are used as comprehensive evaluation indexes of extraction results. The concrete formula is as follows:

wherein TP is the number of triples with correct prediction, FP is the number of irrelevant triples with correct prediction, and FN is the triples without prediction.

In this experiment, the joint extraction model was based on the keras frame. Further, the hardware and software environments are shown in table 2.

Table 2 experimental environment configuration

In experiments, the method uses the Baidu2019 data set to verify the joint extraction capability of the model and compares the model with other joint models to prove the effectiveness of the model. In addition, ablation experiments were performed herein on CHIP2020, focusing on the contribution of RoBERTa encoder, CLN layer and THA layer to the improvement of model performance and evaluating the degree of cooperation of the two methods.

TABLE 3 comparison of different models on the Baidu2019 dataset

Table 4 comparison of different models on CHIP2020 dataset

The results of comparing the model herein with other models for the Baidu2019 dataset are shown in table 3. The method combines a sentence-level extraction model with a simple corpus-level module to realize the aggregation of single entities. When extracting entities, the CoType uses a text segmentation algorithm, and when extracting subsequent entities and relations, text features and type labels are embedded into two low-dimensional spaces belonging to the entities and the relations. The Multi-head selection proposes a federated neural model that can perform both entity recognition and relationship extraction without requiring any manually extracted features or using any external tools. The entity recognition task and the relationship extraction task are modeled as a multi-head selection problem, i.e., each entity identifies multiple relationships. Casrel models relationship relationships as a function of mapping head entity objects to tail entity objects, rather than treating them as labels on entity pairs. Etl-Span based labeling scheme, which decomposes the entity identification and relationship extraction two subtasks into several sequence label problems and solves the problems through a hierarchical boundary marker and a multi-Span decoding algorithm.

From experimental results, for the Baidu2019 dataset, the effect of MultiR is poor, mainly because the algorithm is not enough in terms of the overlap problem. The improvement of Cotype, Multi-head selection, Casrel and Etl-Span on the model mainly aims at the problem of entity relationship overlapping, and the overall effect is much higher than that of MultiR by adding the global semantic information coded by Bert. However, the model herein not only far surpasses the baseline model in accuracy and recall, but also reaches 0.819 above the respective model in the F1 value. Compared with other models, the model more fully utilizes the relation between the entity and the relation and is more suitable for solving the overlapping problem. Therefore, to compare the efficacy of the individual modules in the model, we performed validation on the CHIP-2020 dataset, as in table 4.

From the perspective of ablation experiments, the comparison between our model and the baseline model is not ideal because of the problems of sentence splicing, entity overlapping, etc. in the CHIP2020 dataset, and the direct application of the conventional joint extraction model to this dataset for experiments is not ideal. It can be seen from table 4 that the F1 values for the baseline model are relatively low. The f1 value of the model was increased by 0.04 using RoBERTa embedding. The effect of the CLN layer module and the THA layer module is larger in improvement of the baseline model, and the interrelation between entity identification and relation extraction is effectively utilized. The F1 value in CHIP2020 finally rises to 0.64 for our model, which is 0.16 higher than the baseline model.

Claims

1. A Chinese medical text entity relation combined extraction method based on a conversation attention mechanism is characterized by comprising the following steps:

inputting sentences into a RoBERTA layer, fully extracting sentence characteristics and mining association between words; extracting head entities and tail entities in the same step, and predicting the relation types among the entities; marking each input initial (start) and end (end) by a pointer, and converting the multi-segment problem into N2 classes, wherein N is the sequence length; processing the sequence matrix subjected to entity extraction by a CLN layer and a THA layer to complete extraction of the triples;

2, extracting entities of input sentences, and extracting triples by using two cascade modules according to a cascading pointer network, wherein the two modules correspondingly comprise entity extraction and corresponding relation extraction two contents; extracting entities of each input sentence, wherein the entities comprise a head entity and a tail entity; the extracted entity, namely the head entity is input into the next module, all the relations are traversed, and whether a relation capable of matching the head entity with the tail entity exists is calculated;

2. The method for extracting Chinese medical text entity relationship jointly based on conversation attention mechanism as claimed in claim 1, wherein: in the step 1, the RoBERTA layer carries out feature extraction and sentence modeling based on a transform algorithm bidirectional coding characterization algorithm;

X＝{X₁，X₂，…，X_t，…，X_n}#(1)

X_t＝E_T+E_S+E_P#(2)

3. The method for extracting the Chinese medical text entity relationship based on the conversational attention mechanism as claimed in claim 1 or 2, wherein: in the step 2, each input sentence obtains a coding vector h through a 12-layer RoBERTA coder, and all entities in the input sentence are extracted, including a head entity and a tail entity; allocating 0/1 binary marks to each mark point by initializing a pointer network; 0/1 binary marking the initial (start) and end (end) positions of the identified entity, the marked entity will be input as an object to the next level of module;

4. The method for extracting entity relationship of Chinese medical texts based on the conversational attention mechanism as claimed in claim 1 or 2, wherein the text generation scenario with fixed length vectors as conditions in step 3 is to fuse the conditions into β and γ of the normalization layer; the concrete implementation formula is as follows:

Before entering THA layer, the output result of CLN layer is compared with the previous E when entity is extracted_{pos-Embedding}Splicing and combining are carried out;

the newly derived mixed attention formula is shown below:

represents the ith feature calculation result, and J represents the calculation results of all the heads_iSplicing together; j. the design is a square_iRepresenting each feature through two dialogs to associate all the features; o is_iA result representing attention of the output dialog feature;

wherein L is the length of the sentence; in the output start or end sequence, the start position of the tail entity of the corresponding relation is marked as 1, and I of the tail entity is ₁Is 1, I thereof₂Is 0; the end position of the tail entity of the correspondence is marked 1, its I₁Is 0, I thereof₂Is 1; parameter(s)

For training set D, for each sentence x_iLikelihood functions of the entities and relationships ofRow summation; training a model by adopting an Adam loss function method and maximizing a K value; the learning rate of the optimizer is set to be a larger value, and then the learning rate is dynamically reduced according to the increase of times, so that the efficiency and the effect are achieved; in the formula, T_iRepresenting all objects in the input sentence, T_rRepresenting all relations corresponding to the head entity;