CN110188193A

CN110188193A - A kind of electronic health record entity relation extraction method based on most short interdependent subtree

Info

Publication number: CN110188193A
Application number: CN201910318692.7A
Authority: CN
Inventors: 李智; 冯苗; 李健
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-04-19
Filing date: 2019-04-19
Publication date: 2019-08-30

Abstract

The present invention proposes a kind of electronic health record entity relation extraction method based on most short interdependent subtree.The invention extracts the most short subtree based on entity from original sentence by interdependent syntactic analysis first to compress sentence length, then (Bidirectional Long Short-Term Memory is remembered by two-way shot and long term, BLSTM) neural network encodes sentence, the final semantic expressiveness for learning sentence by maximum pond layer (Max Pooling) again, classifies to obtain entity relationship eventually by softmax classifier.This method not only can be with erased noise vocabulary and compression sentence length, the key words of relationship between characterization entity are completely remained simultaneously, so that compressed statement semantics relationship is definitely, overcome the problems, such as that existing electronic health record entity relation extraction model causes to indicate the semantic information of sentence well since sentence is too long, improves the performance of Relation extraction model.

Description

A kind of electronic health record entity relation extraction method based on most short interdependent subtree

Technical field

The invention belongs to natural language processing fields, are related to a kind of entity relation extraction method, and in particular to one kind is based on The electronic health record entity relation extraction method of most short interdependent subtree.

Background technique

With the arrival of big data era, each FIELD Data is just in sharp increase.Particularly with medical field, examined in clinic A large amount of electronic health records are produced in treatment, wherein comprising largely without structure text and medical treatment & health knowledge.It effectively excavates and utilizes These knowledge are significant for medical treatment & health career development.The effective way for excavating the knowledge in electronic health record is that information is taken out Relevant technology is taken, wherein the Relation extraction between conceptual entity is the important component of information extraction.

Mainly have at present for the entity relation extraction of electronic health record based on machine learning and based on the method for deep learning, Method based on machine learning carries out feature selecting to candidate entity first, and medical knowledge is added as assistant analysis, and will take out The feature obtained is converted into feature vector, and the discriminant classification of supervised learning is carried out in vector space model, therefrom To the relationship of entity pair.The disadvantage is that it is generally necessary to the training data largely manually marked, study is closed automatically from training data It is corresponding decimation pattern, and heavy dependence part-of-speech tagging, the natural language processings mark such as syntax parsing provide characteristic of division. And often there are a large amount of mistakes in natural language processing annotation tool, these mistakes will constantly be propagated in Relation extraction system and be put Greatly, the effect of Relation extraction is finally influenced.In recent years, many to study with the continuous breakthrough in different field of deep learning Deep learning is gradually applied to natural language processing field, is integrated essential characteristic vector based on the method for deep learning Later, it is encoded into high-level feature vector, to make full use of contextual feature, obtained feature vector is finally input to classification In device, the entity relationship in the sentence between entity pair is extracted.This method can reduce existing electric case history entity relation extraction mould Type for manual Feature Engineering quality dependence and improve the recognition performance of model, but taken out at present in electronic health record entity relationship Field is taken to be still in infancy using the research of deep learning method, electronic health record entity relation extraction also faces many difficulties Topic.Different from the text of other field, due to having a large amount of medicine names arranged side by side, symptom description and inspection item in electronic health record Title, so that the sentence length of electronic health record is much larger than Opening field sentence length, and different medical is inter-agency and different doctors Electronic health record between life lacks unified mark, increases difficulty for tasks such as subsequent name Entity recognition and entity relation extractions Degree.Since these characteristics cause current electronic health record entity relation extraction to study the problem that faces: sentence is too long and sentence In containing much noise vocabulary (unrelated with this Relation extraction task or even have interference), prevent existing model is from good table Show the semantic information of too long sentence.

Summary of the invention

Above-mentioned existing electronic health record entity relation extraction model status and there are aiming at the problem that, the present invention proposes one kind Based on the electronic health record entity relation extraction method of most short interdependent subtree, overcome existing electronic health record entity relation extraction model by It is too long in sentence and caused by the problem of cannot indicating the semantic information of sentence well.

The technical solution of the present invention is as follows: being extracted from original sentence by interdependent syntactic analysis first most short based on entity Subtree compresses sentence length, then remembers (Bidirectional Long Short-Term by two-way shot and long term Memory, BLSTM) neural network encodes sentence, then learns sentence by maximum pond layer (Max Pooling) Final semantic expressiveness classifies to obtain entity relationship eventually by softmax classifier.

The present invention not only can be with erased noise vocabulary and compression sentence length, while completely remaining between characterization entity The key words of relationship so that compressed statement semantics relationship is definitely, therefore overcomes existing electronic health record entity and close Be extraction model as sentence is too long and caused by cannot indicate the semantic information of sentence well the problem of, improve model Energy.

Detailed description of the invention

Fig. 1 is the electronic health record entity relation extraction model system frame diagram based on most short interdependent subtree.

Fig. 2 is the exemplary diagram of BIO labelling method label word entities type.

Fig. 3 is the basic block diagram of LSTM neural unit.

Fig. 4 is the semantic expressiveness layer structure chart of sentence level.

Sentence length distribution map after the most short interdependent subtree compression of Fig. 5.

Specific embodiment

The present invention is described in further detail With reference to embodiment:

As shown in Figure 1, the model is mainly by being originally inputted layer, most short subtree layer (Sub-Tree Parse Layer, STP Layer), feature extraction layer, embeding layer, BLSTM coding layer, the semantic expressiveness layer and output layer of sentence level form for 7 layers totally, Wherein each layer of detailed effect is described below.

1. being originally inputted layer: input original electron case history sentence.

2. most short subtree layer: most short subtree based on entity is extracted using interdependent syntactic analysis to compress sentence length, It is specifically described as follows:

1) the interdependent syntax tree for being originally inputted sentence is obtained using the interdependent syntactic analysis based on transfer

Interdependent syntactic analysis main target based on transfer is to predict a metastasis sequence, the transfer sequence according to original state feature Column finally obtain the interdependent syntax of target using the dependence of each word in interdependent arc description sentence according to interdependent arc Tree.Specifically, interdependent syntactic analysis method based on stack of the model using propositions such as Nivr, the analysis method based on stack One next state is indicated by triple, sentence given for one, In:

It is a stack, for the node of syntax tree processed in storage system, original state (configuration) it is；

It is a buffer queue, for storing list entries, original state is is entire sentence, i.e., ；

It is the set of interdependent arc, an interdependent arc mainly includes two category information of type of action and dependence title, at the beginning of Beginning state is；

2) the interdependent syntax tree of sentence is inputted, the starting and ending location index and initial statement of entity word；

If 3)For sky, then exit；

4) according to the initial position of entity, judge whether there is the entity in interdependent syntax tree, if nothing, record the sentence and after The continuous next sentence of processing, goes to step (5) if having；

5) according to the location index of entity, entity is searched respectively and corresponding node and is saved in syntax tree, step is then gone to (6)；

6) using node obtained in step (5) as starting point, it is searched to the interdependent path of root node along interdependent arc and is saved, so After go to step (7)；

It 7) is first common node in all paths of basic path searching with a paths obtained in (6).Specifically, will The last one node initializing on the basis path is that (i.e. root node is denoted as first common node), it inquires and judges Whether there is the node in other all paths, if as long as having does not have the node on a paths, upper oneNode is exactly One common node, then goes to step (8)；If so, then updatingFor the previous node on the basis path, continue to judge it Whether there is the node on his path, the node of Mr. Yu's paths is not present until finding；

8) recursive to traverse out most short interdependent subtree using first common node obtained in step (7) as root node, it is most short What is saved in interdependent subtree is location index of the node in former sentence, that is, is retained using first common node as root node Sub-tree section (including root node), and will most short interdependent subtree as compressed sentence trunk, by these subtrees be added to according to Subtree collection is deposited, step (9) are then gone to；

9) it is needed so including repeated word in obtained subtree by repeated word due to that in entity may include multiple words It deletes, specifically, location index obtained in step (8) sorts first, then repetition index is deleted in comparison one by one, is then turned To step (10)；

10) corresponding word is found out according to the index of step (9), and most short interdependent subtree collection is added in these words.

3. feature extraction layer: this layer input is the input layer of BLSTM, and input is the output of most short subtree layer, in this layer It is interior that feature extraction is carried out to text, obtain 3 essential characteristics, i.e., all words in sentence, the entity type of each word with And distance of each word relative to entity, it is specific as follows:

1) all words in sentence

To reduce interference, this patent has carried out some pretreatments to initial data, the capitalization in text is replaced all with small letter, Number in text is replaced all with into ' dg ', since each list entries length of BLSTM network requirement is consistent, so this is specially Sentence is all extended to the length of longest sentence in text by benefit；

2) entity type of word

This patent, which is used, is marked all words with the BIO labelling method as Sahu et al..It is for example sentence as shown in Figure 2 [S1], using the label result of BIO labelling method (head entity and tail entity are respectively workup, her and tumor in the sentence).Its Middle B-Tes indicates that the entity type of [workup] is inspection (Test), and is the beginning of entity.B-Pro indicates [her] Entity type be medical problem (Problem), be in entity beginning, and I-Pro indicate [tumor] entity class Type is also medical problem (Problem), but it is the middle section of entity；

3) position of the word relative to entity

Position of this patent to each word relative to two entities encodes, the word relative distance that entity itself is included It is 0, and the word before entity indicates that the word after entity uses normal from the near to the remote with negative distance from the near to the remote From expression.

4. embeding layer: by the Feature Mapping extracted in feature extraction layer at low-dimensional original feature vector, it is embedding to specifically include word Enter, position insertion and word types are embedded in:

1) word is embedded in

The higher dimensional space for the quantity that one dimension is all words is embedded into the much lower vector row sky of a dimension by word insertion Between in, each word or phrase are mapped as the vector in real number field.This patent is using providing in gensim kit Word2vec model, wherein the dimension set of term vector is 100, window size 10, and minimum co-occurrence number is 3.WithGiven sentence is indicated, wherein includingA word,It indicates the in sentenceA word is in word insertion Position Number, then the term vector of the sentence can indicate are as follows:

2) position is embedded in

This patent describes position insertion by the way of similar word insertion, usesMatrix indicates each word To the distance of entity pair, whereinIt is that (hyper parameter is for user's tune for dimension after each relative distance is mapped as real vector It is whole),It is the dictionary of fixed size, i.e. the range size of relative distance；

3) word types are embedded in

WithMatrix indicates the entity type feature of each word, whereinIt is that word generic is mapped as Dimension after vector,It is word generic number of species；

Finally obtained characteristic vector sequence is expressed as, wherein。

5. BLSTM coding layer: using the semantic information of original feature vector in BLSTM e-learning embeding layer, building High-level feature；

This patent increases one layer of hidden layer, the network reverse train input variable newly on the basis of LSTM.Finally by two networks Output is added, and forms two-way LSTM(BLSTM).Wherein LSTM neural unit basic structure is as shown in Figure 3.Use matrixIndicate the output vector of BLSTM coding layer, wherein,Indicate each list of BLSTM output The dimension of term vector,Indicate output sequence length.

6. the semantic expressiveness layer of sentence level: learning the semantic of sentence level by maximum pond (Max Plooing) and believe Breath；

The output of BLSTM coding layer is the word level characteristics vector of sentence, needs to be converted into the feature vector of sentence level. If with full articulamentum, will cause the big disadvantage of this layer input dimension, such network is very complicated.Therefore, this patent by The inspiration of convolutional neural networks (Convolutional Neural Networks, CNN), first to the defeated of BLSTM coding layer Convolution operation is carried out out, the vector of a regular length is then obtained by maximum pond technology (Max Pooling), the vector It is exactly that the semantic information of final sentence level indicates.Its pond process is as shown in Figure 4；

Convolution operation is using lengthFilter hidden layer output on sliding extract local feature.Specifically, in Fig. 4 Refer to convolution matrixWith hidden layer vectorBetween dot product operation, calculation formula is:

WhereinIt is bias vector,It is that filter is exported in hidden layerThe matrix formed after upper sliding.Finally obtained sentence The semantic vector of rank is expressed as, whereinExpression vector dimension (),Calculation such as Under:

。

7. output layer: using the output in step (6) as input, being closed using softmax classifier to the entity in sentence System classifies；

1) it using the semantic expressiveness of sentence level as the input of fully-connected network, is exported:

2) output that will be obtainedAs the input of softmax classifier, design conditions probabilityTo predict the entity pair Affiliated class label.Its specific calculating process is as follows:

WhereinIndicate the other sum of relation object.Model over-fitting in order to prevent, this patent make in the BLSTM layer of all models With dropout strategy.

This patent is with U.S. Yancheng health care office in the electronic health record information extraction challenge match (i2b2- of tissue in 2010 2010 shared task challenge) used in data instance illustrate the present invention in electronic health record entity relation extraction side The advantage in face.The data define medical care problem (Problem), treatment means (Treatment) and inspection item (Test) 3 The entity of seed type, data scrubbing is passed through in experiment, includes that medical problem shows medical problem (medical in experimental data Problem indicates medical problem, PIP), inspection (test is executed to observe medical problem Conducted to investigate medical problem, TeCP), check confirm medical problem (test reveal Medical problem, TeRP), Case management medical problem (treatment administered medical Problem, TrAP), treatment lead to medical problem (treatment caused medical problems, TrCP) and entity There is no the sample of any relationship (Other) totally 6 kinds of relationship types between.

Best parameter is found using grid data service in this part.Specifically, the range of learning rate be 0.1,0.01, 0.001,0.0001 }, the range of BLSTM is, dropout probable range 0.2,0.3 ..., 0.7 }, batch size (batch size) range { 30,40 ..., 200 },Regularization hyper parameter range, the number of iterations range { 5,6 ..., 25 }.For remaining parameter, it is embedding that word is rule of thumb respectively set Entering dimension, position insertion dimension, entity type insertion dimension and the BLSTM network number of plies is 100,10,15 and 1.Design parameter is such as Shown in table 1.

The setting of 1 experiment parameter of table

Extracted using the syntactic analysis tool StandfordParser of Stanford NLP group open source test data used according to Syntax tree is deposited, then obtains the most short interdependent subtree (Shortest Dependency Subtree, SDST) based on entity, most The sentence length distribution of whole experimental data is as shown in Figure 5.It can be seen that wherein length is more than 24 after the compression of most short interdependent subtree Sentence quantity drop to 5746 by 13093, and of length no more than 24 sentence number rises to 18301 by 10954 so that The length of sentence is effectively reduced.

By this patent proposition based on the electronic health record entity relation extraction model (BLSTM-SDST) of most short subtree and one A little relevant deep learning models compare, and the results are shown in Table 2.It is compared to other 3 deep learning models (BLSTM- SDP (Shortest Dependency Path, SDP), BLSTM and CNN), the model that this patent proposes achieves best F1 value, has been respectively increased 3.55%, 1.39% and 4.71%.

The Comparative result of the different models of table 2

Attached drawing is described in detail:

Fig. 1 is the electronic health record entity relation extraction model system frame diagram based on most short interdependent subtree, which includes 7 layers, Be originally inputted layer, most short subtree layer, feature extraction layer, embeding layer, BLSTM coding layer, the semantic expressiveness layer of sentence level with And output layer.

Fig. 2 is the exemplary diagram that BIO labelling method marks word entities type, and head entity and tail entity are respectively in the sentence Workup, her and tumor, wherein B-Tes indicates that the entity type of [workup] is inspection (Test), and is opening for entity Initial portion.B-Pro indicates that the entity type of [her] is medical problem (Problem), the beginning in entity, and I- Pro indicates that the entity type of [tumor] is also medical problem (Problem), but it is the middle section of entity.

Fig. 3 is the basic block diagram of LSTM neural unit, is mainly concerned with and forgets door, updates 3 groups of door and out gate At part.

Fig. 4 is the semantic expressiveness layer structure chart for being sentence level, i.e., carries out convolution and maximum to the output of BLSTM coding layer The semantic information of pondization operation, final sentence level indicates.

Sentence length distribution map after the most short interdependent subtree compression of Fig. 5, after the compression of most short interdependent subtree, wherein growing Degree is more than that 24 sentence quantity drops to 5746 by 13093, and of length no more than 24 sentence number rises to 18301 by 10954, So that the length of sentence is effectively reduced.

Claims

1. a kind of electronic health record entity relation extraction method based on most short interdependent subtree, step include:

Step 1: input original electron case history sentence；

Step 2: feature extraction is carried out to the resulting text of step 1；

Step 3: the Feature Mapping that step 2 is extracted is at low-dimensional original feature vector；

Step 4: the semantic information of building deep learning e-learning original feature vector constructs high-level feature；

Step 5: using the high-level feature in step 4 as input, the entity relationship in sentence being divided using classifier Class,

It is characterized by: being extracted from original sentence by interdependent syntactic analysis long to compress sentence based on the most short subtree of entity Then degree remembers (Bidirectional Long Short-Term Memory, BLSTM) nerve net by two-way shot and long term Network encodes sentence, then learns the final semantic expressiveness of sentence by maximum pond layer (Max Pooling), eventually by Softmax classifier classifies to obtain entity relationship.

2. the electronic health record entity relation extraction method according to claim 1 based on most short interdependent subtree, feature exist In step 1 by the interdependent syntactic analysis based on transfer, a metastasis sequence, the metastasis sequence are predicted according to original state feature Using the dependence of each word in interdependent arc description sentence, the interdependent syntax tree of target is finally obtained according to interdependent arc, obtain compressing sentence length based on the most short subtree of entity to extract, erased noise vocabulary while completely retains The key words of relationship between characterization entity.

3. the electronic health record entity relation extraction method according to claim 1 based on most short interdependent subtree, feature exist Feature extraction is carried out in step 2 compressed text resulting to step 1, obtains 3 essential characteristics, i.e., all lists in sentence Word, the distance of the entity type of each word and each word relative to entity.

4. the electronic health record entity relation extraction method according to claim 1 based on most short interdependent subtree, feature exist Be mapped to low-dimensional original feature vector in the essential characteristic that step 3 extracts step 2, specifically include word insertion, position insertion and Word types insertion.

5. the electronic health record entity relation extraction method according to claim 1 based on most short interdependent subtree, feature exist In step 4 first using the semantic information of original feature vector obtained in BLSTM e-learning step 3, high-level spy is constructed Sign.

6. the electronic health record entity relation extraction method according to claim 1 or 5 based on most short interdependent subtree, feature It is step 4 after BLSTM e-learning obtains high-level feature, convolution operation is carried out to it, then passes through maximum pond The semantic information that (Max Plooing) learns sentence level indicates.

7. the electronic health record entity relation extraction method according to claim 1 based on most short interdependent subtree, feature exist It indicates in the semantic information for the sentence level that step 5 is exported using in step 4 as input, using softmax classifier to sentence In entity relationship classify, prediction obtain final entity relationship classification.