CN109697285A

CN109697285A - Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness

Info

Publication number: CN109697285A
Application number: CN201811523661.7A
Authority: CN
Inventors: 王建新; 余颖; 李敏
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2018-12-13
Filing date: 2018-12-13
Publication date: 2019-04-30
Anticipated expiration: 2038-12-13
Also published as: CN109697285B

Abstract

The invention discloses a kind of hierarchical B iLSTM Chinese electronic health record disease code mask methods for enhancing semantic expressiveness, after being pre-processed to the electronic health record text of input, in considering that Chinese word is constituted, individual Chinese character includes specific semantic, extracting character level feature vector using the BiLSTM for introducing concern mechanism indicates, obtains the semanteme and word-building characteristic of individual Chinese character；Character level term vector is indicated to splice with the other vector expression of word-level obtained using word2vec training, the word vectors for obtaining character feature enhancing indicate；Using the text sequence that Feature Words vector indicates as input, learn the contextual feature in entire electronic health record using BiLSTM again, and use concern mechanism, calculates the contribution degree of each Feature Words, the text vector for obtaining contextual feature weighting indicates, improves prediction effect.Method of the invention is suitable for the disease labeling task based on Chinese electronic health record text, and effectively increases classifying quality.

Description

Enhance the hierarchical B iLSTM Chinese electronic health record disease code mask method of semantic expressiveness

Technical field

The present invention relates to medical informatics field, especially a kind of hierarchical B iLSTM Chinese electronics disease for enhancing semantic expressiveness Go through disease code mask method.

Background technique

Electronic health care case history (Electronic Health Records, EHRs, abbreviation electronic health record) has become medicine and faces One of the significant data resource of bed research.Various information during it sees a doctor patient are stored with digitized data, Facilitate us using computer to analyze clinical data and handle.For a electronic health record, need to be described patient The unified Label specifications of disease condition, to be conducive to carry out patient information reasonable classification to help clinical decision.By generation The International Classification of Diseases of the publication of boundary's health organization and continuous updating encodes (International Classification of Diseases, ICD) be international disease code scheme, it often by the label as clinography, for identify symptom, Sign, disease, anomaly or operation etc..Currently, the ICD newly revised encodes the 10th edition hospital for being widely used in China In information system.

Marking ICD coding for electronic health record is an important and basic job using electronic health record.Electronic health record The missing of middle diagnosis name and ICD coding, is unfavorable for our analysis and research to clinical data.In general, the mark work of ICD coding Make the clinical diagnosis provided by the medical worker of each case of hospital room according to doctor to describe to carry out artificial cognition.H coding is not It requires nothing more than coder and grasps certain medical knowledge, coding rule and medical terminology, and is time-consuming and laborious.Therefore, meter is utilized Calculation machine can provide effective auxiliary to carry out autocoding for coding mark work, improve the annotating efficiency of ICD coding.

Most disease code automatic marking work is all based on clinical text data to carry out, such as the report of dept. of radiology at present Announcement, death certificate, discharge abstract etc..But most research work concentrates on English corpus, in Chinese clinical text On disease code prediction work it is less, and main method is that character string semanteme based on diagnosis name compares.It is semantic similar Property comparison quality requirement that diagnosis name is described it is higher, and autocoding can not be carried out in the case where diagnosis name missing Mark.There is presently no correlative study work to mark task for the disease code that neural network model is used for Chinese electronic health record.

There are two features for the processing of Chinese electronic health record text: first is that electronic health record text is longer, the context of long text Acquisition of information is more difficult；Second is that Chinese character is different from English, individual Chinese character also has semanteme, especially in medical terms, such as Orientation, physical feeling etc. are all a Chinese characters to describe, and therefore, the semantic expressiveness comprising character feature can preferably express word It is semantic.

Summary of the invention

The technical problem to be solved by the present invention is in view of the shortcomings of the prior art, provide a kind of layer for enhancing semantic expressiveness Secondary BiLSTM Chinese electronic health record disease code mask method completes automatic marking in a manner of end to end, improves prediction effect.

In order to solve the above technical problems, the technical scheme adopted by the invention is that:

A kind of hierarchical B iLSTM Chinese electronic health record disease code mask method enhancing semantic expressiveness, including following step It is rapid:

1) Chinese word segmentation tool is utilized, the customized clinical medicine of user is introduced and is segmented with dictionary, remove stop words, And Feature Words are filtered out according to word frequency；

2) Feature Words are carried out with character rank and the other vectorization of word-level respectively indicates, splices character level vector and word Grade vector, the character Enhanced feature vector for constructing word indicate；

3) contextual feature of entire text is obtained using spliced Feature Words, and uses concern mechanism, calculated each The contribution degree of Feature Words, the contextual feature weighing vector for obtaining entire text indicate.

In step 1), the Feature Words are chosen according to following rule:Wherein S_fwIt indicates Feature set of words,Indicate word w_iFrequency, N_dIndicate electronic health record total sample number.

In step 2), indicated using the character level feature vector of the two-way LSTM training characteristics word of fusion concern mechanism, benefit The term vector representation method word2vec indicated with word-based distribution obtains the word-level vector representation of Feature Words.

The way of output of two-way shot and long term memory network training are as follows:WhereinIt indicates Forward direction LSTM is exported in the hidden layer of t-th of unit or t moment,It is exported to LSTM in the hidden layer of t-th of unit after being then.

The calculation of concern mechanism are as follows:

u_ij=tanh (W_ch_ij+b_c)；

h_ijFor i-th of word j-th of character BiLSTM training after hidden layer output, W_cFor weight matrix, b_cFor biasing Vector, u_cFor the contextual feature vector of random initializtion character level, α_ijFor j-th be calculated using softmax function Character for i-th of word weight size,It is indicated for the context weighted feature vector of i-th of word.

In step 3), the method for calculating the contextual feature weighing vector of entire text includes: by spliced Feature Words The two-way shot and long term memory network of the text input second layer that vector indicates, study obtains the contextual feature of entire text, and adopts With concern mechanism, the weight of each Feature Words is calculated, obtains the Text eigenvector of contextual information weighting.

The calculation of concern mechanism are as follows:

u_i=tanh (Wh_i+b_w)；

V=∑_iα_ih_i；

h_iIt is that the character of i-th of word of text sequence reinforces the output for the hidden layer that feature vector obtains after BiLSTM training, W For weight matrix, b_wIt is corresponding to introduce simultaneously one other text of word-level of random initializtion in application concern mechanism for bias vector Shelves contextual feature vector u_wTo complete the calculating of weight, α_iFor the corresponding weight of each word, v is that the context of entire text adds Weighing feature vector indicates, which is inputted full articulamentum, the appearance that each disease code is calculated by sigmoid function is general Rate.

Compared with prior art, the advantageous effect of present invention is that: the present invention is directed to Chinese own characteristic, will be single The feature vector that the semantic feature of Chinese character incorporates word indicates, and combines concern mechanism, to contributive spy real in list entries Sign word is weighted, and improves the prediction effect of disease code；This method is suitable for the clinical text data of Chinese, utilizes nerve Network model automatically extracts text feature, and automatic marking is completed in a manner of end to end.

Detailed description of the invention

Flow chart Fig. 1 of the invention；

The hierarchical B iLSTM feature learning model of Fig. 2 fusion concern mechanism；

The calculating of Fig. 3 concern mechanism；(a) by h_ijBecome u_ij；(b) each u is calculated using contextual feature vector_ijPower Weight；(c)h_ijWeighted sum be applied mechanism of paying close attention to feature vector indicate；

Fig. 4 is that the present invention implements experimental result picture.

Specific embodiment

One, the pretreatment of clinical text data

Using Chinese word segmentation tool " stammerer " and the customized medicine dictionary of user, the discharge abstract text of input is carried out After participle, stop words is removed, the word frequency of effective word is counted, selects Feature Words after sorting from large to small based on word frequency, by following rule Then choose:Wherein S_fwIndicate feature set of words,Indicate word w_iFrequency, N_dIndicate electronics disease Go through sum.

Two, the term vector of Feature Words indicates

1) term vector based on character indicates

Firstly, initializing vector for each character indicates, the then BiLSTM of input fusion concern mechanism, trained Character level term vector to each Feature Words indicates, each neural unit state value c in BiLSTM_tWith output valve h_tSpecific meter Calculation process is (t=1,2 ..., n, t indicate the neural unit of t-th of neural unit or t moment in network):

i_t=sigmoid (W_i[x_t；h_t-1]+b_i) (1)

f_t=sigmoid (W_f[x_t；h_t-1]+b_f) (2)

g_t=tanh (W_g[x_t；h_t-1]+b_g) (3)

o_t=sigmoid (W_o[x_t；h_t-1]+b_o) (4)

c_t=f_t*c_t-1+i_t*g_t (5)

h_t=o_t*tanh(c_t) (6)

Each neural unit includes an input gate i, an out gate o, a forgetting door f, a storage unit g, and one The unit c of an a preservation state and hidden state h, they are vector, W_i,W_f,W_g,W_oFor weight matrix, b_i,b_f,b_g,b_o For bias vector, "；" indicate connection operation, " * " indicates element dot product, and sigmoid function is calculated asTanh function is calculated asThe way of output of BiLSTM For

2) application of attention mechanism

Concern mechanism calculation method are as follows:

u_ij=tanh (W_ch_ij+b_c) (7)

h_ijFor i-th of word j-th of character BiLSTM training after hidden layer output, W_cFor weight matrix, b_cFor biasing Vector, u_cFor the contextual feature vector of random initializtion character level, α_ijThe jth being as calculated using softmax function A character for i-th of word weight size,The context weighted feature vector of as i-th word indicates.

3) the character level term vector that training obtains is spliced with the term vector generated using word2vec, obtains character The word feature vector that grade contextual feature is reinforced.

Three, contextual feature is extracted

The BiLSTM for the characteristic vector sequence input second layer fusion concern mechanism that character is reinforced, extracts text context Information characteristics, the calculating of the calculating of BiLSTM neural unit and contextual feature weighting, phase when being indicated with character level term vector Together, specific calculation formula is as follows:

u_i=tanh (Wh_i+b_w) (10)

V=∑_iα_ih_i (12)

Four, experimental verification

1) experimentation

In order to verify the validity of this method, we test on true Chinese electronic health record clinical data Card.The data set includes 7732 discharge records, is related to 1177 ICD-10 disease code labels altogether, ICD-10 coding is by word Female and number composition point minute six codings, with beginning of letter, front three is encoded to level encoder, indicates disease classification.Discharge The average length of brief summary is 610 words, average corresponding 3.6 disease codes of each discharge abstract.

Experiment is completed on a server, which includes 256GB memory and NVIDIA GeForce Titan X Pascal CUDA GPU processor.Data set is divided into training set and test set according to the ratio of 9:1 by us, and is passed through ten times Upset data at random to be verified.Evaluation index has selected micro- average accuracy (P), recall rate (R) and the two synthesis Index F1 value, and the Hamming penalty values for reporting situation by mistake are evaluated from the angle of sample.F1 value is higher, Hamming penalty values more It is low to illustrate that model performance is better.

2) experimental result

Because correlative study work has had been pointed out deep learning method better than traditional machine learning method, we mainly and its He has carried out comparative experiments by common neural network model, and the results are shown in Table 1, and MA-BiLSTM indicates our model, D2V+ CNN is the method in correlative study work, and this method is obtained on disclosed English data set MIMIC III and preferably imitated at present Fruit.The experimental results showed that MA-BiLSTM is superior to other neural network models in every evaluation index, illustrate to combine concern machine The BiLSTM of system can effectively capture the contextual information feature of long text, and improve prediction effect.

1 contrast and experiment of table

Model	Micro_P (CI:95%)	Micro_R (CI:95%)	Micro_F1 (CI:95%)	HLoss (CI:95%)
					CBOW	0.614(±6.43e-03)	0.522(±5.30e-03)	0.564(±4.52e-03)	0.00248(±3.14e-05)
CNN	0.647(±6.67e-03)	0.509(±6.51e-03)	0.569(±4.71e-03)	0.00237(±3.52e-05)
					D2V+CNN	0.661(±9.57e-03)	0.514(±8.74e-03)	0.579(±7.14e-03)	0.00231(±3.70e-05)
MA-BiLSTM	0.704(±1.13e-02)	0.586(±5.84e-03)	0.639(±4.45e-03)	0.00204(±3.47e-05)

For the effect of the performance of analysis model modules, we devise ablation experiment and analyze, as a result such as 2 institute of table Show.From the experimental results, only term vector or character vector indicate that the feature of word in text, prediction result all have occurred down Drop, therefore, the term vector expression that character vector is reinforced bring better Text Representation really.Concern mechanism is in a model Important function is played, concern mechanism is eliminated, the performance decline of model is obvious.

It is predicted in ICD-10 full coding and level encoder, 7732 samples, corresponding level encoder is 488 It is a.Experimental result is as shown in Figure 4.Prediction result on level encoder has reached 80.5% in accuracy, can preferably assist The disease code of Record room medical worker marks work.

2 model of table melts experimental result

Claims

1. a kind of hierarchical B iLSTM Chinese electronic health record disease code mask method for enhancing semantic expressiveness, which is characterized in that packet Include following steps:

1) Chinese word segmentation tool is utilized, the customized clinical medicine of user is introduced and is segmented with dictionary, removes stop words, and root Feature Words are filtered out according to word frequency；

2) Feature Words are carried out with character rank and the other vectorization of word-level respectively indicates, splice character level vector and word-level to Amount, the character Enhanced feature vector for constructing word indicate；

3) sequence is indicated using the term vector that spliced Feature Words obtain entire text, and use concern mechanism, calculate each The contribution degree of Feature Words, the contextual feature weighing vector for obtaining entire text indicate.

2. the hierarchical B iLSTM Chinese electronic health record disease code mark side of enhancing semantic expressiveness according to claim 1 Method, which is characterized in that in step 1), the Feature Words are chosen according to following rule:Wherein S_fwIndicate feature set of words,Indicate word w_iFrequency, N_dIndicate electronic health record total sample number.

3. the hierarchical B iLSTM Chinese electronic health record disease code mark side of enhancing semantic expressiveness according to claim 1 Method, which is characterized in that in step 2), utilize the character level feature vector table of the BiLSTM training characteristics word of fusion concern mechanism Show, indicates shape using the word-level vector that the word-based distributed term vector representation method word2vec indicated obtains Feature Words Formula.

4. the hierarchical B iLSTM Chinese electronic health record disease code mark side of enhancing semantic expressiveness according to claim 3 Method, which is characterized in that the way of output of BiLSTM are as follows:WhereinExist before indicating to LSTM The output of the hidden layer of t-th of unit or t moment,It is exported to LSTM in the hidden layer of t-th of unit after being then.

5. the hierarchical B iLSTM Chinese electronic health record disease code mark side of enhancing semantic expressiveness according to claim 3 Method, which is characterized in that pay close attention to the calculation of mechanism are as follows:

u_ij=tanh (W_ch_ij+b_c)；

h_ijFor i-th of word j-th of character BiLSTM training after hidden layer output, W_cFor weight matrix, b_cFor bias vector, u_cFor the contextual feature vector of random initializtion character level, α_ijFor j-th of character pair being calculated using softmax function In the weight size of i-th of word,It is indicated for the context weighted feature vector of i-th of word.

6. the hierarchical B iLSTM Chinese electronic health record disease code mark side of enhancing semantic expressiveness according to claim 1 Method, which is characterized in that in step 3), the method for calculating the contextual feature weighing vector of entire text includes: will be spliced The two-way shot and long term memory network of the text input second layer that Feature Words vector indicates, the context that study obtains entire text are special Sign, and concern mechanism is used, the weight of each Feature Words is calculated, the Text eigenvector of contextual information weighting is obtained.

7. the hierarchical B iLSTM Chinese electronic health record disease code mark side of enhancing semantic expressiveness according to claim 6 Method, which is characterized in that pay close attention to the calculation of mechanism are as follows:

u_i=tanh (Wh_i+b_w)；

V=∑_iα_ih_i；

h_iIt is that the character of i-th of word of text sequence reinforces the output for the hidden layer that feature vector obtains after BiLSTM training, W is power Value matrix, b_wIt is corresponding to introduce and on one other document of word-level of random initializtion in application concern mechanism for bias vector Following traits vector u_wTo complete the calculating of weight, α_iFor the corresponding weight of each word, v is that the context of entire text weights spy Levying vector indicates, which is inputted full articulamentum, the probability of occurrence of each disease code is calculated by sigmoid function.