CN111767723A

CN111767723A - Chinese electronic medical record entity labeling method based on BIC

Info

Publication number: CN111767723A
Application number: CN202010405161.4A
Authority: CN
Inventors: 滕国伟; 王逸凡
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2020-10-13

Abstract

The invention discloses a novel method for marking Chinese electronic medical record entities based on BIC, which belongs to the technical field of natural language processing and is used for solving the problems of identification and marking of the Chinese electronic medical record entities; the method comprises the following steps: firstly, providing corresponding medical entity marking specifications according to actual requirements, then manually marking a small amount of data, carrying out data processing on the manually marked data, and processing the data into a data format required by a model to form training data; training model parameters to generate a sequence labeling model, wherein the model comprises a bidirectional long-time memory network, an iterative hole convolution neural network and a conditional random field, and a decoding end of the model is arranged; inputting data to be labeled into a sequence labeling model, and outputting a result to obtain data labeled by a machine; and then, manually checking and correcting part marking errors, performing data processing operation to obtain training data required by the model, and performing model training again. The method can realize automatic annotation of Chinese electronic case data and has high accuracy.

Description

Chinese electronic medical record entity labeling method based on BIC

Technical Field

The invention relates to the technical field of natural language processing, in particular to a Chinese electronic medical record entity labeling method based on BIC.

Background

In the biomedical field, large amounts of data, such as electronic medical records, are generated daily. The electronic medical record refers to the digital information such as characters, symbols, charts, graphs, data, images and the like generated by medical staff using a medical institution information system in the process of medical activities, can realize the storage, management, transmission and reproduction of medical records, and comprises a large number of entities with more types. At present, most of electronic medical record information extraction researches are directed at English electronic medical records, the research on Chinese electronic medical records starts late, a clear and systematic research task is not formed, and a public labeled corpus is lacked. The lack of the training corpus greatly restricts the research of the Chinese electronic medical record information extraction. With the wide implementation of the Chinese electronic medical record system, the number of electronic medical records increases rapidly, but whether effective information in massive electronic medical records can be extracted effectively is always a hotspot and difficulty of research, wherein the construction of a Chinese electronic medical record labeling corpus is the basis of the research. In the biomedical field, the disclosed labeling corpus is very limited, and a large amount of manpower and material resources are consumed for manual labeling, so that how to reduce the workload of manual labeling and simultaneously ensure the accuracy of entity identification in the biomedical field are always a research difficulty.

Disclosure of Invention

The invention provides a Chinese electronic medical record entity labeling method based on BIC (building information modeling) aiming at how to construct a Chinese electronic medical record labeling corpus, wherein a BIC model provided by the method comprises a bidirectional long-time memory network (BilTM), an iterative cavity convolutional neural network (IDCNN) and a Conditional Random Field (CRF), wherein the BilTM and the IDCNN are used as encoding ends of the model, and the CRF is used as a decoding end of the model. The method belongs to a semi-supervised method, can realize automatic annotation of Chinese electronic case data, and has the correctness reaching over 73.5 percent of that of manual annotation. After manual supervision, more than 95% of manual marking data can be obtained, and the workload of manual marking is greatly reduced.

In order to achieve the above purpose, the invention adopts the following technical scheme:

a new method for entity tagging of Chinese electronic medical record based on BIC comprises the following specific operation steps:

1) firstly, providing a corresponding medical entity marking standard according to actual requirements, then manually marking a small amount of data, carrying out data processing on the manually marked data, and processing the data into a data format required by a model to form training data;

2) training model parameters to generate a sequence labeling model, wherein the model comprises a bidirectional long-time memory network (BilSTM), an iterative hole convolutional neural network (IDCNN) and a Conditional Random Field (CRF), the BilSTM and the IDCNN are used as encoding ends of the model, and the CRF is used as a decoding end of the model;

3) inputting data to be labeled into a sequence labeling model, and outputting a result to obtain data labeled by a machine;

4) and then, manually checking and correcting part marking errors, performing data processing operation to obtain training data required by the model, and performing model training again.

Preferably, in the step 2), the method for generating the sequence annotation model is as follows:

a. the input of the model is Chinese text, the Chinese text is divided into different training batches according to the text with different lengths, each training batch is provided with 20 sentences of text, a batch of training texts are converted into tensors through an embedding layer, and the lengths of each batch of training texts are consistent through space filling;

b. the embedding layer obtains the tensor of the input data, the tensor is processed by a coding end, the coding end is formed by combining the BilSTM and the IDCNN, the number of neurons of a BilSTM hidden layer is set, and the output of the BilSTM layer corresponds to the tensor;

c. inputting the output of the BilSTM layer into the IDCNN layer to extract the local detail features of the text; the IDCNN layer is formed by combining four iterative hole convolution neural networks, the convolution kernel size of each hole convolution neural network is set, the hole convolution has three layers, the hole size of each layer is set, the output is carried out after convolution operation, and finally the output data corresponding tensor of the encoding end is formed by splicing the convolution results of the four iterative hole convolutions;

d. the decoding end calculates a Tag corresponding to input data by a conditional random field; firstly, the output of the encoding end passes through a neural network, network weight is set, and a tensor corresponding to a logistic regression value is obtained through the network, so that a sequence labeling model is generated.

Preferably, in step c, the three layers of hole convolution are performed, the output of each layer is the input of the next layer, and parameters are shared among the same hole convolution layers of the four iterative hole convolutions.

firstly, inputting Chinese electronic medical record texts, dividing the Chinese electronic medical record texts into different training batches according to sentences with different lengths, wherein each training batch comprises 20 texts, each training batch achieves the effect of consistent length by filling in a space, then converting the training texts of the batch into corresponding tensors through an embedding layer, setting the texts with 21 lengths containing punctuation marks, and finally outputting the dimensions [20,21,120] tensors by the embedding layer;

inputting a [20,21,120] dimension tensor output by the embedding layer into a BilSTM layer to extract global features, wherein the number of neurons in the BilSTM hidden layer is 2 × 100 to 200, so that the output of the BilSTM layer is the [20,21,200] dimension tensor;

thirdly, inputting the output of the BilSTM layer into the IDCNN layer to extract the local detail features of the text; the IDCNN layer is formed by combining four iterative hole convolution neural networks, the convolution kernel of each hole convolution neural network is a [1,3,200,100] dimension tensor, the hole convolution has three layers, the hole of each layer is a [1,1,2] dimension tensor, the output is a [20,21,100] dimension tensor, and finally the four iterative hole convolution results are spliced in series to form an output data [20,21,400] dimension tensor of the encoding end;

fourthly, the decoding end calculates the Tag corresponding to the input data by CRF; firstly, the output of the encoding end passes through a neural network, the network weight is tensor of [400,33], and the tensor of [20,21,33] with logistic regression value can be obtained through the network.

Further preferably, the sequence labeling model comprises the following specific steps:

the input of the model is Chinese text, the Chinese text is divided into different training batches according to the texts with different lengths, each training batch comprises 20 sentences of text, a batch of training text is converted into tensor through an embedding layer, and the length of each batch of training text is consistent through space filling. The embedding layer comprises character embedding and word embedding, and the word embedding layer is labeled by using character-based input in consideration of possible word segmentation boundary without losing word characteristics. Take the text of length 21 containing punctuation as an example: the character embedding is to convert 21 characters of each sentence text in each training batch into [20,21,100] dimension tensor by using word2 vec; the word embedding is to convert 21 characters of each sentence into [20,21,20] dimension tensor correspondingly through word segmentation processing. The output of the final embedding layer is the combination of characters and words, and a [20,21,120] dimension tensor is generated;

the embedded layer obtains the tensor of the input data, the tensor is processed by the encoding end, the encoding end is formed by combining the BilSTM and the IDCNN, the number of neurons of the hidden layer of the BilSTM is 2 × 100 to 200, and therefore the output of the BilSTM layer is the [20,21,200] dimension tensor; the IDCNN layer is formed by combining four iterative hole convolution neural networks, the convolution kernel size of each hole convolution neural network is [1,3,200,100], the hole convolution has three layers, the hole size of each layer is [1,1,2], the output obtained after convolution operation is [20,21,100], and finally the output data [20,21,400] dimension tensor of the encoding end is formed by splicing the convolution results of the four iterative holes. The output of each layer is the input of the next layer, and parameters are shared among the same cavity convolution layers of the four iterative cavity convolutions;

the decoding end calculates a Tag corresponding to input data by a conditional random field; firstly, the output of the encoding end passes through a neural network, the weight of the network is [400,33], and a tensor with the dimensions of [20,21,33] of logistic regression values can be obtained through the network.

Preferably, in the step 2), the coding layer extracts a text global feature from the BiLSTM layer, extracts a text local detail feature from the IDCNN layer, and the decoding layer calculates a label probability from the CRF to obtain an optimal label, which specifically includes the following steps:

the bidirectional long and short memory network (BilSTM) layer comprises a forward LSTM layer and a backward LSTM layer, the input of sentences is input into the LSTM network in a one-hot coding mode through a word embedding layer, and a sentence with the length of n can be expressed as W ═ W [ (W [) ]₁,...,w_t-1,w_t,w_t+1,...,w_n}，w_t∈^dThe tth word of the sentence is a d-dimensional vector; the LSTM includes a series of circularly connected sub-networks, called memory blocks, and the information storage and control are realized by three gate structures, which are: input door i_tForgetting door f_tAnd an output gate o_tThe expression is as follows:

i_t＝(W_wiw_t+W_hih_t-1+W_cic_t-1+b_i) (1)

f_t＝(W_wfw_t+W_hfh_t-1+W_cfc_t-1+b_f) (2)

z_t＝tanh(W_wcw_t+W_hch_t-1+b_c) (3)

o_t＝(W_wow_t+W_hoh_t-1+W_coc_t+b_o) (5)

w in the formulae (1), (2), (3) and (5)_(.)Is the weight value, b is the bias value, c in equations (1), (2), (4), (5), (6) is the cell state vector, for each word w_tContext information is included, and the forward LSTM layer consists of w₁To w_tCoded representation, as

Backward LSTM layer consisting of w_nTo w_tCoded identification, noted

Finally, the product is processedCombining context information to form a vector representation of the t-th word

An iterative hole convolution neural network (IDCNN) layer, repeatedly applying the same small-stack hole convolution block, and taking the result of the previous hole convolution as input in each iteration; the width of the hole is exponentially increased along with the increase of the layer number, but the number of the parameters is linearly increased, so that the receptive field quickly covers all input data; the model is a three-layer cavity convolution in which 4 cavity convolution blocks with the same size are superposed together, and the width of a cavity in each cavity convolution block is 1,1 and 2 respectively; inputting the text into the IDCNN layer, and extracting features through the convolution layer;

conditional Random Field (CRF), given a training dataset D { (x)¹,y²),...,(x^N,y^N) H, observed sequence of N data xⁱThe corresponding mark sequence is yⁱCRF performs parameter estimation with a log-likelihood function that maximizes the conditional probability, i.e.:

wherein W is the model parameter-weight, b is the model parameter-bias, and when the conditional probability likelihood function is maximized, the obtained optimal label is y^*：

Compared with the prior art, the invention has the following advantages:

1) the invention provides a new Chinese electronic medical record entity labeling model, which adopts BiLSTM-IDCNN to select and construct features, and then utilizes CRF to decode the obtained features. The mode of combining deep learning and machine learning mutually makes up for the weakness, and the theoretically obtained model has good effect;

2) the invention provides a method for marking Chinese electronic medical record data based on a deep learning model BIC, which is a semi-supervised method and can realize automatic marking of the Chinese electronic medical record data, and the correctness reaches over 73.5 percent of that of manual marking. After manual supervision, more than 95% of manual marking data can be obtained, and the workload of manual marking is greatly reduced;

3) by comparing the models of BiLSTM-CRF, IDCNN-CRF, BiLSTM-IDCNN-CRF and the like, the method of the invention can show that the BiLSTM-IDCNN-CRF model has better entity recognition effect in both the general field and the specific biomedical field.

Drawings

FIG. 1 is a flow chart of the method for labeling the entity of the Chinese electronic medical record based on BIC according to the present invention.

FIG. 2 is a diagram of a sequence annotation model according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.

The above-described scheme is further illustrated below with reference to specific embodiments, which are detailed below:

example one

In this embodiment, referring to fig. 1, a method for labeling an entity of a chinese electronic medical record based on BIC includes the following specific steps:

In the method of the embodiment, a BilSTM-IDCNN is adopted to carry out feature selection construction, and then CRF is utilized to decode the obtained features. The method combines deep learning and machine learning, makes up for deficiencies of each other, and has the potential of greatly reducing the workload of manual labeling.

Example two

This embodiment is substantially the same as the embodiment, and is characterized in that:

in this embodiment, referring to fig. 1, in step 2), the method for generating the sequence annotation model is as follows:

c. inputting the output of the BilSTM layer into the IDCNN layer to extract the local detail features of the text; the IDCNN layer is formed by combining four iterative hole convolution neural networks, the convolution kernel size of each hole convolution neural network is set, the hole convolution has three layers, the hole size of each layer is set, the output is carried out after convolution operation, and finally the output data corresponding tensor of the encoding end is formed by splicing the convolution results of the four iterative hole convolutions; the output of each layer is the input of the next layer, and parameters are shared among the same cavity convolution layers of the four iterative cavity convolutions;

The BIC model comprises a bidirectional long-time and short-time memory network (BilSTM), an iterative void convolutional neural network (IDCNN) and a Conditional Random Field (CRF), wherein the BilSTM and the IDCNN are used as encoding ends of the model, and the CRF is used as a decoding end of the model. The method combines deep learning and machine learning, makes up for deficiencies of each other, and has the potential of greatly reducing the workload of manual labeling.

EXAMPLE III

in this embodiment, referring to fig. 1, a method for labeling an entity of a chinese electronic medical record based on BIC, the flow of the method is shown in fig. 1, and the method includes first providing a corresponding medical entity labeling specification according to actual requirements, then manually labeling a small amount of data, performing data processing on the manually labeled data, and processing the data into a data format required by a model to form training data. And training model parameters to generate a sequence labeling model. Inputting the data to be labeled into a sequence labeling model, outputting a result to obtain machine labeled data, manually checking and correcting part labeling errors, performing data processing operation to obtain training data required by the model, and performing model training again. As the adopted deep learning model has better and better effect along with the increase of the data quantity, the entity marked by the method is more and more accurate.

As shown in fig. 2, a new method for labeling electronic medical record data based on a deep learning model BIC includes the following specific steps:

1) firstly, inputting Chinese electronic medical record texts, dividing the Chinese electronic medical record texts into different training batches according to sentences with different lengths, wherein each training batch comprises 20 texts, each training text batch achieves the effect of consistent length through space filling, then the training texts of the batch are converted into tensors through an embedding layer, and taking the text with 21 length containing punctuation marks as an example, the output dimensionality of the embedding layer is [20,21,120] tensor;

2) the global features are extracted by inputting the [20,21,120] dimension tensor output by the embedding layer into the BilSTM layer, wherein the number of neurons in the BilSTM hidden layer is 2 × 100 ═ 200, and therefore the output of the BilSTM layer is the [20,21,200] dimension tensor.

3) And inputting the output of the BilSTM layer into the IDCNN layer to extract the local detail features of the text. The IDCNN layer is formed by combining four iterative hole convolution neural networks, the convolution kernel of each hole convolution neural network is a [1,3,200,100] dimension tensor, the hole convolution has three layers, the hole of each layer is a [1,1,2] dimension tensor, the output is a [20,21,100] dimension tensor is obtained, and finally the four iterative hole convolution results are spliced in series to form the output data [20,21,400] dimension tensor of the encoding end.

4) And the decoding end calculates the label Tag corresponding to the input data by the CRF. Firstly, the output of the encoding end passes through a neural network, the network weight is tensor of [400,33], and the tensor of [20,21,33] with logistic regression value can be obtained through the network.

In this embodiment, the bidirectional long short-term memory network (BilSTM) layer includes a forward LSTM layer and a backward LSTM layer, the input of the sentence is input to the LSTM network in a one-hot encoding mode through the word embedding layer, and the sentence with the length of n can be expressed as W ═ W { (W) } in the form of one-hot encoding₁,...,w_t-1,w_t,w_t+1,...,w_n}，w_t∈^dThe tth word of this phrase is a d-dimensional vector. The LSTM includes a series of circularly connected sub-networks, called memory blocks, and the information storage and control are realized by three gate structures, which are: input door i_tForgetting door f_tAnd an output gate o_tThe expression is as follows:

i_t＝(W_wiw_t+W_hih_t-1+W_cic_t-1+b_i) (1)

f_t＝(W_wfw_t+W_hfh_t-1+W_cfc_t-1+b_f) (2)

z_t＝tanh(W_wcw_t+W_hch_t-1+b_c) (3)

o_t＝(W_wow_t+W_hoh_t-1+W_coc_t+b_o) (5)

w in the formulae (1) (2) (3) (5)_(.)Is the weight value, b is the bias value, c in equation (1) (2) (4) (5) (6) is the cell state vector, w for each word_tContext information is included, and the forward LSTM layer consists of w₁To w_tCoded representation, as

Backward LSTM layer consisting of w_nTo w_tCoded identification, noted

Finally, the context information is combined to form the vector representation of the t word

An iterative hole convolution neural network (IDCNN) layer repeatedly applies the same small-stacked hole convolution blocks, and each iteration takes the result of the previous hole convolution as input. The width of the hole appears to increase exponentially as the number of layers increases, but the number of parameters increases linearly, so that the field quickly covers the entire input data. The model is a three-layer cavity convolution in which 4 cavity convolution blocks with the same size are superposed together, and the width of a cavity in each cavity convolution block is 1,1 and 2 respectively. Inputting the text into IDCNN layer, and extracting features through convolution layer.

Conditional Random Field (CRF), given a training dataset D { (x)¹,y²),...,(x^N,y^N) Therein ofObserved sequence x of N dataⁱThe corresponding mark sequence is yⁱCRF performs parameter estimation with a log-likelihood function that maximizes the conditional probability, i.e.:

In the method, by comparing the models of BiLSTM-CRF, IDCNN-CRF, BiLSTM-IDCNN-CRF and the like, it can be seen that the BiLSTM-IDCNN-CRF model has a better entity identification effect in both the general field and the specific biomedical field. The method for labeling the Chinese electronic medical record data based on the deep learning model BIC is a semi-supervised method, can realize automatic labeling of the Chinese electronic medical record data, and has the accuracy reaching over 73.5 percent of that of manual labeling. After manual supervision, more than 95% of manual marking data can be obtained, and the workload of manual marking is greatly reduced.

The embodiments of the present invention have been described with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes and modifications can be made according to the purpose of the invention, and all changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitution ways, so long as the invention meets the purpose of the present invention, and the protection scope of the present invention shall be covered as long as the technical principle and the inventive concept of the BIC-based chinese electronic medical record entity tagging method of the present invention are not deviated.

Claims

1. A Chinese electronic medical record entity labeling method based on BIC is characterized by comprising the following specific operation steps:

2. The method for entity annotation of Chinese electronic medical record based on BIC as claimed in claim 1, wherein in step 2), the method for generating the sequence annotation model is as follows:

3. The method for annotating an entity of a Chinese electronic medical record based on BIC as claimed in claim 2, wherein: in step c, the output of each layer is the input of the next layer, and the parameters are shared among the same hole convolution layers of the four iterative hole convolutions.

4. The method for entity annotation of Chinese electronic medical record based on BIC as claimed in claim 1, wherein in step 2), the method for generating the sequence annotation model is as follows:

firstly, inputting Chinese electronic medical record texts, dividing the Chinese electronic medical record texts into different training batches according to sentences with different lengths, wherein each training batch comprises 20 texts, each training batch achieves the effect of consistent length by filling in a space, then converting the training texts of the batch into tensors through an embedding layer, setting the texts with 21 lengths containing punctuation marks, and finally outputting the dimensions of [20,21,120] tensors by the embedding layer;

5. The method for entity labeling of Chinese electronic medical record based on BIC as claimed in claim 1, wherein in the step 2), the coding layer extracts global text features from the BilSTM layer, local text detail features from the IDCNN layer, and the decoding layer calculates label probability from CRF to obtain the best label, and comprises the following steps:

the bidirectional long-short time memory network layer comprises a forward LSTM layer and a backward LSTM layer, the input of sentences is input into the LSTM network in a one-hot single-code mode through a word embedding layer, and a sentence with the length of n can be expressed as W ═ W { (W) }₁,...,w_t-1,w_t,w_t+1,...,w_n}，w_t∈^dThe tth word of the sentence is a d-dimensional vector; the LSTM includes a series of circularly connected sub-networks, called memory blocks, and the information storage and control are realized by three gate structures, which are: input door i_tForgetting door f_tAnd an output gate o_tThe expression is as follows:

i_t＝(W_wiw_t+W_hih_t-1+W_cic_t-1+b_i) (1)

f_t＝(W_wfw_t+W_hfh_t-1+W_cfc_t-1+b_f) (2)

z_t＝tanh(W_wcw_t+W_hch_t-1+b_c) (3)

o_t＝(W_wow_t+W_hoh_t-1+W_coc_t+b_o) (5)

Backward LSTM layer consisting of w_nTo w_tCoded identification, noted

The iterative cavity convolution neural network layer repeatedly applies the same small-stack cavity convolution blocks, and the result of the previous cavity convolution is taken as input in each iteration; the width of the hole is exponentially increased along with the increase of the layer number, but the number of the parameters is linearly increased, so that the receptive field quickly covers all input data; the model is a three-layer cavity convolution in which 4 cavity convolution blocks with the same size are superposed together, and the width of a cavity in each cavity convolution block is 1,1 and 2 respectively; inputting the text into the IDCNN layer, and extracting features through the convolution layer;

conditional random field, given a training data set D { (x)¹,y²),...,(x^N,y^N) H, observed sequence of N data xⁱThe corresponding mark sequence is yⁱCRF performs parameter estimation with a log-likelihood function that maximizes the conditional probability, i.e.: