CN111767723B

CN111767723B - BIC-based Chinese electronic medical record entity labeling method

Info

Publication number: CN111767723B
Application number: CN202010405161.4A
Authority: CN
Inventors: 滕国伟; 王逸凡
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2024-07-19
Anticipated expiration: 2040-05-14
Also published as: CN111767723A

Abstract

The invention discloses a new method for labeling a Chinese electronic medical record entity based on BIC, which belongs to the technical field of natural language processing and is used for solving the problems of identification and labeling of the Chinese electronic medical record entity; the method comprises the following steps: firstly, giving corresponding medical entity labeling specifications according to actual demands, then manually labeling a small amount of data, and performing data processing on the manually labeled data to form a data format required by a model so as to form training data; training model parameters to generate a sequence labeling model, wherein the model comprises a bidirectional long-short-time memory network, an iterative cavity convolutional neural network and a conditional random field, and setting a decoding end of the model; inputting the data to be marked into a sequence marking model, and outputting a result to obtain machine marked data; and then manually examining and correcting part labeling errors, obtaining training data required by the model through data processing operation, and carrying out model training again. The method can realize automatic labeling of Chinese electronic case data, and has high accuracy.

Description

BIC-based Chinese electronic medical record entity labeling method

Technical Field

The invention relates to the technical field of natural language processing, in particular to a Chinese electronic medical record entity labeling method based on BIC.

Background

In the biomedical field, a large amount of data, such as electronic medical records, is produced every day. The electronic medical record refers to the digitalized information such as characters, symbols, charts, graphs, data, images and the like generated by medical staff in the medical activity process by using a medical institution information system, can realize the storage, management, transmission and reproduction of medical records, contains a large number of entities and has more entity types. At present, most of electronic medical record information extraction researches are aimed at English electronic medical records, the research in the aspect of Chinese electronic medical records starts later, no clear and systematic research task is formed yet, and a public labeling corpus is lacked. The lack of the training corpus greatly restricts the research of the extraction of the Chinese electronic medical record information. Along with the wide implementation of the Chinese electronic medical record system, the number of the electronic medical records is rapidly increased, but whether effective information in a large number of electronic medical records can be effectively extracted is always a research hot spot and a difficulty, wherein the construction of a Chinese electronic medical record labeling corpus is the research basis. In the biomedical field, the disclosed annotation corpus is very limited, and manual annotation consumes a great deal of manpower and material resources, so that how to lighten the manual annotation workload and ensure the accuracy of entity identification in the biomedical field is always a research difficulty.

Disclosure of Invention

The invention provides a BIC-based Chinese electronic medical record entity labeling method aiming at how to construct a Chinese electronic medical record labeling corpus, wherein a BIC model provided by the method comprises a bidirectional long-short-time memory network (BiLSTM), an iterative cavity convolutional neural network (IDCNN) and a Conditional Random Field (CRF), wherein BiLSTM and IDCNN are used as coding ends of the model, and CRF is used as a decoding end of the model. The method belongs to a semi-supervision method, can realize automatic labeling of Chinese electronic case data, and has the accuracy reaching more than 73.5% of manual labeling. After manual supervision, more than 95% of manual labeling data can be obtained, and the manual labeling workload is greatly reduced.

In order to achieve the above purpose, the invention adopts the following technical scheme:

A new method for labeling Chinese electronic medical record entity based on BIC comprises the following specific operation steps:

1) Firstly, giving corresponding medical entity labeling specifications according to actual demands, then manually labeling a small amount of data, performing data processing on the manually labeled data, processing the manually labeled data into a data format required by a model, and forming training data;

2) Training model parameters to generate a sequence labeling model, wherein the model comprises a bidirectional long-short-time memory network (BiLSTM), an iterative cavity convolutional neural network (IDCNN) and a Conditional Random Field (CRF), and BiLSTM and IDCNN are used as encoding ends of the model, and CRF is used as a decoding end of the model;

3) Inputting the data to be marked into a sequence marking model, and outputting a result to obtain machine marked data;

4) And then manually examining and correcting part labeling errors, obtaining training data required by the model through data processing operation, and carrying out model training again.

Preferably, in the step 2), the method for generating the sequence labeling model is as follows:

a. the input of the model is Chinese text, the model is divided into different training batches according to the text with different lengths, each training batch has 20 texts, one batch of training texts is converted into tensors through an embedding layer, and each batch of training texts reaches the consistent length through filling;

b. The embedded layer obtains tensors of input data, the tensors are processed by an encoding end, the encoding end is formed by BiLSTM and IDCNN in a combined mode, the number of neurons of a BiLSTM hidden layer is set, and the tensors are corresponding to the output of BiLSTM layers;

c. Inputting the output of BiLSTM layers into IDCNN layers to extract local detail features of the text; the IDCNN layers are formed by combining four iterative hole convolution neural networks, the size of each hole convolution neural network convolution kernel is set, the number of the hole convolution kernel is three, the size of each hole of each layer is set, the output is carried out after convolution operation, and finally, the four iterative hole convolution results are spliced to form the corresponding tensor of the output data of the coding end;

d. The decoding end calculates a Tag corresponding to the input data by a conditional random field; firstly, the output of the coding end passes through a neural network, the network weight is set, and the corresponding tensor of the logistic regression value is obtained through the network, so that a sequence labeling model is generated.

Preferably, in the step c, three layers of the hole convolution are provided, the output of each layer is the input of the next layer, and the four iterative hole convolutions share parameters between the same hole convolutions.

① Firstly inputting Chinese electronic medical record texts, dividing the Chinese electronic medical record texts into different training batches according to sentences with different lengths, wherein each training batch comprises 20 texts, each batch of training texts achieves the effect of consistent length through complementation, then converting the training texts in the batch into corresponding tensors through an embedding layer, setting texts with length 21 containing punctuation marks, and finally outputting tensors with the dimension of [20,21,120] by the embedding layer;

② The [20,21,120] dimensional tensor output by the embedded layer is input to the BiLSTM layer to extract global features, wherein the number of neurons of the BiLSTM hidden layer is 2×100=200, so the output of the BiLSTM layer is the [20,21,200] dimensional tensor;

③ Inputting the output of BiLSTM layers into IDCNN layers to extract local detail features of the text; the IDCNN layers are formed by combining four iterative hole convolution neural networks, each hole convolution neural network convolution kernel is a [1,3,200,100] dimensional tensor, the hole convolution kernel is composed of three layers, each hole of each layer is a [1, 2] dimensional tensor, the output is obtained as a [20,21,100] dimensional tensor, and finally four iterative hole convolution results are spliced in series to form output data [20,21,400] dimensional tensor of a coding end;

④ The decoding end calculates a Tag corresponding to the input data by using the CRF; firstly, the output of the coding end passes through a neural network, the weight of the network is [400,33] dimensional tensor, and the logistic regression value of [20,21,33] dimensional tensor can be obtained through the network.

Further preferably, the sequence labeling model comprises the following specific steps:

The input of the model is Chinese text, the model is divided into different training batches according to the text with different lengths, each training batch has 20 texts, one batch of training texts is converted into tensors through an embedded layer, and each batch of training texts achieves the effect of consistent length through complementation. The embedded layer comprises character embedding and word embedding, and the character embedding and word embedding are marked by using word-based input in consideration of possible wrong boundary cutting of word segmentation and no loss of word characteristics. Take text of length 21 containing punctuation marks as an example: the character embedding is to convert 21 characters of each text in each training batch into [20,21,100] dimension tensor by word2 vec; word embedding is to process the word segmentation of 21 characters of each sentence and convert the word into [20,21,20] dimension tensor. The output of the final embedded layer is the combination of characters and words, and a [20,21,120] dimensional tensor is generated;

The embedded layer obtains tensors of input data, the encoding end is processed by the encoding end, the encoding end is formed by BiLSTM and IDCNN in a combined mode, the number of neurons of the BiLSTM hidden layer is 2 x 100 = 200, and therefore the output of the BiLSTM layers is [20,21,200] dimensional tensors; the IDCNN layers are formed by combining four iterative hole convolution neural networks, the convolution kernel size of each hole convolution neural network is [1,3,200,100], the hole convolution is three layers, the hole size of each layer is [1, 2], the output is [20,21,100] after convolution operation, and finally, the four iterative hole convolution results are spliced to form output data [20,21,400] dimensional tensor of the coding end. Three layers of cavity convolution, wherein the output of each layer is the input of the next layer, and parameters are shared among four iterative cavity convolution identical cavity convolution layers;

the decoding end calculates a Tag corresponding to the input data by a conditional random field; firstly, the output of the coding end passes through a neural network, the weight of the network is [400,33], and the logistic regression value is [20,21,33] dimension tensor can be obtained through the network.

Preferably, in the step 2), the encoding layer extracts the global feature of the text from the BiLSTM layer, extracts the local detail feature of the text from the IDCNN layer, and the decoding layer calculates the tag probability from the CRF to obtain the best tag, and the specific steps are as follows:

The Bi-directional long-short-time memory network (Bi-directionalLongShort-TermMemory, biLSTM) layer comprises a forward LSTM layer and a backward LSTM layer, the sentence input is input into the LSTM network through the word embedding layer in a one-hot single-heat coding mode, a sentence with the length of n can be expressed as W= { W ₁,...,w_t-1,w_t,w_t+1,...,w_n},w_t∈^d, and the t-th word of the sentence is a d-dimensional vector; the LSTM includes a series of sub-networks connected in a loop, called memory blocks, and stores and controls information through three gate structures, where the three gates are respectively: input gate i _t, forget gate f _t, output gate o _t, whose expressions are as follows:

i_t＝δ(W_wiw_t+W_hih_t-1+W_cic_t-1+b_i) (1)

f_t＝δ(W_wfw_t+W_hfh_t-1+W_cfc_t-1+b_f) (2)

z_t＝tanh(W_wcw_t+W_hch_t-1+b_c) (3)

o_t＝δ(W_wow_t+W_hoh_t-1+W_coc_t+b_o) (5)

W _(.) in formulas (1), (2), (3), (5) is the weight value, b is the bias value, c in formulas (1), (2), (4), (5), (6) is the cell state vector, and for each word W _t, the context information is contained, and the forward LSTM layer is encoded by W ₁ to W _t, denoted as The backward LSTM layer is identified by the codes w _n to w _t, and is marked asFinally combining the context information to form a vector representation of the t-th word

An iterative hole convolution neural network (IteratedDilatedConvolution, IDCNN) layer repeatedly applies the same small stacked hole convolution blocks, and takes the result of the last hole convolution as input in each iteration; the width of the cavity increases exponentially with the number of layers, but the number of parameters increases linearly, so that the receptive field quickly covers all input data; the model is formed by overlapping 4 cavity convolution blocks with the same size, and the widths of the cavities in each cavity convolution block are respectively 1,1 and 2 three-layer cavity convolutions; inputting the text into IDCNN layers, and extracting features through a convolution layer;

Conditional random field (conditionalrandomfield, CRF), given training dataset d= { (x ¹,y²),...,(x^N,y^N) }, where observed sequence x ⁱ of N data corresponds to a marker sequence y ⁱ, CRF performs parameter estimation with a log-likelihood function that maximizes the conditional probability, i.e.:

Wherein W is model parameter-weight, b is model parameter-bias, and when the conditional probability likelihood function is maximized, the obtained optimal label is y ^*:

Compared with the prior art, the invention has the following advantages:

1) The invention provides a new Chinese electronic medical record entity labeling model, which adopts BiLSTM-IDCNN to carry out characteristic selection construction and then decodes the obtained characteristics by using CRF. The mode combining deep learning and machine learning mutually complements each other, and the model obtained in theory has good effect;

2) The invention provides a method for labeling Chinese electronic medical record data based on a deep learning model BIC, which is a semi-supervised method, and can realize automatic labeling of Chinese electronic case data, wherein the accuracy of the method is more than 73.5% of manual labeling. After manual supervision, more than 95% of manual labeling data can be obtained, and the manual labeling workload is greatly reduced;

3) By comparing models such as BiLSTM-CRF, IDCNN-CRF, biLSTM-IDCNN-CRF and the like, the method can be used for finding out that the model BiLSTM-IDCNN-CRF has a good entity identification effect in the general field or the specific biomedical field.

Drawings

FIG. 1 is a flow chart of the method for labeling a Chinese electronic medical record entity based on BIC.

FIG. 2 is a sequence annotation model according to an embodiment of the invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.

The foregoing aspects are further described in conjunction with specific embodiments, and the following detailed description of preferred embodiments of the present invention is provided:

Example 1

In this embodiment, referring to fig. 1, a method for labeling a chinese electronic medical record entity based on BIC specifically includes the following steps:

The method of this embodiment uses BiLSTM-IDCNN for feature selection construction and then uses CRF to decode the resulting features. The method combines deep learning and machine learning, mutually complements each other, and has the potential of greatly reducing the workload of manual labeling.

Example two

This embodiment is substantially identical to the implementation, and is characterized in that:

In this embodiment, referring to fig. 1, in the step 2), the method for generating the sequence labeling model is as follows:

c. Inputting the output of BiLSTM layers into IDCNN layers to extract local detail features of the text; the IDCNN layers are formed by combining four iterative hole convolution neural networks, the size of each hole convolution neural network convolution kernel is set, the number of the hole convolution kernel is three, the size of each hole of each layer is set, the output is carried out after convolution operation, and finally, the four iterative hole convolution results are spliced to form the corresponding tensor of the output data of the coding end; three layers of cavity convolution, wherein the output of each layer is the input of the next layer, and parameters are shared among four iterative cavity convolution identical cavity convolution layers;

The BIC model comprises a bidirectional long-short-time memory network (BiLSTM), an iterative hole convolutional neural network (IDCNN) and a Conditional Random Field (CRF), wherein BiLSTM and IDCNN are used as coding ends of the model, and CRF is used as decoding ends of the model. The method combines deep learning and machine learning, mutually complements each other, and has the potential of greatly reducing the workload of manual labeling.

Example III

In this embodiment, referring to fig. 1, a method for labeling a chinese electronic medical record entity based on BIC is shown in fig. 1, where the flow of the method is that a corresponding medical entity labeling specification is given according to actual requirements, then a small amount of data is labeled manually, and the manually labeled data is processed into a data format required by a model to form training data. And then training model parameters to generate a sequence labeling model. Inputting the data to be marked into a sequence marking model, outputting a result to obtain machine marked data, manually examining and correcting part marking errors, obtaining training data required by the model through data processing operation, and training the model again. As the effect of the deep learning model is better and better along with the increase of the data quantity, the entity marked by the method is more and more accurate.

As shown in fig. 2, a new method for labeling electronic medical record data based on a deep learning model BIC specifically comprises the following steps:

1) Firstly inputting Chinese electronic medical record texts, dividing the Chinese electronic medical record texts into different training batches according to sentences with different lengths, wherein each training batch comprises 20 texts, each batch of training texts achieves the effect of consistent length through complementation, then converting the training texts into tensors through an embedding layer, taking texts with the length of 21 containing punctuation marks as an example, and finally outputting tensors with the dimension of [20,21,120] by the embedding layer;

2) The [20,21,120] dimensional tensor output by the embedded layer is input to the BiLSTM layer to extract global features, where the number of BiLSTM hidden layer neurons is 2 x 100 = 200, so the output of BiLSTM layers is the [20,21,200] dimensional tensor.

3) The BiLSTM layer output is input to IDCNN layer to extract text local detail features. The IDCNN layers are formed by combining four iterative hole convolution neural networks, each hole convolution neural network convolution kernel is [1,3,200,100] dimensional tensor, the hole convolution kernel is composed of three layers, each hole of each layer is [1, 2] dimensional tensor, output is [20,21,100] dimensional tensor, and finally four iterative hole convolution results are spliced in series to form output data [20,21,400] dimensional tensor of a coding end.

4) The decoding end calculates the Tag corresponding to the input data by the CRF. Firstly, the output of the coding end passes through a neural network, the weight of the network is [400,33] dimensional tensor, and the logistic regression value of [20,21,33] dimensional tensor can be obtained through the network.

In this embodiment, the bidirectional long-short-term memory network (Bi-directionalLongShort-TermMemory, biLSTM) layer includes a forward LSTM layer and a backward LSTM layer, and the sentence is input to the LSTM network in one-hot one-time encoding mode through the word embedding layer, and a sentence with a length of n may be expressed as w= { W ₁,...,w_t-1,w_t,w_t+1,...,w_n},w_t∈^d, where the t-th word of the sentence is a d-dimensional vector. The LSTM includes a series of sub-networks connected in a loop, called memory blocks, and stores and controls information through three gate structures, where the three gates are respectively: input gate i _t, forget gate f _t, output gate o _t, whose expressions are as follows:

i_t＝δ(W_wiw_t+W_hih_t-1+W_cic_t-1+b_i) (1)

f_t＝δ(W_wfw_t+W_hfh_t-1+W_cfc_t-1+b_f) (2)

z_t＝tanh(W_wcw_t+W_hch_t-1+b_c) (3)

o_t＝δ(W_wow_t+W_hoh_t-1+W_coc_t+b_o) (5)

W _(.) in equations (1) (2) (3) (5) is the weight value, b is the bias value, c in equations (1) (2) (4) (5) (6) is the cell state vector, and for each word W _t, the context information is included, and the forward LSTM layer is encoded by W ₁ to W _t, denoted as The backward LSTM layer is identified by the codes w _n to w _t, and is marked asFinally combining the context information to form a vector representation of the t-th word

And (3) iterating a hole convolution neural network (IteratedDilatedConvolution, IDCNN) layer, repeatedly applying the same small stacked hole convolution blocks, and taking the result of the last hole convolution as an input in each iteration. The width of the cavity increases exponentially with the number of layers, but the number of parameters increases linearly, so that the receptive field quickly covers all of the input data. The model is a three-layer cavity convolution with 4 cavity convolution blocks with the same size and the cavity widths in each cavity convolution block being 1,1 and 2 respectively. Text is entered into IDCNN layers, passed through the convolution layers, and features are extracted.

By comparing models such as BiLSTM-CRF, IDCNN-CRF, biLSTM-IDCNN-CRF and the like, the method of the embodiment can show that the model BiLSTM-IDCNN-CRF has a good entity identification effect in the general field or the specific biomedical field. The method for labeling the Chinese electronic medical record data based on the deep learning model BIC is a semi-supervised method, can automatically label the Chinese electronic medical record data, and has the accuracy reaching more than 73.5% of manual labeling. After manual supervision, more than 95% of manual labeling data can be obtained, and the manual labeling workload is greatly reduced.

The embodiment of the present invention is described above with reference to the accompanying drawings, but the present invention is not limited to the above embodiment, and various changes, modifications, substitutions, combinations or simplifications made under the spirit and principles of the technical scheme of the present invention can be made according to the purpose of the present invention, so long as the purpose of the present invention is met, and the technical principle and the inventive concept of the present invention BIC-based method for labeling Chinese electronic medical records entity are all within the scope of the present invention.

Claims

1. A Chinese electronic medical record entity labeling method based on BIC is characterized by comprising the following specific operation steps:

2) Then training model parameters to generate a sequence labeling model, wherein the model comprises a bidirectional long-short-time memory network BiLSTM, an iterative cavity convolutional neural network IDCNN and a conditional random field CRF, and BiLSTM and IDCNN are used as coding ends of the model, and CRF is used as a decoding end of the model;

4) Then manually examining and correcting part marking errors, obtaining training data required by the model through data processing operation, and carrying out model training again;

In the step 2), the method for generating the sequence labeling model is as follows:

2. The method for labeling a chinese electronic medical record entity based on BIC according to claim 1, wherein: in the step c, three layers of the hole convolution are provided, the output of each layer is the input of the next layer, and parameters are shared among four iterative hole convolution identical hole convolution layers.

3. The method for labeling a chinese electronic medical record entity based on BIC according to claim 1, wherein in the step 2), the method for generating a sequence labeling model is as follows:

① Firstly inputting Chinese electronic medical record texts, dividing the Chinese electronic medical record texts into different training batches according to sentences with different lengths, wherein each training batch comprises 20 texts, each batch of training texts achieves the effect of consistent length through complementation, then converting the training texts into tensors through an embedding layer, setting texts with length of 21 containing punctuation marks, and finally outputting tensors with dimension of [20,21,120] by the embedding layer;

④ The decoding end calculates a Tag corresponding to the input data by using the CRF; firstly, the output of the coding end passes through a neural network, the weight of the network is [400,33] dimensional tensor, and the logistic regression value is [20,21,33] dimensional tensor is obtained through the network.

4. The method for labeling a chinese electronic medical record entity based on BIC according to claim 1, wherein in the step 2), the encoding layer extracts the global feature of the text from the BiLSTM layer, extracts the local detail feature of the text from the IDCNN layer, and the decoding layer calculates the label probability from the CRF to obtain the best label, and the specific steps are as follows:

The bidirectional long-short-time memory network layer comprises a forward LSTM layer and a backward LSTM layer, the sentence is input into the LSTM network in a one-hot single-hot coding mode through a word embedding layer, a sentence with the length of n is expressed as W= { W ₁,...,w_t-1,w_t,w_t+1,...,w_n},w_t∈P^d, and the t-th word of the sentence is a d-dimensional vector; the LSTM includes a series of sub-networks connected in a loop, called memory blocks, and stores and controls information through three gate structures, where the three gates are respectively: input gate i _t, forget gate f _t, output gate o _t, whose expressions are as follows:

i_t＝δ(W_wiw_t+W_hih_t-1+W_cic_t-1+b_i) (1)

f_t＝δ(W_wfw_t+W_hfh_t-1+W_cfc_t-1+b_f) (2)

z_t＝tanh(W_wcw_t+W_hch_t-1+b_c) (3)

o_t＝δ(W_wow_t+W_hoh_t-1+W_coc_t+b_o) (5)

Iterating the hole convolution neural network layer, repeatedly applying the same small stacked hole convolution blocks, and taking the result of the last hole convolution as input in each iteration; the width of the cavity increases exponentially with the number of layers, but the number of parameters increases linearly, so that the receptive field quickly covers all input data; the model is formed by overlapping 4 cavity convolution blocks with the same size, and the widths of the cavities in each cavity convolution block are respectively 1,1 and 2 three-layer cavity convolutions; inputting the text into IDCNN layers, and extracting features through a convolution layer;

conditional random field, given training dataset d= { (x ¹,y²),...,(x^N,y^N) }, where observed sequence x ⁱ of N data corresponds to a signature sequence y ⁱ, CRF performs parameter estimation with a log likelihood function that maximizes the conditional probability, i.e.: