CN111767723A - Chinese electronic medical record entity labeling method based on BIC - Google Patents

Chinese electronic medical record entity labeling method based on BIC Download PDF

Info

Publication number
CN111767723A
CN111767723A CN202010405161.4A CN202010405161A CN111767723A CN 111767723 A CN111767723 A CN 111767723A CN 202010405161 A CN202010405161 A CN 202010405161A CN 111767723 A CN111767723 A CN 111767723A
Authority
CN
China
Prior art keywords
layer
data
model
training
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010405161.4A
Other languages
Chinese (zh)
Inventor
滕国伟
王逸凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202010405161.4A priority Critical patent/CN111767723A/en
Publication of CN111767723A publication Critical patent/CN111767723A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a novel method for marking Chinese electronic medical record entities based on BIC, which belongs to the technical field of natural language processing and is used for solving the problems of identification and marking of the Chinese electronic medical record entities; the method comprises the following steps: firstly, providing corresponding medical entity marking specifications according to actual requirements, then manually marking a small amount of data, carrying out data processing on the manually marked data, and processing the data into a data format required by a model to form training data; training model parameters to generate a sequence labeling model, wherein the model comprises a bidirectional long-time memory network, an iterative hole convolution neural network and a conditional random field, and a decoding end of the model is arranged; inputting data to be labeled into a sequence labeling model, and outputting a result to obtain data labeled by a machine; and then, manually checking and correcting part marking errors, performing data processing operation to obtain training data required by the model, and performing model training again. The method can realize automatic annotation of Chinese electronic case data and has high accuracy.

Description

Chinese electronic medical record entity labeling method based on BIC
Technical Field
The invention relates to the technical field of natural language processing, in particular to a Chinese electronic medical record entity labeling method based on BIC.
Background
In the biomedical field, large amounts of data, such as electronic medical records, are generated daily. The electronic medical record refers to the digital information such as characters, symbols, charts, graphs, data, images and the like generated by medical staff using a medical institution information system in the process of medical activities, can realize the storage, management, transmission and reproduction of medical records, and comprises a large number of entities with more types. At present, most of electronic medical record information extraction researches are directed at English electronic medical records, the research on Chinese electronic medical records starts late, a clear and systematic research task is not formed, and a public labeled corpus is lacked. The lack of the training corpus greatly restricts the research of the Chinese electronic medical record information extraction. With the wide implementation of the Chinese electronic medical record system, the number of electronic medical records increases rapidly, but whether effective information in massive electronic medical records can be extracted effectively is always a hotspot and difficulty of research, wherein the construction of a Chinese electronic medical record labeling corpus is the basis of the research. In the biomedical field, the disclosed labeling corpus is very limited, and a large amount of manpower and material resources are consumed for manual labeling, so that how to reduce the workload of manual labeling and simultaneously ensure the accuracy of entity identification in the biomedical field are always a research difficulty.
Disclosure of Invention
The invention provides a Chinese electronic medical record entity labeling method based on BIC (building information modeling) aiming at how to construct a Chinese electronic medical record labeling corpus, wherein a BIC model provided by the method comprises a bidirectional long-time memory network (BilTM), an iterative cavity convolutional neural network (IDCNN) and a Conditional Random Field (CRF), wherein the BilTM and the IDCNN are used as encoding ends of the model, and the CRF is used as a decoding end of the model. The method belongs to a semi-supervised method, can realize automatic annotation of Chinese electronic case data, and has the correctness reaching over 73.5 percent of that of manual annotation. After manual supervision, more than 95% of manual marking data can be obtained, and the workload of manual marking is greatly reduced.
In order to achieve the above purpose, the invention adopts the following technical scheme:
a new method for entity tagging of Chinese electronic medical record based on BIC comprises the following specific operation steps:
1) firstly, providing a corresponding medical entity marking standard according to actual requirements, then manually marking a small amount of data, carrying out data processing on the manually marked data, and processing the data into a data format required by a model to form training data;
2) training model parameters to generate a sequence labeling model, wherein the model comprises a bidirectional long-time memory network (BilSTM), an iterative hole convolutional neural network (IDCNN) and a Conditional Random Field (CRF), the BilSTM and the IDCNN are used as encoding ends of the model, and the CRF is used as a decoding end of the model;
3) inputting data to be labeled into a sequence labeling model, and outputting a result to obtain data labeled by a machine;
4) and then, manually checking and correcting part marking errors, performing data processing operation to obtain training data required by the model, and performing model training again.
Preferably, in the step 2), the method for generating the sequence annotation model is as follows:
a. the input of the model is Chinese text, the Chinese text is divided into different training batches according to the text with different lengths, each training batch is provided with 20 sentences of text, a batch of training texts are converted into tensors through an embedding layer, and the lengths of each batch of training texts are consistent through space filling;
b. the embedding layer obtains the tensor of the input data, the tensor is processed by a coding end, the coding end is formed by combining the BilSTM and the IDCNN, the number of neurons of a BilSTM hidden layer is set, and the output of the BilSTM layer corresponds to the tensor;
c. inputting the output of the BilSTM layer into the IDCNN layer to extract the local detail features of the text; the IDCNN layer is formed by combining four iterative hole convolution neural networks, the convolution kernel size of each hole convolution neural network is set, the hole convolution has three layers, the hole size of each layer is set, the output is carried out after convolution operation, and finally the output data corresponding tensor of the encoding end is formed by splicing the convolution results of the four iterative hole convolutions;
d. the decoding end calculates a Tag corresponding to input data by a conditional random field; firstly, the output of the encoding end passes through a neural network, network weight is set, and a tensor corresponding to a logistic regression value is obtained through the network, so that a sequence labeling model is generated.
Preferably, in step c, the three layers of hole convolution are performed, the output of each layer is the input of the next layer, and parameters are shared among the same hole convolution layers of the four iterative hole convolutions.
Preferably, in the step 2), the method for generating the sequence annotation model is as follows:
firstly, inputting Chinese electronic medical record texts, dividing the Chinese electronic medical record texts into different training batches according to sentences with different lengths, wherein each training batch comprises 20 texts, each training batch achieves the effect of consistent length by filling in a space, then converting the training texts of the batch into corresponding tensors through an embedding layer, setting the texts with 21 lengths containing punctuation marks, and finally outputting the dimensions [20,21,120] tensors by the embedding layer;
inputting a [20,21,120] dimension tensor output by the embedding layer into a BilSTM layer to extract global features, wherein the number of neurons in the BilSTM hidden layer is 2 × 100 to 200, so that the output of the BilSTM layer is the [20,21,200] dimension tensor;
thirdly, inputting the output of the BilSTM layer into the IDCNN layer to extract the local detail features of the text; the IDCNN layer is formed by combining four iterative hole convolution neural networks, the convolution kernel of each hole convolution neural network is a [1,3,200,100] dimension tensor, the hole convolution has three layers, the hole of each layer is a [1,1,2] dimension tensor, the output is a [20,21,100] dimension tensor, and finally the four iterative hole convolution results are spliced in series to form an output data [20,21,400] dimension tensor of the encoding end;
fourthly, the decoding end calculates the Tag corresponding to the input data by CRF; firstly, the output of the encoding end passes through a neural network, the network weight is tensor of [400,33], and the tensor of [20,21,33] with logistic regression value can be obtained through the network.
Further preferably, the sequence labeling model comprises the following specific steps:
the input of the model is Chinese text, the Chinese text is divided into different training batches according to the texts with different lengths, each training batch comprises 20 sentences of text, a batch of training text is converted into tensor through an embedding layer, and the length of each batch of training text is consistent through space filling. The embedding layer comprises character embedding and word embedding, and the word embedding layer is labeled by using character-based input in consideration of possible word segmentation boundary without losing word characteristics. Take the text of length 21 containing punctuation as an example: the character embedding is to convert 21 characters of each sentence text in each training batch into [20,21,100] dimension tensor by using word2 vec; the word embedding is to convert 21 characters of each sentence into [20,21,20] dimension tensor correspondingly through word segmentation processing. The output of the final embedding layer is the combination of characters and words, and a [20,21,120] dimension tensor is generated;
the embedded layer obtains the tensor of the input data, the tensor is processed by the encoding end, the encoding end is formed by combining the BilSTM and the IDCNN, the number of neurons of the hidden layer of the BilSTM is 2 × 100 to 200, and therefore the output of the BilSTM layer is the [20,21,200] dimension tensor; the IDCNN layer is formed by combining four iterative hole convolution neural networks, the convolution kernel size of each hole convolution neural network is [1,3,200,100], the hole convolution has three layers, the hole size of each layer is [1,1,2], the output obtained after convolution operation is [20,21,100], and finally the output data [20,21,400] dimension tensor of the encoding end is formed by splicing the convolution results of the four iterative holes. The output of each layer is the input of the next layer, and parameters are shared among the same cavity convolution layers of the four iterative cavity convolutions;
the decoding end calculates a Tag corresponding to input data by a conditional random field; firstly, the output of the encoding end passes through a neural network, the weight of the network is [400,33], and a tensor with the dimensions of [20,21,33] of logistic regression values can be obtained through the network.
Preferably, in the step 2), the coding layer extracts a text global feature from the BiLSTM layer, extracts a text local detail feature from the IDCNN layer, and the decoding layer calculates a label probability from the CRF to obtain an optimal label, which specifically includes the following steps:
the bidirectional long and short memory network (BilSTM) layer comprises a forward LSTM layer and a backward LSTM layer, the input of sentences is input into the LSTM network in a one-hot coding mode through a word embedding layer, and a sentence with the length of n can be expressed as W ═ W [ (W [) ]1,...,wt-1,wt,wt+1,...,wn},wtdThe tth word of the sentence is a d-dimensional vector; the LSTM includes a series of circularly connected sub-networks, called memory blocks, and the information storage and control are realized by three gate structures, which are: input door itForgetting door ftAnd an output gate otThe expression is as follows:
it=(Wwiwt+Whiht-1+Wcict-1+bi) (1)
ft=(Wwfwt+Whfht-1+Wcfct-1+bf) (2)
zt=tanh(Wwcwt+Whcht-1+bc) (3)
Figure BDA0002490984920000041
ot=(Wwowt+Whoht-1+Wcoct+bo) (5)
Figure BDA0002490984920000042
w in the formulae (1), (2), (3) and (5)(.)Is the weight value, b is the bias value, c in equations (1), (2), (4), (5), (6) is the cell state vector, for each word wtContext information is included, and the forward LSTM layer consists of w1To wtCoded representation, as
Figure BDA0002490984920000043
Backward LSTM layer consisting of wnTo wtCoded identification, noted
Figure BDA0002490984920000044
Finally, the product is processedCombining context information to form a vector representation of the t-th word
Figure BDA0002490984920000045
An iterative hole convolution neural network (IDCNN) layer, repeatedly applying the same small-stack hole convolution block, and taking the result of the previous hole convolution as input in each iteration; the width of the hole is exponentially increased along with the increase of the layer number, but the number of the parameters is linearly increased, so that the receptive field quickly covers all input data; the model is a three-layer cavity convolution in which 4 cavity convolution blocks with the same size are superposed together, and the width of a cavity in each cavity convolution block is 1,1 and 2 respectively; inputting the text into the IDCNN layer, and extracting features through the convolution layer;
conditional Random Field (CRF), given a training dataset D { (x)1,y2),...,(xN,yN) H, observed sequence of N data xiThe corresponding mark sequence is yiCRF performs parameter estimation with a log-likelihood function that maximizes the conditional probability, i.e.:
Figure BDA0002490984920000046
wherein W is the model parameter-weight, b is the model parameter-bias, and when the conditional probability likelihood function is maximized, the obtained optimal label is y*
Figure BDA0002490984920000047
Compared with the prior art, the invention has the following advantages:
1) the invention provides a new Chinese electronic medical record entity labeling model, which adopts BiLSTM-IDCNN to select and construct features, and then utilizes CRF to decode the obtained features. The mode of combining deep learning and machine learning mutually makes up for the weakness, and the theoretically obtained model has good effect;
2) the invention provides a method for marking Chinese electronic medical record data based on a deep learning model BIC, which is a semi-supervised method and can realize automatic marking of the Chinese electronic medical record data, and the correctness reaches over 73.5 percent of that of manual marking. After manual supervision, more than 95% of manual marking data can be obtained, and the workload of manual marking is greatly reduced;
3) by comparing the models of BiLSTM-CRF, IDCNN-CRF, BiLSTM-IDCNN-CRF and the like, the method of the invention can show that the BiLSTM-IDCNN-CRF model has better entity recognition effect in both the general field and the specific biomedical field.
Drawings
FIG. 1 is a flow chart of the method for labeling the entity of the Chinese electronic medical record based on BIC according to the present invention.
FIG. 2 is a diagram of a sequence annotation model according to an embodiment of the present invention.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
The above-described scheme is further illustrated below with reference to specific embodiments, which are detailed below:
example one
In this embodiment, referring to fig. 1, a method for labeling an entity of a chinese electronic medical record based on BIC includes the following specific steps:
1) firstly, providing a corresponding medical entity marking standard according to actual requirements, then manually marking a small amount of data, carrying out data processing on the manually marked data, and processing the data into a data format required by a model to form training data;
2) training model parameters to generate a sequence labeling model, wherein the model comprises a bidirectional long-time memory network (BilSTM), an iterative hole convolutional neural network (IDCNN) and a Conditional Random Field (CRF), the BilSTM and the IDCNN are used as encoding ends of the model, and the CRF is used as a decoding end of the model;
3) inputting data to be labeled into a sequence labeling model, and outputting a result to obtain data labeled by a machine;
4) and then, manually checking and correcting part marking errors, performing data processing operation to obtain training data required by the model, and performing model training again.
In the method of the embodiment, a BilSTM-IDCNN is adopted to carry out feature selection construction, and then CRF is utilized to decode the obtained features. The method combines deep learning and machine learning, makes up for deficiencies of each other, and has the potential of greatly reducing the workload of manual labeling.
Example two
This embodiment is substantially the same as the embodiment, and is characterized in that:
in this embodiment, referring to fig. 1, in step 2), the method for generating the sequence annotation model is as follows:
a. the input of the model is Chinese text, the Chinese text is divided into different training batches according to the text with different lengths, each training batch is provided with 20 sentences of text, a batch of training texts are converted into tensors through an embedding layer, and the lengths of each batch of training texts are consistent through space filling;
b. the embedding layer obtains the tensor of the input data, the tensor is processed by a coding end, the coding end is formed by combining the BilSTM and the IDCNN, the number of neurons of a BilSTM hidden layer is set, and the output of the BilSTM layer corresponds to the tensor;
c. inputting the output of the BilSTM layer into the IDCNN layer to extract the local detail features of the text; the IDCNN layer is formed by combining four iterative hole convolution neural networks, the convolution kernel size of each hole convolution neural network is set, the hole convolution has three layers, the hole size of each layer is set, the output is carried out after convolution operation, and finally the output data corresponding tensor of the encoding end is formed by splicing the convolution results of the four iterative hole convolutions; the output of each layer is the input of the next layer, and parameters are shared among the same cavity convolution layers of the four iterative cavity convolutions;
d. the decoding end calculates a Tag corresponding to input data by a conditional random field; firstly, the output of the encoding end passes through a neural network, network weight is set, and a tensor corresponding to a logistic regression value is obtained through the network, so that a sequence labeling model is generated.
The BIC model comprises a bidirectional long-time and short-time memory network (BilSTM), an iterative void convolutional neural network (IDCNN) and a Conditional Random Field (CRF), wherein the BilSTM and the IDCNN are used as encoding ends of the model, and the CRF is used as a decoding end of the model. The method combines deep learning and machine learning, makes up for deficiencies of each other, and has the potential of greatly reducing the workload of manual labeling.
EXAMPLE III
This embodiment is substantially the same as the embodiment, and is characterized in that:
in this embodiment, referring to fig. 1, a method for labeling an entity of a chinese electronic medical record based on BIC, the flow of the method is shown in fig. 1, and the method includes first providing a corresponding medical entity labeling specification according to actual requirements, then manually labeling a small amount of data, performing data processing on the manually labeled data, and processing the data into a data format required by a model to form training data. And training model parameters to generate a sequence labeling model. Inputting the data to be labeled into a sequence labeling model, outputting a result to obtain machine labeled data, manually checking and correcting part labeling errors, performing data processing operation to obtain training data required by the model, and performing model training again. As the adopted deep learning model has better and better effect along with the increase of the data quantity, the entity marked by the method is more and more accurate.
As shown in fig. 2, a new method for labeling electronic medical record data based on a deep learning model BIC includes the following specific steps:
1) firstly, inputting Chinese electronic medical record texts, dividing the Chinese electronic medical record texts into different training batches according to sentences with different lengths, wherein each training batch comprises 20 texts, each training text batch achieves the effect of consistent length through space filling, then the training texts of the batch are converted into tensors through an embedding layer, and taking the text with 21 length containing punctuation marks as an example, the output dimensionality of the embedding layer is [20,21,120] tensor;
2) the global features are extracted by inputting the [20,21,120] dimension tensor output by the embedding layer into the BilSTM layer, wherein the number of neurons in the BilSTM hidden layer is 2 × 100 ═ 200, and therefore the output of the BilSTM layer is the [20,21,200] dimension tensor.
3) And inputting the output of the BilSTM layer into the IDCNN layer to extract the local detail features of the text. The IDCNN layer is formed by combining four iterative hole convolution neural networks, the convolution kernel of each hole convolution neural network is a [1,3,200,100] dimension tensor, the hole convolution has three layers, the hole of each layer is a [1,1,2] dimension tensor, the output is a [20,21,100] dimension tensor is obtained, and finally the four iterative hole convolution results are spliced in series to form the output data [20,21,400] dimension tensor of the encoding end.
4) And the decoding end calculates the label Tag corresponding to the input data by the CRF. Firstly, the output of the encoding end passes through a neural network, the network weight is tensor of [400,33], and the tensor of [20,21,33] with logistic regression value can be obtained through the network.
In this embodiment, the bidirectional long short-term memory network (BilSTM) layer includes a forward LSTM layer and a backward LSTM layer, the input of the sentence is input to the LSTM network in a one-hot encoding mode through the word embedding layer, and the sentence with the length of n can be expressed as W ═ W { (W) } in the form of one-hot encoding1,...,wt-1,wt,wt+1,...,wn},wtdThe tth word of this phrase is a d-dimensional vector. The LSTM includes a series of circularly connected sub-networks, called memory blocks, and the information storage and control are realized by three gate structures, which are: input door itForgetting door ftAnd an output gate otThe expression is as follows:
it=(Wwiwt+Whiht-1+Wcict-1+bi) (1)
ft=(Wwfwt+Whfht-1+Wcfct-1+bf) (2)
zt=tanh(Wwcwt+Whcht-1+bc) (3)
Figure BDA0002490984920000071
ot=(Wwowt+Whoht-1+Wcoct+bo) (5)
Figure BDA0002490984920000072
w in the formulae (1) (2) (3) (5)(.)Is the weight value, b is the bias value, c in equation (1) (2) (4) (5) (6) is the cell state vector, w for each wordtContext information is included, and the forward LSTM layer consists of w1To wtCoded representation, as
Figure BDA0002490984920000073
Backward LSTM layer consisting of wnTo wtCoded identification, noted
Figure BDA0002490984920000074
Finally, the context information is combined to form the vector representation of the t word
Figure BDA0002490984920000075
An iterative hole convolution neural network (IDCNN) layer repeatedly applies the same small-stacked hole convolution blocks, and each iteration takes the result of the previous hole convolution as input. The width of the hole appears to increase exponentially as the number of layers increases, but the number of parameters increases linearly, so that the field quickly covers the entire input data. The model is a three-layer cavity convolution in which 4 cavity convolution blocks with the same size are superposed together, and the width of a cavity in each cavity convolution block is 1,1 and 2 respectively. Inputting the text into IDCNN layer, and extracting features through convolution layer.
Conditional Random Field (CRF), given a training dataset D { (x)1,y2),...,(xN,yN) Therein ofObserved sequence x of N dataiThe corresponding mark sequence is yiCRF performs parameter estimation with a log-likelihood function that maximizes the conditional probability, i.e.:
Figure BDA0002490984920000081
wherein W is the model parameter-weight, b is the model parameter-bias, and when the conditional probability likelihood function is maximized, the obtained optimal label is y*
Figure BDA0002490984920000082
In the method, by comparing the models of BiLSTM-CRF, IDCNN-CRF, BiLSTM-IDCNN-CRF and the like, it can be seen that the BiLSTM-IDCNN-CRF model has a better entity identification effect in both the general field and the specific biomedical field. The method for labeling the Chinese electronic medical record data based on the deep learning model BIC is a semi-supervised method, can realize automatic labeling of the Chinese electronic medical record data, and has the accuracy reaching over 73.5 percent of that of manual labeling. After manual supervision, more than 95% of manual marking data can be obtained, and the workload of manual marking is greatly reduced.
The embodiments of the present invention have been described with reference to the drawings, but the present invention is not limited to the above embodiments, and various changes and modifications can be made according to the purpose of the invention, and all changes, modifications, substitutions, combinations or simplifications made according to the spirit and principle of the technical solution of the present invention shall be equivalent substitution ways, so long as the invention meets the purpose of the present invention, and the protection scope of the present invention shall be covered as long as the technical principle and the inventive concept of the BIC-based chinese electronic medical record entity tagging method of the present invention are not deviated.

Claims (5)

1. A Chinese electronic medical record entity labeling method based on BIC is characterized by comprising the following specific operation steps:
1) firstly, providing a corresponding medical entity marking standard according to actual requirements, then manually marking a small amount of data, carrying out data processing on the manually marked data, and processing the data into a data format required by a model to form training data;
2) training model parameters to generate a sequence labeling model, wherein the model comprises a bidirectional long-time memory network (BilSTM), an iterative hole convolutional neural network (IDCNN) and a Conditional Random Field (CRF), the BilSTM and the IDCNN are used as encoding ends of the model, and the CRF is used as a decoding end of the model;
3) inputting data to be labeled into a sequence labeling model, and outputting a result to obtain data labeled by a machine;
4) and then, manually checking and correcting part marking errors, performing data processing operation to obtain training data required by the model, and performing model training again.
2. The method for entity annotation of Chinese electronic medical record based on BIC as claimed in claim 1, wherein in step 2), the method for generating the sequence annotation model is as follows:
a. the input of the model is Chinese text, the Chinese text is divided into different training batches according to the text with different lengths, each training batch is provided with 20 sentences of text, a batch of training texts are converted into tensors through an embedding layer, and the lengths of each batch of training texts are consistent through space filling;
b. the embedding layer obtains the tensor of the input data, the tensor is processed by a coding end, the coding end is formed by combining the BilSTM and the IDCNN, the number of neurons of a BilSTM hidden layer is set, and the output of the BilSTM layer corresponds to the tensor;
c. inputting the output of the BilSTM layer into the IDCNN layer to extract the local detail features of the text; the IDCNN layer is formed by combining four iterative hole convolution neural networks, the convolution kernel size of each hole convolution neural network is set, the hole convolution has three layers, the hole size of each layer is set, the output is carried out after convolution operation, and finally the output data corresponding tensor of the encoding end is formed by splicing the convolution results of the four iterative hole convolutions;
d. the decoding end calculates a Tag corresponding to input data by a conditional random field; firstly, the output of the encoding end passes through a neural network, network weight is set, and a tensor corresponding to a logistic regression value is obtained through the network, so that a sequence labeling model is generated.
3. The method for annotating an entity of a Chinese electronic medical record based on BIC as claimed in claim 2, wherein: in step c, the output of each layer is the input of the next layer, and the parameters are shared among the same hole convolution layers of the four iterative hole convolutions.
4. The method for entity annotation of Chinese electronic medical record based on BIC as claimed in claim 1, wherein in step 2), the method for generating the sequence annotation model is as follows:
firstly, inputting Chinese electronic medical record texts, dividing the Chinese electronic medical record texts into different training batches according to sentences with different lengths, wherein each training batch comprises 20 texts, each training batch achieves the effect of consistent length by filling in a space, then converting the training texts of the batch into tensors through an embedding layer, setting the texts with 21 lengths containing punctuation marks, and finally outputting the dimensions of [20,21,120] tensors by the embedding layer;
inputting a [20,21,120] dimension tensor output by the embedding layer into a BilSTM layer to extract global features, wherein the number of neurons in the BilSTM hidden layer is 2 × 100 to 200, so that the output of the BilSTM layer is the [20,21,200] dimension tensor;
thirdly, inputting the output of the BilSTM layer into the IDCNN layer to extract the local detail features of the text; the IDCNN layer is formed by combining four iterative hole convolution neural networks, the convolution kernel of each hole convolution neural network is a [1,3,200,100] dimension tensor, the hole convolution has three layers, the hole of each layer is a [1,1,2] dimension tensor, the output is a [20,21,100] dimension tensor, and finally the four iterative hole convolution results are spliced in series to form an output data [20,21,400] dimension tensor of the encoding end;
fourthly, the decoding end calculates the Tag corresponding to the input data by CRF; firstly, the output of the encoding end passes through a neural network, the network weight is tensor of [400,33], and the tensor of [20,21,33] with logistic regression value can be obtained through the network.
5. The method for entity labeling of Chinese electronic medical record based on BIC as claimed in claim 1, wherein in the step 2), the coding layer extracts global text features from the BilSTM layer, local text detail features from the IDCNN layer, and the decoding layer calculates label probability from CRF to obtain the best label, and comprises the following steps:
the bidirectional long-short time memory network layer comprises a forward LSTM layer and a backward LSTM layer, the input of sentences is input into the LSTM network in a one-hot single-code mode through a word embedding layer, and a sentence with the length of n can be expressed as W ═ W { (W) }1,...,wt-1,wt,wt+1,...,wn},wtdThe tth word of the sentence is a d-dimensional vector; the LSTM includes a series of circularly connected sub-networks, called memory blocks, and the information storage and control are realized by three gate structures, which are: input door itForgetting door ftAnd an output gate otThe expression is as follows:
it=(Wwiwt+Whiht-1+Wcict-1+bi) (1)
ft=(Wwfwt+Whfht-1+Wcfct-1+bf) (2)
zt=tanh(Wwcwt+Whcht-1+bc) (3)
Figure FDA0002490984910000021
ot=(Wwowt+Whoht-1+Wcoct+bo) (5)
Figure FDA0002490984910000022
w in the formulae (1), (2), (3) and (5)(.)Is the weight value, b is the bias value, c in equations (1), (2), (4), (5), (6) is the cell state vector, for each word wtContext information is included, and the forward LSTM layer consists of w1To wtCoded representation, as
Figure FDA0002490984910000023
Backward LSTM layer consisting of wnTo wtCoded identification, noted
Figure FDA0002490984910000024
Finally, the context information is combined to form the vector representation of the t word
Figure FDA0002490984910000025
The iterative cavity convolution neural network layer repeatedly applies the same small-stack cavity convolution blocks, and the result of the previous cavity convolution is taken as input in each iteration; the width of the hole is exponentially increased along with the increase of the layer number, but the number of the parameters is linearly increased, so that the receptive field quickly covers all input data; the model is a three-layer cavity convolution in which 4 cavity convolution blocks with the same size are superposed together, and the width of a cavity in each cavity convolution block is 1,1 and 2 respectively; inputting the text into the IDCNN layer, and extracting features through the convolution layer;
conditional random field, given a training data set D { (x)1,y2),...,(xN,yN) H, observed sequence of N data xiThe corresponding mark sequence is yiCRF performs parameter estimation with a log-likelihood function that maximizes the conditional probability, i.e.:
Figure FDA0002490984910000031
wherein W is the model parameter-weight, b is the model parameter-bias, and when the conditional probability likelihood function is maximized, the obtained optimal label is y*
Figure FDA0002490984910000032
CN202010405161.4A 2020-05-14 2020-05-14 Chinese electronic medical record entity labeling method based on BIC Pending CN111767723A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010405161.4A CN111767723A (en) 2020-05-14 2020-05-14 Chinese electronic medical record entity labeling method based on BIC

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010405161.4A CN111767723A (en) 2020-05-14 2020-05-14 Chinese electronic medical record entity labeling method based on BIC

Publications (1)

Publication Number Publication Date
CN111767723A true CN111767723A (en) 2020-10-13

Family

ID=72719106

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010405161.4A Pending CN111767723A (en) 2020-05-14 2020-05-14 Chinese electronic medical record entity labeling method based on BIC

Country Status (1)

Country Link
CN (1) CN111767723A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784997A (en) * 2021-01-22 2021-05-11 北京百度网讯科技有限公司 Annotation rechecking method, device, equipment, storage medium and program product
CN112802570A (en) * 2021-02-07 2021-05-14 成都延华西部健康医疗信息产业研究院有限公司 Named entity recognition system and method for electronic medical record
CN113553840A (en) * 2021-08-12 2021-10-26 卫宁健康科技集团股份有限公司 Text information processing method, device, equipment and storage medium
CN115081451A (en) * 2022-06-30 2022-09-20 中国电信股份有限公司 Entity identification method and device, electronic equipment and storage medium
CN116469503A (en) * 2023-04-06 2023-07-21 海军军医大学第三附属医院东方肝胆外科医院 Health data processing method and server based on big data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018081089A1 (en) * 2016-10-26 2018-05-03 Deepmind Technologies Limited Processing text sequences using neural networks
CN109522546A (en) * 2018-10-12 2019-03-26 浙江大学 Entity recognition method is named based on context-sensitive medicine
CN110442840A (en) * 2019-07-11 2019-11-12 新华三大数据技术有限公司 Sequence labelling network update method, electronic health record processing method and relevant apparatus
CN110705293A (en) * 2019-08-23 2020-01-17 中国科学院苏州生物医学工程技术研究所 Electronic medical record text named entity recognition method based on pre-training language model
CN110837736A (en) * 2019-11-01 2020-02-25 浙江大学 Character structure-based named entity recognition method for Chinese medical record of iterative expansion convolutional neural network-conditional random field
US20200065374A1 (en) * 2018-08-23 2020-02-27 Shenzhen Keya Medical Technology Corporation Method and system for joint named entity recognition and relation extraction using convolutional neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018081089A1 (en) * 2016-10-26 2018-05-03 Deepmind Technologies Limited Processing text sequences using neural networks
US20200065374A1 (en) * 2018-08-23 2020-02-27 Shenzhen Keya Medical Technology Corporation Method and system for joint named entity recognition and relation extraction using convolutional neural network
CN109522546A (en) * 2018-10-12 2019-03-26 浙江大学 Entity recognition method is named based on context-sensitive medicine
CN110442840A (en) * 2019-07-11 2019-11-12 新华三大数据技术有限公司 Sequence labelling network update method, electronic health record processing method and relevant apparatus
CN110705293A (en) * 2019-08-23 2020-01-17 中国科学院苏州生物医学工程技术研究所 Electronic medical record text named entity recognition method based on pre-training language model
CN110837736A (en) * 2019-11-01 2020-02-25 浙江大学 Character structure-based named entity recognition method for Chinese medical record of iterative expansion convolutional neural network-conditional random field

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王逸凡;李国平: "基于语义相似度及命名实体识别的主观题自动评分方法", 电子测量技术, no. 002, 31 December 2019 (2019-12-31) *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784997A (en) * 2021-01-22 2021-05-11 北京百度网讯科技有限公司 Annotation rechecking method, device, equipment, storage medium and program product
CN112784997B (en) * 2021-01-22 2023-11-10 北京百度网讯科技有限公司 Annotation rechecking method, device, equipment, storage medium and program product
CN112802570A (en) * 2021-02-07 2021-05-14 成都延华西部健康医疗信息产业研究院有限公司 Named entity recognition system and method for electronic medical record
CN113553840A (en) * 2021-08-12 2021-10-26 卫宁健康科技集团股份有限公司 Text information processing method, device, equipment and storage medium
CN115081451A (en) * 2022-06-30 2022-09-20 中国电信股份有限公司 Entity identification method and device, electronic equipment and storage medium
CN116469503A (en) * 2023-04-06 2023-07-21 海军军医大学第三附属医院东方肝胆外科医院 Health data processing method and server based on big data

Similar Documents

Publication Publication Date Title
CN107330032B (en) Implicit discourse relation analysis method based on recurrent neural network
CN111767723A (en) Chinese electronic medical record entity labeling method based on BIC
CN109582789B (en) Text multi-label classification method based on semantic unit information
CN110019839B (en) Medical knowledge graph construction method and system based on neural network and remote supervision
CN108460013B (en) Sequence labeling model and method based on fine-grained word representation model
US20240177047A1 (en) Knowledge grap pre-training method based on structural context infor
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN110196980B (en) Domain migration on Chinese word segmentation task based on convolutional network
JP5128629B2 (en) Part-of-speech tagging system, part-of-speech tagging model training apparatus and method
CN109960728B (en) Method and system for identifying named entities of open domain conference information
CN110795556A (en) Abstract generation method based on fine-grained plug-in decoding
CN111753024A (en) Public safety field-oriented multi-source heterogeneous data entity alignment method
CN111274829B (en) Sequence labeling method utilizing cross-language information
CN110222338B (en) Organization name entity identification method
CN110826298B (en) Statement coding method used in intelligent auxiliary password-fixing system
CN114154504A (en) Chinese named entity recognition algorithm based on multi-information enhancement
CN116932661A (en) Event knowledge graph construction method oriented to network security
CN110674642B (en) Semantic relation extraction method for noisy sparse text
CN111581392A (en) Automatic composition scoring calculation method based on statement communication degree
CN113609857B (en) Legal named entity recognition method and system based on cascade model and data enhancement
CN115099244A (en) Voice translation method, and method and device for training voice translation model
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
CN117034948A (en) Paragraph identification method, system and storage medium based on multi-feature self-adaptive fusion
CN113591493B (en) Translation model training method and translation model device
CN115358227A (en) Open domain relation joint extraction method and system based on phrase enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination