CN111767723B - BIC-based Chinese electronic medical record entity labeling method - Google Patents

BIC-based Chinese electronic medical record entity labeling method Download PDF

Info

Publication number
CN111767723B
CN111767723B CN202010405161.4A CN202010405161A CN111767723B CN 111767723 B CN111767723 B CN 111767723B CN 202010405161 A CN202010405161 A CN 202010405161A CN 111767723 B CN111767723 B CN 111767723B
Authority
CN
China
Prior art keywords
model
layer
data
labeling
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010405161.4A
Other languages
Chinese (zh)
Other versions
CN111767723A (en
Inventor
滕国伟
王逸凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN202010405161.4A priority Critical patent/CN111767723B/en
Publication of CN111767723A publication Critical patent/CN111767723A/en
Application granted granted Critical
Publication of CN111767723B publication Critical patent/CN111767723B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a new method for labeling a Chinese electronic medical record entity based on BIC, which belongs to the technical field of natural language processing and is used for solving the problems of identification and labeling of the Chinese electronic medical record entity; the method comprises the following steps: firstly, giving corresponding medical entity labeling specifications according to actual demands, then manually labeling a small amount of data, and performing data processing on the manually labeled data to form a data format required by a model so as to form training data; training model parameters to generate a sequence labeling model, wherein the model comprises a bidirectional long-short-time memory network, an iterative cavity convolutional neural network and a conditional random field, and setting a decoding end of the model; inputting the data to be marked into a sequence marking model, and outputting a result to obtain machine marked data; and then manually examining and correcting part labeling errors, obtaining training data required by the model through data processing operation, and carrying out model training again. The method can realize automatic labeling of Chinese electronic case data, and has high accuracy.

Description

BIC-based Chinese electronic medical record entity labeling method
Technical Field
The invention relates to the technical field of natural language processing, in particular to a Chinese electronic medical record entity labeling method based on BIC.
Background
In the biomedical field, a large amount of data, such as electronic medical records, is produced every day. The electronic medical record refers to the digitalized information such as characters, symbols, charts, graphs, data, images and the like generated by medical staff in the medical activity process by using a medical institution information system, can realize the storage, management, transmission and reproduction of medical records, contains a large number of entities and has more entity types. At present, most of electronic medical record information extraction researches are aimed at English electronic medical records, the research in the aspect of Chinese electronic medical records starts later, no clear and systematic research task is formed yet, and a public labeling corpus is lacked. The lack of the training corpus greatly restricts the research of the extraction of the Chinese electronic medical record information. Along with the wide implementation of the Chinese electronic medical record system, the number of the electronic medical records is rapidly increased, but whether effective information in a large number of electronic medical records can be effectively extracted is always a research hot spot and a difficulty, wherein the construction of a Chinese electronic medical record labeling corpus is the research basis. In the biomedical field, the disclosed annotation corpus is very limited, and manual annotation consumes a great deal of manpower and material resources, so that how to lighten the manual annotation workload and ensure the accuracy of entity identification in the biomedical field is always a research difficulty.
Disclosure of Invention
The invention provides a BIC-based Chinese electronic medical record entity labeling method aiming at how to construct a Chinese electronic medical record labeling corpus, wherein a BIC model provided by the method comprises a bidirectional long-short-time memory network (BiLSTM), an iterative cavity convolutional neural network (IDCNN) and a Conditional Random Field (CRF), wherein BiLSTM and IDCNN are used as coding ends of the model, and CRF is used as a decoding end of the model. The method belongs to a semi-supervision method, can realize automatic labeling of Chinese electronic case data, and has the accuracy reaching more than 73.5% of manual labeling. After manual supervision, more than 95% of manual labeling data can be obtained, and the manual labeling workload is greatly reduced.
In order to achieve the above purpose, the invention adopts the following technical scheme:
A new method for labeling Chinese electronic medical record entity based on BIC comprises the following specific operation steps:
1) Firstly, giving corresponding medical entity labeling specifications according to actual demands, then manually labeling a small amount of data, performing data processing on the manually labeled data, processing the manually labeled data into a data format required by a model, and forming training data;
2) Training model parameters to generate a sequence labeling model, wherein the model comprises a bidirectional long-short-time memory network (BiLSTM), an iterative cavity convolutional neural network (IDCNN) and a Conditional Random Field (CRF), and BiLSTM and IDCNN are used as encoding ends of the model, and CRF is used as a decoding end of the model;
3) Inputting the data to be marked into a sequence marking model, and outputting a result to obtain machine marked data;
4) And then manually examining and correcting part labeling errors, obtaining training data required by the model through data processing operation, and carrying out model training again.
Preferably, in the step 2), the method for generating the sequence labeling model is as follows:
a. the input of the model is Chinese text, the model is divided into different training batches according to the text with different lengths, each training batch has 20 texts, one batch of training texts is converted into tensors through an embedding layer, and each batch of training texts reaches the consistent length through filling;
b. The embedded layer obtains tensors of input data, the tensors are processed by an encoding end, the encoding end is formed by BiLSTM and IDCNN in a combined mode, the number of neurons of a BiLSTM hidden layer is set, and the tensors are corresponding to the output of BiLSTM layers;
c. Inputting the output of BiLSTM layers into IDCNN layers to extract local detail features of the text; the IDCNN layers are formed by combining four iterative hole convolution neural networks, the size of each hole convolution neural network convolution kernel is set, the number of the hole convolution kernel is three, the size of each hole of each layer is set, the output is carried out after convolution operation, and finally, the four iterative hole convolution results are spliced to form the corresponding tensor of the output data of the coding end;
d. The decoding end calculates a Tag corresponding to the input data by a conditional random field; firstly, the output of the coding end passes through a neural network, the network weight is set, and the corresponding tensor of the logistic regression value is obtained through the network, so that a sequence labeling model is generated.
Preferably, in the step c, three layers of the hole convolution are provided, the output of each layer is the input of the next layer, and the four iterative hole convolutions share parameters between the same hole convolutions.
Preferably, in the step 2), the method for generating the sequence labeling model is as follows:
① Firstly inputting Chinese electronic medical record texts, dividing the Chinese electronic medical record texts into different training batches according to sentences with different lengths, wherein each training batch comprises 20 texts, each batch of training texts achieves the effect of consistent length through complementation, then converting the training texts in the batch into corresponding tensors through an embedding layer, setting texts with length 21 containing punctuation marks, and finally outputting tensors with the dimension of [20,21,120] by the embedding layer;
② The [20,21,120] dimensional tensor output by the embedded layer is input to the BiLSTM layer to extract global features, wherein the number of neurons of the BiLSTM hidden layer is 2×100=200, so the output of the BiLSTM layer is the [20,21,200] dimensional tensor;
③ Inputting the output of BiLSTM layers into IDCNN layers to extract local detail features of the text; the IDCNN layers are formed by combining four iterative hole convolution neural networks, each hole convolution neural network convolution kernel is a [1,3,200,100] dimensional tensor, the hole convolution kernel is composed of three layers, each hole of each layer is a [1, 2] dimensional tensor, the output is obtained as a [20,21,100] dimensional tensor, and finally four iterative hole convolution results are spliced in series to form output data [20,21,400] dimensional tensor of a coding end;
④ The decoding end calculates a Tag corresponding to the input data by using the CRF; firstly, the output of the coding end passes through a neural network, the weight of the network is [400,33] dimensional tensor, and the logistic regression value of [20,21,33] dimensional tensor can be obtained through the network.
Further preferably, the sequence labeling model comprises the following specific steps:
The input of the model is Chinese text, the model is divided into different training batches according to the text with different lengths, each training batch has 20 texts, one batch of training texts is converted into tensors through an embedded layer, and each batch of training texts achieves the effect of consistent length through complementation. The embedded layer comprises character embedding and word embedding, and the character embedding and word embedding are marked by using word-based input in consideration of possible wrong boundary cutting of word segmentation and no loss of word characteristics. Take text of length 21 containing punctuation marks as an example: the character embedding is to convert 21 characters of each text in each training batch into [20,21,100] dimension tensor by word2 vec; word embedding is to process the word segmentation of 21 characters of each sentence and convert the word into [20,21,20] dimension tensor. The output of the final embedded layer is the combination of characters and words, and a [20,21,120] dimensional tensor is generated;
The embedded layer obtains tensors of input data, the encoding end is processed by the encoding end, the encoding end is formed by BiLSTM and IDCNN in a combined mode, the number of neurons of the BiLSTM hidden layer is 2 x 100 = 200, and therefore the output of the BiLSTM layers is [20,21,200] dimensional tensors; the IDCNN layers are formed by combining four iterative hole convolution neural networks, the convolution kernel size of each hole convolution neural network is [1,3,200,100], the hole convolution is three layers, the hole size of each layer is [1, 2], the output is [20,21,100] after convolution operation, and finally, the four iterative hole convolution results are spliced to form output data [20,21,400] dimensional tensor of the coding end. Three layers of cavity convolution, wherein the output of each layer is the input of the next layer, and parameters are shared among four iterative cavity convolution identical cavity convolution layers;
the decoding end calculates a Tag corresponding to the input data by a conditional random field; firstly, the output of the coding end passes through a neural network, the weight of the network is [400,33], and the logistic regression value is [20,21,33] dimension tensor can be obtained through the network.
Preferably, in the step 2), the encoding layer extracts the global feature of the text from the BiLSTM layer, extracts the local detail feature of the text from the IDCNN layer, and the decoding layer calculates the tag probability from the CRF to obtain the best tag, and the specific steps are as follows:
The Bi-directional long-short-time memory network (Bi-directionalLongShort-TermMemory, biLSTM) layer comprises a forward LSTM layer and a backward LSTM layer, the sentence input is input into the LSTM network through the word embedding layer in a one-hot single-heat coding mode, a sentence with the length of n can be expressed as W= { W 1,...,wt-1,wt,wt+1,...,wn},wtd, and the t-th word of the sentence is a d-dimensional vector; the LSTM includes a series of sub-networks connected in a loop, called memory blocks, and stores and controls information through three gate structures, where the three gates are respectively: input gate i t, forget gate f t, output gate o t, whose expressions are as follows:
it=δ(Wwiwt+Whiht-1+Wcict-1+bi) (1)
ft=δ(Wwfwt+Whfht-1+Wcfct-1+bf) (2)
zt=tanh(Wwcwt+Whcht-1+bc) (3)
ot=δ(Wwowt+Whoht-1+Wcoct+bo) (5)
W (.) in formulas (1), (2), (3), (5) is the weight value, b is the bias value, c in formulas (1), (2), (4), (5), (6) is the cell state vector, and for each word W t, the context information is contained, and the forward LSTM layer is encoded by W 1 to W t, denoted as The backward LSTM layer is identified by the codes w n to w t, and is marked asFinally combining the context information to form a vector representation of the t-th word
An iterative hole convolution neural network (IteratedDilatedConvolution, IDCNN) layer repeatedly applies the same small stacked hole convolution blocks, and takes the result of the last hole convolution as input in each iteration; the width of the cavity increases exponentially with the number of layers, but the number of parameters increases linearly, so that the receptive field quickly covers all input data; the model is formed by overlapping 4 cavity convolution blocks with the same size, and the widths of the cavities in each cavity convolution block are respectively 1,1 and 2 three-layer cavity convolutions; inputting the text into IDCNN layers, and extracting features through a convolution layer;
Conditional random field (conditionalrandomfield, CRF), given training dataset d= { (x 1,y2),...,(xN,yN) }, where observed sequence x i of N data corresponds to a marker sequence y i, CRF performs parameter estimation with a log-likelihood function that maximizes the conditional probability, i.e.:
Wherein W is model parameter-weight, b is model parameter-bias, and when the conditional probability likelihood function is maximized, the obtained optimal label is y *:
Compared with the prior art, the invention has the following advantages:
1) The invention provides a new Chinese electronic medical record entity labeling model, which adopts BiLSTM-IDCNN to carry out characteristic selection construction and then decodes the obtained characteristics by using CRF. The mode combining deep learning and machine learning mutually complements each other, and the model obtained in theory has good effect;
2) The invention provides a method for labeling Chinese electronic medical record data based on a deep learning model BIC, which is a semi-supervised method, and can realize automatic labeling of Chinese electronic case data, wherein the accuracy of the method is more than 73.5% of manual labeling. After manual supervision, more than 95% of manual labeling data can be obtained, and the manual labeling workload is greatly reduced;
3) By comparing models such as BiLSTM-CRF, IDCNN-CRF, biLSTM-IDCNN-CRF and the like, the method can be used for finding out that the model BiLSTM-IDCNN-CRF has a good entity identification effect in the general field or the specific biomedical field.
Drawings
FIG. 1 is a flow chart of the method for labeling a Chinese electronic medical record entity based on BIC.
FIG. 2 is a sequence annotation model according to an embodiment of the invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
The foregoing aspects are further described in conjunction with specific embodiments, and the following detailed description of preferred embodiments of the present invention is provided:
Example 1
In this embodiment, referring to fig. 1, a method for labeling a chinese electronic medical record entity based on BIC specifically includes the following steps:
1) Firstly, giving corresponding medical entity labeling specifications according to actual demands, then manually labeling a small amount of data, performing data processing on the manually labeled data, processing the manually labeled data into a data format required by a model, and forming training data;
2) Training model parameters to generate a sequence labeling model, wherein the model comprises a bidirectional long-short-time memory network (BiLSTM), an iterative cavity convolutional neural network (IDCNN) and a Conditional Random Field (CRF), and BiLSTM and IDCNN are used as encoding ends of the model, and CRF is used as a decoding end of the model;
3) Inputting the data to be marked into a sequence marking model, and outputting a result to obtain machine marked data;
4) And then manually examining and correcting part labeling errors, obtaining training data required by the model through data processing operation, and carrying out model training again.
The method of this embodiment uses BiLSTM-IDCNN for feature selection construction and then uses CRF to decode the resulting features. The method combines deep learning and machine learning, mutually complements each other, and has the potential of greatly reducing the workload of manual labeling.
Example two
This embodiment is substantially identical to the implementation, and is characterized in that:
In this embodiment, referring to fig. 1, in the step 2), the method for generating the sequence labeling model is as follows:
a. the input of the model is Chinese text, the model is divided into different training batches according to the text with different lengths, each training batch has 20 texts, one batch of training texts is converted into tensors through an embedding layer, and each batch of training texts reaches the consistent length through filling;
b. The embedded layer obtains tensors of input data, the tensors are processed by an encoding end, the encoding end is formed by BiLSTM and IDCNN in a combined mode, the number of neurons of a BiLSTM hidden layer is set, and the tensors are corresponding to the output of BiLSTM layers;
c. Inputting the output of BiLSTM layers into IDCNN layers to extract local detail features of the text; the IDCNN layers are formed by combining four iterative hole convolution neural networks, the size of each hole convolution neural network convolution kernel is set, the number of the hole convolution kernel is three, the size of each hole of each layer is set, the output is carried out after convolution operation, and finally, the four iterative hole convolution results are spliced to form the corresponding tensor of the output data of the coding end; three layers of cavity convolution, wherein the output of each layer is the input of the next layer, and parameters are shared among four iterative cavity convolution identical cavity convolution layers;
d. The decoding end calculates a Tag corresponding to the input data by a conditional random field; firstly, the output of the coding end passes through a neural network, the network weight is set, and the corresponding tensor of the logistic regression value is obtained through the network, so that a sequence labeling model is generated.
The BIC model comprises a bidirectional long-short-time memory network (BiLSTM), an iterative hole convolutional neural network (IDCNN) and a Conditional Random Field (CRF), wherein BiLSTM and IDCNN are used as coding ends of the model, and CRF is used as decoding ends of the model. The method combines deep learning and machine learning, mutually complements each other, and has the potential of greatly reducing the workload of manual labeling.
Example III
This embodiment is substantially identical to the implementation, and is characterized in that:
In this embodiment, referring to fig. 1, a method for labeling a chinese electronic medical record entity based on BIC is shown in fig. 1, where the flow of the method is that a corresponding medical entity labeling specification is given according to actual requirements, then a small amount of data is labeled manually, and the manually labeled data is processed into a data format required by a model to form training data. And then training model parameters to generate a sequence labeling model. Inputting the data to be marked into a sequence marking model, outputting a result to obtain machine marked data, manually examining and correcting part marking errors, obtaining training data required by the model through data processing operation, and training the model again. As the effect of the deep learning model is better and better along with the increase of the data quantity, the entity marked by the method is more and more accurate.
As shown in fig. 2, a new method for labeling electronic medical record data based on a deep learning model BIC specifically comprises the following steps:
1) Firstly inputting Chinese electronic medical record texts, dividing the Chinese electronic medical record texts into different training batches according to sentences with different lengths, wherein each training batch comprises 20 texts, each batch of training texts achieves the effect of consistent length through complementation, then converting the training texts into tensors through an embedding layer, taking texts with the length of 21 containing punctuation marks as an example, and finally outputting tensors with the dimension of [20,21,120] by the embedding layer;
2) The [20,21,120] dimensional tensor output by the embedded layer is input to the BiLSTM layer to extract global features, where the number of BiLSTM hidden layer neurons is 2 x 100 = 200, so the output of BiLSTM layers is the [20,21,200] dimensional tensor.
3) The BiLSTM layer output is input to IDCNN layer to extract text local detail features. The IDCNN layers are formed by combining four iterative hole convolution neural networks, each hole convolution neural network convolution kernel is [1,3,200,100] dimensional tensor, the hole convolution kernel is composed of three layers, each hole of each layer is [1, 2] dimensional tensor, output is [20,21,100] dimensional tensor, and finally four iterative hole convolution results are spliced in series to form output data [20,21,400] dimensional tensor of a coding end.
4) The decoding end calculates the Tag corresponding to the input data by the CRF. Firstly, the output of the coding end passes through a neural network, the weight of the network is [400,33] dimensional tensor, and the logistic regression value of [20,21,33] dimensional tensor can be obtained through the network.
In this embodiment, the bidirectional long-short-term memory network (Bi-directionalLongShort-TermMemory, biLSTM) layer includes a forward LSTM layer and a backward LSTM layer, and the sentence is input to the LSTM network in one-hot one-time encoding mode through the word embedding layer, and a sentence with a length of n may be expressed as w= { W 1,...,wt-1,wt,wt+1,...,wn},wtd, where the t-th word of the sentence is a d-dimensional vector. The LSTM includes a series of sub-networks connected in a loop, called memory blocks, and stores and controls information through three gate structures, where the three gates are respectively: input gate i t, forget gate f t, output gate o t, whose expressions are as follows:
it=δ(Wwiwt+Whiht-1+Wcict-1+bi) (1)
ft=δ(Wwfwt+Whfht-1+Wcfct-1+bf) (2)
zt=tanh(Wwcwt+Whcht-1+bc) (3)
ot=δ(Wwowt+Whoht-1+Wcoct+bo) (5)
W (.) in equations (1) (2) (3) (5) is the weight value, b is the bias value, c in equations (1) (2) (4) (5) (6) is the cell state vector, and for each word W t, the context information is included, and the forward LSTM layer is encoded by W 1 to W t, denoted as The backward LSTM layer is identified by the codes w n to w t, and is marked asFinally combining the context information to form a vector representation of the t-th word
And (3) iterating a hole convolution neural network (IteratedDilatedConvolution, IDCNN) layer, repeatedly applying the same small stacked hole convolution blocks, and taking the result of the last hole convolution as an input in each iteration. The width of the cavity increases exponentially with the number of layers, but the number of parameters increases linearly, so that the receptive field quickly covers all of the input data. The model is a three-layer cavity convolution with 4 cavity convolution blocks with the same size and the cavity widths in each cavity convolution block being 1,1 and 2 respectively. Text is entered into IDCNN layers, passed through the convolution layers, and features are extracted.
Conditional random field (conditionalrandomfield, CRF), given training dataset d= { (x 1,y2),...,(xN,yN) }, where observed sequence x i of N data corresponds to a marker sequence y i, CRF performs parameter estimation with a log-likelihood function that maximizes the conditional probability, i.e.:
Wherein W is model parameter-weight, b is model parameter-bias, and when the conditional probability likelihood function is maximized, the obtained optimal label is y *:
By comparing models such as BiLSTM-CRF, IDCNN-CRF, biLSTM-IDCNN-CRF and the like, the method of the embodiment can show that the model BiLSTM-IDCNN-CRF has a good entity identification effect in the general field or the specific biomedical field. The method for labeling the Chinese electronic medical record data based on the deep learning model BIC is a semi-supervised method, can automatically label the Chinese electronic medical record data, and has the accuracy reaching more than 73.5% of manual labeling. After manual supervision, more than 95% of manual labeling data can be obtained, and the manual labeling workload is greatly reduced.
The embodiment of the present invention is described above with reference to the accompanying drawings, but the present invention is not limited to the above embodiment, and various changes, modifications, substitutions, combinations or simplifications made under the spirit and principles of the technical scheme of the present invention can be made according to the purpose of the present invention, so long as the purpose of the present invention is met, and the technical principle and the inventive concept of the present invention BIC-based method for labeling Chinese electronic medical records entity are all within the scope of the present invention.

Claims (4)

1. A Chinese electronic medical record entity labeling method based on BIC is characterized by comprising the following specific operation steps:
1) Firstly, giving corresponding medical entity labeling specifications according to actual demands, then manually labeling a small amount of data, performing data processing on the manually labeled data, processing the manually labeled data into a data format required by a model, and forming training data;
2) Then training model parameters to generate a sequence labeling model, wherein the model comprises a bidirectional long-short-time memory network BiLSTM, an iterative cavity convolutional neural network IDCNN and a conditional random field CRF, and BiLSTM and IDCNN are used as coding ends of the model, and CRF is used as a decoding end of the model;
3) Inputting the data to be marked into a sequence marking model, and outputting a result to obtain machine marked data;
4) Then manually examining and correcting part marking errors, obtaining training data required by the model through data processing operation, and carrying out model training again;
In the step 2), the method for generating the sequence labeling model is as follows:
a. the input of the model is Chinese text, the model is divided into different training batches according to the text with different lengths, each training batch has 20 texts, one batch of training texts is converted into tensors through an embedding layer, and each batch of training texts reaches the consistent length through filling;
b. The embedded layer obtains tensors of input data, the tensors are processed by an encoding end, the encoding end is formed by BiLSTM and IDCNN in a combined mode, the number of neurons of a BiLSTM hidden layer is set, and the tensors are corresponding to the output of BiLSTM layers;
c. Inputting the output of BiLSTM layers into IDCNN layers to extract local detail features of the text; the IDCNN layers are formed by combining four iterative hole convolution neural networks, the size of each hole convolution neural network convolution kernel is set, the number of the hole convolution kernel is three, the size of each hole of each layer is set, the output is carried out after convolution operation, and finally, the four iterative hole convolution results are spliced to form the corresponding tensor of the output data of the coding end;
d. The decoding end calculates a Tag corresponding to the input data by a conditional random field; firstly, the output of the coding end passes through a neural network, the network weight is set, and the corresponding tensor of the logistic regression value is obtained through the network, so that a sequence labeling model is generated.
2. The method for labeling a chinese electronic medical record entity based on BIC according to claim 1, wherein: in the step c, three layers of the hole convolution are provided, the output of each layer is the input of the next layer, and parameters are shared among four iterative hole convolution identical hole convolution layers.
3. The method for labeling a chinese electronic medical record entity based on BIC according to claim 1, wherein in the step 2), the method for generating a sequence labeling model is as follows:
① Firstly inputting Chinese electronic medical record texts, dividing the Chinese electronic medical record texts into different training batches according to sentences with different lengths, wherein each training batch comprises 20 texts, each batch of training texts achieves the effect of consistent length through complementation, then converting the training texts into tensors through an embedding layer, setting texts with length of 21 containing punctuation marks, and finally outputting tensors with dimension of [20,21,120] by the embedding layer;
② The [20,21,120] dimensional tensor output by the embedded layer is input to the BiLSTM layer to extract global features, wherein the number of neurons of the BiLSTM hidden layer is 2×100=200, so the output of the BiLSTM layer is the [20,21,200] dimensional tensor;
③ Inputting the output of BiLSTM layers into IDCNN layers to extract local detail features of the text; the IDCNN layers are formed by combining four iterative hole convolution neural networks, each hole convolution neural network convolution kernel is a [1,3,200,100] dimensional tensor, the hole convolution kernel is composed of three layers, each hole of each layer is a [1, 2] dimensional tensor, the output is obtained as a [20,21,100] dimensional tensor, and finally four iterative hole convolution results are spliced in series to form output data [20,21,400] dimensional tensor of a coding end;
④ The decoding end calculates a Tag corresponding to the input data by using the CRF; firstly, the output of the coding end passes through a neural network, the weight of the network is [400,33] dimensional tensor, and the logistic regression value is [20,21,33] dimensional tensor is obtained through the network.
4. The method for labeling a chinese electronic medical record entity based on BIC according to claim 1, wherein in the step 2), the encoding layer extracts the global feature of the text from the BiLSTM layer, extracts the local detail feature of the text from the IDCNN layer, and the decoding layer calculates the label probability from the CRF to obtain the best label, and the specific steps are as follows:
The bidirectional long-short-time memory network layer comprises a forward LSTM layer and a backward LSTM layer, the sentence is input into the LSTM network in a one-hot single-hot coding mode through a word embedding layer, a sentence with the length of n is expressed as W= { W 1,...,wt-1,wt,wt+1,...,wn},wt∈Pd, and the t-th word of the sentence is a d-dimensional vector; the LSTM includes a series of sub-networks connected in a loop, called memory blocks, and stores and controls information through three gate structures, where the three gates are respectively: input gate i t, forget gate f t, output gate o t, whose expressions are as follows:
it=δ(Wwiwt+Whiht-1+Wcict-1+bi) (1)
ft=δ(Wwfwt+Whfht-1+Wcfct-1+bf) (2)
zt=tanh(Wwcwt+Whcht-1+bc) (3)
ot=δ(Wwowt+Whoht-1+Wcoct+bo) (5)
W (.) in formulas (1), (2), (3), (5) is the weight value, b is the bias value, c in formulas (1), (2), (4), (5), (6) is the cell state vector, and for each word W t, the context information is contained, and the forward LSTM layer is encoded by W 1 to W t, denoted as The backward LSTM layer is identified by the codes w n to w t, and is marked asFinally combining the context information to form a vector representation of the t-th word
Iterating the hole convolution neural network layer, repeatedly applying the same small stacked hole convolution blocks, and taking the result of the last hole convolution as input in each iteration; the width of the cavity increases exponentially with the number of layers, but the number of parameters increases linearly, so that the receptive field quickly covers all input data; the model is formed by overlapping 4 cavity convolution blocks with the same size, and the widths of the cavities in each cavity convolution block are respectively 1,1 and 2 three-layer cavity convolutions; inputting the text into IDCNN layers, and extracting features through a convolution layer;
conditional random field, given training dataset d= { (x 1,y2),...,(xN,yN) }, where observed sequence x i of N data corresponds to a signature sequence y i, CRF performs parameter estimation with a log likelihood function that maximizes the conditional probability, i.e.:
Wherein W is model parameter-weight, b is model parameter-bias, and when the conditional probability likelihood function is maximized, the obtained optimal label is y *:
CN202010405161.4A 2020-05-14 2020-05-14 BIC-based Chinese electronic medical record entity labeling method Active CN111767723B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010405161.4A CN111767723B (en) 2020-05-14 2020-05-14 BIC-based Chinese electronic medical record entity labeling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010405161.4A CN111767723B (en) 2020-05-14 2020-05-14 BIC-based Chinese electronic medical record entity labeling method

Publications (2)

Publication Number Publication Date
CN111767723A CN111767723A (en) 2020-10-13
CN111767723B true CN111767723B (en) 2024-07-19

Family

ID=72719106

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010405161.4A Active CN111767723B (en) 2020-05-14 2020-05-14 BIC-based Chinese electronic medical record entity labeling method

Country Status (1)

Country Link
CN (1) CN111767723B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112784997B (en) * 2021-01-22 2023-11-10 北京百度网讯科技有限公司 Annotation rechecking method, device, equipment, storage medium and program product
CN112802570A (en) * 2021-02-07 2021-05-14 成都延华西部健康医疗信息产业研究院有限公司 Named entity recognition system and method for electronic medical record
CN115081451A (en) * 2022-06-30 2022-09-20 中国电信股份有限公司 Entity identification method and device, electronic equipment and storage medium
CN116469503B (en) * 2023-04-06 2024-06-21 海军军医大学第三附属医院东方肝胆外科医院 Health data processing method and server based on big data

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522546A (en) * 2018-10-12 2019-03-26 浙江大学 Entity recognition method is named based on context-sensitive medicine
CN110442840A (en) * 2019-07-11 2019-11-12 新华三大数据技术有限公司 Sequence labelling network update method, electronic health record processing method and relevant apparatus

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102359216B1 (en) * 2016-10-26 2022-02-07 딥마인드 테크놀로지스 리미티드 Text Sequence Processing Using Neural Networks
US11574122B2 (en) * 2018-08-23 2023-02-07 Shenzhen Keya Medical Technology Corporation Method and system for joint named entity recognition and relation extraction using convolutional neural network
CN110705293A (en) * 2019-08-23 2020-01-17 中国科学院苏州生物医学工程技术研究所 Electronic medical record text named entity recognition method based on pre-training language model
CN110837736B (en) * 2019-11-01 2021-08-10 浙江大学 Named entity recognition method of Chinese medical record based on word structure

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109522546A (en) * 2018-10-12 2019-03-26 浙江大学 Entity recognition method is named based on context-sensitive medicine
CN110442840A (en) * 2019-07-11 2019-11-12 新华三大数据技术有限公司 Sequence labelling network update method, electronic health record processing method and relevant apparatus

Also Published As

Publication number Publication date
CN111767723A (en) 2020-10-13

Similar Documents

Publication Publication Date Title
CN111767723B (en) BIC-based Chinese electronic medical record entity labeling method
CN108416058B (en) Bi-LSTM input information enhancement-based relation extraction method
CN110019839B (en) Medical knowledge graph construction method and system based on neural network and remote supervision
CN110059185B (en) Medical document professional vocabulary automatic labeling method
CN111310471B (en) Travel named entity identification method based on BBLC model
CN107330032B (en) Implicit discourse relation analysis method based on recurrent neural network
CN113128229B (en) Chinese entity relation joint extraction method
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN110597997B (en) Military scenario text event extraction corpus iterative construction method and device
CN105938485A (en) Image description method based on convolution cyclic hybrid model
CN111241816A (en) Automatic news headline generation method
CN110688862A (en) Mongolian-Chinese inter-translation method based on transfer learning
CN110196980A (en) A kind of field migration based on convolutional network in Chinese word segmentation task
CN112507190B (en) Method and system for extracting keywords of financial and economic news
CN112749253B (en) Multi-text abstract generation method based on text relation graph
CN112417854A (en) Chinese document abstraction type abstract method
CN111274804A (en) Case information extraction method based on named entity recognition
CN114154504B (en) Chinese named entity recognition algorithm based on multi-information enhancement
CN110222338B (en) Organization name entity identification method
CN112364623A (en) Bi-LSTM-CRF-based three-in-one word notation Chinese lexical analysis method
CN110826298B (en) Statement coding method used in intelligent auxiliary password-fixing system
CN115658898A (en) Chinese and English book entity relation extraction method, system and equipment
CN113609840B (en) Chinese law judgment abstract generation method and system
CN113158659B (en) Case-related property calculation method based on judicial text
CN113901813A (en) Event extraction method based on topic features and implicit sentence structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant