CN109670179B

CN109670179B - Medical record text named entity identification method based on iterative expansion convolutional neural network

Info

Publication number: CN109670179B
Application number: CN201811563980.0A
Authority: CN
Inventors: 田珂珂; 印鉴; 高静
Original assignee: Guangdong Hengdian Information Technology Co ltd; Sun Yat Sen University
Current assignee: Guangdong Hengdian Information Technology Co ltd; Sun Yat Sen University
Priority date: 2018-12-20
Filing date: 2018-12-20
Publication date: 2022-11-11
Anticipated expiration: 2038-12-20
Also published as: CN109670179A

Abstract

The invention provides a medical record text named entity recognition method based on an iterative expansion convolution neural network, which is characterized in that named entity recognition is carried out on a medical electronic medical record data set CCKS2017, a section of Chinese electronic medical record text is input, the iterative expansion convolution neural network and a conditional random field are used as model structures, chinese character components are used as characteristics, and named entities such as disease names, inspection means and the like in the text are extracted.

Description

Medical record text named entity identification method based on iterative expansion convolutional neural network

Technical Field

The invention relates to the fields related to natural language processing and medical treatment, in particular to a medical history text named entity identification method based on an iterative expansion convolutional neural network.

Background

In recent years, with the development of big data and computer technology, more and more medical institutions start to adopt electronic medical record systems. The electronic medical record system is special medical software. The hospital records the information of the patient's visit in an electronic way through the electronic medical record, including: medical history, history of the disease, examination results, medical orders, surgical records, nursing records, and the like, wherein the medical records include structured information, unstructured free text, and graphical image information.

With the development of artificial intelligence technology, many groups have tried to use artificial intelligence technology in the medical field as an auxiliary medical means. Electronic medical records, which are an important medical data, contain a lot of unstructured text. The analysis of the unstructured medical record text is the basis for solving and applying medical records by a computing mechanism. Based on the structuralization of the medical record, the relation and the probability among a plurality of knowledge points such as symptoms, diseases, medicines, examination and inspection can be calculated, a knowledge graph in the medical field is constructed, and the work of doctors is further optimized.

The structuring of medical record texts is an important means, namely named entity identification. That is, given a piece of medical text, medical entities of specified types are extracted and classified into predefined categories, including symptoms, body parts, treatments, diseases, examination items, and the like. Such as: "the patient has significantly reduced symptoms of neck and shoulder pain by combination therapy", wherein the medical entity includes "shoulder and neck" (body part), "pain" (symptoms).

Named entity recognition in the medical field differs from the general field mainly by: (1) In the medical field, there are many professional nouns and rare characters, such as loratadine tablets, and the current Chinese word segmentation tool cannot perform word segmentation well, so that the subsequent recognition effect is influenced. (2) Some entities have longer names, such as "brain protein hydrolysate nourishment brain cells" (treatment), and some models have difficulty in establishing longer context dependence.

For the first problem, considering that the existing word segmentation tool has poor word segmentation effect on medical texts, the word segmentation is not performed any more here, and the Chinese characters are directly operated. On one hand, errors caused by word segmentation can be prevented from affecting other parts of the model, on the other hand, the size of the model vocabulary is reduced, parameters are reduced, and overfitting is avoided. In addition, for the case history text, a large number of characters with specific components appear, for example, human organs such as chest, liver, spleen, lung and the like are all shown by the Chinese character ' yue ', and in addition, characters taking ' 30098 ' as radicals such as ' cancer, popular treatment, hemorrhoid, phlegm and the like are all related to diseases or symptoms, so that the radicals are taken as characteristics to be input into a model to relieve the problems of rarely-used characters and the like. For the second problem, we consider using a dilated convolutional neural network so that the model can read long distance contexts without making the convolution kernel too large. In conclusion, a medical record named entity identification method based on an expansion convolution neural network and Chinese character component characteristics is provided.

Disclosure of Invention

The invention provides a medical record text named entity identification method for extracting named entities in texts based on an iterative expansion convolutional neural network.

In order to achieve the technical effects, the technical scheme of the invention is as follows:

a medical record text named entity identification method based on an iterative expansion convolutional neural network comprises the following steps:

s1: establishing a model of an iterative expansion convolution neural network and a conditional random field for named entity recognition;

s2: establishing a loss function of the model;

s3: training of the model is performed and tested on the test set.

Further, the specific process of step S1 is:

s11: building Embedding, because the model needs to process the text, but the characters can not be directly processed by the model, the characters need to be converted into vector representation firstly, namely the characters are completed by an Embedding layer, and the vector representation of the characters and the vector representation of the components of the characters are included;

s12: constructing an iterative expansion convolution neural network for extracting characteristics, wherein the iterative expansion convolution neural network comprises four layers of expansion convolution layers, the expansion radius is 1,2,3 and 3 respectively, each layer comprises 100 convolution kernels, the width of each convolution kernel is 3, the output of the last layer is input to the first layer again, namely the iteration is called, and the iteration is carried out for 4 times;

s13: and constructing a conditional random field model, taking the features extracted in the last step as input of the conditional random field, and outputting a sequence label to each word by the conditional random field to mark whether the word is a part of an entity, and if so, the beginning, the middle or the end of the entity and the type of the entity.

Further, the specific process of step S2 is:

s21: the loss function is given by a negative log-likelihood function, which is equal to the score of the predicted label compared to the scores of all possible labels:

wherein s (x, y) is a scoring function;

s22: the calculation of the score is divided into two parts, (1) the conversion score Ai, j between the labels, i.e. the score converted from label i to label j; (2) The label score Pm, n of a word, i.e. given a certain word m, its label is the score of n, i.e.:

further, the specific process of step S3 is as follows:

s31: splitting an input text, processing a single character, obtaining a radical of each character, obtaining vector representation of the character and the radical through an Embedding layer, inputting the vector representation into an iterative expansion convolution neural network to extract characteristics, and inputting the extracted characteristics into a conditional random field to obtain a final label;

s32: comparing the predicted label with the standard answer, calculating a loss function in the mode of the step S2, optimizing the loss function by using an Adam optimizer, and updating the model parameters;

s33: dividing a data set into a training set and a testing set according to the proportion of 9. When testing the result, outputting a corresponding label for each word of the test sample;

s34: and repeating S31-S33, performing 5 times of cross validation on the test set, measuring the effect of the model by using indexes such as accuracy, recall rate, F1 value and the like, and taking the average value of 5 times as the final effect.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a medical record text named entity recognition method based on an iterative expansion convolutional neural network, which is characterized in that named entity recognition is carried out on a medical electronic medical record data set CCKS2017, a section of Chinese electronic medical record text is input, the iterative expansion convolutional neural network and a conditional random field are used as model architectures, chinese character components are used as characteristics, and named entities such as disease names, inspection means and the like in the text are extracted.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

fig. 2 is a schematic diagram of the algorithm structure in embodiment 1.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, a medical record text named entity identification method based on an iterative expansion convolutional neural network includes the following steps:

s2: establishing a loss function of the model;

s3: training of the model is performed and tested on the test set.

The specific process of step S1 is:

The specific process of step S2 is:

wherein s (x, y) is a scoring function;

the specific process of step S3 is as follows:

s33: dividing a data set into a training set and a test set according to the proportion of 9. When testing the result, outputting a corresponding label for each word of the test sample;

s34: and repeating S31-S33, performing 5 times of cross validation on the test set, measuring the effect of the model by using indexes such as accuracy, recall rate, F1 value and the like, and taking the 5-time average value as the final effect.

The invention aims at named entity recognition of medical record text, and a data set used by people is a CCKS2017 Chinese electronic medical record entity recognition data set which is issued by national knowledge maps and semantic calculations. The data set contains mainly text relating to the medical field. The categories of entities involved include: the distribution of each entity in the data set is shown in table 1, including body organ (body), symptom (symptomm), examination (check), disease (disease), and treatment (treatment). The data sets are labeled BIOES, i.e., begin of entity, intermediate of entity, end of entity, single of entity, O of nonetitude. If "abdominal pain sensation disappears" is labeled "B-body, E-body, B-symptom, I-symptom, E-symptom, O, O", we will follow this labeling with "abdominal" being an entity of type "body organ", whereas "pain sensation" is an entity of type "symptom", whereas "disappearance" is not an entity.

TABLE 1 distribution of training set entities

In the prior art, the best approach is to combine the word vector bidirectional long short term memory network (LSTM) + Conditional Random Field (CRF). Wherein long and short term memory networks are used to understand input sentences and extract features and conditional random fields are used to generate labels. However, in the case history text, the language habit is different from the general expression, and the case history text is generally simplified and strict, and the context correlation is not large, so the long-term and short-term memory network is not suitable for this. In addition, the word vector is not suitable for the case, because more professional nouns exist in the calendar text, the word segmentation is difficult, errors caused by word segmentation are accumulated in the model all the time, and the word vector is also not suitable for the case. We propose a model combining an iterative expanded convolutional neural network of word vectors + conditional random fields.

The method comprises the following specific steps: a vector representation for each word is first obtained. Next, the vector representation is input into the expanded convolutional neural network, and after four layers of expanded convolutional layers are passed, the characteristics of each word, which is also a vector representation, are obtained. These vector representations are then input into the conditional random field, which outputs its label. The details are as follows:

1. the data set CCKS2017 is read in first. In the data set, each line of text includes two parts, a word, and a corresponding tag. The empty row is used to segment the different training samples. After randomly scrambling the samples of the data set, the data set was divided into a training set and a test set in a ratio of 8.

2. And constructing a model, which comprises three parts, namely a word vector, an expansion convolution neural network and a conditional random field. The word vector is used for obtaining a distributed representation of each word, the expansion convolution neural network is used for extracting the characteristics of each word, and the conditional random field is used for evaluating the label sequence score in a training stage and outputting the label sequence with the highest score in a testing stage. The word vector module adopts word vectors pre-trained on an external corpus.

3. And taking each 32 samples of the training set as a batch, obtaining the vector representation of the batch through a word vector module, inputting the vector representation into the model, and training the model. To avoid overfitting, we add a dropout layer after the module, deactivating the word vector with a certain probability (set here to 0.5). The trained objective function is the minimized negative log-likelihood function, i.e.:

wherein:

where s (x, y) is a scoring function.

4. Repeat step 3 for a total of 100 epochs. And after the training is finished, saving the model parameters to a local file. And reading the data of the test set, predicting the named entity on the test set, and comparing the labeled data to measure the performance of the model. The test index uses the F1 value, defined as follows:

f1 value = (2 × precision:recall)/(precision + recall)

Precision = (number of correct predictions in predicted entities)/number of predicted entities

Recall = (number of correctly predicted entities of label data)/number of entities of label data

To show the effect of our model, we chose two additional models to compare. One is a bidirectional long and short term memory network + conditional random field (BilSTM + CRF), which is a classical model in the field of named entity recognition and has good effect on many datasets. Another model is HITSZ CNER, which is a champion model on CCKS2017 games, i.e., the best performing model in the game.

The test results are shown in table 2, and we compare the effects of our model (IDCNN + CRF) and the previous model, and overall, our model has a large improvement on the named entity recognition work on the electronic medical record text, and also has a large improvement on each different type of entity. In addition, the model effect of using the radical feature (with feature) and not using the feature (no feature) is compared, and the comparison shows that the radical feature set by the user can obviously improve the model performance. The medical entity identification method based on the medical record text reasonably utilizes the methods of word vectors, radical features, expansion convolution neural networks and the like to better identify the medical entity from the features of the medical record text.

The specific structure of the invention is shown in figure 2.

The positional relationships depicted in the drawings are for illustrative purposes only and should not be construed as limiting the present patent;

TABLE 2. Comparison of the effects of the different models (F1 value,%).

Method	Bilstm+CRF	HITSZ_CNER	IDCNN+CRFw/ofeat.	IDCNN+CRF
					Body organ	88.10	87.42	87.10	87.56
Symptoms and signs	95.73	96.34	95.16	96.94
					Disease(s)	77.45	78.60	79.57	80.14
Examination of	95.69	94.36	96.02	96.11
					Method of treatment	72.71	78.92	75.10	75.74
Total of	90.82	91.08	91.62	92.53

The same or similar reference numerals correspond to the same or similar parts;

the positional relationships depicted in the drawings are for illustrative purposes only and are not to be construed as limiting the present patent;

it should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A medical record text named entity identification method based on an iterative expansion convolutional neural network is characterized by comprising the following steps:

s2: establishing a loss function of the model;

s3: training the model and testing on the test set;

the specific process of the step S1 is as follows:

2. The medical record text named entity recognition method based on the iterative dilation convolutional neural network as claimed in claim 1, wherein the specific process of step S2 is:

wherein s (x, y) is a scoring function;

s22: the calculation of the score is divided into two partsScore, (1) the transition score Ai, j between tags, i.e., the score for transitioning from tag i to tag j; (2) Word label score Pm, n, i.e. given a certain word m, its label is the score of n, i.e.:

3. the medical record text named entity recognition method based on the iterative dilation convolutional neural network as claimed in claim 2, wherein the specific process of step S3 is as follows:

s31: splitting an input text, processing a single character, obtaining a component of each character, obtaining vector representations of the characters and the component through an Embedding layer, inputting the vector representations into an iterative expansion convolution neural network to extract characteristics, and inputting the extracted characteristics into a conditional random field to obtain a final label;

s33: dividing a data set into a training set and a testing set according to the proportion of 9;