CN110569511A

CN110569511A - Electronic medical record feature extraction method based on hybrid neural network

Info

Publication number: CN110569511A
Application number: CN201910896154.6A
Authority: CN
Inventors: 姜明伟; 吴小雪; 张庆辉
Original assignee: Henan University of Technology
Current assignee: Henan University of Technology
Priority date: 2019-09-22
Filing date: 2019-09-22
Publication date: 2019-12-13

Abstract

the electronic medical record text feature extraction method based on the hybrid neural network sequentially comprises the following steps: (1) acquiring a data set; (2) preprocessing data; (3) obtaining a word vector representation of a word; (4) constructing a TextCNN model to obtain the association between adjacent vocabularies and capturing the local characteristics of the text; (5) a Bi-LSTM model is constructed to memorize the acquired semantic information, the context associated information is captured, and the semantic information of the vocabulary is understood to the maximum extent; (6) and designing a full connection layer for feature convergence. The method improves the method for processing the electronic medical record text and acquiring the text characteristics of the electronic medical record, thereby acquiring the text characteristic semantic information.

Description

Electronic medical record feature extraction method based on hybrid neural network

Technical Field

the invention belongs to the technical field of deep learning natural language processing.

Background

The electronic medical record contains a large amount of digital and text information, and is a record for medical personnel to carry out relevant treatment for patients. By extracting the information, useful medical data are obtained, decision support can be provided for medical treatment, personalized diagnosis schemes can be provided for patients, and accurate medical treatment is achieved. Deep learning models have become a research hotspot of feature representation because features can be automatically extracted from data, and Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been proved to be effective semantic composition models. Lipton et al first proposed the ability to evaluate clinical medical records using Long Short Term Memory (LSTM) networks to identify multivariate sequences, and the model effect was superior to previous research methods using multi-layered perceptrons. R Miotto et al propose a new unsupervised deep feature learning method, which can obtain the pathological features of a patient in an electronic medical record, so that targeted clinical prediction modeling is more convenient. They trained a three-layer stacked noise reduction auto-decoder to discern the hierarchical regularity and dependencies in 70 ten thousand patient electronic medical records. They called the resulting model "advanced patients" which performed well in the prediction of severe diabetes, schizophrenia and various cancers. Nguyen P et al propose an end-to-end deep learning system deep that can extract pathological features from medical records and predict automatically. The constructed 'deep record' can improve the accuracy of clinical diagnosis. WU Y and the like construct a deep neural network for Chinese electronic medical record named object recognition. The unmarked corpus generating words are used as an input layer through unsupervised learning, and experimental results show that the model is superior to other CRF models. Wujiawei proposes a feature learning method for extracting entity relationship of an English electronic medical record, which abstracts and expresses limited context features aiming at the characteristic of sparse text structure in the electronic medical record and further sends out combination relationship features between words. Yang et al uses multilayer convolution neural network to carry out high-level semantic understanding on the electronic medical record text, and then uses the electronic medical record text for disease diagnosis, thereby obtaining good effect.

Generally, the existing methods extract pathological features from medical records based on CNN or RNN, but the CNN model emphasizes extracting current local information, the RNN emphasizes storing historical information of sentences, and the methods have the disadvantage that the methods cannot combine time and space for feature representation when explaining the interaction between data features, and the CNN and RNN fusion model can complement the advantages of the methods.

Disclosure of Invention

The invention aims to provide a hybrid neural network method for extracting text features of an electronic medical record. The model can automatically extract features from the electronic medical record text, represent words through word vectors and learn the vector representation of the variable-length sentences through the neural network, so that semantic information of the sentences can be well captured. The CNN model focuses on extracting current local information, while the RNN focuses on preserving historical information of sentences. Their fusion model takes into account both local information and contextual historical information. However, in the actual training process, the RNN is limited by the influence of gradient explosion, the LSTM derived from the RNN well solves the problem, and the feature information in the electronic medical record text is extracted by adopting a fusion model of TextCNN and Bi-LSTM.

Electronic medical record texts provided by CCKS2017 are selected for preprocessing. The discretization electronic medical record data is obtained through means of text word segmentation, stop word removal, word frequency statistics, feature representation and the like. And training the corpus of the electronic medical record text through a CBOW model of Word2vec, thereby obtaining text Word vector representation. And constructing a TextCNN model to acquire the association between adjacent vocabularies and capture the local characteristics of the text. And (3) constructing a Bi-LSTM model to memorize the acquired semantic information, capturing context associated information and understanding the semantic information of the vocabulary to the maximum extent. And finally, designing two full connection layers to converge the features extracted by the TextCNN model and the Bi-LSTM model, and generating the depth word vector features finally used for chronic disease classification.

drawings

fig. 1 is a flow chart of feature extraction. FIG. 2 is a structural view of "Bi-LSTM". FIG. 3 is a view showing the structure of the "TestCNN" model.

Detailed Description

To verify the validity of the proposed model, experiments were performed on the corpus selected herein. The method specifically comprises the following steps:

Firstly, the electronic medical record text is preprocessed. And obtaining the discretized electronic medical record data through text word segmentation, stop word removal, word frequency statistics and characteristic representation. Segmenting each text based on a jieba segmentation tool; and performing stop word processing on the stop word list based on the provided stop word list, and then performing denoising processing. When removing noise, processing is performed on character strings of specific terms, such as abbreviations, URLs and punctuation marks, involved therein. Dividing the preprocessed electronic medical record data set into two parts of training samples and testing data, randomly extracting 2/3 of the data samples for model training, and using the remaining 1/3 of the data to evaluate the effectiveness of the model.

Second, construct the corresponding Word2vec model for Word vector representation. The Google Source development tool Word2vec is a method for quickly and effectively training a Word vector model, and is divided into two modes of CBOW and Skip-gram, and a CBOW model is adopted to train corpora so as to obtain text Word vector representation. The input of the CBOW model is a word vector corresponding to a word related to the context of a certain characteristic word, and the output is the word vector of the characteristic word. After the word vector corresponding to each word is found, each word vector is stacked to form a word vector feature matrix.

Thirdly, constructing a TextCNN model to acquire the association between adjacent vocabularies and capturing the local characteristics of the text. Respectively using 128 convolution kernels of 3X100, 4X100 and 5X100 on the TextCNN convolution layer, wherein the step size stride is 1, performing convolution operation with the output result of the Word2vec model, and enabling the network to automatically extract different characteristics of sentences by using the convolution kernels with different window sizes. And adopting maximum pooling operation, extracting key features in the convolved features by using three pooling layers, eliminating redundant operation, and splicing the features of the pooling operations to generate a feature vector with fixed dimensionality.

Fourthly, constructing a Bi-LSTM model, taking a word vector generated by word2vec as the input of the model, setting the dimensionality of the word vector to be 128 dimensions, and inputting the forward direction and the backward direction of the model respectively. And memorizing the acquired semantic information, capturing context associated information and understanding the semantic information of the vocabulary to the maximum extent. And finally, splicing the output results in the two directions to finally generate model output.

Fifthly, under the condition of fixing other parameters, comparing 128-dimensional word vector dimensions and 256-dimensional word vector dimensions respectively, taking three comparisons of 3, 5 and 7 of the sizes of the sliding windows of the convolution network respectively, and comparing 0.2,0.4,0.5 and 0.7 of the ratios of dropout respectively. The results show that the best results are obtained when the word vector dimension is 128 dimensions, the sliding window size is 3, 5, the maximum pooling is used and the dropout value is 0.5. By using the same method, when the dimension of the word vector is 128 dimensions, and the hidden layer size of the model is 128, the feature extraction accuracy of the model is highest.

Sixthly, in order to verify the effect of the proposed hybrid neural network model on the electronic medical record feature extraction, a single TextCNN model and a single bllstm model are respectively used as comparison experiments under the same conditions. The extracted features were classified using softmax, and the final classification accuracy obtained is shown in table 1.

Model (model)	Rate of accuracy/%)
		TextCNN model	90.10
BilSTM model	92.15
		TextCNN-BilSTM model	94.36

It can be seen from the table that the extraction effect can be improved by 2 percent by using the hybrid neural network model under the same condition. The results show that the experimental effect can be improved to a certain extent when the text features of the electronic medical record are extracted by using the mixed neural network model structure.

Claims

1. The electronic medical record feature extraction method based on the hybrid neural network is characterized by comprising the following steps of: the method sequentially comprises the following steps:

(1) Acquiring a data set: selecting an electronic medical record text data set provided by CCKS 2017;

(2) Data preprocessing: respectively carrying out text word segmentation, word stop removal, word frequency statistics, feature representation and other means on the data set to obtain discretized electronic medical record data;

(3) Obtain a word vector representation of a word: using a Google word2vec model to map words from a high-dimensional space to a low-dimensional space in a distributed manner and keep the position relation between word vectors, thereby solving two problems of vector sparseness and semantic relation;

(4) constructing a TextCNN model to obtain the association between adjacent vocabularies and capturing the local characteristics of the text; the TextCNN model comprises a convolutional layer and a maximum pooling layer; the convolutional layer is used for learning different context-dependent features from input matrices of different sizes; referring to the idea of local receptive field, each hidden layer node is only connected to a certain input point with small enough local part, but not all connected to each input point, and meanwhile, the connection weight among some neurons in the same layer is shared, thereby greatly reducing the weight parameter needing training; using a plurality of convolutional layersThe convolution kernel carries out convolution operation with the output result of the Word2vec model, and different characteristics of sentences can be automatically extracted by the network by using the convolution kernels with different window sizes; connecting each convolution kernel with the result obtained by sentence convolution to obtain the output of the convolution layer; the pooling layer is used for down-sampling the characteristics, enhancing the robustness of the model and effectively improving the performance of the model; after each convolution, reducing the size of the characteristic diagram through a pooling process, and simplifying information output from the convolution layer; the project adopts a maximum pooling method, the maximum value of each vector output by the convolutional layer is taken, the most important characteristic information is extracted and then connected into a vector, and the output of the pooling layer is obtained; the method of maximum pooling enables the network to automatically extract the most useful features in the sentence

(5) a Bi-LSTM model is constructed to memorize the acquired semantic information, the context associated information is captured, and the semantic information of the vocabulary is understood to the maximum extent; the constructed Bi-LSTM model comprises a bidirectional LSTM layer, a polymerization layer and a maximum pooling layer; the bidirectional LSTM layer is equivalent to a feature extraction part, and information is acquired from two opposite directions by constructing two LSTM neural networks, so that the method is more favorable for capturing the long dependency of sentences and the deep semantic expression of texts on the whole, and the input of the two neural networks is consistent; the LSTM has the advantage of having three special gate functions: the input gate, the forgetting gate and the output gate are used for controlling the memory of the neural network; the aggregation layer splices the forward propagation output vector and the backward propagation output vector obtained by the bidirectional LSTM layer; meanwhile, because the number of words contained in each input text is inconsistent, the fixed-length feature vector can be obtained through the pooling operation;

(6) Designing a full connection layer for feature convergence; designing two full-connection layers to converge the features extracted by the TextCNN model and the Bi-LSTM model, and generating depth word vector features finally used for chronic disease classification; before the first full connection layer, the project uses concat () method in Tensorflow frame to fuse the output characteristics of TextCNN and Bi-LSTM; and taking the fused features as input of a first full-connection layer, introducing a Dropout mechanism between the first full-connection layer and a second full-connection layer, and giving up part of trained parameters each time of iteration so that weight updating does not depend on part of inherent features and overfitting is prevented.