CN110569511A - Electronic medical record feature extraction method based on hybrid neural network - Google Patents

Electronic medical record feature extraction method based on hybrid neural network Download PDF

Info

Publication number
CN110569511A
CN110569511A CN201910896154.6A CN201910896154A CN110569511A CN 110569511 A CN110569511 A CN 110569511A CN 201910896154 A CN201910896154 A CN 201910896154A CN 110569511 A CN110569511 A CN 110569511A
Authority
CN
China
Prior art keywords
layer
model
text
electronic medical
medical record
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910896154.6A
Other languages
Chinese (zh)
Inventor
姜明伟
吴小雪
张庆辉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Henan University of Technology
Original Assignee
Henan University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan University of Technology filed Critical Henan University of Technology
Priority to CN201910896154.6A priority Critical patent/CN110569511A/en
Publication of CN110569511A publication Critical patent/CN110569511A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

the electronic medical record text feature extraction method based on the hybrid neural network sequentially comprises the following steps: (1) acquiring a data set; (2) preprocessing data; (3) obtaining a word vector representation of a word; (4) constructing a TextCNN model to obtain the association between adjacent vocabularies and capturing the local characteristics of the text; (5) a Bi-LSTM model is constructed to memorize the acquired semantic information, the context associated information is captured, and the semantic information of the vocabulary is understood to the maximum extent; (6) and designing a full connection layer for feature convergence. The method improves the method for processing the electronic medical record text and acquiring the text characteristics of the electronic medical record, thereby acquiring the text characteristic semantic information.

Description

Electronic medical record feature extraction method based on hybrid neural network
Technical Field
the invention belongs to the technical field of deep learning natural language processing.
Background
The electronic medical record contains a large amount of digital and text information, and is a record for medical personnel to carry out relevant treatment for patients. By extracting the information, useful medical data are obtained, decision support can be provided for medical treatment, personalized diagnosis schemes can be provided for patients, and accurate medical treatment is achieved. Deep learning models have become a research hotspot of feature representation because features can be automatically extracted from data, and Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) have been proved to be effective semantic composition models. Lipton et al first proposed the ability to evaluate clinical medical records using Long Short Term Memory (LSTM) networks to identify multivariate sequences, and the model effect was superior to previous research methods using multi-layered perceptrons. R Miotto et al propose a new unsupervised deep feature learning method, which can obtain the pathological features of a patient in an electronic medical record, so that targeted clinical prediction modeling is more convenient. They trained a three-layer stacked noise reduction auto-decoder to discern the hierarchical regularity and dependencies in 70 ten thousand patient electronic medical records. They called the resulting model "advanced patients" which performed well in the prediction of severe diabetes, schizophrenia and various cancers. Nguyen P et al propose an end-to-end deep learning system deep that can extract pathological features from medical records and predict automatically. The constructed 'deep record' can improve the accuracy of clinical diagnosis. WU Y and the like construct a deep neural network for Chinese electronic medical record named object recognition. The unmarked corpus generating words are used as an input layer through unsupervised learning, and experimental results show that the model is superior to other CRF models. Wujiawei proposes a feature learning method for extracting entity relationship of an English electronic medical record, which abstracts and expresses limited context features aiming at the characteristic of sparse text structure in the electronic medical record and further sends out combination relationship features between words. Yang et al uses multilayer convolution neural network to carry out high-level semantic understanding on the electronic medical record text, and then uses the electronic medical record text for disease diagnosis, thereby obtaining good effect.
Generally, the existing methods extract pathological features from medical records based on CNN or RNN, but the CNN model emphasizes extracting current local information, the RNN emphasizes storing historical information of sentences, and the methods have the disadvantage that the methods cannot combine time and space for feature representation when explaining the interaction between data features, and the CNN and RNN fusion model can complement the advantages of the methods.
Disclosure of Invention
The invention aims to provide a hybrid neural network method for extracting text features of an electronic medical record. The model can automatically extract features from the electronic medical record text, represent words through word vectors and learn the vector representation of the variable-length sentences through the neural network, so that semantic information of the sentences can be well captured. The CNN model focuses on extracting current local information, while the RNN focuses on preserving historical information of sentences. Their fusion model takes into account both local information and contextual historical information. However, in the actual training process, the RNN is limited by the influence of gradient explosion, the LSTM derived from the RNN well solves the problem, and the feature information in the electronic medical record text is extracted by adopting a fusion model of TextCNN and Bi-LSTM.
Electronic medical record texts provided by CCKS2017 are selected for preprocessing. The discretization electronic medical record data is obtained through means of text word segmentation, stop word removal, word frequency statistics, feature representation and the like. And training the corpus of the electronic medical record text through a CBOW model of Word2vec, thereby obtaining text Word vector representation. And constructing a TextCNN model to acquire the association between adjacent vocabularies and capture the local characteristics of the text. And (3) constructing a Bi-LSTM model to memorize the acquired semantic information, capturing context associated information and understanding the semantic information of the vocabulary to the maximum extent. And finally, designing two full connection layers to converge the features extracted by the TextCNN model and the Bi-LSTM model, and generating the depth word vector features finally used for chronic disease classification.
drawings
fig. 1 is a flow chart of feature extraction. FIG. 2 is a structural view of "Bi-LSTM". FIG. 3 is a view showing the structure of the "TestCNN" model.
Detailed Description
To verify the validity of the proposed model, experiments were performed on the corpus selected herein. The method specifically comprises the following steps:
Firstly, the electronic medical record text is preprocessed. And obtaining the discretized electronic medical record data through text word segmentation, stop word removal, word frequency statistics and characteristic representation. Segmenting each text based on a jieba segmentation tool; and performing stop word processing on the stop word list based on the provided stop word list, and then performing denoising processing. When removing noise, processing is performed on character strings of specific terms, such as abbreviations, URLs and punctuation marks, involved therein. Dividing the preprocessed electronic medical record data set into two parts of training samples and testing data, randomly extracting 2/3 of the data samples for model training, and using the remaining 1/3 of the data to evaluate the effectiveness of the model.
Second, construct the corresponding Word2vec model for Word vector representation. The Google Source development tool Word2vec is a method for quickly and effectively training a Word vector model, and is divided into two modes of CBOW and Skip-gram, and a CBOW model is adopted to train corpora so as to obtain text Word vector representation. The input of the CBOW model is a word vector corresponding to a word related to the context of a certain characteristic word, and the output is the word vector of the characteristic word. After the word vector corresponding to each word is found, each word vector is stacked to form a word vector feature matrix.
Thirdly, constructing a TextCNN model to acquire the association between adjacent vocabularies and capturing the local characteristics of the text. Respectively using 128 convolution kernels of 3X100, 4X100 and 5X100 on the TextCNN convolution layer, wherein the step size stride is 1, performing convolution operation with the output result of the Word2vec model, and enabling the network to automatically extract different characteristics of sentences by using the convolution kernels with different window sizes. And adopting maximum pooling operation, extracting key features in the convolved features by using three pooling layers, eliminating redundant operation, and splicing the features of the pooling operations to generate a feature vector with fixed dimensionality.
Fourthly, constructing a Bi-LSTM model, taking a word vector generated by word2vec as the input of the model, setting the dimensionality of the word vector to be 128 dimensions, and inputting the forward direction and the backward direction of the model respectively. And memorizing the acquired semantic information, capturing context associated information and understanding the semantic information of the vocabulary to the maximum extent. And finally, splicing the output results in the two directions to finally generate model output.
Fifthly, under the condition of fixing other parameters, comparing 128-dimensional word vector dimensions and 256-dimensional word vector dimensions respectively, taking three comparisons of 3, 5 and 7 of the sizes of the sliding windows of the convolution network respectively, and comparing 0.2,0.4,0.5 and 0.7 of the ratios of dropout respectively. The results show that the best results are obtained when the word vector dimension is 128 dimensions, the sliding window size is 3, 5, the maximum pooling is used and the dropout value is 0.5. By using the same method, when the dimension of the word vector is 128 dimensions, and the hidden layer size of the model is 128, the feature extraction accuracy of the model is highest.
Sixthly, in order to verify the effect of the proposed hybrid neural network model on the electronic medical record feature extraction, a single TextCNN model and a single bllstm model are respectively used as comparison experiments under the same conditions. The extracted features were classified using softmax, and the final classification accuracy obtained is shown in table 1.
Model (model) Rate of accuracy/%)
TextCNN model 90.10
BilSTM model 92.15
TextCNN-BilSTM model 94.36
It can be seen from the table that the extraction effect can be improved by 2 percent by using the hybrid neural network model under the same condition. The results show that the experimental effect can be improved to a certain extent when the text features of the electronic medical record are extracted by using the mixed neural network model structure.

Claims (1)

1. The electronic medical record feature extraction method based on the hybrid neural network is characterized by comprising the following steps of: the method sequentially comprises the following steps:
(1) Acquiring a data set: selecting an electronic medical record text data set provided by CCKS 2017;
(2) Data preprocessing: respectively carrying out text word segmentation, word stop removal, word frequency statistics, feature representation and other means on the data set to obtain discretized electronic medical record data;
(3) Obtain a word vector representation of a word: using a Google word2vec model to map words from a high-dimensional space to a low-dimensional space in a distributed manner and keep the position relation between word vectors, thereby solving two problems of vector sparseness and semantic relation;
(4) constructing a TextCNN model to obtain the association between adjacent vocabularies and capturing the local characteristics of the text; the TextCNN model comprises a convolutional layer and a maximum pooling layer; the convolutional layer is used for learning different context-dependent features from input matrices of different sizes; referring to the idea of local receptive field, each hidden layer node is only connected to a certain input point with small enough local part, but not all connected to each input point, and meanwhile, the connection weight among some neurons in the same layer is shared, thereby greatly reducing the weight parameter needing training; using a plurality of convolutional layersThe convolution kernel carries out convolution operation with the output result of the Word2vec model, and different characteristics of sentences can be automatically extracted by the network by using the convolution kernels with different window sizes; connecting each convolution kernel with the result obtained by sentence convolution to obtain the output of the convolution layer; the pooling layer is used for down-sampling the characteristics, enhancing the robustness of the model and effectively improving the performance of the model; after each convolution, reducing the size of the characteristic diagram through a pooling process, and simplifying information output from the convolution layer; the project adopts a maximum pooling method, the maximum value of each vector output by the convolutional layer is taken, the most important characteristic information is extracted and then connected into a vector, and the output of the pooling layer is obtained; the method of maximum pooling enables the network to automatically extract the most useful features in the sentence
(5) a Bi-LSTM model is constructed to memorize the acquired semantic information, the context associated information is captured, and the semantic information of the vocabulary is understood to the maximum extent; the constructed Bi-LSTM model comprises a bidirectional LSTM layer, a polymerization layer and a maximum pooling layer; the bidirectional LSTM layer is equivalent to a feature extraction part, and information is acquired from two opposite directions by constructing two LSTM neural networks, so that the method is more favorable for capturing the long dependency of sentences and the deep semantic expression of texts on the whole, and the input of the two neural networks is consistent; the LSTM has the advantage of having three special gate functions: the input gate, the forgetting gate and the output gate are used for controlling the memory of the neural network; the aggregation layer splices the forward propagation output vector and the backward propagation output vector obtained by the bidirectional LSTM layer; meanwhile, because the number of words contained in each input text is inconsistent, the fixed-length feature vector can be obtained through the pooling operation;
(6) Designing a full connection layer for feature convergence; designing two full-connection layers to converge the features extracted by the TextCNN model and the Bi-LSTM model, and generating depth word vector features finally used for chronic disease classification; before the first full connection layer, the project uses concat () method in Tensorflow frame to fuse the output characteristics of TextCNN and Bi-LSTM; and taking the fused features as input of a first full-connection layer, introducing a Dropout mechanism between the first full-connection layer and a second full-connection layer, and giving up part of trained parameters each time of iteration so that weight updating does not depend on part of inherent features and overfitting is prevented.
CN201910896154.6A 2019-09-22 2019-09-22 Electronic medical record feature extraction method based on hybrid neural network Pending CN110569511A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910896154.6A CN110569511A (en) 2019-09-22 2019-09-22 Electronic medical record feature extraction method based on hybrid neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910896154.6A CN110569511A (en) 2019-09-22 2019-09-22 Electronic medical record feature extraction method based on hybrid neural network

Publications (1)

Publication Number Publication Date
CN110569511A true CN110569511A (en) 2019-12-13

Family

ID=68781673

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910896154.6A Pending CN110569511A (en) 2019-09-22 2019-09-22 Electronic medical record feature extraction method based on hybrid neural network

Country Status (1)

Country Link
CN (1) CN110569511A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111429985A (en) * 2020-03-02 2020-07-17 北京嘉和海森健康科技有限公司 Electronic medical record data processing method and system
CN112287665A (en) * 2020-10-19 2021-01-29 南京南邮信息产业技术研究院有限公司 Chronic disease data analysis method and system based on natural language processing and integrated training
CN112309519A (en) * 2020-10-26 2021-02-02 浙江大学 Electronic medical record medication structured processing system based on multiple models
CN112465075A (en) * 2020-12-31 2021-03-09 杭银消费金融股份有限公司 Metadata management method and system
CN113239192A (en) * 2021-04-29 2021-08-10 湘潭大学 Text structuring technology based on sliding window and random discrete sampling
CN113571199A (en) * 2021-09-26 2021-10-29 成都健康医联信息产业有限公司 Medical data classification and classification method, computer equipment and storage medium
CN113761201A (en) * 2021-08-27 2021-12-07 河北工程大学 Pre-hospital emergency information processing device
CN115563286A (en) * 2022-11-10 2023-01-03 东北农业大学 Knowledge-driven milk cow disease text classification method

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111429985B (en) * 2020-03-02 2023-10-27 北京嘉和海森健康科技有限公司 Electronic medical record data processing method and system
CN111429985A (en) * 2020-03-02 2020-07-17 北京嘉和海森健康科技有限公司 Electronic medical record data processing method and system
CN112287665A (en) * 2020-10-19 2021-01-29 南京南邮信息产业技术研究院有限公司 Chronic disease data analysis method and system based on natural language processing and integrated training
CN112287665B (en) * 2020-10-19 2024-05-03 南京南邮信息产业技术研究院有限公司 Chronic disease data analysis method and system based on natural language processing and integrated training
CN112309519A (en) * 2020-10-26 2021-02-02 浙江大学 Electronic medical record medication structured processing system based on multiple models
CN112465075A (en) * 2020-12-31 2021-03-09 杭银消费金融股份有限公司 Metadata management method and system
CN112465075B (en) * 2020-12-31 2021-05-25 杭银消费金融股份有限公司 Metadata management method and system
CN113239192A (en) * 2021-04-29 2021-08-10 湘潭大学 Text structuring technology based on sliding window and random discrete sampling
CN113239192B (en) * 2021-04-29 2024-04-16 湘潭大学 Text structuring technology based on sliding window and random discrete sampling
CN113761201A (en) * 2021-08-27 2021-12-07 河北工程大学 Pre-hospital emergency information processing device
CN113761201B (en) * 2021-08-27 2023-12-22 河北工程大学 Pre-hospital first-aid information processing device
CN113571199A (en) * 2021-09-26 2021-10-29 成都健康医联信息产业有限公司 Medical data classification and classification method, computer equipment and storage medium
CN115563286A (en) * 2022-11-10 2023-01-03 东北农业大学 Knowledge-driven milk cow disease text classification method
CN115563286B (en) * 2022-11-10 2023-12-01 东北农业大学 Knowledge-driven dairy cow disease text classification method

Similar Documents

Publication Publication Date Title
CN110569511A (en) Electronic medical record feature extraction method based on hybrid neural network
CN110866117B (en) Short text classification method based on semantic enhancement and multi-level label embedding
CN108984724B (en) Method for improving emotion classification accuracy of specific attributes by using high-dimensional representation
CN107992597B (en) Text structuring method for power grid fault case
CN109992783B (en) Chinese word vector modeling method
CN106650813B (en) A kind of image understanding method based on depth residual error network and LSTM
EP3338280B1 (en) Spoken language understanding system
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
US11941366B2 (en) Context-based multi-turn dialogue method and storage medium
KR102008845B1 (en) Automatic classification method of unstructured data
WO2020211720A1 (en) Data processing method and pronoun resolution neural network training method
CN109003601A (en) A kind of across language end-to-end speech recognition methods for low-resource Tujia language
CN106446526A (en) Electronic medical record entity relation extraction method and apparatus
CN106652999A (en) System and method for voice recognition
CN107316654A (en) Emotion identification method based on DIS NV features
CN110298036B (en) Online medical text symptom identification method based on part-of-speech incremental iteration
CN111368088A (en) Text emotion classification method based on deep learning
CN112151183A (en) Entity identification method of Chinese electronic medical record based on Lattice LSTM model
CN111078833A (en) Text classification method based on neural network
CN114756681B (en) Evaluation and education text fine granularity suggestion mining method based on multi-attention fusion
CN114860930A (en) Text classification method and device and storage medium
CN110633467A (en) Semantic relation extraction method based on improved feature fusion
CN110688834A (en) Method and equipment for rewriting intelligent manuscript style based on deep learning model
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN110134950A (en) A kind of text auto-collation that words combines

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20191213