CN114188022A

CN114188022A - Clinical children cough intelligent pre-diagnosis system based on textCNN model

Info

Publication number: CN114188022A
Application number: CN202111521359.XA
Authority: CN
Inventors: 俞刚; 朱珠; 李竞; 张洪健; 陈思宇; 钟千惠; 王颖硕; 王玉琪
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-03-15

Abstract

The invention discloses a clinical children cough intelligent pre-diagnosis system based on a textCNN model, which comprises a computer memory, a computer processor and a computer program which is stored in the computer memory and can be executed on the computer processor, wherein a trained language representation model, a textCNN-based disease pre-diagnosis model and a multi-label task learning inspection and inspection recommendation model are stored in the computer memory; the computer processor, when executing the computer program, performs the steps of: inputting inquiry information of clinical children into a language representation model to obtain a language feature representation vector; inputting the language feature expression vector into a disease pre-diagnosis model to obtain a disease diagnosis result; and inputting the language feature expression vector and the diagnosis result of the disease into a test and inspection recommendation model to obtain recommended test and inspection items. By using the method and the device, disease pre-diagnosis and examination and inspection recommendation can be provided for children with cough, and the diagnosis accuracy is improved.

Description

Clinical children cough intelligent pre-diagnosis system based on textCNN model

Technical Field

The invention belongs to the field of medical artificial intelligence, and particularly relates to a clinical children cough intelligent pre-diagnosis system based on a TextCNN model.

Background

Cough is the most common symptom causing children to see a doctor, and long-term cough can cause a plurality of complications. The causes of cough are numerous and limited by differences in the experience level of doctors, and clinical cough diagnosis has problems such as misdiagnosis, insufficient or redundant examination, and the like.

Chinese patent publication No. CN105339486A discloses a system and method for collecting samples from patients for diagnosis. The sample collection and analysis system concentrates particles from a patient's coughing, sneezing or breathing in a sample for diagnosis of a patient's respiratory tract infection or other ailment. The sample collection and analysis system has a pre-collection assembly, a collector in fluid communication with the sample reservoir, the pre-collection assembly and collector performing functions in combination: effectively capturing a volume of air expelled by the body, directing the expelled air toward the sample reservoir, and separating a desired particle size from the expelled air into the sample reservoir.

Chinese patent publication No. CN107242857A discloses an intelligent comprehensive diagnosis and treatment system based on deep learning in traditional Chinese medicine, which includes: a inspection acquisition subsystem, an auscultation acquisition subsystem, an inquiry acquisition subsystem, a pulse diagnosis acquisition subsystem and a comprehensive analysis subsystem. Wherein, the inspection acquisition subsystem acquires the local image information of the face, the tongue and the like of the patient; the auscultation acquisition subsystem acquires voice, breathing, cough and other sound information of the patient; the inquiry acquisition subsystem acquires the symptom information of the patient in an interactive inquiry and answer mode; the pulse diagnosis acquisition subsystem acquires pulse signals of a patient; the comprehensive analysis subsystem carries out comprehensive analysis on the information data obtained by the subsystems by adopting deep learning related theories and technologies to obtain a diagnosis result and give a suggested prescription. The invention realizes the combination of the four aspects of Chinese medicine inspection, smelling, inquiring and cutting, obtains comprehensive and detailed diagnosis results by means of a deep learning theory and provides convenience for patients to see a doctor.

However, in the prior art, only the diagnosis stage is focused on, and intensive characteristic engineering is relied on, and for the pre-diagnosis stage which is not yet developed for inspection and examination, a targeted method is not provided to assist a primary doctor in disease pre-judgment and inspection and examination recommendation through inquiry information. In actual diagnosis and treatment, certain equipment is needed for collecting particles of cough, sneeze or breath of patients and assisting diagnosis, the clinical operation is not very convenient, and most children with cough are not necessarily diagnosed by traditional Chinese medicine, so that the prior art is lack of universality in popularization. However, for most patients, the inquiry is a necessary and primary step in the treatment process, and directly affects the subsequent judgment and decision of the doctor. The intelligent pre-diagnosis assistance based on the patient inquiry information and the diagnosis and treatment experience of the superior doctor can generate positive help for the work of the primary doctor in the initial stage of the diagnosis. Therefore, there is a need to design a technical solution for children cough pre-diagnosis.

Disclosure of Invention

The invention provides a clinical children cough intelligent pre-diagnosis system based on a TextCNN model, which can provide disease pre-diagnosis and examination and inspection recommendation for cough children and improve the diagnosis accuracy.

A clinical child cough intelligent pre-diagnosis system based on a TextCNN model, comprising a computer memory, a computer processor and a computer program stored in and executable on the computer memory, the computer memory having stored therein a trained language representation model, a TextCNN-based disease pre-diagnosis model and a multi-labeled task learning exam review recommendation model;

the computer processor, when executing the computer program, performs the steps of:

inputting inquiry information of clinical children into a language representation model to obtain a language feature representation vector;

inputting the language feature expression vector into a disease pre-diagnosis model to obtain a disease diagnosis result;

and inputting the language feature expression vector and the diagnosis result of the disease into a test and inspection recommendation model to obtain recommended test and inspection items.

Further, the language representation model is based on a Skip-Gram model in Word2 vec.

Further, a large amount of medical literature data is adopted to train the language representation model, and semantic representation vectors of medical vocabularies in a feature space are obtained; the training objectives for the Skip-gram model are: maximizing text sequence [ w ] in training set₁,w₂,w₃,...,w_T]Given word w of_tAs a core word, the context word w within a fixed-size window_t+jProbability P (w)_t+j|w_t) The objective function of the Skip-gram model is expressed as:

where c is the contextual window size.

Further, a negative sampling algorithm is adopted to selectively update a small part of weights of the training samples, and the gradient descent process is accelerated;

in negative sampling, for a given word w, assuming that c represents the context of the word w, the word w is a positive case, other words are negative cases, and a negative sampling-based Skip-gram algorithm selects a negative sampling word by using unitary model distribution to achieve the aim of reducing the calculation overhead; the probability of a word being selected as a negative sample is related to the frequency of occurrence of the word, and words with higher frequency of occurrence are easier to be selected as negative sample words; the probability calculation formula for each context word is as follows:

wherein, f (w)_i) Representing the word frequency of the occurrence of the word w, and the denominator represents the weighted sum of all the words;

another objective function of the negative-sampling based Skip-gram model is to find a parameter that maximizes the probability that all observations are from the data:

P_pos＝p(D＝1|c,w；θ)

P_neg＝p(D＝0|c,w；θ)

wherein D represents a set of contextual words, D' represents a set of non-contextual words, w and c represent words in D, P_posRepresenting the probability of w and c occurring as context words, P_negIndicating the probability that w and c do not occur as non-contextual words. The meaning of the objective function is to maximize the probability of context words in the window.

Furthermore, the disease pre-diagnosis model adopts a textCNN model, the text of the case record is input into a trained language representation model after One Hot coding, the language feature representation vector of the sentence is obtained through word2vec word embedding, and the vector is used as the input of the textCNN model convolution layer. the network structure of the textCNN model is as follows:

the kernel _ size is set to (3,4,5) in the convolutional layer, in order to avoid losing word vector information, the width of the convolutional kernel is set to be the same as the dimension of the word vector, and each kernel _ size has 128 output channels; in the network initialization stage, a Glorot _ normal distribution initialization method is adopted, a Batch Normalization network is added into a TextCNN network to readjust data distribution and then perform pooling so as to improve the stability of the model during training, 3 feature vectors with the length of 128 are generated, and then the feature vectors are merged into vectors with 384 dimensions and then subjected to dropout.

In the disease pre-diagnosis model, the calculation formula of the Softmax function is as follows:

the loss function adopts cross entropy, and the calculation formula is as follows:

wherein, t_kiIs the probability that the sample k belongs to class i, y_kiIs the probability that the model predicts for sample k that belongs to class i.

When a disease pre-diagnosis model is trained, extracting medical record records of a patient, wherein the medical record records comprise 7 attributes of age, chief complaint symptoms, current medical history, past medical history, family history, allergic history and medication condition;

removing stop words and special symbols from the texts with different attributes, and then splicing the texts in a natural language way: and for each case record, splicing the attribute name and the attribute value, splicing different attributes into a short text serving as comprehensive text description of patient information, inputting the comprehensive text into a language representation model to obtain a language feature representation vector, and training a disease pre-diagnosis model.

Furthermore, the inspection recommendation model also adopts a textCNN model, the loss function adopts binary cross entropy loss, the predicted average probability error of each inspection recommendation category is used as the error of the whole model, and the parameters are updated through a BP algorithm.

Compared with the prior art, the invention has the following beneficial effects:

1. the auxiliary diagnosis service is preposed, so that the diagnosis accuracy is improved from the source, the medical cost is reduced, and the children cough disease pre-diagnosis and examination and inspection recommendation service is provided based on the inquiry information and clinical electronic medical record data.

2. Semantic understanding of medical text data is enhanced by building a pre-trained language representation model that can be quickly applied to downstream AI tasks through transfer learning. 3. The system is convenient to apply, high in universality and operability, free of any equipment, capable of being rapidly popularized and applied to primary hospitals, and significant in auxiliary diagnosis and treatment of primary doctors.

Drawings

FIG. 1 is a block diagram of an intelligent pre-diagnosis system for cough in children according to the present invention;

FIG. 2 is a block diagram of a Skip-gram model in an embodiment of the present invention;

FIG. 3 is a diagram illustrating a structure of a TextCNN model according to an embodiment of the present invention;

FIG. 4 is a comparison graph of the Precision of Top1 results of disease pre-diagnosis according to the method of the present invention and the prior art algorithm;

FIG. 5 is a graph comparing the results Recall of the disease pre-diagnosis Top1 performed by the method of the present invention and the prior algorithm;

FIG. 6 is a graph comparing the results of Top1 disease pre-diagnosis by the present invention and the existing algorithm F1-Score.

Detailed Description

The invention will be described in further detail below with reference to the drawings and examples, which are intended to facilitate the understanding of the invention without limiting it in any way.

A clinical children cough intelligent pre-diagnosis system based on a TextCNN model comprises a computer memory, a computer processor and a computer program which is stored in the computer memory and can be executed on the computer processor, wherein a trained language representation model, a textCNN-based disease pre-diagnosis model and a multi-label task learning check and check recommendation model are stored in the computer memory. As shown in fig. 1, the computer program when executed by a computer processor implements the steps of:

In the invention, the language representation model is based on word2vec, and the purpose is to generate a language feature representation vector which is more accurate for medical semantic understanding, so that the accuracy of text prediction is improved. In addition, the pre-trained language representation model is migrated to other neural network models in a mode of extracting parameter weights from the model trained by using a large amount of linguistic data to realize the goal of knowledge transfer, so that the downstream task model can be finely adjusted based on the parameters, and the convergence speed of the model of the downstream task is greatly accelerated. The invention utilizes a large amount of medical literature data for pre-training to generate a medical language representation model which can be used for word embedding of text data in downstream tasks (disease pre-diagnosis and examination recommendation) to generate language feature representation sensitive to medical language context.

Specifically, the language representation model is based on a Skip-Gram model in Word2vec, and information is more effectively learned from the context in a mode of predicting upper and lower words through the central Word.

As shown in FIG. 2, the training objectives of the Skip-gram model are: maximizing text sequence [ w ] in training set₁,w₂,w₃,...,w_T]Given word w of_tAs a core word, the context word w within a fixed-size window_t+jProbability P (w)_t+j|w_t) Thus, the objective function of the Skip-gram model is expressed as:

where c is the contextual window size.

The invention adopts a negative sampling algorithm to selectively update a small part of weights of training samples, and accelerates the gradient descent process. In negative sampling, for a given word w, assuming that c represents the context of the word w, the word w is a positive case, other words are negative cases, and a negative sampling-based Skip-gram algorithm selects a negative sampling word by using unitary model distribution to achieve the aim of reducing the calculation overhead; the probability of a word being selected as a negative sample is related to the frequency of occurrence of the word, and words with higher frequency of occurrence are easier to be selected as negative sample words; the probability calculation formula for each context word is as follows:

wherein, f (w)_i) Indicating the word frequency of occurrence of the word w and the denominator representing the weighted sum of all words.

P_pos＝p(D＝1|c,w；θ)

P_neg＝p(D＝0|c,w；θ)

the skip-gram algorithm based on negative sampling achieves the purpose of reducing the calculation cost by sampling (w, c) belonging to a D' negative sample set.

The textCNN model is applied to a pre-inquiry disease type prediction task, and the model trained by the task is transferred to an inspection task for fine adjustment.

Specifically, the disease pre-diagnosis model adopts a textCNN model, as shown in fig. 3, and the network structure thereof is as follows:

The network adopts 1-Max pooling, namely, a maximum feature is screened from feature vectors generated by each sliding window, and then the features are spliced to form vector representation. For the output layer, we use the full connection structure and Softmax, and the calculation formula of the Softmax function is as follows:

The inspection and inspection recommendation is similar to the disease pre-diagnosis and is a classification problem in nature, but the inspection and inspection recommendation is a multi-label classification task, a vector is generated by prediction, and each dimension of the vector corresponds to the result of each inspection. The values are binary (0 means not recommended, 1 means recommended), while the pre-diagnosis of disease requires the generation of a probability distribution that the model predicts for each disease.

In the sample training stage, corresponding examination items are recommended according to basic information of the patient, such as chief complaints, current medical history, past history and allergic history, and the types of diseases diagnosed by doctors. For the test data, because the pre-diagnosis stage does not produce exact disease diagnosis yet, the disease with the highest probability predicted by the disease pre-diagnosis model is adopted as the disease type, and examination, inspection and recommendation are carried out by combining the inquiry information of the patient.

In order to enable the model to learn the disease characteristics more accurately, the invention adds the disease prediction result description text on the basis of the characteristics extracted from the EHR data, thereby enabling the model to learn the characteristics of the corresponding relationship between the disease and the inspection and examination categories.

During model training, a textCNN model is still adopted, but the difference is that a loss function adopts Binary Cross Engine loss, the predicted average probability error of each inspection and inspection recommended category is used as the error of the whole model, and parameters are updated through a BP algorithm.

In order to verify the effect of the invention, the clinical children cough intelligent pre-diagnosis system is tested.

The EHR real data of the department of respiration outpatient service of the child hospital affiliated to the Zhejiang university medical college are extracted. Filtering according to ICD-10 disease diagnosis code, wherein 107840 patients are diagnosed as cough diseases in total in affiliated children hospitals of Zhejiang university college of medicine between 08 and 11 months of 2020 in 2019, and the total treatment records are 181229. 2936 cases were excluded because the lack of information failed to satisfy our training task, the remaining 178293 records, we pressed 7: 3, 133719 samples were trained for disease pre-diagnosis and 44574 samples were used in the test set.

Counting common diseases of children cough, consulting respiratory medical experts in hospitals, dividing again according to similarity of symptoms, merging some fine categories, finally forming 12 disease types as prediction targets, and marking the category after coding as 0-11 through natural digital coding.

For better disease type prediction, medical record and literature records of patients are extracted, wherein the medical record and literature records comprise 7 attributes of age, chief complaint symptoms, medical history, allergy history and the like. Removing stop words and special symbols from the texts with different attributes, and then splicing the texts in a natural language way: for each case record, attribute names and attribute values (texts) are spliced, different attributes are spliced into a short text which is used as comprehensive text description of patient information, and learning is performed based on the comprehensive text.

The evaluation indexes of the disease pre-diagnosis task and the inspection and examination recommendation task are based on a confusion matrix, and the pre-diagnosis of the disease is a multi-classification task, so that each category is independently calculated when the evaluation indexes are calculated. The number of instances which belong to a certain class and are correctly classified into the class is represented by a, the number of instances which do not belong to the class but are wrongly classified into the class is represented by b, the number of instances which belong to the class but are wrongly classified into other classes is represented by c, and the number of instances which do not belong to the class is represented by d, which is specifically shown in the following table 1.

TABLE 1

The calculation formulas of accuracy, precision, recall and F1 values are as follows:

the precision ratio is as follows:

the recall ratio is as follows:

f1 value:

for each disease category, Precision, Recall and F1-Score values under Logistic Regression (LR), gradient descent algorithm (GDBT), HAN model and TextCNN model were counted separately for each category. The accuracy, macro-average, and weighted average are then used to evaluate the overall effect of the algorithm on the test data set. The accuracy, the macro-average and the weighted average are calculated as follows:

accuracy (Accuracy):

macro Average (Macro Average):

weighted Average (Weighted Average):

w_i＝C_support/C_Total

wherein, C_TPNumber of instances that represent a truth for a certain class of disease and that are also judged as such by prognosis, C_TNNumber of instances indicating that the truth does not belong to any of the 12 classes of disease and that the pre-diagnosis is also judged as other diseases. C_TotalRepresenting the total number of instances in the test set. n denotes the total number of class categories, FS_iF1-Score values for each category are indicated. w is a_iRepresenting the weight of each class, by the number of instances C of each class_supportDivided by the total number of test sets C_TotalThus obtaining the product.

LR and GBDT belong to machine learning methods, HAN and TextCNN belong to deep learning algorithms. And (3) utilizing the children electronic medical record data training model to predict the possibility of the patient under the 12 types of respiratory diseases according to the patient chief complaint information. The data were cross-validated by dividing into 10 scores and the measured predictive data for 12 disease classes are shown in FIGS. 4-6. In the figure, the disease types corresponding to the abscissa are AURI (acute upper respiratory infection), Bronchitis (Bronchitis), Asthma (Asthma), Pharyngitis (Pharyngitis), Pneumonia (Pneumonia), Rhinitis (Rhinitis), Tonsillitis (Tonsillitis), Laryngitis (Laryngitis), Nassinosissitis (sinusitis), FLU (influenza), FBAO (obstruction of airway foreign body), Others (other diseases), respectively. Since the disease pre-diagnosis outputs a 12-dimensional disease probability vector, each dimension of the vector represents the probability of pre-diagnosing a certain disease according to the patient complaint information, and the component with the highest probability is the Top1 result. FIGS. 4-6 show the accuracy, recall and F1 statistic of Top1 returns predicted by the four algorithms over 12 disease classes. As can be seen from the above figure, the model performance trained by the four algorithms is basically proportional to the model feature extraction capability, and the prediction capabilities of the two machine learning models, namely logistic regression and GBDT, are weak, and are even lower than 0.1 in some disease categories, such as Pharyngitis (pharyngiis), Tonsillitis (Tonsillitis) and influenza (FLU). In contrast, the prediction effect of the deep learning method is obviously better, and the model (TextCNN) of the invention has stronger prediction capability than HAN.

The embodiments described above are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only specific embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions and equivalents made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. A TextCNN model-based intelligent pre-diagnosis system for cough in a clinical child, comprising a computer memory, a computer processor, and a computer program stored in the computer memory and executable on the computer processor, characterized in that:

the computer memory is stored with a trained language representation model, a textCNN-based disease pre-diagnosis model and a multi-label task learning inspection and inspection recommendation model;

2. The TextCNN model-based clinical children cough intelligent pre-diagnosis system according to claim 1, wherein the language representation model is based on Skip-Gram model in Word2 vec.

3. The clinical children cough intelligent pre-diagnosis system based on the TextCNN model according to claim 2, wherein a language representation model is trained by adopting a large amount of medical literature data to obtain semantic representation vectors of medical vocabularies in a feature space; the training objectives for the Skip-gram model are: maximizing text sequence [ w ] in training set₁,w₂,w₃,...,w_T]Given word w of_tAs a core word, the context word w within a fixed-size window_t+jProbability P (w)_t+j|w_t) The objective function of the Skip-gram model is expressed as:

where c is the contextual window size.

4. The TextCNN model-based clinical children cough intelligent pre-diagnosis system according to claim 3, wherein a negative sampling algorithm is adopted to selectively update a small part of weights of training samples to accelerate the gradient descent process;

P_pos＝p(D＝1|c,w；θ)

P_neg＝p(D＝0|c,w；θ)

5. The clinical children cough intelligent pre-diagnosis system based on TextCNN model according to claim 1, wherein the disease pre-diagnosis model adopts TextCNN model, and the network structure is as follows:

6. The TextCNN model-based clinical intelligent children cough pre-diagnosis system according to claim 5, wherein in the disease pre-diagnosis model, the calculation formula of the Softmax function is as follows:

7. The clinical intelligent children cough pre-diagnosis system based on TextCNN model according to claim 1, wherein when training the disease pre-diagnosis model, extracting medical record of the patient, including 7 attributes of age, chief complaint symptom, current medical history, past medical history, family history, allergy medical condition;

8. The clinical children cough intelligent pre-diagnosis system based on a TextCNN model as claimed in claim 1, wherein the examination recommendation model adopts a TextCNN model, the loss function adopts a binary cross entropy loss, and the predicted average probability error of each examination recommendation category is used as the error of the model as a whole, and parameters are updated through a BP algorithm.