CN110299194B

CN110299194B - Similar case recommendation method based on comprehensive feature representation and improved wide-depth model

Info

Publication number: CN110299194B
Application number: CN201910490881.2A
Authority: CN
Inventors: 黄青松; 杨承启; 王艺平; 刘利军; 冯旭鹏
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2022-11-08
Anticipated expiration: 2039-06-06
Also published as: CN110299194A

Abstract

The invention relates to a similar case recommendation method based on comprehensive feature representation and an improved wide-depth model, and belongs to the technical field of computer natural language processing. Firstly, obtaining comprehensive characteristics of medical record description by utilizing a comprehensive characteristic representation model, and screening diseases; secondly, processing discrete features by adopting a cross feature method, inputting the discrete features into a linear part, fusing comprehensive features described by medical records with shallow model features, and inputting the fused comprehensive features and the shallow model features into a recommendation sorting part with a gate cycle unit as a core; finally, tens of case recommendation items are output on the basis of hundreds of candidate cases. The invention realizes personalized case recommendation, provides a recommendation sequencing algorithm model combining a traditional shallow linear model and a deep network model, and improves the accuracy of similar case recommendation.

Description

Similar case recommendation method based on comprehensive feature representation and improved wide-depth model

Technical Field

The invention relates to a similar case recommendation method based on comprehensive feature representation and an improved wide-depth model, and belongs to the technical field of computer natural language processing.

Background

With the rapid development of intelligent medical treatment, the artificial intelligence technology is gradually integrated into the medical industry, and a health medical information platform for interaction between a patient and medical staff and a medical institution is produced. In the current stage of clinical auxiliary diagnosis and treatment research, massive medical data and diagnosis data existing in a multimedia form are mainly analyzed and processed. Therefore, the method is indispensable to feature extraction of mass data, and has profound research significance on how to perform personalized diagnosis and treatment application.

Compared with the existing medical record recommendation method, the traditional Chinese medical electronic medical record medical text has semi-structured features and continuous features, so that the traditional text feature representation method has no universality and is often poor in accuracy. Meanwhile, the hidden features cannot be well learned due to excessive dependence on existing medical data. Moreover, there are usually multiple target users in the inquiry platform, and the inquiry habits of the medical staff are different from each other. Meanwhile, clinical diagnosis requires that the platform has the characteristics of quickness, accuracy and the like. Making conventional rule-based diagnosis and case recommendation methods less than expected. Aiming at the two problems, the invention provides a similar case recommendation method based on comprehensive feature representation and an improved wide-depth model, personalized case recommendation is realized, a recommendation sequencing algorithm model combining a traditional shallow linear model and a depth network model is provided, and the accuracy of similar case recommendation is improved.

Disclosure of Invention

The invention provides a similar case recommendation method based on comprehensive characteristic representation and an improved wide-depth model, which aims at the traditional Chinese medicine electronic medical record text, obtains a better recommendation effect on the whole and improves the recommendation efficiency to a certain extent.

The technical scheme of the invention is as follows: a similar case recommendation method based on comprehensive feature representation and improved wide depth models comprises the following specific steps:

step1, firstly, carrying out medical text desensitization on a text, and segmenting words of the medical text; adding a Chinese medicine characteristic noun term, a Chinese medicine basic theory noun, a Chinese medicine unique disease name and symptom name, a Chinese medicine therapeutic rule and treatment noun and a Chinese medicine prescription noun into a THULAC word bank, and performing word segmentation by using THULAC to obtain a corpus representation taking words as units;

step2, carrying out feature partitioning, and mapping the discrete features into real-valued vectors; performing Word segmentation on the continuous features according to the Step1, and obtaining Word vector representation of the corpus by using Word2 Vec;

step3, constructing a comprehensive characteristic representation model based on a threshold convolution variational self-encoder; firstly, carrying out dimension fusion on the two parts of features according to feature partition in Step2, wherein continuous features are represented by using a threshold convolution-based variational automatic encoder algorithm; finally, obtaining high-level semantic information representation in the electronic medical record of the traditional Chinese medicine;

step4, constructing a similar case recommendation model based on an improved wide-depth model; respectively constructing a similar case recommendation model for each doctor; and (4) according to the comprehensive characteristic representation of the medical records obtained in Step3, sequencing and outputting dozens of case recommendation items to a doctor.

Further, the Step1 comprises the following specific steps:

step1.1, carrying out privacy removal and feature screening operations on the source data of the traditional Chinese medical electronic medical record, and removing personal privacy information related to the patient in the medical record text, such as 'name', 'hospital number', 'home address'; combining with expert opinions to screen out items which do not contribute to the extraction of diseased features, such as 'occupation', 'wedding condition', 'nationality', 'physical examination';

step1.2, adding a Chinese medicine related disease pathology and medical term dictionary, and performing word segmentation on the electronic medical record by adopting a THULAC Chinese text word segmentation tool;

step1.3, because the electronic medical record text has more missing values and can not express the accurately described diseased characteristics, the electronic medical record with less than 150 words is removed.

Further, the Step2 comprises the following specific steps:

step2.1, extracting information such as 'vital signs', 'patient sex', 'patient age' and the like in the electronic medical record data, and mapping the numerical values into one-dimensional vectors as discrete features of the electronic medical record;

step2.2, mapping the Word segmentation content obtained in Step1 into a Word vector by adopting a Word2vec method; and arranging the word vector list into a matrix as a continuous characteristic of the electronic medical record.

Further, in Step3, the specific steps of performing feature representation on the continuous features by using an automatic encoder algorithm based on threshold convolution variation are as follows:

step3.1, coding by using a threshold convolution network, and sending the continuous characteristics obtained by Step2 into a pooling layer to obtain a coding result; calculating and generating a mean value and a variance by using the coding result, generating Gaussian distribution and resampling;

step3.2, constructing a double-layer stacking CNN model, and multiplying the output of the convolution layer of the nonlinear activation function and the output of the convolution layer activated by the sigmoid nonlinear activation function

Wherein: w and V represent the weight of the convolution layer, b and c represent the bias term of the convolution layer, a represents the convolution operation, and sigma is a threshold convolution function;

step3.3, loss function logp (x) = D _KL (q _φ (z|x)||p _θ (z | x)) to minimize the resulting loss function; where θ is an optimization parameter, logp (x) represents a log-likelihood function that the model needs to be maximized, D _KL Is KL divergence, q _φ (z | x) is an encoder, p _θ (x | z) is the decoder, z is a hidden variable, and x is an input variable; if and only if q _φ (z|x)＝p _θ (z | x), D _KL (q _φ (z|x)∥p _θ (z|x))＝0；

Step3.4, updating a parameter training network model by adopting a random gradient ascent method; first using a priori distributions

Randomly sampling a group of samples of a hidden variable z, inputting the samples into a decoder, and finally outputting a random sample of a data point x; considering a different number of hidden units n, it may be smaller or higher than the number of original features.

Further, the Step4 includes the specific steps of:

step4.1, defining a logistic regression model

Where x = [ x ] ₁ ,x ₂ ,…,x _d ]A set of vectors representing features d, the set of features comprising the original input features and the combined features, w = [ w ] ₁ ,w ₂ ,…,w _d ]Parameters representing the model;

step4.2, defining Cross features

Where c is _ki E {0,1} is a Boolean value, e.g., the ith characteristic is the kth transition φ _k A part of (c) _ki I.e. 1, otherwise 0, for binary features, e.g. there are and only 1 if all the combined features are true, otherwise 0;

step4.3, defining the core GRU layer of the depth module, and adding an additional feed-forward layer between the last layer and the output, wherein the tanh function is used as the activation function of the output layer; adding connection between nodes of the hidden layer, and controlling the output of the hidden nodes by using a gate cycle unit, thereby effectively obtaining the change of modeling characteristics in time sequence dynamics;

step4.4, taking the input characteristics of the shallow part including the cross characteristics formed by region characteristics, weather time-base, search keywords and the like as input, randomly optimizing model parameters in batches, and reversely transmitting the parameters to the shallow and depth gate circulating parts of the model;

step4.5, defining a prediction function of a joint model

Taking the weighted sum of logarithms as a predicted value by using the joint output result, and feeding the weighted sum to a common loss function for joint training and optimization; finally outputting the probability value; wherein the content of the first and second substances,

for the parameters of the width-depth model,

GRU model parameters; and finally, sorting the probability from low to high, and taking the first 5 as case recommendation.

In order to make the comprehensive characteristic representation model sufficiently represent the input data set, each data point x in the data set is provided with one or more groups of hidden variables corresponding to the data point x. q. q of _φ (z | x) is the encoder, p _θ (x | z) is the decoder. Sampling the probability density function p (Z) in the high-dimensional space Z to obtain a sample ZMapping the latent variable z to the original data space X with a function f (z; theta); the parameter theta is optimized TO make f (z; theta) as similar as possible TO the real data in the TO-EMR; by p _θ (x | z) instead of f (z; θ), the dependency between x and z can be clearly seen from the total probability formula. I.e. to maximize the probability:

p(x)＝∫p _θ (x|z)p(z)dz

when processing a data set with unbalanced data distribution, although a traditional Mean Square Error (MSE) method can simply measure the error between the network output and a desired target value, a good effect cannot be obtained, and thus the MSE is no longer an effective error measurement method. The invention calculates the conditional probability distribution q by constructing another set of neural networks _φ (z | x) is used to approximate the true posterior probability p _θ (z | x), the difference between the two distributions is measured using KL divergence (Kullback-Leibler divergence), and q is calculated _φ (z | x) and p _θ Similarity between (z | x):

finishing to obtain:

logp(x)＝D _KL (q _φ (z|x)||p _θ (z|x))+L(x；φ,θ)≥L(x；φ,θ)

logp (x) represents the log-likelihood function that the model needs to maximize. Wherein the KL divergence is non-negative and if and only if q _φ (z|x)＝p _θ (z | x) in the presence of a catalyst,

σ _θ (z)). Finally, the maximized objective function is transformed to solve the convex optimization problem.

The invention adopts a random gradient descent method to update a parameter training network model, and the Loss function is Loss = MSE + D _KL In the experiment, a priori distribution is first utilized

A set of samples of the hidden variable z is randomly sampled and then input to the decoder, and finally a random sample of the data point x is output. Considering a different number of hidden units n, it may be smaller or higher than the number of original features. I.e. not only data that can be converted to a lower dimension, i.e. an incomplete representation, but also data of a higher dimension, i.e. an overly complete representation.

The invention has the beneficial effects that:

1. different feature representation algorithms are used for different structural types of data in the medical text. Aiming at data with different structural types in the text of the electronic medical record of traditional Chinese medicine, different methods are applied to carry out characteristic representation. And for the discrete characteristic item, obtaining a numerical vector by using the semi-structured data by adopting a structured mapping method. And for the continuous characteristic items, performing characteristic representation by adopting an automatic encoder algorithm based on threshold convolution variation, and automatically extracting high-level semantic information in the traditional Chinese medical electronic medical record. After the case description vectorization representation of different structure types is completed, dimension fusion is carried out on the continuous feature vectors and the discrete feature vectors. The fused features may optimally represent the original input.

2. The medical data distribution imbalance management method is applicable to the current situation of medical data distribution imbalance. In order to solve the problem of unbalanced medical data distribution, avoid data volume difference, prevent excessive deviation of a training set and ensure enough training data, the convolution variational self-encoder can better learn data distribution, and the problem of unbalanced medical data distribution is solved.

3. And constructing a personalized similar case recommendation model. And respectively constructing a similar case recommendation model for each doctor, fusing inquiry preferences of doctor users, and solving the problem of individual difference. The method obtains hundreds of candidate cases of a certain disease from a large quantity of high-quality case libraries, and then sorts and outputs dozens of case recommendation items to a doctor for reference.

In conclusion, the method for recommending similar cases based on comprehensive characteristic representation and the improved wide-depth model realizes personalized case recommendation, provides a recommendation sequencing algorithm model combining the traditional shallow linear model and the depth network model, and improves the accuracy of recommending similar cases.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a graphical representation of a comprehensive characterization model of the present invention;

FIG. 3 is a diagram of a threshold convolution variational self-coder model in a comprehensive characterization representation according to the present invention;

FIG. 4 is a diagram of an improved wide depth model according to the present invention.

Detailed Description

Example 1: as shown in fig. 1-4, a method for recommending similar cases based on comprehensive feature representation and improved wide-depth model includes the following steps:

Further, the Step1 includes the specific steps of:

step1.1, carrying out privacy removal and feature screening operation on source data of the traditional Chinese medical electronic medical record, and removing personal privacy information of a patient, such as 'name', 'hospitalization number', 'home address', in a medical record text; combining with expert opinions to screen out items which do not contribute to the extraction of diseased features, such as 'occupation', 'wedding condition', 'nationality', 'physical examination';

Further, the Step2 comprises the following specific steps:

Wherein: w and V represent the weight of the convolution layer, b and c represent the bias term of the convolution layer, x represents the convolution operation, and sigma is a threshold convolution function;

step3.3, loss function logp (x) = D _KL (q _φ (z|x)||p _θ (z | x)) so that the final loss function is minimized; where θ is an optimization parameter, logp (x) represents a log-likelihood function that the model needs to be maximized, D _KL Is KL divergence, q _φ (z | x) is the encoder, p _θ (x | z) is the decoder, z is a hidden variable, and x is an input variable; if and only if q _φ (z|x)＝p _θ (z | x), D _KL (q _φ (z|x)||p _θ (z|x))＝0；

Step3.4, updating a parameter training network model by adopting a random gradient ascent method; first using a prior distribution

Further, the Step4 specifically comprises the following steps:

step4.1, defining a logistic regression model

step4.2, defining Cross features

Where c is _ki E {0,1} is a Boolean value, e.g., the ith characteristic is the kth conversion φ _k A part of (c) _ki I.e. 1, otherwise 0, for binary features, e.g. 1 if and only if all the combined features are true, otherwise 0;

step4.3, defining the core GRU layer of the depth module, and adding an additional feed-forward layer between the last layer and the output, wherein the tanh function is used as the activation function of the output layer; adding connection between nodes of the hidden layer, and controlling the output of the hidden node by using a gate cycle unit, thereby effectively obtaining the change of the modeling characteristics on the time sequence dynamic;

step4.4, using the input characteristics of the shallow part including the cross characteristics formed by region characteristics, weather time-base, search keywords and the like as input, randomly optimizing model parameters in batches, and reversely transmitting the parameters to the shallow and depth gate circulation parts of the model;

step4.5, defining a prediction function of a joint model

for the parameters of the width-depth model,

Wherein the Step5 comprises the following steps: the recommended quality uses accuracy (Precision), recall (Recall), and F1 value (F-Measure) as measures. The recommendation efficiency is measured by the training and prediction speed of the model when the personalized recommendation is performed on the doctor user.

The present invention considers the recommendation efficiency design of the joint model as measured by the generation unit time overhead in the recommendation process. Namely, the average recommendation time of a single case list is adopted to carry out comparison on a data training set and a test set respectively, and a comparison experiment is carried out with the recommendation speed of other models. The wide-depth recommendation model provided by the invention overcomes the defect of gradient disappearance due to the fact that the depth model part is simpler, can solve the problem of uneven distribution of medical text data, and has higher efficiency.

For doctor user u, let R _u Set of cases as model recommendation, L _u As the case set approved by the user u, the recommendation accuracy, recall and F1-Score values are:

the data used in this example study was from a real electronic medical record dataset collected by a medical hospital in Yunnan province. In the common research, 2090 cases of real and effective electronic medical records are selected, which comprise 8 diseases such as fracture, arthralgia, tendon injury, lumbago, dislocation, gangrene, pyretic arthralgia and osteomyelitis, and are the most common diseases in departments of hospital departments in traditional Chinese medicine, and the electronic medical record data set is constructed. After the data samples are arranged, 70% is taken as a training set, 10% is taken as a cross validation set, and 20% is taken as a test set.

In order to verify the quality of auxiliary diagnosis, a comparison experiment is carried out by using an improved wide-depth model (CFI) based on threshold convolution, a traditional classification method of Logistic Regression and SVM and a feature representation and automatic diagnosis method based on a DBN + SVM two-step model. Comparative experimental data utilized the same partitioned TO-EMR data set. The results are shown in Table X.

TABLE 1 comprehensive index value comparison table for different models

In table 1, comparing different models and methods, the CFI model is superior to other methods in various indexes, and is improved by 2.63 percentage points compared with the comprehensive index value of the existing method. The result shows that the CFI model has stronger extraction capability of electronic medical record text information than the traditional model, the high accuracy rate indicates that the misdiagnosis probability of the model is very low, the higher recall rate indicates that the missed diagnosis probability of the model is lower, and the comprehensive evaluation index F value indicates that the auxiliary diagnosis effect of the comprehensive evaluation model is outstanding; compared with the existing depth model, the CFI model has improved performance, the method for representing the feature partitions effectively improves the accuracy of feature representation, the variational self-encoder can better learn the distribution of electronic medical record data, and the capability of depth feature representation is improved. The comparison experiment results are combined, so that the CFI model is powerfully proved to have remarkable practical value, and the realization of comprehensive characteristic representation and clinical auxiliary diagnosis is feasible and effective.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A similar case recommendation method based on comprehensive feature representation and an improved wide-depth model is characterized by comprising the following steps: the method comprises the following specific steps:

step1, firstly, carrying out medical text desensitization on a text, and segmenting words of the medical text; adding a special noun term of the traditional Chinese medicine, a basic theoretical noun of the traditional Chinese medicine, a unique disease name and symptom name of the traditional Chinese medicine, a therapeutic principle and treatment noun of the traditional Chinese medicine and a Chinese prescription noun into a THULAC word bank, and using THULAC to perform word segmentation to obtain a corpus representation taking words as units;

step2, carrying out feature partition on the electronic medical record text, and mapping discrete features into real-valued vectors; performing Word segmentation on the continuous features according to the Step1, and obtaining Word vector representation of the corpus by using Word2 Vec;

step3, constructing a comprehensive characteristic representation model based on a threshold convolution variational self-encoder; firstly, carrying out dimension fusion on the two parts of features according to feature partition in Step2, wherein continuous features are represented by using a threshold convolution-based variational automatic encoder algorithm; finally, obtaining high-level semantic information representation in the traditional Chinese medicine electronic medical record;

step4, constructing a similar case recommendation model based on an improved wide-depth model; respectively constructing a similar case recommendation model for each doctor; according to the comprehensive characteristic representation of the medical records obtained in Step3, tens of case recommendation items are output to a doctor in a sequencing mode;

in Step3, the specific steps of performing feature representation on the continuous features by using an automatic encoder algorithm based on threshold convolution variational variation are as follows:

step3.1, coding by using a threshold convolution network, and sending the continuous characteristics obtained in Step2 into a pooling layer to obtain a coding result; calculating and generating a mean value and a variance by using the coding result, generating Gaussian distribution and resampling;

step3.3 loss function logp (x) = D _KL (q _φ (z|x)||p _θ (z | x)) so that the final loss function is minimized; where θ is an optimization parameter, logp (x) represents a log-likelihood function that the model needs to be maximized, D _KL Is KL divergence, q _φ (z | x) is the encoder, p _θ (x | z) is the decoder, z is a hidden variable, and x is an input variable; if and only if q _φ (z|x)＝p _θ (z | x), D _KL (q _φ (z|x)||p _θ (z|x))＝0；

Randomly sampling a group of hidden variable z samples, inputting the samples into a decoder, and finally outputting a random sample of a data point x; considering different numbers of hidden units n, which can be smaller or higher than the number of original features;

the specific steps of Step4 are as follows:

step4.1, defining a logistic regression model

step4.2, defining Cross features

Where c is _ki E {0,1} is a Boolean value, when the ith feature is the kth conversion φ _k A part of (c) _ki That is, 1, otherwise 0, for binary features, there is and only 1 if all the combined features are true, otherwise 0;

step4.4, taking the input characteristics of the shallow part including the cross characteristics formed by region characteristics, weather time-base and search keywords as input, randomly optimizing model parameters in batches, and reversely propagating to the shallow and depth gate circulation parts of the model;

step4.5, defining a prediction function of a joint model

Taking the weighted sum of logarithms as a predicted value by using the joint output result, and feeding the weighted sum to a common loss function for joint training and optimization; finally outputting the probability value; wherein, the first and the second end of the pipe are connected with each other,

for the parameters of the wide-depth model,

is GRU model parameter; and finally, sorting the use probabilities from low to high.

2. The method of claim 1, wherein the method comprises: the specific steps of Step1 are as follows:

step1.1, carrying out privacy removal and feature screening operations on source data of the traditional Chinese medical electronic medical record, and removing personal privacy information related to patients in medical record texts, wherein the personal privacy information comprises names, hospitalization numbers and home addresses; items which do not contribute to the extraction of diseased features are screened out by combining expert opinions, wherein the items comprise 'occupation', 'wedding condition', 'ethnic group', 'physique examination';

3. The method of claim 1, wherein the method comprises: the specific steps of Step2 are as follows:

step2.1, extracting 'vital sign', 'patient sex' and 'patient age' information in the electronic medical record data, and mapping the numerical values into one-dimensional vectors as discrete features of the electronic medical record;