CN110299194B - Similar case recommendation method based on comprehensive feature representation and improved wide-depth model - Google Patents

Similar case recommendation method based on comprehensive feature representation and improved wide-depth model Download PDF

Info

Publication number
CN110299194B
CN110299194B CN201910490881.2A CN201910490881A CN110299194B CN 110299194 B CN110299194 B CN 110299194B CN 201910490881 A CN201910490881 A CN 201910490881A CN 110299194 B CN110299194 B CN 110299194B
Authority
CN
China
Prior art keywords
model
features
medical record
representation
comprehensive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910490881.2A
Other languages
Chinese (zh)
Other versions
CN110299194A (en
Inventor
黄青松
杨承启
王艺平
刘利军
冯旭鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201910490881.2A priority Critical patent/CN110299194B/en
Publication of CN110299194A publication Critical patent/CN110299194A/en
Application granted granted Critical
Publication of CN110299194B publication Critical patent/CN110299194B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Public Health (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Biomedical Technology (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention relates to a similar case recommendation method based on comprehensive feature representation and an improved wide-depth model, and belongs to the technical field of computer natural language processing. Firstly, obtaining comprehensive characteristics of medical record description by utilizing a comprehensive characteristic representation model, and screening diseases; secondly, processing discrete features by adopting a cross feature method, inputting the discrete features into a linear part, fusing comprehensive features described by medical records with shallow model features, and inputting the fused comprehensive features and the shallow model features into a recommendation sorting part with a gate cycle unit as a core; finally, tens of case recommendation items are output on the basis of hundreds of candidate cases. The invention realizes personalized case recommendation, provides a recommendation sequencing algorithm model combining a traditional shallow linear model and a deep network model, and improves the accuracy of similar case recommendation.

Description

Similar case recommendation method based on comprehensive feature representation and improved wide-depth model
Technical Field
The invention relates to a similar case recommendation method based on comprehensive feature representation and an improved wide-depth model, and belongs to the technical field of computer natural language processing.
Background
With the rapid development of intelligent medical treatment, the artificial intelligence technology is gradually integrated into the medical industry, and a health medical information platform for interaction between a patient and medical staff and a medical institution is produced. In the current stage of clinical auxiliary diagnosis and treatment research, massive medical data and diagnosis data existing in a multimedia form are mainly analyzed and processed. Therefore, the method is indispensable to feature extraction of mass data, and has profound research significance on how to perform personalized diagnosis and treatment application.
Compared with the existing medical record recommendation method, the traditional Chinese medical electronic medical record medical text has semi-structured features and continuous features, so that the traditional text feature representation method has no universality and is often poor in accuracy. Meanwhile, the hidden features cannot be well learned due to excessive dependence on existing medical data. Moreover, there are usually multiple target users in the inquiry platform, and the inquiry habits of the medical staff are different from each other. Meanwhile, clinical diagnosis requires that the platform has the characteristics of quickness, accuracy and the like. Making conventional rule-based diagnosis and case recommendation methods less than expected. Aiming at the two problems, the invention provides a similar case recommendation method based on comprehensive feature representation and an improved wide-depth model, personalized case recommendation is realized, a recommendation sequencing algorithm model combining a traditional shallow linear model and a depth network model is provided, and the accuracy of similar case recommendation is improved.
Disclosure of Invention
The invention provides a similar case recommendation method based on comprehensive characteristic representation and an improved wide-depth model, which aims at the traditional Chinese medicine electronic medical record text, obtains a better recommendation effect on the whole and improves the recommendation efficiency to a certain extent.
The technical scheme of the invention is as follows: a similar case recommendation method based on comprehensive feature representation and improved wide depth models comprises the following specific steps:
step1, firstly, carrying out medical text desensitization on a text, and segmenting words of the medical text; adding a Chinese medicine characteristic noun term, a Chinese medicine basic theory noun, a Chinese medicine unique disease name and symptom name, a Chinese medicine therapeutic rule and treatment noun and a Chinese medicine prescription noun into a THULAC word bank, and performing word segmentation by using THULAC to obtain a corpus representation taking words as units;
step2, carrying out feature partitioning, and mapping the discrete features into real-valued vectors; performing Word segmentation on the continuous features according to the Step1, and obtaining Word vector representation of the corpus by using Word2 Vec;
step3, constructing a comprehensive characteristic representation model based on a threshold convolution variational self-encoder; firstly, carrying out dimension fusion on the two parts of features according to feature partition in Step2, wherein continuous features are represented by using a threshold convolution-based variational automatic encoder algorithm; finally, obtaining high-level semantic information representation in the electronic medical record of the traditional Chinese medicine;
step4, constructing a similar case recommendation model based on an improved wide-depth model; respectively constructing a similar case recommendation model for each doctor; and (4) according to the comprehensive characteristic representation of the medical records obtained in Step3, sequencing and outputting dozens of case recommendation items to a doctor.
Further, the Step1 comprises the following specific steps:
step1.1, carrying out privacy removal and feature screening operations on the source data of the traditional Chinese medical electronic medical record, and removing personal privacy information related to the patient in the medical record text, such as 'name', 'hospital number', 'home address'; combining with expert opinions to screen out items which do not contribute to the extraction of diseased features, such as 'occupation', 'wedding condition', 'nationality', 'physical examination';
step1.2, adding a Chinese medicine related disease pathology and medical term dictionary, and performing word segmentation on the electronic medical record by adopting a THULAC Chinese text word segmentation tool;
step1.3, because the electronic medical record text has more missing values and can not express the accurately described diseased characteristics, the electronic medical record with less than 150 words is removed.
Further, the Step2 comprises the following specific steps:
step2.1, extracting information such as 'vital signs', 'patient sex', 'patient age' and the like in the electronic medical record data, and mapping the numerical values into one-dimensional vectors as discrete features of the electronic medical record;
step2.2, mapping the Word segmentation content obtained in Step1 into a Word vector by adopting a Word2vec method; and arranging the word vector list into a matrix as a continuous characteristic of the electronic medical record.
Further, in Step3, the specific steps of performing feature representation on the continuous features by using an automatic encoder algorithm based on threshold convolution variation are as follows:
step3.1, coding by using a threshold convolution network, and sending the continuous characteristics obtained by Step2 into a pooling layer to obtain a coding result; calculating and generating a mean value and a variance by using the coding result, generating Gaussian distribution and resampling;
step3.2, constructing a double-layer stacking CNN model, and multiplying the output of the convolution layer of the nonlinear activation function and the output of the convolution layer activated by the sigmoid nonlinear activation function
Figure BDA0002086967410000021
Wherein: w and V represent the weight of the convolution layer, b and c represent the bias term of the convolution layer, a represents the convolution operation, and sigma is a threshold convolution function;
step3.3, loss function logp (x) = D KL (q φ (z|x)||p θ (z | x)) to minimize the resulting loss function; where θ is an optimization parameter, logp (x) represents a log-likelihood function that the model needs to be maximized, D KL Is KL divergence, q φ (z | x) is an encoder, p θ (x | z) is the decoder, z is a hidden variable, and x is an input variable; if and only if q φ (z|x)=p θ (z | x), D KL (q φ (z|x)∥p θ (z|x))=0;
Step3.4, updating a parameter training network model by adopting a random gradient ascent method; first using a priori distributions
Figure BDA0002086967410000031
Randomly sampling a group of samples of a hidden variable z, inputting the samples into a decoder, and finally outputting a random sample of a data point x; considering a different number of hidden units n, it may be smaller or higher than the number of original features.
Further, the Step4 includes the specific steps of:
step4.1, defining a logistic regression model
Figure BDA0002086967410000032
Where x = [ x ] 1 ,x 2 ,…,x d ]A set of vectors representing features d, the set of features comprising the original input features and the combined features, w = [ w ] 1 ,w 2 ,…,w d ]Parameters representing the model;
step4.2, defining Cross features
Figure BDA0002086967410000033
Where c is ki E {0,1} is a Boolean value, e.g., the ith characteristic is the kth transition φ k A part of (c) ki I.e. 1, otherwise 0, for binary features, e.g. there are and only 1 if all the combined features are true, otherwise 0;
step4.3, defining the core GRU layer of the depth module, and adding an additional feed-forward layer between the last layer and the output, wherein the tanh function is used as the activation function of the output layer; adding connection between nodes of the hidden layer, and controlling the output of the hidden nodes by using a gate cycle unit, thereby effectively obtaining the change of modeling characteristics in time sequence dynamics;
step4.4, taking the input characteristics of the shallow part including the cross characteristics formed by region characteristics, weather time-base, search keywords and the like as input, randomly optimizing model parameters in batches, and reversely transmitting the parameters to the shallow and depth gate circulating parts of the model;
step4.5, defining a prediction function of a joint model
Figure BDA0002086967410000034
Taking the weighted sum of logarithms as a predicted value by using the joint output result, and feeding the weighted sum to a common loss function for joint training and optimization; finally outputting the probability value; wherein the content of the first and second substances,
Figure BDA0002086967410000035
for the parameters of the width-depth model,
Figure BDA0002086967410000036
GRU model parameters; and finally, sorting the probability from low to high, and taking the first 5 as case recommendation.
In order to make the comprehensive characteristic representation model sufficiently represent the input data set, each data point x in the data set is provided with one or more groups of hidden variables corresponding to the data point x. q. q of φ (z | x) is the encoder, p θ (x | z) is the decoder. Sampling the probability density function p (Z) in the high-dimensional space Z to obtain a sample ZMapping the latent variable z to the original data space X with a function f (z; theta); the parameter theta is optimized TO make f (z; theta) as similar as possible TO the real data in the TO-EMR; by p θ (x | z) instead of f (z; θ), the dependency between x and z can be clearly seen from the total probability formula. I.e. to maximize the probability:
p(x)=∫p θ (x|z)p(z)dz
when processing a data set with unbalanced data distribution, although a traditional Mean Square Error (MSE) method can simply measure the error between the network output and a desired target value, a good effect cannot be obtained, and thus the MSE is no longer an effective error measurement method. The invention calculates the conditional probability distribution q by constructing another set of neural networks φ (z | x) is used to approximate the true posterior probability p θ (z | x), the difference between the two distributions is measured using KL divergence (Kullback-Leibler divergence), and q is calculated φ (z | x) and p θ Similarity between (z | x):
Figure BDA0002086967410000044
finishing to obtain:
logp(x)=D KL (q φ (z|x)||p θ (z|x))+L(x;φ,θ)≥L(x;φ,θ)
logp (x) represents the log-likelihood function that the model needs to maximize. Wherein the KL divergence is non-negative and if and only if q φ (z|x)=p θ (z | x) in the presence of a catalyst,
Figure BDA0002086967410000041
Figure BDA0002086967410000042
σ θ (z)). Finally, the maximized objective function is transformed to solve the convex optimization problem.
The invention adopts a random gradient descent method to update a parameter training network model, and the Loss function is Loss = MSE + D KL In the experiment, a priori distribution is first utilized
Figure BDA0002086967410000043
A set of samples of the hidden variable z is randomly sampled and then input to the decoder, and finally a random sample of the data point x is output. Considering a different number of hidden units n, it may be smaller or higher than the number of original features. I.e. not only data that can be converted to a lower dimension, i.e. an incomplete representation, but also data of a higher dimension, i.e. an overly complete representation.
The invention has the beneficial effects that:
1. different feature representation algorithms are used for different structural types of data in the medical text. Aiming at data with different structural types in the text of the electronic medical record of traditional Chinese medicine, different methods are applied to carry out characteristic representation. And for the discrete characteristic item, obtaining a numerical vector by using the semi-structured data by adopting a structured mapping method. And for the continuous characteristic items, performing characteristic representation by adopting an automatic encoder algorithm based on threshold convolution variation, and automatically extracting high-level semantic information in the traditional Chinese medical electronic medical record. After the case description vectorization representation of different structure types is completed, dimension fusion is carried out on the continuous feature vectors and the discrete feature vectors. The fused features may optimally represent the original input.
2. The medical data distribution imbalance management method is applicable to the current situation of medical data distribution imbalance. In order to solve the problem of unbalanced medical data distribution, avoid data volume difference, prevent excessive deviation of a training set and ensure enough training data, the convolution variational self-encoder can better learn data distribution, and the problem of unbalanced medical data distribution is solved.
3. And constructing a personalized similar case recommendation model. And respectively constructing a similar case recommendation model for each doctor, fusing inquiry preferences of doctor users, and solving the problem of individual difference. The method obtains hundreds of candidate cases of a certain disease from a large quantity of high-quality case libraries, and then sorts and outputs dozens of case recommendation items to a doctor for reference.
In conclusion, the method for recommending similar cases based on comprehensive characteristic representation and the improved wide-depth model realizes personalized case recommendation, provides a recommendation sequencing algorithm model combining the traditional shallow linear model and the depth network model, and improves the accuracy of recommending similar cases.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a graphical representation of a comprehensive characterization model of the present invention;
FIG. 3 is a diagram of a threshold convolution variational self-coder model in a comprehensive characterization representation according to the present invention;
FIG. 4 is a diagram of an improved wide depth model according to the present invention.
Detailed Description
Example 1: as shown in fig. 1-4, a method for recommending similar cases based on comprehensive feature representation and improved wide-depth model includes the following steps:
step1, firstly, carrying out medical text desensitization on a text, and segmenting words of the medical text; adding a Chinese medicine characteristic noun term, a Chinese medicine basic theory noun, a Chinese medicine unique disease name and symptom name, a Chinese medicine therapeutic rule and treatment noun and a Chinese medicine prescription noun into a THULAC word bank, and performing word segmentation by using THULAC to obtain a corpus representation taking words as units;
step2, carrying out feature partitioning, and mapping the discrete features into real-valued vectors; performing Word segmentation on the continuous features according to the Step1, and obtaining Word vector representation of the corpus by using Word2 Vec;
step3, constructing a comprehensive characteristic representation model based on a threshold convolution variational self-encoder; firstly, carrying out dimension fusion on the two parts of features according to feature partition in Step2, wherein continuous features are represented by using a threshold convolution-based variational automatic encoder algorithm; finally, obtaining high-level semantic information representation in the electronic medical record of the traditional Chinese medicine;
step4, constructing a similar case recommendation model based on an improved wide-depth model; respectively constructing a similar case recommendation model for each doctor; and (4) according to the comprehensive characteristic representation of the medical records obtained in Step3, sequencing and outputting dozens of case recommendation items to a doctor.
Further, the Step1 includes the specific steps of:
step1.1, carrying out privacy removal and feature screening operation on source data of the traditional Chinese medical electronic medical record, and removing personal privacy information of a patient, such as 'name', 'hospitalization number', 'home address', in a medical record text; combining with expert opinions to screen out items which do not contribute to the extraction of diseased features, such as 'occupation', 'wedding condition', 'nationality', 'physical examination';
step1.2, adding a Chinese medicine related disease pathology and medical term dictionary, and performing word segmentation on the electronic medical record by adopting a THULAC Chinese text word segmentation tool;
step1.3, because the electronic medical record text has more missing values and can not express the accurately described diseased characteristics, the electronic medical record with less than 150 words is removed.
Further, the Step2 comprises the following specific steps:
step2.1, extracting information such as 'vital signs', 'patient sex', 'patient age' and the like in the electronic medical record data, and mapping the numerical values into one-dimensional vectors as discrete features of the electronic medical record;
step2.2, mapping the Word segmentation content obtained in Step1 into a Word vector by adopting a Word2vec method; and arranging the word vector list into a matrix as a continuous characteristic of the electronic medical record.
Further, in Step3, the specific steps of performing feature representation on the continuous features by using an automatic encoder algorithm based on threshold convolution variation are as follows:
step3.1, coding by using a threshold convolution network, and sending the continuous characteristics obtained by Step2 into a pooling layer to obtain a coding result; calculating and generating a mean value and a variance by using the coding result, generating Gaussian distribution and resampling;
step3.2, constructing a double-layer stacking CNN model, and multiplying the output of the convolution layer of the nonlinear activation function and the output of the convolution layer activated by the sigmoid nonlinear activation function
Figure BDA0002086967410000061
Wherein: w and V represent the weight of the convolution layer, b and c represent the bias term of the convolution layer, x represents the convolution operation, and sigma is a threshold convolution function;
step3.3, loss function logp (x) = D KL (q φ (z|x)||p θ (z | x)) so that the final loss function is minimized; where θ is an optimization parameter, logp (x) represents a log-likelihood function that the model needs to be maximized, D KL Is KL divergence, q φ (z | x) is the encoder, p θ (x | z) is the decoder, z is a hidden variable, and x is an input variable; if and only if q φ (z|x)=p θ (z | x), D KL (q φ (z|x)||p θ (z|x))=0;
Step3.4, updating a parameter training network model by adopting a random gradient ascent method; first using a prior distribution
Figure BDA0002086967410000062
Randomly sampling a group of samples of a hidden variable z, inputting the samples into a decoder, and finally outputting a random sample of a data point x; considering a different number of hidden units n, it may be smaller or higher than the number of original features.
Further, the Step4 specifically comprises the following steps:
step4.1, defining a logistic regression model
Figure BDA0002086967410000063
Where x = [ x ] 1 ,x 2 ,…,x d ]A set of vectors representing features d, the set of features comprising the original input features and the combined features, w = [ w ] 1 ,w 2 ,…,w d ]Parameters representing the model;
step4.2, defining Cross features
Figure BDA0002086967410000071
Where c is ki E {0,1} is a Boolean value, e.g., the ith characteristic is the kth conversion φ k A part of (c) ki I.e. 1, otherwise 0, for binary features, e.g. 1 if and only if all the combined features are true, otherwise 0;
step4.3, defining the core GRU layer of the depth module, and adding an additional feed-forward layer between the last layer and the output, wherein the tanh function is used as the activation function of the output layer; adding connection between nodes of the hidden layer, and controlling the output of the hidden node by using a gate cycle unit, thereby effectively obtaining the change of the modeling characteristics on the time sequence dynamic;
step4.4, using the input characteristics of the shallow part including the cross characteristics formed by region characteristics, weather time-base, search keywords and the like as input, randomly optimizing model parameters in batches, and reversely transmitting the parameters to the shallow and depth gate circulation parts of the model;
step4.5, defining a prediction function of a joint model
Figure BDA0002086967410000072
Taking the weighted sum of logarithms as a predicted value by using the joint output result, and feeding the weighted sum to a common loss function for joint training and optimization; finally outputting the probability value; wherein the content of the first and second substances,
Figure BDA0002086967410000073
for the parameters of the width-depth model,
Figure BDA0002086967410000074
GRU model parameters; and finally, sorting the probability from low to high, and taking the first 5 as case recommendation.
Wherein the Step5 comprises the following steps: the recommended quality uses accuracy (Precision), recall (Recall), and F1 value (F-Measure) as measures. The recommendation efficiency is measured by the training and prediction speed of the model when the personalized recommendation is performed on the doctor user.
The present invention considers the recommendation efficiency design of the joint model as measured by the generation unit time overhead in the recommendation process. Namely, the average recommendation time of a single case list is adopted to carry out comparison on a data training set and a test set respectively, and a comparison experiment is carried out with the recommendation speed of other models. The wide-depth recommendation model provided by the invention overcomes the defect of gradient disappearance due to the fact that the depth model part is simpler, can solve the problem of uneven distribution of medical text data, and has higher efficiency.
For doctor user u, let R u Set of cases as model recommendation, L u As the case set approved by the user u, the recommendation accuracy, recall and F1-Score values are:
Figure BDA0002086967410000075
Figure BDA0002086967410000076
Figure BDA0002086967410000077
the data used in this example study was from a real electronic medical record dataset collected by a medical hospital in Yunnan province. In the common research, 2090 cases of real and effective electronic medical records are selected, which comprise 8 diseases such as fracture, arthralgia, tendon injury, lumbago, dislocation, gangrene, pyretic arthralgia and osteomyelitis, and are the most common diseases in departments of hospital departments in traditional Chinese medicine, and the electronic medical record data set is constructed. After the data samples are arranged, 70% is taken as a training set, 10% is taken as a cross validation set, and 20% is taken as a test set.
In order to verify the quality of auxiliary diagnosis, a comparison experiment is carried out by using an improved wide-depth model (CFI) based on threshold convolution, a traditional classification method of Logistic Regression and SVM and a feature representation and automatic diagnosis method based on a DBN + SVM two-step model. Comparative experimental data utilized the same partitioned TO-EMR data set. The results are shown in Table X.
TABLE 1 comprehensive index value comparison table for different models
Figure BDA0002086967410000081
In table 1, comparing different models and methods, the CFI model is superior to other methods in various indexes, and is improved by 2.63 percentage points compared with the comprehensive index value of the existing method. The result shows that the CFI model has stronger extraction capability of electronic medical record text information than the traditional model, the high accuracy rate indicates that the misdiagnosis probability of the model is very low, the higher recall rate indicates that the missed diagnosis probability of the model is lower, and the comprehensive evaluation index F value indicates that the auxiliary diagnosis effect of the comprehensive evaluation model is outstanding; compared with the existing depth model, the CFI model has improved performance, the method for representing the feature partitions effectively improves the accuracy of feature representation, the variational self-encoder can better learn the distribution of electronic medical record data, and the capability of depth feature representation is improved. The comparison experiment results are combined, so that the CFI model is powerfully proved to have remarkable practical value, and the realization of comprehensive characteristic representation and clinical auxiliary diagnosis is feasible and effective.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (3)

1. A similar case recommendation method based on comprehensive feature representation and an improved wide-depth model is characterized by comprising the following steps: the method comprises the following specific steps:
step1, firstly, carrying out medical text desensitization on a text, and segmenting words of the medical text; adding a special noun term of the traditional Chinese medicine, a basic theoretical noun of the traditional Chinese medicine, a unique disease name and symptom name of the traditional Chinese medicine, a therapeutic principle and treatment noun of the traditional Chinese medicine and a Chinese prescription noun into a THULAC word bank, and using THULAC to perform word segmentation to obtain a corpus representation taking words as units;
step2, carrying out feature partition on the electronic medical record text, and mapping discrete features into real-valued vectors; performing Word segmentation on the continuous features according to the Step1, and obtaining Word vector representation of the corpus by using Word2 Vec;
step3, constructing a comprehensive characteristic representation model based on a threshold convolution variational self-encoder; firstly, carrying out dimension fusion on the two parts of features according to feature partition in Step2, wherein continuous features are represented by using a threshold convolution-based variational automatic encoder algorithm; finally, obtaining high-level semantic information representation in the traditional Chinese medicine electronic medical record;
step4, constructing a similar case recommendation model based on an improved wide-depth model; respectively constructing a similar case recommendation model for each doctor; according to the comprehensive characteristic representation of the medical records obtained in Step3, tens of case recommendation items are output to a doctor in a sequencing mode;
in Step3, the specific steps of performing feature representation on the continuous features by using an automatic encoder algorithm based on threshold convolution variational variation are as follows:
step3.1, coding by using a threshold convolution network, and sending the continuous characteristics obtained in Step2 into a pooling layer to obtain a coding result; calculating and generating a mean value and a variance by using the coding result, generating Gaussian distribution and resampling;
step3.2, constructing a double-layer stacking CNN model, and multiplying the output of the convolution layer of the nonlinear activation function and the output of the convolution layer activated by the sigmoid nonlinear activation function
Figure FDA0003851307480000011
Figure FDA0003851307480000012
Wherein: w and V represent the weight of the convolution layer, b and c represent the bias term of the convolution layer, x represents the convolution operation, and sigma is a threshold convolution function;
step3.3 loss function logp (x) = D KL (q φ (z|x)||p θ (z | x)) so that the final loss function is minimized; where θ is an optimization parameter, logp (x) represents a log-likelihood function that the model needs to be maximized, D KL Is KL divergence, q φ (z | x) is the encoder, p θ (x | z) is the decoder, z is a hidden variable, and x is an input variable; if and only if q φ (z|x)=p θ (z | x), D KL (q φ (z|x)||p θ (z|x))=0;
Step3.4, updating a parameter training network model by adopting a random gradient ascent method; first using a prior distribution
Figure FDA0003851307480000021
Randomly sampling a group of hidden variable z samples, inputting the samples into a decoder, and finally outputting a random sample of a data point x; considering different numbers of hidden units n, which can be smaller or higher than the number of original features;
the specific steps of Step4 are as follows:
step4.1, defining a logistic regression model
Figure FDA0003851307480000022
Where x = [ x ] 1 ,x 2 ,…,x d ]A set of vectors representing features d, the set of features comprising the original input features and the combined features, w = [ w ] 1 ,w 2 ,…,w d ]Parameters representing the model;
step4.2, defining Cross features
Figure FDA0003851307480000023
Where c is ki E {0,1} is a Boolean value, when the ith feature is the kth conversion φ k A part of (c) ki That is, 1, otherwise 0, for binary features, there is and only 1 if all the combined features are true, otherwise 0;
step4.3, defining the core GRU layer of the depth module, and adding an additional feed-forward layer between the last layer and the output, wherein the tanh function is used as the activation function of the output layer; adding connection between nodes of the hidden layer, and controlling the output of the hidden node by using a gate cycle unit, thereby effectively obtaining the change of the modeling characteristics on the time sequence dynamic;
step4.4, taking the input characteristics of the shallow part including the cross characteristics formed by region characteristics, weather time-base and search keywords as input, randomly optimizing model parameters in batches, and reversely propagating to the shallow and depth gate circulation parts of the model;
step4.5, defining a prediction function of a joint model
Figure FDA0003851307480000024
Taking the weighted sum of logarithms as a predicted value by using the joint output result, and feeding the weighted sum to a common loss function for joint training and optimization; finally outputting the probability value; wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003851307480000025
for the parameters of the wide-depth model,
Figure FDA0003851307480000026
is GRU model parameter; and finally, sorting the use probabilities from low to high.
2. The method of claim 1, wherein the method comprises: the specific steps of Step1 are as follows:
step1.1, carrying out privacy removal and feature screening operations on source data of the traditional Chinese medical electronic medical record, and removing personal privacy information related to patients in medical record texts, wherein the personal privacy information comprises names, hospitalization numbers and home addresses; items which do not contribute to the extraction of diseased features are screened out by combining expert opinions, wherein the items comprise 'occupation', 'wedding condition', 'ethnic group', 'physique examination';
step1.2, adding a Chinese medicine related disease pathology and medical term dictionary, and performing word segmentation on the electronic medical record by adopting a THULAC Chinese text word segmentation tool;
step1.3, because the electronic medical record text has more missing values and can not express the accurately described diseased characteristics, the electronic medical record with less than 150 words is removed.
3. The method of claim 1, wherein the method comprises: the specific steps of Step2 are as follows:
step2.1, extracting 'vital sign', 'patient sex' and 'patient age' information in the electronic medical record data, and mapping the numerical values into one-dimensional vectors as discrete features of the electronic medical record;
step2.2, mapping the Word segmentation content obtained in Step1 into a Word vector by adopting a Word2vec method; and arranging the word vector list into a matrix as a continuous characteristic of the electronic medical record.
CN201910490881.2A 2019-06-06 2019-06-06 Similar case recommendation method based on comprehensive feature representation and improved wide-depth model Active CN110299194B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910490881.2A CN110299194B (en) 2019-06-06 2019-06-06 Similar case recommendation method based on comprehensive feature representation and improved wide-depth model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910490881.2A CN110299194B (en) 2019-06-06 2019-06-06 Similar case recommendation method based on comprehensive feature representation and improved wide-depth model

Publications (2)

Publication Number Publication Date
CN110299194A CN110299194A (en) 2019-10-01
CN110299194B true CN110299194B (en) 2022-11-08

Family

ID=68027589

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910490881.2A Active CN110299194B (en) 2019-06-06 2019-06-06 Similar case recommendation method based on comprehensive feature representation and improved wide-depth model

Country Status (1)

Country Link
CN (1) CN110299194B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111613339B (en) * 2020-05-15 2021-07-09 山东大学 Similar medical record searching method and system based on deep learning
CN112699408A (en) * 2020-12-31 2021-04-23 重庆大学 Wearable device data privacy protection method based on self-encoder
CN116189843B (en) * 2023-04-23 2023-07-07 索思(苏州)医疗科技有限公司 Treatment scheme recommendation method, device, system and storage medium

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9916538B2 (en) * 2012-09-15 2018-03-13 Z Advanced Computing, Inc. Method and system for feature detection
CN105653840B (en) * 2015-12-21 2019-01-04 青岛中科慧康科技有限公司 The similar case recommender system and corresponding method shown based on words and phrases distribution table
CN105718744B (en) * 2016-01-25 2018-05-29 深圳大学 A kind of metabolism mass spectrum screening method and system based on deep learning
WO2017180208A1 (en) * 2016-04-13 2017-10-19 Google Inc. Wide and deep machine learning models
CN111797928A (en) * 2017-09-08 2020-10-20 第四范式(北京)技术有限公司 Method and system for generating combined features of machine learning samples
CN108647251B (en) * 2018-04-20 2021-06-18 昆明理工大学 Recommendation sorting method based on wide-depth gate cycle combination model
CN108897834A (en) * 2018-06-22 2018-11-27 招商信诺人寿保险有限公司 Data processing and method for digging
CN109447244A (en) * 2018-10-11 2019-03-08 中山大学 A kind of advertisement recommended method of combination gating cycle unit neural network

Also Published As

Publication number Publication date
CN110299194A (en) 2019-10-01

Similar Documents

Publication Publication Date Title
CN109299396B (en) Convolutional neural network collaborative filtering recommendation method and system fusing attention model
CN111414393B (en) Semantic similar case retrieval method and equipment based on medical knowledge graph
CN112119412A (en) Neural network of map with attention
Sun et al. Sentiment analysis for Chinese microblog based on deep neural networks with convolutional extension features
Bashir et al. BagMOOV: A novel ensemble for heart disease prediction bootstrap aggregation with multi-objective optimized voting
CN109036577B (en) Diabetes complication analysis method and device
CN111554360A (en) Drug relocation prediction method based on biomedical literature and domain knowledge data
CN110299194B (en) Similar case recommendation method based on comprehensive feature representation and improved wide-depth model
CN112735597A (en) Medical text disorder identification method driven by semi-supervised self-learning
WO2017193685A1 (en) Method and device for data processing in social network
CN112347781A (en) Generating or modifying ontologies representing relationships within input data
CN113707339A (en) Method and system for concept alignment and content inter-translation among multi-source heterogeneous databases
Enríquez et al. Recommendation and classification systems: a systematic mapping study
Feng et al. PhenoBERT: a combined deep learning method for automated recognition of human phenotype ontology
Yildirim A novel grid-based many-objective swarm intelligence approach for sentiment analysis in social media
Baboo et al. Sentiment analysis and automatic emotion detection analysis of twitter using machine learning classifiers
CN111782818A (en) Device, method and system for constructing biomedical knowledge graph and memory
Fan et al. Large margin nearest neighbor embedding for knowledge representation
Tiwari et al. Learning semantic image attributes using image recognition and knowledge graph embeddings
CN114722217A (en) Content pushing method based on link prediction and collaborative filtering
Badriyah et al. Deep learning algorithm for data classification with hyperparameter optimization method
Li et al. Mapping client messages to a unified data model with mixture feature embedding convolutional neural network
Priyanka et al. DeepSkillNER: An automatic screening and ranking of resumes using hybrid deep learning and enhanced spectral clustering approach
AU2021102318A4 (en) System for Improving Prediction Accuracy of Healthcare Ontology
Falzone et al. Measuring similarity for technical product descriptions with a character-level siamese neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant