CN110299194A

CN110299194A - The similar case recommended method with the wide depth model of improvement is indicated based on comprehensive characteristics

Info

Publication number: CN110299194A
Application number: CN201910490881.2A
Authority: CN
Inventors: 黄青松; 杨承启; 王艺平; 刘利军; 冯旭鹏
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2019-10-01
Anticipated expiration: 2039-06-06
Also published as: CN110299194B

Abstract

The present invention relates to indicate to belong to Computer Natural Language Processing technical field with the similar case recommended method for improving wide depth model based on comprehensive characteristics.The present invention passes through first indicates that model obtains the comprehensive characteristics of case history description using comprehensive characteristics, carries out disorder in screening；Secondly, handling discrete features using cross feature method, it is input in linear segment, and the comprehensive characteristics of case history description and shallow Model feature is blended, is input to using door cycling element as in the recommendation sort sections of core；Finally, exporting dozens of case recommendation items on the basis of hundreds of candidate cases.The present invention realizes personalized case and recommends, and proposes a kind of recommendation sort algorithm model that traditional shallow-layer linear model is combined with depth network model, improves the accuracy of similar case recommendation.

Description

The similar case recommended method with the wide depth model of improvement is indicated based on comprehensive characteristics

Technical field

It indicates to belong to the similar case recommended method for improving wide depth model based on comprehensive characteristics the present invention relates to a kind of Computer Natural Language Processing technical field.

Background technique

With the fast development of intelligent medical treatment, artificial intelligence technology is gradually dissolved into medical industry, patient and medical worker, The health medical treatment information platform of medical institutions' interaction comes into being.The research of clinical assisting in diagnosis and treatment at this stage, mainly by dividing Magnanimity medical data and diagnostic data existing for analysis, processing in the multimedia form.It works as a result, the feature extraction of mass data Essential, how to carry out personalized diagnosis and treatment application similarly has far-reaching research significance.

Compared with existing medical history recommended method, Traditional Chinese medical electronic case history medical treatment text is with semi-structured feature and continuously Feature, so that traditional Text Representation method does not have universality and accuracy rate is often poor.Meanwhile it depending on unduly existing Medical data preferable cannot must learn to hide feature.Also, usually there is plurality of target user in interrogation platform, and medical matters people The interrogation habit of member is different.Meanwhile clinical diagnosis requires platform to have the features such as fast, quasi-.So that tradition is rule-based Diagnosis and case recommended method fall flat.For two above problem, the invention proposes one kind based on comprehensive special Sign indicates to realize personalized case with the similar case recommended method for improving wide depth model and recommend, propose a kind of tradition The recommendation sort algorithm model that shallow-layer linear model is combined with depth network model improves the accurate of similar case recommendation Property.

Summary of the invention

The present invention provides a kind of similar case recommended method indicated based on comprehensive characteristics with the wide depth model of improvement, needles To Chinese medicine electronic health record text, preferable recommendation effect is achieved on the whole, also improves recommendation efficiency to a certain extent.

The technical scheme is that a kind of indicate to recommend with the similar case for improving wide depth model based on comprehensive characteristics Method, specific step is as follows for the method:

Step1, medical text desensitization is carried out to text first, and medical text is segmented；By tcm characteristic vocabulary of terms, The unique name of disease of Basic Theories of Chinese Medicine noun, Chinese medicine and disease name, hepatopathy noun and Chinese medicinal formulae noun, are added THULAC dictionary, and segmented using THULAC, obtaining the corpus as unit of word indicates；

Step2, characteristic sub-area is carried out, discrete features is mapped to real-valued vectors；By continuous feature according to step in Step1 It is segmented, is indicated using the term vector that Word2Vec obtains corpus；

The comprehensive characteristics of Step3, building based on thresholding convolution variation self-encoding encoder indicate model；Firstly, according in Step2 Characteristic sub-area, two parts feature is carried out to the fusion of dimension, is compiled automatically wherein continuous feature is used based on thresholding convolution variation Code device algorithm carries out character representation；Finally, the high-level semantics information obtained in Traditional Chinese medical electronic case history indicates；

Step4, building are based on the similar case recommended models for improving wide depth model；Each doctor is constructed respectively respectively Similar case recommended models；It is indicated according to the case history comprehensive characteristics that Step3 is obtained, sequence exports dozens of case recommendation items to doctor It is raw.

Further, the specific steps of the step Step1 are as follows:

Step1.1, privacy and Feature Selection operation are carried out to Chinese medicine electronic health record source data, remove in case history text It is related to patient's individual privacy information, such as " name ", " admission number ", " home address "；It is screened out in conjunction with expert opinion and extraction is suffered from Sick feature does not have contributive item, such as " occupation ", " wedding condition ", " nationality ", " physical examination "；

Step1.2, Chinese medicine associated disease pathology and medical terminology dictionary is added, work is segmented using THULAC Chinese text Tool segments electronic health record；

Step1.3, due in electronic health record text missing values it is more, the illness feature of accurate description can not be showed, in It is the electronic health record removed less than 150 words.

Further, the specific steps of the step Step2 are as follows:

Step2.1, " vital sign " in electronic health record data is extracted, " Gender ", the information such as " patient age ", And its numerical value is mapped as one-dimensional vector, the discrete features as electronic health record；

Step2.2, content will be segmented obtained by Step1, term vector is mapped as using Word2vec method；And term vector is arranged Table is arranged as matrix, the continuous feature as electronic health record.

Further, in the step Step3, continuous feature use based on thresholding convolution variation autocoder algorithm into The specific steps of row character representation are as follows:

Step3.1, it is encoded using thresholding convolutional network, continuous feature obtained by Step2 is sent into pond layer and is compiled Code result；And calculated using above-mentioned coding result and generate mean value and variance, generate Gaussian Profile and resampling；

Step3.2, building double stacked CNN model, the convolutional layer of nonlinear activation function is exported and be have passed through The convolutional layer of sigmoid nonlinear activation function activation, which exports, to be multipliedWherein: W and V indicates the weight of convolutional layer, and b and c indicate the bias term of convolutional layer, and * indicates convolution operation operation, and σ is thresholding convolution letter Number；

Step3.3, loss function are logp (x)=D_KL(q_φ(z|x)||p_θ(z | x)) so that final loss function reaches To minimum；Wherein, θ is Optimal Parameters, and logp (x) indicates that model needs maximized log-likelihood function, D_KLFor KL divergence, q_φ (z | x) it is encoder, p_θ(x | z) it is decoder, z is hidden variable, and x is input variable；And if only if q_φ(z | x)=p_θ(z|x) When, D_KL(q_φ(z|x)∥p_θ(z | x))=0；

Step3.4, network model is trained come undated parameter using stochastic gradient rise method；First with prior distributionThe sample of one group of hidden variable z of stochastical sampling, is then input to decoder, finally exports a data point The random sample of x；Consider different number hidden unit n, can be less than or higher than primitive character quantity.

Further, the specific steps of the step Step4 are as follows:

Step4.1, Logic Regression Models are definedHere x=[x₁,x₂,…,x_d] indicate feature d's One group of vector contains in characteristic set and is originally inputted feature and assemblage characteristic, w=[w₁,w₂,…,w_d] indicate model ginseng Number；

Step4.2, cross feature is definedHere c_ki∈ { 0,1 } is Boolean, and such as i-th special Sign is k-th of conversion φ_kA part, c_kiAs 1, it is otherwise 0, for binary features, for example, having and if only if assemblage characteristic It all sets up and is only 1, be otherwise exactly 0；

Step4.3, GRU layers of core for defining depth module, and additional feedforward is added between the last layer and output Layer, wherein using tanh function as the activation primitive of output layer；Connection is added between node layer hiding, and is followed with a door Ring element controls the output of concealed nodes, effective variation of the Modelling feature on time-series dynamics；

It Step4.4, by the input feature vector of shallow-layer part include by structures such as regional feature, weather time and search keys At cross feature as input, batch Stochastic Optimization Model parameter, while propagating backward to the shallow-layer and depth door of model again Cyclic part；

Step4.5, conjunctive model anticipation function is definedIt uses Joint output result takes the weighted sum of logarithm as predicted value, and the weighted sum is then fed to a common loss function, It carries out joint training and optimizes；Final output is probability value；Wherein,For wide deep all model parameters,For GRU model Parameter；It is finally ranked up from low to high using probability, takes preceding 5 to recommend as case.

In order to enable comprehensive characteristics to indicate that model fully represents input data set, the present invention enables each in data set Data point x, there is one group or multiple groups hidden variable is corresponding to it.q_φ(z | x) it is encoder, p_θ(x | z) it is decoder.Probability is close Degree function p (z) up-samples to obtain sample z in higher dimensional space Z, utilizes function f (z；Hidden variable z θ) is mapped to initial data sky Between X；Make f (z by Optimal Parameters θ；It is θ) similar as far as possible to the truthful data in TO-EMR；Pass through p_θ(x | z) replace f (z； θ), the dependence between x and z can be clearly seen according to total probability formula.Maximize with lower probability:

P (x)=∫ p_θ(x|z)p(z)dz

In the unbalanced data set of processing data distribution, tradition is using mean square error (MSE) although method can be simple The error between network output and desired target value must be measured, but preferable effect can not be obtained, thus mean square error is not It is a kind of effective error metrics method again.The present invention passes through building another set neural computing conditional probability distribution q_φ(z | x) it is used to approximate true posterior probability p_θ(z | x), utilize KL divergence (Kullback-Leibler divergence) measurement two Difference between a distribution calculates q_φ(z | x) and p_θSimilarity between (z | x):

Arrangement obtains:

Logp (x)=D_KL(q_φ(z|x)||p_θ(z|x))+L(x；φ,θ)≥L(x；φ,θ)

Logp (x) indicates that model needs maximized log-likelihood function.Wherein, KL divergence is non-negative, and if only if q_φ(z | x)=p_θ(z | x) when, σ_θ(z)) finally, by maximize objective function be converted into solve it is convex excellent The problem of change.

The present invention is using stochastic gradient descent method come undated parameter training network model, loss function Loss=MSE+ D_KLIn experiment, first with prior distributionThe sample of one group of hidden variable z of stochastical sampling, then inputs To decoder, the random sample of a data point x is finally exported.The hidden unit n for considering different number, can be less than or be higher than The quantity of primitive character.It can not only convert to the data compared with low dimensional, i.e., not exclusively indicate, can also indicate more high-dimensional Data, i.e., excessive complete representation.

The beneficial effects of the present invention are:

1, different character representation algorithms is used to the data of different types of structure in medical text.For Traditional Chinese medical electronic disease The different data of structure type in text are gone through, carry out character representation using different methods.For discrete features item, by half hitch Structure data use structuring mapping method, obtain numerical value vector.For continuous characteristic item, using based on thresholding convolution variation from Dynamic encoder algo carries out character representation, automatically extracts out the high-level semantics information in Traditional Chinese medical electronic case history.To different structure class The case description vectorsization of type indicate after completing, and continuous characteristic vector is carried out merging for dimension with discrete features vector.Melt Feature after conjunction can optimally indicate to be originally inputted.

2, reply medical data is distributed unbalanced status.It is distributed unbalanced problem to improve medical data, avoids counting According to amount difference, prevents training set from having excessive deviation and ensure there are enough training datas, it can be more using convolution variation self-encoding encoder Good learning data distribution solves medical data distribution imbalance problem.

3, personalized similar case recommended models are constructed.Similar case recommended models are constructed to each doctor respectively, are merged Clinician user interrogation preference, solves the problems, such as individual difference.This method obtains the number of certain disease from a large amount of high-quality case library After hundred candidate cases, sequence exports dozens of case recommendation items to doctor, by reference.

To sum up, the similar case recommended method proposed by the present invention indicated based on comprehensive characteristics with the wide depth model of improvement, It realizes personalized case to recommend, proposes a kind of recommendation sequence that traditional shallow-layer linear model is combined with depth network model Algorithm model improves the accuracy of similar case recommendation.

Detailed description of the invention

Fig. 1 is flow chart of the invention；

Fig. 2 is that comprehensive characteristics of the present invention indicate illustraton of model；

Fig. 3 is thresholding convolution variation self-encoding encoder illustraton of model in comprehensive characteristics of the present invention expression；

Fig. 4 is that the present invention improves wide depth model illustraton of model.

Specific embodiment

Embodiment 1: as shown in Figs 1-4, the similar case recommendation side with the wide depth model of improvement is indicated based on comprehensive characteristics Method, specific step is as follows for the method:

Further, the specific steps of the step Step1 are as follows:

Further, the specific steps of the step Step2 are as follows:

Step3.3, loss function are logp (x)=D_KL(q_φ(z|x)||p_θ(z | x)) so that final loss function reaches To minimum；Wherein, θ is Optimal Parameters, and logp (x) indicates that model needs maximized log-likelihood function, D_KLFor KL divergence, q_φ (z | x) it is encoder, p_θ(x | z) it is decoder, z is hidden variable, and x is input variable；And if only if q_φ(z | x)=p_θ(z|x) When, D_KL(q_φ(z|x)||p_θ(z | x))=0；

Further, the specific steps of the step Step4 are as follows:

The wherein step Step5 are as follows: quality is recommended to use accuracy rate (Precision), recall rate (Recall), F1 It is worth (F-Measure) and is used as measurement index.Recommend efficiency by clinician user carry out personalized recommendation when, the training of model and Predetermined speed is measured.

The present invention considers that the recommendation efficient design of conjunctive model is weighed using the generation unit time expense in recommendation process Amount.I.e. respectively on data training set and test set, it is compared using the single case list average recommendation time, and and its The advisory speed of his model compares experiment.Width depth recommended models proposed by the present invention, more due to its depth model part Add simply, overcome the shortcomings that gradient disappears, and the problem of medical text data is unevenly distributed can be improved, has higher Efficiency.

For clinician user u, R is enabled_uAs the case set that model is recommended, L_uAs user u approve case set, Recommend accuracy rate, recall rate and F1-Score value are as follows:

The true electronic health record data set that data used in this case study are collected from Traditional Chinese Medical Hospital, Yunnan Prov..Research choosing altogether Taken wherein 2090 authentic and valid electronic health records, comprising the most common illness in department, department, the hospital of traditional Chinese hospital, have bone fracture bone fracture disease, 8 kinds of diseases such as numbness disease, injury of the tissues disease, lumbago diseases, dislocation disease, gangrene of finger or toe disease, hot numbness, suppurative osteomyelitis, the electronic health record data set being configured to. After arranging data sample, 70% is taken as training set, 10% is used as cross validation collection, and 20% is used as test set.

Experiment: for verify auxiliary diagnosis quality, using based on thresholding convolution improvement width depth model (CFI) with Logistic Regression, SVM traditional classification method and the character representation based on two step model of DBN+SVM and automatic Diagnostic method compares experiment.Comparative experimental data utilizes identical ready-portioned TO-EMR data set.As a result as shown in Table X.

The comprehensive index value contrast table of the different models of table 1

In table 1, different model and method are compared, the CFI model is superior to other methods in various indexs, than The comprehensive index value of existing method improves 2.63 percentage points.The result shows that CFI model has stronger electricity than conventional model Sub- case history text information extractability, high-accuracy indicate that model mistaken diagnosis probability is very low, and higher recall rate indicates model leakage Examine that probability is lower, comprehensive evaluation index F value indicates that the auxiliary diagnosis effect of Integrated Evaluation Model is prominent；CFI model with it is existing Depth model is compared, and performance is promoted, and the method for indicating characteristic sub-area effectively raises the accuracy rate of character representation, is become Point self-encoding encoder can preferably learn the distribution of electronic health record data, improve the ability of depth characteristic expression.In summary right Than experimental result, CFI model is effectively demonstrated, there is significant practical value, realizes that comprehensive characteristics are indicated and carried out clinical auxiliary It is feasible and effective for helping diagnosis.

Above in conjunction with attached drawing, the embodiment of the present invention is explained in detail, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. indicating the similar case recommended method with the wide depth model of improvement based on comprehensive characteristics, it is characterised in that: the method Specific step is as follows:

Step1, medical text desensitization is carried out to text first, and medical text is segmented；By tcm characteristic vocabulary of terms, Chinese medicine The unique name of disease of basic theory noun, Chinese medicine and disease name, hepatopathy noun and Chinese medicinal formulae noun, are added THULAC word Library, and segmented using THULAC, obtaining the corpus as unit of word indicates；

Step2, characteristic sub-area is carried out, discrete features is mapped to real-valued vectors；Continuous feature is carried out according to step in Step1 Participle is indicated using the term vector that Word2Vec obtains corpus；

The comprehensive characteristics of Step3, building based on thresholding convolution variation self-encoding encoder indicate model；Firstly, according to the spy in Step2 Subregion is levied, two parts feature is carried out to the fusion of dimension, wherein continuous feature, which uses, is based on thresholding convolution variation autocoder Algorithm carries out character representation；Finally, the high-level semantics information obtained in Traditional Chinese medical electronic case history indicates；

Step4, building are based on the similar case recommended models for improving wide depth model；Each doctor is constructed respectively respectively similar Case recommended models；Indicate that sequence exports dozens of case recommendation items to doctor according to the case history comprehensive characteristics that Step3 is obtained.

2. the similar case recommended method according to claim 1 indicated based on comprehensive characteristics with the wide depth model of improvement, It is characterized by: the specific steps of the step Step1 are as follows:

Step1.1, privacy and Feature Selection operation are carried out to Chinese medicine electronic health record source data, remove involved in case history text Patient's individual privacy information, such as " name ", " admission number ", " home address "；It is screened out in conjunction with expert opinion special to illness is extracted Sign does not have contributive item, such as " occupation ", " wedding condition ", " nationality ", " physical examination "；

Step1.2, Chinese medicine associated disease pathology and medical terminology dictionary is added, tool pair is segmented using THULAC Chinese text Electronic health record is segmented；

Step1.3, due in electronic health record text missing values it is more, the illness feature of accurate description can not be showed, then gone Fall to be less than the electronic health record of 150 words.

3. the similar case recommended method according to claim 1 indicated based on comprehensive characteristics with the wide depth model of improvement, It is characterized by: the specific steps of the step Step2 are as follows:

Step2.1, " vital sign " in electronic health record data is extracted, " Gender ", the information such as " patient age ", and will Its numerical value is mapped as one-dimensional vector, the discrete features as electronic health record；

Step2.2, content will be segmented obtained by Step1, term vector is mapped as using Word2vec method；And term vector list is arranged It is classified as matrix, the continuous feature as electronic health record.

4. the similar case recommended method according to claim 1 indicated based on comprehensive characteristics with the wide depth model of improvement, It is characterized by: continuous feature, which is used, carries out feature based on thresholding convolution variation autocoder algorithm in the step Step3 The specific steps of expression are as follows:

Step3.1, it is encoded using thresholding convolutional network, continuous feature obtained by Step2 is sent into pond layer and obtains coding knot Fruit；And calculated using above-mentioned coding result and generate mean value and variance, generate Gaussian Profile and resampling；

Step3.2, building double stacked CNN model, by the output of the convolutional layer of nonlinear activation function with to have passed through sigmoid non- The convolutional layer of linear activation primitive activation, which exports, to be multipliedWherein: W and V is indicated The weight of convolutional layer, b and c indicate the bias term of convolutional layer, and * indicates convolution operation operation, and σ is thresholding convolution function；

Step3.3, loss function are logp (x)=D_KL(q_φ(z|x)||p_θ(z | x)) so that final loss function reaches most It is small；Wherein, θ is Optimal Parameters, and logp (x) indicates that model needs maximized log-likelihood function, D_KLFor KL divergence, q_φ(z| It x) is encoder, p_θ(x | z) it is decoder, z is hidden variable, and x is input variable；And if only if q_φ(z | x)=p_θ(z | x) when, D_KL(q_φ(z|x)||p_θ(z | x))=0；

5. the similar case recommended method according to claim 1 indicated based on comprehensive characteristics with the wide depth model of improvement, It is characterized by: the specific steps of the step Step4 are as follows:

Step4.1, Logic Regression Models are definedHere x=[x₁,x₂,…,x_d] indicate one group of feature d Vector contains in characteristic set and is originally inputted feature and assemblage characteristic, w=[w₁,w₂,…,w_d] indicate model parameter；

Step4.2, cross feature is definedHere c_ki∈ { 0,1 } is Boolean, as ith feature is K-th of conversion φ_kA part, c_kiAs 1, it is otherwise 0, for binary features, for example, having and if only if assemblage characteristic is whole Establishment is only 1, is otherwise exactly 0；

Step4.3, GRU layers of core for defining depth module, and additional feedforward layer is added between the last layer and output, The middle activation primitive for using tanh function as output layer；Connection is added between node layer hiding, and single with a door circulation Member controls the output of concealed nodes, effective variation of the Modelling feature on time-series dynamics；

It Step4.4, by the input feature vector of shallow-layer part include being made of regional feature, weather time and search key etc. Cross feature propagates backward to the shallow-layer and depth door circulation of model as input, batch Stochastic Optimization Model parameter again Part；

Step4.5, conjunctive model anticipation function is definedUse joint Output result takes the weighted sum of logarithm as predicted value, and the weighted sum is then fed to a common loss function, carries out Joint training simultaneously optimizes；Final output is probability value；Wherein,For wide deep all model parameters,For GRU model ginseng Number；It is finally ranked up from low to high using probability, takes preceding 5 to recommend as case.