CN110299194A - The similar case recommended method with the wide depth model of improvement is indicated based on comprehensive characteristics - Google Patents

The similar case recommended method with the wide depth model of improvement is indicated based on comprehensive characteristics Download PDF

Info

Publication number
CN110299194A
CN110299194A CN201910490881.2A CN201910490881A CN110299194A CN 110299194 A CN110299194 A CN 110299194A CN 201910490881 A CN201910490881 A CN 201910490881A CN 110299194 A CN110299194 A CN 110299194A
Authority
CN
China
Prior art keywords
model
feature
comprehensive characteristics
similar case
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910490881.2A
Other languages
Chinese (zh)
Other versions
CN110299194B (en
Inventor
黄青松
杨承启
王艺平
刘利军
冯旭鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201910490881.2A priority Critical patent/CN110299194B/en
Publication of CN110299194A publication Critical patent/CN110299194A/en
Application granted granted Critical
Publication of CN110299194B publication Critical patent/CN110299194B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/83Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Public Health (AREA)
  • Primary Health Care (AREA)
  • Epidemiology (AREA)
  • Biomedical Technology (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Security & Cryptography (AREA)
  • Pathology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The present invention relates to indicate to belong to Computer Natural Language Processing technical field with the similar case recommended method for improving wide depth model based on comprehensive characteristics.The present invention passes through first indicates that model obtains the comprehensive characteristics of case history description using comprehensive characteristics, carries out disorder in screening;Secondly, handling discrete features using cross feature method, it is input in linear segment, and the comprehensive characteristics of case history description and shallow Model feature is blended, is input to using door cycling element as in the recommendation sort sections of core;Finally, exporting dozens of case recommendation items on the basis of hundreds of candidate cases.The present invention realizes personalized case and recommends, and proposes a kind of recommendation sort algorithm model that traditional shallow-layer linear model is combined with depth network model, improves the accuracy of similar case recommendation.

Description

The similar case recommended method with the wide depth model of improvement is indicated based on comprehensive characteristics
Technical field
It indicates to belong to the similar case recommended method for improving wide depth model based on comprehensive characteristics the present invention relates to a kind of Computer Natural Language Processing technical field.
Background technique
With the fast development of intelligent medical treatment, artificial intelligence technology is gradually dissolved into medical industry, patient and medical worker, The health medical treatment information platform of medical institutions' interaction comes into being.The research of clinical assisting in diagnosis and treatment at this stage, mainly by dividing Magnanimity medical data and diagnostic data existing for analysis, processing in the multimedia form.It works as a result, the feature extraction of mass data Essential, how to carry out personalized diagnosis and treatment application similarly has far-reaching research significance.
Compared with existing medical history recommended method, Traditional Chinese medical electronic case history medical treatment text is with semi-structured feature and continuously Feature, so that traditional Text Representation method does not have universality and accuracy rate is often poor.Meanwhile it depending on unduly existing Medical data preferable cannot must learn to hide feature.Also, usually there is plurality of target user in interrogation platform, and medical matters people The interrogation habit of member is different.Meanwhile clinical diagnosis requires platform to have the features such as fast, quasi-.So that tradition is rule-based Diagnosis and case recommended method fall flat.For two above problem, the invention proposes one kind based on comprehensive special Sign indicates to realize personalized case with the similar case recommended method for improving wide depth model and recommend, propose a kind of tradition The recommendation sort algorithm model that shallow-layer linear model is combined with depth network model improves the accurate of similar case recommendation Property.
Summary of the invention
The present invention provides a kind of similar case recommended method indicated based on comprehensive characteristics with the wide depth model of improvement, needles To Chinese medicine electronic health record text, preferable recommendation effect is achieved on the whole, also improves recommendation efficiency to a certain extent.
The technical scheme is that a kind of indicate to recommend with the similar case for improving wide depth model based on comprehensive characteristics Method, specific step is as follows for the method:
Step1, medical text desensitization is carried out to text first, and medical text is segmented;By tcm characteristic vocabulary of terms, The unique name of disease of Basic Theories of Chinese Medicine noun, Chinese medicine and disease name, hepatopathy noun and Chinese medicinal formulae noun, are added THULAC dictionary, and segmented using THULAC, obtaining the corpus as unit of word indicates;
Step2, characteristic sub-area is carried out, discrete features is mapped to real-valued vectors;By continuous feature according to step in Step1 It is segmented, is indicated using the term vector that Word2Vec obtains corpus;
The comprehensive characteristics of Step3, building based on thresholding convolution variation self-encoding encoder indicate model;Firstly, according in Step2 Characteristic sub-area, two parts feature is carried out to the fusion of dimension, is compiled automatically wherein continuous feature is used based on thresholding convolution variation Code device algorithm carries out character representation;Finally, the high-level semantics information obtained in Traditional Chinese medical electronic case history indicates;
Step4, building are based on the similar case recommended models for improving wide depth model;Each doctor is constructed respectively respectively Similar case recommended models;It is indicated according to the case history comprehensive characteristics that Step3 is obtained, sequence exports dozens of case recommendation items to doctor It is raw.
Further, the specific steps of the step Step1 are as follows:
Step1.1, privacy and Feature Selection operation are carried out to Chinese medicine electronic health record source data, remove in case history text It is related to patient's individual privacy information, such as " name ", " admission number ", " home address ";It is screened out in conjunction with expert opinion and extraction is suffered from Sick feature does not have contributive item, such as " occupation ", " wedding condition ", " nationality ", " physical examination ";
Step1.2, Chinese medicine associated disease pathology and medical terminology dictionary is added, work is segmented using THULAC Chinese text Tool segments electronic health record;
Step1.3, due in electronic health record text missing values it is more, the illness feature of accurate description can not be showed, in It is the electronic health record removed less than 150 words.
Further, the specific steps of the step Step2 are as follows:
Step2.1, " vital sign " in electronic health record data is extracted, " Gender ", the information such as " patient age ", And its numerical value is mapped as one-dimensional vector, the discrete features as electronic health record;
Step2.2, content will be segmented obtained by Step1, term vector is mapped as using Word2vec method;And term vector is arranged Table is arranged as matrix, the continuous feature as electronic health record.
Further, in the step Step3, continuous feature use based on thresholding convolution variation autocoder algorithm into The specific steps of row character representation are as follows:
Step3.1, it is encoded using thresholding convolutional network, continuous feature obtained by Step2 is sent into pond layer and is compiled Code result;And calculated using above-mentioned coding result and generate mean value and variance, generate Gaussian Profile and resampling;
Step3.2, building double stacked CNN model, the convolutional layer of nonlinear activation function is exported and be have passed through The convolutional layer of sigmoid nonlinear activation function activation, which exports, to be multipliedWherein: W and V indicates the weight of convolutional layer, and b and c indicate the bias term of convolutional layer, and * indicates convolution operation operation, and σ is thresholding convolution letter Number;
Step3.3, loss function are logp (x)=DKL(qφ(z|x)||pθ(z | x)) so that final loss function reaches To minimum;Wherein, θ is Optimal Parameters, and logp (x) indicates that model needs maximized log-likelihood function, DKLFor KL divergence, qφ (z | x) it is encoder, pθ(x | z) it is decoder, z is hidden variable, and x is input variable;And if only if qφ(z | x)=pθ(z|x) When, DKL(qφ(z|x)∥pθ(z | x))=0;
Step3.4, network model is trained come undated parameter using stochastic gradient rise method;First with prior distributionThe sample of one group of hidden variable z of stochastical sampling, is then input to decoder, finally exports a data point The random sample of x;Consider different number hidden unit n, can be less than or higher than primitive character quantity.
Further, the specific steps of the step Step4 are as follows:
Step4.1, Logic Regression Models are definedHere x=[x1,x2,…,xd] indicate feature d's One group of vector contains in characteristic set and is originally inputted feature and assemblage characteristic, w=[w1,w2,…,wd] indicate model ginseng Number;
Step4.2, cross feature is definedHere cki∈ { 0,1 } is Boolean, and such as i-th special Sign is k-th of conversion φkA part, ckiAs 1, it is otherwise 0, for binary features, for example, having and if only if assemblage characteristic It all sets up and is only 1, be otherwise exactly 0;
Step4.3, GRU layers of core for defining depth module, and additional feedforward is added between the last layer and output Layer, wherein using tanh function as the activation primitive of output layer;Connection is added between node layer hiding, and is followed with a door Ring element controls the output of concealed nodes, effective variation of the Modelling feature on time-series dynamics;
It Step4.4, by the input feature vector of shallow-layer part include by structures such as regional feature, weather time and search keys At cross feature as input, batch Stochastic Optimization Model parameter, while propagating backward to the shallow-layer and depth door of model again Cyclic part;
Step4.5, conjunctive model anticipation function is definedIt uses Joint output result takes the weighted sum of logarithm as predicted value, and the weighted sum is then fed to a common loss function, It carries out joint training and optimizes;Final output is probability value;Wherein,For wide deep all model parameters,For GRU model Parameter;It is finally ranked up from low to high using probability, takes preceding 5 to recommend as case.
In order to enable comprehensive characteristics to indicate that model fully represents input data set, the present invention enables each in data set Data point x, there is one group or multiple groups hidden variable is corresponding to it.qφ(z | x) it is encoder, pθ(x | z) it is decoder.Probability is close Degree function p (z) up-samples to obtain sample z in higher dimensional space Z, utilizes function f (z;Hidden variable z θ) is mapped to initial data sky Between X;Make f (z by Optimal Parameters θ;It is θ) similar as far as possible to the truthful data in TO-EMR;Pass through pθ(x | z) replace f (z; θ), the dependence between x and z can be clearly seen according to total probability formula.Maximize with lower probability:
P (x)=∫ pθ(x|z)p(z)dz
In the unbalanced data set of processing data distribution, tradition is using mean square error (MSE) although method can be simple The error between network output and desired target value must be measured, but preferable effect can not be obtained, thus mean square error is not It is a kind of effective error metrics method again.The present invention passes through building another set neural computing conditional probability distribution qφ(z | x) it is used to approximate true posterior probability pθ(z | x), utilize KL divergence (Kullback-Leibler divergence) measurement two Difference between a distribution calculates qφ(z | x) and pθSimilarity between (z | x):
Arrangement obtains:
Logp (x)=DKL(qφ(z|x)||pθ(z|x))+L(x;φ,θ)≥L(x;φ,θ)
Logp (x) indicates that model needs maximized log-likelihood function.Wherein, KL divergence is non-negative, and if only if qφ(z | x)=pθ(z | x) when, σθ(z)) finally, by maximize objective function be converted into solve it is convex excellent The problem of change.
The present invention is using stochastic gradient descent method come undated parameter training network model, loss function Loss=MSE+ DKLIn experiment, first with prior distributionThe sample of one group of hidden variable z of stochastical sampling, then inputs To decoder, the random sample of a data point x is finally exported.The hidden unit n for considering different number, can be less than or be higher than The quantity of primitive character.It can not only convert to the data compared with low dimensional, i.e., not exclusively indicate, can also indicate more high-dimensional Data, i.e., excessive complete representation.
The beneficial effects of the present invention are:
1, different character representation algorithms is used to the data of different types of structure in medical text.For Traditional Chinese medical electronic disease The different data of structure type in text are gone through, carry out character representation using different methods.For discrete features item, by half hitch Structure data use structuring mapping method, obtain numerical value vector.For continuous characteristic item, using based on thresholding convolution variation from Dynamic encoder algo carries out character representation, automatically extracts out the high-level semantics information in Traditional Chinese medical electronic case history.To different structure class The case description vectorsization of type indicate after completing, and continuous characteristic vector is carried out merging for dimension with discrete features vector.Melt Feature after conjunction can optimally indicate to be originally inputted.
2, reply medical data is distributed unbalanced status.It is distributed unbalanced problem to improve medical data, avoids counting According to amount difference, prevents training set from having excessive deviation and ensure there are enough training datas, it can be more using convolution variation self-encoding encoder Good learning data distribution solves medical data distribution imbalance problem.
3, personalized similar case recommended models are constructed.Similar case recommended models are constructed to each doctor respectively, are merged Clinician user interrogation preference, solves the problems, such as individual difference.This method obtains the number of certain disease from a large amount of high-quality case library After hundred candidate cases, sequence exports dozens of case recommendation items to doctor, by reference.
To sum up, the similar case recommended method proposed by the present invention indicated based on comprehensive characteristics with the wide depth model of improvement, It realizes personalized case to recommend, proposes a kind of recommendation sequence that traditional shallow-layer linear model is combined with depth network model Algorithm model improves the accuracy of similar case recommendation.
Detailed description of the invention
Fig. 1 is flow chart of the invention;
Fig. 2 is that comprehensive characteristics of the present invention indicate illustraton of model;
Fig. 3 is thresholding convolution variation self-encoding encoder illustraton of model in comprehensive characteristics of the present invention expression;
Fig. 4 is that the present invention improves wide depth model illustraton of model.
Specific embodiment
Embodiment 1: as shown in Figs 1-4, the similar case recommendation side with the wide depth model of improvement is indicated based on comprehensive characteristics Method, specific step is as follows for the method:
Step1, medical text desensitization is carried out to text first, and medical text is segmented;By tcm characteristic vocabulary of terms, The unique name of disease of Basic Theories of Chinese Medicine noun, Chinese medicine and disease name, hepatopathy noun and Chinese medicinal formulae noun, are added THULAC dictionary, and segmented using THULAC, obtaining the corpus as unit of word indicates;
Step2, characteristic sub-area is carried out, discrete features is mapped to real-valued vectors;By continuous feature according to step in Step1 It is segmented, is indicated using the term vector that Word2Vec obtains corpus;
The comprehensive characteristics of Step3, building based on thresholding convolution variation self-encoding encoder indicate model;Firstly, according in Step2 Characteristic sub-area, two parts feature is carried out to the fusion of dimension, is compiled automatically wherein continuous feature is used based on thresholding convolution variation Code device algorithm carries out character representation;Finally, the high-level semantics information obtained in Traditional Chinese medical electronic case history indicates;
Step4, building are based on the similar case recommended models for improving wide depth model;Each doctor is constructed respectively respectively Similar case recommended models;It is indicated according to the case history comprehensive characteristics that Step3 is obtained, sequence exports dozens of case recommendation items to doctor It is raw.
Further, the specific steps of the step Step1 are as follows:
Step1.1, privacy and Feature Selection operation are carried out to Chinese medicine electronic health record source data, remove in case history text It is related to patient's individual privacy information, such as " name ", " admission number ", " home address ";It is screened out in conjunction with expert opinion and extraction is suffered from Sick feature does not have contributive item, such as " occupation ", " wedding condition ", " nationality ", " physical examination ";
Step1.2, Chinese medicine associated disease pathology and medical terminology dictionary is added, work is segmented using THULAC Chinese text Tool segments electronic health record;
Step1.3, due in electronic health record text missing values it is more, the illness feature of accurate description can not be showed, in It is the electronic health record removed less than 150 words.
Further, the specific steps of the step Step2 are as follows:
Step2.1, " vital sign " in electronic health record data is extracted, " Gender ", the information such as " patient age ", And its numerical value is mapped as one-dimensional vector, the discrete features as electronic health record;
Step2.2, content will be segmented obtained by Step1, term vector is mapped as using Word2vec method;And term vector is arranged Table is arranged as matrix, the continuous feature as electronic health record.
Further, in the step Step3, continuous feature use based on thresholding convolution variation autocoder algorithm into The specific steps of row character representation are as follows:
Step3.1, it is encoded using thresholding convolutional network, continuous feature obtained by Step2 is sent into pond layer and is compiled Code result;And calculated using above-mentioned coding result and generate mean value and variance, generate Gaussian Profile and resampling;
Step3.2, building double stacked CNN model, the convolutional layer of nonlinear activation function is exported and be have passed through The convolutional layer of sigmoid nonlinear activation function activation, which exports, to be multipliedWherein: W and V indicates the weight of convolutional layer, and b and c indicate the bias term of convolutional layer, and * indicates convolution operation operation, and σ is thresholding convolution letter Number;
Step3.3, loss function are logp (x)=DKL(qφ(z|x)||pθ(z | x)) so that final loss function reaches To minimum;Wherein, θ is Optimal Parameters, and logp (x) indicates that model needs maximized log-likelihood function, DKLFor KL divergence, qφ (z | x) it is encoder, pθ(x | z) it is decoder, z is hidden variable, and x is input variable;And if only if qφ(z | x)=pθ(z|x) When, DKL(qφ(z|x)||pθ(z | x))=0;
Step3.4, network model is trained come undated parameter using stochastic gradient rise method;First with prior distributionThe sample of one group of hidden variable z of stochastical sampling, is then input to decoder, finally exports a data point The random sample of x;Consider different number hidden unit n, can be less than or higher than primitive character quantity.
Further, the specific steps of the step Step4 are as follows:
Step4.1, Logic Regression Models are definedHere x=[x1,x2,…,xd] indicate feature d's One group of vector contains in characteristic set and is originally inputted feature and assemblage characteristic, w=[w1,w2,…,wd] indicate model ginseng Number;
Step4.2, cross feature is definedHere cki∈ { 0,1 } is Boolean, and such as i-th special Sign is k-th of conversion φkA part, ckiAs 1, it is otherwise 0, for binary features, for example, having and if only if assemblage characteristic It all sets up and is only 1, be otherwise exactly 0;
Step4.3, GRU layers of core for defining depth module, and additional feedforward is added between the last layer and output Layer, wherein using tanh function as the activation primitive of output layer;Connection is added between node layer hiding, and is followed with a door Ring element controls the output of concealed nodes, effective variation of the Modelling feature on time-series dynamics;
It Step4.4, by the input feature vector of shallow-layer part include by structures such as regional feature, weather time and search keys At cross feature as input, batch Stochastic Optimization Model parameter, while propagating backward to the shallow-layer and depth door of model again Cyclic part;
Step4.5, conjunctive model anticipation function is definedIt uses Joint output result takes the weighted sum of logarithm as predicted value, and the weighted sum is then fed to a common loss function, It carries out joint training and optimizes;Final output is probability value;Wherein,For wide deep all model parameters,For GRU model Parameter;It is finally ranked up from low to high using probability, takes preceding 5 to recommend as case.
The wherein step Step5 are as follows: quality is recommended to use accuracy rate (Precision), recall rate (Recall), F1 It is worth (F-Measure) and is used as measurement index.Recommend efficiency by clinician user carry out personalized recommendation when, the training of model and Predetermined speed is measured.
The present invention considers that the recommendation efficient design of conjunctive model is weighed using the generation unit time expense in recommendation process Amount.I.e. respectively on data training set and test set, it is compared using the single case list average recommendation time, and and its The advisory speed of his model compares experiment.Width depth recommended models proposed by the present invention, more due to its depth model part Add simply, overcome the shortcomings that gradient disappears, and the problem of medical text data is unevenly distributed can be improved, has higher Efficiency.
For clinician user u, R is enableduAs the case set that model is recommended, LuAs user u approve case set, Recommend accuracy rate, recall rate and F1-Score value are as follows:
The true electronic health record data set that data used in this case study are collected from Traditional Chinese Medical Hospital, Yunnan Prov..Research choosing altogether Taken wherein 2090 authentic and valid electronic health records, comprising the most common illness in department, department, the hospital of traditional Chinese hospital, have bone fracture bone fracture disease, 8 kinds of diseases such as numbness disease, injury of the tissues disease, lumbago diseases, dislocation disease, gangrene of finger or toe disease, hot numbness, suppurative osteomyelitis, the electronic health record data set being configured to. After arranging data sample, 70% is taken as training set, 10% is used as cross validation collection, and 20% is used as test set.
Experiment: for verify auxiliary diagnosis quality, using based on thresholding convolution improvement width depth model (CFI) with Logistic Regression, SVM traditional classification method and the character representation based on two step model of DBN+SVM and automatic Diagnostic method compares experiment.Comparative experimental data utilizes identical ready-portioned TO-EMR data set.As a result as shown in Table X.
The comprehensive index value contrast table of the different models of table 1
In table 1, different model and method are compared, the CFI model is superior to other methods in various indexs, than The comprehensive index value of existing method improves 2.63 percentage points.The result shows that CFI model has stronger electricity than conventional model Sub- case history text information extractability, high-accuracy indicate that model mistaken diagnosis probability is very low, and higher recall rate indicates model leakage Examine that probability is lower, comprehensive evaluation index F value indicates that the auxiliary diagnosis effect of Integrated Evaluation Model is prominent;CFI model with it is existing Depth model is compared, and performance is promoted, and the method for indicating characteristic sub-area effectively raises the accuracy rate of character representation, is become Point self-encoding encoder can preferably learn the distribution of electronic health record data, improve the ability of depth characteristic expression.In summary right Than experimental result, CFI model is effectively demonstrated, there is significant practical value, realizes that comprehensive characteristics are indicated and carried out clinical auxiliary It is feasible and effective for helping diagnosis.
Above in conjunction with attached drawing, the embodiment of the present invention is explained in detail, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims (5)

1. indicating the similar case recommended method with the wide depth model of improvement based on comprehensive characteristics, it is characterised in that: the method Specific step is as follows:
Step1, medical text desensitization is carried out to text first, and medical text is segmented;By tcm characteristic vocabulary of terms, Chinese medicine The unique name of disease of basic theory noun, Chinese medicine and disease name, hepatopathy noun and Chinese medicinal formulae noun, are added THULAC word Library, and segmented using THULAC, obtaining the corpus as unit of word indicates;
Step2, characteristic sub-area is carried out, discrete features is mapped to real-valued vectors;Continuous feature is carried out according to step in Step1 Participle is indicated using the term vector that Word2Vec obtains corpus;
The comprehensive characteristics of Step3, building based on thresholding convolution variation self-encoding encoder indicate model;Firstly, according to the spy in Step2 Subregion is levied, two parts feature is carried out to the fusion of dimension, wherein continuous feature, which uses, is based on thresholding convolution variation autocoder Algorithm carries out character representation;Finally, the high-level semantics information obtained in Traditional Chinese medical electronic case history indicates;
Step4, building are based on the similar case recommended models for improving wide depth model;Each doctor is constructed respectively respectively similar Case recommended models;Indicate that sequence exports dozens of case recommendation items to doctor according to the case history comprehensive characteristics that Step3 is obtained.
2. the similar case recommended method according to claim 1 indicated based on comprehensive characteristics with the wide depth model of improvement, It is characterized by: the specific steps of the step Step1 are as follows:
Step1.1, privacy and Feature Selection operation are carried out to Chinese medicine electronic health record source data, remove involved in case history text Patient's individual privacy information, such as " name ", " admission number ", " home address ";It is screened out in conjunction with expert opinion special to illness is extracted Sign does not have contributive item, such as " occupation ", " wedding condition ", " nationality ", " physical examination ";
Step1.2, Chinese medicine associated disease pathology and medical terminology dictionary is added, tool pair is segmented using THULAC Chinese text Electronic health record is segmented;
Step1.3, due in electronic health record text missing values it is more, the illness feature of accurate description can not be showed, then gone Fall to be less than the electronic health record of 150 words.
3. the similar case recommended method according to claim 1 indicated based on comprehensive characteristics with the wide depth model of improvement, It is characterized by: the specific steps of the step Step2 are as follows:
Step2.1, " vital sign " in electronic health record data is extracted, " Gender ", the information such as " patient age ", and will Its numerical value is mapped as one-dimensional vector, the discrete features as electronic health record;
Step2.2, content will be segmented obtained by Step1, term vector is mapped as using Word2vec method;And term vector list is arranged It is classified as matrix, the continuous feature as electronic health record.
4. the similar case recommended method according to claim 1 indicated based on comprehensive characteristics with the wide depth model of improvement, It is characterized by: continuous feature, which is used, carries out feature based on thresholding convolution variation autocoder algorithm in the step Step3 The specific steps of expression are as follows:
Step3.1, it is encoded using thresholding convolutional network, continuous feature obtained by Step2 is sent into pond layer and obtains coding knot Fruit;And calculated using above-mentioned coding result and generate mean value and variance, generate Gaussian Profile and resampling;
Step3.2, building double stacked CNN model, by the output of the convolutional layer of nonlinear activation function with to have passed through sigmoid non- The convolutional layer of linear activation primitive activation, which exports, to be multipliedWherein: W and V is indicated The weight of convolutional layer, b and c indicate the bias term of convolutional layer, and * indicates convolution operation operation, and σ is thresholding convolution function;
Step3.3, loss function are logp (x)=DKL(qφ(z|x)||pθ(z | x)) so that final loss function reaches most It is small;Wherein, θ is Optimal Parameters, and logp (x) indicates that model needs maximized log-likelihood function, DKLFor KL divergence, qφ(z| It x) is encoder, pθ(x | z) it is decoder, z is hidden variable, and x is input variable;And if only if qφ(z | x)=pθ(z | x) when, DKL(qφ(z|x)||pθ(z | x))=0;
Step3.4, network model is trained come undated parameter using stochastic gradient rise method;First with prior distributionThe sample of one group of hidden variable z of stochastical sampling, is then input to decoder, finally exports a data point The random sample of x;Consider different number hidden unit n, can be less than or higher than primitive character quantity.
5. the similar case recommended method according to claim 1 indicated based on comprehensive characteristics with the wide depth model of improvement, It is characterized by: the specific steps of the step Step4 are as follows:
Step4.1, Logic Regression Models are definedHere x=[x1,x2,…,xd] indicate one group of feature d Vector contains in characteristic set and is originally inputted feature and assemblage characteristic, w=[w1,w2,…,wd] indicate model parameter;
Step4.2, cross feature is definedHere cki∈ { 0,1 } is Boolean, as ith feature is K-th of conversion φkA part, ckiAs 1, it is otherwise 0, for binary features, for example, having and if only if assemblage characteristic is whole Establishment is only 1, is otherwise exactly 0;
Step4.3, GRU layers of core for defining depth module, and additional feedforward layer is added between the last layer and output, The middle activation primitive for using tanh function as output layer;Connection is added between node layer hiding, and single with a door circulation Member controls the output of concealed nodes, effective variation of the Modelling feature on time-series dynamics;
It Step4.4, by the input feature vector of shallow-layer part include being made of regional feature, weather time and search key etc. Cross feature propagates backward to the shallow-layer and depth door circulation of model as input, batch Stochastic Optimization Model parameter again Part;
Step4.5, conjunctive model anticipation function is definedUse joint Output result takes the weighted sum of logarithm as predicted value, and the weighted sum is then fed to a common loss function, carries out Joint training simultaneously optimizes;Final output is probability value;Wherein,For wide deep all model parameters,For GRU model ginseng Number;It is finally ranked up from low to high using probability, takes preceding 5 to recommend as case.
CN201910490881.2A 2019-06-06 2019-06-06 Similar case recommendation method based on comprehensive feature representation and improved wide-depth model Active CN110299194B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910490881.2A CN110299194B (en) 2019-06-06 2019-06-06 Similar case recommendation method based on comprehensive feature representation and improved wide-depth model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910490881.2A CN110299194B (en) 2019-06-06 2019-06-06 Similar case recommendation method based on comprehensive feature representation and improved wide-depth model

Publications (2)

Publication Number Publication Date
CN110299194A true CN110299194A (en) 2019-10-01
CN110299194B CN110299194B (en) 2022-11-08

Family

ID=68027589

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910490881.2A Active CN110299194B (en) 2019-06-06 2019-06-06 Similar case recommendation method based on comprehensive feature representation and improved wide-depth model

Country Status (1)

Country Link
CN (1) CN110299194B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111613339A (en) * 2020-05-15 2020-09-01 山东大学 Similar medical record searching method and system based on deep learning
CN112699408A (en) * 2020-12-31 2021-04-23 重庆大学 Wearable device data privacy protection method based on self-encoder
CN116189843A (en) * 2023-04-23 2023-05-30 索思(苏州)医疗科技有限公司 Treatment scheme recommendation method, device, system and storage medium

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140201126A1 (en) * 2012-09-15 2014-07-17 Lotfi A. Zadeh Methods and Systems for Applications for Z-numbers
CN105653840A (en) * 2015-12-21 2016-06-08 青岛中科慧康科技有限公司 Similar case recommendation system based on word and phrase distributed representation, and corresponding method
US20170213000A1 (en) * 2016-01-25 2017-07-27 Shenzhen University Metabolic mass spectrometry screening method for diseases based on deep learning and the system thereof
US20170300814A1 (en) * 2016-04-13 2017-10-19 Google Inc. Wide and deep machine learning models
CN108647251A (en) * 2018-04-20 2018-10-12 昆明理工大学 The recommendation sort method of conjunctive model is recycled based on wide depth door
CN108897834A (en) * 2018-06-22 2018-11-27 招商信诺人寿保险有限公司 Data processing and method for digging
CN109447244A (en) * 2018-10-11 2019-03-08 中山大学 A kind of advertisement recommended method of combination gating cycle unit neural network
WO2019047790A1 (en) * 2017-09-08 2019-03-14 第四范式(北京)技术有限公司 Method and system for generating combined features of machine learning samples

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140201126A1 (en) * 2012-09-15 2014-07-17 Lotfi A. Zadeh Methods and Systems for Applications for Z-numbers
CN105653840A (en) * 2015-12-21 2016-06-08 青岛中科慧康科技有限公司 Similar case recommendation system based on word and phrase distributed representation, and corresponding method
US20170213000A1 (en) * 2016-01-25 2017-07-27 Shenzhen University Metabolic mass spectrometry screening method for diseases based on deep learning and the system thereof
US20170300814A1 (en) * 2016-04-13 2017-10-19 Google Inc. Wide and deep machine learning models
WO2019047790A1 (en) * 2017-09-08 2019-03-14 第四范式(北京)技术有限公司 Method and system for generating combined features of machine learning samples
CN108647251A (en) * 2018-04-20 2018-10-12 昆明理工大学 The recommendation sort method of conjunctive model is recycled based on wide depth door
CN108897834A (en) * 2018-06-22 2018-11-27 招商信诺人寿保险有限公司 Data processing and method for digging
CN109447244A (en) * 2018-10-11 2019-03-08 中山大学 A kind of advertisement recommended method of combination gating cycle unit neural network

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
倪壮: "基于距离度量学习的医疗数据挖掘研究与应用", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
刘利军: "基于改进的宽深度模型的推荐方法研究", 《计算机应用与软件》 *
王艺平: "面向中医骨科问诊的相似病例推荐方法研究", 《中国优秀博硕士学位论文全文数据库(硕士) 医药卫生科技辑》 *
王静: "在线问诊平台相似病例推荐", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
陈培新: "文本语义的向量表示与建模方法研究", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111613339A (en) * 2020-05-15 2020-09-01 山东大学 Similar medical record searching method and system based on deep learning
CN111613339B (en) * 2020-05-15 2021-07-09 山东大学 Similar medical record searching method and system based on deep learning
CN112699408A (en) * 2020-12-31 2021-04-23 重庆大学 Wearable device data privacy protection method based on self-encoder
CN116189843A (en) * 2023-04-23 2023-05-30 索思(苏州)医疗科技有限公司 Treatment scheme recommendation method, device, system and storage medium

Also Published As

Publication number Publication date
CN110299194B (en) 2022-11-08

Similar Documents

Publication Publication Date Title
Sha et al. Interpretable predictions of clinical outcomes with an attention-based recurrent neural network
Chattopadhyay et al. A comparative study of fuzzy c-means algorithm and entropy-based fuzzy clustering algorithms
Hernández-Julio et al. Framework for the development of data-driven Mamdani-type fuzzy clinical decision support systems
CN110299194A (en) The similar case recommended method with the wide depth model of improvement is indicated based on comprehensive characteristics
Woodman et al. A comprehensive review of machine learning algorithms and their application in geriatric medicine: present and future
Klüver Steering clustering of medical data in a Self-Enforcing Network (SEN) with a cue validity factor
Kumar et al. Gene expression data clustering using variance-based harmony search algorithm
Coban A new modification and application of item response theory‐based feature selection for different machine learning tasks
Chattopadhyay et al. Some studies on fuzzy clustering of psychosis data
Dadgar et al. A hybrid method of feature selection and neural network with genetic algorithm to predict diabetes
Gulhane et al. A Machine Learning based Model for Disease Prediction
Wibowo et al. A K-Nearest Algorithm Based Application To Predict Snmptn Acceptance For High School Students In Indonesia
Nugraha et al. Classification of Depression Expressions on Twitter Using Ensemble Learning with Word2Vec
Mandala et al. A Study on the Development of Machine Learning in Health Analysis.
Azizan et al. Hybridised Network of Fuzzy Logic and a Genetic Algorithm in Solving 3-Satisfiability Hopfield Neural Networks
Rendeiro et al. Taxonomical associative memory
Reddy et al. Diabetes Prediction using Extreme Learning Machine: Application of Health Systems
Kuang et al. LSTM based classification model and its application for doctor-patient relationship evaluation
Lu et al. Combining transformer-based model and GCN to predict ICD codes from clinical records
Mu et al. Diagnosis prediction via recurrent neural networks
Chen et al. Clinical knowledge graph embeddings with hierarchical structure for thyroid treatment recommendation
Akhila et al. A review on sentiment analysis of Twitter data for diabetes classification and prediction
Szeląg Application of the dominance-based rough set approach to ranking and similarity-based classification problems
Tebbe et al. Is natural language processing the cheap charlie of analyzing cheap talk? A horse race between classifiers on experimental communication data
Behpour et al. Understanding Machine Learning Through Data-Oriented and Human Learning Approaches

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant