CN109065174B - Medical record theme acquisition method and device considering similarity constraint - Google Patents

Medical record theme acquisition method and device considering similarity constraint Download PDF

Info

Publication number
CN109065174B
CN109065174B CN201810843072.0A CN201810843072A CN109065174B CN 109065174 B CN109065174 B CN 109065174B CN 201810843072 A CN201810843072 A CN 201810843072A CN 109065174 B CN109065174 B CN 109065174B
Authority
CN
China
Prior art keywords
medical record
similarity
topic
distribution
documents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810843072.0A
Other languages
Chinese (zh)
Other versions
CN109065174A (en
Inventor
丁帅
蔡琼
潘金鑫
金行
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN201810843072.0A priority Critical patent/CN109065174B/en
Publication of CN109065174A publication Critical patent/CN109065174A/en
Application granted granted Critical
Publication of CN109065174B publication Critical patent/CN109065174B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Biomedical Technology (AREA)
  • Databases & Information Systems (AREA)
  • Pathology (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention provides a medical record theme acquisition method and device considering similar constraints. The method comprises the following steps: calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold; and sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-subject distribution and subject-word distribution of the medical record documents through the preset LDA model. Therefore, the thinking process of determining the medical record text in the doctor diagnosis and treatment process can be well simulated, and the accuracy of obtaining the theme is facilitated.

Description

Medical record theme acquisition method and device considering similarity constraint
Technical Field
The invention relates to the technical field of data mining, in particular to a medical record theme acquisition method and device considering similar constraints.
Background
At present, a topic model is mostly applied to the aspect of evolution analysis of network public sentiment topics in the field of online social media, which is beneficial to effectively monitoring network public sentiment changes according to network topic distribution at different time periods, and even actively guiding the development direction of the network public sentiment changes. In addition, the topic model is also applied in a small amount in the field of clinical diagnosis and treatment, and aims to analyze the diagnosis and treatment rules between disease-medication and disease-symptoms in medical record documents, wherein the analysis process comprises the following steps: and (3) taking each medical record document as an independent sample to be input into the model, and obtaining a final theme analysis result through a large amount of training.
However, in the process of implementing the scheme of the invention, the inventor finds that: in one aspect, because of the similarity in disease progression between two patients of the same disease, the diagnostic protocols that physicians make for are affected by previous diagnostic protocols for similar patients. On the other hand, there are individual differences between two patients, such as constitution, sex, age, disease stage, etc., so that doctors give different diagnosis and treatment plans according to different patients. In the actual diagnosis and treatment process, two patients with similar physical conditions and diseases may exist, and the similar part of the diagnosis and treatment scheme of the two patients also exists. For example: diabetic patients may have multiple diabetic complications at the same time, but the diagnosis and treatment plans and disease progression of the same complications should have similarities.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a medical record theme acquisition method and device considering similar constraints, which are used for solving the technical problems in the related art.
In a first aspect, an embodiment of the present invention provides a medical record topic acquisition method considering similarity constraints, where the method includes:
calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;
and sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-subject distribution and subject-word distribution of the medical record documents through the preset LDA model.
Optionally, calculating the similarity between any two medical record documents in the initial medical record comprises:
acquiring a plurality of similarity calculation factors of the medical record and weight values of the similarity calculation factors;
respectively calculating the numerical values of any two medical record documents about each similarity calculation factor;
and calculating the similarity of any two medical record documents according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.
Optionally, the similarity calculation factor comprises: distance of gender attribute, distance of segment to which age belongs, distance of diagnosis result.
Optionally, deriving the document-topic distribution and the topic-word distribution of each medical record document through the preset LDA model includes:
randomly assigning a theme number z to each word in each medical record document in the similarity constraint medical record set;
rescanning the similarity-constrained medical record set according to each word
Figure BDA0001746019110000021
Resampling the topics so that the new topics meet GibbsSampling convergence;
and counting the topic-word co-occurrence frequency matrix in the corpus to obtain document-topic distribution and topic-word distribution.
Optionally, the preset LDA model includes:
subject distribution distance dis (theta r) is adopted for similarity constraint of any two medical record documentsm,θrn) Expressed, the formula is:
Figure BDA0001746019110000031
where θ rm={θm,1m,2,…,θm,LmIndicates that each medical record document includes LmRecording the individual disease course; thetam,LmDenotes the L thmSubject matter of individual course records; d (theta)m,Lmn,Ln) The Euclidean distance between the subject vectors expressed as two disease courses;
the preset LDA model further comprises a Gibbs-EM iterative function which is as follows:
Figure BDA0001746019110000032
Figure BDA0001746019110000033
the number of words i with topic k in the similarity constraint medical record set is represented.
In a second aspect, an embodiment of the present invention provides an apparatus for acquiring a medical record topic in consideration of similarity constraints, where the apparatus includes:
the medical record set acquisition module is used for calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;
and the theme distribution derivation module is used for sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model and deriving the document-theme distribution and the theme-word distribution of the medical record documents through the preset LDA model.
Optionally, the medical record collection acquiring module includes:
the system comprises a weighted value acquisition unit, a similarity calculation unit and a matching unit, wherein the weighted value acquisition unit is used for acquiring a plurality of similarity calculation factors of medical records and weighted values of the similarity calculation factors;
the factor data calculation unit is used for calculating the numerical values of any two medical record documents about each similarity calculation factor;
and the similarity calculation unit is used for calculating the similarity of any two medical record documents according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.
Optionally, the similarity calculation factor comprises: distance of gender attribute, distance of segment to which age belongs, distance of diagnosis result.
Optionally, the topic distribution derivation module includes:
a topic numbering unit, configured to randomly assign a topic number z to each word in each medical record document in the similarity constraint medical record set;
a topic iteration unit for rescanning the similarity constraint medical record set according to each word
Figure BDA0001746019110000041
Resampling the topics so that the new topics meet GibbsSampling convergence;
and the theme distribution calculating unit is used for counting the theme-word co-occurrence frequency matrix in the corpus to obtain document-theme distribution and theme-word distribution.
Optionally, the preset LDA model includes:
subject distribution distance dis (theta r) is adopted for similarity constraint of any two medical record documentsm,θrn) Expressed, the formula is:
Figure BDA0001746019110000042
where θ rm={θm,1m,2,…,θm,Lm},Indicating that each medical record document includes LmRecording the individual disease course; thetam,LmDenotes the L thmSubject matter of individual course records; d (theta)m,Lmn,Ln) The Euclidean distance between the subject vectors expressed as two disease courses;
the preset LDA model further comprises a Gibbs-EM iterative function which is as follows:
Figure BDA0001746019110000051
Figure BDA0001746019110000052
the number of words i with topic k in the similarity constraint medical record set is represented.
According to the technical scheme, the similarity of the two medical record documents is calculated, so that a plurality of medical record documents which are larger than or equal to the similarity threshold value can be screened from the initial medical record, and the similarity constraint medical record set formed by the plurality of medical record documents is used as the topic analysis document in the subsequent process. Therefore, the thinking process of determining the medical record text in the doctor diagnosis and treatment process can be well simulated, and the accuracy of obtaining the theme is facilitated.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a medical record topic acquisition method considering similarity constraints according to an embodiment of the present invention;
FIG. 2 is a record of the course of disease in a medical record document;
FIG. 3 is a graph of the number of diabetic complications in a male patient;
FIG. 4 is a graph of the number of diabetic complications in a female patient;
FIG. 5 is a diagram of the relationship between the number of topics and the similarity constraint indication SIM with similarity thresholds of 0.5 and 0.6, respectively;
FIG. 6 is a diagram of the relationship between the number of topics and the similarity constraint indication SIM with similarity thresholds of 0.7 and 0.8, respectively;
FIG. 7 is a relationship between topic numbers and interaction information;
fig. 8 to fig. 10 are block diagrams of medical record topic acquisition apparatuses considering similarity constraints according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Fig. 1 is a schematic flow chart of a medical record topic acquisition method considering similarity constraints according to an embodiment of the present invention. Referring to fig. 1, a medical record topic acquisition method considering similarity constraints includes:
101, calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;
and 102, sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-theme distribution and theme-word distribution of the medical record documents through the preset LDA model.
The following describes in detail steps of a medical record topic acquisition method considering similar constraints with reference to the accompanying drawings and embodiments.
Firstly, introducing 101, and calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold.
During the hospitalization, the patient will generate various detection records, such as admission record, discharge record, course record, consultation record, etc. If the similarity between the detection records is directly calculated, the calculation amount is greatly increased. For convenience of explanation, in this embodiment, the detection record before processing is referred to as an initial medical record.
To reduce the amount of calculation, only the similarity of the hospitalized diagnosis portion in the initial medical record is considered in this embodiment. In one embodiment, the similarity is calculated as the distance between any two initial medical records, and the medical record similarity constraint construction can be understood as collecting a medical record set with the distance between every two medical records being smaller than a certain threshold.
In practical applications, the initial medical record may also include various complications of a certain disease, for example, diabetes may cause various complications, as shown in table 1.
Table 1 diabetic complications example
Figure BDA0001746019110000071
As can be seen from the analysis of Table 1, patients of different ages have differences in the characteristics of diabetes and complications thereof; in addition, patients of different ages have different drug bearing capacities, so that the clinical diagnosis and treatment process has different aspects such as characterization, medication and the like. Therefore, the basic information of the patient needs to be considered when calculating the similarity of the medical record documents, and the patient name and age are taken into the similarity calculation factors of the medical record documents in the embodiment.
In one embodiment, the distance of the gender attribute between the same gender is set to 1, and the distance of the gender attribute between different genders is set to 0, as shown in the following equation:
Figure BDA0001746019110000081
wherein, sexi,sexjExpressed as the gender of the different individuals.
In one embodiment, the ages are divided into 4 age groups according to the international population age structure, which are: juvenile, 0-17 years old, denoted 1; young, 18-45 years old, expressed as 2; middle aged, 18-45 years old, expressed as 3; older, older than 59 years, was designated 4. Thus, the present embodiment can calculate the distance between the age groups of two patients, as represented by the following formula:
Figure BDA0001746019110000082
wherein, agei,agejIndicated as the ages of the two different persons, flagi,flagjIndicating the segments of different ages. And, the closer the two ages belong to the segment, the smaller the distance, and the farther the two ages belong to the segment, the larger the distance.
Considering the discrete textual description in the initial medical record, the distance between the diagnosis results in different initial medical records is calculated by using the Jaccard distance in this embodiment, as shown in the following formula:
Figure BDA0001746019110000083
wherein diai,diajRepresenting the discharge diagnostic boolean vector space for medical record i and medical record j, much of this document considers the conditions between diabetic complications.
For example: diai={123},diaj={234},diai∩diaj={2,3};diai∪diaj1,2,3,4, then d (dia)i,diaj)=2/4=0.5。
It should be noted that, in this embodiment, only the similarity calculation factors are considered, including: the distance of the gender attribute, the distance of the segment to which the age belongs, and the distance of the diagnosis result, when the application scene of the text theme acquisition method changes, the specific composition of the similarity calculation factor can be correspondingly adjusted, and the adjusted scheme also falls within the protection scope of the application.
In determining similarityAfter the factors are calculated, the weight adjustment adjusting parameters mu are respectively set123And calculating the similarity between any two initial medical records, as shown in the following formula:
sim(Ti,Tj)=μ1*d(sexi,sexj)+μ2*d(agei,agej)+μ3*d(diai,diaj)
(3)
μ123=1 (4)
0≤μ123≤1 (5)
and finally, comparing the similarity with a similarity threshold tau, screening out a plurality of initial medical records of which the similarity is greater than or equal to the similarity threshold, obtaining a similarity constraint medical record set consisting of the plurality of initial medical records, and recording the similarity constraint medical record set as D { (T)i,Tj)|i,j∈[1,M]}。
And then, introducing 102, namely sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-subject distribution and subject-word distribution of the medical record documents through the preset LDA model.
In this embodiment, the preset LDA model is obtained by improving the existing LDA model. In order to facilitate the technical person to better understand the preset LDA model, the basic principle of the LDA model is described first:
latent Dirichlet Allocation (LDA) is a topic model, which aims to find the topic of a document, including three layers of document, topic and word, and each document has a probability distribution related to its topic, and the words in the document are sampled by different topic distributions, as shown in the following formula (6):
Σ p (word | document) ═ Σ p (word | topic) × p (topic | document) (6)
Modeling medical record documents by utilizing an LDA (latent Dirichlet Allocation) model, wherein the total number M of the medical record documents is set, Nm clinical description words exist in the mth medical record document, and each word is expressed as omegam,nDocuments and sheets are sorted according to the existing bag of words model (bag of words)Words are represented as a document-topic distribution and a topic-word distribution. The subjects in the medical record texts can be understood as the general terms of clinical care means such as medication, observation, symptoms, operation and the like, and each medical record text is a polynomial distribution of a plurality of subjects, namely each medical record text is formed by combining a plurality of steps in the clinical care process.
In the related art, the steps of the LDA model generating the medical record text are shown in table 2.
Figure BDA0001746019110000101
It can be understood that, because each topic is a polynomial distribution of a plurality of words, each clinical care step comprises a plurality of clinical practical operations, and the document-topic distribution and the topic-word distribution both conform to the dirichlet parameter as alpha and beta prior distributions, the LDA model can well simulate the thinking process of a doctor making a case history text in the diagnosis and treatment process.
Based on the above analysis, the LDA model reasoning aims at: calculating unknown parameters in LDA model through current test document set
Figure BDA0001746019110000102
And according to
Figure BDA0001746019110000103
A topic-word distribution and a document-topic distribution are calculated. In fact, the topic-word distribution and the document-topic distribution can be directly deduced in the calculation process without calculating
Figure BDA0001746019110000111
In practical application, the parameter inference algorithm of the LDA model comprises Gibbs sampling and EM variation. Two methods are described below.
First, the core idea of Gibbs Sampling is the markov monte carlo (MCMC) method, in which only the parameter value of one dimension is changed during each iteration until convergence and the parameter value to be estimated is output. According to dirichlet parameter estimation, reasoning can obtain:
Figure BDA0001746019110000112
Figure BDA0001746019110000113
Figure BDA0001746019110000114
wherein:
Figure BDA0001746019110000115
a document-subject distribution is represented that,
Figure BDA0001746019110000116
a distribution of the subject-words is represented,
Figure BDA0001746019110000117
representing words
Figure BDA0001746019110000119
The distribution is the probability of k, i is a data pair (m, n) representing the nth word in the mth document.
Since there are a total of K topics, K iterations are required, and the training steps are shown in table 3:
Figure BDA0001746019110000118
second, the EM variational algorithm consists in finding suitable parameters that maximize the observed topic-word distribution probability in the text set, similar to the maximum likelihood estimation problem. The EM variational algorithm is divided into two iteration steps:
the variation E-step considers the difficult derivation of the posterior probability p (w | alpha, beta) formula in the original step, introduces the variation parameter (gamma,
Figure BDA0001746019110000121
) An approximate posterior probability distribution q (theta, z | gamma,
Figure BDA0001746019110000122
)。
the variation M-step maximizes the approximation function L (gamma,
Figure BDA0001746019110000123
β). Wherein the prior dirichlet distribution parameters (α, β) determine a topic-word distribution and a document-topic distribution θ, w represents words and z represents topics.
Because the iteration goal of the LDA model is to maximize the occurrence probability p (Z, W | alpha, beta) of the words, the data characteristics of the diabetes course record can be effectively met, and the topic distribution of similar medical records can be greatly different, so that the medical records can not be effectively statistically analyzed according to the topic distribution of the medical records.
In order to establish a topic model satisfying the medical record similarity constraint, this embodiment achieves this goal by changing the Gibbs sampling convergence condition policy.
Considering that a plurality of time-ordered disease course records exist in each medical record, similarity calculation of medical record documents should consider similarity between different disease course record sets in each medical record document, that is, similarity restricts document-subject distribution of different disease course record sets of each medical record document in the medical record set D to be as similar as possible.
Let TmThe medical record with the number m including LmIndividual course record, the subject set of which is expressed as theta rm={θm,1m,2,…,θm,Lm}. Course record topic set theta gamma with two case history documentsm,θrnThe case history similarity constraint can be calculated by using the mean value of the distribution distance of every two subjects, as follows:
Figure BDA0001746019110000131
wherein d (θ)m,Lmn,Ln) Expressed as the Euclidean distance, dis (θ r), between two diseases and the vectorm,θrn) Larger indicates lower similarity.
The maximum objective function may be modified as:
Figure BDA0001746019110000137
in the embodiment, a Gibbs-EM iteration method is adopted for carrying outLDAModel derivation, which distributes document-topic αmModified to normally distribute mumAnd obtaining a preset LDA model:
Figure BDA0001746019110000132
wherein, mumkRepresents the probability that the medical record document m belongs to the topic k, since μ is considered to bemFollowing a standard normal distribution, the improved maximum objective function is expressed as follows:
Figure BDA0001746019110000133
in addition, in the embodiment, the document theme distribution alpha is fixed in advance in the sampling processmThen the Gibbs-EM iterative function expression is:
Figure BDA0001746019110000134
wherein the content of the first and second substances,
Figure BDA0001746019110000135
representing the number of words i with k as the subject in the similarity constraint medical record set, since the original α is replaced by normal distribution, the formula (14) can be derived by a stochastic gradient descent method, and the model training process is as shown in table 4:
Figure BDA0001746019110000136
Figure BDA0001746019110000141
and then, sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-theme distribution and theme-word distribution of the medical record documents through the preset LDA model.
Therefore, on the basis of analyzing the influence of text mining on medical diagnosis and the modeling process and the reasoning method of the potential dirichlet topic model, the embodiment of the invention designs the preset LDA model based on medical record similarity constraint. The preset LDA model not only considers similarity constraint among different medical record documents, but also determines a medical text topic modeling target, a reasoning process and a model related measurement index, so that the set LDA model can clearly reflect the focus point and the disease evolution process of each diagnosis and treatment stage, and the scientificity, effectiveness and accuracy of medical record topic mining are favorably improved.
An LDA model and a preset LDA model (hereinafter referred to as Medical Record basis content digital dictionary Allocation, MRS-LDA) of the present application are used to perform a comparative experiment to illustrate the effectiveness and superiority of the Medical Record topic acquisition method considering similar constraints, which is provided by the embodiment of the present invention.
The initial medical records are the medical records of the patients in the endocrinology department in the first subsidiary hospital of the university of medical science in Anhui province, including the admission records of 1294 patients in total from 2015 to 2017, and each medical record document mainly comprises admission records, disease course records (shown in figure 2), consultation records, discharge records and the like. The ratio 648:646 of the number of medical record documents of male and female patients is approximately the same.
Referring to fig. 3 and 4, in the diabetic patients who were admitted to the first subsidiary hospital of the medical university of Anhui, it was judged that patients of different ages and different sexes were significantly different in the number of complications that they had at the same time according to the hospitalization diagnosis. The number of diabetic complications suffered by the old people is greatly increased compared with that suffered by people in other age groups, more middle-aged people suffer from 3 to 5 kinds of complications at the same time, the young people suffer from diabetes, but no more complications occur, and the number of diabetic patients suffered by children is small.
In the embodiment, the sex, age and admission diagnosis of the patient in the admission record are selected as the basis of medical record similarity constraint calculation data, and the disease course record of the doctor during the patient hospitalization period is utilized to perform relevant topic analysis. In the experimental process, the following treatment can be carried out, including:
(1) by using a python crawler method, text records of all stages such as admission records, discharge records, disease course records and the like are divided from 1294 patient medical record documents in an HTML format, and required patient information, diagnosis results and disease course record texts are separated.
(2) Constructing a dictionary and stopping a word stock. The research content of the invention is that the medical record text contains a large number of words which are irrelevant to the text, and 12599 words are manually extracted as stop words to be added to a stop word bank after counting the frequency of each word appearing in the medical record. Meanwhile, the disease name of ICD10 China is added as a supplementary feature to be added to the dictionary.
(3) And performing word segmentation and stop word removal operations by using the dictionary and the stop word bank by using the jieba word segmentation in python as a word segmentation tool.
Considering that in medical record document topic mining, the influence of topic quantity on text topic modeling and the quantity of similar medical records brought by different similarity thresholds are different, in this embodiment, the similarity threshold and the topic quantity are adjustment parameters, the value range of the medical record similarity threshold τ is 0.5-0.8, the topic quantity K is 7, 10, 13, 15, 20, and 30, and the PMI-Score and medical record similarity constraint of the model are respectively calculated under the above parameters.
Referring to fig. 5 and fig. 6, the MRS-LDA model and the LDA model are compared in similarity constraint results under different theme parameters and different similarities, where the abscissa is the number K of themes and the ordinate is the similarity constraint index SIM. The comparison analysis MRS-LDA model has obvious advantages in medical record similarity constraint. When the topic similarity thresholds are consistent, the medical record similarity constraint has an unobvious reduction along with the increase of the number of the topics, but the MRA-LDA model still has a greater advantage in the aspect of medical record similarity constraint indexes than the LDA model.
Referring to fig. 7, the results of the interaction information (PIM-Score) between the MRS-LDA model and the LDA model are compared under different theme parameters and different similarity thresholds, where the abscissa is the number of themes K and the ordinate is the metric PIM-Score. When the number of subjects K is 15, the MRS-LDA model is superior to the LDA model in the PIM-Score metric index, and is better than the LDA model when the medical record similarity threshold is 0.5.
Through comparison experiments, the MRS-LDA model has good performance on similarity constraint measurement indexes, and under the condition of the same medical record similarity threshold and the same number of subjects, the distance between the subject distributions of similar medical records obtained by the MRS-LDA model is smaller, so that the existing association between the similar medical records can be better described. That is to say, the constraint condition of similarity of medical records is added when the objective function is constructed, so that the topic distribution among similar medical records is relatively close, the method and the device can be suitable for a use scene of medical record topic mining, and the accuracy is relatively high.
In a second aspect, an embodiment of the present invention provides an apparatus for acquiring a medical record topic in consideration of similarity constraints, and referring to fig. 8, the apparatus includes:
a medical record set obtaining module 801, configured to calculate similarity between any two medical record documents in an initial medical record, and obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;
and a topic distribution derivation module 802, configured to sequentially input each medical record document in the similarity-constrained medical record set into a preset LDA model, and derive document-topic distribution and topic-word distribution of each medical record document through the preset LDA model.
Optionally, referring to fig. 9, the medical record collection acquiring module 801 includes:
a weight value obtaining unit 901, configured to obtain a plurality of similarity calculation factors of a medical record and a weight value of each similarity calculation factor;
a factor data calculating unit 902, configured to calculate a numerical value of each similarity calculation factor of any two medical record documents respectively;
and the similarity calculation unit 903 is configured to calculate the similarity between any two medical record documents according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.
Optionally, the similarity calculation factor comprises: distance of gender attribute, distance of segment to which age belongs, distance of diagnosis result.
Optionally, referring to fig. 10, the topic distribution derivation module 802 includes:
a topic numbering unit 1001, configured to randomly assign a topic number z to each word in each medical record document in the similarity constraint medical record set;
a topic iteration unit 1002, configured to rescan the similarity-constrained medical record set according to the similarity constraint set
Figure BDA0001746019110000181
Resampling the topics so that the new topics meet GibbsSampling convergence;
and a topic distribution calculation unit 1003, configured to count a topic-word co-occurrence frequency matrix in the corpus to obtain a document-topic distribution and a topic-word distribution.
Optionally, the preset LDA model includes:
subject distribution distance dis (theta r) is adopted for similarity constraint of any two medical record documentsm,θrn) Expressed, the formula is:
Figure BDA0001746019110000182
where θ rm={θm,1m,2,…,θm,LmIndicates that each medical record document includes LmRecording the individual disease course; thetam,LmDenotes the L thmSubject matter of individual course records; d (theta)m,Lmn,Ln) The Euclidean distance between the subject vectors expressed as two disease courses;
the preset LDA model further comprises a Gibbs-EM iterative function which is as follows:
Figure BDA0001746019110000183
Figure BDA0001746019110000184
the number of words i with topic k in the similarity constraint medical record set is represented.
It should be noted that the medical record topic acquisition device considering similar constraints provided in the embodiment of the present invention is in a one-to-one correspondence relationship with the above method, and the implementation details of the above method are also applicable to the above device, and the above system is not described in detail in the embodiment of the present invention.
In the description of the present invention, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims (6)

1. A medical record topic acquisition method considering similarity constraint is characterized by comprising the following steps:
calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;
sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-subject distribution and subject-word distribution of the medical record documents through the preset LDA model;
deriving document-subject distribution and subject-word distribution of each medical record document through the preset LDA model comprises:
randomly assigning a theme number z to each word in each medical record document in the similarity constraint medical record set;
rescanning the similarity-constrained medical record set according to each word
Figure FDA0003460890300000011
Resampling the topics so that the new topics meet Gibbs Sampling convergence; wherein the content of the first and second substances,
Figure FDA0003460890300000012
representing words
Figure FDA0003460890300000013
Probability of distribution being k;
counting a topic-word co-occurrence frequency matrix in a corpus to obtain document-topic distribution and topic-word distribution;
the preset LDA model comprises:
subject distribution distance dis (theta r) is adopted for similarity constraint of any two medical record documentsm,θrn) Expressed, the formula is:
Figure FDA0003460890300000014
where θ rm={θm,1m,2,…,θm,LmIndicates that each medical record document includes LmRecording the individual disease course; thetam,LmDenotes the L thmSubject matter of individual course records; d (theta)m,Lmn,Ln) The Euclidean distance between the subject vectors expressed as two disease courses; theta rn={θn,1n,2,…,θn,LnIndicates that each medical record document includes LnRecording the individual disease course; thetan,LnDenotes the L thnSubject matter of individual course records;
the preset LDA model further comprises a Gibbs-EM iterative function which is as follows:
Figure FDA0003460890300000021
Figure FDA0003460890300000022
representing the number of words i with the topic k in the similarity constraint medical record set; -a priori dirichlet distribution parameters (α, β); the total number of words in the medical record set is V; preset LDA model
Figure FDA0003460890300000023
Figure FDA0003460890300000024
μmkRepresenting the probability that the medical record document m belongs to the topic k.
2. The medical record topic acquisition method of claim 1, wherein calculating the similarity between any two medical record documents in the initial medical record comprises:
acquiring a plurality of similarity calculation factors of the medical record and weight values of the similarity calculation factors;
respectively calculating the numerical values of any two medical record documents about each similarity calculation factor;
and calculating the similarity of any two medical record documents according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.
3. The medical record topic acquisition method as recited in claim 2, wherein the similarity calculation factors comprise: distance of gender attribute, distance of segment to which age belongs, distance of diagnosis result.
4. An apparatus for obtaining a subject of a medical record considering similarity constraint, the apparatus comprising:
the medical record set acquisition module is used for calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;
the theme distribution derivation module is used for sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model and deriving document-theme distribution and theme-word distribution of the medical record documents through the preset LDA model;
the topic distribution derivation module comprises:
a topic numbering unit, configured to randomly assign a topic number z to each word in each medical record document in the similarity constraint medical record set;
a topic iteration unit for rescanning the similarity constraint medical record set according to each word
Figure FDA0003460890300000031
Resampling the topics so that the new topics meet GibbsSampling convergence; wherein the content of the first and second substances,
Figure FDA0003460890300000032
representing words
Figure FDA0003460890300000033
Probability of distribution being k;
the topic distribution calculating unit is used for counting topic-word co-occurrence frequency matrixes in the corpus to obtain document-topic distribution and topic-word distribution;
the preset LDA model comprises:
subject distribution distance dis (theta r) is adopted for similarity constraint of any two medical record documentsm,θrn) Expressed, the formula is:
Figure FDA0003460890300000034
where θ rm={θm,1m,2,…,θm,LmIndicates that each medical record document includes LmRecording the individual disease course; thetam,LmDenotes the L thmSubject matter of individual course records; d (theta)m,Lmn,Ln) The Euclidean distance between the subject vectors expressed as two disease courses; theta rn={θn,1n,2,…,θn,LnIndicates that each medical record document includes LnRecording the individual disease course; thetan,LnDenotes the L thnSubject matter of individual course records;
the preset LDA model further comprises a Gibbs-EM iterative function which is as follows:
Figure FDA0003460890300000035
Figure FDA0003460890300000036
representing the number of words i with the topic k in the similarity constraint medical record set; -a priori dirichlet distribution parameters (α, β); the total number of words in the medical record set is V; preset LDA model
Figure FDA0003460890300000037
Figure FDA0003460890300000041
μmkRepresenting the probability that the medical record document m belongs to the topic k.
5. The medical record topic acquisition device of claim 4, wherein the medical record collection acquisition module comprises:
the system comprises a weighted value acquisition unit, a similarity calculation unit and a matching unit, wherein the weighted value acquisition unit is used for acquiring a plurality of similarity calculation factors of medical records and weighted values of the similarity calculation factors;
the factor data calculation unit is used for calculating the numerical values of any two medical record documents about each similarity calculation factor;
and the similarity calculation unit is used for calculating the similarity of any two medical record documents according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.
6. The medical record topic acquisition method device as claimed in claim 5, wherein the similarity calculation factors comprise: distance of gender attribute, distance of segment to which age belongs, distance of diagnosis result.
CN201810843072.0A 2018-07-27 2018-07-27 Medical record theme acquisition method and device considering similarity constraint Active CN109065174B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810843072.0A CN109065174B (en) 2018-07-27 2018-07-27 Medical record theme acquisition method and device considering similarity constraint

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810843072.0A CN109065174B (en) 2018-07-27 2018-07-27 Medical record theme acquisition method and device considering similarity constraint

Publications (2)

Publication Number Publication Date
CN109065174A CN109065174A (en) 2018-12-21
CN109065174B true CN109065174B (en) 2022-02-18

Family

ID=64836831

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810843072.0A Active CN109065174B (en) 2018-07-27 2018-07-27 Medical record theme acquisition method and device considering similarity constraint

Country Status (1)

Country Link
CN (1) CN109065174B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046339A (en) * 2018-12-24 2019-07-23 北京字节跳动网络技术有限公司 Determine method, apparatus, storage medium and the electronic equipment of document subject matter
CN109871434B (en) * 2019-02-25 2019-12-10 内蒙古工业大学 Public opinion evolution tracking method based on dynamic incremental probability graph model
CN110517789B (en) * 2019-08-30 2023-06-16 深圳市汇健医疗工程有限公司 Digital composite operating room with multiple image devices
CN111370086A (en) * 2020-02-27 2020-07-03 平安国际智慧城市科技股份有限公司 Electronic case detection method, electronic case detection device, computer equipment and storage medium
CN111430037B (en) * 2020-03-30 2024-04-09 讯飞医疗科技股份有限公司 Similar medical record searching method and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102317786A (en) * 2007-04-18 2012-01-11 特提斯生物科学公司 Diabetes correlativity biological marker and method of application thereof
CN103365978A (en) * 2013-07-01 2013-10-23 浙江大学 Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN107613520A (en) * 2017-08-29 2018-01-19 重庆邮电大学 A kind of telecommunication user similarity based on LDA topic models finds method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1949105A4 (en) * 2005-10-11 2009-06-17 Tethys Bioscience Inc Diabetes-associated markers and methods of use thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102317786A (en) * 2007-04-18 2012-01-11 特提斯生物科学公司 Diabetes correlativity biological marker and method of application thereof
CN103365978A (en) * 2013-07-01 2013-10-23 浙江大学 Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis
CN107613520A (en) * 2017-08-29 2018-01-19 重庆邮电大学 A kind of telecommunication user similarity based on LDA topic models finds method

Also Published As

Publication number Publication date
CN109065174A (en) 2018-12-21

Similar Documents

Publication Publication Date Title
CN109065174B (en) Medical record theme acquisition method and device considering similarity constraint
CN109036577B (en) Diabetes complication analysis method and device
Caballero Barajas et al. Dynamically modeling patient's health state from electronic medical records: A time series approach
CN109460473B (en) Electronic medical record multi-label classification method based on symptom extraction and feature representation
Amir et al. Quantifying mental health from social media with neural user embeddings
Haque et al. Deep learning for suicide and depression identification with unsupervised label correction
Fang et al. Feature Selection Method Based on Class Discriminative Degree for Intelligent Medical Diagnosis.
Bozkurt et al. Using automatically extracted information from mammography reports for decision-support
CN117744654A (en) Semantic classification method and system for numerical data in natural language context based on machine learning
CN116364299B (en) Disease diagnosis and treatment path clustering method and system based on heterogeneous information network
CN113555077B (en) Suspected infectious disease prediction method and device
CN112037909B (en) Diagnostic information review system
Ma et al. Constructing a semantic graph with depression symptoms extraction from twitter
Zou et al. Modeling electronic health record data using an end-to-end knowledge-graph-informed topic model
CN111524570A (en) Ultrasonic follow-up patient screening method based on machine learning
Bhattacharya et al. Identifying patterns of associated-conditions through topic models of Electronic Medical Records
CN114188022A (en) Clinical children cough intelligent pre-diagnosis system based on textCNN model
CN113360643A (en) Electronic medical record data quality evaluation method based on short text classification
Chuan Classifying eligibility criteria in clinical trials using active deep learning
CN108831560B (en) Method and device for determining medical data attribute data
Tai et al. Mental disorder detection and measurement using latent Dirichlet allocation and SentiWordNet
Aydogan A hybrid deep neural network‐based automated diagnosis system using x‐ray images and clinical findings
Bania Heterogenous ensemble learning framework for sentiment analysis on COVID-19 Tweets
Kongburan et al. Enhancing predictive power of cluster-boosted regression with text-based indexing
RU2723674C1 (en) Method for prediction of diagnosis based on data processing containing medical knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant