CN109065174B

CN109065174B - Medical record theme acquisition method and device considering similarity constraint

Info

Publication number: CN109065174B
Application number: CN201810843072.0A
Authority: CN
Inventors: 丁帅; 蔡琼; 潘金鑫; 金行
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2018-07-27
Filing date: 2018-07-27
Publication date: 2022-02-18
Anticipated expiration: 2038-07-27
Also published as: CN109065174A

Abstract

The invention provides a medical record theme acquisition method and device considering similar constraints. The method comprises the following steps: calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold; and sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-subject distribution and subject-word distribution of the medical record documents through the preset LDA model. Therefore, the thinking process of determining the medical record text in the doctor diagnosis and treatment process can be well simulated, and the accuracy of obtaining the theme is facilitated.

Description

Medical record theme acquisition method and device considering similarity constraint

Technical Field

The invention relates to the technical field of data mining, in particular to a medical record theme acquisition method and device considering similar constraints.

Background

At present, a topic model is mostly applied to the aspect of evolution analysis of network public sentiment topics in the field of online social media, which is beneficial to effectively monitoring network public sentiment changes according to network topic distribution at different time periods, and even actively guiding the development direction of the network public sentiment changes. In addition, the topic model is also applied in a small amount in the field of clinical diagnosis and treatment, and aims to analyze the diagnosis and treatment rules between disease-medication and disease-symptoms in medical record documents, wherein the analysis process comprises the following steps: and (3) taking each medical record document as an independent sample to be input into the model, and obtaining a final theme analysis result through a large amount of training.

However, in the process of implementing the scheme of the invention, the inventor finds that: in one aspect, because of the similarity in disease progression between two patients of the same disease, the diagnostic protocols that physicians make for are affected by previous diagnostic protocols for similar patients. On the other hand, there are individual differences between two patients, such as constitution, sex, age, disease stage, etc., so that doctors give different diagnosis and treatment plans according to different patients. In the actual diagnosis and treatment process, two patients with similar physical conditions and diseases may exist, and the similar part of the diagnosis and treatment scheme of the two patients also exists. For example: diabetic patients may have multiple diabetic complications at the same time, but the diagnosis and treatment plans and disease progression of the same complications should have similarities.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a medical record theme acquisition method and device considering similar constraints, which are used for solving the technical problems in the related art.

In a first aspect, an embodiment of the present invention provides a medical record topic acquisition method considering similarity constraints, where the method includes:

calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;

and sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-subject distribution and subject-word distribution of the medical record documents through the preset LDA model.

Optionally, calculating the similarity between any two medical record documents in the initial medical record comprises:

acquiring a plurality of similarity calculation factors of the medical record and weight values of the similarity calculation factors;

respectively calculating the numerical values of any two medical record documents about each similarity calculation factor;

and calculating the similarity of any two medical record documents according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.

Optionally, the similarity calculation factor comprises: distance of gender attribute, distance of segment to which age belongs, distance of diagnosis result.

Optionally, deriving the document-topic distribution and the topic-word distribution of each medical record document through the preset LDA model includes:

randomly assigning a theme number z to each word in each medical record document in the similarity constraint medical record set;

rescanning the similarity-constrained medical record set according to each word

Resampling the topics so that the new topics meet GibbsSampling convergence;

and counting the topic-word co-occurrence frequency matrix in the corpus to obtain document-topic distribution and topic-word distribution.

Optionally, the preset LDA model includes:

subject distribution distance dis (theta r) is adopted for similarity constraint of any two medical record documents^m，θrⁿ) Expressed, the formula is:

where θ r^m＝{θ_m,1,θ_m,2,…,θ_m,LmIndicates that each medical record document includes L_mRecording the individual disease course; theta_m,LmDenotes the L th_mSubject matter of individual course records; d (theta)_m,Lm,θ_n,Ln) The Euclidean distance between the subject vectors expressed as two disease courses;

the preset LDA model further comprises a Gibbs-EM iterative function which is as follows:

the number of words i with topic k in the similarity constraint medical record set is represented.

In a second aspect, an embodiment of the present invention provides an apparatus for acquiring a medical record topic in consideration of similarity constraints, where the apparatus includes:

the medical record set acquisition module is used for calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;

and the theme distribution derivation module is used for sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model and deriving the document-theme distribution and the theme-word distribution of the medical record documents through the preset LDA model.

Optionally, the medical record collection acquiring module includes:

the system comprises a weighted value acquisition unit, a similarity calculation unit and a matching unit, wherein the weighted value acquisition unit is used for acquiring a plurality of similarity calculation factors of medical records and weighted values of the similarity calculation factors;

the factor data calculation unit is used for calculating the numerical values of any two medical record documents about each similarity calculation factor;

and the similarity calculation unit is used for calculating the similarity of any two medical record documents according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.

Optionally, the topic distribution derivation module includes:

a topic numbering unit, configured to randomly assign a topic number z to each word in each medical record document in the similarity constraint medical record set;

a topic iteration unit for rescanning the similarity constraint medical record set according to each word

Resampling the topics so that the new topics meet GibbsSampling convergence;

and the theme distribution calculating unit is used for counting the theme-word co-occurrence frequency matrix in the corpus to obtain document-theme distribution and theme-word distribution.

Optionally, the preset LDA model includes:

where θ r^m＝{θ_m,1,θ_m,2,…,θ_m,Lm}，Indicating that each medical record document includes L_mRecording the individual disease course; theta_m,LmDenotes the L th_mSubject matter of individual course records; d (theta)_m,Lm,θ_n,Ln) The Euclidean distance between the subject vectors expressed as two disease courses;

According to the technical scheme, the similarity of the two medical record documents is calculated, so that a plurality of medical record documents which are larger than or equal to the similarity threshold value can be screened from the initial medical record, and the similarity constraint medical record set formed by the plurality of medical record documents is used as the topic analysis document in the subsequent process. Therefore, the thinking process of determining the medical record text in the doctor diagnosis and treatment process can be well simulated, and the accuracy of obtaining the theme is facilitated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a medical record topic acquisition method considering similarity constraints according to an embodiment of the present invention;

FIG. 2 is a record of the course of disease in a medical record document;

FIG. 3 is a graph of the number of diabetic complications in a male patient;

FIG. 4 is a graph of the number of diabetic complications in a female patient;

FIG. 5 is a diagram of the relationship between the number of topics and the similarity constraint indication SIM with similarity thresholds of 0.5 and 0.6, respectively;

FIG. 6 is a diagram of the relationship between the number of topics and the similarity constraint indication SIM with similarity thresholds of 0.7 and 0.8, respectively;

FIG. 7 is a relationship between topic numbers and interaction information;

fig. 8 to fig. 10 are block diagrams of medical record topic acquisition apparatuses considering similarity constraints according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Fig. 1 is a schematic flow chart of a medical record topic acquisition method considering similarity constraints according to an embodiment of the present invention. Referring to fig. 1, a medical record topic acquisition method considering similarity constraints includes:

101, calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;

and 102, sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-theme distribution and theme-word distribution of the medical record documents through the preset LDA model.

The following describes in detail steps of a medical record topic acquisition method considering similar constraints with reference to the accompanying drawings and embodiments.

Firstly, introducing 101, and calculating the similarity between any two medical record documents in the initial medical record to obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold.

During the hospitalization, the patient will generate various detection records, such as admission record, discharge record, course record, consultation record, etc. If the similarity between the detection records is directly calculated, the calculation amount is greatly increased. For convenience of explanation, in this embodiment, the detection record before processing is referred to as an initial medical record.

To reduce the amount of calculation, only the similarity of the hospitalized diagnosis portion in the initial medical record is considered in this embodiment. In one embodiment, the similarity is calculated as the distance between any two initial medical records, and the medical record similarity constraint construction can be understood as collecting a medical record set with the distance between every two medical records being smaller than a certain threshold.

In practical applications, the initial medical record may also include various complications of a certain disease, for example, diabetes may cause various complications, as shown in table 1.

Table 1 diabetic complications example

As can be seen from the analysis of Table 1, patients of different ages have differences in the characteristics of diabetes and complications thereof; in addition, patients of different ages have different drug bearing capacities, so that the clinical diagnosis and treatment process has different aspects such as characterization, medication and the like. Therefore, the basic information of the patient needs to be considered when calculating the similarity of the medical record documents, and the patient name and age are taken into the similarity calculation factors of the medical record documents in the embodiment.

In one embodiment, the distance of the gender attribute between the same gender is set to 1, and the distance of the gender attribute between different genders is set to 0, as shown in the following equation:

wherein, sex_i，sex_jExpressed as the gender of the different individuals.

In one embodiment, the ages are divided into 4 age groups according to the international population age structure, which are: juvenile, 0-17 years old, denoted 1; young, 18-45 years old, expressed as 2; middle aged, 18-45 years old, expressed as 3; older, older than 59 years, was designated 4. Thus, the present embodiment can calculate the distance between the age groups of two patients, as represented by the following formula:

wherein, age_i，age_jIndicated as the ages of the two different persons, flag_i，flag_jIndicating the segments of different ages. And, the closer the two ages belong to the segment, the smaller the distance, and the farther the two ages belong to the segment, the larger the distance.

Considering the discrete textual description in the initial medical record, the distance between the diagnosis results in different initial medical records is calculated by using the Jaccard distance in this embodiment, as shown in the following formula:

wherein dia_i,dia_jRepresenting the discharge diagnostic boolean vector space for medical record i and medical record j, much of this document considers the conditions between diabetic complications.

For example: dia_i＝{123}，dia_j＝{234}，dia_i∩dia_j＝{2,3}；dia_i∪dia_j1,2,3,4, then d (dia)_i,dia_j)＝2/4＝0.5。

It should be noted that, in this embodiment, only the similarity calculation factors are considered, including: the distance of the gender attribute, the distance of the segment to which the age belongs, and the distance of the diagnosis result, when the application scene of the text theme acquisition method changes, the specific composition of the similarity calculation factor can be correspondingly adjusted, and the adjusted scheme also falls within the protection scope of the application.

In determining similarityAfter the factors are calculated, the weight adjustment adjusting parameters mu are respectively set₁,μ₂,μ₃And calculating the similarity between any two initial medical records, as shown in the following formula:

sim(T_i,T_j)＝μ₁*d(sex_i,sex_j)+μ₂*d(age_i,age_j)+μ₃*d(dia_i,dia_j)

(3)

μ₁+μ₂+μ₃＝1 (4)

0≤μ₁,μ₂,μ₃≤1 (5)

and finally, comparing the similarity with a similarity threshold tau, screening out a plurality of initial medical records of which the similarity is greater than or equal to the similarity threshold, obtaining a similarity constraint medical record set consisting of the plurality of initial medical records, and recording the similarity constraint medical record set as D { (T)_i,T_j)|i,j∈[1,M]}。

And then, introducing 102, namely sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-subject distribution and subject-word distribution of the medical record documents through the preset LDA model.

In this embodiment, the preset LDA model is obtained by improving the existing LDA model. In order to facilitate the technical person to better understand the preset LDA model, the basic principle of the LDA model is described first:

latent Dirichlet Allocation (LDA) is a topic model, which aims to find the topic of a document, including three layers of document, topic and word, and each document has a probability distribution related to its topic, and the words in the document are sampled by different topic distributions, as shown in the following formula (6):

Σ p (word | document) ═ Σ p (word | topic) × p (topic | document) (6)

Modeling medical record documents by utilizing an LDA (latent Dirichlet Allocation) model, wherein the total number M of the medical record documents is set, Nm clinical description words exist in the mth medical record document, and each word is expressed as omega_m,nDocuments and sheets are sorted according to the existing bag of words model (bag of words)Words are represented as a document-topic distribution and a topic-word distribution. The subjects in the medical record texts can be understood as the general terms of clinical care means such as medication, observation, symptoms, operation and the like, and each medical record text is a polynomial distribution of a plurality of subjects, namely each medical record text is formed by combining a plurality of steps in the clinical care process.

In the related art, the steps of the LDA model generating the medical record text are shown in table 2.

It can be understood that, because each topic is a polynomial distribution of a plurality of words, each clinical care step comprises a plurality of clinical practical operations, and the document-topic distribution and the topic-word distribution both conform to the dirichlet parameter as alpha and beta prior distributions, the LDA model can well simulate the thinking process of a doctor making a case history text in the diagnosis and treatment process.

Based on the above analysis, the LDA model reasoning aims at: calculating unknown parameters in LDA model through current test document set

And according to

A topic-word distribution and a document-topic distribution are calculated. In fact, the topic-word distribution and the document-topic distribution can be directly deduced in the calculation process without calculating

In practical application, the parameter inference algorithm of the LDA model comprises Gibbs sampling and EM variation. Two methods are described below.

First, the core idea of Gibbs Sampling is the markov monte carlo (MCMC) method, in which only the parameter value of one dimension is changed during each iteration until convergence and the parameter value to be estimated is output. According to dirichlet parameter estimation, reasoning can obtain:

wherein:

a document-subject distribution is represented that,

a distribution of the subject-words is represented,

representing words

The distribution is the probability of k, i is a data pair (m, n) representing the nth word in the mth document.

Since there are a total of K topics, K iterations are required, and the training steps are shown in table 3:

second, the EM variational algorithm consists in finding suitable parameters that maximize the observed topic-word distribution probability in the text set, similar to the maximum likelihood estimation problem. The EM variational algorithm is divided into two iteration steps:

the variation E-step considers the difficult derivation of the posterior probability p (w | alpha, beta) formula in the original step, introduces the variation parameter (gamma,

) An approximate posterior probability distribution q (theta, z | gamma,

)。

the variation M-step maximizes the approximation function L (gamma,

β). Wherein the prior dirichlet distribution parameters (α, β) determine a topic-word distribution and a document-topic distribution θ, w represents words and z represents topics.

Because the iteration goal of the LDA model is to maximize the occurrence probability p (Z, W | alpha, beta) of the words, the data characteristics of the diabetes course record can be effectively met, and the topic distribution of similar medical records can be greatly different, so that the medical records can not be effectively statistically analyzed according to the topic distribution of the medical records.

In order to establish a topic model satisfying the medical record similarity constraint, this embodiment achieves this goal by changing the Gibbs sampling convergence condition policy.

Considering that a plurality of time-ordered disease course records exist in each medical record, similarity calculation of medical record documents should consider similarity between different disease course record sets in each medical record document, that is, similarity restricts document-subject distribution of different disease course record sets of each medical record document in the medical record set D to be as similar as possible.

Let T_mThe medical record with the number m including L_mIndividual course record, the subject set of which is expressed as theta r^m＝{θ_m,1,θ_m,2,…,θ_m,Lm}. Course record topic set theta gamma with two case history documents^m，θrⁿThe case history similarity constraint can be calculated by using the mean value of the distribution distance of every two subjects, as follows:

wherein d (θ)_m,Lm,θ_n,Ln) Expressed as the Euclidean distance, dis (θ r), between two diseases and the vector^m，θrⁿ) Larger indicates lower similarity.

The maximum objective function may be modified as:

in the embodiment, a Gibbs-EM iteration method is adopted for carrying outLDAModel derivation, which distributes document-topic α_mModified to normally distribute mu_mAnd obtaining a preset LDA model:

wherein, mu_mkRepresents the probability that the medical record document m belongs to the topic k, since μ is considered to be_mFollowing a standard normal distribution, the improved maximum objective function is expressed as follows:

in addition, in the embodiment, the document theme distribution alpha is fixed in advance in the sampling process_mThen the Gibbs-EM iterative function expression is:

wherein the content of the first and second substances,

representing the number of words i with k as the subject in the similarity constraint medical record set, since the original α is replaced by normal distribution, the formula (14) can be derived by a stochastic gradient descent method, and the model training process is as shown in table 4:

and then, sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-theme distribution and theme-word distribution of the medical record documents through the preset LDA model.

Therefore, on the basis of analyzing the influence of text mining on medical diagnosis and the modeling process and the reasoning method of the potential dirichlet topic model, the embodiment of the invention designs the preset LDA model based on medical record similarity constraint. The preset LDA model not only considers similarity constraint among different medical record documents, but also determines a medical text topic modeling target, a reasoning process and a model related measurement index, so that the set LDA model can clearly reflect the focus point and the disease evolution process of each diagnosis and treatment stage, and the scientificity, effectiveness and accuracy of medical record topic mining are favorably improved.

An LDA model and a preset LDA model (hereinafter referred to as Medical Record basis content digital dictionary Allocation, MRS-LDA) of the present application are used to perform a comparative experiment to illustrate the effectiveness and superiority of the Medical Record topic acquisition method considering similar constraints, which is provided by the embodiment of the present invention.

The initial medical records are the medical records of the patients in the endocrinology department in the first subsidiary hospital of the university of medical science in Anhui province, including the admission records of 1294 patients in total from 2015 to 2017, and each medical record document mainly comprises admission records, disease course records (shown in figure 2), consultation records, discharge records and the like. The ratio 648:646 of the number of medical record documents of male and female patients is approximately the same.

Referring to fig. 3 and 4, in the diabetic patients who were admitted to the first subsidiary hospital of the medical university of Anhui, it was judged that patients of different ages and different sexes were significantly different in the number of complications that they had at the same time according to the hospitalization diagnosis. The number of diabetic complications suffered by the old people is greatly increased compared with that suffered by people in other age groups, more middle-aged people suffer from 3 to 5 kinds of complications at the same time, the young people suffer from diabetes, but no more complications occur, and the number of diabetic patients suffered by children is small.

In the embodiment, the sex, age and admission diagnosis of the patient in the admission record are selected as the basis of medical record similarity constraint calculation data, and the disease course record of the doctor during the patient hospitalization period is utilized to perform relevant topic analysis. In the experimental process, the following treatment can be carried out, including:

(1) by using a python crawler method, text records of all stages such as admission records, discharge records, disease course records and the like are divided from 1294 patient medical record documents in an HTML format, and required patient information, diagnosis results and disease course record texts are separated.

(2) Constructing a dictionary and stopping a word stock. The research content of the invention is that the medical record text contains a large number of words which are irrelevant to the text, and 12599 words are manually extracted as stop words to be added to a stop word bank after counting the frequency of each word appearing in the medical record. Meanwhile, the disease name of ICD10 China is added as a supplementary feature to be added to the dictionary.

(3) And performing word segmentation and stop word removal operations by using the dictionary and the stop word bank by using the jieba word segmentation in python as a word segmentation tool.

Considering that in medical record document topic mining, the influence of topic quantity on text topic modeling and the quantity of similar medical records brought by different similarity thresholds are different, in this embodiment, the similarity threshold and the topic quantity are adjustment parameters, the value range of the medical record similarity threshold τ is 0.5-0.8, the topic quantity K is 7, 10, 13, 15, 20, and 30, and the PMI-Score and medical record similarity constraint of the model are respectively calculated under the above parameters.

Referring to fig. 5 and fig. 6, the MRS-LDA model and the LDA model are compared in similarity constraint results under different theme parameters and different similarities, where the abscissa is the number K of themes and the ordinate is the similarity constraint index SIM. The comparison analysis MRS-LDA model has obvious advantages in medical record similarity constraint. When the topic similarity thresholds are consistent, the medical record similarity constraint has an unobvious reduction along with the increase of the number of the topics, but the MRA-LDA model still has a greater advantage in the aspect of medical record similarity constraint indexes than the LDA model.

Referring to fig. 7, the results of the interaction information (PIM-Score) between the MRS-LDA model and the LDA model are compared under different theme parameters and different similarity thresholds, where the abscissa is the number of themes K and the ordinate is the metric PIM-Score. When the number of subjects K is 15, the MRS-LDA model is superior to the LDA model in the PIM-Score metric index, and is better than the LDA model when the medical record similarity threshold is 0.5.

Through comparison experiments, the MRS-LDA model has good performance on similarity constraint measurement indexes, and under the condition of the same medical record similarity threshold and the same number of subjects, the distance between the subject distributions of similar medical records obtained by the MRS-LDA model is smaller, so that the existing association between the similar medical records can be better described. That is to say, the constraint condition of similarity of medical records is added when the objective function is constructed, so that the topic distribution among similar medical records is relatively close, the method and the device can be suitable for a use scene of medical record topic mining, and the accuracy is relatively high.

In a second aspect, an embodiment of the present invention provides an apparatus for acquiring a medical record topic in consideration of similarity constraints, and referring to fig. 8, the apparatus includes:

a medical record set obtaining module 801, configured to calculate similarity between any two medical record documents in an initial medical record, and obtain a similarity constraint medical record set formed by a plurality of medical record documents of which the similarity is greater than or equal to a similarity threshold;

and a topic distribution derivation module 802, configured to sequentially input each medical record document in the similarity-constrained medical record set into a preset LDA model, and derive document-topic distribution and topic-word distribution of each medical record document through the preset LDA model.

Optionally, referring to fig. 9, the medical record collection acquiring module 801 includes:

a weight value obtaining unit 901, configured to obtain a plurality of similarity calculation factors of a medical record and a weight value of each similarity calculation factor;

a factor data calculating unit 902, configured to calculate a numerical value of each similarity calculation factor of any two medical record documents respectively;

and the similarity calculation unit 903 is configured to calculate the similarity between any two medical record documents according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.

Optionally, referring to fig. 10, the topic distribution derivation module 802 includes:

a topic numbering unit 1001, configured to randomly assign a topic number z to each word in each medical record document in the similarity constraint medical record set;

a topic iteration unit 1002, configured to rescan the similarity-constrained medical record set according to the similarity constraint set

Resampling the topics so that the new topics meet GibbsSampling convergence;

and a topic distribution calculation unit 1003, configured to count a topic-word co-occurrence frequency matrix in the corpus to obtain a document-topic distribution and a topic-word distribution.

Optionally, the preset LDA model includes:

It should be noted that the medical record topic acquisition device considering similar constraints provided in the embodiment of the present invention is in a one-to-one correspondence relationship with the above method, and the implementation details of the above method are also applicable to the above device, and the above system is not described in detail in the embodiment of the present invention.

In the description of the present invention, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. A medical record topic acquisition method considering similarity constraint is characterized by comprising the following steps:

sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model, and deducing document-subject distribution and subject-word distribution of the medical record documents through the preset LDA model;

deriving document-subject distribution and subject-word distribution of each medical record document through the preset LDA model comprises:

rescanning the similarity-constrained medical record set according to each word

Resampling the topics so that the new topics meet Gibbs Sampling convergence; wherein the content of the first and second substances,

representing words

Probability of distribution being k;

counting a topic-word co-occurrence frequency matrix in a corpus to obtain document-topic distribution and topic-word distribution;

the preset LDA model comprises:

where θ r^m＝{θ_m,1,θ_m,2,…,θ_m,LmIndicates that each medical record document includes L_mRecording the individual disease course; theta_m,LmDenotes the L th_mSubject matter of individual course records; d (theta)_m,Lm,θ_n,Ln) The Euclidean distance between the subject vectors expressed as two disease courses; theta rⁿ＝{θ_n,1,θ_n,2,…,θ_n,LnIndicates that each medical record document includes L_nRecording the individual disease course; theta_n,LnDenotes the L th_nSubject matter of individual course records;

representing the number of words i with the topic k in the similarity constraint medical record set; -a priori dirichlet distribution parameters (α, β); the total number of words in the medical record set is V; preset LDA model

μ_mkRepresenting the probability that the medical record document m belongs to the topic k.

2. The medical record topic acquisition method of claim 1, wherein calculating the similarity between any two medical record documents in the initial medical record comprises:

3. The medical record topic acquisition method as recited in claim 2, wherein the similarity calculation factors comprise: distance of gender attribute, distance of segment to which age belongs, distance of diagnosis result.

4. An apparatus for obtaining a subject of a medical record considering similarity constraint, the apparatus comprising:

the theme distribution derivation module is used for sequentially inputting the medical record documents in the similarity constraint medical record set into a preset LDA model and deriving document-theme distribution and theme-word distribution of the medical record documents through the preset LDA model;

the topic distribution derivation module comprises:

Resampling the topics so that the new topics meet GibbsSampling convergence; wherein the content of the first and second substances,

representing words

Probability of distribution being k;

the topic distribution calculating unit is used for counting topic-word co-occurrence frequency matrixes in the corpus to obtain document-topic distribution and topic-word distribution;

the preset LDA model comprises:

5. The medical record topic acquisition device of claim 4, wherein the medical record collection acquisition module comprises:

6. The medical record topic acquisition method device as claimed in claim 5, wherein the similarity calculation factors comprise: distance of gender attribute, distance of segment to which age belongs, distance of diagnosis result.