CN113161001A - Process path mining method based on improved LDA - Google Patents
Process path mining method based on improved LDA Download PDFInfo
- Publication number
- CN113161001A CN113161001A CN202110515351.6A CN202110515351A CN113161001A CN 113161001 A CN113161001 A CN 113161001A CN 202110515351 A CN202110515351 A CN 202110515351A CN 113161001 A CN113161001 A CN 113161001A
- Authority
- CN
- China
- Prior art keywords
- diagnosis
- topic
- treatment
- word
- distribution
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 58
- 238000005065 mining Methods 0.000 title claims abstract description 32
- 230000008569 process Effects 0.000 title claims abstract description 29
- 238000011282 treatment Methods 0.000 claims description 138
- 238000003745 diagnosis Methods 0.000 claims description 107
- 238000004364 calculation method Methods 0.000 claims description 20
- 239000013598 vector Substances 0.000 claims description 19
- 238000013507 mapping Methods 0.000 claims description 15
- 230000000694 effects Effects 0.000 claims description 12
- 238000001914 filtration Methods 0.000 claims description 12
- 238000004422 calculation algorithm Methods 0.000 claims description 9
- 238000013138 pruning Methods 0.000 claims description 7
- 230000002159 abnormal effect Effects 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 238000012217 deletion Methods 0.000 claims description 3
- 230000037430 deletion Effects 0.000 claims description 3
- 238000003780 insertion Methods 0.000 claims description 3
- 230000037431 insertion Effects 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 239000000725 suspension Substances 0.000 claims 1
- 238000000968 medical method and process Methods 0.000 abstract description 3
- 230000037361 pathway Effects 0.000 description 17
- FAPWRFPIFSIZLT-UHFFFAOYSA-M Sodium chloride Chemical compound [Na+].[Cl-] FAPWRFPIFSIZLT-UHFFFAOYSA-M 0.000 description 13
- 201000010099 disease Diseases 0.000 description 12
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 12
- 206010006187 Breast cancer Diseases 0.000 description 11
- 208000026310 Breast neoplasm Diseases 0.000 description 11
- 239000011780 sodium chloride Substances 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000007726 management method Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 3
- 239000003814 drug Substances 0.000 description 3
- 229940079593 drug Drugs 0.000 description 3
- 238000002347 injection Methods 0.000 description 3
- 239000007924 injection Substances 0.000 description 3
- 238000002360 preparation method Methods 0.000 description 3
- 239000000243 solution Substances 0.000 description 3
- 238000001356 surgical procedure Methods 0.000 description 3
- 238000003759 clinical diagnosis Methods 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000002980 postoperative effect Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 208000017667 Chronic Disease Diseases 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000000038 chest Anatomy 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000004393 prognosis Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 239000008354 sodium chloride injection Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 238000011269 treatment regimen Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/20—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Probability & Statistics with Applications (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Primary Health Care (AREA)
- Mathematical Optimization (AREA)
- Mathematical Physics (AREA)
- Algebra (AREA)
- Computational Mathematics (AREA)
- Databases & Information Systems (AREA)
- Mathematical Analysis (AREA)
- Pathology (AREA)
- Pure & Applied Mathematics (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention provides a process path mining method based on improved LDA, and relates to the technical field of clinical path mining. According to the method, the medical advice logs in the electronic medical records are analyzed, a medical dictionary is constructed to filter useless medical advice items in the medical advice logs, an LDA topic model in the topic model is selected to model medical data, the medical logs are mapped to a low-dimensional topic space, and then the time sequence relation among topic features is discovered through process mining, so that the mined medical process model is easier to understand, and the medical interpretability of the obtained result is improved. The results obtained by the invention are compared with the national standard clinical route, and the results are basically consistent.
Description
Technical Field
The invention relates to the technical field of clinical path mining, in particular to a process path mining method based on improved LDA.
Background
With the progress of society, medical expenses are also rising. In order to suppress this trend and improve the utilization of sanitary resources, the state sets a series of clinical treatment standards, which are greatly improved in terms of reducing medical expenses, reducing treatment days, reasonably regulating the behaviors of medical staff and the like, and can achieve the expected treatment effect. This treatment standardization pattern is called the clinical pathway.
Clinical pathway (Clinical pathway) is a programmed and standardized diagnosis plan with strict working sequence and accurate time requirements for the purpose of desired treatment effect and cost control for a specific disease or operation, and the adoption of the correct treatment means at the correct time is the core of Clinical pathway, which generally divides the diagnosis and treatment of a disease into several stages and specifies the diagnosis and treatment items required for each stage. The clinical pathway is performed relative to the traditional pathway, i.e., the individual pathway of each doctor, different regions, different hospitals, different treatment groups, or different treatment regimens that may be taken by different doctor individuals for a disease. After the clinical route is adopted, the situation that the traditional route causes different treatment schemes to appear on the same disease in different regions, different hospitals, different treatment groups or different doctors among individuals can be avoided, the randomness is avoided, and the evaluability of the cost, the prognosis and the like is improved. A large number of clinical practice tests prove that the clinical path can be applied to standardize clinical diagnosis and treatment activities, control cost, strengthen medical process management and improve medical quality and efficiency.
At present, the national health council is gradually implementing a clinical pathway management mode, but the promotion process is not smooth, hospitals implementing clinical pathways are few, and problems of lack of reliability, small number of covered disease species and the like are often encountered in the practical application process, which are specifically as follows:
(1) reliability is lacking. Most clinical routes implemented in hospitals are based on national standards, and are established by related personnel according to past experience discussions. However, the clinical pathway formulated according to experience is seriously lack of data support and experimental simulation, which can cause the variation rate of the clinical pathway to be increased, thereby causing the reduction of the rate of utilization, and being not suitable for the development of the personalized clinical pathway;
(2) most hospitals have insufficient attention, small popularization range and few developed disease varieties. The clinical path entry is mainly caused by surgical diseases treated by surgery, the number of the disease types is small, the disease types are relatively single, and the clinical path application report in chronic diseases is rare and only stays on the relatively single disease type;
(3) the existing clinical path is slow to update, the updating is not timely according to the change information of the patient's condition, and the expansion is poor. Because it is time and labor consuming to manually develop a clinical pathway, the developed clinical pathway remains static for a long period of time. Most hospitals design clinical routes, generally, a set of treatment schemes from beginning to end is directly designed according to the patient conditions, and the clinical routes are difficult to update in real time according to the patient condition changes in the implementation process. Furthermore, currently, tens of thousands of diseases are known, and if management through clinical routes is desired, taking into consideration complications and the like, a large amount of investment is required;
(4) it is difficult to practice. Generally, the diagnosis and treatment item categories specified by the clinical path form have different implementation and deployment schemes in different places and hospitals, so that a great deal of local energy is required to do local mapping work; meanwhile, due to different requirements of the personalized characteristics of different patients on clinical routes, the variation rate of the manually-made clinical routes in practice (required diagnosis and treatment items do not meet the requirements of established routes) is extremely high, and a proper diagnosis and treatment planning guidance is difficult to provide.
For the problems of slow updating, poor expansion, lack of reliability and the like, an automatic clinical path making method can be introduced to assist in solving, and for the problem of difficult practice, the diagnosis and treatment scheme which is strong in practice and more consistent with the current patient can be found from historical data to serve as reference and guidance. Based on these two starting points, together with the rapid accumulation of medical data brought by the development of medical informatization in recent years, data-driven clinical path mining is receiving more and more attention.
The clinical path is derived from the practice process of clinical diagnosis and treatment activities, and is a common treatment mode of disease types hidden in the mass data of the hospital information system. With the continuous improvement of the medical informatization level, a large amount of historical patient diagnosis and treatment data is recorded in various medical information systems. The medical data are used for analyzing the mass data of the hospital information system in combination with the data mining technology, a scientific and reasonable clinical path which meets the diagnosis and treatment standards can be made, and scientific and reasonable decision support and recommendation are provided for doctors to make the clinical path, which is also significant for scientific making of the clinical path.
The aim of clinical path mining is to find a diagnosis and treatment process model with generality and time sequence for a plurality of people from diagnosis and treatment data, and focuses on finding actually executed diagnosis and treatment paths from historical diagnosis and treatment data, and by utilizing the more objective and specific execution paths, the design/redesign of the clinical paths can be effectively assisted, reference is provided for a maker of the clinical paths, and in addition, the method can also be used for truly examining the specific implementation conditions of areas and hospitals in which the clinical path management is implemented, so that a clinical path manager can be helped to identify differences.
The LDA model establishes a document-theme-word three-layer Bayesian network, and is a document theme generation model. By generative model, we mean that each word of an article is considered to be obtained through a process of "selecting a topic with a certain probability and selecting a word from the topic with a certain probability".
In the current clinical path mining research, a classical process mining algorithm is adopted to directly mine medical data, and the obtained medical process is italian due to the fact that the event granularity is too fine, and is not easy to understand and use. To obtain a more understandable and compact medical procedure model, the medical data needs to be reduced in dimension, and the medical procedure needs to be abstracted and generalized. Some have modeled topic models for medical data, but the resulting results lose the chronology between the phases of the clinical pathway.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a process path mining method based on improved LDA.
The technical scheme of the invention is as follows:
a process path mining method based on improved LDA comprises the following steps:
step 1: filtering abnormal medical record samples in the data set, constructing a stop word list and a medical dictionary, filtering meaningless medical advice items by using the stop word list, and uniformly mapping diagnosis and treatment items with the same meaning by using the medical dictionary;
the data in the data set are medical advice data which specifically comprise patient IDs, medical advice activity names, medical advice types and occurrence time;
the meaningless medical advice item is a medical advice item irrelevant to treatment;
step 1.1: denoising the medical advice data; setting a noise threshold value, filtering abnormal data samples, adding meaningless diagnosis and treatment items into a stop word list, and filtering the stop word list;
step 1.2: carrying out unified mapping on diagnosis and treatment items on the text data; constructing a medical dictionary, uniformly mapping diagnosis and treatment items with the same meaning, and unifying all writing situations during processing;
carrying out unified mapping on the diagnosis and treatment items by adopting a mode of combining similarity calculation and regularization matching; before similarity calculation, regularization matching is carried out, suffix interference items in the medical advice data are removed, and cosine similarity calculation is carried out after regularization matching;
converting the text into corresponding word frequency vectors a and b, and calculating a cosine value between the two vectors, wherein the cosine similarity calculation formula is s (a, b) ═ a · b/| a | b |, wherein s (a, b) represents the cosine similarity between a and b, and if the calculation result is 1, the result is in accordance with reality;
the LDA topic model comprises two core model parameters: a topic distribution for each document and a lexical distribution for each topic; in LDA topic modelSupposing that a document is composed of different topics with different probabilities, and each topic corresponds to the probability distribution of a word, so that each word in the document is generated by selecting the corresponding topic according to the corresponding probability and then selecting the word according to the probability; taking the diagnosis and treatment items as words and the diagnosis and treatment days as documents, respectively calculating the distribution of the diagnosis and treatment days-the diagnosis and treatment subjects and the distribution of the diagnosis and treatment subjects-the diagnosis and treatment items, and sampling the distribution of the diagnosis and treatment days-the treatment subjects from the Dirichlet distribution alpha to generate a distribution theta of the diagnosis and treatment days i-the treatment days-the treatment subjectsiSampling from Dirichlet distribution beta to generate diagnosis and treatment subject-diagnosis and treatment item distribution corresponding to diagnosis and treatment subject z
The distribution of terms in the Document collection is quantified using Inverse Document Frequency (IDF), which is calculated as follows:
wherein idfiIs the word tiThe IDF value, | D | is the total number of files in the corpus, | { j: t |, isi∈djIs taken to contain a word tiIf the word is not in the corpus, this will result in a dividend of zero, when 1+ | { j: ti∈djJ, including the word tiThe fewer documents of | { j: t |)i∈djThe smaller the } | and the larger the IDF, the word t is describediThe better the category discrimination ability;
taking the diagnosis and treatment items as words in the topic model, taking the diagnosis and treatment date as a document in the topic model to carry out LDA topic modeling, thereby obtaining a topic-word distribution variable in outputValue (i.e. probability, weight of keyword belonging to a certain topic), topic-word distributionThe method is generated by Dirichlet distribution with a parameter of beta, word distribution phi values under each topic are ordered from small to large, the twenty words ranked first under each topic are taken for weight recalculation, and the calculation formula is as follows
Wherein the content of the first and second substances,indicating the probability value, idf, of the occurrence of the word w in the subject zwThe IDF value representing the keyword w in the data set,the final weight of the word w in the theme z;
step 3.1: generating a diagnosis and treatment day theme label;
for a diagnosis and treatment day d, according to the corresponding topic vector thetadExtracting related subjects as subject labels to represent the diagnosis and treatment day according to the selected subject label probability threshold; a topic k, as one of the topic tags, needs to satisfy the following constraints:
where r (k, d) represents the topic vector θdThe value of the subject k in (1), δtlSelecting a probability threshold value of the theme label, wherein K is the selected optimal theme number; arranging the topics meeting the formula in the diagnosis and treatment day in a descending order according to the probability value, and finally recording the topic label forming the diagnosis and treatment day d as tldWhere k (j) represents a topic with a j-high probability, TL is defined as a different topic tag set.
Replacing each diagnosis and treatment day of one hospitalization of one patient with a topic label to obtain a topic sequence sigma ═ tl corresponding to the hospitalization1,tl2,...,tl|σ|Therein tliBelongs to TL and is the number of treatment days of the hospital.
Step 3.2: pruning the low-frequency subject label;
replacing each diagnosis and treatment day with a theme label to obtain a theme sequence corresponding to each hospitalization, pruning low-frequency theme labels, wherein the probability of the theme arranged behind the theme label is lower than that of the theme arranged in front of the theme label, gradually deleting subsequent themes in the low-frequency theme, and then judging whether the pruned theme label is low or not; constructing a prefix tree for the subject label in the TL, setting a threshold value of the low-frequency label, merging the low-frequency label node to a father node of the prefix tree, and changing the frequency of the father node until the low-frequency label node does not exist in the whole tree.
Step 3.3: clustering the subject sequences;
clustering of the subject sequences is carried out by adopting a Kmeans algorithm, and the distance between the subject sequences is measured by an Edit Distance (ED), wherein the edit distance refers to the minimum operand required for converting one character string into another character string, and the allowed edit operation comprises insertion, deletion and replacement.
Step 3.3.1: selecting initialized k samples as initial clustering center a1,a2,...,ak;
Step 3.3.2: for each sample x in the datasetiCalculating the Edit Distance (ED) from the cluster center to k cluster centers and dividing the ED into the class corresponding to the cluster center with the minimum distance;
step 3.3.3: for each class ajRecalculating its cluster center;
step 3.3.4: repeating the steps 3.3.2 and 3.3.3 until reaching the set stopping condition;
and 4, carrying out process mining on the constructed subject sequence sets by adopting a mining algorithm based on an inter-activity dependency graph, wherein the subject labels are used as nodes in the graph model, and the time sequence relation between the subject labels is used as directed edges of the graph model, so that the diagnosis and treatment process model of each subject sequence set is finally obtained.
The beneficial effects produced by adopting the technical method are as follows:
the invention provides a process path mining method based on improved LDA, which is used for mining clinical paths from high-dimensional sparse medical data and combining an LDA topic model with process mining.
Drawings
FIG. 1 is an overall flow chart in an embodiment of the present invention;
FIG. 2 is a mapping relationship diagram of LDA and the medical field in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a subject label pruning in an embodiment of the present invention;
FIG. 4 is a graph of breast cancer surgery data in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram of a first set of subject sequences according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating a second set of subject sequences according to an embodiment of the present invention;
FIG. 7 is a sequence diagram of a third group of topics in the embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
A method for mining a process path based on improved LDA, as shown in fig. 1, comprising the steps of:
step 1: filtering abnormal medical record samples in the data set, constructing a stop word list and a medical dictionary, filtering meaningless medical advice items by using the stop word list, and uniformly mapping diagnosis and treatment items with the same meaning by using the medical dictionary;
the data in the data set are medical advice data which specifically comprise patient IDs, medical advice activity names, medical advice types and occurrence time;
the meaningless medical advice item is a medical advice item irrelevant to treatment;
step 1.1: denoising the medical advice data;
medical data has a high noise problem, for example, sample data with a diagnosis date of 2 days does not contribute much to clinical path mining, and the data has some useless diagnosis and treatment items, which may interfere with experiments. The method comprises the steps of setting a noise threshold value, filtering abnormal data samples, adding meaningless diagnosis and treatment items into a stop word list, and filtering the stop word list;
step 1.2: unified mapping of medical items to textual data
Doctor orders are filled in by doctors, different people have different filling habits, so that the filling contents of text data are not uniform, and the phenomenon of multiple words and meaning generally exists. The term "0.9% sodium chloride injection" may be written as "0.9% NACL", "sodium chloride", "0.9% sodium chloride", etc., and for some injections, some doctors will also record the injection dosage and injection mode in detail. A medical dictionary can be constructed during data preprocessing, diagnosis and treatment items with the same meaning are uniformly mapped, and all writing conditions are unified during processing;
aiming at the problems in the medical advice data, the method adopts a mode of combining similarity calculation and regularization matching to carry out unified mapping on diagnosis and treatment items. Because the invention mainly uses medical advice data which are mostly phrases and nouns, the invention selects cosine similarity as a similarity algorithm. If only similarity calculations are used, the resulting medical dictionary may be incomplete due to the presence of interfering suffixes (e.g., injected doses) in the order item. If the cosine similarity between "0.9% sodium chloride" and "0.9% sodium chloride 1000 ml" is calculated to be 0.7559, if the threshold value of the similarity is set to 0.8, the two items cannot be added into the dictionary for uniform mapping, but the method is not practical. Before similarity calculation, regularization matching is firstly carried out, suffix interference items in medical advice data are removed, a regularization matching rule is set as \ d + ml aiming at 0.9% sodium chloride and 0.9% sodium chloride 1000ml, and cosine similarity calculation is carried out after regularization matching;
converting the text into corresponding word frequency vectors a and b, and calculating a cosine value between the two vectors, wherein the cosine similarity calculation formula is s (a, b) ═ a · b/| a | b |, wherein s (a, b) represents the cosine similarity between a and b, and if the calculation result is 1, the result is in accordance with reality;
LDA is one of the most popular statistical topic modeling techniques. It models the generation process of each word in each document in the text dataset. The LDA topic model comprises two core model parameters: a topic distribution for each document and a lexical distribution for each topic; in the LDA topic model, a document is supposed to be composed of different topics with different probabilities, and each topic corresponds to the probability distribution of a word, so that each word in the document is generated by selecting the corresponding topic according to the corresponding probability and then selecting the word according to the probability; taking the diagnosis and treatment items as words and the diagnosis and treatment days as documents, respectively calculating the distribution of the diagnosis and treatment days-the diagnosis and treatment subjects and the distribution of the diagnosis and treatment subjects-the diagnosis and treatment items, and sampling the distribution of the diagnosis and treatment days-the treatment subjects from the Dirichlet distribution alpha to generate a distribution theta of the diagnosis and treatment days i-the treatment days-the treatment subjectsiSampling from Dirichlet distribution beta to generate diagnosis and treatment subject-diagnosis and treatment item distribution corresponding to diagnosis and treatment subject zThe mapping relationship between LDA and medical field is shown in fig. 2.
In the traditional LDA model, a word bag model is adopted for text modeling, but the word bag model has a serious problem, the word frequency of common words is often very high, and the word frequency of proper nouns is very low, so that topics are influenced by high-frequency words, as shown in table 1, table 1 is a result obtained after modeling clinical data of a breast cancer operation by using the traditional LDA model (only intercepting and displaying keywords in the top 4 of ranks under each topic, and as can be seen from the table, common food belongs to general medical advice, and is a high-frequency word in the whole corpus, so that the common food is ranked in the front under each topic-word distribution of the breast cancer operation and is strongly associated with each topic, which is not practical.
TABLE 1 LDA modeling of surgical data
The distribution of terms in a Document collection is quantified using Inverse Document Frequency (IDF), which is a measure of the general importance of terms. The IDF of the term is calculated as follows:
wherein idfiIs the word tiThe IDF value, | D | is the total number of files in the corpus, | { j: t |, isi∈djIs taken to contain a word tiIf the word is not in the corpus, this will result in a dividend of zero, when 1+ | { j: ti∈djJ, including the word tiThe fewer documents of | { j: t |)i∈djThe smaller the } | and the larger the IDF, the word t is describediThe better the category discrimination ability;
taking the diagnosis and treatment items as words in the topic model, taking the diagnosis and treatment date as a document in the topic model to carry out LDA topic modeling, thereby obtaining a topic-word distribution variable in outputValue (i.e. probability, weight of keyword belonging to a certain topic), topic-word distributionGenerated by Dirichlet distribution with parameter beta, the word distribution phi value under each subject is ordered from small to large, and each word distribution phi value is takenThe weight recalculation is carried out on the words which are twenty words before the ranking under each topic, and the calculation formula is as follows
Wherein the content of the first and second substances,indicating the probability value, idf, of the occurrence of the word w in the subject zwThe IDF value representing the keyword w in the data set,the final weight of the word w in the theme z;
by using the improved LDA algorithm, complex and various medical orders can be aggregated into a plurality of subjects, each diagnosis and treatment day can be represented as a subject distribution, the distribution represents the probability that the diagnosis and treatment day belongs to each subject, and then a diagnosis and treatment log (a hospitalization record) of a patient is correspondingly converted into a subject vector sequence. In order to obtain a clearer and easily understood clinical path model, the patent also carries out topic sequence construction to replace each diagnosis and treatment day, and the topic sequence construction is mainly divided into the following parts:
step 3.1: generating a diagnosis and treatment day theme label;
for a diagnosis and treatment day d, according to the corresponding topic vector thetadExtracting related subjects as subject labels to represent the diagnosis and treatment day according to the selected subject label probability threshold; a topic k, as one of the topic tags, needs to satisfy the following constraints:
where r (k, d) represents the topic vector θdThe value of the subject k in (1), δtlSelecting a probability threshold value of the theme label, wherein K is the selected optimal theme number; arranging the topics meeting the formula in the diagnosis and treatment day in a descending order according to the probability value, and finally recording the topic label forming the diagnosis and treatment day d as tldWhere k (j) represents a topic with a j-high probability, TL is defined as a different topic tag set.
Replacing each diagnosis and treatment day of one hospitalization of one patient with a topic label to obtain a topic sequence sigma ═ tl corresponding to the hospitalization1,tl2,...,tl|σ|Therein tliBelongs to TL and is the number of treatment days of the hospital.
Step 3.2: pruning the low-frequency subject label;
and (3) replacing each diagnosis and treatment day with a topic label to obtain a topic sequence corresponding to each hospitalization, and mining the topic sequence by using process mining in the following work to obtain a final clinical path model. However, some low-frequency tags in the formed subject tags only represent the characteristics of a few treatment days, and the finally obtained clinical path model becomes complicated if the low-frequency tags are not processed. The aim of clinical path mining of the patent is to mine the treatment process followed by most cases, so that the low-frequency subject labels need to be pruned, and the influence on the final mining result is avoided. As can be known from the generation of the above theme tags, the probability of the theme arranged behind in the theme tags is lower than that of the theme arranged in front, and in comparison, the theme arranged behind is not important for the diagnosis and treatment day, the subsequent theme in the low-frequency theme can be deleted step by step, and then whether the pruned theme tags are low-frequency or not is judged; therefore, the method borrows the concept of the prefix tree, constructs the prefix tree for the topic labels in the TL, sets the threshold value of the low-frequency label, combines the low-frequency label node to the father node of the low-frequency label node, and changes the frequency of the father node until the low-frequency label node does not exist in the whole tree. For the topic tag { "0": 3,: "0,1": 2,"0,1,2,3": 1,"0,1,2,4": 1 an example of pruning is shown in figure 3.
Step 3.3: clustering the subject sequences;
the construction of the subject sequence of each visiting case is completed in the above-mentioned work. In order to show the characteristics of different diagnosis and treatment modes more clearly, the subject sequences are clustered, and then a process mining method is used for mining the clinical path of each sequence class. The method adopts a Kmeans algorithm to cluster subject sequences, measures the distance between the subject sequences by an Edit Distance (ED), wherein the edit distance refers to the minimum operand required for converting one character string into another character string, and the allowed edit operation comprises insertion, deletion and replacement.
Step 3.3.1: selecting initialized k samples as initial clustering center a1,a2,...,ak;
Step 3.3.2: for each sample x in the datasetiCalculating the Edit Distance (ED) from the cluster center to k cluster centers and dividing the ED into the class corresponding to the cluster center with the minimum distance;
step 3.3.3: for each class ajRecalculating its cluster center;
step 3.3.4: repeating the steps 3.3.2 and 3.3.3 until reaching the set stopping conditions (iteration times, minimum error change, etc.);
and 4, carrying out process mining on the constructed subject sequence sets by adopting a mining algorithm based on an inter-activity dependency graph, wherein the subject labels are used as nodes in the graph model, and the time sequence relation between the subject labels is used as directed edges of the graph model, so that the diagnosis and treatment process model of each subject sequence set is finally obtained.
Hospital informatization systems are currently becoming more sophisticated, accumulating a wide variety of types of medical data, including charge item data, order data, etc., where order data is chosen because it is more detailed, contains more information, and can be contrasted with the order information of the NCP.
Because data in a hospital information system are various and complex, a breast cancer medical record is selected as experimental data, complications and secondary symptoms of breast cancer medical records are not considered, and the experimental data are finally obtained through screening, cleaning and preprocessing.
The breast cancer hospitalization data after pretreatment all included 4 main attributes: patient ID, order activity name, order type, time of occurrence, as shown in Table 2. Wherein the ordered activity having the same patient ID and the time of occurrence constitutes a treatment day for the patient, and a plurality of treatment days having the same patient ID constitute a hospitalization visit for the patient.
TABLE 2 sample clinical data
The patent uses the medical data of the breast cancer as experimental data, digs out the clinical path of the operation therein, and provides reference for doctors when making treatment plans. And performing confusion calculation on the data of the breast cancer operation part, and selecting the optimal number of subjects. Confusion refers to the uncertainty that in text analysis, a trained model identifies which topics some documents contain. Thus the lower the number, the less uncertainty and the better the final clustering result. The graph of the confusion with the change of the number of subjects is shown in fig. 4, the confusion gradually decreases with the increase of the number of subjects K, and the change of the confusion tends to be gentle when K is 5, so that the optimal number of subjects selected for the operation data is 5.
The LDA topic model after the improvement of breast cancer surgery data was modeled and the results are shown in table 3. The topic name of each topic is manually defined according to the key words under each topic. If the keywords under the theme 0 include "food water prohibited before operation", "skin prepared before operation", "chest band prepared before operation", etc., the theme 0 may be defined as "preparation before operation", and the theme 1 includes the keywords of "stitches removed (extra large)", "dressing change (6 pieces or less)", etc., and thus the theme 1 may be labeled as "post-operative care".
TABLE 3 modeling of modified LDA on Breast cancer surgical data
For the breast cancer operation data set, 3 sets of topic sequence sets can be obtained, the corresponding medical records are 379, 136 and 172 respectively, the corresponding clinical pathway model graphs thereof are respectively shown in fig. 5, fig. 6 and fig. 7, -1 represents an admission node, -2 represents a discharge node, and the numbers on other nodes represent corresponding topics.
The first series of subject sequences follows procedures of admission checks, pre-operative preparation, post-operative care, daily care and medication, roughly the same as the national standard clinical pathway. The second group is comparatively free of the relevant topics of preoperative preparation, medication topics, and the third group is roughly the same procedure as the first group, lacking the last medication.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.
Claims (5)
1. A process path mining method based on improved LDA is characterized by comprising the following steps:
step 1: filtering abnormal medical record samples in the data set, constructing a stop word list and a medical dictionary, filtering meaningless medical advice items by using the stop word list, and uniformly mapping diagnosis and treatment items with the same meaning by using the medical dictionary;
the data in the data set are medical advice data which specifically comprise patient IDs, medical advice activity names, medical advice types and occurrence time;
the meaningless medical advice item is a medical advice item irrelevant to treatment;
step 2, modeling the medical data by utilizing the improved LDA topic model;
step 3, after topic modeling, representing each diagnosis and treatment day as a topic distribution, wherein the distribution represents the probability that the diagnosis and treatment day belongs to each topic, and then converting a diagnosis and treatment log of a patient, namely a hospital admission record, into a topic vector sequence; processing the theme vector sequence and constructing the theme sequence;
and 4, carrying out process mining on the constructed subject sequence sets by adopting a mining algorithm based on an inter-activity dependency graph, wherein the subject labels are used as nodes in the graph model, and the time sequence relation between the subject labels is used as directed edges of the graph model, so that the diagnosis and treatment process model of each subject sequence set is finally obtained.
2. The method for process path mining based on improved LDA as claimed in claim 1, wherein said step 1 specifically comprises the steps of:
step 1.1: denoising the medical advice data; setting a noise threshold value, filtering abnormal data samples, adding meaningless diagnosis and treatment items into a stop word list, and filtering the stop word list;
step 1.2: carrying out unified mapping on diagnosis and treatment items on the text data; constructing a medical dictionary, uniformly mapping diagnosis and treatment items with the same meaning, and unifying all writing situations during processing;
carrying out unified mapping on diagnosis and treatment items by adopting a mode of combining similarity calculation and regularization matching, carrying out regularization matching before carrying out similarity calculation, removing suffix interference items in the medical advice data, and carrying out cosine similarity calculation after regularization matching;
the text is converted into corresponding word frequency vectors a and b, a cosine value between the two vectors is calculated, a formula of cosine similarity calculation is s (a, b) ═ a · b/| a | b |, wherein s (a, b) represents the cosine similarity between a and b, and if a calculation result is 1, the result is in accordance with reality.
3. The method of claim 1, wherein the LDA topic model in step 2 comprises two core model parameters: a topic distribution for each document and a lexical distribution for each topic; in the LDA topic model, a document is supposed to be composed of different topics with different probabilities, and each topic corresponds to the probability distribution of a word, so that each word in the document is generated by selecting the corresponding topic according to the corresponding probability and then selecting the word according to the probability; taking the diagnosis and treatment items as words and the diagnosis and treatment days as documents, respectively calculating the distribution of the diagnosis and treatment days-the diagnosis and treatment subjects and the distribution of the diagnosis and treatment subjects-the diagnosis and treatment items, and sampling the distribution of the diagnosis and treatment days-the treatment subjects from the Dirichlet distribution alpha to generate a distribution theta of the diagnosis and treatment days i-the treatment days-the treatment subjectsiSampling from Dirichlet distribution beta to generate diagnosis and treatment subject-diagnosis and treatment item distribution corresponding to diagnosis and treatment subject z
The distribution of terms in the Document collection is quantified using Inverse Document Frequency (IDF), which is calculated as follows:
wherein idfiIs the word tiThe IDF value, | D | is the total number of files in the corpus, | { j: t |, isi∈djIs taken to contain a word tiIf the word is not in the corpus, this will result in a dividend of zero, when 1+ | { j: ti∈djJ, including the word tiThe fewer documents of | { j: t |)i∈djThe smaller the } | and the larger the IDF, the word t is describediThe better the category discrimination ability;
taking the diagnosis and treatment items as words in the topic model, taking the diagnosis and treatment date as a document in the topic model to carry out LDA topic modeling, thereby obtaining a topic-word distribution variable in outputValue (i.e. probability, weight of keyword belonging to a certain topic), topic-word distributionWord distribution under each topic generated from Dirichlet distribution with parameter betaThe values are sorted from small to large, the top twenty words under each subject are taken for weight recalculation, and the calculation formula is as follows
4. The method of claim 1, wherein the step 3 specifically comprises the following steps:
step 3.1: generating a diagnosis and treatment day theme label;
for a diagnosis and treatment day d, according to the corresponding topic vector thetadExtracting related subjects as main subjects according to the probability threshold of the selected subject labelA question label to represent the treatment day; a topic k, as one of the topic tags, needs to satisfy the following constraints:
where r (k, d) represents the topic vector θdThe value of the subject k in (1), δtlSelecting a probability threshold value of the theme label, wherein K is the selected optimal theme number; arranging the topics meeting the formula in the diagnosis and treatment day in a descending order according to the probability value, and finally recording the topic label forming the diagnosis and treatment day d as tld(k (1), k (2),. ·, k (p)), where k (j) represents a topic with a j-th high probability, TL being defined as a different topic tag set;
replacing each diagnosis and treatment day of one hospitalization of one patient with a topic label to obtain a topic sequence sigma ═ tl corresponding to the hospitalization1,tl2,...,tl|σ|Therein tliBelongs to TL, and the [ sigma ] is the diagnosis and treatment day number of the hospitalization;
step 3.2: pruning the low-frequency subject label;
replacing each diagnosis and treatment day with a theme label to obtain a theme sequence corresponding to each hospitalization, pruning low-frequency theme labels, wherein the probability of the theme arranged behind the theme label is lower than that of the theme arranged in front of the theme label, gradually deleting subsequent themes in the low-frequency theme, and then judging whether the pruned theme label is low or not; constructing a prefix tree for the subject label in the TL, setting a threshold value of a low-frequency label, merging the low-frequency label node to a father node of the prefix tree, and changing the frequency of the father node until the low-frequency label node does not exist in the whole tree;
step 3.3: clustering the subject sequences;
clustering of the subject sequences is carried out by adopting a Kmeans algorithm, and the distance between the subject sequences is measured by an Edit Distance (ED), wherein the edit distance refers to the minimum operand required for converting one character string into another character string, and the allowed edit operation comprises insertion, deletion and replacement.
5. The method of claim 4, wherein the step 3.3 specifically comprises the steps of:
step 3.3.1: selecting initialized k samples as initial clustering center a1,a2,...,ak;
Step 3.3.2: for each sample x in the datasetiCalculating the Edit Distance (ED) from the cluster center to k cluster centers and dividing the ED into the class corresponding to the cluster center with the minimum distance;
step 3.3.3: for each class ajRecalculating its cluster center;
step 3.3.4: and repeating the steps 3.3.2 and 3.3.3 until reaching the set suspension condition.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110515351.6A CN113161001B (en) | 2021-05-12 | 2021-05-12 | Improved LDA-based process path mining method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110515351.6A CN113161001B (en) | 2021-05-12 | 2021-05-12 | Improved LDA-based process path mining method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113161001A true CN113161001A (en) | 2021-07-23 |
CN113161001B CN113161001B (en) | 2023-11-17 |
Family
ID=76874697
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110515351.6A Active CN113161001B (en) | 2021-05-12 | 2021-05-12 | Improved LDA-based process path mining method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113161001B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115083616A (en) * | 2022-08-16 | 2022-09-20 | 之江实验室 | Chronic nephropathy subtype mining system based on self-supervision graph clustering |
CN115879179A (en) * | 2023-02-24 | 2023-03-31 | 忻州师范学院 | Abnormal medical record detection device |
CN116303893A (en) * | 2023-02-23 | 2023-06-23 | 哈尔滨工业大学 | Method for classifying anchor image and analyzing key characteristics based on LDA topic model |
CN116719926A (en) * | 2023-08-10 | 2023-09-08 | 四川大学 | Congenital heart disease report data screening method and system based on intelligent medical treatment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106228023A (en) * | 2016-08-01 | 2016-12-14 | 清华大学 | A kind of clinical path method for digging based on body and topic model |
CN112700878A (en) * | 2020-12-22 | 2021-04-23 | 云南大学 | Clinical path optimization method based on process mining |
-
2021
- 2021-05-12 CN CN202110515351.6A patent/CN113161001B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106228023A (en) * | 2016-08-01 | 2016-12-14 | 清华大学 | A kind of clinical path method for digging based on body and topic model |
CN112700878A (en) * | 2020-12-22 | 2021-04-23 | 云南大学 | Clinical path optimization method based on process mining |
Non-Patent Citations (2)
Title |
---|
徐啸;金涛;王建民;: "基于优化主题模型的临床路径挖掘", 软件学报, no. 11, pages 231 - 239 * |
李睿易;鲁法明;包云霞;曾庆田;朱冠烨;: "基于药物疗效日志的临床路径挖掘方法", 计算机集成制造系统, no. 04, pages 61 - 71 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115083616A (en) * | 2022-08-16 | 2022-09-20 | 之江实验室 | Chronic nephropathy subtype mining system based on self-supervision graph clustering |
CN115083616B (en) * | 2022-08-16 | 2022-11-08 | 之江实验室 | Chronic nephropathy subtype mining system based on self-supervision graph clustering |
JP7404581B1 (en) | 2022-08-16 | 2023-12-25 | 之江実験室 | Chronic nephropathy subtype mining system based on self-supervised graph clustering |
CN116303893A (en) * | 2023-02-23 | 2023-06-23 | 哈尔滨工业大学 | Method for classifying anchor image and analyzing key characteristics based on LDA topic model |
CN116303893B (en) * | 2023-02-23 | 2024-01-30 | 哈尔滨工业大学 | Method for classifying anchor image and analyzing key characteristics based on LDA topic model |
CN115879179A (en) * | 2023-02-24 | 2023-03-31 | 忻州师范学院 | Abnormal medical record detection device |
CN116719926A (en) * | 2023-08-10 | 2023-09-08 | 四川大学 | Congenital heart disease report data screening method and system based on intelligent medical treatment |
CN116719926B (en) * | 2023-08-10 | 2023-10-20 | 四川大学 | Congenital heart disease report data screening method and system based on intelligent medical treatment |
Also Published As
Publication number | Publication date |
---|---|
CN113161001B (en) | 2023-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107731269B (en) | Disease coding method and system based on original diagnosis data and medical record file data | |
CN107705839B (en) | Disease automatic coding method and system | |
CN113161001B (en) | Improved LDA-based process path mining method | |
Azadani et al. | Graph-based biomedical text summarization: An itemset mining and sentence clustering approach | |
CN109920540A (en) | Construction method, device and the computer equipment of assisting in diagnosis and treatment decision system | |
Zhao et al. | EMR-based medical knowledge representation and inference via Markov random fields and distributed representation learning | |
CN112687397B (en) | Rare disease knowledge base processing method and device and readable storage medium | |
JP7068106B2 (en) | Test plan formulation support device, test plan formulation support method and program | |
Zeng et al. | Identifying breast cancer distant recurrences from electronic health records using machine learning | |
Ahmed et al. | Diagnosis recommendation using machine learning scientific workflows | |
CN114003734A (en) | Breast cancer risk factor knowledge system model, knowledge map system and construction method | |
Oyelade et al. | ST-ONCODIAG: a semantic rule-base approach to diagnosing breast cancer base on Wisconsin datasets | |
Jha et al. | Mining novel knowledge from biomedical literature using statistical measures and domain knowledge | |
WO2020048952A1 (en) | Method of classifying medical records | |
Frasca et al. | Visualizing correlations among Parkinson biomedical data through information retrieval and machine learning techniques | |
Zhou et al. | Converting semi-structured clinical medical records into information and knowledge | |
CN112667781A (en) | Malignant tumor document acquisition method and device | |
Henry et al. | Indirect association and ranking hypotheses for literature based discovery | |
Chen et al. | Hypothesis generation and data quality assessment through association mining | |
Rao et al. | Clinical and financial outcomes analysis with existing hospital patient records | |
Wah et al. | Development of a data warehouse for lymphoma cancer diagnosis and treatment decision support | |
Kongburan et al. | Enhancing predictive power of cluster-boosted regression with text-based indexing | |
Lin et al. | A top-down binary hierarchical topic model for biomedical literature | |
Kamal et al. | Disease Symptoms Analysis Using Data Mining Techniques to Predict Diabetes Risk. | |
EP3654339A1 (en) | Method of classifying medical records |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |