CN113161001A - Process path mining method based on improved LDA - Google Patents

Process path mining method based on improved LDA Download PDF

Info

Publication number
CN113161001A
CN113161001A CN202110515351.6A CN202110515351A CN113161001A CN 113161001 A CN113161001 A CN 113161001A CN 202110515351 A CN202110515351 A CN 202110515351A CN 113161001 A CN113161001 A CN 113161001A
Authority
CN
China
Prior art keywords
diagnosis
topic
treatment
word
distribution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110515351.6A
Other languages
Chinese (zh)
Other versions
CN113161001B (en
Inventor
栗伟
闵新�
叶盼盼
韩瑞奇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN202110515351.6A priority Critical patent/CN113161001B/en
Publication of CN113161001A publication Critical patent/CN113161001A/en
Application granted granted Critical
Publication of CN113161001B publication Critical patent/CN113161001B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N7/00Computing arrangements based on specific mathematical models
    • G06N7/01Probabilistic graphical models, e.g. probabilistic networks
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/20ICT specially adapted for the handling or processing of patient-related medical or healthcare data for electronic clinical trials or questionnaires
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Primary Health Care (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Physics (AREA)
  • Algebra (AREA)
  • Computational Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Mathematical Analysis (AREA)
  • Pathology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention provides a process path mining method based on improved LDA, and relates to the technical field of clinical path mining. According to the method, the medical advice logs in the electronic medical records are analyzed, a medical dictionary is constructed to filter useless medical advice items in the medical advice logs, an LDA topic model in the topic model is selected to model medical data, the medical logs are mapped to a low-dimensional topic space, and then the time sequence relation among topic features is discovered through process mining, so that the mined medical process model is easier to understand, and the medical interpretability of the obtained result is improved. The results obtained by the invention are compared with the national standard clinical route, and the results are basically consistent.

Description

Process path mining method based on improved LDA
Technical Field
The invention relates to the technical field of clinical path mining, in particular to a process path mining method based on improved LDA.
Background
With the progress of society, medical expenses are also rising. In order to suppress this trend and improve the utilization of sanitary resources, the state sets a series of clinical treatment standards, which are greatly improved in terms of reducing medical expenses, reducing treatment days, reasonably regulating the behaviors of medical staff and the like, and can achieve the expected treatment effect. This treatment standardization pattern is called the clinical pathway.
Clinical pathway (Clinical pathway) is a programmed and standardized diagnosis plan with strict working sequence and accurate time requirements for the purpose of desired treatment effect and cost control for a specific disease or operation, and the adoption of the correct treatment means at the correct time is the core of Clinical pathway, which generally divides the diagnosis and treatment of a disease into several stages and specifies the diagnosis and treatment items required for each stage. The clinical pathway is performed relative to the traditional pathway, i.e., the individual pathway of each doctor, different regions, different hospitals, different treatment groups, or different treatment regimens that may be taken by different doctor individuals for a disease. After the clinical route is adopted, the situation that the traditional route causes different treatment schemes to appear on the same disease in different regions, different hospitals, different treatment groups or different doctors among individuals can be avoided, the randomness is avoided, and the evaluability of the cost, the prognosis and the like is improved. A large number of clinical practice tests prove that the clinical path can be applied to standardize clinical diagnosis and treatment activities, control cost, strengthen medical process management and improve medical quality and efficiency.
At present, the national health council is gradually implementing a clinical pathway management mode, but the promotion process is not smooth, hospitals implementing clinical pathways are few, and problems of lack of reliability, small number of covered disease species and the like are often encountered in the practical application process, which are specifically as follows:
(1) reliability is lacking. Most clinical routes implemented in hospitals are based on national standards, and are established by related personnel according to past experience discussions. However, the clinical pathway formulated according to experience is seriously lack of data support and experimental simulation, which can cause the variation rate of the clinical pathway to be increased, thereby causing the reduction of the rate of utilization, and being not suitable for the development of the personalized clinical pathway;
(2) most hospitals have insufficient attention, small popularization range and few developed disease varieties. The clinical path entry is mainly caused by surgical diseases treated by surgery, the number of the disease types is small, the disease types are relatively single, and the clinical path application report in chronic diseases is rare and only stays on the relatively single disease type;
(3) the existing clinical path is slow to update, the updating is not timely according to the change information of the patient's condition, and the expansion is poor. Because it is time and labor consuming to manually develop a clinical pathway, the developed clinical pathway remains static for a long period of time. Most hospitals design clinical routes, generally, a set of treatment schemes from beginning to end is directly designed according to the patient conditions, and the clinical routes are difficult to update in real time according to the patient condition changes in the implementation process. Furthermore, currently, tens of thousands of diseases are known, and if management through clinical routes is desired, taking into consideration complications and the like, a large amount of investment is required;
(4) it is difficult to practice. Generally, the diagnosis and treatment item categories specified by the clinical path form have different implementation and deployment schemes in different places and hospitals, so that a great deal of local energy is required to do local mapping work; meanwhile, due to different requirements of the personalized characteristics of different patients on clinical routes, the variation rate of the manually-made clinical routes in practice (required diagnosis and treatment items do not meet the requirements of established routes) is extremely high, and a proper diagnosis and treatment planning guidance is difficult to provide.
For the problems of slow updating, poor expansion, lack of reliability and the like, an automatic clinical path making method can be introduced to assist in solving, and for the problem of difficult practice, the diagnosis and treatment scheme which is strong in practice and more consistent with the current patient can be found from historical data to serve as reference and guidance. Based on these two starting points, together with the rapid accumulation of medical data brought by the development of medical informatization in recent years, data-driven clinical path mining is receiving more and more attention.
The clinical path is derived from the practice process of clinical diagnosis and treatment activities, and is a common treatment mode of disease types hidden in the mass data of the hospital information system. With the continuous improvement of the medical informatization level, a large amount of historical patient diagnosis and treatment data is recorded in various medical information systems. The medical data are used for analyzing the mass data of the hospital information system in combination with the data mining technology, a scientific and reasonable clinical path which meets the diagnosis and treatment standards can be made, and scientific and reasonable decision support and recommendation are provided for doctors to make the clinical path, which is also significant for scientific making of the clinical path.
The aim of clinical path mining is to find a diagnosis and treatment process model with generality and time sequence for a plurality of people from diagnosis and treatment data, and focuses on finding actually executed diagnosis and treatment paths from historical diagnosis and treatment data, and by utilizing the more objective and specific execution paths, the design/redesign of the clinical paths can be effectively assisted, reference is provided for a maker of the clinical paths, and in addition, the method can also be used for truly examining the specific implementation conditions of areas and hospitals in which the clinical path management is implemented, so that a clinical path manager can be helped to identify differences.
The LDA model establishes a document-theme-word three-layer Bayesian network, and is a document theme generation model. By generative model, we mean that each word of an article is considered to be obtained through a process of "selecting a topic with a certain probability and selecting a word from the topic with a certain probability".
In the current clinical path mining research, a classical process mining algorithm is adopted to directly mine medical data, and the obtained medical process is italian due to the fact that the event granularity is too fine, and is not easy to understand and use. To obtain a more understandable and compact medical procedure model, the medical data needs to be reduced in dimension, and the medical procedure needs to be abstracted and generalized. Some have modeled topic models for medical data, but the resulting results lose the chronology between the phases of the clinical pathway.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a process path mining method based on improved LDA.
The technical scheme of the invention is as follows:
a process path mining method based on improved LDA comprises the following steps:
step 1: filtering abnormal medical record samples in the data set, constructing a stop word list and a medical dictionary, filtering meaningless medical advice items by using the stop word list, and uniformly mapping diagnosis and treatment items with the same meaning by using the medical dictionary;
the data in the data set are medical advice data which specifically comprise patient IDs, medical advice activity names, medical advice types and occurrence time;
the meaningless medical advice item is a medical advice item irrelevant to treatment;
step 1.1: denoising the medical advice data; setting a noise threshold value, filtering abnormal data samples, adding meaningless diagnosis and treatment items into a stop word list, and filtering the stop word list;
step 1.2: carrying out unified mapping on diagnosis and treatment items on the text data; constructing a medical dictionary, uniformly mapping diagnosis and treatment items with the same meaning, and unifying all writing situations during processing;
carrying out unified mapping on the diagnosis and treatment items by adopting a mode of combining similarity calculation and regularization matching; before similarity calculation, regularization matching is carried out, suffix interference items in the medical advice data are removed, and cosine similarity calculation is carried out after regularization matching;
converting the text into corresponding word frequency vectors a and b, and calculating a cosine value between the two vectors, wherein the cosine similarity calculation formula is s (a, b) ═ a · b/| a | b |, wherein s (a, b) represents the cosine similarity between a and b, and if the calculation result is 1, the result is in accordance with reality;
step 2, modeling the medical data by utilizing the improved LDA topic model;
the LDA topic model comprises two core model parameters: a topic distribution for each document and a lexical distribution for each topic; in LDA topic modelSupposing that a document is composed of different topics with different probabilities, and each topic corresponds to the probability distribution of a word, so that each word in the document is generated by selecting the corresponding topic according to the corresponding probability and then selecting the word according to the probability; taking the diagnosis and treatment items as words and the diagnosis and treatment days as documents, respectively calculating the distribution of the diagnosis and treatment days-the diagnosis and treatment subjects and the distribution of the diagnosis and treatment subjects-the diagnosis and treatment items, and sampling the distribution of the diagnosis and treatment days-the treatment subjects from the Dirichlet distribution alpha to generate a distribution theta of the diagnosis and treatment days i-the treatment days-the treatment subjectsiSampling from Dirichlet distribution beta to generate diagnosis and treatment subject-diagnosis and treatment item distribution corresponding to diagnosis and treatment subject z
Figure BDA0003061707540000032
The distribution of terms in the Document collection is quantified using Inverse Document Frequency (IDF), which is calculated as follows:
Figure BDA0003061707540000031
wherein idfiIs the word tiThe IDF value, | D | is the total number of files in the corpus, | { j: t |, isi∈djIs taken to contain a word tiIf the word is not in the corpus, this will result in a dividend of zero, when 1+ | { j: ti∈djJ, including the word tiThe fewer documents of | { j: t |)i∈djThe smaller the } | and the larger the IDF, the word t is describediThe better the category discrimination ability;
taking the diagnosis and treatment items as words in the topic model, taking the diagnosis and treatment date as a document in the topic model to carry out LDA topic modeling, thereby obtaining a topic-word distribution variable in output
Figure BDA0003061707540000033
Value (i.e. probability, weight of keyword belonging to a certain topic), topic-word distribution
Figure BDA0003061707540000045
The method is generated by Dirichlet distribution with a parameter of beta, word distribution phi values under each topic are ordered from small to large, the twenty words ranked first under each topic are taken for weight recalculation, and the calculation formula is as follows
Figure BDA0003061707540000041
Wherein the content of the first and second substances,
Figure BDA0003061707540000042
indicating the probability value, idf, of the occurrence of the word w in the subject zwThe IDF value representing the keyword w in the data set,
Figure BDA0003061707540000043
the final weight of the word w in the theme z;
step 3, after topic modeling, representing each diagnosis and treatment day as a topic distribution, wherein the distribution represents the probability that the diagnosis and treatment day belongs to each topic, and then converting a diagnosis and treatment log of a patient, namely a hospital admission record, into a topic vector sequence; processing the theme vector sequence and constructing the theme sequence;
step 3.1: generating a diagnosis and treatment day theme label;
for a diagnosis and treatment day d, according to the corresponding topic vector thetadExtracting related subjects as subject labels to represent the diagnosis and treatment day according to the selected subject label probability threshold; a topic k, as one of the topic tags, needs to satisfy the following constraints:
Figure BDA0003061707540000044
where r (k, d) represents the topic vector θdThe value of the subject k in (1), δtlSelecting a probability threshold value of the theme label, wherein K is the selected optimal theme number; arranging the topics meeting the formula in the diagnosis and treatment day in a descending order according to the probability value, and finally recording the topic label forming the diagnosis and treatment day d as tldWhere k (j) represents a topic with a j-high probability, TL is defined as a different topic tag set.
Replacing each diagnosis and treatment day of one hospitalization of one patient with a topic label to obtain a topic sequence sigma ═ tl corresponding to the hospitalization1,tl2,...,tl|σ|Therein tliBelongs to TL and is the number of treatment days of the hospital.
Step 3.2: pruning the low-frequency subject label;
replacing each diagnosis and treatment day with a theme label to obtain a theme sequence corresponding to each hospitalization, pruning low-frequency theme labels, wherein the probability of the theme arranged behind the theme label is lower than that of the theme arranged in front of the theme label, gradually deleting subsequent themes in the low-frequency theme, and then judging whether the pruned theme label is low or not; constructing a prefix tree for the subject label in the TL, setting a threshold value of the low-frequency label, merging the low-frequency label node to a father node of the prefix tree, and changing the frequency of the father node until the low-frequency label node does not exist in the whole tree.
Step 3.3: clustering the subject sequences;
clustering of the subject sequences is carried out by adopting a Kmeans algorithm, and the distance between the subject sequences is measured by an Edit Distance (ED), wherein the edit distance refers to the minimum operand required for converting one character string into another character string, and the allowed edit operation comprises insertion, deletion and replacement.
Step 3.3.1: selecting initialized k samples as initial clustering center a1,a2,...,ak
Step 3.3.2: for each sample x in the datasetiCalculating the Edit Distance (ED) from the cluster center to k cluster centers and dividing the ED into the class corresponding to the cluster center with the minimum distance;
step 3.3.3: for each class ajRecalculating its cluster center;
step 3.3.4: repeating the steps 3.3.2 and 3.3.3 until reaching the set stopping condition;
and 4, carrying out process mining on the constructed subject sequence sets by adopting a mining algorithm based on an inter-activity dependency graph, wherein the subject labels are used as nodes in the graph model, and the time sequence relation between the subject labels is used as directed edges of the graph model, so that the diagnosis and treatment process model of each subject sequence set is finally obtained.
The beneficial effects produced by adopting the technical method are as follows:
the invention provides a process path mining method based on improved LDA, which is used for mining clinical paths from high-dimensional sparse medical data and combining an LDA topic model with process mining.
Drawings
FIG. 1 is an overall flow chart in an embodiment of the present invention;
FIG. 2 is a mapping relationship diagram of LDA and the medical field in an embodiment of the present invention;
FIG. 3 is a schematic diagram of a subject label pruning in an embodiment of the present invention;
FIG. 4 is a graph of breast cancer surgery data in accordance with an embodiment of the present invention;
FIG. 5 is a schematic diagram of a first set of subject sequences according to an embodiment of the present invention;
FIG. 6 is a diagram illustrating a second set of subject sequences according to an embodiment of the present invention;
FIG. 7 is a sequence diagram of a third group of topics in the embodiment of the present invention.
Detailed Description
The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.
A method for mining a process path based on improved LDA, as shown in fig. 1, comprising the steps of:
step 1: filtering abnormal medical record samples in the data set, constructing a stop word list and a medical dictionary, filtering meaningless medical advice items by using the stop word list, and uniformly mapping diagnosis and treatment items with the same meaning by using the medical dictionary;
the data in the data set are medical advice data which specifically comprise patient IDs, medical advice activity names, medical advice types and occurrence time;
the meaningless medical advice item is a medical advice item irrelevant to treatment;
step 1.1: denoising the medical advice data;
medical data has a high noise problem, for example, sample data with a diagnosis date of 2 days does not contribute much to clinical path mining, and the data has some useless diagnosis and treatment items, which may interfere with experiments. The method comprises the steps of setting a noise threshold value, filtering abnormal data samples, adding meaningless diagnosis and treatment items into a stop word list, and filtering the stop word list;
step 1.2: unified mapping of medical items to textual data
Doctor orders are filled in by doctors, different people have different filling habits, so that the filling contents of text data are not uniform, and the phenomenon of multiple words and meaning generally exists. The term "0.9% sodium chloride injection" may be written as "0.9% NACL", "sodium chloride", "0.9% sodium chloride", etc., and for some injections, some doctors will also record the injection dosage and injection mode in detail. A medical dictionary can be constructed during data preprocessing, diagnosis and treatment items with the same meaning are uniformly mapped, and all writing conditions are unified during processing;
aiming at the problems in the medical advice data, the method adopts a mode of combining similarity calculation and regularization matching to carry out unified mapping on diagnosis and treatment items. Because the invention mainly uses medical advice data which are mostly phrases and nouns, the invention selects cosine similarity as a similarity algorithm. If only similarity calculations are used, the resulting medical dictionary may be incomplete due to the presence of interfering suffixes (e.g., injected doses) in the order item. If the cosine similarity between "0.9% sodium chloride" and "0.9% sodium chloride 1000 ml" is calculated to be 0.7559, if the threshold value of the similarity is set to 0.8, the two items cannot be added into the dictionary for uniform mapping, but the method is not practical. Before similarity calculation, regularization matching is firstly carried out, suffix interference items in medical advice data are removed, a regularization matching rule is set as \ d + ml aiming at 0.9% sodium chloride and 0.9% sodium chloride 1000ml, and cosine similarity calculation is carried out after regularization matching;
converting the text into corresponding word frequency vectors a and b, and calculating a cosine value between the two vectors, wherein the cosine similarity calculation formula is s (a, b) ═ a · b/| a | b |, wherein s (a, b) represents the cosine similarity between a and b, and if the calculation result is 1, the result is in accordance with reality;
step 2, modeling the medical data by utilizing the improved LDA topic model;
LDA is one of the most popular statistical topic modeling techniques. It models the generation process of each word in each document in the text dataset. The LDA topic model comprises two core model parameters: a topic distribution for each document and a lexical distribution for each topic; in the LDA topic model, a document is supposed to be composed of different topics with different probabilities, and each topic corresponds to the probability distribution of a word, so that each word in the document is generated by selecting the corresponding topic according to the corresponding probability and then selecting the word according to the probability; taking the diagnosis and treatment items as words and the diagnosis and treatment days as documents, respectively calculating the distribution of the diagnosis and treatment days-the diagnosis and treatment subjects and the distribution of the diagnosis and treatment subjects-the diagnosis and treatment items, and sampling the distribution of the diagnosis and treatment days-the treatment subjects from the Dirichlet distribution alpha to generate a distribution theta of the diagnosis and treatment days i-the treatment days-the treatment subjectsiSampling from Dirichlet distribution beta to generate diagnosis and treatment subject-diagnosis and treatment item distribution corresponding to diagnosis and treatment subject z
Figure BDA0003061707540000061
The mapping relationship between LDA and medical field is shown in fig. 2.
In the traditional LDA model, a word bag model is adopted for text modeling, but the word bag model has a serious problem, the word frequency of common words is often very high, and the word frequency of proper nouns is very low, so that topics are influenced by high-frequency words, as shown in table 1, table 1 is a result obtained after modeling clinical data of a breast cancer operation by using the traditional LDA model (only intercepting and displaying keywords in the top 4 of ranks under each topic, and as can be seen from the table, common food belongs to general medical advice, and is a high-frequency word in the whole corpus, so that the common food is ranked in the front under each topic-word distribution of the breast cancer operation and is strongly associated with each topic, which is not practical.
TABLE 1 LDA modeling of surgical data
Figure BDA0003061707540000071
The distribution of terms in a Document collection is quantified using Inverse Document Frequency (IDF), which is a measure of the general importance of terms. The IDF of the term is calculated as follows:
Figure BDA0003061707540000072
wherein idfiIs the word tiThe IDF value, | D | is the total number of files in the corpus, | { j: t |, isi∈djIs taken to contain a word tiIf the word is not in the corpus, this will result in a dividend of zero, when 1+ | { j: ti∈djJ, including the word tiThe fewer documents of | { j: t |)i∈djThe smaller the } | and the larger the IDF, the word t is describediThe better the category discrimination ability;
taking the diagnosis and treatment items as words in the topic model, taking the diagnosis and treatment date as a document in the topic model to carry out LDA topic modeling, thereby obtaining a topic-word distribution variable in output
Figure BDA0003061707540000073
Value (i.e. probability, weight of keyword belonging to a certain topic), topic-word distribution
Figure BDA0003061707540000074
Generated by Dirichlet distribution with parameter beta, the word distribution phi value under each subject is ordered from small to large, and each word distribution phi value is takenThe weight recalculation is carried out on the words which are twenty words before the ranking under each topic, and the calculation formula is as follows
Figure BDA0003061707540000075
Wherein the content of the first and second substances,
Figure BDA0003061707540000076
indicating the probability value, idf, of the occurrence of the word w in the subject zwThe IDF value representing the keyword w in the data set,
Figure BDA0003061707540000077
the final weight of the word w in the theme z;
step 3, after topic modeling, representing each diagnosis and treatment day as a topic distribution, wherein the distribution represents the probability that the diagnosis and treatment day belongs to each topic, and then converting a diagnosis and treatment log of a patient, namely a hospital admission record, into a topic vector sequence; processing the theme vector sequence and constructing the theme sequence;
by using the improved LDA algorithm, complex and various medical orders can be aggregated into a plurality of subjects, each diagnosis and treatment day can be represented as a subject distribution, the distribution represents the probability that the diagnosis and treatment day belongs to each subject, and then a diagnosis and treatment log (a hospitalization record) of a patient is correspondingly converted into a subject vector sequence. In order to obtain a clearer and easily understood clinical path model, the patent also carries out topic sequence construction to replace each diagnosis and treatment day, and the topic sequence construction is mainly divided into the following parts:
step 3.1: generating a diagnosis and treatment day theme label;
for a diagnosis and treatment day d, according to the corresponding topic vector thetadExtracting related subjects as subject labels to represent the diagnosis and treatment day according to the selected subject label probability threshold; a topic k, as one of the topic tags, needs to satisfy the following constraints:
Figure BDA0003061707540000081
where r (k, d) represents the topic vector θdThe value of the subject k in (1), δtlSelecting a probability threshold value of the theme label, wherein K is the selected optimal theme number; arranging the topics meeting the formula in the diagnosis and treatment day in a descending order according to the probability value, and finally recording the topic label forming the diagnosis and treatment day d as tldWhere k (j) represents a topic with a j-high probability, TL is defined as a different topic tag set.
Replacing each diagnosis and treatment day of one hospitalization of one patient with a topic label to obtain a topic sequence sigma ═ tl corresponding to the hospitalization1,tl2,...,tl|σ|Therein tliBelongs to TL and is the number of treatment days of the hospital.
Step 3.2: pruning the low-frequency subject label;
and (3) replacing each diagnosis and treatment day with a topic label to obtain a topic sequence corresponding to each hospitalization, and mining the topic sequence by using process mining in the following work to obtain a final clinical path model. However, some low-frequency tags in the formed subject tags only represent the characteristics of a few treatment days, and the finally obtained clinical path model becomes complicated if the low-frequency tags are not processed. The aim of clinical path mining of the patent is to mine the treatment process followed by most cases, so that the low-frequency subject labels need to be pruned, and the influence on the final mining result is avoided. As can be known from the generation of the above theme tags, the probability of the theme arranged behind in the theme tags is lower than that of the theme arranged in front, and in comparison, the theme arranged behind is not important for the diagnosis and treatment day, the subsequent theme in the low-frequency theme can be deleted step by step, and then whether the pruned theme tags are low-frequency or not is judged; therefore, the method borrows the concept of the prefix tree, constructs the prefix tree for the topic labels in the TL, sets the threshold value of the low-frequency label, combines the low-frequency label node to the father node of the low-frequency label node, and changes the frequency of the father node until the low-frequency label node does not exist in the whole tree. For the topic tag { "0": 3,: "0,1": 2,"0,1,2,3": 1,"0,1,2,4": 1 an example of pruning is shown in figure 3.
Step 3.3: clustering the subject sequences;
the construction of the subject sequence of each visiting case is completed in the above-mentioned work. In order to show the characteristics of different diagnosis and treatment modes more clearly, the subject sequences are clustered, and then a process mining method is used for mining the clinical path of each sequence class. The method adopts a Kmeans algorithm to cluster subject sequences, measures the distance between the subject sequences by an Edit Distance (ED), wherein the edit distance refers to the minimum operand required for converting one character string into another character string, and the allowed edit operation comprises insertion, deletion and replacement.
Step 3.3.1: selecting initialized k samples as initial clustering center a1,a2,...,ak
Step 3.3.2: for each sample x in the datasetiCalculating the Edit Distance (ED) from the cluster center to k cluster centers and dividing the ED into the class corresponding to the cluster center with the minimum distance;
step 3.3.3: for each class ajRecalculating its cluster center;
step 3.3.4: repeating the steps 3.3.2 and 3.3.3 until reaching the set stopping conditions (iteration times, minimum error change, etc.);
and 4, carrying out process mining on the constructed subject sequence sets by adopting a mining algorithm based on an inter-activity dependency graph, wherein the subject labels are used as nodes in the graph model, and the time sequence relation between the subject labels is used as directed edges of the graph model, so that the diagnosis and treatment process model of each subject sequence set is finally obtained.
Hospital informatization systems are currently becoming more sophisticated, accumulating a wide variety of types of medical data, including charge item data, order data, etc., where order data is chosen because it is more detailed, contains more information, and can be contrasted with the order information of the NCP.
Because data in a hospital information system are various and complex, a breast cancer medical record is selected as experimental data, complications and secondary symptoms of breast cancer medical records are not considered, and the experimental data are finally obtained through screening, cleaning and preprocessing.
The breast cancer hospitalization data after pretreatment all included 4 main attributes: patient ID, order activity name, order type, time of occurrence, as shown in Table 2. Wherein the ordered activity having the same patient ID and the time of occurrence constitutes a treatment day for the patient, and a plurality of treatment days having the same patient ID constitute a hospitalization visit for the patient.
TABLE 2 sample clinical data
Figure BDA0003061707540000091
Figure BDA0003061707540000101
The patent uses the medical data of the breast cancer as experimental data, digs out the clinical path of the operation therein, and provides reference for doctors when making treatment plans. And performing confusion calculation on the data of the breast cancer operation part, and selecting the optimal number of subjects. Confusion refers to the uncertainty that in text analysis, a trained model identifies which topics some documents contain. Thus the lower the number, the less uncertainty and the better the final clustering result. The graph of the confusion with the change of the number of subjects is shown in fig. 4, the confusion gradually decreases with the increase of the number of subjects K, and the change of the confusion tends to be gentle when K is 5, so that the optimal number of subjects selected for the operation data is 5.
The LDA topic model after the improvement of breast cancer surgery data was modeled and the results are shown in table 3. The topic name of each topic is manually defined according to the key words under each topic. If the keywords under the theme 0 include "food water prohibited before operation", "skin prepared before operation", "chest band prepared before operation", etc., the theme 0 may be defined as "preparation before operation", and the theme 1 includes the keywords of "stitches removed (extra large)", "dressing change (6 pieces or less)", etc., and thus the theme 1 may be labeled as "post-operative care".
TABLE 3 modeling of modified LDA on Breast cancer surgical data
Figure BDA0003061707540000102
Figure BDA0003061707540000111
For the breast cancer operation data set, 3 sets of topic sequence sets can be obtained, the corresponding medical records are 379, 136 and 172 respectively, the corresponding clinical pathway model graphs thereof are respectively shown in fig. 5, fig. 6 and fig. 7, -1 represents an admission node, -2 represents a discharge node, and the numbers on other nodes represent corresponding topics.
The first series of subject sequences follows procedures of admission checks, pre-operative preparation, post-operative care, daily care and medication, roughly the same as the national standard clinical pathway. The second group is comparatively free of the relevant topics of preoperative preparation, medication topics, and the third group is roughly the same procedure as the first group, lacking the last medication.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims (5)

1. A process path mining method based on improved LDA is characterized by comprising the following steps:
step 1: filtering abnormal medical record samples in the data set, constructing a stop word list and a medical dictionary, filtering meaningless medical advice items by using the stop word list, and uniformly mapping diagnosis and treatment items with the same meaning by using the medical dictionary;
the data in the data set are medical advice data which specifically comprise patient IDs, medical advice activity names, medical advice types and occurrence time;
the meaningless medical advice item is a medical advice item irrelevant to treatment;
step 2, modeling the medical data by utilizing the improved LDA topic model;
step 3, after topic modeling, representing each diagnosis and treatment day as a topic distribution, wherein the distribution represents the probability that the diagnosis and treatment day belongs to each topic, and then converting a diagnosis and treatment log of a patient, namely a hospital admission record, into a topic vector sequence; processing the theme vector sequence and constructing the theme sequence;
and 4, carrying out process mining on the constructed subject sequence sets by adopting a mining algorithm based on an inter-activity dependency graph, wherein the subject labels are used as nodes in the graph model, and the time sequence relation between the subject labels is used as directed edges of the graph model, so that the diagnosis and treatment process model of each subject sequence set is finally obtained.
2. The method for process path mining based on improved LDA as claimed in claim 1, wherein said step 1 specifically comprises the steps of:
step 1.1: denoising the medical advice data; setting a noise threshold value, filtering abnormal data samples, adding meaningless diagnosis and treatment items into a stop word list, and filtering the stop word list;
step 1.2: carrying out unified mapping on diagnosis and treatment items on the text data; constructing a medical dictionary, uniformly mapping diagnosis and treatment items with the same meaning, and unifying all writing situations during processing;
carrying out unified mapping on diagnosis and treatment items by adopting a mode of combining similarity calculation and regularization matching, carrying out regularization matching before carrying out similarity calculation, removing suffix interference items in the medical advice data, and carrying out cosine similarity calculation after regularization matching;
the text is converted into corresponding word frequency vectors a and b, a cosine value between the two vectors is calculated, a formula of cosine similarity calculation is s (a, b) ═ a · b/| a | b |, wherein s (a, b) represents the cosine similarity between a and b, and if a calculation result is 1, the result is in accordance with reality.
3. The method of claim 1, wherein the LDA topic model in step 2 comprises two core model parameters: a topic distribution for each document and a lexical distribution for each topic; in the LDA topic model, a document is supposed to be composed of different topics with different probabilities, and each topic corresponds to the probability distribution of a word, so that each word in the document is generated by selecting the corresponding topic according to the corresponding probability and then selecting the word according to the probability; taking the diagnosis and treatment items as words and the diagnosis and treatment days as documents, respectively calculating the distribution of the diagnosis and treatment days-the diagnosis and treatment subjects and the distribution of the diagnosis and treatment subjects-the diagnosis and treatment items, and sampling the distribution of the diagnosis and treatment days-the treatment subjects from the Dirichlet distribution alpha to generate a distribution theta of the diagnosis and treatment days i-the treatment days-the treatment subjectsiSampling from Dirichlet distribution beta to generate diagnosis and treatment subject-diagnosis and treatment item distribution corresponding to diagnosis and treatment subject z
Figure FDA0003061707530000011
The distribution of terms in the Document collection is quantified using Inverse Document Frequency (IDF), which is calculated as follows:
Figure FDA0003061707530000021
wherein idfiIs the word tiThe IDF value, | D | is the total number of files in the corpus, | { j: t |, isi∈djIs taken to contain a word tiIf the word is not in the corpus, this will result in a dividend of zero, when 1+ | { j: ti∈djJ, including the word tiThe fewer documents of | { j: t |)i∈djThe smaller the } | and the larger the IDF, the word t is describediThe better the category discrimination ability;
taking the diagnosis and treatment items as words in the topic model, taking the diagnosis and treatment date as a document in the topic model to carry out LDA topic modeling, thereby obtaining a topic-word distribution variable in output
Figure FDA0003061707530000026
Value (i.e. probability, weight of keyword belonging to a certain topic), topic-word distribution
Figure FDA0003061707530000027
Word distribution under each topic generated from Dirichlet distribution with parameter beta
Figure FDA0003061707530000028
The values are sorted from small to large, the top twenty words under each subject are taken for weight recalculation, and the calculation formula is as follows
Figure FDA0003061707530000022
Wherein the content of the first and second substances,
Figure FDA0003061707530000023
indicating the probability value, idf, of the occurrence of the word w in the subject zwThe IDF value representing the keyword w in the data set,
Figure FDA0003061707530000024
is the final weight of the word w in the topic z.
4. The method of claim 1, wherein the step 3 specifically comprises the following steps:
step 3.1: generating a diagnosis and treatment day theme label;
for a diagnosis and treatment day d, according to the corresponding topic vector thetadExtracting related subjects as main subjects according to the probability threshold of the selected subject labelA question label to represent the treatment day; a topic k, as one of the topic tags, needs to satisfy the following constraints:
Figure FDA0003061707530000025
where r (k, d) represents the topic vector θdThe value of the subject k in (1), δtlSelecting a probability threshold value of the theme label, wherein K is the selected optimal theme number; arranging the topics meeting the formula in the diagnosis and treatment day in a descending order according to the probability value, and finally recording the topic label forming the diagnosis and treatment day d as tld(k (1), k (2),. ·, k (p)), where k (j) represents a topic with a j-th high probability, TL being defined as a different topic tag set;
replacing each diagnosis and treatment day of one hospitalization of one patient with a topic label to obtain a topic sequence sigma ═ tl corresponding to the hospitalization1,tl2,...,tl|σ|Therein tliBelongs to TL, and the [ sigma ] is the diagnosis and treatment day number of the hospitalization;
step 3.2: pruning the low-frequency subject label;
replacing each diagnosis and treatment day with a theme label to obtain a theme sequence corresponding to each hospitalization, pruning low-frequency theme labels, wherein the probability of the theme arranged behind the theme label is lower than that of the theme arranged in front of the theme label, gradually deleting subsequent themes in the low-frequency theme, and then judging whether the pruned theme label is low or not; constructing a prefix tree for the subject label in the TL, setting a threshold value of a low-frequency label, merging the low-frequency label node to a father node of the prefix tree, and changing the frequency of the father node until the low-frequency label node does not exist in the whole tree;
step 3.3: clustering the subject sequences;
clustering of the subject sequences is carried out by adopting a Kmeans algorithm, and the distance between the subject sequences is measured by an Edit Distance (ED), wherein the edit distance refers to the minimum operand required for converting one character string into another character string, and the allowed edit operation comprises insertion, deletion and replacement.
5. The method of claim 4, wherein the step 3.3 specifically comprises the steps of:
step 3.3.1: selecting initialized k samples as initial clustering center a1,a2,...,ak
Step 3.3.2: for each sample x in the datasetiCalculating the Edit Distance (ED) from the cluster center to k cluster centers and dividing the ED into the class corresponding to the cluster center with the minimum distance;
step 3.3.3: for each class ajRecalculating its cluster center;
step 3.3.4: and repeating the steps 3.3.2 and 3.3.3 until reaching the set suspension condition.
CN202110515351.6A 2021-05-12 2021-05-12 Improved LDA-based process path mining method Active CN113161001B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110515351.6A CN113161001B (en) 2021-05-12 2021-05-12 Improved LDA-based process path mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110515351.6A CN113161001B (en) 2021-05-12 2021-05-12 Improved LDA-based process path mining method

Publications (2)

Publication Number Publication Date
CN113161001A true CN113161001A (en) 2021-07-23
CN113161001B CN113161001B (en) 2023-11-17

Family

ID=76874697

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110515351.6A Active CN113161001B (en) 2021-05-12 2021-05-12 Improved LDA-based process path mining method

Country Status (1)

Country Link
CN (1) CN113161001B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115083616A (en) * 2022-08-16 2022-09-20 之江实验室 Chronic nephropathy subtype mining system based on self-supervision graph clustering
CN115879179A (en) * 2023-02-24 2023-03-31 忻州师范学院 Abnormal medical record detection device
CN116303893A (en) * 2023-02-23 2023-06-23 哈尔滨工业大学 Method for classifying anchor image and analyzing key characteristics based on LDA topic model
CN116719926A (en) * 2023-08-10 2023-09-08 四川大学 Congenital heart disease report data screening method and system based on intelligent medical treatment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228023A (en) * 2016-08-01 2016-12-14 清华大学 A kind of clinical path method for digging based on body and topic model
CN112700878A (en) * 2020-12-22 2021-04-23 云南大学 Clinical path optimization method based on process mining

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106228023A (en) * 2016-08-01 2016-12-14 清华大学 A kind of clinical path method for digging based on body and topic model
CN112700878A (en) * 2020-12-22 2021-04-23 云南大学 Clinical path optimization method based on process mining

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
徐啸;金涛;王建民;: "基于优化主题模型的临床路径挖掘", 软件学报, no. 11, pages 231 - 239 *
李睿易;鲁法明;包云霞;曾庆田;朱冠烨;: "基于药物疗效日志的临床路径挖掘方法", 计算机集成制造系统, no. 04, pages 61 - 71 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115083616A (en) * 2022-08-16 2022-09-20 之江实验室 Chronic nephropathy subtype mining system based on self-supervision graph clustering
CN115083616B (en) * 2022-08-16 2022-11-08 之江实验室 Chronic nephropathy subtype mining system based on self-supervision graph clustering
JP7404581B1 (en) 2022-08-16 2023-12-25 之江実験室 Chronic nephropathy subtype mining system based on self-supervised graph clustering
CN116303893A (en) * 2023-02-23 2023-06-23 哈尔滨工业大学 Method for classifying anchor image and analyzing key characteristics based on LDA topic model
CN116303893B (en) * 2023-02-23 2024-01-30 哈尔滨工业大学 Method for classifying anchor image and analyzing key characteristics based on LDA topic model
CN115879179A (en) * 2023-02-24 2023-03-31 忻州师范学院 Abnormal medical record detection device
CN116719926A (en) * 2023-08-10 2023-09-08 四川大学 Congenital heart disease report data screening method and system based on intelligent medical treatment
CN116719926B (en) * 2023-08-10 2023-10-20 四川大学 Congenital heart disease report data screening method and system based on intelligent medical treatment

Also Published As

Publication number Publication date
CN113161001B (en) 2023-11-17

Similar Documents

Publication Publication Date Title
CN107731269B (en) Disease coding method and system based on original diagnosis data and medical record file data
CN107705839B (en) Disease automatic coding method and system
CN113161001B (en) Improved LDA-based process path mining method
Azadani et al. Graph-based biomedical text summarization: An itemset mining and sentence clustering approach
CN109920540A (en) Construction method, device and the computer equipment of assisting in diagnosis and treatment decision system
Zhao et al. EMR-based medical knowledge representation and inference via Markov random fields and distributed representation learning
CN112687397B (en) Rare disease knowledge base processing method and device and readable storage medium
JP7068106B2 (en) Test plan formulation support device, test plan formulation support method and program
Zeng et al. Identifying breast cancer distant recurrences from electronic health records using machine learning
Ahmed et al. Diagnosis recommendation using machine learning scientific workflows
CN114003734A (en) Breast cancer risk factor knowledge system model, knowledge map system and construction method
Oyelade et al. ST-ONCODIAG: a semantic rule-base approach to diagnosing breast cancer base on Wisconsin datasets
Jha et al. Mining novel knowledge from biomedical literature using statistical measures and domain knowledge
WO2020048952A1 (en) Method of classifying medical records
Frasca et al. Visualizing correlations among Parkinson biomedical data through information retrieval and machine learning techniques
Zhou et al. Converting semi-structured clinical medical records into information and knowledge
CN112667781A (en) Malignant tumor document acquisition method and device
Henry et al. Indirect association and ranking hypotheses for literature based discovery
Chen et al. Hypothesis generation and data quality assessment through association mining
Rao et al. Clinical and financial outcomes analysis with existing hospital patient records
Wah et al. Development of a data warehouse for lymphoma cancer diagnosis and treatment decision support
Kongburan et al. Enhancing predictive power of cluster-boosted regression with text-based indexing
Lin et al. A top-down binary hierarchical topic model for biomedical literature
Kamal et al. Disease Symptoms Analysis Using Data Mining Techniques to Predict Diabetes Risk.
EP3654339A1 (en) Method of classifying medical records

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant