CN113161001A

CN113161001A - Process path mining method based on improved LDA

Info

Publication number: CN113161001A
Application number: CN202110515351.6A
Authority: CN
Inventors: 栗伟; 闵新�; 叶盼盼; 韩瑞奇
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2021-07-23
Anticipated expiration: 2041-05-12
Also published as: CN113161001B

Abstract

The invention provides a process path mining method based on improved LDA, and relates to the technical field of clinical path mining. According to the method, the medical advice logs in the electronic medical records are analyzed, a medical dictionary is constructed to filter useless medical advice items in the medical advice logs, an LDA topic model in the topic model is selected to model medical data, the medical logs are mapped to a low-dimensional topic space, and then the time sequence relation among topic features is discovered through process mining, so that the mined medical process model is easier to understand, and the medical interpretability of the obtained result is improved. The results obtained by the invention are compared with the national standard clinical route, and the results are basically consistent.

Description

Process path mining method based on improved LDA

Technical Field

The invention relates to the technical field of clinical path mining, in particular to a process path mining method based on improved LDA.

Background

With the progress of society, medical expenses are also rising. In order to suppress this trend and improve the utilization of sanitary resources, the state sets a series of clinical treatment standards, which are greatly improved in terms of reducing medical expenses, reducing treatment days, reasonably regulating the behaviors of medical staff and the like, and can achieve the expected treatment effect. This treatment standardization pattern is called the clinical pathway.

Clinical pathway (Clinical pathway) is a programmed and standardized diagnosis plan with strict working sequence and accurate time requirements for the purpose of desired treatment effect and cost control for a specific disease or operation, and the adoption of the correct treatment means at the correct time is the core of Clinical pathway, which generally divides the diagnosis and treatment of a disease into several stages and specifies the diagnosis and treatment items required for each stage. The clinical pathway is performed relative to the traditional pathway, i.e., the individual pathway of each doctor, different regions, different hospitals, different treatment groups, or different treatment regimens that may be taken by different doctor individuals for a disease. After the clinical route is adopted, the situation that the traditional route causes different treatment schemes to appear on the same disease in different regions, different hospitals, different treatment groups or different doctors among individuals can be avoided, the randomness is avoided, and the evaluability of the cost, the prognosis and the like is improved. A large number of clinical practice tests prove that the clinical path can be applied to standardize clinical diagnosis and treatment activities, control cost, strengthen medical process management and improve medical quality and efficiency.

At present, the national health council is gradually implementing a clinical pathway management mode, but the promotion process is not smooth, hospitals implementing clinical pathways are few, and problems of lack of reliability, small number of covered disease species and the like are often encountered in the practical application process, which are specifically as follows:

(1) reliability is lacking. Most clinical routes implemented in hospitals are based on national standards, and are established by related personnel according to past experience discussions. However, the clinical pathway formulated according to experience is seriously lack of data support and experimental simulation, which can cause the variation rate of the clinical pathway to be increased, thereby causing the reduction of the rate of utilization, and being not suitable for the development of the personalized clinical pathway;

(2) most hospitals have insufficient attention, small popularization range and few developed disease varieties. The clinical path entry is mainly caused by surgical diseases treated by surgery, the number of the disease types is small, the disease types are relatively single, and the clinical path application report in chronic diseases is rare and only stays on the relatively single disease type;

(3) the existing clinical path is slow to update, the updating is not timely according to the change information of the patient's condition, and the expansion is poor. Because it is time and labor consuming to manually develop a clinical pathway, the developed clinical pathway remains static for a long period of time. Most hospitals design clinical routes, generally, a set of treatment schemes from beginning to end is directly designed according to the patient conditions, and the clinical routes are difficult to update in real time according to the patient condition changes in the implementation process. Furthermore, currently, tens of thousands of diseases are known, and if management through clinical routes is desired, taking into consideration complications and the like, a large amount of investment is required;

(4) it is difficult to practice. Generally, the diagnosis and treatment item categories specified by the clinical path form have different implementation and deployment schemes in different places and hospitals, so that a great deal of local energy is required to do local mapping work; meanwhile, due to different requirements of the personalized characteristics of different patients on clinical routes, the variation rate of the manually-made clinical routes in practice (required diagnosis and treatment items do not meet the requirements of established routes) is extremely high, and a proper diagnosis and treatment planning guidance is difficult to provide.

For the problems of slow updating, poor expansion, lack of reliability and the like, an automatic clinical path making method can be introduced to assist in solving, and for the problem of difficult practice, the diagnosis and treatment scheme which is strong in practice and more consistent with the current patient can be found from historical data to serve as reference and guidance. Based on these two starting points, together with the rapid accumulation of medical data brought by the development of medical informatization in recent years, data-driven clinical path mining is receiving more and more attention.

The clinical path is derived from the practice process of clinical diagnosis and treatment activities, and is a common treatment mode of disease types hidden in the mass data of the hospital information system. With the continuous improvement of the medical informatization level, a large amount of historical patient diagnosis and treatment data is recorded in various medical information systems. The medical data are used for analyzing the mass data of the hospital information system in combination with the data mining technology, a scientific and reasonable clinical path which meets the diagnosis and treatment standards can be made, and scientific and reasonable decision support and recommendation are provided for doctors to make the clinical path, which is also significant for scientific making of the clinical path.

The aim of clinical path mining is to find a diagnosis and treatment process model with generality and time sequence for a plurality of people from diagnosis and treatment data, and focuses on finding actually executed diagnosis and treatment paths from historical diagnosis and treatment data, and by utilizing the more objective and specific execution paths, the design/redesign of the clinical paths can be effectively assisted, reference is provided for a maker of the clinical paths, and in addition, the method can also be used for truly examining the specific implementation conditions of areas and hospitals in which the clinical path management is implemented, so that a clinical path manager can be helped to identify differences.

The LDA model establishes a document-theme-word three-layer Bayesian network, and is a document theme generation model. By generative model, we mean that each word of an article is considered to be obtained through a process of "selecting a topic with a certain probability and selecting a word from the topic with a certain probability".

In the current clinical path mining research, a classical process mining algorithm is adopted to directly mine medical data, and the obtained medical process is italian due to the fact that the event granularity is too fine, and is not easy to understand and use. To obtain a more understandable and compact medical procedure model, the medical data needs to be reduced in dimension, and the medical procedure needs to be abstracted and generalized. Some have modeled topic models for medical data, but the resulting results lose the chronology between the phases of the clinical pathway.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a process path mining method based on improved LDA.

The technical scheme of the invention is as follows:

a process path mining method based on improved LDA comprises the following steps:

step 1: filtering abnormal medical record samples in the data set, constructing a stop word list and a medical dictionary, filtering meaningless medical advice items by using the stop word list, and uniformly mapping diagnosis and treatment items with the same meaning by using the medical dictionary;

the data in the data set are medical advice data which specifically comprise patient IDs, medical advice activity names, medical advice types and occurrence time;

the meaningless medical advice item is a medical advice item irrelevant to treatment;

step 1.1: denoising the medical advice data; setting a noise threshold value, filtering abnormal data samples, adding meaningless diagnosis and treatment items into a stop word list, and filtering the stop word list;

step 1.2: carrying out unified mapping on diagnosis and treatment items on the text data; constructing a medical dictionary, uniformly mapping diagnosis and treatment items with the same meaning, and unifying all writing situations during processing;

carrying out unified mapping on the diagnosis and treatment items by adopting a mode of combining similarity calculation and regularization matching; before similarity calculation, regularization matching is carried out, suffix interference items in the medical advice data are removed, and cosine similarity calculation is carried out after regularization matching;

converting the text into corresponding word frequency vectors a and b, and calculating a cosine value between the two vectors, wherein the cosine similarity calculation formula is s (a, b) ═ a · b/| a | b |, wherein s (a, b) represents the cosine similarity between a and b, and if the calculation result is 1, the result is in accordance with reality;

step 2, modeling the medical data by utilizing the improved LDA topic model;

the LDA topic model comprises two core model parameters: a topic distribution for each document and a lexical distribution for each topic; in LDA topic modelSupposing that a document is composed of different topics with different probabilities, and each topic corresponds to the probability distribution of a word, so that each word in the document is generated by selecting the corresponding topic according to the corresponding probability and then selecting the word according to the probability; taking the diagnosis and treatment items as words and the diagnosis and treatment days as documents, respectively calculating the distribution of the diagnosis and treatment days-the diagnosis and treatment subjects and the distribution of the diagnosis and treatment subjects-the diagnosis and treatment items, and sampling the distribution of the diagnosis and treatment days-the treatment subjects from the Dirichlet distribution alpha to generate a distribution theta of the diagnosis and treatment days i-the treatment days-the treatment subjects_iSampling from Dirichlet distribution beta to generate diagnosis and treatment subject-diagnosis and treatment item distribution corresponding to diagnosis and treatment subject z

The distribution of terms in the Document collection is quantified using Inverse Document Frequency (IDF), which is calculated as follows:

wherein idf_iIs the word t_iThe IDF value, | D | is the total number of files in the corpus, | { j: t |, is_i∈d_jIs taken to contain a word t_iIf the word is not in the corpus, this will result in a dividend of zero, when 1+ | { j: t_i∈d_jJ, including the word t_iThe fewer documents of | { j: t |)_i∈d_jThe smaller the } | and the larger the IDF, the word t is described_iThe better the category discrimination ability;

taking the diagnosis and treatment items as words in the topic model, taking the diagnosis and treatment date as a document in the topic model to carry out LDA topic modeling, thereby obtaining a topic-word distribution variable in output

Value (i.e. probability, weight of keyword belonging to a certain topic), topic-word distribution

The method is generated by Dirichlet distribution with a parameter of beta, word distribution phi values under each topic are ordered from small to large, the twenty words ranked first under each topic are taken for weight recalculation, and the calculation formula is as follows

Wherein the content of the first and second substances,

indicating the probability value, idf, of the occurrence of the word w in the subject z_wThe IDF value representing the keyword w in the data set,

the final weight of the word w in the theme z;

step 3, after topic modeling, representing each diagnosis and treatment day as a topic distribution, wherein the distribution represents the probability that the diagnosis and treatment day belongs to each topic, and then converting a diagnosis and treatment log of a patient, namely a hospital admission record, into a topic vector sequence; processing the theme vector sequence and constructing the theme sequence;

step 3.1: generating a diagnosis and treatment day theme label;

for a diagnosis and treatment day d, according to the corresponding topic vector theta_dExtracting related subjects as subject labels to represent the diagnosis and treatment day according to the selected subject label probability threshold; a topic k, as one of the topic tags, needs to satisfy the following constraints:

where r (k, d) represents the topic vector θ_dThe value of the subject k in (1), δ_tlSelecting a probability threshold value of the theme label, wherein K is the selected optimal theme number; arranging the topics meeting the formula in the diagnosis and treatment day in a descending order according to the probability value, and finally recording the topic label forming the diagnosis and treatment day d as tl_dWhere k (j) represents a topic with a j-high probability, TL is defined as a different topic tag set.

Replacing each diagnosis and treatment day of one hospitalization of one patient with a topic label to obtain a topic sequence sigma ═ tl corresponding to the hospitalization₁,tl₂,...,tl_|σ|Therein tl_iBelongs to TL and is the number of treatment days of the hospital.

Step 3.2: pruning the low-frequency subject label;

replacing each diagnosis and treatment day with a theme label to obtain a theme sequence corresponding to each hospitalization, pruning low-frequency theme labels, wherein the probability of the theme arranged behind the theme label is lower than that of the theme arranged in front of the theme label, gradually deleting subsequent themes in the low-frequency theme, and then judging whether the pruned theme label is low or not; constructing a prefix tree for the subject label in the TL, setting a threshold value of the low-frequency label, merging the low-frequency label node to a father node of the prefix tree, and changing the frequency of the father node until the low-frequency label node does not exist in the whole tree.

Step 3.3: clustering the subject sequences;

clustering of the subject sequences is carried out by adopting a Kmeans algorithm, and the distance between the subject sequences is measured by an Edit Distance (ED), wherein the edit distance refers to the minimum operand required for converting one character string into another character string, and the allowed edit operation comprises insertion, deletion and replacement.

Step 3.3.1: selecting initialized k samples as initial clustering center a₁,a₂,...,a_k；

Step 3.3.2: for each sample x in the dataset_iCalculating the Edit Distance (ED) from the cluster center to k cluster centers and dividing the ED into the class corresponding to the cluster center with the minimum distance;

step 3.3.3: for each class a_jRecalculating its cluster center;

step 3.3.4: repeating the steps 3.3.2 and 3.3.3 until reaching the set stopping condition;

and 4, carrying out process mining on the constructed subject sequence sets by adopting a mining algorithm based on an inter-activity dependency graph, wherein the subject labels are used as nodes in the graph model, and the time sequence relation between the subject labels is used as directed edges of the graph model, so that the diagnosis and treatment process model of each subject sequence set is finally obtained.

The beneficial effects produced by adopting the technical method are as follows:

the invention provides a process path mining method based on improved LDA, which is used for mining clinical paths from high-dimensional sparse medical data and combining an LDA topic model with process mining.

Drawings

FIG. 1 is an overall flow chart in an embodiment of the present invention;

FIG. 2 is a mapping relationship diagram of LDA and the medical field in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a subject label pruning in an embodiment of the present invention;

FIG. 4 is a graph of breast cancer surgery data in accordance with an embodiment of the present invention;

FIG. 5 is a schematic diagram of a first set of subject sequences according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating a second set of subject sequences according to an embodiment of the present invention;

FIG. 7 is a sequence diagram of a third group of topics in the embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

A method for mining a process path based on improved LDA, as shown in fig. 1, comprising the steps of:

step 1.1: denoising the medical advice data;

medical data has a high noise problem, for example, sample data with a diagnosis date of 2 days does not contribute much to clinical path mining, and the data has some useless diagnosis and treatment items, which may interfere with experiments. The method comprises the steps of setting a noise threshold value, filtering abnormal data samples, adding meaningless diagnosis and treatment items into a stop word list, and filtering the stop word list;

step 1.2: unified mapping of medical items to textual data

Doctor orders are filled in by doctors, different people have different filling habits, so that the filling contents of text data are not uniform, and the phenomenon of multiple words and meaning generally exists. The term "0.9% sodium chloride injection" may be written as "0.9% NACL", "sodium chloride", "0.9% sodium chloride", etc., and for some injections, some doctors will also record the injection dosage and injection mode in detail. A medical dictionary can be constructed during data preprocessing, diagnosis and treatment items with the same meaning are uniformly mapped, and all writing conditions are unified during processing;

aiming at the problems in the medical advice data, the method adopts a mode of combining similarity calculation and regularization matching to carry out unified mapping on diagnosis and treatment items. Because the invention mainly uses medical advice data which are mostly phrases and nouns, the invention selects cosine similarity as a similarity algorithm. If only similarity calculations are used, the resulting medical dictionary may be incomplete due to the presence of interfering suffixes (e.g., injected doses) in the order item. If the cosine similarity between "0.9% sodium chloride" and "0.9% sodium chloride 1000 ml" is calculated to be 0.7559, if the threshold value of the similarity is set to 0.8, the two items cannot be added into the dictionary for uniform mapping, but the method is not practical. Before similarity calculation, regularization matching is firstly carried out, suffix interference items in medical advice data are removed, a regularization matching rule is set as \ d + ml aiming at 0.9% sodium chloride and 0.9% sodium chloride 1000ml, and cosine similarity calculation is carried out after regularization matching;

step 2, modeling the medical data by utilizing the improved LDA topic model;

LDA is one of the most popular statistical topic modeling techniques. It models the generation process of each word in each document in the text dataset. The LDA topic model comprises two core model parameters: a topic distribution for each document and a lexical distribution for each topic; in the LDA topic model, a document is supposed to be composed of different topics with different probabilities, and each topic corresponds to the probability distribution of a word, so that each word in the document is generated by selecting the corresponding topic according to the corresponding probability and then selecting the word according to the probability; taking the diagnosis and treatment items as words and the diagnosis and treatment days as documents, respectively calculating the distribution of the diagnosis and treatment days-the diagnosis and treatment subjects and the distribution of the diagnosis and treatment subjects-the diagnosis and treatment items, and sampling the distribution of the diagnosis and treatment days-the treatment subjects from the Dirichlet distribution alpha to generate a distribution theta of the diagnosis and treatment days i-the treatment days-the treatment subjects_iSampling from Dirichlet distribution beta to generate diagnosis and treatment subject-diagnosis and treatment item distribution corresponding to diagnosis and treatment subject z

The mapping relationship between LDA and medical field is shown in fig. 2.

In the traditional LDA model, a word bag model is adopted for text modeling, but the word bag model has a serious problem, the word frequency of common words is often very high, and the word frequency of proper nouns is very low, so that topics are influenced by high-frequency words, as shown in table 1, table 1 is a result obtained after modeling clinical data of a breast cancer operation by using the traditional LDA model (only intercepting and displaying keywords in the top 4 of ranks under each topic, and as can be seen from the table, common food belongs to general medical advice, and is a high-frequency word in the whole corpus, so that the common food is ranked in the front under each topic-word distribution of the breast cancer operation and is strongly associated with each topic, which is not practical.

TABLE 1 LDA modeling of surgical data

The distribution of terms in a Document collection is quantified using Inverse Document Frequency (IDF), which is a measure of the general importance of terms. The IDF of the term is calculated as follows:

Generated by Dirichlet distribution with parameter beta, the word distribution phi value under each subject is ordered from small to large, and each word distribution phi value is takenThe weight recalculation is carried out on the words which are twenty words before the ranking under each topic, and the calculation formula is as follows

Wherein the content of the first and second substances,

the final weight of the word w in the theme z;

by using the improved LDA algorithm, complex and various medical orders can be aggregated into a plurality of subjects, each diagnosis and treatment day can be represented as a subject distribution, the distribution represents the probability that the diagnosis and treatment day belongs to each subject, and then a diagnosis and treatment log (a hospitalization record) of a patient is correspondingly converted into a subject vector sequence. In order to obtain a clearer and easily understood clinical path model, the patent also carries out topic sequence construction to replace each diagnosis and treatment day, and the topic sequence construction is mainly divided into the following parts:

step 3.1: generating a diagnosis and treatment day theme label;

Step 3.2: pruning the low-frequency subject label;

and (3) replacing each diagnosis and treatment day with a topic label to obtain a topic sequence corresponding to each hospitalization, and mining the topic sequence by using process mining in the following work to obtain a final clinical path model. However, some low-frequency tags in the formed subject tags only represent the characteristics of a few treatment days, and the finally obtained clinical path model becomes complicated if the low-frequency tags are not processed. The aim of clinical path mining of the patent is to mine the treatment process followed by most cases, so that the low-frequency subject labels need to be pruned, and the influence on the final mining result is avoided. As can be known from the generation of the above theme tags, the probability of the theme arranged behind in the theme tags is lower than that of the theme arranged in front, and in comparison, the theme arranged behind is not important for the diagnosis and treatment day, the subsequent theme in the low-frequency theme can be deleted step by step, and then whether the pruned theme tags are low-frequency or not is judged; therefore, the method borrows the concept of the prefix tree, constructs the prefix tree for the topic labels in the TL, sets the threshold value of the low-frequency label, combines the low-frequency label node to the father node of the low-frequency label node, and changes the frequency of the father node until the low-frequency label node does not exist in the whole tree. For the topic tag { "0": 3,: "0,1": 2,"0,1,2,3": 1,"0,1,2,4": 1 an example of pruning is shown in figure 3.

Step 3.3: clustering the subject sequences;

the construction of the subject sequence of each visiting case is completed in the above-mentioned work. In order to show the characteristics of different diagnosis and treatment modes more clearly, the subject sequences are clustered, and then a process mining method is used for mining the clinical path of each sequence class. The method adopts a Kmeans algorithm to cluster subject sequences, measures the distance between the subject sequences by an Edit Distance (ED), wherein the edit distance refers to the minimum operand required for converting one character string into another character string, and the allowed edit operation comprises insertion, deletion and replacement.

step 3.3.3: for each class a_jRecalculating its cluster center;

step 3.3.4: repeating the steps 3.3.2 and 3.3.3 until reaching the set stopping conditions (iteration times, minimum error change, etc.);

Hospital informatization systems are currently becoming more sophisticated, accumulating a wide variety of types of medical data, including charge item data, order data, etc., where order data is chosen because it is more detailed, contains more information, and can be contrasted with the order information of the NCP.

Because data in a hospital information system are various and complex, a breast cancer medical record is selected as experimental data, complications and secondary symptoms of breast cancer medical records are not considered, and the experimental data are finally obtained through screening, cleaning and preprocessing.

The breast cancer hospitalization data after pretreatment all included 4 main attributes: patient ID, order activity name, order type, time of occurrence, as shown in Table 2. Wherein the ordered activity having the same patient ID and the time of occurrence constitutes a treatment day for the patient, and a plurality of treatment days having the same patient ID constitute a hospitalization visit for the patient.

TABLE 2 sample clinical data

The patent uses the medical data of the breast cancer as experimental data, digs out the clinical path of the operation therein, and provides reference for doctors when making treatment plans. And performing confusion calculation on the data of the breast cancer operation part, and selecting the optimal number of subjects. Confusion refers to the uncertainty that in text analysis, a trained model identifies which topics some documents contain. Thus the lower the number, the less uncertainty and the better the final clustering result. The graph of the confusion with the change of the number of subjects is shown in fig. 4, the confusion gradually decreases with the increase of the number of subjects K, and the change of the confusion tends to be gentle when K is 5, so that the optimal number of subjects selected for the operation data is 5.

The LDA topic model after the improvement of breast cancer surgery data was modeled and the results are shown in table 3. The topic name of each topic is manually defined according to the key words under each topic. If the keywords under the theme 0 include "food water prohibited before operation", "skin prepared before operation", "chest band prepared before operation", etc., the theme 0 may be defined as "preparation before operation", and the theme 1 includes the keywords of "stitches removed (extra large)", "dressing change (6 pieces or less)", etc., and thus the theme 1 may be labeled as "post-operative care".

TABLE 3 modeling of modified LDA on Breast cancer surgical data

For the breast cancer operation data set, 3 sets of topic sequence sets can be obtained, the corresponding medical records are 379, 136 and 172 respectively, the corresponding clinical pathway model graphs thereof are respectively shown in fig. 5, fig. 6 and fig. 7, -1 represents an admission node, -2 represents a discharge node, and the numbers on other nodes represent corresponding topics.

The first series of subject sequences follows procedures of admission checks, pre-operative preparation, post-operative care, daily care and medication, roughly the same as the national standard clinical pathway. The second group is comparatively free of the relevant topics of preoperative preparation, medication topics, and the third group is roughly the same procedure as the first group, lacking the last medication.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. A process path mining method based on improved LDA is characterized by comprising the following steps:

step 2, modeling the medical data by utilizing the improved LDA topic model;

2. The method for process path mining based on improved LDA as claimed in claim 1, wherein said step 1 specifically comprises the steps of:

carrying out unified mapping on diagnosis and treatment items by adopting a mode of combining similarity calculation and regularization matching, carrying out regularization matching before carrying out similarity calculation, removing suffix interference items in the medical advice data, and carrying out cosine similarity calculation after regularization matching;

the text is converted into corresponding word frequency vectors a and b, a cosine value between the two vectors is calculated, a formula of cosine similarity calculation is s (a, b) ═ a · b/| a | b |, wherein s (a, b) represents the cosine similarity between a and b, and if a calculation result is 1, the result is in accordance with reality.

3. The method of claim 1, wherein the LDA topic model in step 2 comprises two core model parameters: a topic distribution for each document and a lexical distribution for each topic; in the LDA topic model, a document is supposed to be composed of different topics with different probabilities, and each topic corresponds to the probability distribution of a word, so that each word in the document is generated by selecting the corresponding topic according to the corresponding probability and then selecting the word according to the probability; taking the diagnosis and treatment items as words and the diagnosis and treatment days as documents, respectively calculating the distribution of the diagnosis and treatment days-the diagnosis and treatment subjects and the distribution of the diagnosis and treatment subjects-the diagnosis and treatment items, and sampling the distribution of the diagnosis and treatment days-the treatment subjects from the Dirichlet distribution alpha to generate a distribution theta of the diagnosis and treatment days i-the treatment days-the treatment subjects_iSampling from Dirichlet distribution beta to generate diagnosis and treatment subject-diagnosis and treatment item distribution corresponding to diagnosis and treatment subject z

Word distribution under each topic generated from Dirichlet distribution with parameter beta

The values are sorted from small to large, the top twenty words under each subject are taken for weight recalculation, and the calculation formula is as follows

Wherein the content of the first and second substances,

is the final weight of the word w in the topic z.

4. The method of claim 1, wherein the step 3 specifically comprises the following steps:

step 3.1: generating a diagnosis and treatment day theme label;

for a diagnosis and treatment day d, according to the corresponding topic vector theta_dExtracting related subjects as main subjects according to the probability threshold of the selected subject labelA question label to represent the treatment day; a topic k, as one of the topic tags, needs to satisfy the following constraints:

where r (k, d) represents the topic vector θ_dThe value of the subject k in (1), δ_tlSelecting a probability threshold value of the theme label, wherein K is the selected optimal theme number; arranging the topics meeting the formula in the diagnosis and treatment day in a descending order according to the probability value, and finally recording the topic label forming the diagnosis and treatment day d as tl_d(k (1), k (2),. ·, k (p)), where k (j) represents a topic with a j-th high probability, TL being defined as a different topic tag set;

replacing each diagnosis and treatment day of one hospitalization of one patient with a topic label to obtain a topic sequence sigma ═ tl corresponding to the hospitalization₁,tl₂,...,tl_|σ|Therein tl_iBelongs to TL, and the [ sigma ] is the diagnosis and treatment day number of the hospitalization;

step 3.2: pruning the low-frequency subject label;

replacing each diagnosis and treatment day with a theme label to obtain a theme sequence corresponding to each hospitalization, pruning low-frequency theme labels, wherein the probability of the theme arranged behind the theme label is lower than that of the theme arranged in front of the theme label, gradually deleting subsequent themes in the low-frequency theme, and then judging whether the pruned theme label is low or not; constructing a prefix tree for the subject label in the TL, setting a threshold value of a low-frequency label, merging the low-frequency label node to a father node of the prefix tree, and changing the frequency of the father node until the low-frequency label node does not exist in the whole tree;

step 3.3: clustering the subject sequences;

5. The method of claim 4, wherein the step 3.3 specifically comprises the steps of:

step 3.3.3: for each class a_jRecalculating its cluster center;

step 3.3.4: and repeating the steps 3.3.2 and 3.3.3 until reaching the set suspension condition.