WO2021204038A1

WO2021204038A1 - Multi-scale clinical pathway mining method and apparatus, computer device, and storage medium

Info

Publication number: WO2021204038A1
Application number: PCT/CN2021/084255
Authority: WO
Inventors: 蒋雪涵; 唐蕊; 孙行智
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-11-12
Filing date: 2021-03-31
Publication date: 2021-10-14
Also published as: CN112382398A; CN112382398B

Abstract

A multi-scale clinical pathway mining method and apparatus, a computer device and a storage medium. The method comprises: converting usage data of items used by a plurality of users each day into an item usage matrix, and denoting the item usage matrix as m*n, wherein m represents the sum of all hospital stays of all the users, n represents the number of all items, and each row in the item usage matrix represents the items used by one user in one day (S101); taking each row in the item usage matrix as a user·day, and clustering similar users·days according to the similarities between the various users·days (S102); and using cores of clusters to represent doctor-visiting pathways of the various users, performing serialization representation on the doctor-visiting pathways of the various users, then mining frequent sequences from doctor-visiting pathway sequences, and taking the frequent sequences as the main clinical pathways (S103). By means of the present method, the rationality and variability of an actual clinical operation can be better reflected.

Description

Multi-scale clinical path mining method, device, computer equipment and storage medium

This application claims the priority of the Chinese patent application filed with the Chinese Patent Office on November 12, 2020, the application number is 202011260888.4, and the invention title is "Multi-scale clinical pathway mining method, device, computer equipment and storage medium", and its entire content Incorporated in this application by reference.

Technical field

This application relates to the field of data mining, and in particular to multi-scale clinical path mining methods, devices, computer equipment and storage media.

Background technique

With the improvement of medical informatization, electronic medical records have gradually replaced paper medical records. It has become a trend to use data analysis and artificial intelligence methods to dig out potential medical information. How to understand the patient's medical treatment behavior from the patient's sequential medical treatment data is essential for summarizing the patient's main clinical path, extracting sequential clinical rules, and performing quality control.

One of the programs to standardize patients' medical treatment and conduct quality control is the clinical pathway. The clinical path is a mode of medical service management. By formulating a procedural and standardized diagnosis and treatment plan for a certain disease or major operation, the goal of standardizing medical behavior and reducing the waste of medical resources is achieved. Thousands of clinical pathways have been formulated. However, the inventor realizes that there are many problems in the quality control of medical behaviors in accordance with the established clinical pathways. For example, the clinical pathways are formulated in accordance with general conditions, and each patient is not considered. Therefore, the quality control based on the clinical path will be too strict and meaningless. That is to say, the existing methods of clinical path mining are not flexible and changeable.

Application content

The purpose of this application is to provide a multi-scale clinical path mining method, device, computer equipment and storage medium, aiming to solve the problem that the existing clinical path mining method does not have flexibility and variability.

In the first aspect, an embodiment of the present application provides a multi-scale clinical path mining method, which includes:

Convert the project use data used by multiple users every day into a project use matrix, and record the project use matrix as m*n, where m represents the sum of all days of hospitalization of all the users, and n represents the number of all projects , Each row in the item usage matrix represents an item used by a user in a day;

Taking each row in the item usage matrix as a user·day, and clustering similar users·days according to the similarity between each user·day;

The core of clustering is used to represent the medical treatment path of each user, and the medical treatment path of each user is serialized and expressed, and then frequent sequences are excavated from them, and the frequent sequences are used as the main clinical path.

In the second aspect, an embodiment of the present application provides a multi-scale clinical path mining device, which includes:

The conversion unit is used to convert the item usage data used by multiple users every day into a item usage matrix, and record the item usage matrix as m*n, where m represents the sum of all the days of hospitalization of all the users, n Represents the number of all items, and each row in the item usage matrix represents an item used by a user in a day;

A clustering unit, configured to use each row in the item usage matrix as a user·day, and cluster similar users·days according to the similarity between each user·day;

The mining unit is used to use the core of clustering to represent the medical treatment path of each user, and to serialize the medical treatment path of each user, and then mine the frequent sequence from it, and use the frequent sequence as the main Clinical path.

In a third aspect, embodiments of the present application provide a computer device, including a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor executes the computer program Time to realize the multi-scale clinical path mining method as described in the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to execute The multi-scale clinical path mining method described in the aspect.

The embodiment of the application can realize the pattern mining of time series clinical data, obtain the real clinical path from the data, can better reflect the rationality and variability of clinical actual operation, and solve the disordered item set through serialized representation Too much time and space complexity problems.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings used in the description of the embodiments. Obviously, the drawings in the following description are some embodiments of the present application. Ordinary technicians can obtain other drawings based on these drawings without creative work.

FIG. 1 is a schematic flowchart of a multi-scale clinical path mining method provided by an embodiment of the application;

FIG. 2 is a schematic diagram of a sub-process of a multi-scale clinical path mining method provided by an embodiment of the application;

FIG. 3 is a schematic diagram of another sub-process of the multi-scale clinical path mining method provided by an embodiment of the application;

4 is a schematic diagram of another sub-process of the multi-scale clinical path mining method provided by an embodiment of the application;

FIG. 5 is a schematic diagram of another sub-process of the multi-scale clinical path mining method provided by an embodiment of the application;

6 is a schematic diagram of another sub-process of the multi-scale clinical path mining method provided by an embodiment of the application;

FIG. 7 is a schematic block diagram of a multi-scale clinical path mining device provided by an embodiment of the application;

8 is a schematic block diagram of subunits of the multi-scale clinical path mining device provided by an embodiment of the application;

9 is a schematic block diagram of another sub-unit of the multi-scale clinical path mining device provided by an embodiment of the application;

10 is a schematic block diagram of another subunit of the multi-scale clinical path mining device provided by an embodiment of the application;

11 is a schematic block diagram of another subunit of the multi-scale clinical path mining device provided by an embodiment of the application;

12 is a schematic block diagram of another subunit of the multi-scale clinical path mining device provided by an embodiment of the application;

FIG. 13 is a schematic block diagram of a computer device provided by an embodiment of the application.

Detailed ways

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

It should be understood that when used in this specification and appended claims, the terms "including" and "including" indicate the existence of the described features, wholes, steps, operations, elements and/or components, but do not exclude one or The existence or addition of multiple other features, wholes, steps, operations, elements, components, and/or collections thereof.

It should also be understood that the terms used in the specification of this application are only for the purpose of describing specific embodiments and are not intended to limit the application. As used in the specification of this application and the appended claims, unless the context clearly indicates other circumstances, the singular forms "a", "an" and "the" are intended to include plural forms.

It should be further understood that the term "and/or" used in the specification and appended claims of this application refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations .

Please refer to FIG. 1. FIG. 1 is a schematic flowchart of a multi-scale clinical path mining method provided by an embodiment of the application, including steps S101 to S103:

S101. Convert project use data used by multiple users every day into a project use matrix, and record the project use matrix as m*n, where m represents the sum of all hospital days of all the users, and n represents all projects Each row in the item usage matrix represents an item used by a user in a day;

In this step, the items used by a user every day (that is, every hospitalization day) may be repeated, and there may be non-repetitive items. In order to unify the specifications for each user’s project usage data, it needs to be converted to project use matrix.

In an embodiment, as shown in FIG. 2, the S101 includes steps S201 to S203:

S201. Pre-build a project usage matrix, where the number of rows of the project usage matrix is m and the number of columns is n;

Where m means the sum of all hospital days of all the users. For example, the hospital stay of user a is m1, the hospital stay of user b is m2, and the hospital stay of user s is ms, then m=m1+m2+... .+ms. n represents the number of all items. It should be noted that n here does not include duplicate items, that is, each item in n is unique. For example, suppose that there are currently three users: user a, user b, and user c. The items used by user a in all the days of hospitalization are n1, n2, n3, n4, and n5, and the items used by user b in all the days of hospitalization The items are n1, n3, n5, n7, and n8. The items used by user c in all the days of hospitalization are n4, n6, n7, n9, and n10, then n is 10, which represents n1, n2, n3, n4 , N5, n6, n7, n8, n9, n10 these 10 items. Of course, it is also possible to obtain the quantity of all items in the hospital in advance, and use the quantity as n. When sampling in this way, it is possible that none of the items in n is used by all users.

S202. Obtain the items used by each user every day;

The items used by each user on each day represent the items that each user spends every day during the hospitalization period. The items used by each user on each day can be represented in an orderly sequence: <{item a, item b, Item c, …}, {item b, item d, …}, …>, where the elements represented in "<...>" are in order, and the length of "<...>" is the number of days the user is hospitalized, "{... The elements indicated in }" are in no order. The set of all charging items in the hospital is S, then "{...}" means a subset of S;

S203: Fill each row element of the item usage matrix according to the items used by each user every day.

Each row of the item usage matrix represents the usage of the item in the hospital on a certain day, and each column of the item usage matrix represents the usage of a certain item in different users and days, and the elements in the item usage matrix can be It is 0 or 1, 0 means the item is not used in the corresponding user·day, and 1 means the item is used in the corresponding user·day.

“User·Day” means “a certain user on a certain day” in the project usage matrix. In the project usage matrix, it can be arranged in the order of users first, and for a specific user in the order of days arrangement. That is, the first line is the project usage of the first user on the first day, the second line is the project usage of the first user on the second day, and so on, for example, the first user has a total of 10 days in hospital, then the tenth The row represents the project usage of the first user on the tenth day, the eleventh row represents the project usage of the second user on the first day, and the twelfth row represents the project usage of the second user on the second day , The number of days in hospital for each user may vary.

S102. Use each row in the item usage matrix as a user·day, and cluster similar users·days according to the similarity between the users·days;

This step is to cluster each user·day in the project usage matrix, and the embodiment of the present application provides two ways to calculate and cluster the user·day. The first method is described below.

In an embodiment, as shown in FIG. 3, the S102 includes steps S301 to S302:

S301. Calculate the similarity between the users and days according to the project usage matrix, construct and obtain the distance matrix of the users and days according to the similarity between the users and days, and combine the The distance matrix is denoted as m*m;

S302. Cluster similar users·days according to the distance matrix.

In this embodiment, the similarity of each user and day is calculated according to the data of each user and day in the project usage matrix, and the distance matrix is constructed according to the similarity, and then clustering is performed according to the distance matrix.

In an embodiment, as shown in FIG. 4, the S301 includes steps S401 to S403:

S401. Extract the data of each row from the project usage matrix;

This step is to extract the data of each row from the project usage matrix. The data of each row represents a user's project usage on a certain day, for example {1, 0, 0, 1, 0,..., 0 }. The 1 represents that the user used the corresponding item on this day, and the 0 represents that the user did not use the corresponding item on this day.

S402: Calculate the similarity between the data of each row and the data of all rows in order;

For example, if the data in one row is {1, 0, 0, 1, 0,..., 0}, and the data in another row is {0, 0, 1, 0, 0,..., 1}, then you can Calculate the similarity between the two rows of data. According to this method, the similarity between the data of each row and the data of all rows can be calculated. In order to make the subsequent distance matrix more regular, the data of all rows also include the data of its own row, that is, the data of each row is calculated The similarity between the data and the data of all rows including its own row.

In addition, this step preferably performs the calculation of similarity in order, for example, first calculate the similarity between the data of the first row and the data of all rows, and then calculate the similarity between the data of the second row and the data of all rows , And so on, until the similarity between the last row of data and all rows of data is calculated.

In addition, when calculating the similarity between the data of a certain row and the data of all rows, the calculation is also performed in order. For example, when calculating the similarity between the data in the third row and the data in all rows, the similarity between the data in the third row and the data in the first row is calculated first, and then the data in the third row and the second row are calculated. The similarity between the rows of data, then calculate the similarity between the third row of data and the third row of data, and then calculate the similarity between the third row of data and the fourth row of data, and so on , Until calculating the similarity between the third row of data and the last row of data.

In the embodiment of this application, the jaccard (Jaccard coefficient) distance can be used to calculate the similarity, and the calculation formula is as follows:

Where |·| indicates the length of ·, S _i indicates the set of items used by the i-th user·day (where i does not indicate the i-th user, but the data of the i-th row), and S _j indicates the j-th user· A collection of items used by days (where j does not represent the jth user, but represents the jth row of data).

S403. Arrange the calculated similarities in order to construct and obtain the distance matrix, and record the distance matrix as m*m, where the element d _{ij in the i-th row and j-th column of the distance matrix} Represents the distance between the i-th user·day and the j-th user·day.

In this step, the previously calculated similarities are inserted into the matrix in order to construct a distance matrix. The arrangement form of the distance matrix may be: the elements in the first row represent the similarity between the data in the first row of the project usage matrix and the data in all rows in turn, that is, the data in the first row and the first column of the distance matrix The element represents the similarity between the data in the first row and the data in the first row of the item usage matrix, and the element in the first row and second column in the distance matrix represents the data in the first row and the second row in the item usage matrix. The similarity between the data in the distance matrix. The element in the first row of the m-th column in the distance matrix represents the similarity between the data in the first row and the data in the last row in the project usage matrix. The elements in the second row represent the similarity between the data in the second row of the project usage matrix and the data in all rows, and so on, the last row represents the data in the last row of the project usage matrix and the data in all rows in turn的similarity.

In an embodiment, the S302 includes:

A hierarchical clustering method is used to cluster the two closest elements in the distance matrix into one category, and all the elements are traversed to achieve global clustering.

This step is to cluster similar users and days, that is, to cluster different users and days according to the similarity of the used items. The way of clustering can be hierarchical clustering. Through clustering, which elements in the distance matrix are more similar can be obtained, which can be classified into one category.

Since the elements in the distance matrix represent the similarity between the data of different rows in the project usage matrix, that is, the similarity of different users and days, clustering the elements in the distance matrix actually realizes the project Use the clustering of users·days in the matrix, that is, use the data of each row in the matrix to cluster the items. In the original project usage matrix, there are a total of m users·days. After clustering, assuming that a total of x categories are obtained, then a total of x categories of users·days are obtained, where m is greater than x, and the actual situation may be that m is much greater than x. .

In addition to using the distance matrix method described above to calculate the similarity and perform clustering, the embodiment of the present application also provides a second method for calculating clustering, that is, applying a language model to express learning for daily items. The advantage of this is that the dimensionality of the high-dimensional sparse matrix can be reduced to a low-dimensional dense matrix, which not only improves the performance of the method, but also can represent each user·day more accurately, so as to obtain a better clustering effect .

In an embodiment, as shown in FIG. 5, the S102 includes steps S501 to S504:

S501. Obtain items used by each of the users and days, and use the obtained items as words;

S502: Perform vector representation on all words in each user·day through word vector-based representation learning to obtain corresponding word vectors;

S503. Weight the word vectors of all words in each user·day by the word frequency weighting method to obtain sentence vectors for each user·day, wherein the calculation formula for the word frequency weighting is: v _day =dot (V _I, TFIDF), where v _day represents the sentence vector of the user -day, V _I denotes the respective item-user within a matrix representation, where I is the set of items of user-day, the V _I Each row represents the word vector of an item, dot represents the inner product operation of the elements, and TFIDF represents the word frequency article specificity matrix; the TFIDF calculation formula for item i is:

Where D _i represents the total number of users·days that include item i, D represents the total number of all users·days, A _i represents the total number of users that include item i, and A represents the total number of users;

S504. Cluster similar users and days according to the distance between the sentence vectors of each user and day.

When the language model is actually used for representation learning, the analogy language model can take each user·day as a sentence, and each item in each day as a word, and perform sentence-based representation learning. For example, a user · day is {item a, item b, item c}, which means that three items of "item a", "item b" and "item c" occurred during that day; convert them into sentences, that is, " Item a item b item c", this sentence consists of 3 words, these three words are "item a", "item b" and "item c"; then through the representation learning based on the word vector, get each The vector representation of the word, the corresponding word vector is obtained, and then the word vectors of all items in a day are weighted to obtain the corresponding user·day sentence vector. Finally, the similar users·days are clustered by the distance between each sentence vector. As for the calculation method of the distance between sentence vectors, Euclidean distance, angle cosine distance, Manhattan distance, Chebyshev distance, etc. can be used. Clustering can be achieved based on the calculated distance, that is, clustering together with small distances. . In the embodiment of the present application, the language model may be word2vec, which is a method for learning each word based on the probability of co-occurrence of other words in the neighbor window of a certain word.

In addition, in the embodiments of this application, when the language model is applied, since the items of each day are not in order in the item usage data, that is, all items in the same day should be considered as neighbors of other items. Therefore, in actual applications, setting the maximum number of items in a sliding time window of the day most occurring, thereby obtaining expression of each item, such as "project a", "project b" and "item c" expressed by the learning, respectively, V _a, V _b and V _c indicate.

In addition, the embodiment of the present application obtains the sentence vector by the method of word frequency weighting. Compared with directly applying the average of all word vectors to obtain the final expression, the embodiment of the present application can improve the accuracy of the sentence vector representation, which is more in line with the embodiment of the present application. Application scenarios.

S103. Use the core of clustering to represent the medical treatment path of each user, and serialize the medical treatment path of each user, and then dig out frequent sequences therefrom, and use the frequent sequences as the main clinical path.

In an embodiment, as shown in FIG. 6, the step S103 includes S601 to S605:

S601. Use different numbers to represent the core of each cluster separately;

For example, use the number 1 to represent the first category, the number 2 to represent the second category,..., and the number x to represent the xth category.

S602. Use the number of the core of the corresponding cluster to represent the user·day in each cluster;

Since the project has been clustered using the user·day in the matrix in the preceding steps, the corresponding number can be used directly to indicate the corresponding user·day. For example, if the first user·day is the first category, then only Use a number 1 to indicate that the second user·day is the third type, then only a number 3 is needed to indicate.

S603. Perform serialized representation for each user·day after the number is represented to obtain a medical treatment path sequence;

Through the foregoing steps, each user·day is represented by a number, so this step can be serialized in the order of user·day, that is, serialized in the order of each row in the project usage matrix, so as to obtain the medical treatment path sequence.

For example, the medical treatment path sequence is <c1,c2,...,ci,...>, where ci represents the cluster core of the i-th user·day. That is, after clustering, may be used for treatment x represents the class path to the user, i.e. the user indicates a day, day substituted in the prior art set S _i represented by a number of users.

S604. Delete consecutive identical elements in the sequence of medical treatment routes and keep only one of them to obtain a simplified medical treatment route sequence;

For example, the medical treatment path sequence obtained in the foregoing step S603 is <1,1,1,3,3,3,3,3,6,8,8,9>, then the redundant repeated elements can be deleted, and the repeated elements can be retained One, the simplified sequence of medical treatment route <1,3,6,8,9>.

Of course, after obtaining the medical treatment path sequence, no simplification is required, then the statistics of each type of user·day duration can be obtained, such as the example "<1,1,1,3,3,3,3,3,6,8 ,8,9>”, corresponding to the first type of user·day lasted for 3 days, and the third type of user·day lasted for 5 days. Such statistics can show the duration of each type of user·day in general, such as By taking the duration of 95% of the cases as the threshold, we can get rules such as: the e-th user·day duration should be less than y days. In the actual data, if the e-type user·day duration exceeds y days, it is considered to be The possibility of over-medical treatment or even fraudulent insurance is higher.

S605. Use a sequence mining algorithm to dig out frequent sequences from the sequence of medical treatment routes, and use the frequent sequences as the main clinical route.

In this step, sequence mining algorithms such as prefixspan (pattern mining of prefix projection) can be used to mine frequent sequences.

Through clustering, this application can classify m patients·days into x categories of users·days, and by separately mining each category of users·days, to understand what the specific behaviors of that category of users·days are; The understanding of this type of user·day is mapped to the frequent sequence, and the understanding of the frequent sequence is obtained. Such frequent sequence is the main clinical path mined from the data.

Take a large-scale operation as an example. First, we obtained the daily data of all users who underwent the operation. After clustering according to users and days, we obtained 5 cluster cores, using a/b/c/d/e Indicates that these five cluster cores are mapped back to the user's medical treatment path sequence, and the user's medical treatment path sequence is simplified and the frequent sequences are extracted from them. The frequent sequences that meet the requirements are obtained as ab, bde, and ae; Users who belong to the core of these 5 clusters conduct frequent collection mining. For example, the items that frequently appear in the a-type user·tianzhong are: hospitalization examination fee, blood routine, urine test, blood coagulation function test, and b-type users·tianzhong frequent Items that appear are: ventilator, anesthesia fee, gauze, surgery fee, blood transfusion fee, etc. For d users, the items that appear frequently in the day are: nutrition infusion, blood routine, urine test, C-reactive protein determination, etc., e-type users· The items that frequently appear in the sky are: rehabilitation training, antibiotics, etc., a/b/d/e can be understood as pre-operative preparations/in-surgery/post-operative examinations/post-operative rehabilitation, respectively, so as to map back to the frequent sequence obtained by mining ab means: preoperative preparation first, followed by surgery; ae means preoperative preparation first, followed by postoperative rehabilitation. Through this analysis, the main clinical path can be obtained from the data, and a basic interpretation of the trajectory of each user's visit can be made. The embodiment of this application can analyze frequent sequences on the scale of users·days, and then conduct frequent item mining in each type of users·days to obtain an understanding of each type of users·days, and then map back to the frequent sequences of users In order to gain a comprehensive understanding.

Based on the above analysis, the timing association rules can also be obtained. That is, through the main clinical path obtained, the time sequence relationship of user visits is obtained, that is, a certain type of user·day must occur before another type of user·day; in addition, through frequent set mining of each type of user·day, each type of user·day can be obtained. The items that frequently appear in the sky can be combined to obtain a sequential association rule such as "the item 1 that frequently appears in the day must appear in the category b patient and the item 2 that appears frequently in the sky". For example, in a large-scale operation, the rule that "preoperative coagulation function test must occur before intraoperative blood transfusion" can be obtained and applied to actual quality control. For example, if the patient has an intraoperative blood transfusion, the above-mentioned examination must be done before the operation.

In the actual scenario, the data of a single user·day is very complicated, and each user·day is arranged in an orderly manner, while the data within each user·day is arranged in an disorderly manner. With the existing frequent sequence mining method, the amount of calculation required is very large. The embodiment of this application solves the problem of mining sequential patterns containing multiple item sets. According to the embodiment of this application, according to each row (in a business scenario, a row of data corresponds to a user's use item in a day), the item set appears. Cluster all users·days of all users. After clustering, use the category number of each category to represent this user·day, so that each row is replaced with a number, so that a sequence of category numbers can be used for one hospitalization Indicates that the mining of frequent sequences can be quickly realized.

Therefore, the core of the multi-scale clinical path mining scheme proposed in this proposal is to classify hospitalization days, not only to mine frequent paths from hospitalization data, but also to mine frequent itemsets of each type of day; this is a direct application Existing pattern mining technology cannot be realized.

Please refer to FIG. 7. FIG. 7 is a schematic block diagram of a multi-scale clinical path mining device according to an embodiment of the application. The multi-scale clinical path mining device 700 includes:

The conversion unit 701 is configured to convert the item usage data used by multiple users every day into a item usage matrix, and record the item usage matrix as m*n, where m represents the sum of all the days of hospitalization of all the users, n represents the number of all items, and each row in the item usage matrix represents an item used by a user in a day;

The clustering unit 702 is configured to use each row in the item usage matrix as a user·day, and cluster similar users·days according to the similarity between the users·days;

The mining unit 703 is configured to use the core of the cluster to represent the medical treatment path of each user, and to serialize the medical treatment path of each user, and then mine the frequent sequence from it, and use the frequent sequence as The main clinical path.

In an embodiment, as shown in FIG. 8, the conversion unit 701 includes:

The project use matrix construction unit 801 is used to construct a project use matrix in advance, wherein the number of rows of the project use matrix is m and the number of columns is n;

The obtaining unit 802 is used to obtain the items used by each user every day;

The filling unit 803 is used to fill each row element of the item usage matrix according to the items used by each user every day.

In an embodiment, as shown in FIG. 9, the clustering unit 702 includes:

The distance matrix construction unit 901 is configured to calculate the similarity between each user·day according to the project usage matrix, and construct the distance between each user·day according to the similarity between each user·day Matrix, and record the distance matrix as m*m;

The distance matrix clustering unit 902 is configured to cluster similar users·days according to the distance matrix.

In an embodiment, as shown in FIG. 10, the distance matrix construction unit 901 includes:

The extraction unit 1001 is used to extract data of each row from the item usage matrix;

The similarity calculation unit 1002 is configured to sequentially calculate the similarity between each row of data and all rows of data;

The arranging unit 1003 is configured to arrange the calculated similarities in order to construct the distance matrix, and record the distance matrix as m*m, where the i-th row of the distance matrix is the jth row The column element d _ij represents the distance between the i-th user·day and the j-th user·day.

In an embodiment, the distance matrix clustering unit 902 includes:

The hierarchical clustering unit is used to cluster the two closest elements in the distance matrix into one class by using hierarchical clustering, and traverse all the elements to realize global clustering.

In an embodiment, as shown in FIG. 11, the clustering unit 702 includes:

The word extracting unit 1101 is configured to obtain the items used by each user·day, and use the obtained items as words;

The word vector representation unit 1102 is configured to perform vector representation on all words in each user·day through word vector-based representation learning to obtain corresponding word vectors;

The word frequency weighting unit 1103 is used to weight the word vectors of all words in each user·day by the word frequency weighting method to obtain the sentence vectors of each user·day, wherein the calculation formula for the word frequency weighting is _{_{: v day = dot (V I}} , TFIDF), where v _day represents the sentence vector of the user -day, V _I denotes the matrix within the respective user-item represented by wherein I is the user-item days collection, each row represents a word vector V _i a project, dot represents an inner product computation elements, TFIDF represents term frequency specificity article matrix; TFIDF item i is calculated as:

The distance clustering unit 1104 is configured to cluster similar users·days according to the distance between the sentence vectors of each user·day.

In an embodiment, as shown in FIG. 12, the mining unit 703 includes:

The core representation unit 1201 is used to separately represent the cores of each cluster using different numbers;

The number representation unit 1202 is used to represent the user·day in each cluster using the number of the core of the corresponding cluster;

The serialized representation unit 1203 is used to serialize and represent each user·day after the number is represented to obtain a medical treatment path sequence;

The simplification unit 1204 is configured to delete consecutive identical elements in the medical treatment path sequence and retain only one of them to obtain a simplified medical treatment path sequence;

The sequence mining unit 1205 is configured to use a sequence mining algorithm to dig out frequent sequences from the medical treatment path sequence, and use the frequent sequence as the main clinical path.

The specific content of the foregoing device embodiment corresponds to the specific content of the foregoing method embodiment one-to-one. For the specific implementation details of the foregoing device embodiment, reference may be made to the description of the method embodiment, which will not be repeated here.

The device provided by the embodiment of the application can realize the pattern mining of time series clinical data, obtain the real clinical path from the data, can better reflect the rationality and variability of clinical actual operation, and solve the problem through serialized representation. The problem of high time and space complexity caused by too many unordered itemsets.

The above-mentioned multi-scale clinical path mining device 700 can be implemented in the form of a computer program, and the computer program can be run on a computer device as shown in FIG. 13.

Please refer to FIG. 13, which is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 1300 is a server, and the server may be an independent server or a server cluster composed of multiple servers.

Referring to FIG. 13, the computer device 1300 includes a processor 1302, a memory, and a network interface 1305 connected through a system bus 1301, where the memory may include a non-volatile storage medium 1303 and an internal memory 1304.

The non-volatile storage medium 1303 can store an operating system 13031 and a computer program 13032. When the computer program 13032 is executed, the processor 1302 can execute the multi-scale clinical path mining method.

The processor 1302 is used to provide computing and control capabilities, and support the operation of the entire computer device 1300.

The internal memory 1304 provides an environment for the operation of the computer program 13032 in the non-volatile storage medium 1303. When the computer program 13032 is executed by the processor 1302, the processor 1302 can execute the multi-scale clinical path mining method.

The network interface 1305 is used for network communication, such as providing data information transmission. Those skilled in the art can understand that the structure shown in FIG. 13 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device 1300 to which the solution of the present application is applied. The specific computer device The 1300 may include more or fewer components than shown in the figure, or combine certain components, or have a different component arrangement.

Wherein, the processor 1302 is used to run a computer program 13032 stored in the memory to realize the following function: convert the project usage data used by multiple users every day into a project usage matrix, and record the project usage matrix as m*n, m represents the sum of all days of hospitalization of all the users, n represents the number of all items, each row in the item usage matrix represents an item used by a user in a day; use the item Each row in the matrix is regarded as a user·day, and similar users·days are clustered according to the similarity between each user·day; the core of the clustering is used to represent the medical treatment path of each user, and The medical treatment path of each user is serialized and expressed, and then frequent sequences are excavated therefrom, and the frequent sequences are used as the main clinical path.

Those skilled in the art can understand that the embodiment of the computer device shown in FIG. 13 does not constitute a limitation on the specific structure of the computer device. In other embodiments, the computer device may include more or less components than those shown in the figure. Or some parts are combined, or different parts are arranged. For example, in some embodiments, the computer device may only include a memory and a processor. In such an embodiment, the structures and functions of the memory and the processor are consistent with the embodiment shown in FIG. 13 and will not be repeated here.

It should be understood that in this embodiment of the application, the processor 1302 may be a central processing unit (Central Processing Unit, CPU), and the processor 1302 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), Application Specific Integrated Circuit (ASIC), off-the-shelf programmable gate array (Field-Programmable GateArray, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc. Among them, the general-purpose processor may be a microprocessor or the processor may also be any conventional processor.

In another embodiment of the present application, a computer-readable storage medium is provided. The computer-readable storage medium may be a non-volatile computer-readable storage medium, or may be a volatile computer-readable storage medium. The computer-readable storage medium stores a computer program, where the computer program is executed by the processor to implement the following steps: convert the item usage data used by multiple users every day into a item usage matrix, and record the item usage matrix as m *n, m represents the sum of all the days of hospitalization of all the users, n represents the number of all items, each row in the item usage matrix represents an item used by a user in a day; the item usage matrix Each row in is regarded as a user·day, and similar users·days are clustered according to the similarity between each user·day; the core of the clustering is used to represent the medical treatment path of each user, and The medical path of each user is serialized and expressed, and then frequent sequences are excavated from them, and the frequent sequences are used as the main clinical path.

The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other. It should be pointed out that for those of ordinary skill in the art, without departing from the principles of this application, several improvements and modifications can be made to this application, and these improvements and modifications also fall within the protection scope of the claims of this application.

It should also be noted that in this specification, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply these entities or operations. There is any such actual relationship or sequence between operations. Moreover, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements not only includes those elements, but also includes those that are not explicitly listed Other elements of, or also include elements inherent to this process, method, article or equipment. Under the condition of no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other same elements in the process, method, article or equipment including the element.

Claims

A multi-scale clinical path mining method, which includes:

Convert the project use data used by multiple users every day into a project use matrix, and record the project use matrix as m*n, where m represents the sum of all days of hospitalization of all the users, and n represents the number of all projects , Each row in the item usage matrix represents an item used by a user in a day;

Taking each row in the item usage matrix as a user·day, and clustering similar users·days according to the similarity between each user·day;

The core of clustering is used to represent the medical treatment path of each user, and the medical treatment path of each user is serialized and expressed, and then frequent sequences are excavated from them, and the frequent sequences are used as the main clinical path.
The multi-scale clinical path mining method according to claim 1, wherein said converting the item usage data used by a plurality of users every day into an item usage matrix, and denoting the item usage matrix as m*n, comprises:

Pre-build a project usage matrix, where the number of rows of the project usage matrix is m and the number of columns is n;

Get the items used by each user every day;

Fill in each row element of the item usage matrix according to the items used by each user every day.
The multi-scale clinical path mining method according to claim 1, wherein each row in the item usage matrix is regarded as a user·day, and similar users are compared according to the similarity between each user·day. ·Day clustering, including:

Calculate the similarity between the users·days according to the project usage matrix, construct and obtain the distance matrix of each user·day according to the similarities between the users·days, and combine the distance matrix Denoted as m*m;

Clustering similar users·days according to the distance matrix.
The multi-scale clinical path mining method according to claim 3, wherein the calculation of the similarity between the users·days according to the item usage matrix is based on the similarity between the users·days, Construct and obtain the distance matrix of each user·day, and record the distance matrix as m*m, including:

Extract the data of each row from the project usage matrix;

Calculate the similarity between each row of data and all rows of data in order;

Arrange the calculated similarities in order to construct the distance matrix, and record the distance matrix as m*m, where the element d ij in the i-th row and j-th column of the distance matrix represents the The distance between the i user·day and the jth user·day.
The multi-scale clinical path mining method according to claim 3, wherein the clustering of similar users·days according to the distance matrix comprises:

A hierarchical clustering method is used to cluster the two closest elements in the distance matrix into one category, and all the elements are traversed to achieve global clustering.
The multi-scale clinical path mining method according to claim 1, wherein each row in the item usage matrix is regarded as a user·day, and similar users are compared according to the similarity between each user·day. ·Day clustering, including:

Obtain the items used by each user·tianzhong, and use the obtained items as words;

Performing vector representation of all words in each user·day through word vector-based representation learning to obtain corresponding word vectors;

The word vector of all words in each user·day is weighted by the method of word frequency weighting to obtain the sentence vector of each user·day. The calculation formula for word frequency weighting is: v day =dot(V I, the TFIDF), where v day represents the sentence vector of the user -day, V I denotes the respective item-user within a matrix representation, where I is the set of user-day items, each row of V I Represents the word vector of an item, dot represents the inner product operation of the elements, and TFIDF represents the word frequency article specificity matrix; the calculation formula of TFIDF for item i is:
Where D i represents the total number of users·days that include item i, D represents the total number of all users·days, A i represents the total number of users that include item i, and A represents the total number of users;

Clustering similar users·days based on the distance between the sentence vectors of each user·day.
The multi-scale clinical path mining method according to claim 1, wherein the core of the cluster is used to represent the medical path of each user, and the medical path of each user is serialized and displayed, and then Mining frequent sequences and using the frequent sequences as the main clinical path includes:

Use different numbers to represent the core of each cluster separately;

Use the number of the core of the corresponding cluster to represent the user·day in each cluster;

Serialize each user·day after the digital representation to obtain the medical treatment path sequence;

Delete consecutive identical elements in the medical treatment path sequence and keep only one of them to obtain a simplified medical treatment path sequence;

A sequence mining algorithm is used to dig out frequent sequences from the medical path sequence, and use the frequent sequences as the main clinical path.
A multi-scale clinical path mining device, which includes:

The conversion unit is used to convert the item usage data used by multiple users every day into a item usage matrix, and record the item usage matrix as m*n, where m represents the sum of all the days of hospitalization of all the users, n Represents the number of all items, and each row in the item usage matrix represents an item used by a user in a day;

A clustering unit, configured to use each row in the item usage matrix as a user·day, and cluster similar users·days according to the similarity between each user·day;

The mining unit is used to use the core of clustering to represent the medical treatment path of each user, and to serialize the medical treatment path of each user, and then mine the frequent sequence from it, and use the frequent sequence as the main Clinical path.
A computer device, comprising a memory, a processor, and a computer program stored on the memory and running on the processor, wherein the processor implements the computer program described in claim 1 when the processor executes the computer program Multi-scale clinical pathway mining method.
9. The computer device according to claim 9, wherein said converting the item usage data used by a plurality of users every day into an item usage matrix, and denoting the item usage matrix as m*n, comprises:

Pre-build a project usage matrix, where the number of rows of the project usage matrix is m and the number of columns is n;

Get the items used by each user every day;

Fill in each row element of the item usage matrix according to the items used by each user every day.
The computer device according to claim 9, wherein each row in the item usage matrix is regarded as a user·day, and similar users·days are aggregated according to the similarity between the user·days. Classes, including:

Calculate the similarity between the users·days according to the project usage matrix, construct and obtain the distance matrix of each user·day according to the similarities between the users·days, and combine the distance matrix Denoted as m*m;

Clustering similar users·days according to the distance matrix.
11. The computer device according to claim 11, wherein the calculation of the similarity between the users and the days is based on the item usage matrix, and the similarity between the users and the days is constructed to obtain the State the distance matrix of the user·day, and record the distance matrix as m*m, including:

Extract the data of each row from the project usage matrix;

Calculate the similarity between each row of data and all rows of data in order;

Arrange the calculated similarities in order to construct the distance matrix, and record the distance matrix as m*m, where the element d ij in the i-th row and j-th column of the distance matrix represents the The distance between the i user·day and the jth user·day.
11. The computer device according to claim 11, wherein the clustering of similar users·days according to the distance matrix comprises:

A hierarchical clustering method is used to cluster the two closest elements in the distance matrix into one category, and all the elements are traversed to achieve global clustering.
The computer device according to claim 9, wherein each row in the item usage matrix is regarded as a user·day, and similar users·days are aggregated according to the similarity between the user·days. Classes, including:

Obtain the items used by each user·tianzhong, and use the obtained items as words;

Performing vector representation of all words in each user·day through word vector-based representation learning to obtain corresponding word vectors;

The word vector of all words in each user·day is weighted by the method of word frequency weighting to obtain the sentence vector of each user·day. The calculation formula for word frequency weighting is: v day =dot(V I, the TFIDF), where v day represents the sentence vector of the user -day, V I denotes the respective item-user within a matrix representation, where I is the set of user-day items, each row of V I Represents the word vector of an item, dot represents the inner product operation of the elements, and TFIDF represents the word frequency article specificity matrix; the calculation formula of TFIDF for item i is:
Where D i represents the total number of users·days that include item i, D represents the total number of all users·days, A i represents the total number of users that include item i, and A represents the total number of users;

Clustering similar users and days based on the distance between sentence vectors of each user and day.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program that, when executed by a processor, causes the processor to execute the multi-scale clinical path mining according to claim 1 method.
15. The computer-readable storage medium according to claim 15, wherein said converting the item usage data used by a plurality of users every day into an item usage matrix, and denoting the item usage matrix as m*n, comprises:

Pre-build a project usage matrix, where the number of rows of the project usage matrix is m and the number of columns is n;

Get the items used by each user every day;

Fill in each row element of the item usage matrix according to the items used by each user every day.
The computer-readable storage medium according to claim 15, wherein each row in the item usage matrix is regarded as a user·day, and similar users·days are compared according to the similarity between the users·days. Clustering is performed every day, including:

Calculate the similarity between the users·days according to the project usage matrix, construct and obtain the distance matrix of each user·day according to the similarities between the users·days, and combine the distance matrix Denoted as m*m;

Clustering similar users·days according to the distance matrix.
The computer-readable storage medium according to claim 17, wherein the calculation of the similarity between the users·days according to the item usage matrix is constructed according to the similarity between the users·days Obtain the distance matrix of each user·day, and record the distance matrix as m*m, including:

Extract the data of each row from the project usage matrix;

Calculate the similarity between each row of data and all rows of data in order;

Arrange the calculated similarities in order to construct the distance matrix, and record the distance matrix as m*m, where the element d ij in the i-th row and j-th column of the distance matrix represents the The distance between the i user·day and the jth user·day.
18. The computer-readable storage medium according to claim 17, wherein the clustering of similar users·days according to the distance matrix comprises:

A hierarchical clustering method is used to cluster the two closest elements in the distance matrix into one category, and all the elements are traversed to achieve global clustering.
The computer-readable storage medium according to claim 15, wherein each row in the item usage matrix is regarded as a user·day, and similar users·days are compared according to the similarity between each user·day. Clustering in days, including:

Obtain the items used by each of the users and days, and use the obtained items as words;

Performing vector representation of all words in each user·day through word vector-based representation learning to obtain corresponding word vectors;

The word vector of all words in each user·day is weighted by the method of word frequency weighting to obtain the sentence vector of each user·day. The calculation formula for word frequency weighting is: v day =dot(V I, the TFIDF), where v day represents the sentence vector of the user -day, V I denotes the respective item-user within a matrix representation, where I is the set of user-day items, each row of V I Represents the word vector of an item, dot represents the inner product operation of the elements, and TFIDF represents the word frequency article specificity matrix; the calculation formula of TFIDF for item i is:
Where D i represents the total number of users·days that include item i, D represents the total number of all users·days, A i represents the total number of users that include item i, and A represents the total number of users;

Clustering similar users·days based on the distance between the sentence vectors of each user·day.