WO2021139116A1 - Method, apparatus and device for intelligently grouping similar patients, and storage medium - Google Patents

Method, apparatus and device for intelligently grouping similar patients, and storage medium Download PDF

Info

Publication number
WO2021139116A1
WO2021139116A1 PCT/CN2020/099566 CN2020099566W WO2021139116A1 WO 2021139116 A1 WO2021139116 A1 WO 2021139116A1 CN 2020099566 W CN2020099566 W CN 2020099566W WO 2021139116 A1 WO2021139116 A1 WO 2021139116A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
sample
disease
new patient
mahalanobis distance
Prior art date
Application number
PCT/CN2020/099566
Other languages
French (fr)
Chinese (zh)
Other versions
WO2021139116A9 (en
Inventor
廖希洋
马凯宁
欧秋雨
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139116A1 publication Critical patent/WO2021139116A1/en
Publication of WO2021139116A9 publication Critical patent/WO2021139116A9/en

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks

Definitions

  • This application relates to the field of database technology, in particular to methods, devices, equipment and storage media for intelligent grouping of similar patients.
  • the inventor realizes that most of the medical decisions made for new patients based on the data of samples (historical patients) rely on continuous data, such as test indicators, age, etc., to obtain different subgroups with large differences in clinical outcomes, and It is not possible to use the information considered by doctors in decision-making as much as possible, and it is impossible to make accurate medical decisions quickly.
  • the main purpose of this application is to solve the technical problem of how to intelligently group similar patients.
  • the first aspect of the present application provides an intelligent grouping method for similar patients, which includes: acquiring new patient data to be matched, the new patient data including multiple disease characteristic data;
  • the disease feature data is vectorized to obtain the disease feature word vector corresponding to the new patient; based on the disease feature word vector, the relationship between the new patient data and each historical patient data in the preset disease feature database is calculated.
  • the disease feature database contains multiple disease information groups, and similar disease features belong to the same disease information group; each of the Mahalanobis distances is sorted to obtain a sorting result; based on the sorting result, all disease information groups are determined
  • the new patient data corresponds to the matched disease information group, wherein the disease information group respectively contains different clinical outcome information.
  • the second aspect of the present application provides an intelligent grouping device for similar patients, including a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, and the processor executes the computer
  • the following steps are implemented when the instructions are readable: acquiring new patient data to be matched, the new patient data containing multiple disease characteristic data; performing vectorization processing on each disease characteristic data of the new patient to obtain the corresponding new patient
  • the disease feature word vector based on the disease feature word vector, calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease feature database, wherein the disease feature database contains multiple diseases Information group, similar disease characteristics belong to the same disease information group; sort each of the Mahalanobis distances to obtain a sort result; based on the sort result, determine the disease information group corresponding to the new patient data, wherein,
  • the disease information groups respectively contain different clinical outcome information.
  • the third aspect of the present application provides a computer-readable storage medium in which computer instructions are stored, and when the computer instructions are run on the computer, the computer is caused to perform the following steps: obtain the new to-be-matched Patient data, the new patient data contains multiple disease feature data; vectorized processing is performed on each disease feature data of the new patient to obtain the disease feature word vector corresponding to the new patient; based on the disease feature word vector Calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease feature database, where the disease feature database contains multiple disease information groups, and similar disease features belong to the same disease information group Sort the Mahalanobis distances to obtain a sorting result; based on the sorting result, determine the disease information group corresponding to the new patient data, wherein the disease information group contains different clinical outcome information .
  • the fourth aspect of the present application provides an intelligent grouping device for similar patients, including: a first acquisition module for acquiring new patient data to be matched, the new patient data including multiple disease characteristic data; a first processing module, It is used to vectorize each disease feature data of the new patient to obtain the disease feature word vector corresponding to the new patient; the first calculation module is used to calculate the new patient data based on the disease feature word vector The Mahalanobis distance between each historical patient data in the preset disease feature database, wherein the disease feature database contains multiple disease information groups, and similar disease features belong to the same disease information group; the sorting module is used to compare The respective Mahalanobis distances are sorted to obtain a sorting result; the determining module is configured to determine, based on the sorting result, the disease information group corresponding to the new patient data, wherein the disease information group includes different Clinical outcome information.
  • the new patient data to be matched is acquired, and the new patient data contains multiple disease characteristic data, and each disease characteristic data of the new patient is vectorized to obtain the corresponding new patient
  • calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease feature database wherein the disease feature database contains multiple diseases Information group, similar disease characteristics belong to the same disease information group, the respective Mahalanobis distances are sorted to obtain a sorting result, and based on the sorting result, the matching disease information group corresponding to the new patient data is determined.
  • the disease information groups respectively include different clinical outcome information.
  • Fig. 1 is a schematic diagram of a first embodiment of a method for intelligent grouping of similar patients in an embodiment of this application;
  • FIG. 2 is a schematic diagram of a second embodiment of a method for intelligent grouping of similar patients in an embodiment of this application;
  • Fig. 3 is a schematic diagram of a third embodiment of a method for intelligent grouping of similar patients in an embodiment of this application;
  • FIG. 4 is a schematic diagram of a first embodiment of an intelligent grouping device for similar patients in an embodiment of this application;
  • FIG. 5 is a schematic diagram of a second embodiment of an intelligent grouping device for similar patients in an embodiment of this application;
  • Fig. 6 is a schematic diagram of an embodiment of an intelligent grouping device for similar patients in an embodiment of the application.
  • the embodiments of the present application provide a method, device, equipment and storage medium for intelligent grouping of similar patients, which are used to calculate the new patient data and each preset disease group when determining the disease group to which the new patient belongs by acquiring new patient data.
  • the disease group to which the new patient data belongs is determined according to the value of the Mahalanobis distance.
  • This solution belongs to the field of smart medical care. Through this solution, the construction of smart cities can be promoted. This application can make maximum use of the information in the sample (patient) data that doctors will consider when making medical decisions.
  • the disease group to which the new patient belongs assists the doctor in making decisions based on the characteristics of the corresponding group and other information. It improves the efficiency of judging the group of patients and improves the accuracy of doctors' decision-making.
  • An embodiment of the method for intelligent grouping of similar patients in the embodiment of the present application includes:
  • the method for intelligent grouping of similar patients includes:
  • the new patient data to be matched refers to the data of the patient who is being treated by the doctor, and the doctor needs to learn from the information of the previous patient to make medical decisions. It contains both the personal information of the new patient and the patient's information. Information about the symptoms of the disease and the characteristics of the disease, including gender, age, name, various physical examination indicators, examination results, past medical history and other data information. For example, Zhang San, gender male, Han nationality, age 25, hepatitis B history of 10 years, chief complaint: often feeling fatigue, lack of physical strength, lower extremity edema, insomnia and dreams, upper abdominal discomfort, abdominal distension, yellow skin and urine, dark urine, etc. .
  • Machineatching in this embodiment refers to matching the disease and symptoms of the new patient with the symptoms of the previous patient.
  • the collected data types containing new patient information not only include continuous data such as test indicators, age, etc., but also discrete data or text data such as gender and examination results, it is necessary to check the collected data.
  • vectorize the data to obtain the corresponding vectorized new patient data. For example, if the new patient data is a mixture of text data, discrete data, and continuous data, then the word vector method in natural language processing technology is used to perform one-stop treatment on the text data and discrete data. -Hot) preprocessing of encoding to obtain vectorized data.
  • continuous data does not require any standardization or normalization preprocessing, and the characteristic data of this type can be used directly.
  • the disease feature word vector calculates the Mahalanobis distance between the new patient data and each historical patient data in a preset disease feature database, wherein the disease feature database includes multiple disease information groups, Similar disease characteristics belong to the same disease information group;
  • the Mahalanobis distance between the new patient data and each historical patient data in each preset disease feature database is calculated according to the patient feature word vector generated after vectorization processing.
  • the preset disease feature database shares the Mahalanobis distance A, B, C, D, E, F, G, 7 disease information groups, each disease information group has n samples (patients): A(a1,a2,a3...an), B (b1,b2,b3...bn), C(c1,c2,c3...cn), D(d1,d2,d3...dn), E(e1,e2,e3...en) , F(f1,f2,f3...fn), G(g1,g2,g3...gn), respectively calculate the new patient data and A, B, C, D, E, F, G, 7 diseases The Mahalanobis distance between each sample (patient) data in the information group.
  • the disease characteristic information database we can understand it as a database containing a large number of patient data, including multiple different groups of a disease, for example, the outcome is diabetes with nephropathy, diabetes with hypertension, Or groups with diabetes HbA1c standards, etc.
  • Each disease information group contains data information of a certain number of patients with this type of clinical outcome.
  • the Mahalanobis distance is sorted according to the value of the Mahalanobis distance between the calculated new patient data and each historical patient data in each preset disease characteristic information database, and the sorting result is obtained.
  • the ranking can be from largest to smallest, or from smallest to largest, in which the Mahalanobis distance between pairs of patients with similar outcomes is much smaller than the Mahalanobis distance between pairs of patients with dissimilar outcomes.
  • the disease information group refers to a specific disease group, which contains a certain number of samples (patients) of this type of disease.
  • each sample (patient) in the clinical outcome information group of the disease has personal information, disease characteristics, disease development process, outcome and other information throughout the course of the disease. Current medical history, past medical history, recent medications, past history, family history, physical examination, outcome and other data information.
  • the Mahalanobis distance between the new patient data and the sample (patient) is smaller, it means that the outcome between the two patients is similar, and the greater the possibility that they belong to the same disease information group, so you can According to the sorting result of Mahalanobis distance, it is determined that the new patient data corresponds to the disease information group.
  • Mahalanobis distance is used to measure the similarity between two data samples. For example, two sample data are identified by two sample matrices, and the covariance of sample matrix 1 data is sample matrix 1 horse. Similarly, the sample matrix 2 also has a corresponding Mahalanobis distance. If the calculated two Mahalanobis distances are closer, then it can be considered that the similarity of the two samples is higher.
  • the execution subject of this application may be an intelligent grouping device for similar patients, and may also be a terminal or a server, which is not specifically limited here.
  • the embodiment of the present application takes the server as the execution subject as an example for description.
  • the Markov between the new patient data and each sample (patient) data in each preset disease group is calculated respectively.
  • the distance according to the value of Mahalanobis distance, determines the disease group to which the new patient data belongs.
  • This solution belongs to the field of smart medical care. Through this solution, the construction of smart cities can be promoted.
  • This application can make maximum use of the information in the sample (patient) data that doctors will consider when making medical decisions. At the same time, it can be judged based on the Mahalanobis distance
  • the disease group to which the new patient belongs assists the doctor in making decisions based on the characteristics of the corresponding group and other information. It improves the efficiency of judging the group of patients and improves the accuracy of doctors' decision-making.
  • FIG. 2 another embodiment of the method for intelligent grouping of similar patients in the embodiment of the present application includes:
  • the outcome variable refers to the outcome of a certain disease concern. If you have a cold, the outcome of concern is whether it is cured. The outcome of type 2 diabetes care is whether glycation meets the standard.
  • the sample data containing outcome variables refers to the data information of patients who have received treatment and the treatment has ended.
  • a large number of historical patient data containing outcome variables are obtained as sample data through the hospital’s electronic medical records and other channels, and the sample is judged The type of data. For example, basic information such as the patient's name, age and blood type, the patient's main complaint, past medical history, family history, physical examination, medication information, and outcome (whether cured), etc.
  • the sample is preprocessed according to the type of sample data.
  • discrete data or text data can be vectorized to obtain data in the form of discrete word vectors.
  • the data type of new patient data includes not only continuous data such as test indicators, age, etc., but also discrete data or text data such as gender and examination results.
  • discrete data and text data must be discretized before they can be used in discrete word vector form, the type of new patient data must be determined.
  • the vectorization processing method includes:
  • vectorization processing is performed on the modified data.
  • text data refers to any character that cannot participate in arithmetic operations, and is also referred to as character data, such as gender, inspection results, and so on.
  • vectorization refers to converting words into a distributed representation, also known as word vectors, so that there is a concept of "distance" between words and contains more information.
  • the data is also vectorized in the same manner to form a discrete word vector form.
  • the new patient data is continuous data, there is no need to perform any standardization or normalization preprocessing on the continuous data, and it can be used directly.
  • continuous data refers to continuous data, a statistical concept, also known as continuous variables. Refers to data that can be arbitrarily selected within a certain interval, the value is continuous, and two adjacent values can be infinitely divided (that is, an infinite number of values). For example, the specifications and dimensions of the production parts, the height, weight, and chest circumference of the body measured are continuous data, and the values can only be obtained by measurement or measurement. "
  • the processing of the data is different.
  • continuous data can be used directly without processing, while text data or discrete data need to be vectorized before it can be processed. Therefore, we must determine the vectorization processing corresponding to the sample data.
  • Patient feature word vector refers to data in the form of a word vector containing patient information and characteristics.
  • the Mahalanobis distance between each sample (patient) data in the sample data is calculated according to the discretized word vector.
  • the Mahalanobis distance means that the Mahalanobis distance is an effective method to calculate the closest distance between a sample and the "center of gravity" of a sample set, or to effectively calculate the similarity between two unknown sample sets. It takes into account the relationship between various characteristics, can eliminate the interference of the correlation between variables, and the Mahalanobis distance is scale-independent, that is, independent of the measurement scale. When ⁇ is the identity matrix, the Mahalanobis distance is the Euclidean distance. In summary, Mahalanobis distance can easily measure the distance between the observed sample and the known sample set, so it is very suitable for fault diagnosis.
  • clustering is a special classification process that divides sample data with insufficient prior knowledge and uncertainties into several classes. The division is based on dividing data records with a greater degree of similarity into the same group. The degree of dissimilarity among the data records in different groups of childhood is maximized. It is a statistical analysis method for studying (sample or index) classification problems.
  • the cluster generated by clustering is a collection of a set of data objects. These objects are similar to objects in the same cluster and different from objects in other clusters.
  • the sample data is clustered according to the Mahalanobis distance between each sample in the sample data to determine the clustering result.
  • the sample data contains n samples (patients) M1, M2, M3... Mn, respectively, calculate the Mahalanobis distance between each sample, and cluster the sample data according to the Mahalanobis distance to obtain the clustering result , Get multiple sample groups.
  • each disease information group contains different clinical outcome information corresponding to a certain disease, for example, according to 500 samples in the sample data (Patient)
  • Extract the characteristics of samples (patients) in each disease information group such as demographic characteristics, inspection and detection characteristics, etc., and describe these characteristics.
  • the feature distribution of the disease information group in this embodiment is some features of the data distribution of the sample data contained in the group. For example, in the group, the average age of the sample (patient) is 50 years old, and the gender is male. Accounted for 70% and so on.
  • multiple disease information groups are obtained according to the grouping results, and features in each disease information group are extracted.
  • These features include, but are not limited to, the gender (male and female) ratio of the population, age distribution, inspection data, and disease Features, disease progression, current medical history, past medical history, etc.
  • extract the features of a data set of iris flowers The data set contains 4 features: the length of the calyx, the width of the calyx, the length of the petal, and the width of the petal, in centimeters.
  • the characteristic refers to the characteristic information specific to a certain disease, such as the gender distribution of the population, the distribution characteristic of the inspection data, the characteristic of the disease, and the characteristic of the disease development process.
  • the feature distribution information in the disease information group query the preset disease disease description database to determine the data information of the corresponding disease to help doctors make more accurate medical decisions.
  • the disease description database is obtained based on a large number of disease medical records in the hospital, including a large number of disease characteristics of patients of different ages corresponding to the type of disease, disease development, disease medication treatment process, and the final development trend of the disease .
  • judge the disease characteristics of the new patient based on the complaint of the new patient and the diagnosed disease, use the disease characteristics as a key, and query from the preset disease description database to determine the disease type of the new patient .
  • the patient’s condition features: polyuria, polydipsia, and polyphagia, but the weight loss is severe in a short period of time, accompanied by edema of the lower limbs.
  • the most matching new patient can be queried from the preset disease feature description database. In order to determine the type of disease that best matches the new patient, help doctors make more accurate medical diagnoses.
  • the Markov between the new patient data and each sample (patient) data in each preset disease group is calculated respectively.
  • the distance according to the value of Mahalanobis distance, determines the disease group to which the new patient data belongs.
  • This solution belongs to the field of smart medical care. Through this solution, the construction of smart cities can be promoted.
  • This application can maximize the use of the information in the sample (patient) data that doctors will consider when making medical decisions. At the same time, it can judge new information based on the Mahalanobis distance.
  • the disease group to which the patient belongs assists the doctor in making decisions based on the characteristics of the corresponding group and other information. It improves the efficiency of judging the group of patients and improves the accuracy of doctors' decision-making.
  • the third embodiment of the method for intelligent grouping of similar patients in the embodiment of the present application includes:
  • the clustering center refers to dividing the input sample data into different parts according to characteristics in the neural network, which is called clustering, and the clustering center is the center of the clustering.
  • clustering refers to the process of dividing a collection of physical or abstract objects into multiple classes composed of similar objects, and is a statistical analysis method for studying (sample or index) classification problems.
  • the cluster generated by clustering is a collection of a set of data objects. These objects are similar to objects in the same cluster and different from objects in other clusters.
  • the samples (diabetics) in the sample data are divided into k groups containing different clinical outcome information, and k samples are randomly selected as cluster centers in this batch of sample data centers.
  • all the data in the sample data can be divided into A, B, C, D, E, F, G, 7 disease information groups, representing 7 different outcome information of diabetes, of which A, B, C , D, E, F, G are the cluster centers of these 7 disease information groups.
  • the determination of the cluster center is divided into an initial situation and a non-initial situation.
  • the initial situation randomly select k samples from the sample data as the initial cluster centers.
  • the Mahalanobis distance from each sample to each cluster center in the sample data is calculated separately. For example, if the sample data contains N samples, calculate the Mahalanobis distances between N1, N2, N3... NN and A, B, C, D, E, F, G, and the 7 initial cluster centers, where , The Mahalanobis distances between N1 and A, B, C, D, E, F, G, and the 7 initial cluster centers are a1, b1, c1, d1, e1, f1, g1, respectively.
  • the minimum Mahalanobis distance corresponding to each sample is selected, and each sample is divided into the cluster corresponding to the minimum Mahalanobis distance.
  • the first grouping result is generated. For example, we assume that all the data in the sample data can be divided into A, B, C, D, E, F, G, and 7 different clinical outcome information represents the aggregation of different outcome information groups for a certain disease.
  • the cluster center, the sample data contains N samples, and the Mahalanobis distances between N1, N2, N3...NN and A, B, C, D, E, F, G, and the 7 initial cluster centers are calculated respectively.
  • N1 the Mahalanobis distances between N1 and A, B, C, D, E, F, and G, and the 7 initial cluster centers are a1, b1, c1, d1, e1, f1, g1, respectively, where If a1 is the smallest, N1 is classified into the clinical outcome information group of the disease where the cluster center A is located. Take this as an example, until the N samples in the sample data are divided, and the first clustering result is generated.
  • the sum of squared errors of the clusters is calculated according to the Mahalanobis distance.
  • the density of the samples (patients) in the sample data and the similarity difference of the outcomes between the samples (patients) have an impact on the clustering effect. For example, when the concentration of samples (patients) is high, and the disease characteristics between the disease information group and the disease information group are quite different, the clustering effect is better.
  • the sum of square errors refers to the sum of square errors of all samples in the sample data (which needs to be clustered). The smaller the sum of square errors, the higher the similarity of the samples in the disease information group.
  • the average value of the sample values contained in each grouping is calculated according to the clustering result generated in the previous (clustering) time to obtain k non-initial clustering centers.
  • the Mahalanobis distance between each sample (patient) in the sample data and each non-initial cluster center is calculated, and further, the minimum Mahalanobis distance corresponding to each sample is selected, and each sample is divided into the minimum Mahalanobis distance.
  • the cluster that corresponds to the non-initial clustering center of the ′-degree distance will generate a new clustering result. For example, there are 7 non-initial clustering centers of S, F, H, B, P, R, and K.
  • the sample (patient) m is classified into the group where the non-initial cluster center F is located, until all the samples in the sample data are divided, and a new The grouping result of. Among them, for the sample data, each clustering, the clustering results obtained are different.
  • the density of clustered data samples and the difference between clusters have a greater impact on the clustering effect.
  • the density of processed data is high and the difference between classes is large, the clustering effect is good, and vice versa. , It is worse.
  • the square error criterion is commonly used, and the function formula is as follows:
  • Jc(m) represents the sum of the squared errors of all samples (patients) in the sample data. The smaller Jc(m), the higher the similarity within the group.
  • Xi represents the point in the multidimensional space (a given sample). (Patient)), Zj represents the average value of cluster Cj.
  • Update clusters (step S305) In a non-initial case, K non-initial cluster centers are calculated according to the clustering result generated last time. Update the average value of the cluster, the calculation formula is as follows:
  • the clustering results are very unstable, so it must be based on the initial clustering
  • the above iterative calculation process is performed cyclically to compare the sum of square errors of two adjacent clusters.
  • iterative calculation is a typical method in numerical calculation, which is applied to finding roots of equations, solving equations, finding eigenvalues of matrices, and so on.
  • the basic idea is to approximate successively, first take a rough approximation, and then use the same recurrence formula to repeatedly correct this initial value until the predetermined accuracy requirement is reached.
  • the value of the sum of square errors corresponding to the cluster is the smallest, which means that the grouping result is higher.
  • the clustering result of the corresponding cluster with the smallest sum of square errors is the final clustering result.
  • the Markov between the new patient data and each sample (patient) data in each preset disease group is calculated respectively.
  • the distance according to the value of Mahalanobis distance, determines the disease group to which the new patient data belongs.
  • This solution belongs to the field of smart medical care. Through this solution, the construction of smart cities can be promoted.
  • This application can make maximum use of the information in the sample (patient) data that doctors will consider when making medical decisions. At the same time, it can be judged based on the Mahalanobis distance
  • the disease group to which the new patient belongs will assist the doctor in making decisions based on the characteristics of the corresponding group and other information. It improves the efficiency of judging the group that a patient belongs to, and improves the accuracy of doctors' decision-making.
  • an embodiment of the intelligent grouping device for similar patients in the embodiments of the present application includes:
  • the first obtaining module 401 is used to obtain new patient data to be matched
  • the first processing module 402 is configured to perform vectorization processing on each disease feature data of the new patient to obtain a disease feature word vector corresponding to the new patient;
  • the first calculation module 403 is configured to calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease characteristic database based on the disease feature word vector;
  • the sorting module 404 is configured to sort the Mahalanobis distances to obtain a sorting result
  • the determining module 405 is configured to determine the matched disease information group corresponding to the new patient data based on the sorting result, wherein the disease information group respectively contains different clinical outcome information.
  • the first processing module 402 may also be specifically configured to:
  • the preprocessing method includes:
  • a method for intelligent grouping of similar patients obtains new patient data, and when judging the disease group to which the new patient belongs, respectively calculates the new patient data and each of the preset disease groups.
  • the Mahalanobis distance between the sample (patient) data According to the value of the Mahalanobis distance, the disease group to which the new patient data belongs is determined.
  • This solution belongs to the field of smart medical care. Through this solution, the construction of smart cities can be promoted. This application can make maximum use of the information in the sample (patient) data that doctors will consider when making medical decisions.
  • the disease group to which the new patient belongs will assist the doctor in making decisions based on the characteristics of the corresponding group and other information. It improves the efficiency of judging the group that a patient belongs to, and improves the accuracy of doctors' decision-making.
  • the second embodiment of the device for intelligent grouping of similar patients in the embodiment of the present application includes:
  • the first obtaining module 501 is used to obtain new patient data to be matched
  • the first processing module 502 is configured to perform vectorization processing on each disease feature data of the new patient to obtain a disease feature word vector corresponding to the new patient;
  • the first calculation module 503 is configured to calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease characteristic database based on the disease feature word vector;
  • the sorting module 504 is used to sort the Mahalanobis distances to obtain a sorting result
  • the determining module 505 is configured to determine the matched disease information group corresponding to the new patient data based on the ranking result;
  • the second obtaining module 506 is used to obtain sample data including outcome variables
  • the second processing module 507 is configured to preprocess the sample data based on the type of the sample data to obtain a discretized word vector
  • the second calculation module 508 is configured to calculate the Mahalanobis distance between each sample in the sample data based on the discretized word vector;
  • the clustering module 509 is configured to cluster the sample data based on the Mahalanobis distance between each sample in the sample data to obtain a clustering result;
  • the extraction module 510 is configured to obtain multiple disease information groups contained in the sample data based on the grouping result, and extract the characteristics of the disease information groups;
  • the query module 511 is configured to query a preset disease condition description database based on the characteristics of the disease information group, and output the disease condition description corresponding to the characteristics of the disease information group.
  • the first processing module 502 may also be specifically configured to:
  • the preprocessing method includes:
  • the clustering module 509 may be specifically used for:
  • Clustering result Set the number of clusters to k, randomly select k samples as the initial cluster centers, and calculate the Mahalanobis distances from each sample to each cluster center in the sample data, based on each sample to each cluster center Select the minimum Mahalanobis distance corresponding to each sample, and divide each sample into the group where the cluster center corresponding to the minimum Mahalanobis distance is located, until all the samples in the sample data are divided, and the first time is obtained.
  • the clustering module 509 may also be specifically used for:
  • the Mahalanobis distance calculate the sum of square errors of the clusters corresponding to the first clustering result.
  • K non-initial clustering centers are calculated according to the clustering results generated last time, and each sample in the sample data is calculated.
  • the Mahalanobis distance to each non-initial cluster center is selected, the minimum Mahalanobis distance corresponding to each sample is selected, and each sample is divided into the group where the non-initial cluster center corresponding to the minimum Mahalanobis distance is located, and generates New clustering result;
  • the clustering module 509 may also be specifically used for:
  • the total square error of the cluster corresponding to the new clustering result is calculated, and the total square error of the cluster corresponding to the first clustering result is compared with the total square error of the cluster corresponding to the new clustering result, The comparison result is obtained, and based on the comparison result, the grouping result corresponding to the cluster with the smallest sum of square errors corresponding to the two grouping results is selected as the final grouping result.
  • a method for intelligent grouping of similar patients obtains new patient data, and when judging the disease group to which the new patient belongs, respectively calculates the new patient data and each of the preset disease groups.
  • the Mahalanobis distance between the sample (patient) data According to the value of the Mahalanobis distance, the disease group to which the new patient data belongs is determined.
  • This solution belongs to the field of smart medical care. Through this solution, the construction of smart cities can be promoted. This application can make maximum use of the information in the sample (patient) data that doctors will consider when making medical decisions.
  • the disease group to which the new patient belongs will assist the doctor in making decisions based on the characteristics of the corresponding group and other information. It improves the efficiency of judging the group that a patient belongs to, and improves the accuracy of doctors' decision-making.
  • the intelligent grouping device 600 for similar patients may have relatively large differences due to different configurations or performances, and may include one or more processors (central Processing units, CPU) 610 (for example, one or more processors), memory 620, and one or more storage media 630 (for example, one or more storage devices with a large amount of data) storing application programs 633 or data 632.
  • the memory 620 and the storage medium 630 may be short-term storage or persistent storage.
  • the program stored in the storage medium 630 may include one or more modules (not shown in the figure), and each module may include a series of command operations in the intelligent clustering device 600 for similar patients.
  • the processor 610 may be configured to communicate with the storage medium 630, and execute a series of instruction operations in the storage medium 630 on the intelligent grouping device 600 for similar patients.
  • the similar patient intelligent grouping device 600 may also include one or more power sources 640, one or more wired or wireless network interfaces 650, one or more input and output interfaces 660, and/or one or more operating systems 631, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc.
  • operating systems 631 such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc.
  • the present application also provides an intelligent grouping device for similar patients, including: a memory and at least one processor, the memory stores instructions, and the memory and the at least one processor are interconnected by wires; the at least one processor The instructions in the memory are invoked, so that the intelligent path planning device executes the steps in the intelligent grouping method for similar patients.
  • the present application also provides a computer-readable storage medium.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium.
  • the computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:
  • the disease feature word vector calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease feature database, wherein the disease feature database contains multiple disease information groups and similar symptoms Features belong to the same disease information group;
  • the matched disease information group corresponding to the new patient data is determined, wherein the disease information group respectively contains different clinical outcome information.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

Abstract

Provided are a method, apparatus and device for intelligently grouping similar patients, and a storage medium. The method for intelligently grouping similar patients comprises: acquiring new patient data to be matched, wherein the new patient data contains multiple pieces of disease feature data; performing vectorization processing on the disease feature data to obtain a disease feature word vector corresponding to a new patient; calculating the Mahalanobis distance between the new patient data and each piece of historical patient data in a preset disease feature database; sorting various Mahalanobis distances to obtain a sorting result; and determining disease information groups correspondingly matching the new patient data, wherein the disease information groups respectively include different clinical outcome information. By using the method, data information of historical patients can be used to the greatest possible extent, the disease information group to which the new patient belongs can be quickly determined according to the Mahalanobis distance, and a doctor can be assisted, according to the features of the corresponding disease information group, in making a decision, thereby improving the accuracy of the doctor when making a medical decision.

Description

相似患者智能分群方法、装置、设备和存储介质Method, device, equipment and storage medium for intelligent grouping of similar patients
本申请要求于2020年5月14日提交中国专利局、申请号为202010405737.7、发明名称为“相似患者智能分群方法、装置、设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 14, 2020, the application number is 202010405737.7, and the invention title is "Similar patient intelligent clustering method, device, equipment and storage medium", the entire content of which is incorporated by reference Incorporate in the application.
技术领域Technical field
本申请涉及数据库技术领域,尤其涉及相似患者智能分群方法、装置、设备和存储介质。This application relates to the field of database technology, in particular to methods, devices, equipment and storage media for intelligent grouping of similar patients.
背景技术Background technique
随着技术的发展,人工智能越来越普遍,在医疗领域内,医生在进行医疗决策的场景时,通常会将以往治疗过的病人的病情特征和治疗过程与现在的接受治疗的病人的实际情况相结合,以作出更合适的医疗决策。然而,医生对新病人做医疗决策时,对已有病人的数据利用的并不充分。With the development of technology, artificial intelligence is becoming more and more common. In the medical field, when doctors make medical decision-making scenarios, they usually compare the characteristics and treatment process of patients who have been treated in the past with the actual conditions of patients who are currently being treated. Combining circumstances to make more appropriate medical decisions. However, when doctors make medical decisions for new patients, they do not fully utilize the data of existing patients.
发明人意识到,针对样本(历史病人)的数据对新病人做出医疗决策,大多借助于其中的连续型数据,如检验指标,年龄等,不能获取临床结局差异较大的不同子群,且不能尽最大可能使用医生决策时所考虑的信息,不能快速的作出准确的医疗决策。The inventor realizes that most of the medical decisions made for new patients based on the data of samples (historical patients) rely on continuous data, such as test indicators, age, etc., to obtain different subgroups with large differences in clinical outcomes, and It is not possible to use the information considered by doctors in decision-making as much as possible, and it is impossible to make accurate medical decisions quickly.
发明内容Summary of the invention
本申请的主要目的在于解决相似病人如何智能分群的技术问题。The main purpose of this application is to solve the technical problem of how to intelligently group similar patients.
为实现上述目的,本申请第一方面提供了一种相似患者智能分群方法,包括:获取待匹配的新病人数据,所述新病人数据包含有多种病症特征数据;对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量;基于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历史病人数据之间的马氏距离,其中,所述病症特征数据库包含多个疾病信息组群,相似病症特征属于同一疾病信息组群;对所述各马氏距离进行排序,获得排序结果;基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组群,其中,所述疾病信息组群分别包含不同的临床结局信息。In order to achieve the above objectives, the first aspect of the present application provides an intelligent grouping method for similar patients, which includes: acquiring new patient data to be matched, the new patient data including multiple disease characteristic data; The disease feature data is vectorized to obtain the disease feature word vector corresponding to the new patient; based on the disease feature word vector, the relationship between the new patient data and each historical patient data in the preset disease feature database is calculated. The disease feature database contains multiple disease information groups, and similar disease features belong to the same disease information group; each of the Mahalanobis distances is sorted to obtain a sorting result; based on the sorting result, all disease information groups are determined The new patient data corresponds to the matched disease information group, wherein the disease information group respectively contains different clinical outcome information.
本申请第二方面提供了一种相似患者智能分群设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:获取待匹配的新病人数据,所述新病人数据包含有多种病症特征数据;对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量;基于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历史病人数据之间的马氏距离,其中,所述病症特征数据库包含多个疾病信息组群,相似病症特征属于同一疾病信息组群;对所述各马氏距离进行排序,获得排序结果;基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组群,其中,所述疾病信息组群分别包含不同的临床结局信息。The second aspect of the present application provides an intelligent grouping device for similar patients, including a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, and the processor executes the computer The following steps are implemented when the instructions are readable: acquiring new patient data to be matched, the new patient data containing multiple disease characteristic data; performing vectorization processing on each disease characteristic data of the new patient to obtain the corresponding new patient The disease feature word vector; based on the disease feature word vector, calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease feature database, wherein the disease feature database contains multiple diseases Information group, similar disease characteristics belong to the same disease information group; sort each of the Mahalanobis distances to obtain a sort result; based on the sort result, determine the disease information group corresponding to the new patient data, wherein, The disease information groups respectively contain different clinical outcome information.
本申请第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:获取待匹配的新病人数据,所述新病人数据包含有多种病症特征数据;对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量;基于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历史病人数据之间的马氏距离,其中,所述病症特征数据库包含多个疾病信息组群,相似病症特征属于同一疾病信息组群;对所述各马氏距离进行排序,获得排序结果;基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组群,其中,所述疾病信息组群分别包含不同的临床结局信息。The third aspect of the present application provides a computer-readable storage medium in which computer instructions are stored, and when the computer instructions are run on the computer, the computer is caused to perform the following steps: obtain the new to-be-matched Patient data, the new patient data contains multiple disease feature data; vectorized processing is performed on each disease feature data of the new patient to obtain the disease feature word vector corresponding to the new patient; based on the disease feature word vector Calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease feature database, where the disease feature database contains multiple disease information groups, and similar disease features belong to the same disease information group Sort the Mahalanobis distances to obtain a sorting result; based on the sorting result, determine the disease information group corresponding to the new patient data, wherein the disease information group contains different clinical outcome information .
本申请第四方面提供了一种相似患者智能分群装置,包括:第一获取模块,用于获取待匹配的新病人数据,所述新病人数据包含有多种病症特征数据;第一处理模块,用于对 所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量;第一计算模块,用于基于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历史病人数据之间的马氏距离,其中,所述病症特征数据库包含多个疾病信息组群,相似病症特征属于同一疾病信息组群;排序模块,用于对所述各马氏距离进行排序,获得排序结果;确定模块,用于基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组群,其中,所述疾病信息组群分别包含不同的临床结局信息。The fourth aspect of the present application provides an intelligent grouping device for similar patients, including: a first acquisition module for acquiring new patient data to be matched, the new patient data including multiple disease characteristic data; a first processing module, It is used to vectorize each disease feature data of the new patient to obtain the disease feature word vector corresponding to the new patient; the first calculation module is used to calculate the new patient data based on the disease feature word vector The Mahalanobis distance between each historical patient data in the preset disease feature database, wherein the disease feature database contains multiple disease information groups, and similar disease features belong to the same disease information group; the sorting module is used to compare The respective Mahalanobis distances are sorted to obtain a sorting result; the determining module is configured to determine, based on the sorting result, the disease information group corresponding to the new patient data, wherein the disease information group includes different Clinical outcome information.
本申请提供的技术方案中,获取待匹配的新病人数据,所述新病人数据包含有多种病症特征数据,对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量,基于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历史病人数据之间的马氏距离,其中,所述病症特征数据库包含多个疾病信息组群,相似病症特征属于同一疾病信息组群,对所述各马氏距离进行排序,获得排序结果,基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组群。其中,所述疾病信息组群分别包含不同的临床结局信息。本方案可应用于智慧医疗领域中,从而推动智慧城市的建设,可以最大程度的利用样本(病人)数据中医生做医疗决策时会考虑到的信息,根据马氏距离判断新病人所属疾病组群,根据对应组群的特征等信息协助医生进行决策,提高了判断病人所属组群的效率,提升了医疗决策的准确性。In the technical solution provided in this application, the new patient data to be matched is acquired, and the new patient data contains multiple disease characteristic data, and each disease characteristic data of the new patient is vectorized to obtain the corresponding new patient Based on the disease feature word vector, calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease feature database, wherein the disease feature database contains multiple diseases Information group, similar disease characteristics belong to the same disease information group, the respective Mahalanobis distances are sorted to obtain a sorting result, and based on the sorting result, the matching disease information group corresponding to the new patient data is determined. Wherein, the disease information groups respectively include different clinical outcome information. This solution can be applied in the field of smart medical care, thereby promoting the construction of smart cities. It can make maximum use of the information in the sample (patient) data that doctors will consider when making medical decisions, and judge the disease group of the new patient based on the Mahalanobis distance. , Assist doctors in making decisions based on the characteristics of the corresponding group, which improves the efficiency of judging the group that the patient belongs to, and improves the accuracy of medical decision-making.
附图说明Description of the drawings
图1为本申请实施例中相似患者智能分群方法的第一个实施例示意图;Fig. 1 is a schematic diagram of a first embodiment of a method for intelligent grouping of similar patients in an embodiment of this application;
图2为本申请实施例中相似患者智能分群方法的第二个实施例示意图;FIG. 2 is a schematic diagram of a second embodiment of a method for intelligent grouping of similar patients in an embodiment of this application;
图3为本申请实施例中相似患者智能分群方法的第三个实施例示意图;Fig. 3 is a schematic diagram of a third embodiment of a method for intelligent grouping of similar patients in an embodiment of this application;
图4为本申请实施例中相似患者智能分群装置的第一个实施例示意图;4 is a schematic diagram of a first embodiment of an intelligent grouping device for similar patients in an embodiment of this application;
图5为本申请实施例中相似患者智能分群装置的第二个实施例示意图;FIG. 5 is a schematic diagram of a second embodiment of an intelligent grouping device for similar patients in an embodiment of this application;
图6为本申请实施例中相似患者智能分群设备的一个实施例示意图。Fig. 6 is a schematic diagram of an embodiment of an intelligent grouping device for similar patients in an embodiment of the application.
具体实施方式Detailed ways
本申请实施例提供了一种相似患者智能分群方法、装置、设备及存储介质,用于通过获取新病人数据,在判断新病人所属疾病组群时,分别计算新病人数据与各个预置疾病组群中的每一个样本(病人)数据两两之间的马氏距离,根据马氏距离的值,确定新病人数据所属的疾病组群。本方案属于智慧医疗领域,通过本方案能够推动智慧城市的建设,本申请通可以最大程度的利用样本(病人)数据中医生做医疗决策时会考虑到的信息,同时,可以根据马氏距离判断新病人所属疾病组群,根据对应组群的特征等信息协助医生进行决策。提高了判断病人所属组群的效率,提升了医生决策的准确性。The embodiments of the present application provide a method, device, equipment and storage medium for intelligent grouping of similar patients, which are used to calculate the new patient data and each preset disease group when determining the disease group to which the new patient belongs by acquiring new patient data. According to the Mahalanobis distance between each sample (patient) data in the group, the disease group to which the new patient data belongs is determined according to the value of the Mahalanobis distance. This solution belongs to the field of smart medical care. Through this solution, the construction of smart cities can be promoted. This application can make maximum use of the information in the sample (patient) data that doctors will consider when making medical decisions. At the same time, it can be judged based on the Mahalanobis distance The disease group to which the new patient belongs, assists the doctor in making decisions based on the characteristics of the corresponding group and other information. It improves the efficiency of judging the group of patients and improves the accuracy of doctors' decision-making.
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例进行描述。In order to enable those skilled in the art to better understand the solution of the present application, the embodiments of the present application will be described below in conjunction with the accompanying drawings in the embodiments of the present application.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects, without having to use To describe a specific order or sequence. It should be understood that the data used in this way can be interchanged under appropriate circumstances so that the embodiments described herein can be implemented in a sequence other than the content illustrated or described herein. In addition, the terms "including" or "having" and any variations thereof are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or device that includes a series of steps or units is not necessarily limited to those clearly listed. Steps or units, but may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or equipment.
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中相似患者智能分群方法的一个实施例包括:For ease of understanding, the following describes the specific process of the embodiment of the present application. Please refer to FIG. 1. An embodiment of the method for intelligent grouping of similar patients in the embodiment of the present application includes:
在一实施例中,该相似患者智能分群方法包括:In an embodiment, the method for intelligent grouping of similar patients includes:
101、获取待匹配的新病人数据,所述新病人数据包含有多种病症特征数据;101. Obtain new patient data to be matched, where the new patient data includes multiple disease characteristic data;
本实施例中,待匹配的新病人数据是指,正在接受医生的治疗,医生需要借鉴既往病人的信息对其进行医疗决策的病人的数据,其中既包含新病人的个人信息,又包括病人所患疾病的病症及该病症所呈现的特征等信息,主要包括性别,年龄,姓名,身体各项检验指标,检查结果,既往病史等数据信息。比如,张三,性别男,汉族,年龄25,乙肝病史10年,主诉:常感乏力,体力不支,下肢水肿,失眠多梦,上腹部不适,腹胀,皮肤小便发黄,小便呈浓茶色等。In this embodiment, the new patient data to be matched refers to the data of the patient who is being treated by the doctor, and the doctor needs to learn from the information of the previous patient to make medical decisions. It contains both the personal information of the new patient and the patient's information. Information about the symptoms of the disease and the characteristics of the disease, including gender, age, name, various physical examination indicators, examination results, past medical history and other data information. For example, Zhang San, gender male, Han nationality, age 25, hepatitis B history of 10 years, chief complaint: often feeling fatigue, lack of physical strength, lower extremity edema, insomnia and dreams, upper abdominal discomfort, abdominal distension, yellow skin and urine, dark urine, etc. .
本实施例中的“匹配”指的是,将新病人的疾病病症等信息与既往的病人的病症特征进行匹配。"Matching" in this embodiment refers to matching the disease and symptoms of the new patient with the symptoms of the previous patient.
102、对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量;102. Perform vectorization processing on each disease feature data of the new patient to obtain a disease feature word vector corresponding to the new patient;
本实施例中,由于收集到的包含新病人信息的数据类型不仅仅只包含检验指标,年龄等类似等连续型数据,还包含性别,检查结果等离散型数据或文本数据,因此需要对收集到的新病人数据根据其所属的数据类型,对数据进行向量化处理,获取对应的向量化新病人数据。比如,若新病人数据为包含文本型数据、离散型数据及连续型数据的混合型数据,则使用自然语言处理技术中词向量方法,对其中的文本型数据和离散型数据进行独热(One-Hot)编码的预处理,得到向量化的数据。In this embodiment, since the collected data types containing new patient information not only include continuous data such as test indicators, age, etc., but also discrete data or text data such as gender and examination results, it is necessary to check the collected data. According to the data type to which the new patient data belongs, vectorize the data to obtain the corresponding vectorized new patient data. For example, if the new patient data is a mixture of text data, discrete data, and continuous data, then the word vector method in natural language processing technology is used to perform one-stop treatment on the text data and discrete data. -Hot) preprocessing of encoding to obtain vectorized data.
其中,连续型数据不需要做任何标准化或归一化的预处理,该类型特征数据可以直接使用。Among them, continuous data does not require any standardization or normalization preprocessing, and the characteristic data of this type can be used directly.
103、于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历史病人数据之间的马氏距离,其中,所述病症特征数据库包含多个疾病信息组群,相似病症特征属于同一疾病信息组群;103. Based on the disease feature word vector, calculate the Mahalanobis distance between the new patient data and each historical patient data in a preset disease feature database, wherein the disease feature database includes multiple disease information groups, Similar disease characteristics belong to the same disease information group;
本实施例中,根据向量化处理后生成的病人特征词向量,计算新病人数据与各预置病症特征数据库中每一个历史病人数据之间的马氏距离,比如,预置病症特征数据库中共有A,B,C,D,E,F,G,7个疾病信息组群,每个疾病信息组群中有n个样本(病人):A(a1,a2,a3...an)、B(b1,b2,b3...bn)、C(c1,c2,c3...cn)、D(d1,d2,d3...dn)、E(e1,e2,e3...en)、F(f1,f2,f3...fn)、G(g1,g2,g3...gn),分别计算新病人数据与A,B,C,D,E,F,G,7个疾病信息组群中每一个样本(病人)数据之间的马氏距离。In this embodiment, the Mahalanobis distance between the new patient data and each historical patient data in each preset disease feature database is calculated according to the patient feature word vector generated after vectorization processing. For example, the preset disease feature database shares the Mahalanobis distance A, B, C, D, E, F, G, 7 disease information groups, each disease information group has n samples (patients): A(a1,a2,a3...an), B (b1,b2,b3...bn), C(c1,c2,c3...cn), D(d1,d2,d3...dn), E(e1,e2,e3...en) , F(f1,f2,f3...fn), G(g1,g2,g3...gn), respectively calculate the new patient data and A, B, C, D, E, F, G, 7 diseases The Mahalanobis distance between each sample (patient) data in the information group.
本实施例中,疾病特征信息库,我们可以把它理解成是一个包含了大量病人数据的数据库,其中包括一种疾病的多个不同群组,比如结局是糖尿病并发肾病,糖尿病伴高血压,或者糖尿病HbA1c达标的群组等。每一个疾病信息组群中包含一定数量个该种临床结局类型的病人的数据信息。在本实施例中,我们也把这些病人的数据信息叫作样本数据。In this embodiment, the disease characteristic information database, we can understand it as a database containing a large number of patient data, including multiple different groups of a disease, for example, the outcome is diabetes with nephropathy, diabetes with hypertension, Or groups with diabetes HbA1c standards, etc. Each disease information group contains data information of a certain number of patients with this type of clinical outcome. In this embodiment, we also call the data information of these patients as sample data.
104、对所述各马氏距离进行排序,获得排序结果;104. Sort the Mahalanobis distances to obtain a sorting result;
本实施例中,根据计算出来的新病人数据与各预置疾病特征信息库中每一个历史病人数据之间的马氏距离的值,对马氏距离进行排序,获取排序结果。排序可以是从大到小排序,也可以是从小到大排序,其中,结局相似的病人两两之间的马氏距离远小于结局不相似的病人两两之间的马氏距离。In this embodiment, the Mahalanobis distance is sorted according to the value of the Mahalanobis distance between the calculated new patient data and each historical patient data in each preset disease characteristic information database, and the sorting result is obtained. The ranking can be from largest to smallest, or from smallest to largest, in which the Mahalanobis distance between pairs of patients with similar outcomes is much smaller than the Mahalanobis distance between pairs of patients with dissimilar outcomes.
105、基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组群,其中,所述疾病信息组群分别包含不同的临床结局信息。105. Based on the sorting result, determine a disease information group corresponding to the new patient data, wherein the disease information group respectively contains different clinical outcome information.
本实施例中,疾病信息组群指特定的某种疾病的组群,其中包含一定数量该类型疾病的样本(病人)。以糖尿病病人的临床结局信息为HbA1c(小于7)达标为例,该疾病临床结局信息族群中的每一个样本(病人)在整个病程中的个人信息,病症特征,疾病发展进程,结局等信息。现病史,既往病史,近期用药情况,既往史,家族史,体格检查,结局 等数据信息。In this embodiment, the disease information group refers to a specific disease group, which contains a certain number of samples (patients) of this type of disease. Taking the clinical outcome information of a diabetic patient as HbA1c (less than 7) as an example, each sample (patient) in the clinical outcome information group of the disease has personal information, disease characteristics, disease development process, outcome and other information throughout the course of the disease. Current medical history, past medical history, recent medications, past history, family history, physical examination, outcome and other data information.
本实施例中,若新病人数据与样本(病人)之间的马氏距离越小,则说明两个病人之间的结局相似,同属同一疾病信息组群的可能性就越大,所以,可以根据马氏距离的排序结果,确定新病人数据对应所述的疾病信息组群。In this embodiment, if the Mahalanobis distance between the new patient data and the sample (patient) is smaller, it means that the outcome between the two patients is similar, and the greater the possibility that they belong to the same disease information group, so you can According to the sorting result of Mahalanobis distance, it is determined that the new patient data corresponds to the disease information group.
本实施例中,马氏距离是用来衡量两个数据样本之间的相似度,比如说,将两个样本数据分别用两个样本矩阵标识,样本矩阵1数据的协方差就是样本矩阵1马氏距离,同样,样本矩阵2也有对应的马氏距离,如果算出来的2个马氏距离越接近,那么可以认为这2个样本的相似度越高。In this embodiment, Mahalanobis distance is used to measure the similarity between two data samples. For example, two sample data are identified by two sample matrices, and the covariance of sample matrix 1 data is sample matrix 1 horse. Similarly, the sample matrix 2 also has a corresponding Mahalanobis distance. If the calculated two Mahalanobis distances are closer, then it can be considered that the similarity of the two samples is higher.
可以理解的是,本申请的执行主体可以为相似患者智能分群装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。It is understandable that the execution subject of this application may be an intelligent grouping device for similar patients, and may also be a terminal or a server, which is not specifically limited here. The embodiment of the present application takes the server as the execution subject as an example for description.
本申请实施例中,通过获取新病人数据,在判断新病人所属疾病组群时,分别计算新病人数据与各个预置疾病组群中的每一个样本(病人)数据两两之间的马氏距离,根据马氏距离的值,确定新病人数据所属的疾病组群。本方案属于智慧医疗领域,通过本方案能够推动智慧城市的建设,本申请通可以最大程度的利用样本(病人)数据中医生做医疗决策时会考虑到的信息,同时,可以根据马氏距离判断新病人所属疾病组群,根据对应组群的特征等信息协助医生进行决策。提高了判断病人所属组群的效率,提升了医生决策的准确性。In the embodiment of this application, by acquiring new patient data, when determining the disease group to which the new patient belongs, the Markov between the new patient data and each sample (patient) data in each preset disease group is calculated respectively. The distance, according to the value of Mahalanobis distance, determines the disease group to which the new patient data belongs. This solution belongs to the field of smart medical care. Through this solution, the construction of smart cities can be promoted. This application can make maximum use of the information in the sample (patient) data that doctors will consider when making medical decisions. At the same time, it can be judged based on the Mahalanobis distance The disease group to which the new patient belongs, assists the doctor in making decisions based on the characteristics of the corresponding group and other information. It improves the efficiency of judging the group of patients and improves the accuracy of doctors' decision-making.
请参阅图2,本申请实施例中相似患者智能分群方法的另一个实施例包括:Referring to Fig. 2, another embodiment of the method for intelligent grouping of similar patients in the embodiment of the present application includes:
201、获取包含结局变量的样本数据;201. Obtain sample data including outcome variables;
本实施例中,结局变量是指某种疾病关心的结局。如感冒,关心的结局是是否治愈。2型糖尿病关心的结局是是否糖化达标。In this embodiment, the outcome variable refers to the outcome of a certain disease concern. If you have a cold, the outcome of concern is whether it is cured. The outcome of type 2 diabetes care is whether glycation meets the standard.
本实施例中,包含结局变量的样本数据是指接受治疗,并治疗结束的病人的数据信息,通过医院的电子病历等渠道获取大量的包含结局变量的历史病人数据作为样本数据,并判断该样本数据的类型。比如,病人姓名年龄血型等基本信息,病人主诉病症,既往病史,家族史,体格检查,用药信息及结局(是否治愈)等。In this embodiment, the sample data containing outcome variables refers to the data information of patients who have received treatment and the treatment has ended. A large number of historical patient data containing outcome variables are obtained as sample data through the hospital’s electronic medical records and other channels, and the sample is judged The type of data. For example, basic information such as the patient's name, age and blood type, the patient's main complaint, past medical history, family history, physical examination, medication information, and outcome (whether cured), etc.
202、基于所述样本数据的类型,对所述样本数据进行预处理,得到离散化词向量;202. Preprocess the sample data based on the type of the sample data to obtain a discretized word vector;
本实施例中,根据样本数据的类型,对样本预处理,比如说,可以对离散型数据或文本型数据进行向量化处理,得到离散型词向量形式的数据。In this embodiment, the sample is preprocessed according to the type of sample data. For example, discrete data or text data can be vectorized to obtain data in the form of discrete word vectors.
在一可选实施例中,具体获取所述新病人数据的类型;In an optional embodiment, specifically acquiring the type of the new patient data;
本实施例中,在医疗领域,新病人数据的数据类型不仅仅只包含检验指标,年龄等类似等连续型数据,还包含性别,检查结果等离散型数据或文本型数据。同时,由于离散型数据和文本型数据必须进行离散化处理之后,得到离散型词向量形式才能使用,所以要确定新病人数据的类型。In this embodiment, in the medical field, the data type of new patient data includes not only continuous data such as test indicators, age, etc., but also discrete data or text data such as gender and examination results. At the same time, since discrete data and text data must be discretized before they can be used in discrete word vector form, the type of new patient data must be determined.
在另一可选实施例中,具体基于所述新病人数据的类型,确定所述数据对应的向量化处理并执行向量化处理;In another optional embodiment, specifically based on the type of the new patient data, determine the vectorization processing corresponding to the data and execute the vectorization processing;
其中,所述向量化处理方式包括:Wherein, the vectorization processing method includes:
A、当所述新病人数据的类型为文本型数据时,对所述文本型数据进行向量化处理;A. When the type of the new patient data is text data, vectorize the text data;
本实施例中,若新病人数据为文本型数据,则对改数据进行向量化处理。In this embodiment, if the new patient data is text data, vectorization processing is performed on the modified data.
本实施例中,文本数据是指不能参与算术运算的任何字符,也称为字符型数据,例如,性别,检查结果等。In this embodiment, text data refers to any character that cannot participate in arithmetic operations, and is also referred to as character data, such as gender, inspection results, and so on.
本实施例中,向量化处理是指将词转化成一种分布式表示,又称词向量,使词之间存在“距离”概念,包含更多信息。In this embodiment, vectorization refers to converting words into a distributed representation, also known as word vectors, so that there is a concept of "distance" between words and contains more information.
B、当所述新病人数据的类型为离散型数据时,对所述离散型数据进行向量化处理;B. When the type of the new patient data is discrete data, perform vectorization processing on the discrete data;
本实施例中,与文本型数据相同,若新病人数据为离散型数据,也同样的对数据进行向量化处理,做成离散型词向量形式。In this embodiment, as with the text data, if the new patient data is discrete data, the data is also vectorized in the same manner to form a discrete word vector form.
C、当所述新病人数据的类型为连续型数据时,不对所述数据进行向量化处理。C. When the type of the new patient data is continuous data, the data is not vectorized.
本实施例中,若新病人数据为连续型数据,则不需对连续型数据做任何标准化或归一化的预处理,可以直接使用。In this embodiment, if the new patient data is continuous data, there is no need to perform any standardization or normalization preprocessing on the continuous data, and it can be used directly.
本实施例中,连续型数据是指连续数据,统计学概念,又称连续变量。指在一定区间内可以任意取值、数值是连续不断的、相邻两个数值可作无限分割(即可取无限个数值)的数据。例如:生产零件的规格尺寸、人体测量的身高、体重、胸围等为连续数据,其数值只能用测量或计量的方法取得。”In this embodiment, continuous data refers to continuous data, a statistical concept, also known as continuous variables. Refers to data that can be arbitrarily selected within a certain interval, the value is continuous, and two adjacent values can be infinitely divided (that is, an infinite number of values). For example, the specifications and dimensions of the production parts, the height, weight, and chest circumference of the body measured are continuous data, and the values can only be obtained by measurement or measurement. "
本实施例中,由于数据类型的不同,对数据进行的处理也不相同,比如说,连续型数据可以不作处理直接使用,而文本型数据或离散型数据均需要进行向量化处理之后方能进行使用,所以,要确定样本数据对应的向量化处理。In this embodiment, due to the different data types, the processing of the data is different. For example, continuous data can be used directly without processing, while text data or discrete data need to be vectorized before it can be processed. Therefore, we must determine the vectorization processing corresponding to the sample data.
本实施例中,对新病人数据进行向量化处理,获取病人特征词向量。病人特征词向量,是指包含病人信息及特征的词向量形式的数据。In this embodiment, vectorization processing is performed on the new patient data to obtain the patient feature word vector. Patient feature word vector refers to data in the form of a word vector containing patient information and characteristics.
203、基于所述离散化词向量,分别计算所述样本数据中各样本两两之间的马氏距离;203. Calculate the Mahalanobis distance between each sample in the sample data based on the discretized word vector;
本实施例中,根据离散化词向量,分别计算样本数据中各样本(病人)数据两两之间的马氏距离。In this embodiment, the Mahalanobis distance between each sample (patient) data in the sample data is calculated according to the discretized word vector.
本实施例中,马氏距离是指马氏距离是一种有效的计算一个样本和一个样本集“重心”的最近距离,或者有效计算两个未知样本集的相似度的方法。它考虑到各种特性之间的联系,可以排除变量之间的相关性的干扰,并且马氏距离是尺度无关的,即独立于测量尺度。当∑是单位矩阵的时候,马氏距离即为欧氏距离。综上所述,马氏距离能够很方便的度量观测样本与已知样本集间的距离,因而很适合用在故障诊断中。In this embodiment, the Mahalanobis distance means that the Mahalanobis distance is an effective method to calculate the closest distance between a sample and the "center of gravity" of a sample set, or to effectively calculate the similarity between two unknown sample sets. It takes into account the relationship between various characteristics, can eliminate the interference of the correlation between variables, and the Mahalanobis distance is scale-independent, that is, independent of the measurement scale. When ∑ is the identity matrix, the Mahalanobis distance is the Euclidean distance. In summary, Mahalanobis distance can easily measure the distance between the observed sample and the known sample set, so it is very suitable for fault diagnosis.
204、基于所述样本数据中各样本两两之间的马氏距离,对所述样本数据进行聚类,得到分群结果;204. Cluster the sample data based on the Mahalanobis distance between each sample in the sample data to obtain a clustering result;
本实施例中,聚类是一个将先验知识不足且不确定的样本数据划分为若干个类的特殊分类过程,划分的依据是将相似程度较大的数据记录划分到同一个组群中,儿时的处于不同分组中的数据记录中间的相异程度最大化。是一种研究(样本或指标)分类问题的统计分析方法。由聚类所生成的簇是一组数据对象的集合,这些对象与同一个簇中的对象彼此相似,与其他簇中的对象相异。In this embodiment, clustering is a special classification process that divides sample data with insufficient prior knowledge and uncertainties into several classes. The division is based on dividing data records with a greater degree of similarity into the same group. The degree of dissimilarity among the data records in different groups of childhood is maximized. It is a statistical analysis method for studying (sample or index) classification problems. The cluster generated by clustering is a collection of a set of data objects. These objects are similar to objects in the same cluster and different from objects in other clusters.
本实施例中,根据样本数据中各样本两两之间的马氏距离,对所述样本数据进行聚类,确定聚类结果。比如,样本数据中包含n个样本(病人)M1,M2,M3...Mn分别计算个样本两两之间的马氏距离,根据马氏距离,对样本数据进行聚类,获取聚类结果,得到多个样本组。In this embodiment, the sample data is clustered according to the Mahalanobis distance between each sample in the sample data to determine the clustering result. For example, the sample data contains n samples (patients) M1, M2, M3... Mn, respectively, calculate the Mahalanobis distance between each sample, and cluster the sample data according to the Mahalanobis distance to obtain the clustering result , Get multiple sample groups.
205、基于所述分群结果,获取所述样本数据中包含的多个疾病信息组群,并提取所述疾病信息组群的特征;205. Based on the grouping result, obtain multiple disease information groups included in the sample data, and extract features of the disease information group;
本实施例中,根据分群结果,获取样本数据中包含的多个疾病信息组群,每一个疾病信息组群中分别包含某种疾病对应的不同临床结局信息,比如,根据样本数据中500个样本(病人)两两之间的马氏距离,对样本数据进行聚类,得到了糖尿病A、B、C、D、E、F、G共七个不同的临床结局的疾病信息组群,进一步地,提取每一个疾病信息组群中样本(病人)的特征,比如人口信息学特征,检验检测特征等等,对这些特征进行描述如某疾病信息组群中,人群的年龄是什么分布,性别(男女)比例等,根据特征的分布,协助医生决策。本实施例中疾病信息组群的特征分布是,该组群所包含的样本数据在数据分布上的一些特征,比如,该组群中,样本(病人)的年龄的均值是50岁,性别男性占比70%等等。In this embodiment, according to the clustering results, multiple disease information groups contained in the sample data are obtained, and each disease information group contains different clinical outcome information corresponding to a certain disease, for example, according to 500 samples in the sample data (Patient) The Mahalanobis distance between the two groups, clustering the sample data, and obtaining seven different clinical outcome disease information groups of diabetes A, B, C, D, E, F, and G. Further , Extract the characteristics of samples (patients) in each disease information group, such as demographic characteristics, inspection and detection characteristics, etc., and describe these characteristics. For example, in a disease information group, what is the age distribution of the population, gender ( According to the distribution of characteristics, assist doctors in decision-making. The feature distribution of the disease information group in this embodiment is some features of the data distribution of the sample data contained in the group. For example, in the group, the average age of the sample (patient) is 50 years old, and the gender is male. Accounted for 70% and so on.
本实施例中,根据分群结果获取多个疾病信息组群,并提取每一个疾病信息组群中的特征,这些特征包括但不限于人群的性别(男女)比例,年龄分布,检验检测数据,病症特征,疾病发展进程,现病史,既往病史等。再比如,对鸢尾花的数据集的特征进行提取,该数据集包含4个特征:花萼长度,花萼宽度,花瓣长度,花瓣宽度,单位为厘米。通过特征提取,我们可以得到各个疾病组群的特征,以帮助医生作出更准确的医疗决策。In this embodiment, multiple disease information groups are obtained according to the grouping results, and features in each disease information group are extracted. These features include, but are not limited to, the gender (male and female) ratio of the population, age distribution, inspection data, and disease Features, disease progression, current medical history, past medical history, etc. For another example, extract the features of a data set of iris flowers. The data set contains 4 features: the length of the calyx, the width of the calyx, the length of the petal, and the width of the petal, in centimeters. Through feature extraction, we can get the characteristics of each disease group to help doctors make more accurate medical decisions.
206、基于所述疾病信息组群的特征,查询预置疾病病症描述库,输出所述疾病信息组群的特征对应的疾病病症描述;206. Based on the characteristics of the disease information group, query a preset disease condition description database, and output the disease condition description corresponding to the characteristics of the disease information group;
本实施例中,特征是指某一种疾病所特有的特征信息,比如人群性别分布,检验检测数据分布特征,病症特征及疾病发展进程特征等。根据疾病信息组群中的特征分布信息,查询预置的疾病病症描述库,确定对应疾病的数据信息,以帮助医生进行更准确的医疗决策。In this embodiment, the characteristic refers to the characteristic information specific to a certain disease, such as the gender distribution of the population, the distribution characteristic of the inspection data, the characteristic of the disease, and the characteristic of the disease development process. According to the feature distribution information in the disease information group, query the preset disease disease description database to determine the data information of the corresponding disease to help doctors make more accurate medical decisions.
本申请实施例中,疾病病症描述库是根据医院内大量的疾病病历获取的,其中包括大量对应疾病种类的不同年龄段病人的疾病特征,病情发展情况,疾病用药治疗过程以及疾病最后的发展走势。对新病人进行诊断时,根据新病人主诉的病情和诊断的病症,判断新病人的疾病特征,将该疾病特征作为关键字,从预置的疾病病症描述库中查询,确定新病人的疾病种类。比如,病人病情特征有:多尿,多饮,多食,但是短期内体重下降严重,伴有双下肢水肿,根据病人的病情特征,从预置病症特征描述库中查询出与新病人最匹配的疾病信息,从而确定出与新病人最匹配的疾病类型,帮助医生作出更准确的医疗诊断。In the examples of this application, the disease description database is obtained based on a large number of disease medical records in the hospital, including a large number of disease characteristics of patients of different ages corresponding to the type of disease, disease development, disease medication treatment process, and the final development trend of the disease . When diagnosing a new patient, judge the disease characteristics of the new patient based on the complaint of the new patient and the diagnosed disease, use the disease characteristics as a key, and query from the preset disease description database to determine the disease type of the new patient . For example, the patient’s condition features: polyuria, polydipsia, and polyphagia, but the weight loss is severe in a short period of time, accompanied by edema of the lower limbs. According to the patient’s condition, the most matching new patient can be queried from the preset disease feature description database. In order to determine the type of disease that best matches the new patient, help doctors make more accurate medical diagnoses.
207、获取待匹配的新病人数据;207. Obtain new patient data to be matched;
208、对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量;208. Perform vectorization processing on each disease feature data of the new patient to obtain a disease feature word vector corresponding to the new patient;
209、基于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历史病人数据之间的马氏距离;209. Based on the disease feature word vector, calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease feature database;
210、对所述各马氏距离进行排序,获得排序结果;210. Sort the Mahalanobis distances to obtain a sorting result;
211、基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组群。211. Based on the ranking result, determine a matched disease information group corresponding to the new patient data.
本申请实施例中,通过获取新病人数据,在判断新病人所属疾病组群时,分别计算新病人数据与各个预置疾病组群中的每一个样本(病人)数据两两之间的马氏距离,根据马氏距离的值,确定新病人数据所属的疾病组群。本方案属于智慧医疗领域,通过本方案能够推动智慧城市的建设,本申请可以最大程度的利用样本(病人)数据中医生做医疗决策时会考虑到的信息,同时,可以根据马氏距离判断新病人所属疾病组群,根据对应组群的特征等信息协助医生进行决策。提高了判断病人所属组群的效率,提升了医生决策的准确性。In the embodiment of this application, by acquiring new patient data, when determining the disease group to which the new patient belongs, the Markov between the new patient data and each sample (patient) data in each preset disease group is calculated respectively. The distance, according to the value of Mahalanobis distance, determines the disease group to which the new patient data belongs. This solution belongs to the field of smart medical care. Through this solution, the construction of smart cities can be promoted. This application can maximize the use of the information in the sample (patient) data that doctors will consider when making medical decisions. At the same time, it can judge new information based on the Mahalanobis distance. The disease group to which the patient belongs, assists the doctor in making decisions based on the characteristics of the corresponding group and other information. It improves the efficiency of judging the group of patients and improves the accuracy of doctors' decision-making.
请参阅图3,本申请实施例中相似患者智能分群方法的第三个实施例包括:Referring to Fig. 3, the third embodiment of the method for intelligent grouping of similar patients in the embodiment of the present application includes:
301、获取包含结局变量的样本数据;301. Obtain sample data containing outcome variables;
302、基于所述样本数据的类型,对所述样本数据进行预处理,得到离散化词向量;302. Preprocess the sample data based on the type of the sample data to obtain a discretized word vector.
303、基于所述离散化词向量,分别计算所述样本数据中各样本两两之间的马氏距离;303. Based on the discretized word vector, respectively calculate the Mahalanobis distance between each sample in the sample data.
304、设定分群个数为k,随机选取k个样本作为初始聚类中心;304. Set the number of clusters to k, and randomly select k samples as initial cluster centers;
本实施例中,聚类中心是指在神经网络中把输入的样本数据根据特征分成不同的几个部分,就是聚类,聚类中心就是聚类的中心。In this embodiment, the clustering center refers to dividing the input sample data into different parts according to characteristics in the neural network, which is called clustering, and the clustering center is the center of the clustering.
本实施例中,聚类是指将物理或抽象对象的集合分成由类似的对象组成的多个类的过程,是一种研究(样本或指标)分类问题的统计分析方法。由聚类所生成的簇是一组数据对象的集合,这些对象与同一个簇中的对象彼此相似,与其他簇中的对象相异。In this embodiment, clustering refers to the process of dividing a collection of physical or abstract objects into multiple classes composed of similar objects, and is a statistical analysis method for studying (sample or index) classification problems. The cluster generated by clustering is a collection of a set of data objects. These objects are similar to objects in the same cluster and different from objects in other clusters.
本实施例中,假设样本数据中的样本(糖尿病病人)会被分为k个包含不同的临床结 局信息的族群,在这批样本数据中心随机选取k个样本作为聚类中心。比如,我们假设样本数据中的所有数据可以分为A、B、C、D、E、F、G,7个疾病信息组群,分别代表糖尿病的7个不同结局信息,其中A、B、C、D、E、F、G就是这7个疾病信息组群的聚类中心。In this embodiment, it is assumed that the samples (diabetics) in the sample data are divided into k groups containing different clinical outcome information, and k samples are randomly selected as cluster centers in this batch of sample data centers. For example, we assume that all the data in the sample data can be divided into A, B, C, D, E, F, G, 7 disease information groups, representing 7 different outcome information of diabetes, of which A, B, C , D, E, F, G are the cluster centers of these 7 disease information groups.
本实施例中,聚类中心的确定分为初始情况和非初始情况。在初始情况下,随机在所述样本数据中选取k个样本作为初始聚类中心。初始聚类中心表示为:mp(1)=(V i1,V i2,…,V ij)其中,p=1,2,…,k,k表示分群个数。 In this embodiment, the determination of the cluster center is divided into an initial situation and a non-initial situation. In the initial situation, randomly select k samples from the sample data as the initial cluster centers. The initial cluster center is expressed as: mp(1)=(V i1 ,V i2 ,...,V ij ), where p=1, 2,...,k, and k represents the number of clusters.
305、分别计算所述样本数据中各样本到每一个聚类中心的马氏距离;305. Calculate the Mahalanobis distances from each sample in the sample data to each cluster center respectively.
本实施例中,分别计算样本数据中,每一个样本到每一个聚类中心的马氏距离。比如,样本数据中包含N个样本,分别计算N1,N2,N3...NN与A、B、C、D、E、F、G,7个初始聚类中心之间的马氏距离,其中,N1与A、B、C、D、E、F、G,7个初始聚类中心之间的马氏距离分别为a1,b1,c1,d1,e1,f1,g1。In this embodiment, the Mahalanobis distance from each sample to each cluster center in the sample data is calculated separately. For example, if the sample data contains N samples, calculate the Mahalanobis distances between N1, N2, N3... NN and A, B, C, D, E, F, G, and the 7 initial cluster centers, where , The Mahalanobis distances between N1 and A, B, C, D, E, F, G, and the 7 initial cluster centers are a1, b1, c1, d1, e1, f1, g1, respectively.
306、基于所述各样本到每一个聚类中心的马氏距离,选取各样本对应的最小马氏距离,并将各样本划入与最小马氏距离对应的聚类中心所在组群中,直至将所述样本数据中的所有样本划分完毕,得到首次分群结果;306. Based on the Mahalanobis distance of each sample to each cluster center, select the minimum Mahalanobis distance corresponding to each sample, and classify each sample into the group where the cluster center corresponding to the minimum Mahalanobis distance is located, until All samples in the sample data are divided, and the first clustering result is obtained;
本实施例中,根据得到的样本数据中各样本到每一个聚类中心的马氏距离的值,选取各样本对应的最小马氏距离,并将各样本划入与最小马氏距离对应的聚类中心所在的组群中,直至将样本数据中的所有样本划分完毕,生成首次分群结果。比如,比如,我们假设样本数据中的所有数据可以分为A、B、C、D、E、F、G,7个不同的临床结局信息分别代表某种疾病的不同结局信息的组群的聚类中心,样本数据中包含N个样本,分别计算N1,N2,N3...NN与A、B、C、D、E、F、G,7个初始聚类中心之间的马氏距离。以N1为例,N1与A、B、C、D、E、F、G,7个初始聚类中心之间的马氏距离分别为a1,b1,c1,d1,e1,f1,g1,其中a1最小,则将N1划入聚类中心A所在的疾病临床结局信息组群中,以此为例,直到将样本数据中的N个样本划分完毕,生成首次分群结果。In this embodiment, according to the value of the Mahalanobis distance from each sample to each cluster center in the obtained sample data, the minimum Mahalanobis distance corresponding to each sample is selected, and each sample is divided into the cluster corresponding to the minimum Mahalanobis distance. In the group where the class center is located, until all the samples in the sample data are divided, the first grouping result is generated. For example, we assume that all the data in the sample data can be divided into A, B, C, D, E, F, G, and 7 different clinical outcome information represents the aggregation of different outcome information groups for a certain disease. The cluster center, the sample data contains N samples, and the Mahalanobis distances between N1, N2, N3...NN and A, B, C, D, E, F, G, and the 7 initial cluster centers are calculated respectively. Taking N1 as an example, the Mahalanobis distances between N1 and A, B, C, D, E, F, and G, and the 7 initial cluster centers are a1, b1, c1, d1, e1, f1, g1, respectively, where If a1 is the smallest, N1 is classified into the clinical outcome information group of the disease where the cluster center A is located. Take this as an example, until the N samples in the sample data are divided, and the first clustering result is generated.
307、根据所述马氏距离,计算首次分群结果对应聚类的平方误差总和;307. According to the Mahalanobis distance, calculate the sum of squared errors of the clusters corresponding to the first clustering result;
本实施例中,根据马氏距离计算获得聚类的平方误差总和,。In this embodiment, the sum of squared errors of the clusters is calculated according to the Mahalanobis distance.
本实施例中,在对数据进行聚类时,样本数据中样本(病人)的密集程度和样本(病人)之间结局的相似度差异对聚类的效果有影响。比如,当样本(病人)的密集度较高、疾病信息组群与疾病信息组群之间的病症特征差异较大时,聚类效果比较好。In this embodiment, when the data is clustered, the density of the samples (patients) in the sample data and the similarity difference of the outcomes between the samples (patients) have an impact on the clustering effect. For example, when the concentration of samples (patients) is high, and the disease characteristics between the disease information group and the disease information group are quite different, the clustering effect is better.
本实施例中,平方误差总和是指(需要对其进行聚类的)样本数据中,所有样本的平方误差的总和,平方误差总和越小,说明疾病信息组群内样本的相似度越高。In this embodiment, the sum of square errors refers to the sum of square errors of all samples in the sample data (which needs to be clustered). The smaller the sum of square errors, the higher the similarity of the samples in the disease information group.
308、在非初始情况下,根据上一次生成的分群结果计算获得K个非初始聚类中心;308. In a non-initial case, calculate K non-initial cluster centers according to the clustering result generated last time;
本实施例中,在非初始情况下,根据前一次(聚类)生成的分群结果,计算每个分群中包含的样本值的平均值,获得k个非初始聚类中心。In this embodiment, in a non-initial case, the average value of the sample values contained in each grouping is calculated according to the clustering result generated in the previous (clustering) time to obtain k non-initial clustering centers.
309、计算所述样本数据中各样本分别到各非初始聚类中心的马氏距离,选取所述各样本对应的最小马氏距离,并将各样本划入与所述最小马氏距离对应的非初始聚类中心所在的群,生成新的分群结果;309. Calculate the Mahalanobis distance of each sample in the sample data to each non-initial cluster center, select the minimum Mahalanobis distance corresponding to each sample, and divide each sample into the distance corresponding to the minimum Mahalanobis distance. The cluster where the non-initial clustering center is located will generate a new clustering result;
本实施例中,计算样本数据中各样本(病人)与每一个非初始聚类中心的马氏距离,进一步地,选取每一个样本对应的最小马氏距离,将每一个样本划入与最小马氏距离对应的非初始聚类中心所在的群,生成新的分群结果。比如,有S、F、H、B、P、R、K7个非初始聚类中心,计算样本(病人)m与K各非初始聚类中心之间的马氏距离,对应马氏距离的值为m1,m2,...m7,其中,m2的值最小,则将样本(病人)m划入非初始聚类中心F所在的群,直至将样本数据中所有的样本都划分完毕,生成新的分群结果。其中,对样本数据进行的,每一次聚类,得到的分群结果都是不相同的。In this embodiment, the Mahalanobis distance between each sample (patient) in the sample data and each non-initial cluster center is calculated, and further, the minimum Mahalanobis distance corresponding to each sample is selected, and each sample is divided into the minimum Mahalanobis distance. The cluster that corresponds to the non-initial clustering center of the ′-degree distance will generate a new clustering result. For example, there are 7 non-initial clustering centers of S, F, H, B, P, R, and K. Calculate the Mahalanobis distance between each non-initial clustering center of sample (patient) m and K, corresponding to the value of Mahalanobis distance M1, m2,...m7, where the value of m2 is the smallest, the sample (patient) m is classified into the group where the non-initial cluster center F is located, until all the samples in the sample data are divided, and a new The grouping result of. Among them, for the sample data, each clustering, the clustering results obtained are different.
310、基于所述马氏距离,计算获得新的分群结果对应的聚类的平方误差总和;310. Based on the Mahalanobis distance, calculate and obtain the sum of squared errors of the clusters corresponding to the new clustering result;
本实施例中,聚类数据样本的密集程度和类间差异性对聚类效果影响较大,当处理数据的密集程度较高、类与类间差异较大时,聚类效果教好,反之,则较差。聚类算法中,常用平方误差准则,函数公式如下:In this embodiment, the density of clustered data samples and the difference between clusters have a greater impact on the clustering effect. When the density of processed data is high and the difference between classes is large, the clustering effect is good, and vice versa. , It is worse. In clustering algorithms, the square error criterion is commonly used, and the function formula is as follows:
Figure PCTCN2020099566-appb-000001
Figure PCTCN2020099566-appb-000001
其中,Jc(m)表示样本数据中所有样本(病人)的平方误差的总和,Jc(m)越小,越说明组群内相似度越高,Xi表示多维空间中的点(给定的样本(病人)),Zj表示簇Cj的平均值。更新簇(步骤S305)在非初始情况下,根据上一次生成的分群结果计算获得K个非初始聚类中心。更新簇的平均值,其计算公式如下:Among them, Jc(m) represents the sum of the squared errors of all samples (patients) in the sample data. The smaller Jc(m), the higher the similarity within the group. Xi represents the point in the multidimensional space (a given sample). (Patient)), Zj represents the average value of cluster Cj. Update clusters (step S305) In a non-initial case, K non-initial cluster centers are calculated according to the clustering result generated last time. Update the average value of the cluster, the calculation formula is as follows:
Figure PCTCN2020099566-appb-000002
Figure PCTCN2020099566-appb-000002
311、比较所述首次分群结果对应聚类的平方误差总和与所述新的分群结果对应聚类的平方误差总和,并得到比较结果;311. Compare the sum of square errors of the clusters corresponding to the first grouping result with the sum of square errors of the clusters corresponding to the new grouping result, and obtain a comparison result.
本实施例中,由于初始选择的K个聚类中心的选择具有随机性,很难选到具有代表性的数据记录作为初始聚类中心,因此聚类结果很不稳定,所以要根据初次聚类得到的分群结果,重新计算样本数据中每一个样本与分群结果对应的新的聚类中心的马氏距离,根据这个马氏距离,计算新的分群结果对应聚类的平方误差总和,并对两次聚类对应的平方误差总和(的值)进行比较,值越小,说明分群结果更准确。In this embodiment, since the selection of the initially selected K clustering centers is random, it is difficult to select representative data records as the initial clustering centers. Therefore, the clustering results are very unstable, so it must be based on the initial clustering Recalculate the Mahalanobis distance between each sample in the sample data and the new clustering center corresponding to the clustering result. According to this Mahalanobis distance, calculate the sum of squared errors of the clusters corresponding to the new clustering result, and calculate the sum of the squared errors of the clusters corresponding to the new clustering results. The sum of the squared errors corresponding to the sub-clusters is compared. The smaller the value, the more accurate the clustering result.
本实施例中,循环执行上述迭代计算过程,比较相邻两次聚类的平方误差总和,通过比较相邻两次聚类的平方误差总和,当聚类对应的平方误差总和的值不再发生明显变化,也即,当满足E-E'<ε时,停止迭代计算,其中,E、E'分别为相邻两次聚类的平方误差总和,值大的为E,值小的为E',ε代表一个很小的正数。In this embodiment, the above iterative calculation process is performed cyclically to compare the sum of square errors of two adjacent clusters. By comparing the sum of square errors of two adjacent clusters, when the value of the sum of square errors corresponding to the cluster no longer occurs Obvious change, that is, when E-E'<ε, stop the iterative calculation, where E and E'are the sum of the square errors of two adjacent clusters, the larger value is E, and the smaller value is E ', ε represents a small positive number.
在本步骤中,因为聚类的平方误差总和是判断一个计算结果误差的方法。而分群本身是一个迭代的过程,因此,本方案想要得到的是一个稳定的分群结果,并把它作为最终的结果。所以,当每次循环迭代得到的值误差足够小(即具有相似性)时,可以认为分群结果足够稳定了。In this step, because the sum of squared errors of clustering is a method of judging the error of a calculation result. The clustering itself is an iterative process, therefore, what this scheme wants to obtain is a stable clustering result, and use it as the final result. Therefore, when the value error obtained by each loop iteration is small enough (that is, it has similarity), it can be considered that the clustering result is sufficiently stable.
本实施例中,迭代计算是是数值计算中一类典型方法,应用于方程求根,方程组求解,矩阵求特征值等方面。其基本思想是逐次逼近,先取一个粗糙的近似值,然后用同一个递推公式,反复校正此初值,直至达到预定精度要求为止。In this embodiment, iterative calculation is a typical method in numerical calculation, which is applied to finding roots of equations, solving equations, finding eigenvalues of matrices, and so on. The basic idea is to approximate successively, first take a rough approximation, and then use the same recurrence formula to repeatedly correct this initial value until the predetermined accuracy requirement is reached.
312、基于所述比较结果,选取两次分群结果对应聚类的平方误差总和最小的聚类对应的分群结果作为最终分群结果;312. Based on the comparison result, select the clustering result corresponding to the cluster with the smallest sum of square errors of the clusters corresponding to the two clustering results as the final clustering result;
本实施例中,由于平方误差总和越小,越说明组群内相似度越高,所以,在所有聚类获取的分群结果中,对应聚类的平方误差总和的值最小,则说明分群结果越准确,进一步地,平方误差总和的值最小的对应聚类的分群结果就是最终的聚类结果。In this embodiment, because the smaller the sum of square errors, the higher the similarity within the group. Therefore, among the grouping results obtained by all clusters, the value of the sum of square errors corresponding to the cluster is the smallest, which means that the grouping result is higher. Accurately, and further, the clustering result of the corresponding cluster with the smallest sum of square errors is the final clustering result.
313、基于所述分群结果,获取所述样本数据中包含的多个疾病信息组群,并提取所述疾病信息组群的特征;313. Based on the grouping result, obtain multiple disease information groups included in the sample data, and extract features of the disease information group;
314、基于所述疾病信息组群的特征,查询预置疾病病症描述库,输出所述疾病信息组群的特征对应的疾病病症描述;314. Based on the characteristics of the disease information group, query a preset disease condition description database, and output a disease condition description corresponding to the characteristics of the disease information group;
315、获取待匹配的新病人数据,所述新病人数据包含有多种病症特征数据;315. Obtain new patient data to be matched, where the new patient data includes multiple disease characteristic data;
316、对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量;316. Perform vectorization processing on each disease feature data of the new patient to obtain a disease feature word vector corresponding to the new patient;
317、基于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历 史病人数据之间的马氏距离;317. Based on the disease feature word vector, calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease feature database;
318、对所述各马氏距离进行排序,获得排序结果;318. Sort the Mahalanobis distances to obtain a sorting result.
319、基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组群,其中,所述疾病信息组群分别包含不同的临床结局信息。319. Based on the sorting result, determine a matching disease information group corresponding to the new patient data, wherein the disease information group respectively includes different clinical outcome information.
本申请实施例中,通过获取新病人数据,在判断新病人所属疾病组群时,分别计算新病人数据与各个预置疾病组群中的每一个样本(病人)数据两两之间的马氏距离,根据马氏距离的值,确定新病人数据所属的疾病组群。本方案属于智慧医疗领域,通过本方案能够推动智慧城市的建设,本申请通可以最大程度的利用样本(病人)数据中医生做医疗决策时会考虑到的信息,同时,可以根据马氏距离判断新病人所属疾病组群,根据对应组群的特征等信息协助医生进行决策。提高了判断病人所属组群的效率,提升了医生决策的准确性。In the embodiment of this application, by acquiring new patient data, when determining the disease group to which the new patient belongs, the Markov between the new patient data and each sample (patient) data in each preset disease group is calculated respectively. The distance, according to the value of Mahalanobis distance, determines the disease group to which the new patient data belongs. This solution belongs to the field of smart medical care. Through this solution, the construction of smart cities can be promoted. This application can make maximum use of the information in the sample (patient) data that doctors will consider when making medical decisions. At the same time, it can be judged based on the Mahalanobis distance The disease group to which the new patient belongs will assist the doctor in making decisions based on the characteristics of the corresponding group and other information. It improves the efficiency of judging the group that a patient belongs to, and improves the accuracy of doctors' decision-making.
上面对本申请实施例中相似患者智能分群方法进行了描述,下面对本申请实施例中相似患者智能分群装置进行描述,请参阅图4,本申请实施例中相似患者智能分群装置一个实施例包括:The above describes the method for intelligent grouping of similar patients in the embodiments of the present application. The intelligent grouping device for similar patients in the embodiments of the present application is described below. Referring to FIG. 4, an embodiment of the intelligent grouping device for similar patients in the embodiments of the present application includes:
第一获取模块401,用于获取待匹配的新病人数据;The first obtaining module 401 is used to obtain new patient data to be matched;
第一处理模块402,用于对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量;The first processing module 402 is configured to perform vectorization processing on each disease feature data of the new patient to obtain a disease feature word vector corresponding to the new patient;
第一计算模块403,用于基于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历史病人数据之间的马氏距离;The first calculation module 403 is configured to calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease characteristic database based on the disease feature word vector;
排序模块404,用于对所述各马氏距离进行排序,获得排序结果;The sorting module 404 is configured to sort the Mahalanobis distances to obtain a sorting result;
确定模块405,用于基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组群,其中,所述疾病信息组群分别包含不同的临床结局信息。The determining module 405 is configured to determine the matched disease information group corresponding to the new patient data based on the sorting result, wherein the disease information group respectively contains different clinical outcome information.
可选的,第一处理模块402还可以具体用于:Optionally, the first processing module 402 may also be specifically configured to:
获取所述新病人数据的类型,基于所述新病人数据的类型,确定所述数据对应的向量化处理并执行向量化处理,其中,所述预处理方式包括:Acquire the type of the new patient data, determine the vectorization processing corresponding to the data based on the type of the new patient data, and execute the vectorization processing, wherein the preprocessing method includes:
A、当所述新病人数据的类型为文本型数据时,对所述文本型数据进行向量化处理;A. When the type of the new patient data is text data, vectorize the text data;
B、当所述新病人数据的类型为离散型数据时,对所述离散型数据进行向量化处理;B. When the type of the new patient data is discrete data, perform vectorization processing on the discrete data;
C、当所述新病人数据的类型为连续型数据时,不对所述数据进行向量化处理。C. When the type of the new patient data is continuous data, the data is not vectorized.
本申请实施例中,通过提供一种相似患者智能分群方法,该方法通过获取新病人数据,在判断新病人所属疾病组群时,分别计算新病人数据与各个预置疾病组群中的每一个样本(病人)数据两两之间的马氏距离,根据马氏距离的值,确定新病人数据所属的疾病组群。本方案属于智慧医疗领域,通过本方案能够推动智慧城市的建设,本申请通可以最大程度的利用样本(病人)数据中医生做医疗决策时会考虑到的信息,同时,可以根据马氏距离判断新病人所属疾病组群,根据对应组群的特征等信息协助医生进行决策。提高了判断病人所属组群的效率,提升了医生决策的准确性。In the embodiment of this application, a method for intelligent grouping of similar patients is provided. The method obtains new patient data, and when judging the disease group to which the new patient belongs, respectively calculates the new patient data and each of the preset disease groups. The Mahalanobis distance between the sample (patient) data. According to the value of the Mahalanobis distance, the disease group to which the new patient data belongs is determined. This solution belongs to the field of smart medical care. Through this solution, the construction of smart cities can be promoted. This application can make maximum use of the information in the sample (patient) data that doctors will consider when making medical decisions. At the same time, it can be judged based on the Mahalanobis distance The disease group to which the new patient belongs will assist the doctor in making decisions based on the characteristics of the corresponding group and other information. It improves the efficiency of judging the group that a patient belongs to, and improves the accuracy of doctors' decision-making.
请参阅图5,本申请实施例中相似患者智能分群装置的第二个实施例包括:Referring to Fig. 5, the second embodiment of the device for intelligent grouping of similar patients in the embodiment of the present application includes:
第一获取模块501,用于获取待匹配的新病人数据;The first obtaining module 501 is used to obtain new patient data to be matched;
第一处理模块502,用于对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量;The first processing module 502 is configured to perform vectorization processing on each disease feature data of the new patient to obtain a disease feature word vector corresponding to the new patient;
第一计算模块503,用于基于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历史病人数据之间的马氏距离;The first calculation module 503 is configured to calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease characteristic database based on the disease feature word vector;
排序模块504,用于对所述各马氏距离进行排序,获得排序结果;The sorting module 504 is used to sort the Mahalanobis distances to obtain a sorting result;
确定模块505,用于基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组 群;The determining module 505 is configured to determine the matched disease information group corresponding to the new patient data based on the ranking result;
第二获取模块506,用于获取包含结局变量的样本数据;The second obtaining module 506 is used to obtain sample data including outcome variables;
第二处理模块507,用于基于所述样本数据的类型,对所述样本数据进行预处理,得到离散化词向量;The second processing module 507 is configured to preprocess the sample data based on the type of the sample data to obtain a discretized word vector;
第二计算模块508,用于基于所述离散化词向量,分别计算所述样本数据中各样本两两之间的马氏距离;The second calculation module 508 is configured to calculate the Mahalanobis distance between each sample in the sample data based on the discretized word vector;
聚类模块509,用于基于所述样本数据中各样本两两之间的马氏距离,对所述样本数据进行聚类,得到分群结果;The clustering module 509 is configured to cluster the sample data based on the Mahalanobis distance between each sample in the sample data to obtain a clustering result;
提取模块510,用于基于所述分群结果,获取所述样本数据中包含的多个疾病信息组群,并提取所述疾病信息组群的特征;The extraction module 510 is configured to obtain multiple disease information groups contained in the sample data based on the grouping result, and extract the characteristics of the disease information groups;
查询模块511,用于基于所述疾病信息组群的特征,查询预置疾病病症描述库,输出所述疾病信息组群的特征对应的疾病病症描述。The query module 511 is configured to query a preset disease condition description database based on the characteristics of the disease information group, and output the disease condition description corresponding to the characteristics of the disease information group.
可选的,第一处理模块502还可以具体用于:Optionally, the first processing module 502 may also be specifically configured to:
获取所述新病人数据的类型,基于所述新病人数据的类型,确定所述数据对应的向量化处理并执行向量化处理,其中,所述预处理方式包括:Acquire the type of the new patient data, determine the vectorization processing corresponding to the data based on the type of the new patient data, and execute the vectorization processing, wherein the preprocessing method includes:
A、当所述新病人数据的类型为文本型数据时,对所述文本型数据进行向量化处理;A. When the type of the new patient data is text data, vectorize the text data;
B、当所述新病人数据的类型为离散型数据时,对所述离散型数据进行向量化处理;B. When the type of the new patient data is discrete data, perform vectorization processing on the discrete data;
C、当所述新病人数据的类型为连续型数据时,不对所述数据进行向量化处理。C. When the type of the new patient data is continuous data, the data is not vectorized.
可选的,聚类模块509可以具体用于:Optionally, the clustering module 509 may be specifically used for:
设定分群个数为k,随机选取k个样本作为初始聚类中心,分别计算所述样本数据中各样本到每一个聚类中心的马氏距离,基于所述各样本到每一个聚类中心的马氏距离,选取各样本对应的最小马氏距离,将各样本划入与最小马氏距离对应的聚类中心所在组群中,直至将所述样本数据中的所有样本划分完毕,得到首次分群结果;Set the number of clusters to k, randomly select k samples as the initial cluster centers, and calculate the Mahalanobis distances from each sample to each cluster center in the sample data, based on each sample to each cluster center Select the minimum Mahalanobis distance corresponding to each sample, and divide each sample into the group where the cluster center corresponding to the minimum Mahalanobis distance is located, until all the samples in the sample data are divided, and the first time is obtained. Clustering result;
可选的,聚类模块509还可以具体用于:Optionally, the clustering module 509 may also be specifically used for:
根据所述马氏距离,计算首次分群结果对应聚类的平方误差总和,在非初始情况下,根据上一次生成的分群结果计算获得K个非初始聚类中心,计算所述样本数据中各样本分别到各非初始聚类中心的马氏距离,选取所述各样本对应的最小马氏距离,并将各样本划入与所述最小马氏距离对应的非初始聚类中心所在的群,生成新的分群结果;According to the Mahalanobis distance, calculate the sum of square errors of the clusters corresponding to the first clustering result. In the non-initial case, K non-initial clustering centers are calculated according to the clustering results generated last time, and each sample in the sample data is calculated The Mahalanobis distance to each non-initial cluster center is selected, the minimum Mahalanobis distance corresponding to each sample is selected, and each sample is divided into the group where the non-initial cluster center corresponding to the minimum Mahalanobis distance is located, and generates New clustering result;
可选的,聚类模块509还可以具体用于:Optionally, the clustering module 509 may also be specifically used for:
基于所述马氏距离,计算获得新的分群结果对应的聚类的平方误差总和,比较所述首次分群结果对应聚类的平方误差总和与所述新的分群结果对应聚类的平方误差总和,并得到比较结果,基于所述比较结果,选取两次分群结果对应聚类的平方误差总和最小的聚类对应的分群结果作为最终分群结果。Based on the Mahalanobis distance, the total square error of the cluster corresponding to the new clustering result is calculated, and the total square error of the cluster corresponding to the first clustering result is compared with the total square error of the cluster corresponding to the new clustering result, The comparison result is obtained, and based on the comparison result, the grouping result corresponding to the cluster with the smallest sum of square errors corresponding to the two grouping results is selected as the final grouping result.
本申请实施例中,通过提供一种相似患者智能分群方法,该方法通过获取新病人数据,在判断新病人所属疾病组群时,分别计算新病人数据与各个预置疾病组群中的每一个样本(病人)数据两两之间的马氏距离,根据马氏距离的值,确定新病人数据所属的疾病组群。本方案属于智慧医疗领域,通过本方案能够推动智慧城市的建设,本申请通可以最大程度的利用样本(病人)数据中医生做医疗决策时会考虑到的信息,同时,可以根据马氏距离判断新病人所属疾病组群,根据对应组群的特征等信息协助医生进行决策。提高了判断病人所属组群的效率,提升了医生决策的准确性。In the embodiment of this application, a method for intelligent grouping of similar patients is provided. The method obtains new patient data, and when judging the disease group to which the new patient belongs, respectively calculates the new patient data and each of the preset disease groups. The Mahalanobis distance between the sample (patient) data. According to the value of the Mahalanobis distance, the disease group to which the new patient data belongs is determined. This solution belongs to the field of smart medical care. Through this solution, the construction of smart cities can be promoted. This application can make maximum use of the information in the sample (patient) data that doctors will consider when making medical decisions. At the same time, it can be judged based on the Mahalanobis distance The disease group to which the new patient belongs will assist the doctor in making decisions based on the characteristics of the corresponding group and other information. It improves the efficiency of judging the group that a patient belongs to, and improves the accuracy of doctors' decision-making.
上面图4和图5从模块化功能实体的角度对本申请实施例中的相似患者智能分群装置进行详细描述,下面从硬件处理的角度对本申请实施例中相似患者智能分群设备进行详细描述。The above Figures 4 and 5 describe in detail the similar patient intelligent grouping device in the embodiment of the present application from the perspective of modular functional entities, and the following describes the similar patient intelligent grouping device in the embodiment of the present application in detail from the perspective of hardware processing.
图6是本申请实施例提供的一种相似患者智能分群设备的结构示意图,该相似患者智能分群设备600可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)610(例如,一个或一个以上处理器)和存储器620,一个或一个以上存储应用程序633或数据632的存储介质630(例如一个或一个以上海量存储设备)。其中,存储器620和存储介质630可以是短暂存储或持久存储。存储在存储介质630的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对相似患者智能分群设备600中的一系列指令操作。更进一步地,处理器610可以设置为与存储介质630通信,在相似患者智能分群设备600上执行存储介质630中的一系列指令操作。6 is a schematic structural diagram of an intelligent grouping device for similar patients provided by an embodiment of the present application. The intelligent grouping device 600 for similar patients may have relatively large differences due to different configurations or performances, and may include one or more processors (central Processing units, CPU) 610 (for example, one or more processors), memory 620, and one or more storage media 630 (for example, one or more storage devices with a large amount of data) storing application programs 633 or data 632. Among them, the memory 620 and the storage medium 630 may be short-term storage or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown in the figure), and each module may include a series of command operations in the intelligent clustering device 600 for similar patients. Further, the processor 610 may be configured to communicate with the storage medium 630, and execute a series of instruction operations in the storage medium 630 on the intelligent grouping device 600 for similar patients.
相似患者智能分群设备600还可以包括一个或一个以上电源640,一个或一个以上有线或无线网络接口650,一个或一个以上输入输出接口660,和/或,一个或一个以上操作系统631,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图6示出的相似患者智能分群设备结构并不构成对相似患者智能分群设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The similar patient intelligent grouping device 600 may also include one or more power sources 640, one or more wired or wireless network interfaces 650, one or more input and output interfaces 660, and/or one or more operating systems 631, such as Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art can understand that the structure of the intelligent clustering device for similar patients shown in FIG. 6 does not constitute a limitation on the intelligent clustering device for similar patients, and may include more or fewer components than shown in the figure, or a combination of certain components, or Different component arrangements.
本申请还提供一种相似患者智能分群设备,包括:存储器和至少一个处理器,所述存储器中存储有指令,所述存储器和所述至少一个处理器通过线路互连;所述至少一个处理器调用所述存储器中的所述指令,以使得所述智能化路径规划设备执行上述相似患者智能分群方法中的步骤。The present application also provides an intelligent grouping device for similar patients, including: a memory and at least one processor, the memory stores instructions, and the memory and the at least one processor are interconnected by wires; the at least one processor The instructions in the memory are invoked, so that the intelligent path planning device executes the steps in the intelligent grouping method for similar patients.
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,也可以为易失性计算机可读存储介质。计算机可读存储介质存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:The present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium or a volatile computer-readable storage medium. The computer-readable storage medium stores computer instructions, and when the computer instructions are executed on the computer, the computer executes the following steps:
获取待匹配的新病人数据,所述新病人数据包含有多种病症特征数据;Acquiring new patient data to be matched, where the new patient data includes multiple disease feature data;
对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量;Performing vectorization processing on each disease feature data of the new patient to obtain a disease feature word vector corresponding to the new patient;
基于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历史病人数据之间的马氏距离,其中,所述病症特征数据库包含多个疾病信息组群,相似病症特征属于同一疾病信息组群;Based on the disease feature word vector, calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease feature database, wherein the disease feature database contains multiple disease information groups and similar symptoms Features belong to the same disease information group;
对所述各马氏距离进行排序,获得排序结果;Sorting the Mahalanobis distances to obtain a sorting result;
基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组群,其中,所述疾病信息组群分别包含不同的临床结局信息。Based on the sorting result, the matched disease information group corresponding to the new patient data is determined, wherein the disease information group respectively contains different clinical outcome information.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the system, device and unit described above can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, device, and method can be implemented in other ways. For example, the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions recorded in the embodiments are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (20)

  1. 一种相似患者智能分群方法,其中,包括:An intelligent grouping method for similar patients, which includes:
    获取待匹配的新病人数据,所述新病人数据包含有多种病症特征数据;Acquiring new patient data to be matched, where the new patient data includes multiple disease feature data;
    对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量;Performing vectorization processing on each disease feature data of the new patient to obtain a disease feature word vector corresponding to the new patient;
    基于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历史病人数据之间的马氏距离,其中,所述病症特征数据库包含多个疾病信息组群,相似病症特征属于同一疾病信息组群;Based on the disease feature word vector, calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease feature database, where the disease feature database contains multiple disease information groups and similar symptoms Features belong to the same disease information group;
    对所述各马氏距离进行排序,获得排序结果;Sorting the Mahalanobis distances to obtain a sorting result;
    基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组群,其中,所述疾病信息组群分别包含不同的临床结局信息。Based on the sorting result, the matched disease information group corresponding to the new patient data is determined, wherein the disease information group respectively contains different clinical outcome information.
  2. 根据权利要求1所述的相似患者智能分群方法,其中,在所述获取待匹配的新病人数据的步骤之前,还包括:The method for intelligent grouping of similar patients according to claim 1, wherein before the step of acquiring new patient data to be matched, the method further comprises:
    获取包含结局变量的样本数据;Obtain sample data containing outcome variables;
    基于所述样本数据的类型,对所述样本数据进行预处理,得到离散化词向量;Preprocessing the sample data based on the type of the sample data to obtain a discretized word vector;
    基于所述离散化词向量,分别计算所述样本数据中各样本两两之间的马氏距离;Based on the discretized word vector, respectively calculating the Mahalanobis distance between each sample in the sample data;
    基于所述样本数据中各样本两两之间的马氏距离,对所述样本数据进行聚类,得到分群结果;Clustering the sample data based on the Mahalanobis distance between each sample in the sample data to obtain a clustering result;
    基于所述分群结果,获取所述样本数据中包含的多个疾病信息组群,并提取所述疾病信息组群的特征;Based on the grouping result, acquiring multiple disease information groups included in the sample data, and extracting features of the disease information group;
    基于所述疾病信息组群的特征,查询预置疾病病症描述库,输出所述疾病信息组群的特征对应的疾病病症描述。Based on the characteristics of the disease information group, query a preset disease and symptom description database, and output the disease and symptom description corresponding to the characteristics of the disease information group.
  3. 根据权利要求1所述的相似患者智能分群方法,其中,所述对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量包括:The method for intelligent grouping of similar patients according to claim 1, wherein said performing vectorization processing on each disease feature data of said new patient to obtain a disease feature word vector corresponding to said new patient comprises:
    获取所述新病人数据的类型;Acquiring the type of the new patient data;
    基于所述新病人数据的类型,确定所述数据对应的向量化处理并执行向量化处理;Based on the type of the new patient data, determine the vectorization processing corresponding to the data and execute the vectorization processing;
    其中,所述向量化处理包括:Wherein, the vectorization processing includes:
    A、当所述新病人数据的类型为文本型数据时,对所述文本型数据进行向量化处理;A. When the type of the new patient data is text data, vectorize the text data;
    B、当所述新病人数据的类型为离散型数据时,对所述离散型数据进行向量化处理;B. When the type of the new patient data is discrete data, perform vectorization processing on the discrete data;
    C、当所述新病人数据的类型为连续型数据时,不对所述数据进行向量化处理。C. When the type of the new patient data is continuous data, the data is not vectorized.
  4. 根据权利要求2所述的相似患者智能分群方法,其中,所述基于所述样本数据中各样本两两之间的马氏距离,对所述样本数据进行聚类,得到分群结果包括:The method for intelligent clustering of similar patients according to claim 2, wherein the clustering of the sample data based on the Mahalanobis distance between each sample in the sample data to obtain a clustering result comprises:
    设定分群个数为k,随机选取k个样本作为初始聚类中心;Set the number of clusters to k, and randomly select k samples as the initial cluster centers;
    分别计算所述样本数据中各样本到每一个聚类中心的马氏距离;Respectively calculating the Mahalanobis distance from each sample in the sample data to each cluster center;
    基于所述各样本到每一个聚类中心的马氏距离,选取各样本对应的最小马氏距离,并将各样本划入与最小马氏距离对应的聚类中心所在组群中,直至将所述样本数据中的所有样本划分完毕,得到首次分群结果。Based on the Mahalanobis distance from each sample to each cluster center, the minimum Mahalanobis distance corresponding to each sample is selected, and each sample is classified into the group where the cluster center corresponding to the minimum Mahalanobis distance is located, until all All the samples in the sample data are divided, and the first clustering result is obtained.
  5. 根据权利要求4所述的相似患者智能分群方法,其中,在所述将各样本划入与最小马氏距离对应的聚类中心所在组群中,直至将所述样本数据中的所有样本划分完毕,得到首次分群结果的步骤之后,还包括:The method for intelligent grouping of similar patients according to claim 4, wherein in said dividing each sample into the group where the cluster center corresponding to the minimum Mahalanobis distance is located, until all samples in the sample data are divided After getting the first clustering result, it also includes:
    根据所述马氏距离,计算首次分群结果对应聚类的平方误差总和;According to the Mahalanobis distance, calculate the sum of squared errors of the clusters corresponding to the first clustering result;
    在非初始情况下,根据上一次生成的分群结果计算获得K个非初始聚类中心;In the non-initial case, K non-initial clustering centers are calculated according to the clustering result generated last time;
    计算所述样本数据中各样本分别到各非初始聚类中心的马氏距离,选取所述各样本对 应的最小马氏距离,并将各样本划入与所述最小马氏距离对应的非初始聚类中心所在的群,生成新的分群结果。Calculate the Mahalanobis distance of each sample in the sample data to each non-initial cluster center, select the minimum Mahalanobis distance corresponding to each sample, and divide each sample into the non-initial Mahalanobis distance corresponding to the minimum Mahalanobis distance. The cluster where the cluster center is located generates a new clustering result.
  6. 根据权利要求5所述的相似患者智能分群方法,其中,在所述计算所述样本数据中各样本分别到各非初始聚类中心的马氏距离,选取所述各样本对应的最小马氏距离,并将各样本划入与所述最小马氏距离对应的非初始聚类中心所在的群,生成新的分群结果的步骤之后,还包括:The method for intelligent grouping of similar patients according to claim 5, wherein, in the calculation of the Mahalanobis distance of each sample in the sample data to each non-initial cluster center, the minimum Mahalanobis distance corresponding to each sample is selected , And classify each sample into the cluster where the non-initial cluster center corresponding to the minimum Mahalanobis distance is located, and after the step of generating a new clustering result, it also includes:
    基于所述马氏距离,计算获得新的分群结果对应的聚类的平方误差总和;Based on the Mahalanobis distance, calculate and obtain the sum of squared errors of the clusters corresponding to the new clustering results;
    比较所述首次分群结果对应聚类的平方误差总和与所述新的分群结果对应聚类的平方误差总和,并得到比较结果;Comparing the sum of square errors of the clusters corresponding to the first grouping result and the sum of square errors of the clusters corresponding to the new grouping result, and obtaining a comparison result;
    基于所述比较结果,选取两次分群结果对应聚类的平方误差总和最小的聚类对应的分群结果作为最终分群结果。Based on the comparison result, the grouping result corresponding to the cluster with the smallest sum of square errors corresponding to the two grouping results is selected as the final grouping result.
  7. 一种相似患者智能分群设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:An intelligent grouping device for similar patients, including a memory, a processor, and computer-readable instructions stored on the memory and running on the processor, and the processor implements the following steps when the processor executes the computer-readable instructions :
    获取待匹配的新病人数据,所述新病人数据包含有多种病症特征数据;Acquiring new patient data to be matched, where the new patient data includes multiple disease feature data;
    对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量;Performing vectorization processing on each disease feature data of the new patient to obtain a disease feature word vector corresponding to the new patient;
    基于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历史病人数据之间的马氏距离,其中,所述病症特征数据库包含多个疾病信息组群,相似病症特征属于同一疾病信息组群;Based on the disease feature word vector, calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease feature database, wherein the disease feature database contains multiple disease information groups and similar symptoms Features belong to the same disease information group;
    对所述各马氏距离进行排序,获得排序结果;Sorting the Mahalanobis distances to obtain a sorting result;
    基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组群,其中,所述疾病信息组群分别包含不同的临床结局信息。Based on the sorting result, the matched disease information group corresponding to the new patient data is determined, wherein the disease information group respectively contains different clinical outcome information.
  8. 根据权利要求7所述的相似患者智能分群设备,所述处理器执行所述计算机程序时还实现以下步骤:According to the similar patient intelligent grouping device according to claim 7, the processor further implements the following steps when executing the computer program:
    获取包含结局变量的样本数据;Obtain sample data containing outcome variables;
    基于所述样本数据的类型,对所述样本数据进行预处理,得到离散化词向量;Preprocessing the sample data based on the type of the sample data to obtain a discretized word vector;
    基于所述离散化词向量,分别计算所述样本数据中各样本两两之间的马氏距离;Based on the discretized word vector, respectively calculating the Mahalanobis distance between each sample in the sample data;
    基于所述样本数据中各样本两两之间的马氏距离,对所述样本数据进行聚类,得到分群结果;Clustering the sample data based on the Mahalanobis distance between each sample in the sample data to obtain a clustering result;
    基于所述分群结果,获取所述样本数据中包含的多个疾病信息组群,并提取所述疾病信息组群的特征;Based on the grouping result, acquiring multiple disease information groups included in the sample data, and extracting features of the disease information group;
    基于所述疾病信息组群的特征,查询预置疾病病症描述库,输出所述疾病信息组群的特征对应的疾病病症描述。Based on the characteristics of the disease information group, query a preset disease and symptom description database, and output the disease and symptom description corresponding to the characteristics of the disease information group.
  9. 根据权利要求7所述的相似患者智能分群设备,所述处理器执行所述计算机程序时还实现以下步骤:According to the similar patient intelligent grouping device according to claim 7, the processor further implements the following steps when executing the computer program:
    获取所述新病人数据的类型;Acquiring the type of the new patient data;
    基于所述新病人数据的类型,确定所述数据对应的向量化处理并执行向量化处理;Based on the type of the new patient data, determine the vectorization processing corresponding to the data and execute the vectorization processing;
    其中,所述向量化处理包括:Wherein, the vectorization processing includes:
    A、当所述新病人数据的类型为文本型数据时,对所述文本型数据进行向量化处理;A. When the type of the new patient data is text data, vectorize the text data;
    B、当所述新病人数据的类型为离散型数据时,对所述离散型数据进行向量化处理;B. When the type of the new patient data is discrete data, perform vectorization processing on the discrete data;
    C、当所述新病人数据的类型为连续型数据时,不对所述数据进行向量化处理。C. When the type of the new patient data is continuous data, the data is not vectorized.
  10. 根据权利要求8所述的相似患者智能分群设备,所述处理器执行所述计算机程序时还实现以下步骤:According to the similar patient intelligent grouping device according to claim 8, the processor further implements the following steps when executing the computer program:
    设定分群个数为k,随机选取k个样本作为初始聚类中心;Set the number of clusters to k, and randomly select k samples as the initial cluster centers;
    分别计算所述样本数据中各样本到每一个聚类中心的马氏距离;Respectively calculating the Mahalanobis distance from each sample in the sample data to each cluster center;
    基于所述各样本到每一个聚类中心的马氏距离,选取各样本对应的最小马氏距离,并将各样本划入与最小马氏距离对应的聚类中心所在组群中,直至将所述样本数据中的所有样本划分完毕,得到首次分群结果。Based on the Mahalanobis distance from each sample to each cluster center, the minimum Mahalanobis distance corresponding to each sample is selected, and each sample is classified into the group where the cluster center corresponding to the minimum Mahalanobis distance is located, until all All the samples in the sample data are divided, and the first clustering result is obtained.
  11. 根据权利要求10所述的相似患者智能分群设备,所述处理器执行所述计算机程序时还实现以下步骤:According to the similar patient intelligent grouping device according to claim 10, the processor further implements the following steps when executing the computer program:
    根据所述马氏距离,计算首次分群结果对应聚类的平方误差总和;According to the Mahalanobis distance, calculate the sum of squared errors of the clusters corresponding to the first clustering result;
    在非初始情况下,根据上一次生成的分群结果计算获得K个非初始聚类中心;In the non-initial case, K non-initial clustering centers are calculated according to the clustering result generated last time;
    计算所述样本数据中各样本分别到各非初始聚类中心的马氏距离,选取所述各样本对应的最小马氏距离,并将各样本划入与所述最小马氏距离对应的非初始聚类中心所在的群,生成新的分群结果。Calculate the Mahalanobis distance of each sample in the sample data to each non-initial cluster center, select the minimum Mahalanobis distance corresponding to each sample, and divide each sample into the non-initial distance corresponding to the minimum Mahalanobis distance. The cluster where the cluster center is located generates a new clustering result.
  12. 根据权利要求11所述的相似患者智能分群设备,所述处理器执行所述计算机程序时还实现以下步骤:According to the similar patient intelligent grouping device according to claim 11, the processor further implements the following steps when executing the computer program:
    基于所述马氏距离,计算获得新的分群结果对应的聚类的平方误差总和;Based on the Mahalanobis distance, calculate and obtain the sum of squared errors of the clusters corresponding to the new clustering results;
    比较所述首次分群结果对应聚类的平方误差总和与所述新的分群结果对应聚类的平方误差总和,并得到比较结果;Comparing the sum of square errors of the clusters corresponding to the first grouping result and the sum of square errors of the clusters corresponding to the new grouping result, and obtaining a comparison result;
    基于所述比较结果,选取两次分群结果对应聚类的平方误差总和最小的聚类对应的分群结果作为最终分群结果。Based on the comparison result, the grouping result corresponding to the cluster with the smallest sum of square errors corresponding to the two grouping results is selected as the final grouping result.
  13. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:A computer-readable storage medium in which computer instructions are stored, and when the computer instructions are executed on a computer, the computer executes the following steps:
    获取待匹配的新病人数据,所述新病人数据包含有多种病症特征数据;Acquiring new patient data to be matched, where the new patient data includes multiple disease feature data;
    对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量;Performing vectorization processing on each disease feature data of the new patient to obtain a disease feature word vector corresponding to the new patient;
    基于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历史病人数据之间的马氏距离,其中,所述病症特征数据库包含多个疾病信息组群,相似病症特征属于同一疾病信息组群;Based on the disease feature word vector, calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease feature database, wherein the disease feature database contains multiple disease information groups and similar symptoms Features belong to the same disease information group;
    对所述各马氏距离进行排序,获得排序结果;Sorting the Mahalanobis distances to obtain a sorting result;
    基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组群,其中,所述疾病信息组群分别包含不同的临床结局信息。Based on the sorting result, the matched disease information group corresponding to the new patient data is determined, wherein the disease information group respectively contains different clinical outcome information.
  14. 根据权利要求13所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 13, when the computer instructions are executed on the computer, the computer is caused to further execute the following steps:
    获取包含结局变量的样本数据;Obtain sample data containing outcome variables;
    基于所述样本数据的类型,对所述样本数据进行预处理,得到离散化词向量;Preprocessing the sample data based on the type of the sample data to obtain a discretized word vector;
    基于所述离散化词向量,分别计算所述样本数据中各样本两两之间的马氏距离;Based on the discretized word vector, respectively calculating the Mahalanobis distance between each sample in the sample data;
    基于所述样本数据中各样本两两之间的马氏距离,对所述样本数据进行聚类,得到分群结果;Clustering the sample data based on the Mahalanobis distance between each sample in the sample data to obtain a clustering result;
    基于所述分群结果,获取所述样本数据中包含的多个疾病信息组群,并提取所述疾病信息组群的特征;Based on the grouping result, acquiring multiple disease information groups included in the sample data, and extracting features of the disease information group;
    基于所述疾病信息组群的特征,查询预置疾病病症描述库,输出所述疾病信息组群的特征对应的疾病病症描述。Based on the characteristics of the disease information group, query a preset disease and symptom description database, and output the disease and symptom description corresponding to the characteristics of the disease information group.
  15. 根据权利要求13所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 13, when the computer instructions are executed on the computer, the computer is caused to further execute the following steps:
    获取所述新病人数据的类型;Acquiring the type of the new patient data;
    基于所述新病人数据的类型,确定所述数据对应的向量化处理并执行向量化处理;Based on the type of the new patient data, determine the vectorization processing corresponding to the data and execute the vectorization processing;
    其中,所述向量化处理包括:Wherein, the vectorization processing includes:
    A、当所述新病人数据的类型为文本型数据时,对所述文本型数据进行向量化处理;A. When the type of the new patient data is text data, vectorize the text data;
    B、当所述新病人数据的类型为离散型数据时,对所述离散型数据进行向量化处理;B. When the type of the new patient data is discrete data, perform vectorization processing on the discrete data;
    C、当所述新病人数据的类型为连续型数据时,不对所述数据进行向量化处理。C. When the type of the new patient data is continuous data, the data is not vectorized.
  16. 根据权利要求12所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 12, when the computer instructions are executed on the computer, the computer is caused to further execute the following steps:
    设定分群个数为k,随机选取k个样本作为初始聚类中心;Set the number of clusters to k, and randomly select k samples as the initial cluster centers;
    分别计算所述样本数据中各样本到每一个聚类中心的马氏距离;Respectively calculating the Mahalanobis distance from each sample in the sample data to each cluster center;
    基于所述各样本到每一个聚类中心的马氏距离,选取各样本对应的最小马氏距离,并将各样本划入与最小马氏距离对应的聚类中心所在组群中,直至将所述样本数据中的所有样本划分完毕,得到首次分群结果。Based on the Mahalanobis distance from each sample to each cluster center, the minimum Mahalanobis distance corresponding to each sample is selected, and each sample is classified into the group where the cluster center corresponding to the minimum Mahalanobis distance is located, until all All the samples in the sample data are divided, and the first clustering result is obtained.
  17. 根据权利要求16所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 16, when the computer instructions are executed on the computer, the computer is caused to further perform the following steps:
    根据所述马氏距离,计算首次分群结果对应聚类的平方误差总和;According to the Mahalanobis distance, calculate the sum of squared errors of the clusters corresponding to the first clustering result;
    在非初始情况下,根据上一次生成的分群结果计算获得K个非初始聚类中心;In the non-initial case, K non-initial clustering centers are calculated according to the clustering result generated last time;
    计算所述样本数据中各样本分别到各非初始聚类中心的马氏距离,选取所述各样本对应的最小马氏距离,并将各样本划入与所述最小马氏距离对应的非初始聚类中心所在的群,生成新的分群结果。Calculate the Mahalanobis distance of each sample in the sample data to each non-initial cluster center, select the minimum Mahalanobis distance corresponding to each sample, and divide each sample into the non-initial distance corresponding to the minimum Mahalanobis distance. The cluster where the cluster center is located generates a new clustering result.
  18. 根据权利要求17所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:The computer-readable storage medium according to claim 17, when the computer instructions are executed on the computer, the computer is caused to further perform the following steps:
    基于所述马氏距离,计算获得新的分群结果对应的聚类的平方误差总和;Based on the Mahalanobis distance, calculate and obtain the sum of squared errors of the clusters corresponding to the new clustering results;
    比较所述首次分群结果对应聚类的平方误差总和与所述新的分群结果对应聚类的平方误差总和,并得到比较结果;Comparing the sum of square errors of the clusters corresponding to the first grouping result and the sum of square errors of the clusters corresponding to the new grouping result, and obtaining a comparison result;
    基于所述比较结果,选取两次分群结果对应聚类的平方误差总和最小的聚类对应的分群结果作为最终分群结果。Based on the comparison result, the grouping result corresponding to the cluster with the smallest sum of square errors corresponding to the two grouping results is selected as the final grouping result.
  19. 一种相似患者智能分群装置,其中,所述相似患者智能分群装置包括:An intelligent grouping device for similar patients, wherein the intelligent grouping device for similar patients includes:
    第一获取模块,用于获取待匹配的新病人数据,所述新病人数据包含有多种病症特征数据;The first acquisition module is used to acquire new patient data to be matched, and the new patient data includes multiple disease characteristic data;
    第一处理模块,用于对所述新病人的各病症特征数据进行向量化处理,得到所述新病人对应的病症特征词向量;The first processing module is configured to perform vectorization processing on each disease feature data of the new patient to obtain a disease feature word vector corresponding to the new patient;
    第一计算模块,用于基于所述病症特征词向量,计算所述新病人数据与预置病症特征数据库中每一历史病人数据之间的马氏距离,其中,所述病症特征数据库包含多个疾病信息组群,相似病症特征属于同一疾病信息组群;The first calculation module is used to calculate the Mahalanobis distance between the new patient data and each historical patient data in the preset disease feature database based on the disease feature word vector, wherein the disease feature database includes multiple Disease information group, similar disease characteristics belong to the same disease information group;
    排序模块,用于对所述各马氏距离进行排序,获得排序结果;A sorting module for sorting the Mahalanobis distances to obtain a sorting result;
    确定模块,用于基于所述排序结果,确定所述新病人数据对应匹配的疾病信息组群,其中,所述疾病信息组群分别包含不同的临床结局信息。The determining module is configured to determine the matched disease information group corresponding to the new patient data based on the sorting result, wherein the disease information group respectively contains different clinical outcome information.
  20. 根据权利要求19所述的相似患者智能分群装置,其中,The intelligent grouping device for similar patients according to claim 19, wherein:
    所述相似患者智能分群装置,还包括:The intelligent grouping device for similar patients further includes:
    样本数据获取模块,用于获取包含结局变量的样本数据;The sample data acquisition module is used to acquire sample data including outcome variables;
    第二处理模块,用于基于所述样本数据的类型,对所述样本数据进行预处理,得到离散化词向量;The second processing module is configured to preprocess the sample data based on the type of the sample data to obtain a discretized word vector;
    第二计算模块,用于基于所述离散化词向量,分别计算所述样本数据中各样本两两之间的马氏距离;The second calculation module is configured to calculate the Mahalanobis distance between each sample in the sample data based on the discretized word vector;
    聚类模块,用于基于所述样本数据中各样本两两之间的马氏距离,对所述样本数据进行聚类,得到分群结果;A clustering module, configured to cluster the sample data based on the Mahalanobis distance between each sample in the sample data to obtain a clustering result;
    提取模块,用于基于所述分群结果,获取所述样本数据中包含的多个疾病信息组群,并提取所述疾病信息组群的特征;An extraction module, configured to obtain multiple disease information groups contained in the sample data based on the grouping result, and extract the characteristics of the disease information group;
    查询模块,用于基于所述疾病信息组群的特征,查询预置疾病病症描述库,输出所述疾病信息组群的特征对应的疾病病症描述。The query module is used to query a preset disease condition description database based on the characteristics of the disease information group, and output the disease condition description corresponding to the characteristics of the disease information group.
PCT/CN2020/099566 2020-05-14 2020-06-30 Method, apparatus and device for intelligently grouping similar patients, and storage medium WO2021139116A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010405737.7A CN111739634A (en) 2020-05-14 2020-05-14 Method, device and equipment for intelligently grouping similar patients and storage medium
CN202010405737.7 2020-05-14

Publications (2)

Publication Number Publication Date
WO2021139116A1 true WO2021139116A1 (en) 2021-07-15
WO2021139116A9 WO2021139116A9 (en) 2021-09-23

Family

ID=72647169

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/099566 WO2021139116A1 (en) 2020-05-14 2020-06-30 Method, apparatus and device for intelligently grouping similar patients, and storage medium

Country Status (2)

Country Link
CN (1) CN111739634A (en)
WO (1) WO2021139116A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862897A (en) * 2023-02-21 2023-03-28 江西曼荼罗软件有限公司 Syndrome monitoring method and system based on clinical data
CN116344011A (en) * 2023-05-29 2023-06-27 肇庆市高要区人民医院 Medical record file establishment management method and system
CN116434975A (en) * 2023-06-14 2023-07-14 西安交通大学医学院第一附属医院 Breathe internal medicine device intelligent management system that clears away lung-heat and discharges phlegm

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112735596A (en) * 2020-12-31 2021-04-30 神州医疗科技股份有限公司 Similar patient determination method and device, electronic equipment and storage medium
CN113470823A (en) * 2021-06-28 2021-10-01 康键信息技术(深圳)有限公司 User physiological period prediction method, device, equipment and storage medium
CN115500829A (en) * 2022-11-24 2022-12-23 广东美赛尔细胞生物科技有限公司 Depression detection and analysis system applied to neurology

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050209785A1 (en) * 2004-02-27 2005-09-22 Wells Martin D Systems and methods for disease diagnosis
US20140095184A1 (en) * 2012-10-01 2014-04-03 International Business Machines Corporation Identifying group and individual-level risk factors via risk-driven patient stratification
CN108648827A (en) * 2018-05-11 2018-10-12 北京邮电大学 Cardiovascular and cerebrovascular disease Risk Forecast Method and device
CN108665975A (en) * 2017-03-30 2018-10-16 深圳欧德蒙科技有限公司 Clinical path matching process and system
CN109119134A (en) * 2018-08-09 2019-01-01 脉景(杭州)健康管理有限公司 Medical history data processing method, medical data recommender system, equipment and medium
CN109817339A (en) * 2018-12-14 2019-05-28 平安医疗健康管理股份有限公司 Patient's group technology and device based on big data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105184683A (en) * 2015-10-10 2015-12-23 华北电力科学研究院有限责任公司 Probability clustering method based on wind electric field operation data

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050209785A1 (en) * 2004-02-27 2005-09-22 Wells Martin D Systems and methods for disease diagnosis
US20140095184A1 (en) * 2012-10-01 2014-04-03 International Business Machines Corporation Identifying group and individual-level risk factors via risk-driven patient stratification
CN108665975A (en) * 2017-03-30 2018-10-16 深圳欧德蒙科技有限公司 Clinical path matching process and system
CN108648827A (en) * 2018-05-11 2018-10-12 北京邮电大学 Cardiovascular and cerebrovascular disease Risk Forecast Method and device
CN109119134A (en) * 2018-08-09 2019-01-01 脉景(杭州)健康管理有限公司 Medical history data processing method, medical data recommender system, equipment and medium
CN109817339A (en) * 2018-12-14 2019-05-28 平安医疗健康管理股份有限公司 Patient's group technology and device based on big data

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862897A (en) * 2023-02-21 2023-03-28 江西曼荼罗软件有限公司 Syndrome monitoring method and system based on clinical data
CN116344011A (en) * 2023-05-29 2023-06-27 肇庆市高要区人民医院 Medical record file establishment management method and system
CN116344011B (en) * 2023-05-29 2023-08-15 肇庆市高要区人民医院 Medical record file establishment management method and system
CN116434975A (en) * 2023-06-14 2023-07-14 西安交通大学医学院第一附属医院 Breathe internal medicine device intelligent management system that clears away lung-heat and discharges phlegm
CN116434975B (en) * 2023-06-14 2023-09-01 西安交通大学医学院第一附属医院 Breathe internal medicine device intelligent management system that clears away lung-heat and discharges phlegm

Also Published As

Publication number Publication date
CN111739634A (en) 2020-10-02
WO2021139116A9 (en) 2021-09-23

Similar Documents

Publication Publication Date Title
WO2021139116A1 (en) Method, apparatus and device for intelligently grouping similar patients, and storage medium
Wei et al. A comprehensive exploration to the machine learning techniques for diabetes identification
CN111292853B (en) Multi-parameter-based cardiovascular disease risk prediction network model and construction method thereof
Kandhasamy et al. Performance analysis of classifier models to predict diabetes mellitus
Vijayan et al. Study of data mining algorithms for prediction and diagnosis of diabetes mellitus
US7809660B2 (en) System and method to optimize control cohorts using clustering algorithms
Devi et al. A novel hybrid approach for diagnosing diabetes mellitus using farthest first and support vector machine algorithms
Hunt et al. Theory & Methods: Mixture model clustering using the MULTIMIX program
Duggal et al. Prediction of thyroid disorders using advanced machine learning techniques
US20090287503A1 (en) Analysis of individual and group healthcare data in order to provide real time healthcare recommendations
Karthikeyani et al. Comparative of data mining classification algorithm (CDMCA) in diabetes disease prediction
Ramani et al. MapReduce-based big data framework using modified artificial neural network classifier for diabetic chronic disease prediction
Pillai et al. Prediction of heart disease using rnn algorithm
Kangra et al. Comparative analysis of predictive machine learning algorithms for diabetes mellitus
CN111986814A (en) Modeling method of lupus nephritis prediction model of lupus erythematosus patient
Nabi et al. Machine learning approach: Detecting polycystic ovary syndrome & it's impact on bangladeshi women
Anasanti et al. The Exploring feature selection techniques on Classification Algorithms for Predicting Type 2 Diabetes at Early Stage
Sakib et al. Performance analysis of machine learning approaches in diabetes prediction
Theodoraki et al. Innovative data mining approaches for outcome prediction of trauma patients
CN114242178A (en) Method for quantitatively predicting biological activity of ER alpha antagonist based on gradient lifting decision tree
Faris et al. An intelligence model for detection of PCOS based on K‐means coupled with LS‐SVM
Yadav et al. Discovery of thyroid Disease using different ensemble methods with reduced error pruning technique
Jabbar et al. Risks of chronic kidney disease prediction using various data mining algorithms
Ganesh et al. Diabetes Prediction using Logistic Regression and Feature Normalization
Shruthi et al. Diabetes prediction using machine learning technique

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20912309

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20912309

Country of ref document: EP

Kind code of ref document: A1