CN111243753A - Medical data-oriented multi-factor correlation interactive analysis method - Google Patents

Medical data-oriented multi-factor correlation interactive analysis method Download PDF

Info

Publication number
CN111243753A
CN111243753A CN202010125946.6A CN202010125946A CN111243753A CN 111243753 A CN111243753 A CN 111243753A CN 202010125946 A CN202010125946 A CN 202010125946A CN 111243753 A CN111243753 A CN 111243753A
Authority
CN
China
Prior art keywords
medical data
data
features
feature
patient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010125946.6A
Other languages
Chinese (zh)
Other versions
CN111243753B (en
Inventor
钱步月
刘涛
郑莹倩
刘璇
吕欣
许靖琴
侯梦薇
吴风浪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202010125946.6A priority Critical patent/CN111243753B/en
Publication of CN111243753A publication Critical patent/CN111243753A/en
Application granted granted Critical
Publication of CN111243753B publication Critical patent/CN111243753B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Quality & Reliability (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Medical Treatment And Welfare Office Work (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)

Abstract

The invention discloses a medical data-oriented multi-factor correlation interactive analysis method, which comprises the following steps: processing the acquired medical data, and associating the processed medical data according to the patient case number to obtain a treatment sequence of each patient; mapping the obtained clinic sequences to a two-dimensional plane by using a t-SNE algorithm to form different characteristic groups; selecting a characteristic group from the characteristic groups according to the requirement; setting disease characterization indexes; selecting characteristics of the selected characteristic population, and determining a characteristic sequence related to the disease characterization index; and measuring the correlation among the selected characteristics by adopting a statistical measurement index to obtain a result with statistical significance, and finishing the multi-factor correlation interactive analysis. The invention can interactively analyze the high-dimensional medical data and visually display key factors influencing the disease development.

Description

Medical data-oriented multi-factor correlation interactive analysis method
Technical Field
The invention belongs to the technical field of multi-factor correlation analysis, and particularly relates to a medical data-oriented multi-factor correlation interactive analysis method.
Background
Medical statistics is the science of gathering, organizing, analyzing, expressing and interpreting data information in medicine and related fields, mainly applying the basic principle and method of statistics. In clinical medical research, according to existing clinical medical data and by combining with existing medical knowledge, the multi-factor correlation analysis is carried out by calculating statistical characteristics such as Pearson correlation coefficient and the like, and key factors which have large influence on disease development are determined. However, medical data is high-dimensional and complex, heavy calculation is needed in the traditional method, and the result is abstract and difficult to understand, so that diagnosis and scientific research are not facilitated for doctors; the development of diseases is often related to a plurality of factors, and the traditional method can only measure and calculate the correlation between two factors at present, which influences the effectiveness of results.
In summary, a new multi-factor correlation interactive analysis method oriented to high-dimensional medical data is needed.
Disclosure of Invention
The present invention is directed to a method for interactive analysis of medical data based on multi-factor correlation, which solves one or more of the above problems. The invention can interactively analyze the high-dimensional medical data and visually display key factors influencing the disease development.
In order to achieve the purpose, the invention adopts the following technical scheme:
the invention discloses a medical data-oriented multi-factor correlation interactive analysis method, which comprises the following steps of:
step 1, processing the acquired medical data, and associating the processed medical data according to the patient case number to obtain a treatment sequence of each patient; wherein the processing comprises a normalization processing;
step 2, mapping the treatment sequence obtained in the step 1 to a two-dimensional plane by using a t-SNE algorithm to form different characteristic groups; selecting a characteristic group from the characteristic groups according to the requirement;
step 3, setting disease characterization indexes; performing feature selection on the features of the feature population selected in the step 2, and determining a feature sequence related to the disease characterization index;
and 4, measuring the correlation among the characteristics selected in the step 3 by adopting a statistical measurement index to obtain a result with statistical significance, and finishing the multi-factor correlation interactive analysis.
In a further improvement of the present invention, in step 1, the specific step of processing the acquired medical data includes:
(1.1) eliminating irrelevant features and private data in the medical data; wherein the extraneous features include: patient name, patient serial number, privacy data include: patient identification number, patient mobile phone number;
(1.2) eliminating missing values and abnormal values in the medical data; wherein the missing values include: null, "-", outliers include: a value violating medical knowledge, a value violating common sense;
(1.3) eliminating completely duplicated data in the medical data;
(1.4) normalizing the numerical data in the medical data, comprising: for the same feature data xi
Figure BDA0002394383090000021
Where X is the set of all values of a numerical characteristic, XiDenotes the ith element in X, i ═ 1,2, 3.. n, n denotes the total number of elements, min (X) denotes the minimum value in set X, max (X) denotes the maximum value in set X;
(1.5) encoding the type data in the medical data to obtain an encoding vector Y; wherein, the coding format is as follows:
Figure BDA0002394383090000022
wherein, ykDenotes the kth value in the encoded vector, k 1,2, 3.
In a further improvement of the present invention, in step 1, a visit sequence T for each patient is obtained, the expression of which is:
T={xa,yb,zc,...},
in the formula, xa,yb,zc1, b, c ═ 1,2,3,. l, each representing different types of medical data, belonging to the same patient; l represents the number of elements of each type of medical data;
in step 2, selecting a characteristic group G to be researched from the characteristic groups according to the requirement, wherein the expression is as follows:
G={T1,T2,...,Tp,…,Td},
in the formula, TpRepresents the visit sequence of the p-th patient in the characteristic population to be researched, and d is 1,2, 3.
The invention has the further improvement that the step 3 specifically comprises the following steps:
(3.1) when a disease characterization index is set, the disease characterization index is interactively specified;
(3.2) selecting the features of the selected feature population, and removing the features with the variance value smaller than a threshold value when determining the feature sequence related to the disease characterization index to obtain the removed features; and sorting the removed features according to the relevance with the disease characterization indexes, determining k features which are most critical to the disease characterization, and finishing feature selection and feature sorting.
A further improvement of the invention is that in step (3.2), the step of ranking the remaining features according to relevance to disease characterization indicators, the step of determining the k features most critical to disease progression specifically comprises:
(3.2.1) constructing a classifier taking a decision tree as a base learning device, and marking as F;
(3.2.2), sending the data of the removed features into a classifier F, predicting a disease characterization index P, and obtaining a reference prediction result O, wherein the expression is as follows:
O=F(t1,t2,...,tq...,te),
in the formula, tqQ 1, 2., e denotes data containing the qth feature, and e denotes the number of features;
(3.2.3), the data from which the r-th feature is removed is sent to a classifier for prediction to obtain a prediction result OrThe expression is:
Oi=F(t1,t2,...tr-1,tr+1,...,te);
(3.2.4), calculating the prediction result OrThe difference from the reference prediction result O is used as the influence degree delta O of the r-th characteristic on the disease developmentrThe expression is:
ΔOr=|Or-O|,
in the formula,. DELTA.Or R 1,2,3, e denotes the degree of influence of the r-th feature on disease progression; wherein, Δ OrThe larger, the more critical it represents that the r-th feature has a greater impact on the progression of the disease;
(3.2.5) repeating steps (3.2.4) and (3.2.5) until all features have an impact on disease progression Δ O;
(3.2.6), sorting the features according to the size of the key measurement index to obtain the first s most key features, wherein the expression is as follows:
{t1,t2,...ts}=sort(ΔO1,ΔO2,...,ΔOn),
in the formula, sort () represents a sorting function.
In a further improvement of the present invention, in step 4, the statistical metric index includes: pearson correlation coefficient, u test, t test, analysis of variance, unitary regression based on central limit theorem or multiple regression analysis.
The invention further improves the method and also comprises the following steps:
and 5, visualizing the correlation among the s most critical features obtained in the step (3.2.6).
The invention has the further improvement that the step 5 specifically comprises the following steps:
(5.1) taking each feature obtained by feature selection as a longitudinal axis, taking the treatment sequence of each patient as a transverse axis, and drawing a parallel coordinate system among the features for visually displaying the dependence change rule among different features;
and (5.2) selecting two characteristics, mapping the data to a two-dimensional plane taking the two characteristics as coordinate axes, and displaying the correlation relationship between the two characteristics in a visualized manner.
Compared with the prior art, the invention has the following beneficial effects:
the multi-factor correlation interactive analysis method for the high-dimensional medical data, provided by the invention, designs a complete process from the original clinical medical data to the final correlation visualization result, and can directly display the dependence change rule among key features in the high-dimensional medical data.
The method comprises the steps of firstly processing acquired original clinical medical data, removing invalid information, sensitive information, missing values and abnormal values in the data, respectively adopting a standardization and coding processing method aiming at numerical data and classified data, splicing according to case numbers, and generating a patient treatment sequence; then mapping the high-dimensional visit sequence data to a two-dimensional plane to generate a characteristic group, and interactively selecting a group to be researched by a doctor; further performing feature selection on the data of the group of patients, calculating a key metric index of each feature for a final prediction result, selecting the first few features which are most key after sorting, performing hypothesis test on the selected features by a statistical method, and verifying the statistical significance level of the correlation among the features; and further, a parallel coordinate system and a two-dimensional coordinate system are adopted to respectively visually display all the characteristics and the dependency change relationship between every two characteristics, and the influence of different characteristics on the disease development is analyzed. The invention displays the key factors for the disease development by visually displaying the dependency change relationship among the characteristics in the high-dimensional medical data, and provides the statistically significant evidence.
The analysis method can overcome the defect that high-dimensional medical data is complex in abstraction and difficult to analyze; in the traditional method, the analysis process of high-dimensional medical data is calculated by means of complex statistics, the calculated amount is large, the calculation principle is difficult to understand, and great difficulty is caused to clinical diagnosis and medical research. The invention simplifies the whole analysis process, introduces medical knowledge of doctors through an interactive method, visually displays the whole analysis process, reduces the calculated amount and makes the dependency change relationship between variables easier to understand.
The invention considers the multi-factor dependence relationship among high-dimensional medical data; in the traditional method, the analysis of high-dimensional medical data only can analyze the relation between two variables by means of pairwise variation relation between the variables, and the multivariate relation cannot be modeled. The invention adopts the methods of dimension reduction mapping, feature selection, parallel coordinate system drawing and the like, fully considers the dependency change relationship among the multivariable, and leads the analysis result of the high-dimensional medical data to be more accurate.
The method is suitable for clinical medical data under various diseases, and has strong expandability; under the traditional method, a special analysis algorithm needs to be designed according to different diseases and different data types, and the method can hardly be expanded. The method of the invention is independent of specific data types, and all forms of clinical medical data can be analyzed and displayed by using the method of the invention, and the method can be adapted to the analysis requirements of different diseases.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art are briefly introduced below; it is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a schematic block diagram of a flow chart of a medical data-oriented multi-factor correlation interactive analysis method according to an embodiment of the present invention;
FIG. 2 is a schematic block diagram of a flow of a feature selection method in a method according to an embodiment of the invention;
FIG. 3 is a diagram illustrating the visualization result of feature clusters in the method according to the embodiment of the present invention;
FIG. 4 is a diagram illustrating the visualization results of all key features in the method according to the embodiment of the present invention;
fig. 5 is a schematic diagram of a partial feature correlation visualization result in the method according to the embodiment of the present invention.
Detailed Description
In order to make the purpose, technical effect and technical solution of the embodiments of the present invention clearer, the following clearly and completely describes the technical solution of the embodiments of the present invention with reference to the drawings in the embodiments of the present invention; it is to be understood that the described embodiments are only some of the embodiments of the present invention. Other embodiments, which can be derived by one of ordinary skill in the art from the disclosed embodiments without inventive faculty, are intended to be within the scope of the invention.
The embodiment of the invention provides a medical data-oriented multi-factor correlation interactive analysis method, which comprises the following steps:
step 1, carrying out standardization processing on collected clinical medical data, and associating according to patient case numbers to obtain a diagnosis sequence for each patient;
and 2, mapping the high-dimensional clinic medical data to a two-dimensional plane by using a t-SNE algorithm to form different characteristic groups. The doctor looks up all the characteristic population distributions and selects the characteristic population to be researched;
step 3, the doctor designates disease characterization indexes, selects the characteristics of the characteristics in the treatment sequence, and determines the characteristic sequence with larger relevance with the indexes;
step 4, measuring the correlation among the characteristics by adopting a statistical method to obtain a result with statistical significance;
and 5, visualizing the correlation among the characteristics.
Preferably, step 1 specifically comprises:
step 1.1, analyzing the collected clinical medical data to eliminate irrelevant features and privacy data in the collected clinical medical data. Irrelevant features include patient name, patient number, etc., which have no effect on the degree of illness of the patient and therefore should be removed; the private data comprises a patient identification number, a mobile phone number and the like, and can be positioned to the individual of the patient, so that ethical risks are easily caused, and therefore the private data also needs to be removed. Particularly, for the characteristics of the patients, such as native place, home address and the like, which are sensitive but have influence on the degree of illness, fuzzy processing should be carried out, namely only fuzzy information of nationality, province and the like is extracted, and specific sensitive information is removed;
and 1.2, eliminating missing values and abnormal values in the acquired clinical medical data. Missing values are values that are not meaningful, including null, "-" etc., that have no definite medical meaning, adversely affect the end result, and therefore should be dealt with. Outliers indicate significantly incorrect values, including values that violate medical knowledge and values that violate general knowledge, which can cause a large disturbance to the end result and therefore should be dealt with as well. The solution to the above two values is: if the vacancy values and outliers are less than 1/10 of the total data, then the piece of data is deleted; otherwise, replacing the vacancy value or the abnormal value by using the average value of the column of data;
and 1.3, eliminating repeated data in the acquired clinical medical data. Duplicate data includes two types: one is fully-duplicated data, such data is deduplicated with only the last strip remaining; the other is that partial data are different, which can be understood as examination records of patients at different times, and all data should be preserved;
step 1.4, for values in the collected clinical medical dataThe data is normalized, i.e. for the same characteristic data xi
Figure BDA0002394383090000071
Where X is the set of all values of a numerical characteristic, XiDenotes the ith element in X, i ═ 1,2, 3.. n, min (X) denotes the minimum value in X, max (X) denotes the maximum value in X;
and 1.5, encoding the classified data in the acquired clinical medical data, and converting the classified data into a format which can be utilized by an algorithm. The converted coding format is:
Figure BDA0002394383090000072
where Y is the converted code vector, YiDenotes the ith value in the vector, i 1,2, 3.
Step 1.6, splicing the processed data according to the patient case number to generate a diagnosis sequence T of a single patient:
T={xi,yj,zk,...},
in the formula, xi,yj,zkI, j, k is 1,2, 3.. n, which respectively represents different types of clinical medical data, and the data belong to the same patient and have the same case number. T is a high dimensional vector representing the sequence of visits by a single patient.
Specifically, the step 2 specifically includes:
and 2.1, mapping the high-dimensional clinic sequence vector to a two-dimensional plane by using a T-SNE algorithm for the patient clinic sequence T obtained in the previous step to generate different feature groups.
Wherein, the t-SNE algorithm can be expressed as:
for n high-dimensional data x1,x2,…,xnAnd transforming Euclidean distance between data into joint probability to characterize similarity, wherein the formula is as follows:
Figure BDA0002394383090000081
Figure BDA0002394383090000082
wherein:
Figure BDA0002394383090000083
the objective function is expressed as:
Figure BDA0002394383090000084
where P is the joint probability distribution of each point in the high-dimensional space and Q is the joint probability distribution of each point in the low-dimensional space.
The optimized gradient is as follows:
Figure BDA0002394383090000085
definition of the perplexity:
Figure BDA0002394383090000091
H(Pi)=-∑jpijlog2pij
the concrete solving steps are as follows:
Figure BDA0002394383090000092
step 2.2, the group G to be studied is interactively selected by the physician according to the feature group generated in the previous step:
G={T1,T2,...,Ti},
in the formula, TiRepresenting the visit order of the ith patient in the selected feature populationColumn, i ═ 1,2, 3.
Specifically, step 3 specifically includes:
step 3.1, the physician specifies a characterization index P of the extent of disease progression. There are also differences in the indicators used medically to characterize the degree of disease progression for different diseases. Therefore, there is a need for physicians to interactively specify characterization indicators for a particular problem to measure the severity of the disease;
and 3.2, removing the low variance features. A variance value of a feature that is less than the threshold value indicates that the feature has little fluctuation in variation across all patients, i.e., it represents that the feature has little effect on the patient's disease progression and should be removed. In particular, in this scheme, the threshold takes 0;
and 3.3, sequencing all the characteristics according to the correlation with the disease development degree, determining k characteristics which are most critical to the disease development, and finishing the characteristic selection. The method comprises the following specific steps:
step 3.3.1, constructing a classifier taking a decision tree as a base learning device, and marking as F;
step 3.3.2, sending data containing all the characteristics into a classifier to predict a disease development degree index P, and obtaining a reference prediction result O:
O=F(t1,t2,...,tn),
in the formula, ti(i 1, 2.., n) denotes data including the i-th feature, and n denotes the number of features.
Step 3.3.3, the data without the ith characteristic is sent to a classifier for prediction again to obtain a new prediction result Oi:
Oi=F(t1,t2,...ti-1,ti+1,...,tn),
In the formula, ti(i 1, 2.., n) denotes data including the i-th feature, and n denotes the number of features.
Step 3.3.4, calculating the prediction result O after the ith characteristic is removediThe difference from the baseline prediction O as the degree of effect of the feature on disease progression Δ Oi
ΔOi=|Oi-O|,
In the formula,. DELTA.Oi(i ═ 1,2, 3.., n) indicates the degree of influence of the i-th feature on the disease progression, and the larger the value, the greater the influence of the feature on the disease progression, i.e., the more critical the feature is.
Step 3.3.5, repeat the above process until all n features have obtained the critical measurement index Δ Oi
Step 3.3.6, according to. DELTA.OiSequencing all the features from big to small to obtain the first k most important features, and finishing feature selection:
{t1,t2,...tk}=sort(ΔO1,ΔO2,...,ΔOn),
where sort () represents the sorting function and takes the top k values.
Specifically, the statistical measurement index in step 4 includes pearson correlation coefficient, u test, t test, variance analysis (one-way variance analysis, multivariate variance analysis, etc.), and unitary regression or multivariate regression analysis based on the central limit theorem. The main function of the step is to perform statistical analysis on the k features selected in the feature selection process of the previous step, so as to ensure that the k features conform to a hypothesis testing rule on statistics.
Specifically, step 5 specifically includes:
step 5.1, drawing a parallel coordinate system among the characteristics by taking each characteristic selected in the characteristic selection process as a longitudinal axis and a treatment sequence of each patient as a transverse axis, and visually displaying a dependence change rule among different characteristics;
and 5.2, selecting two characteristics with strong correlation, mapping data containing the characteristics to a two-dimensional plane taking the two characteristics as coordinate axes, and displaying the correlation relationship between the two characteristics in a visualized manner.
In summary, the multi-factor correlation interactive analysis method for high-dimensional medical data provided by the embodiment of the invention designs a complete process from original clinical medical data to a final correlation visualization result, and can directly show the dependence change rule between key features in the high-dimensional medical data. The method comprises the steps of firstly processing acquired original clinical medical data, removing invalid information, sensitive information, missing values and abnormal values in the data, respectively adopting a standardization and coding processing method aiming at numerical data and classified data, splicing according to case numbers, and generating a patient treatment sequence; then mapping the high-dimensional visit sequence data to a two-dimensional plane to generate a characteristic group, and interactively selecting a group to be researched by a doctor; further performing feature selection on the data of the group of patients, calculating a key metric index of each feature for a final prediction result, selecting the first few features which are most key after sorting, performing hypothesis test on the selected features by a statistical method, and verifying the statistical significance level of the correlation among the features; and further, a parallel coordinate system and a two-dimensional coordinate system are adopted to respectively visually display all the characteristics and the dependency change relationship between every two characteristics, and the influence of different characteristics on the disease development is analyzed. The invention displays the key factors for the disease development by visually displaying the dependency change relationship among the characteristics in the high-dimensional medical data, and provides the statistically significant evidence. The analysis method can overcome the defect that high-dimensional medical data is complex in abstraction and difficult to analyze; in the traditional method, the analysis process of high-dimensional medical data is calculated by means of complex statistics, the calculated amount is large, the calculation principle is difficult to understand, and great difficulty is caused to clinical diagnosis and medical research. The invention simplifies the whole analysis process, introduces medical knowledge of doctors through an interactive method, visually displays the whole analysis process, reduces the calculated amount and makes the dependency change relationship between variables easier to understand. The invention considers the multi-factor dependence relationship among high-dimensional medical data; in the traditional method, the analysis of high-dimensional medical data only can analyze the relation between two variables by means of pairwise variation relation between the variables, and the multivariate relation cannot be modeled. The invention adopts the methods of dimension reduction mapping, feature selection, parallel coordinate system drawing and the like, fully considers the dependency change relationship among the multivariable, and leads the analysis result of the high-dimensional medical data to be more accurate. The method is suitable for clinical medical data under various diseases, and has strong expandability; under the traditional method, a special analysis algorithm needs to be designed according to different diseases and different data types, and the method can hardly be expanded. The method of the invention is independent of specific data types, all forms of clinical medical data can be analyzed and displayed by using the method, and the method can be adapted to the analysis requirements of different diseases.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
In the embodiment, diagnosis and treatment data of oncology department of the second subsidiary hospital of the university of western transportation are selected to explain the implementation process of the method. It should be noted that, as an example, the present embodiment only lists a part of data segments and visualization results to illustrate the implementation process of the method, and the actual clinical medical data far exceeds the range of the list.
In an embodiment of the invention, the clinical medical data comprises:
Figure BDA0002394383090000121
Figure BDA0002394383090000131
Figure BDA0002394383090000141
firstly, the collected data is processed, the data such as name, date of birth, identification card number, telephone and the like are removed, and the type information such as gender and the like is coded, for example, the male code is 0, the female code is 1, the Chinese code is 000000, the Hui code is 000001 and the like. The data was then normalized to remove missing, outliers and duplicates. And then, correlating the patient data according to the case number to obtain a clinic sequence T of the patient. The visit sequence T in this example is a 3219 × 29 matrix, representing 3219 patients, each of whom contained 29-dimensional visit records. Wherein the record for patient 1008 is:
T1008={1008,0,0,0,0,0,0,1,...,0.29,1,0.87,...0.90,0,0,1},
and mapping part of the diagnosis sequences to a two-dimensional plane to obtain the visualization result of the figure 3. As can be seen from the figure, these data form two feature populations, which are two different diseases, as can be seen in conjunction with specific data and medical knowledge.
The feature population to be studied is then selected by the physician. In this embodiment, the physician selects the right feature population. And (4) selecting features of the right data, wherein the specific flow refers to fig. 2. By calculating the influence of each feature on the disease progression Δ OiAnd sorting the components in the order from big to small to obtain the characteristics with the largest influence on the disease as follows: basophil counting, discharge diagnosis, erythrocyte volume width CV and thrombocyte specific volume. Calculating statistical indicators of these features can verify that these features are significantly correlated with the degree of progression of the disease.
The dependency relationships between these features are then visualized. The visualization result in the parallel coordinate system is plotted with each feature as the vertical axis and the visit sequence of each patient in the feature population as the horizontal axis, respectively, as shown in fig. 4.
And then carrying out visual display on the correlation between every two characteristics. In this embodiment, two features of the T value and the BMD value are selected, and data related to the two features are mapped onto a two-dimensional plane, and the visualization result is as shown in fig. 5. As can be seen from the graph, the BMD value and the T value have high correlation and present a consistent distribution trend. This is consistent with the experience of clinical medical practice with medical explanations.
Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art can make modifications and equivalents to the embodiments of the present invention without departing from the spirit and scope of the present invention, which is set forth in the claims of the present application.

Claims (8)

1. A medical data-oriented multi-factor correlation interactive analysis method is characterized by comprising the following steps:
step 1, processing the acquired medical data, and associating the processed medical data according to the patient case number to obtain a treatment sequence of each patient; wherein the processing comprises a normalization processing;
step 2, mapping the treatment sequence obtained in the step 1 to a two-dimensional plane by using a t-SNE algorithm to form different characteristic groups; selecting a characteristic group from the characteristic groups according to the requirement;
step 3, setting disease characterization indexes; performing feature selection on the features of the feature population selected in the step 2, and determining a feature sequence related to the disease characterization index;
and 4, measuring the correlation among the characteristics selected in the step 3 by adopting a statistical measurement index to obtain a result with statistical significance, and finishing the multi-factor correlation interactive analysis.
2. The medical data-oriented multi-factor correlation interactive analysis method according to claim 1, wherein in step 1, the specific step of processing the acquired medical data comprises:
(1.1) eliminating irrelevant features and private data in the medical data; wherein the extraneous features include: patient name, patient serial number, privacy data include: patient identification number, patient mobile phone number;
(1.2) eliminating missing values and abnormal values in the medical data; wherein the missing values include: null, "-", outliers include: a value violating medical knowledge, a value violating common sense;
(1.3) eliminating completely duplicated data in the medical data;
(1.4) normalizing the numerical data in the medical data, comprising: for the same feature data xi
Figure FDA0002394383080000011
Where X is the set of all values of a numerical characteristic, XiDenotes the ith element in X, i ═ 1,2, 3.. n, n denotes the total number of elements, min (X) denotes the minimum value in set X, max (X) denotes the maximum value in set X;
(1.5) encoding the type data in the medical data to obtain an encoding vector Y; wherein, the coding format is as follows:
Figure FDA0002394383080000012
wherein, ykDenotes the kth value in the encoded vector, k 1,2, 3.
3. The medical data-oriented multi-factor correlation interactive analysis method as claimed in claim 1, wherein the visit sequence T of each patient obtained in step 1 is expressed as:
T={xa,yb,zc,...},
in the formula, xa,yb,zc1, b, c ═ 1,2,3,. l, each representing different types of medical data, belonging to the same patient; l represents the number of elements of each type of medical data;
in step 2, selecting a characteristic group G to be researched from the characteristic groups according to the requirement, wherein the expression is as follows:
G={T1,T2,...,Tp,…,Td},
in the formula, TpRepresents the visit sequence of the p-th patient in the characteristic population to be researched, and d is 1,2, 3.
4. The medical data-oriented multi-factor correlation interactive analysis method according to claim 1, wherein the step 3 specifically comprises:
(3.1) when a disease characterization index is set, the disease characterization index is interactively specified;
(3.2) selecting the features of the selected feature population, and removing the features with the variance value smaller than a threshold value when determining the feature sequence related to the disease characterization index to obtain the removed features; and sorting the removed features according to the relevance with the disease characterization indexes, determining k features which are most critical to the disease characterization, and finishing feature selection and feature sorting.
5. The medical data-oriented multi-factor correlation interactive analysis method according to claim 4, wherein in step (3.2), the step of ranking the remaining features according to the correlation with disease characterization indicators and determining the k features most critical to disease progression specifically comprises:
(3.2.1) constructing a classifier taking a decision tree as a base learning device, and marking as F;
(3.2.2), sending the data of the removed features into a classifier F, predicting a disease characterization index P, and obtaining a reference prediction result O, wherein the expression is as follows:
O=F(t1,t2,...,tq...,te),
in the formula, tqQ 1, 2., e denotes data containing the qth feature, and e denotes the number of features;
(3.2.3), the data from which the r-th feature is removed is sent to a classifier for prediction to obtain a prediction result OrThe expression is:
Oi=F(t1,t2,...tr-1,tr+1,...,te);
(3.2.4), calculating the prediction result OrThe difference from the reference prediction result O is used as the influence degree delta O of the r-th characteristic on the disease developmentrThe expression is:
ΔOr=|Or-O|,
in the formula,. DELTA.OrR 1,2,3, e denotes the degree of influence of the r-th feature on disease progression; wherein, Δ OrThe larger the size, the more critical it is to represent the greater the impact of the r-th feature on the progression of the disease;
(3.2.5) repeating steps (3.2.4) and (3.2.5) until all features have an impact on disease progression Δ O;
(3.2.6), sorting the features according to the size of the key measurement index to obtain the first s most key features, wherein the expression is as follows:
{t1,t2,...ts}=sort(ΔO1,ΔO2,...,ΔOn),
in the formula, sort () represents a sorting function.
6. The medical data-oriented multi-factor correlation interactive analysis method according to claim 1, wherein in step 4, the statistical metric index comprises: pearson correlation coefficient, u test, t test, analysis of variance, unitary regression based on central limit theorem or multiple regression analysis.
7. The medical data-oriented multi-factor correlation interactive analysis method according to claim 5, further comprising:
and 5, visualizing the correlation among the s most critical features obtained in the step (3.2.6).
8. The medical data-oriented multi-factor correlation interactive analysis method according to claim 7, wherein the step 5 specifically comprises:
(5.1) taking each feature obtained by feature selection as a longitudinal axis, taking the treatment sequence of each patient as a transverse axis, and drawing a parallel coordinate system among the features for visually displaying the dependence change rule among different features;
and (5.2) selecting two characteristics, mapping the data to a two-dimensional plane taking the two characteristics as coordinate axes, and displaying the correlation relationship between the two characteristics in a visualized manner.
CN202010125946.6A 2020-02-27 2020-02-27 Multi-factor correlation interactive analysis method for medical data Active CN111243753B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010125946.6A CN111243753B (en) 2020-02-27 2020-02-27 Multi-factor correlation interactive analysis method for medical data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010125946.6A CN111243753B (en) 2020-02-27 2020-02-27 Multi-factor correlation interactive analysis method for medical data

Publications (2)

Publication Number Publication Date
CN111243753A true CN111243753A (en) 2020-06-05
CN111243753B CN111243753B (en) 2024-04-02

Family

ID=70864457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010125946.6A Active CN111243753B (en) 2020-02-27 2020-02-27 Multi-factor correlation interactive analysis method for medical data

Country Status (1)

Country Link
CN (1) CN111243753B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816310A (en) * 2020-07-16 2020-10-23 山东大学 Bone marrow blood disease risk factor contribution rate calculation and risk prediction system
CN113257408A (en) * 2021-06-02 2021-08-13 杭州咏柳科技有限公司 Auxiliary inquiry system based on decision tree
CN113609195A (en) * 2021-08-04 2021-11-05 联仁健康医疗大数据科技股份有限公司 Report generation method, report generation device, electronic equipment and storage medium
CN115171894A (en) * 2022-07-01 2022-10-11 核工业总医院 Bragg decision scheme evaluation method and device

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104992058A (en) * 2015-06-25 2015-10-21 成都厚立信息技术有限公司 Disease risk regulation model establishment method
CN105893766A (en) * 2016-04-06 2016-08-24 成都数联易康科技有限公司 Graded diagnosis and treatment evaluating method based on data mining
US20160259899A1 (en) * 2015-03-04 2016-09-08 Expeda ehf Clinical decision support system for diagnosing and monitoring of a disease of a patient
CN106778042A (en) * 2017-01-26 2017-05-31 中电科软件信息服务有限公司 Cardio-cerebral vascular disease patient similarity analysis method and system
CN106874663A (en) * 2017-01-26 2017-06-20 中电科软件信息服务有限公司 Cardiovascular and cerebrovascular disease Risk Forecast Method and system
CN107330238A (en) * 2016-08-12 2017-11-07 中国科学院上海技术物理研究所 Medical information collection, processing, storage and display methods and device
CN108615560A (en) * 2018-03-19 2018-10-02 安徽锐欧赛智能科技有限公司 A kind of clinical medical data analysis method based on data mining
WO2018206528A1 (en) * 2017-05-10 2018-11-15 Koninklijke Philips N.V. A cohort explorer for visualizing comprehensive sample relationships through multi-modal feature variations
CN109378065A (en) * 2018-10-30 2019-02-22 医渡云(北京)技术有限公司 Medical data processing method and processing device, storage medium, electronic equipment
CN109671507A (en) * 2018-12-24 2019-04-23 万达信息股份有限公司 A kind of obstetrics' disease that calls for specialized treatment coupling index method for digging based on Electronic Health Record
CN109872783A (en) * 2018-12-28 2019-06-11 金力 A kind of diabetes documentation & info standard database set analysis method based on big data
CN110024044A (en) * 2016-09-28 2019-07-16 曼迪奥研究有限公司 For excavating the system and method for medical data
CN110111887A (en) * 2019-05-15 2019-08-09 清华大学 Clinical aid decision-making method and device
CN110634566A (en) * 2019-09-24 2019-12-31 成都成信高科信息技术有限公司 Traditional Chinese medicine clinical diagnosis data processing system and method and information data processing terminal
CN110675938A (en) * 2019-09-24 2020-01-10 成都成信高科信息技术有限公司 Acupuncture medical data processing system and method and information data processing terminal

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160259899A1 (en) * 2015-03-04 2016-09-08 Expeda ehf Clinical decision support system for diagnosing and monitoring of a disease of a patient
CN104992058A (en) * 2015-06-25 2015-10-21 成都厚立信息技术有限公司 Disease risk regulation model establishment method
CN105893766A (en) * 2016-04-06 2016-08-24 成都数联易康科技有限公司 Graded diagnosis and treatment evaluating method based on data mining
CN107330238A (en) * 2016-08-12 2017-11-07 中国科学院上海技术物理研究所 Medical information collection, processing, storage and display methods and device
CN110024044A (en) * 2016-09-28 2019-07-16 曼迪奥研究有限公司 For excavating the system and method for medical data
CN106778042A (en) * 2017-01-26 2017-05-31 中电科软件信息服务有限公司 Cardio-cerebral vascular disease patient similarity analysis method and system
CN106874663A (en) * 2017-01-26 2017-06-20 中电科软件信息服务有限公司 Cardiovascular and cerebrovascular disease Risk Forecast Method and system
WO2018206528A1 (en) * 2017-05-10 2018-11-15 Koninklijke Philips N.V. A cohort explorer for visualizing comprehensive sample relationships through multi-modal feature variations
CN108615560A (en) * 2018-03-19 2018-10-02 安徽锐欧赛智能科技有限公司 A kind of clinical medical data analysis method based on data mining
CN109378065A (en) * 2018-10-30 2019-02-22 医渡云(北京)技术有限公司 Medical data processing method and processing device, storage medium, electronic equipment
CN109671507A (en) * 2018-12-24 2019-04-23 万达信息股份有限公司 A kind of obstetrics' disease that calls for specialized treatment coupling index method for digging based on Electronic Health Record
CN109872783A (en) * 2018-12-28 2019-06-11 金力 A kind of diabetes documentation & info standard database set analysis method based on big data
CN110111887A (en) * 2019-05-15 2019-08-09 清华大学 Clinical aid decision-making method and device
CN110634566A (en) * 2019-09-24 2019-12-31 成都成信高科信息技术有限公司 Traditional Chinese medicine clinical diagnosis data processing system and method and information data processing terminal
CN110675938A (en) * 2019-09-24 2020-01-10 成都成信高科信息技术有限公司 Acupuncture medical data processing system and method and information data processing terminal

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816310A (en) * 2020-07-16 2020-10-23 山东大学 Bone marrow blood disease risk factor contribution rate calculation and risk prediction system
CN113257408A (en) * 2021-06-02 2021-08-13 杭州咏柳科技有限公司 Auxiliary inquiry system based on decision tree
CN113609195A (en) * 2021-08-04 2021-11-05 联仁健康医疗大数据科技股份有限公司 Report generation method, report generation device, electronic equipment and storage medium
CN115171894A (en) * 2022-07-01 2022-10-11 核工业总医院 Bragg decision scheme evaluation method and device

Also Published As

Publication number Publication date
CN111243753B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
CN111243753B (en) Multi-factor correlation interactive analysis method for medical data
EP3306500A1 (en) Method for analysing medical treatment data based on deep learning, and intelligent analyser thereof
CN111710420B (en) Complication onset risk prediction method, system, terminal and storage medium based on electronic medical record big data
US20070038587A1 (en) Predicting apparatus, predicting method, and computer product
CN112635011A (en) Disease diagnosis method, disease diagnosis system, and readable storage medium
CN111009321A (en) Application method of machine learning classification model in juvenile autism auxiliary diagnosis
Zhou et al. Optimizing autoencoders for learning deep representations from health data
CN110767279A (en) Electronic health record missing data completion method and system based on LSTM
CN113657548A (en) Medical insurance abnormity detection method and device, computer equipment and storage medium
CN112967803A (en) Early mortality prediction method and system for emergency patients based on integrated model
CN111180026A (en) Special diagnosis and treatment view system and method
CN111933281A (en) Disease typing determination system, method, device and storage medium
Skitsan et al. Evaluation of the Informative Features of Cardiac Studies Diagnostic Data using the Kullback Method.
Al-Mualemi et al. A deep learning-based sepsis estimation scheme
CN116189866A (en) Remote medical care analysis system based on data analysis
Kim et al. Fostering transparent medical image AI via an image-text foundation model grounded in medical literature
CN114191665A (en) Method and device for classifying man-machine asynchronous phenomena in mechanical ventilation process
Zhang et al. Modeling alzheimer’s disease progression via amalgamated magnitude-direction brain structure variation quantification and tensor multi-task learning
CN117370565A (en) Information retrieval method and system
CN116884612A (en) Intelligent analysis method, device, equipment and storage medium for disease risk level
JP2015032013A (en) Numeric data analyzer and program
WO2023110477A1 (en) A computer implemented method and a system
US11961204B2 (en) State visualization device, state visualization method, and state visualization program
CN109102896A (en) A kind of method of generating classification model, data classification method and device
CN115188484A (en) Multi-party mixed data tracing method and system based on potential group tool variables

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant