CN111243753B - Multi-factor correlation interactive analysis method for medical data - Google Patents

Multi-factor correlation interactive analysis method for medical data Download PDF

Info

Publication number
CN111243753B
CN111243753B CN202010125946.6A CN202010125946A CN111243753B CN 111243753 B CN111243753 B CN 111243753B CN 202010125946 A CN202010125946 A CN 202010125946A CN 111243753 B CN111243753 B CN 111243753B
Authority
CN
China
Prior art keywords
features
feature
data
medical data
patient
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010125946.6A
Other languages
Chinese (zh)
Other versions
CN111243753A (en
Inventor
钱步月
刘涛
郑莹倩
刘璇
吕欣
许靖琴
侯梦薇
吴风浪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian Jiaotong University
Original Assignee
Xian Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian Jiaotong University filed Critical Xian Jiaotong University
Priority to CN202010125946.6A priority Critical patent/CN111243753B/en
Publication of CN111243753A publication Critical patent/CN111243753A/en
Application granted granted Critical
Publication of CN111243753B publication Critical patent/CN111243753B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/70ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/211Selection of the most significant subset of features
    • G06F18/2113Selection of the most significant subset of features by ranking or filtering the set of features, e.g. using a measure of variance or of feature cross-correlation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Public Health (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Pathology (AREA)
  • Quality & Reliability (AREA)
  • Epidemiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Measuring And Recording Apparatus For Diagnosis (AREA)
  • Medical Treatment And Welfare Office Work (AREA)

Abstract

The invention discloses a multi-factor correlation interactive analysis method for medical data, which comprises the following steps: processing the acquired medical data, and correlating the processed medical data according to the patient case number to obtain a treatment sequence of each patient; mapping the acquired diagnosis sequence onto a two-dimensional plane by using a t-SNE algorithm to form different characteristic groups; selecting a characteristic group from the characteristic groups according to the requirement; setting a disease characterization index; performing feature selection on features of the selected feature population, and determining a feature sequence related to the disease characterization index; and measuring the correlation among the selected features by adopting a statistical measurement index to obtain a result with statistical significance, and completing multi-factor correlation interactive analysis. The invention can interactively analyze the high-dimensional medical data and visually display key factors influencing the disease development.

Description

Multi-factor correlation interactive analysis method for medical data
Technical Field
The invention belongs to the technical field of multi-factor correlation analysis, and particularly relates to a multi-factor correlation interactive analysis method for medical data.
Background
Medical statistics is a science which applies the basic principles and methods of statistics and mainly researches the collection, arrangement, analysis, expression and interpretation of data information in medicine and related fields. In clinical medical research, according to the existing clinical medical data and combining the existing medical knowledge, multi-factor correlation analysis is performed by calculating statistical characteristics such as pearson correlation coefficient and the like, and key factors with great influence on disease development are determined. However, the medical data is high-dimensional and complex, the traditional method needs heavy calculation, and the result is abstract and difficult to understand, so that doctors are not facilitated to develop diagnosis and treatment and scientific research; the disease development is often related to various factors, and the traditional method can only calculate the correlation between two factors at present, so that the effectiveness of the result is affected.
In summary, a new multi-factor correlation interactive analysis method for high-dimensional medical data is needed.
Disclosure of Invention
The invention aims to provide a multi-factor correlation interactive analysis method for medical data, which aims to solve one or more technical problems. The invention can interactively analyze the high-dimensional medical data and visually display key factors influencing the disease development.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the invention discloses a multi-factor correlation interactive analysis method for medical data, which comprises the following steps of:
step 1, processing acquired medical data, and correlating the processed medical data according to patient case numbers to obtain a treatment sequence of each patient; wherein the processing includes normalization processing;
step 2, mapping the diagnosis sequence obtained in the step 1 onto a two-dimensional plane by using a t-SNE algorithm to form different characteristic groups; selecting a characteristic group from the characteristic groups according to the requirement;
step 3, setting disease characterization indexes; performing feature selection on the features of the feature population selected in the step 2, and determining a feature sequence related to the disease characterization index;
and step 4, measuring the correlation among the features selected in the step 3 by adopting a statistical measurement index to obtain a result with statistical significance, and completing multi-factor correlation interactive analysis.
The invention further improves that in the step 1, the specific steps of processing the acquired medical data comprise:
(1.1) eliminating extraneous features and privacy data in the medical data; wherein the extraneous feature includes: patient name, patient serial number, privacy data includes: patient identification number, patient handset number;
(1.2) eliminating missing values and outliers in the medical data; wherein the missing values include: null, "-", outliers include: a value that violates medical knowledge and a value that violates common knowledge;
(1.3) eliminating completely duplicate data in the medical data;
(1.4) performing normalization processing on numerical data in the medical data, including: for the same characteristic data x i
Wherein X is the set of all values of a certain numerical feature, X i Represents the i-th element in X, i=1, 2,3,..n, n represents the total number of elements, min (X) represents the minimum value in set X, and max (X) represents the maximum value in set X;
(1.5) encoding the category type data in the medical data to obtain an encoding vector Y; wherein, the coding format is:
wherein y is k Represents the kth value in the encoded vector, k=1, 2,3,..m, m represents the number of elements in the encoded vector, j represents the class number to which the data belongs.
The invention further improves that in the step 1, the visit sequence T of each patient is obtained, and the expression is:
T={x a ,y b ,z c ,...},
wherein x is a ,y b ,z c A, b, c=1, 2,3,..l, respectively represent different types of medical data belonging to the same patient; l represents the number of elements of each type of medical data;
in the step 2, a feature group G to be researched is selected from feature groups according to the requirement, and the expression is:
G={T 1 ,T 2 ,...,T p ,…,T d },
wherein T is p Representing the sequence of visits by the p-th patient in the feature population to be studied, d=1, 2, 3.
The invention further improves that the step 3 specifically comprises the following steps:
(3.1) when setting the disease characterization index, interactively specifying;
(3.2) selecting the features of the selected feature group, and removing the features with variance values smaller than a threshold value when determining the feature sequence related to the disease characterization index to obtain the removed features; and sorting the removed features according to the correlation with the disease characterization index, determining k features which are most critical to disease characterization, and finishing feature selection and feature sorting.
A further improvement of the present invention is that, in step (3.2), the step of sorting the remaining features according to the correlation with the disease characterization index, and determining k features most critical for disease progression specifically includes:
(3.2.1) constructing a classifier based on a decision tree as a learner, and marking as F;
(3.2.2), sending the data of the removed features into a classifier F, and predicting a disease characterization index P to obtain a reference prediction result O, wherein the expression is as follows:
O=F(t 1 ,t 2 ,...,t q ...,t e ),
in the method, in the process of the invention,t q q=1, 2, where, e represents data containing the q-th feature, e represents the number of features;
(3.2.3) sending the data with the r-th characteristic removed into a classifier for prediction to obtain a prediction result O r The expression is:
O i =F(t 1 ,t 2 ,...t r-1 ,t r+1 ,...,t e );
(3.2.4) calculating the prediction result O r The difference from the reference prediction result O is used as the influence degree delta O of the r-th characteristic on the disease development r The expression is:
ΔO r =|O r -O|,
wherein DeltaO r R=1, 2,3, where, e represents the extent of influence of the r-th feature on disease progression; wherein DeltaO r The larger the r-th feature, the more critical the effect on disease progression;
(3.2.5) repeating steps (3.2.4) and (3.2.5) until all features have an effect on disease progression Δo;
(3.2.6) sorting the features according to the size of the key measurement index to obtain the first s most key features, wherein the expression is as follows:
{t 1 ,t 2 ,...t s }=sort(ΔO 1 ,ΔO 2 ,...,ΔO n ),
in the equation, sort () represents the sort function.
A further improvement of the present invention is that, in step 4, the statistical measure index includes: pearson correlation coefficient, u-test, t-test, analysis of variance, and monobasic regression or polybasic regression analysis based on the central limit theorem.
A further improvement of the present invention is that it further comprises:
and 5, visualizing the correlation among the s most critical features obtained in the step (3.2.6).
The invention is further improved in that the step 5 specifically comprises the following steps:
(5.1) drawing a parallel coordinate system among the features by taking each feature obtained by feature selection as a vertical axis and a visit sequence of each patient as a horizontal axis, and visually displaying the dependence change rule among different features;
and (5.2) selecting two features, and mapping the data onto a two-dimensional plane taking the two features as coordinate axes for visually displaying the correlation relationship between the two features.
Compared with the prior art, the invention has the following beneficial effects:
the multi-factor correlation interactive analysis method for the high-dimensional medical data provided by the invention designs a complete flow from the original clinical medical data to the final correlation visual result, and can directly display the dependency change rule among key features in the high-dimensional medical data.
Firstly, processing acquired original clinical medical data, removing invalid information, sensitive information, missing values and abnormal values in the data, respectively adopting standardized and coding processing methods aiming at numerical data and category data, and splicing according to medical records to generate a patient treatment sequence; mapping the high-dimensional diagnosis sequence data to a two-dimensional plane to generate a characteristic group, and interactively selecting the group to be studied by a doctor; further selecting the characteristics of the data of the group of patients, calculating key measurement indexes of each characteristic for a final prediction result, selecting the first few most key characteristics after sequencing, carrying out hypothesis testing on the selected characteristics by a statistical method, and verifying the statistically significant level of the correlation between the characteristics; and further, a parallel coordinate system and a two-dimensional coordinate system are adopted to respectively and visually display all the characteristics and the dependence change relation between every two characteristics, and the influence of different characteristics on disease development is analyzed. The invention displays key factors for disease development through visualizing the dependency change relation among the features hidden in the high-dimensional medical data, and provides statistically significant evidence.
The analysis method can solve the defects that the high-dimensional medical data is complex in abstraction and difficult to analyze; the analysis process of the high-dimensional medical data in the traditional method is large in calculated amount by means of complex statistical calculation, the calculation principle is difficult to understand, and great difficulty is caused to clinical diagnosis and treatment and medical scientific research. The invention simplifies the whole analysis flow, introduces medical knowledge of doctors through an interactive method, visually displays the whole analysis process, reduces the calculated amount and ensures that the dependency and change relation among variables is easier to understand.
The invention considers the multi-factor dependence relationship between the high-dimensional medical data; the analysis of high-dimensional medical data under the traditional method only can analyze the relation between two variables by means of the two-to-two change relation between the variables, and the multivariate relation cannot be modeled. The method adopts the methods of dimension reduction mapping, feature selection, drawing a parallel coordinate system and the like, fully considers the dependency change relation among multiple variables, and ensures that the analysis result of the high-dimensional medical data is more accurate.
The method is suitable for clinical medical data under various diseases, and has strong expandability; under the traditional method, a special analysis algorithm is required to be designed according to different diseases and different data types, and the method is hardly expanded. The method does not depend on specific data types, all forms of clinical medical data can be analyzed and displayed by using the method, and the method can adapt to the analysis requirements of different diseases.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description of the embodiments or the drawings used in the description of the prior art will make a brief description; it will be apparent to those of ordinary skill in the art that the drawings in the following description are of some embodiments of the invention and that other drawings may be derived from them without undue effort.
FIG. 1 is a schematic block diagram of a multi-factor correlation interactive analysis method for medical data according to an embodiment of the invention;
FIG. 2 is a schematic block diagram of a feature selection method in a method of an embodiment of the invention;
FIG. 3 is a schematic diagram of a feature population visualization result in a method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of the visualization of all key features in the method of an embodiment of the present invention;
FIG. 5 is a schematic diagram of a part of the feature correlation visualization result in the method according to the embodiment of the invention.
Detailed Description
In order to make the purposes, technical effects and technical solutions of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention are clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention; it will be apparent that the described embodiments are some of the embodiments of the present invention. Other embodiments, which may be made by those of ordinary skill in the art based on the disclosed embodiments without undue burden, are within the scope of the present invention.
The multi-factor correlation interactive analysis method for the medical data comprises the following steps:
step 1, carrying out standardized processing on collected clinical medical data, and correlating according to patient case numbers to obtain a diagnosis sequence aiming at each patient;
and 2, mapping the high-dimensional diagnosis sequence to a two-dimensional plane by using a t-SNE algorithm on the processed clinical medical data to form different characteristic groups. A doctor looks up all feature population distributions and selects feature populations to be studied;
step 3, a doctor designates a disease characterization index, performs feature selection on features in a diagnosis sequence, and determines a feature sequence with larger relativity with the index;
step 4, measuring the correlation between the features by adopting a statistical method to obtain a result with statistical significance;
and 5, visualizing the correlation between the features.
Preferably, step 1 specifically includes:
and 1.1, analyzing the acquired clinical medical data, and eliminating irrelevant features and privacy data in the acquired clinical medical data. Irrelevant features include patient name, patient number, etc., which have no effect on the patient's extent of illness and should therefore be removed; the privacy data includes patient identification numbers, cell phone numbers, etc., which can be located to the patient's individual, easily posing ethical risks, and should therefore also be removed. Particularly, for the characteristics that the native place and home address of a patient are sensitive but the disease degree is possibly influenced, fuzzy processing should be performed, namely only fuzzy information such as nationality, province and the like is extracted, and specific sensitive information is removed;
and step 1.2, eliminating missing values and abnormal values in the acquired clinical medical data. Missing values refer to values that have no meaning, including null, "-" etc., that have no clear medical meaning and that have an adverse effect on the end result and therefore should be processed. Outliers refer to significantly incorrect values, including values that violate medical knowledge and values that violate common sense, which cause a large disturbance to the final result and should therefore also be processed. The solutions for the above two values are: if the vacancy value and the outlier are less than 1/10 of the total data, deleting the piece of data; otherwise, replacing the null value or the abnormal value by using the average value of the column data;
and step 1.3, eliminating repeated data in the acquired clinical medical data. The duplicate data includes two classes: one is completely repeated data, and only the last piece of data is reserved after the data is de-duplicated; the other is that partial data are different, which can be understood as examination records of patients at different times, and all data should be reserved;
step 1.4, the numerical data in the acquired clinical medical data is normalized, namely the same characteristic data x i
Wherein X is the set of all values of a certain numerical feature, X i Represents the i-th element in X, i=1, 2,3,..n, min (X) represents the minimum value in X, and max (X) represents the maximum value in X;
and step 1.5, encoding the category type data in the acquired clinical medical data, and converting the category type data into a format which can be utilized by an algorithm. The converted coding format is:
wherein Y is the converted code vector, Y i The i-th value in the vector, i=1, 2,3,..n, j, indicates the class number to which the data belongs.
Step 1.6, splicing the processed data according to the patient case number to generate a single patient treatment sequence T:
T={x i ,y j ,z k ,...},
wherein x is i ,y j ,z k I, j, k=1, 2,3,..n, respectively represent different types of clinical medical data, which data belong to the same patient, the patient numbers of which are the same. T is a high-dimensional vector representing the sequence of visits for a single patient.
Specifically, step 2 specifically includes:
and 2.1, mapping the high-dimensional diagnosis sequence vector to a two-dimensional plane by using a T-SNE algorithm on the patient diagnosis sequence T obtained in the previous step, and generating different characteristic groups.
Wherein, the t-SNE algorithm can be expressed as:
for n high-dimensional data x 1 ,x 2 ,…,x n The Euclidean distance between the data is used for converting the Euclidean distance into joint probability to represent similarity, and the formula is as follows:
wherein:
the objective function is expressed as:
where P is the joint probability distribution of each point in the high-dimensional space and Q is the joint probability distribution of each point in the low-dimensional space.
The optimized gradient is as follows:
definition confusion degree:
H(P i )=-∑ j p ij log 2 p ij
the specific solving steps are as follows:
step 2.2, interactively selecting the group G to be studied by a doctor according to the characteristic group generated in the previous step:
G={T 1 ,T 2 ,...,T i },
wherein T is i Representing the sequence of visits by the ith patient in the selected feature population, i=1, 2,3,..n.
Specifically, the step 3 specifically includes:
step 3.1, the doctor designates a characterization index P of the disease development degree. There are also differences in the indices used in medicine to characterize the extent of disease progression for different diseases. Thus, there is a need for doctors to interactively specify characterization indicators for specific problems to measure the severity of the disease;
step 3.2, removing the low variance feature. The variance value of the feature being less than the threshold value indicates that the variation fluctuation of the feature across all patients is small, i.e. it is indicative that the feature has less impact on the progression of the patient's disease and should be removed. In particular, in this scheme, the threshold value takes 0;
and 3.3, sorting all the features according to the correlation with the disease development degree, determining k features which are most critical to the disease development, and finishing feature selection. The method comprises the following specific steps:
step 3.3.1, constructing a classifier which takes a decision tree as a base learner, and marking the classifier as F;
step 3.3.2, sending the data containing all the characteristics into a classifier to predict the disease development degree index P, and obtaining a reference prediction result O:
O=F(t 1 ,t 2 ,...,t n ),
wherein t is i (i=1, 2,) n represents data containing the i-th feature, n represents the number of features.
Step 3.3.3, sending the data with the ith feature removed into a classifier to predict again to obtain a new prediction result O i :
O i =F(t 1 ,t 2 ,...t i-1 ,t i+1 ,...,t n ),
Wherein t is i (i=1, 2,) n represents data containing the i-th feature, n represents the number of features.
Step 3.3.4, calculating the predicted result O after removing the ith feature i The difference from the reference prediction result O is used as the influence degree delta O of the characteristic on the disease development i
ΔO i =|O i -O|,
In DeltaO i (i=1, 2,3,., n) represents the extent of effect of the i-th feature on disease progression, the greater the value representing the greater the effect of the feature on disease progression, i.e., the more critical the feature.
Step 3.3.5 repeating the above process until all n features are obtainedCritical metric ΔO i
Step 3.3.6 according to ΔO i All the features are sequenced from large to small to obtain the first k most important features, and feature selection is completed:
{t 1 ,t 2 ,...t k }=sort(ΔO 1 ,ΔO 2 ,...,ΔO n ),
where sort () represents the ranking function and takes the first k values.
Specifically, the statistical metrics in step 4 include pearson correlation coefficient, u-test, t-test, analysis of variance (single-factor analysis of variance, multiple-element analysis of variance, etc.), and performing unitary regression or multiple regression analysis based on the central limit theorem. The main function of the step is to perform statistical analysis on k features selected in the feature selection process of the previous step, so as to ensure that the k features conform to a statistical hypothesis testing rule.
Specifically, step 5 specifically includes:
step 5.1, drawing a parallel coordinate system among the features by taking each feature selected in the feature selection process as a vertical axis and a treatment sequence of each patient as a horizontal axis, and visually displaying the dependence change rule among different features;
and 5.2, selecting two features with stronger correlation, mapping data containing the features onto a two-dimensional plane taking the two features as coordinate axes, and visually displaying the correlation relationship between the two features.
In summary, the multi-factor correlation interactive analysis method for high-dimensional medical data provided by the embodiment of the invention designs a complete flow from original clinical medical data to final correlation visualization results, and can directly display the dependence change rule among key features in the high-dimensional medical data. Firstly, processing acquired original clinical medical data, removing invalid information, sensitive information, missing values and abnormal values in the data, respectively adopting standardized and coding processing methods aiming at numerical data and category data, and splicing according to medical records to generate a patient treatment sequence; mapping the high-dimensional diagnosis sequence data to a two-dimensional plane to generate a characteristic group, and interactively selecting the group to be studied by a doctor; further selecting the characteristics of the data of the group of patients, calculating key measurement indexes of each characteristic for a final prediction result, selecting the first few most key characteristics after sequencing, carrying out hypothesis testing on the selected characteristics by a statistical method, and verifying the statistically significant level of the correlation between the characteristics; and further, a parallel coordinate system and a two-dimensional coordinate system are adopted to respectively and visually display all the characteristics and the dependence change relation between every two characteristics, and the influence of different characteristics on disease development is analyzed. The invention displays key factors for disease development through visualizing the dependency change relation among the features hidden in the high-dimensional medical data, and provides statistically significant evidence. The analysis method can solve the defects that the high-dimensional medical data is complex in abstraction and difficult to analyze; the analysis process of the high-dimensional medical data in the traditional method is large in calculated amount by means of complex statistical calculation, the calculation principle is difficult to understand, and great difficulty is caused to clinical diagnosis and treatment and medical scientific research. The invention simplifies the whole analysis flow, introduces medical knowledge of doctors through an interactive method, visually displays the whole analysis process, reduces the calculated amount and ensures that the dependency and change relation among variables is easier to understand. The invention considers the multi-factor dependence relationship between the high-dimensional medical data; the analysis of high-dimensional medical data under the traditional method only can analyze the relation between two variables by means of the two-to-two change relation between the variables, and the multivariate relation cannot be modeled. The method adopts the methods of dimension reduction mapping, feature selection, drawing a parallel coordinate system and the like, fully considers the dependency change relation among multiple variables, and ensures that the analysis result of the high-dimensional medical data is more accurate. The method is suitable for clinical medical data under various diseases, and has strong expandability; under the traditional method, a special analysis algorithm is required to be designed according to different diseases and different data types, and the method is hardly expanded. The method of the invention does not depend on specific data types, can be used for remembering and analyzing and displaying all types of clinical medical data, and can adapt to the analysis requirements of different diseases.
DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION
In this embodiment, diagnosis and treatment data of oncology department of a second affiliated hospital of the western traffic university is selected to illustrate the implementation process of the method. It should be noted that, as an example, the present embodiment only enumerates a portion of the data segments and the visualization results to illustrate the implementation of the method, and the actual clinical medical data is far beyond the enumerated range.
In an embodiment of the present invention, the clinical medical data includes:
firstly, the collected data is processed, the data such as name, birth date, ID card number, telephone number and the like are removed, and the gender and other type information are coded, for example, the male code is 0, the female code is 1, the Han nationality code is 000000, the Hui nationality code is 000001 and the like. The data is then normalized to remove missing values, outliers, and duplicate values. And then, the patient data are related according to the patient record number, and the patient treatment sequence T of the patient is obtained. In this example, the visit sequence T is a matrix of 3219 x 29 representing 3219 patients each containing a 29-dimensional visit record. Wherein the 1008 th patient was recorded as:
T 1008 ={1008,0,0,0,0,0,0,1,...,0.29,1,0.87,...0.90,0,0,1},
and mapping part of the treatment sequence to a two-dimensional plane to obtain the visual result of fig. 3. As can be seen from the figure, these data form two characteristic populations, which, in combination with specific data and medical knowledge, are known to be two different diseases.
The physician next selects the population of features to be studied. In this embodiment, the physician selects the right feature population. Feature selection is performed on the right data, and the specific flow is shown in fig. 2. By calculating the effect ΔO of each feature on disease progression i And ordered in order from big to small, the features that have the greatest impact on the disease are: basophil count, discharge diagnosis, red blood cell volume width CV, platelet specific volume. Calculating statistical measures of these features can verify that these features are significantly correlated with the extent of disease progression.
The dependency between these features is then visualized. The visual result in the parallel coordinate system is plotted as shown in fig. 4, with each feature as the vertical axis and the visit sequence of each patient in the feature group as the horizontal axis.
The correlation between features is then visualized. In this embodiment, two features of the T value and the BMD value are selected, and the data related to the two features is mapped onto a two-dimensional plane, and the visualization result is shown in fig. 5. It can be seen from the graph that the correlation degree between the BMD value and the T value is higher, and the BMD value and the T value show consistent distribution trend. This is consistent with experience in clinical medical practice, with medical interpretation.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the above embodiments, one skilled in the art may make modifications and equivalents to the specific embodiments of the present invention, and any modifications and equivalents not departing from the spirit and scope of the present invention are within the scope of the claims of the present invention.

Claims (4)

1. The multi-factor correlation interactive analysis method for the medical data is characterized by comprising the following steps of:
step 1, processing acquired medical data, and correlating the processed medical data according to patient case numbers to obtain a treatment sequence of each patient; wherein the processing includes normalization processing;
step 2, mapping the diagnosis sequence obtained in the step 1 onto a two-dimensional plane by using a t-SNE algorithm to form different characteristic groups; selecting a characteristic group from the characteristic groups according to the requirement;
step 3, setting disease characterization indexes; performing feature selection on the features of the feature population selected in the step 2, and determining a feature sequence related to the disease characterization index;
step 4, measuring the correlation among the features selected in the step 3 by adopting a statistical measurement index to obtain a result with statistical significance, and completing multi-factor correlation interactive analysis;
in step 1, the specific step of processing the acquired medical data includes:
(1.1) eliminating extraneous features and privacy data in the medical data; wherein the extraneous feature includes: patient name, patient serial number, privacy data includes: patient identification number, patient handset number;
(1.2) eliminating missing values and outliers in the medical data; wherein the missing values include: null, "-", outliers include: a value that violates medical knowledge and a value that violates common knowledge;
(1.3) eliminating completely duplicate data in the medical data;
(1.4) performing normalization processing on numerical data in the medical data, including: for the same characteristic data x i
Wherein X is the set of all values of a certain numerical feature, X i Represents the i-th element in X, i=1, 2,3,..n, n represents the total number of elements, min (X) represents the minimum value in set X, and max (X) represents the maximum value in set X;
(1.5) encoding the category type data in the medical data to obtain an encoding vector Y; wherein, the coding format is:
wherein y is k Represents the kth value in the encoded vector, k=1, 2,3,..m, m represents the number of elements in the encoded vector, j represents the class number to which the data belongs;
in step 1, the obtained treatment sequence T of each patient has the expression:
T={x a ,y b ,z c ,...},
wherein x is a ,y b ,z c A, b, c=1, 2,3,..l, respectively represent different types of medical data belonging to the same patient; l represents the number of elements of each type of medical data;
in the step 2, a feature group G to be researched is selected from feature groups according to the requirement, and the expression is:
G={T 1 ,T 2 ,...,T p ,…,T d },
wherein T is p A sequence of visits representing the p-th patient in the feature population to be studied, d = 1,2,3,..d, d represents the number of patients in the feature population to be studied;
the step 3 specifically comprises the following steps:
(3.1) when setting the disease characterization index, interactively specifying;
(3.2) selecting the features of the selected feature group, and removing the features with variance values smaller than a threshold value when determining the feature sequence related to the disease characterization index to obtain the removed features; sorting the removed features according to the correlation with the disease characterization index, determining k features which are most critical to disease characterization, and finishing feature selection and feature sorting;
in the step (3.2), the remaining features are ranked according to the correlation with the disease characterization index, and the step of determining k features most critical to disease progression specifically includes:
(3.2.1) constructing a classifier based on a decision tree as a learner, and marking as F;
(3.2.2), sending the data of the removed features into a classifier F, and predicting a disease characterization index P to obtain a reference prediction result O, wherein the expression is as follows:
O=F(t 1 ,t 2 ,...,t q ...,t e ),
wherein t is q Q=1, 2, where, e represents data containing the q-th feature, e represents the number of features;
(3.2.3) sending the data with the r-th characteristic removed into a classifier for prediction to obtain a prediction result O r The expression is:
O i =F(t 1 ,t 2 ,...t r-1 ,t r+1 ,...,t e );
(3.2.4) calculating the prediction result O r The difference from the reference prediction result O is used as the influence degree delta O of the r-th characteristic on the disease development r The expression is:
ΔO r =|O r -O|,
wherein DeltaO r R=1, 2,3, where, e represents the extent of influence of the r-th feature on disease progression; wherein DeltaO r The larger the r-th feature, the more critical the effect on disease progression;
(3.2.5) repeating steps (3.2.4) and (3.2.5) until all features have an effect on disease progression Δo;
(3.2.6) sorting the features according to the size of the key measurement index to obtain the first s most key features, wherein the expression is as follows:
{t 1 ,t 2 ,...t s }=sort(ΔO 1 ,ΔO 2 ,...,ΔO n ),
in the equation, sort () represents the sort function.
2. The method of claim 1, wherein in step 4, the statistical measure comprises: pearson correlation coefficient, u-test, t-test, analysis of variance, and monobasic regression or polybasic regression analysis based on the central limit theorem.
3. The method of multi-factor correlation interactive analysis for medical data according to claim 1, further comprising:
and 5, visualizing the correlation among the s most critical features obtained in the step (3.2.6).
4. A multi-factor correlation interactive analysis method for medical data according to claim 3, wherein step 5 specifically comprises:
(5.1) drawing a parallel coordinate system among the features by taking each feature obtained by feature selection as a vertical axis and a visit sequence of each patient as a horizontal axis, and visually displaying the dependence change rule among different features;
and (5.2) selecting two features, and mapping the data onto a two-dimensional plane taking the two features as coordinate axes for visually displaying the correlation relationship between the two features.
CN202010125946.6A 2020-02-27 2020-02-27 Multi-factor correlation interactive analysis method for medical data Active CN111243753B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010125946.6A CN111243753B (en) 2020-02-27 2020-02-27 Multi-factor correlation interactive analysis method for medical data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010125946.6A CN111243753B (en) 2020-02-27 2020-02-27 Multi-factor correlation interactive analysis method for medical data

Publications (2)

Publication Number Publication Date
CN111243753A CN111243753A (en) 2020-06-05
CN111243753B true CN111243753B (en) 2024-04-02

Family

ID=70864457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010125946.6A Active CN111243753B (en) 2020-02-27 2020-02-27 Multi-factor correlation interactive analysis method for medical data

Country Status (1)

Country Link
CN (1) CN111243753B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816310A (en) * 2020-07-16 2020-10-23 山东大学 Bone marrow blood disease risk factor contribution rate calculation and risk prediction system
CN113257408B (en) * 2021-06-02 2021-10-01 杭州咏柳科技有限公司 Auxiliary inquiry system based on decision tree
CN113609195A (en) * 2021-08-04 2021-11-05 联仁健康医疗大数据科技股份有限公司 Report generation method, report generation device, electronic equipment and storage medium
CN115171894A (en) * 2022-07-01 2022-10-11 核工业总医院 Bragg decision scheme evaluation method and device

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104992058A (en) * 2015-06-25 2015-10-21 成都厚立信息技术有限公司 Disease risk regulation model establishment method
CN105893766A (en) * 2016-04-06 2016-08-24 成都数联易康科技有限公司 Graded diagnosis and treatment evaluating method based on data mining
CN106778042A (en) * 2017-01-26 2017-05-31 中电科软件信息服务有限公司 Cardio-cerebral vascular disease patient similarity analysis method and system
CN106874663A (en) * 2017-01-26 2017-06-20 中电科软件信息服务有限公司 Cardiovascular and cerebrovascular disease Risk Forecast Method and system
CN107330238A (en) * 2016-08-12 2017-11-07 中国科学院上海技术物理研究所 Medical information collection, processing, storage and display methods and device
CN108615560A (en) * 2018-03-19 2018-10-02 安徽锐欧赛智能科技有限公司 A kind of clinical medical data analysis method based on data mining
WO2018206528A1 (en) * 2017-05-10 2018-11-15 Koninklijke Philips N.V. A cohort explorer for visualizing comprehensive sample relationships through multi-modal feature variations
CN109378065A (en) * 2018-10-30 2019-02-22 医渡云(北京)技术有限公司 Medical data processing method and processing device, storage medium, electronic equipment
CN109671507A (en) * 2018-12-24 2019-04-23 万达信息股份有限公司 A kind of obstetrics' disease that calls for specialized treatment coupling index method for digging based on Electronic Health Record
CN109872783A (en) * 2018-12-28 2019-06-11 金力 A kind of diabetes documentation & info standard database set analysis method based on big data
CN110024044A (en) * 2016-09-28 2019-07-16 曼迪奥研究有限公司 For excavating the system and method for medical data
CN110111887A (en) * 2019-05-15 2019-08-09 清华大学 Clinical aid decision-making method and device
CN110634566A (en) * 2019-09-24 2019-12-31 成都成信高科信息技术有限公司 Traditional Chinese medicine clinical diagnosis data processing system and method and information data processing terminal
CN110675938A (en) * 2019-09-24 2020-01-10 成都成信高科信息技术有限公司 Acupuncture medical data processing system and method and information data processing terminal

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160259899A1 (en) * 2015-03-04 2016-09-08 Expeda ehf Clinical decision support system for diagnosing and monitoring of a disease of a patient

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104992058A (en) * 2015-06-25 2015-10-21 成都厚立信息技术有限公司 Disease risk regulation model establishment method
CN105893766A (en) * 2016-04-06 2016-08-24 成都数联易康科技有限公司 Graded diagnosis and treatment evaluating method based on data mining
CN107330238A (en) * 2016-08-12 2017-11-07 中国科学院上海技术物理研究所 Medical information collection, processing, storage and display methods and device
CN110024044A (en) * 2016-09-28 2019-07-16 曼迪奥研究有限公司 For excavating the system and method for medical data
CN106778042A (en) * 2017-01-26 2017-05-31 中电科软件信息服务有限公司 Cardio-cerebral vascular disease patient similarity analysis method and system
CN106874663A (en) * 2017-01-26 2017-06-20 中电科软件信息服务有限公司 Cardiovascular and cerebrovascular disease Risk Forecast Method and system
WO2018206528A1 (en) * 2017-05-10 2018-11-15 Koninklijke Philips N.V. A cohort explorer for visualizing comprehensive sample relationships through multi-modal feature variations
CN108615560A (en) * 2018-03-19 2018-10-02 安徽锐欧赛智能科技有限公司 A kind of clinical medical data analysis method based on data mining
CN109378065A (en) * 2018-10-30 2019-02-22 医渡云(北京)技术有限公司 Medical data processing method and processing device, storage medium, electronic equipment
CN109671507A (en) * 2018-12-24 2019-04-23 万达信息股份有限公司 A kind of obstetrics' disease that calls for specialized treatment coupling index method for digging based on Electronic Health Record
CN109872783A (en) * 2018-12-28 2019-06-11 金力 A kind of diabetes documentation & info standard database set analysis method based on big data
CN110111887A (en) * 2019-05-15 2019-08-09 清华大学 Clinical aid decision-making method and device
CN110634566A (en) * 2019-09-24 2019-12-31 成都成信高科信息技术有限公司 Traditional Chinese medicine clinical diagnosis data processing system and method and information data processing terminal
CN110675938A (en) * 2019-09-24 2020-01-10 成都成信高科信息技术有限公司 Acupuncture medical data processing system and method and information data processing terminal

Also Published As

Publication number Publication date
CN111243753A (en) 2020-06-05

Similar Documents

Publication Publication Date Title
CN111243753B (en) Multi-factor correlation interactive analysis method for medical data
RU2533500C2 (en) System and method for combining clinical signs and image signs for computer-aided diagnostics
US10327637B2 (en) Systems, methods, and computer-readable media for patient image analysis to identify new diseases
JP2012523877A (en) Clinical decision support system and method
CN112635011A (en) Disease diagnosis method, disease diagnosis system, and readable storage medium
CN111009321A (en) Application method of machine learning classification model in juvenile autism auxiliary diagnosis
CN110141219A (en) Myocardial infarction automatic testing method based on lead fusion deep neural network
CN110141220A (en) Myocardial infarction automatic testing method based on multi-modal fusion neural network
Li et al. Nonparametric and semiparametric estimation of the three way receiver operating characteristic surface
CN112967803A (en) Early mortality prediction method and system for emergency patients based on integrated model
CN111180026A (en) Special diagnosis and treatment view system and method
CN116844733B (en) Medical data integrity analysis method based on artificial intelligence
US20220051114A1 (en) Inference process visualization system for medical scans
CN116228759B (en) Computer-aided diagnosis system and apparatus for renal cell carcinoma type
EP3186737A1 (en) Method and apparatus for hierarchical data analysis based on mutual correlations
US20230060794A1 (en) Diagnostic Tool
CN116864104A (en) Chronic thromboembolic pulmonary artery high-pressure risk classification system based on artificial intelligence
CN116469570A (en) Malignant tumor complication analysis method based on electronic medical record
CN113270144B (en) Phenotype-based gene priority ordering method and electronic equipment
CN111599427B (en) Recommendation method and device for unified diagnosis, electronic equipment and storage medium
CN114550930A (en) Disease prediction method, device, equipment and storage medium
Oliveira et al. Towards an intelligent systems to predict nosocomial infections in intensive care
AU2021102832A4 (en) System & method for automatic health prediction using fuzzy based machine learning
CN111028953B (en) Control method for prompting marking of medical data
Schmidt et al. Clustering Emergency Department patients-an assessment of group normality

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant