CN116259415A  Patient medicine taking compliance prediction method based on machine learning  Google Patents
Patient medicine taking compliance prediction method based on machine learning Download PDFInfo
 Publication number
 CN116259415A CN116259415A CN202211309805.5A CN202211309805A CN116259415A CN 116259415 A CN116259415 A CN 116259415A CN 202211309805 A CN202211309805 A CN 202211309805A CN 116259415 A CN116259415 A CN 116259415A
 Authority
 CN
 China
 Prior art keywords
 compliance
 data
 patient
 correlation
 model
 Prior art date
 Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
 Pending
Links
Images
Classifications

 G—PHYSICS
 G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
 G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
 G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
 G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients

 G—PHYSICS
 G06—COMPUTING; CALCULATING OR COUNTING
 G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
 G06N20/00—Machine learning

 Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSSSECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSSREFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
 Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
 Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
 Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
 Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
 Engineering & Computer Science (AREA)
 Data Mining & Analysis (AREA)
 Medical Informatics (AREA)
 Public Health (AREA)
 Health & Medical Sciences (AREA)
 Theoretical Computer Science (AREA)
 Software Systems (AREA)
 Primary Health Care (AREA)
 Databases & Information Systems (AREA)
 General Health & Medical Sciences (AREA)
 Pathology (AREA)
 Artificial Intelligence (AREA)
 Computer Vision & Pattern Recognition (AREA)
 Evolutionary Computation (AREA)
 Epidemiology (AREA)
 Physics & Mathematics (AREA)
 Computing Systems (AREA)
 General Engineering & Computer Science (AREA)
 General Physics & Mathematics (AREA)
 Mathematical Physics (AREA)
 Biomedical Technology (AREA)
 Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a machine learningbased patient medication compliance prediction method, which comprises the following steps: firstly, calculating a regular taking rate according to taking data of a patient, and defining taking compliance of the patient according to the regular taking rate; then calculating the correlation among the feature variables, and filtering out the features with low correlation; taking importance scores of random forests as measurement mode to obtain medication compliance characteristics X _{1} ,X _{2} ...X _{c} The method comprises the steps of carrying out a first treatment on the surface of the Optimizing the super parameters of the random forest model by using a grid search algorithm to obtain optimal super parameters; with the use of a loopbased estimation algorithm,and selecting an optimal feature set in a continuous training iteration mode, then adopting a random forest model as a classifier, and outputting a classification result. The method is significant for identifying and predicting the compliance risk factors of tuberculosis patients.
Description
Technical Field
The method belongs to the data mining direction in the medical field, and relates to a patient medicine taking compliance prediction method based on machine learning.
Background
In recent years, various medical and health information systems have been widely used, and data stored in hospital systems are increasing. With the rise of mobile medical treatment, medical information is increasingly digitized, and the medical data has great value for diagnosis and treatment of diseases, so that the medical industry enters a huge data age with a great name. Tuberculosis is TB (Tuberculosis) for short, and is one of ten global deadly diseases. Tuberculosis "scruff" in the 90 s of the 20 th century, the incidence of tuberculosis in many countries increases at a rate of 1.1% per year, becoming a public health problem that needs to be faced and solved globally. The tuberculosis in China has heavy burden and is the second in the world. At present, the annual incidence of tuberculosis in China is 14.3% of the world, which means that about 130 tens of thousands of people are diagnosed with tuberculosis each year, and the patient base is inferior to india. In 20112016, the number of patients reporting tuberculosis in China is about 90 ten thousand, the distribution of the epidemic situation of the tuberculosis in China is uneven, the distribution of the patients in the western part is higher than that in the middle and the eastern part, and the rural part is higher than that in the city. Tuberculosis epidemic has the following characteristics: is easy to be infected, diseased, deformed and dead.
Medication compliance refers to whether a patient's medication behavior complies with a doctor's requirements. In investigation, it is found that due to the problems of long treatment period, slow disease cure, high medical cost, complex medicine taking mode and the like of tuberculosis, part of patients can privately interrupt medicine taking or miss medicine taking, so that the medicine taking compliance is poor, and finally the treatment effect is affected. Particularly with prolonged treatment courses or worsening conditions, causing unnecessary economic burden to clinicians, the medical industry and other stakeholders. It is counted that medication noncompliance has affected more than 50% of chronically ill patients. Thus, it is increasingly appreciated that it is highly desirable to predict and improve patient compliance. In other words, it is important to preferentially allocate healthcare resources to patients most likely to be nonordercompliant to improve the efficiency of current medical system interventions. Study of patient compliance is a classification problem. There are many methods for predicting the disease of patients in the medical field, and the traditional prediction methods include machine learning algorithms, such as logistic regression, K nearest neighbor, random forest, support vector machine, etc. With the development of deep learning, many research methods such as a multilayer feedforward neural network (BP), a Convolutional Neural Network (CNN), and the like have also appeared.
Traditional research methods mostly collect user data in the form of questionnaires or selfreports, evaluate patient medication compliance using a dosetable scoring mechanism, and model singlefactor and multifactor analysis based on statistical methods. However, the questionnaire cannot guarantee the authenticity of the data, and the medical data set has the problems of high correlation and high redundancy among features, and when the data set has the problems of nonlinearity and unbalance, a proper model structure is difficult to quickly find, and the interpretation of the model is poor.
Disclosure of Invention
The invention aims to provide a method capable of predicting the medication compliance of a patient with tuberculosis, and solves the problem that the prediction result is inaccurate due to nonlinearity and unbalance of a data set in the traditional method.
The technical scheme adopted by the invention is as follows:
a machine learning based patient medication compliance prediction method comprising the steps of:
step 1, collecting patient data information, calculating a regular administration rate according to patient administration data, defining the administration compliance of a patient according to the regular administration rate, and labeling high compliance and low compliance;
step 2, calculating the correlation among the feature variables, and filtering out the features with low correlation;
Step 4, based on the medication compliance feature X obtained in step 3 _{1} ,X _{2} ...X _{c} Optimizing the hyperparameters of the random forest model by using a grid search algorithm to obtainReaching the optimal super parameter;
and 6, classifying the data by adopting the final classification model trained in the step 5, and outputting a classification result.
The invention is also characterized in that:
the patient data information comprises age, sex, household book, adverse reaction score, treatment month of treatment scheme, treatment scheme medicine type, administration mode, smoking frequency, drinking frequency, individual room, ventilation condition, patient type, health education score, complications, 15 characteristic attributes of the current sputum examination and one target attribute medication compliance, the regular medication rate is calculated according to the patient medication data, the medication compliance of the patient is defined according to the regular medication rate, the definition of more than 80% is high compliance, the label is set to 1, the definition of less than 80% is low compliance, and the label is set to 0.
The step 2 is specifically as follows: firstly, carrying out onehot coding and standardization processing on all data sets of sample data, then calculating correlation coefficients among all features, and finally visualizing results and outputting a correlation matrix diagram; pearson correlation coefficient is a statistical method, and we generally use Pearson coefficient to represent the linear relationship between two variables in machine learning; the calculation formula is expressed as:
wherein Cov (X, Y) represents the covariance of X and Y, σ _{X} 、σ _{Y} Represents the standard deviation of X and Y; determining the correlation strength between the variables by calculating the correlation coefficient between the variables and the target; the larger the absolute value of the correlation coefficient is, the stronger the correlation is; and filtering the characteristic variable according to the absolute value of the correlation coefficient, screening out uncorrelated characteristics and reducing interference.
The step 3 comprises the following steps: step 3.1, data normalization is performed by using a ZScore algorithm, wherein the data normalization formula is as follows:
wherein x' represents normalized data and x represents original data;
and 3.2, taking the standardized data as input, taking a Gini coefficient as a measurement mode, and outputting importance ranking of feature variables, wherein the importance score principle of the Gini coefficient calculation features is as follows:
let m feature variables be denoted as X _{1} ,X _{2} ...X _{m} Gini index for each variable is VIM _{j} ^{(Gini)} A representation; the Gini index is expressed as:
where K represents that the sample has K classes, p _{mk} Representing the specific gravity occupied by the class k in the node m; then X is _{j} Before and after m node branching, the Gini index variation is:
wherein GI is _{l} And GI _{r} A Gini index representing the left node and a Gini index representing the right node generated after m node splitting; then X is _{j} The importance score of (2) is expressed as:
assuming that the random forest has n basis classifiers, the importance scores are:
step 3.3, normalizing the importance scores to obtain the characteristic X _{j} Importance score of (c):
setting 0.05 as a threshold, the features with importance scores greater than 0.05 are regarded as factors affecting the patient's medication compliance, by this step the irrelevant nonsensical features are removed, the features affecting the patient's medication compliance are retained, and the remaining feature vectors are denoted as X _{1} ,X _{2} ...X _{c} 。
The step 4 is specifically as follows: firstly, determining a parameter range to be optimized, and considering the number n_identifiers of main optimization decision trees, split standard rules, the number min_sample_split of split nodes, the maximum depth max_depth of the decision trees and the maximum feature number max_features; then iterating and selecting any k1 data training samples, iterating for ten times, selecting the parameters corresponding to the model with highest average accuracy as optimal parameter combinations, outputting the optimal parameters, judging whether the output optimal parameters reach the boundary of a parameter range, and if the optimal parameters reach the boundary of the parameter range, readjusting the parameter range, and repeating the steps; and finally outputting the optimal super parameters of the model.
The step 5 is specifically as follows:
step 5.1, sampling the optimal super parameters by using bootstrap, and constructing a plurality of decision trees, wherein each time, samples which are not sampled form outofbag data, and the outofbag data are used as test samples;
step 5.2, training the random forest model to obtain a trained final classification model;
the step 5.2 specifically comprises the following steps:
step 5.2.1, creating decision trees by using the number m of the decision trees in the random forest obtained in the step 4;
step 5.2.2, dividing the sample set into n training sets with the same scale by a replaced random sampling method;
step 5.2.3, randomly selecting k characteristics from all attribute sets, and establishing a training model for the selected samples and the characteristic sets by using a decision tree algorithm;
step 5.2.4, repeating the steps 5.2.2 and 5.2.3 for k times to generate m decision trees;
and 5.2.5, aiming at each prediction sample, each decision tree generates a predicted classification result, and finally, outputting a predicted result of the random forest model through a voting or weighting mechanism, and storing model parameters with the best predicted result.
The beneficial effects of the invention are as follows:
1) Medical data often has problems with data imbalance, collinearity, strong correlation, and contains both linear and nonlinear data. The traditional research method has poor prediction effect on patient compliance, and the model can not only solve linear data, but also obtain higher prediction effect when facing nonlinear problems, and has good applicability.
2) Medical data mining techniques often require a model with some interpretability. The method provided by the invention can output the predicted results and simultaneously output the importance ranking of the characteristic variables, and can identify factors influencing the medication compliance of patients through the setting of the threshold value, so that the experimental results have interpretability. The proposal of the method has very important significance for preparing corresponding treatment schemes, improving the medication compliance of patients and curing tuberculosis patients.
3) The method is suitable for a tuberculosis patient management system, patient compliance is predicted in advance through a model, so that people with low compliance can be helped to be screened, and the cure rate is improved by focusing on the people.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 is a correlation matrix diagram of dataset feature variables.
FIG. 3 is a feature variable importance ranking output by the method of the present invention.
FIG. 4 is a graph of experimental comparisons of the process of feature selection for the inventive method (RFRFE) and the comparative method (SVMRFE).
Fig. 510 are ROC curves of the prediction model established by combining the feature set screened by the method, the original feature set and the SVMRFE method screened feature set with six machine learning classifiers respectively.
FIG. 11 is a graph comparing model performance of a predictive model constructed by screening features using random forests as classifiers by the method of the present invention, the prior preferred feature selection methods (RFFilter, corSFS, SVMRFE).
Detailed Description
The invention will be described in detail below with reference to the drawings and the detailed description.
A machine learningbased patient medication compliance prediction method, as shown in fig. 1, is implemented according to the following steps:
step 1: collecting patient data information, wherein the data set comprises age, sex, household registration, adverse reaction score, treatment month of treatment scheme, treatment scheme medicine type, medicine administration management mode, smoking frequency, drinking frequency, individual room, ventilation condition, patient type, health education score, complications, 15 characteristic attributes of the current sputum examination and target attribute medicine taking compliance, calculating regular medicine taking rate according to patient medicine taking data, defining medicine taking compliance of the patient according to the regular medicine taking rate, wherein the definition of more than 80% is high compliance, the label is set to 1, the definition of less than 80% is low compliance, and the label is set to 0.
Step 2: and calculating the correlation among the characteristic variables, outputting a correlation matrix diagram, and identifying which variables have correlation among the data set data by having basic knowledge. The method comprises the following specific steps:
firstly, all data sets of sample data are subjected to onehot coding and standardization processing, then, correlation coefficients among all features are calculated, and finally, the results are visualized and a correlation matrix diagram is output. Pearson correlation coefficient is a statistical method and we generally use Pearson coefficient to represent the linear relationship between two variables in machine learning. The calculation formula is expressed as:
wherein Cov (X, Y) represents the covariance of X and Y, σ _{X} 、σ _{Y} Represents the standard deviation of X and Y. And determining the correlation strength between the variables by calculating the correlation coefficient between the variables and the target. The larger the absolute value of the correlation coefficient, the stronger the correlation is explained. And filtering the characteristic variable according to the absolute value of the correlation coefficient, screening out uncorrelated characteristics and reducing interference.
Step 3: taking the importance score of the random forest as a measurement mode to acquire the medication compliance characteristic. The method comprises the following specific steps:
step 3.1, data normalization is performed by using a ZScore algorithm, wherein the data normalization formula is as follows:
where x' represents normalized data and x represents raw data.
And 3.2, taking the standardized data as input, taking a Gini coefficient as a measurement mode, and outputting importance ranking of feature variables, wherein the importance score principle of the Gini coefficient calculation features is as follows:
let m feature variables be denoted as X _{1} ,X _{2} ...X _{m} Gini index for each variable is VIM _{j} ^{(Gini)} And (3) representing. The Gini index is expressed as:
where K represents that the sample has K classes, p _{mk} The specific gravity occupied by the class k in the node m is shown. Then X is _{j} Before and after m node branching, the Gini index variation is:
wherein GI is _{l} And GI _{r} The Gini index of the left node and the Gini index of the right node, which are generated after mnode splitting, are represented. Then X is _{j} The importance score of (2) is expressed as:
assuming that the random forest has n basis classifiers, the importance scores are:
step 3.3, normalizing the importance scores to obtain the characteristic X _{j} Importance score of (c):
setting 0.05 as a threshold, the features with importance scores greater than 0.05 are regarded as factors affecting the patient's medication compliance, by this step the irrelevant nonsensical features are removed, the features affecting the patient's medication compliance are retained, and the remaining feature vectors are denoted as X _{1} ,X _{2} ...X _{c} 。
Step 4: based on the medication compliance feature X obtained in step 3 _{1} ,X _{2} ...X _{c} And optimizing the hyperparameters of the random forest model by using a grid search algorithm.
Firstly, determining a parameter range needing to be optimized, and considering the number n_identifiers of main optimization decision trees, split standard rules, the number min_sample_split of split nodes, the maximum depth max_depth of the decision trees and the maximum feature number max_features. And then iterating and selecting any k1 data training samples, iterating for ten times, selecting the parameters corresponding to the model with the highest average accuracy as the optimal parameter combination, outputting the optimal parameters, judging whether the output optimal parameters reach the boundary of the parameter range, and if the output optimal parameters reach the boundary of the parameter range, readjusting the parameter range, and repeating the steps. And finally outputting the optimal super parameters of the model, taking the optimal parameters as the initialization parameters of the method, and applying the optimal parameters to the next experiment.
Step 5: the optimal super parameters of the random forest model obtained in the step 4 are used for selecting an optimal feature set by a continuous training iteration mode based on a cyclic estimation algorithm, and the method specifically comprises the following steps:
and 5.1, storing the optimal super parameters obtained in the step 4 in a model. And then sampling the extracted sample subset with the extraction by using bootstrap, and constructing a plurality of decision trees, wherein each time, the samples which are not extracted form outofbag data, and the outofbag data are used as test samples.
And 5.2, training the random forest model to obtain random forest model parameters. The specific training process is as follows:
step 5.2.1, creating decision trees by using the number m of the decision trees in the random forest obtained in the step 4;
step 5.2.2, dividing the sample set into n training sets with the same scale by a replaced random sampling method;
step 5.2.3, randomly selecting k characteristics from all attribute sets, and establishing a training model for the selected samples and the characteristic sets by using a decision tree algorithm;
step 5.2.4, repeating the steps 5.2.2 and 5.2.3 for k times to generate m decision trees;
and 5.2.5, aiming at each prediction sample, each decision tree generates a predicted classification result, and finally, a predicted result of the random forest model is output through a voting or weighting mechanism. By the training method, the model parameters with the best prediction results are stored.
Step 6: and (5) taking the optimal model obtained in the step (5) as a final classification model to classify the data.
Example 1
In this embodiment, patient data of the national tuberculosis management and service system is used as a data set, the data protects privacy of patients, and information such as identification card numbers, mobile phone numbers and the like is filtered. The data set comprises 15 characteristic attributes such as age, sex, household registration, symptoms and signs and a target attribute, wherein the target attribute is the level of medication compliance of patients. The training and optimization process of the algorithm will take these variables as samples. The experimental data is firstly subjected to data preprocessing through data cleaning, coding and normalization technologies, and then a training set and a testing set are subjected to data preprocessing according to a method of 3:1 and outputting importance ranking of the feature variables. The actual effect of the invention is checked by comparing the predicted effect and the evaluation index of the method.
Assume that sample D has n eigenvalues, denoted as x= (X) _{1} ,x _{2} ,x _{3} ,……,x _{n} ) Where X represents the feature set of sample D.
Step 1: the regular medication rate is calculated according to the patient medication data, and medication compliance of the patient is defined according to the regular medication rate, more than 80% is defined as high compliance, less than 80% is defined as low compliance, and the label is set as 0.
Step 2: and calculating the correlation among the characteristic variables, outputting a correlation matrix diagram, and identifying which variables have correlation among the data set data by having basic knowledge.
Fig. 2 is a characteristic correlation coefficient matrix diagram in which pearson correlation test is performed on all data and output. The color intensity of the color block in the figure represents the correlation intensity. Wherein the stronger the positive correlation, the lighter the color of the feature color patch, the stronger the negative correlation, and the darker the color of the feature color patch. It can be seen from the figure that there is a high correlation between the partial features of the dataset, which is a problem that is often present in medical data, and which is also a problem that the method of the present invention is intended to solve.
Step 3: and taking the standardized data as input, taking the importance score of the random forest as a measurement mode, and outputting the importance ranking of the characteristic variables.
FIG. 3 is a visualization of importance ordering of feature variables. We have 0.05 as the threshold of feature selection, from the figure it can be seen that the first seven features affecting medication compliance are: treating month sequence, administration management mode, adverse reaction, administration type, health education, age, and household registration. The seven characteristics are used as the risk factors influencing the medication compliance of the tuberculosis patient, and the effective identification of the risk factors has important significance for improving the medication behavior of the patient.
Step 4: model parameters of the inventive method are initialized based on a grid search algorithm.
Step 5: training the random forest model by utilizing a cyclic estimation algorithm to obtain a final classification model;
step 6: and (5) classifying the data by adopting the final classification model trained in the step (5) and outputting a classification result.
FIG. 4 is a process diagram of feature selection for the algorithm (RFRFE) and the comparison algorithm (SVMRFE) of the present invention. It can be derived from this that the inventive algorithm is optimized when the feature number is 12 and the comparative algorithm is optimized when the feature number is 23. The data set has 27 features, and after the data set is subjected to feature selection and filtering by the algorithm, the output optimal subset contains less than 50% of the original feature set. In the feature selection process of the whole model, the model accuracy corresponding to the algorithm is always higher than that of a comparison algorithm, so that the algorithm is meaningful, the feature dimension can be reduced, the algorithm complexity is reduced, and the model accuracy is improved.
Fig. 510 are ROC curves of six classifiers (including the random forest model of the present invention) respectively combined with the feature set screened by the inventive method, the data primitive feature set, and the feature set screened by the SVMRFE method to build the model. The six classifiers are LR, SVM, RF, NB, DT and KNN, respectively. From the figure, LR, SVM, RF and DT are taken as classifiers, and the model lifting effect is obvious by combining the algorithm proposed by the chapter. The AUC value of the classifier model corresponding to the algorithm of the invention is far greater than that of the original feature set and the comparison algorithm. From fig. 10 and 11, it can be seen that when NB and KNN are used as classifiers, the features extracted by the inventive method and the comparison algorithm are used to build a prediction model, and the AUC values are not greatly different before and after each other, but the number of optimal subsets of the inventive algorithm is 12, which is far smaller than the original features 27 and the feature subsets 23 of the comparison algorithm. Therefore, the method can be used for reducing the feature dimension and guaranteeing the model prediction performance, and can be used as a feature selection algorithm to be applied to the data mining technology.
FIG. 11 is a graph of a predictive model performance comparison with random forests as classifiers, each in combination with newer feature selection techniques, in accordance with the present invention. From the graph, the model established by combining the method with the random forest has optimal performance. The problem of high correlation between medical data features is solved by the CorSFS algorithm and the invention method, but the invention method has better effect by comparison.
On the one hand, the method of the invention is based on the actual medical data of tuberculosis patients, and introduces a random forest method to study the influence factors of the medication compliance of the tuberculosis patients. The problem of experimental result errors caused by unreal data is avoided, unbalanced data and nonlinear data can be well processed by the random forest, and the collinearity among the data can be reduced by adjusting parameters. The factors influencing the medication compliance of the patient are identified, and corresponding treatment schemes are formulated, so that the medication compliance of the patient can be improved, and the method has very important significance for curing the tuberculosis. On the other hand, aiming at the problems of strong correlation, high redundancy and unbalanced data among the characteristics of medical data, the invention provides a medicine taking compliance prediction model based on an algorithm of the invention, solves the problems of low prediction accuracy, inaccurate identification and the like of the traditional algorithm, and has higher accuracy in characteristic selection and prediction model. The medicine taking compliance prediction model is established, so that people with low compliance can be identified in advance, corresponding measures are taken to focus on the people, the medicine taking compliance of patients is improved, the treatment of the patients is promoted to develop towards a good direction, and the medicine taking compliance prediction model has important significance for improving the management strategy of the patients.
Claims (6)
1. A machine learning based patient medication compliance prediction method, comprising the steps of:
step 1, collecting patient data information, calculating a regular administration rate according to patient administration data, defining the administration compliance of a patient according to the regular administration rate, and labeling high compliance and low compliance;
step 2, calculating the correlation among the feature variables, and filtering out the features with low correlation;
step 3, taking importance scores of random forests as measurement modes to obtain medication compliance characteristics X _{1} ,X _{2} ...X _{c} ；
Step 4, based on the medication compliance obtained in step 3Feature X _{1} ,X _{2} ...X _{c} Optimizing the hyperparameters of the random forest model by utilizing a grid search algorithm to obtain optimal hyperparameters;
step 5, training the random forest model by using a cyclic estimation algorithm to obtain a final classification model;
and 6, classifying the data by adopting the final classification model trained in the step 5, and outputting a classification result.
2. The machine learning based patient compliance prediction method of claim 1, wherein the patient data information includes age, sex, household deposit, adverse reaction score, treatment regimen month, treatment regimen drug category, medication management mode, smoking frequency, drinking frequency, individual room, ventilation condition, patient type, health education score, complications, 15 characteristic attributes of the present day of sputum examination, and a target attribute medication compliance, a regular medication rate is calculated from the patient medication data, and medication compliance of the patient is defined according to the regular medication rate, definition of more than 80% is high compliance, definition of less than 80% is low compliance, and label is set to 0.
3. The machine learning based patient compliance prediction method of claim 1, wherein step 2 is specifically: firstly, carrying out onehot coding and standardization processing on all data sets of sample data, then calculating correlation coefficients among all features, and finally visualizing results and outputting a correlation matrix diagram; pearson correlation coefficient is a statistical method, and we generally use Pearson coefficient to represent the linear relationship between two variables in machine learning; the calculation formula is expressed as:
wherein Cov (X, Y) represents the covariance of X and Y, σ _{X} 、σ _{Y} Represents the standard deviation of X and Y; determining the correlation strength between the variables by calculating the correlation coefficient between the variables and the target; the larger the absolute value of the correlation coefficient is, the stronger the correlation is; and filtering the characteristic variable according to the absolute value of the correlation coefficient, screening out uncorrelated characteristics and reducing interference.
4. A machine learning based patient compliance prediction method as claimed in claim 1, wherein said step 3 comprises:
step 3.1, data normalization is performed by using a ZScore algorithm, wherein the data normalization formula is as follows:
wherein x' represents normalized data and x represents original data;
and 3.2, taking the standardized data as input, taking a Gini coefficient as a measurement mode, and outputting importance ranking of feature variables, wherein the importance score principle of the Gini coefficient calculation features is as follows:
let m feature variables be denoted as X _{1} ,X _{2} ...X _{m} Gini index for each variable is VIM _{j} ^{(Gini)} A representation; the Gini index is expressed as:
where K represents that the sample has K classes, p _{mk} Representing the specific gravity occupied by the class k in the node m; then X is _{j} Before and after m node branching, the Gini index variation is:
wherein GI is _{l} And GI _{r} Representing left node generated after m node splittingThe Gini index of the point and the Gini index of the right node; then X is _{j} The importance score of (2) is expressed as:
assuming that the random forest has n basis classifiers, the importance scores are:
step 3.3, normalizing the importance scores to obtain the characteristic X _{j} Importance score of (c):
setting 0.05 as a threshold, the features with importance scores greater than 0.05 are regarded as factors affecting the patient's medication compliance, by this step the irrelevant nonsensical features are removed, the features affecting the patient's medication compliance are retained, and the remaining feature vectors are denoted as X _{1} ,X _{2} ...X _{c} 。
5. The machine learning based patient compliance prediction method of claim 1, wherein step 4 is specifically: firstly, determining a parameter range to be optimized, and considering the number n_identifiers of main optimization decision trees, split standard rules, the number min_sample_split of split nodes, the maximum depth max_depth of the decision trees and the maximum feature number max_features; then iterating and selecting any k1 data training samples, iterating for ten times, selecting the parameters corresponding to the model with highest average accuracy as optimal parameter combinations, outputting the optimal parameters, judging whether the output optimal parameters reach the boundary of a parameter range, and if the optimal parameters reach the boundary of the parameter range, readjusting the parameter range, and repeating the steps; and finally outputting the optimal super parameters of the model.
6. The machine learning based patient compliance prediction method of claim 1, wherein step 5 is specifically:
step 5.1, sampling the optimal super parameters by using bootstrap, and constructing a plurality of decision trees, wherein each time, samples which are not sampled form outofbag data, and the outofbag data are used as test samples;
step 5.2, training the random forest model, and training the random forest model to obtain a trained final classification model; the method specifically comprises the following steps:
step 5.2.1, creating decision trees by using the number m of the decision trees in the random forest obtained in the step 4;
step 5.2.2, dividing the sample set into n training sets with the same scale by a replaced random sampling method;
step 5.2.3, randomly selecting k characteristics from all attribute sets, and establishing a training model for the selected samples and the characteristic sets by using a decision tree algorithm;
step 5.2.4, repeating the steps 5.2.2 and 5.2.3 for k times to generate m decision trees;
and 5.2.5, aiming at each prediction sample, each decision tree generates a predicted classification result, and finally, outputting a predicted result of the random forest model through a voting or weighting mechanism, and storing model parameters with the best predicted result.
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

CN202211309805.5A CN116259415A (en)  20221025  20221025  Patient medicine taking compliance prediction method based on machine learning 
Applications Claiming Priority (1)
Application Number  Priority Date  Filing Date  Title 

CN202211309805.5A CN116259415A (en)  20221025  20221025  Patient medicine taking compliance prediction method based on machine learning 
Publications (1)
Publication Number  Publication Date 

CN116259415A true CN116259415A (en)  20230613 
Family
ID=86685105
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

CN202211309805.5A Pending CN116259415A (en)  20221025  20221025  Patient medicine taking compliance prediction method based on machine learning 
Country Status (1)
Country  Link 

CN (1)  CN116259415A (en) 
Cited By (2)
Publication number  Priority date  Publication date  Assignee  Title 

CN116798630A (en) *  20230705  20230922  广州视景医疗软件有限公司  Myopia physiotherapy compliance prediction method, device and medium based on machine learning 
CN117133459A (en) *  20230912  20231128  江苏省人民医院（南京医科大学第一附属医院）  Machine learningbased postoperative intracranial infection prediction method and system 

2022
 20221025 CN CN202211309805.5A patent/CN116259415A/en active Pending
Cited By (4)
Publication number  Priority date  Publication date  Assignee  Title 

CN116798630A (en) *  20230705  20230922  广州视景医疗软件有限公司  Myopia physiotherapy compliance prediction method, device and medium based on machine learning 
CN116798630B (en) *  20230705  20240308  广州视景医疗软件有限公司  Myopia physiotherapy compliance prediction method, device and medium based on machine learning 
CN117133459A (en) *  20230912  20231128  江苏省人民医院（南京医科大学第一附属医院）  Machine learningbased postoperative intracranial infection prediction method and system 
CN117133459B (en) *  20230912  20240409  江苏省人民医院（南京医科大学第一附属医院）  Machine learningbased postoperative intracranial infection prediction method and system 
Similar Documents
Publication  Publication Date  Title 

Nejatian et al.  Using subsampling and ensemble clustering techniques to improve performance of imbalanced classification  
CN116259415A (en)  Patient medicine taking compliance prediction method based on machine learning  
Kabir et al.  Classification of breast cancer risk factors using several resampling approaches  
CN111784040A (en)  Optimization method and device for policy simulation analysis and computer equipment  
Raju et al.  Optimized building of machine learning technique for thyroid monitoring and analysis  
CN112052874B (en)  Physiological data classification method and system based on generation countermeasure network  
Shrestha et al.  Supervised machine learning for early predicting the sepsis patient: modified mean imputation and modified chisquare feature selection  
Shobha et al.  Clustering based imputation algorithm using unsupervised neural network for enhancing the quality of healthcare data  
CN118312816A (en)  Cluster weighted clustering integrated medical data processing method and system based on member selection  
de Carvalho Brito et al.  COVIDindex: A texturebased approach to classifying lung lesions based on CT images  
Bakasa et al.  Stacked ensemble deep learning for pancreas cancer classification using extreme gradient boosting  
CN111863248B (en)  Effective method for constructing clinical decision model  
Dhanamithra et al.  A Comparison Study on Machine Learning Approaches for Thyroid Disease Prediction  
Oliullah et al.  Analyzing the effectiveness of several machine learning methods for heart attack prediction  
Alves et al.  Specialized MLP classifiers to support the isolation of patients suspected of pulmonary tuberculosis  
Colbaugh et al.  Learning about individuals' health from aggregate data  
Kavitha et al.  Predicting Breast Cancer Survivability Using Naïve Baysein Classifier And C4. 5 Algorithm  
Luo et al.  Risk factors analysis and classification on heart disease  
Hakim  Performance Evaluation of Machine Learning Techniques for Early Prediction of Brain Strokes  
AlShwaish et al.  Mortality prediction based on imbalanced new born and perinatal period data  
Desai et al.  A Comparative Assessment Study on Machine Learning Classifiers for Cardiac Arrest Diagnosis and Prediction  
CN113971984A (en)  Classification model construction method and device, electronic equipment and storage medium  
Pal et al.  Heart disease prediction by stacking ensemble models on multiple classifiers by applying feature selection methods  
Joseph et al.  A Stacked Meta Classifier Approach for Predicting Cardiovascular Diseases  
Saputra et al.  Hepatitis Prediction Using KNN, Naive Bayes, Support Vector Machine, Multilayer Perceptron and Random Forest, Gradient Boosting, KMeans 
Legal Events
Date  Code  Title  Description 

PB01  Publication  
PB01  Publication  
SE01  Entry into force of request for substantive examination  
SE01  Entry into force of request for substantive examination 