CN108648827B - Cardiovascular and cerebrovascular disease risk prediction method and device - Google Patents
Cardiovascular and cerebrovascular disease risk prediction method and device Download PDFInfo
- Publication number
- CN108648827B CN108648827B CN201810449174.4A CN201810449174A CN108648827B CN 108648827 B CN108648827 B CN 108648827B CN 201810449174 A CN201810449174 A CN 201810449174A CN 108648827 B CN108648827 B CN 108648827B
- Authority
- CN
- China
- Prior art keywords
- patient
- sample
- data
- cardiovascular
- samples
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
- 208000024172 Cardiovascular disease Diseases 0.000 title claims abstract description 169
- 230000002526 effect on cardiovascular system Effects 0.000 title claims abstract description 167
- 208000026106 cerebrovascular disease Diseases 0.000 title claims abstract description 159
- 238000000034 method Methods 0.000 title claims abstract description 70
- 239000000523 sample Substances 0.000 claims description 336
- 238000004422 calculation algorithm Methods 0.000 claims description 92
- 239000011159 matrix material Substances 0.000 claims description 49
- 238000012545 processing Methods 0.000 claims description 46
- 230000036541 health Effects 0.000 claims description 40
- 238000003745 diagnosis Methods 0.000 claims description 21
- 239000013610 patient sample Substances 0.000 claims description 18
- 238000011282 treatment Methods 0.000 claims description 18
- 238000012217 deletion Methods 0.000 claims description 17
- 230000037430 deletion Effects 0.000 claims description 17
- 238000005259 measurement Methods 0.000 claims description 17
- 238000010606 normalization Methods 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000010187 selection method Methods 0.000 claims description 6
- 238000000540 analysis of variance Methods 0.000 claims description 5
- 238000012549 training Methods 0.000 abstract description 13
- 230000006870 function Effects 0.000 description 15
- 238000004891 communication Methods 0.000 description 14
- 230000036772 blood pressure Effects 0.000 description 13
- 230000008569 process Effects 0.000 description 12
- 238000013528 artificial neural network Methods 0.000 description 9
- 210000004369 blood Anatomy 0.000 description 8
- 239000008280 blood Substances 0.000 description 8
- 208000024891 symptom Diseases 0.000 description 8
- 238000011156 evaluation Methods 0.000 description 7
- 238000003066 decision tree Methods 0.000 description 6
- 230000002159 abnormal effect Effects 0.000 description 5
- 201000010099 disease Diseases 0.000 description 5
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 5
- 238000007477 logistic regression Methods 0.000 description 5
- HVYWMOMLDIMFJA-DPAQBDIFSA-N cholesterol Chemical compound C1C=C2C[C@@H](O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2 HVYWMOMLDIMFJA-DPAQBDIFSA-N 0.000 description 4
- 238000004590 computer program Methods 0.000 description 4
- NOESYZHRGYRDHS-UHFFFAOYSA-N insulin Chemical compound N1C(=O)C(NC(=O)C(CCC(N)=O)NC(=O)C(CCC(O)=O)NC(=O)C(C(C)C)NC(=O)C(NC(=O)CN)C(C)CC)CSSCC(C(NC(CO)C(=O)NC(CC(C)C)C(=O)NC(CC=2C=CC(O)=CC=2)C(=O)NC(CCC(N)=O)C(=O)NC(CC(C)C)C(=O)NC(CCC(O)=O)C(=O)NC(CC(N)=O)C(=O)NC(CC=2C=CC(O)=CC=2)C(=O)NC(CSSCC(NC(=O)C(C(C)C)NC(=O)C(CC(C)C)NC(=O)C(CC=2C=CC(O)=CC=2)NC(=O)C(CC(C)C)NC(=O)C(C)NC(=O)C(CCC(O)=O)NC(=O)C(C(C)C)NC(=O)C(CC(C)C)NC(=O)C(CC=2NC=NC=2)NC(=O)C(CO)NC(=O)CNC2=O)C(=O)NCC(=O)NC(CCC(O)=O)C(=O)NC(CCCNC(N)=N)C(=O)NCC(=O)NC(CC=3C=CC=CC=3)C(=O)NC(CC=3C=CC=CC=3)C(=O)NC(CC=3C=CC(O)=CC=3)C(=O)NC(C(C)O)C(=O)N3C(CCC3)C(=O)NC(CCCCN)C(=O)NC(C)C(O)=O)C(=O)NC(CC(N)=O)C(O)=O)=O)NC(=O)C(C(C)CC)NC(=O)C(CO)NC(=O)C(C(C)O)NC(=O)C1CSSCC2NC(=O)C(CC(C)C)NC(=O)C(NC(=O)C(CCC(N)=O)NC(=O)C(CC(N)=O)NC(=O)C(NC(=O)C(N)CC=1C=CC=CC=1)C(C)C)CC1=CN=CN1 NOESYZHRGYRDHS-UHFFFAOYSA-N 0.000 description 4
- 210000002966 serum Anatomy 0.000 description 4
- 208000000616 Hemoptysis Diseases 0.000 description 3
- 206010020772 Hypertension Diseases 0.000 description 3
- 230000009471 action Effects 0.000 description 3
- 238000011157 data evaluation Methods 0.000 description 3
- 230000035622 drinking Effects 0.000 description 3
- 230000002068 genetic effect Effects 0.000 description 3
- 230000000391 smoking effect Effects 0.000 description 3
- 206010011224 Cough Diseases 0.000 description 2
- 108010023302 HDL Cholesterol Proteins 0.000 description 2
- 206010019280 Heart failures Diseases 0.000 description 2
- 102000004877 Insulin Human genes 0.000 description 2
- 108090001061 Insulin Proteins 0.000 description 2
- 108010028554 LDL Cholesterol Proteins 0.000 description 2
- 208000006011 Stroke Diseases 0.000 description 2
- 206010008118 cerebral infarction Diseases 0.000 description 2
- 208000029078 coronary artery disease Diseases 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 235000006694 eating habits Nutrition 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 201000001421 hyperglycemia Diseases 0.000 description 2
- 238000002347 injection Methods 0.000 description 2
- 239000007924 injection Substances 0.000 description 2
- 229940125396 insulin Drugs 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 239000000243 solution Substances 0.000 description 2
- 238000011269 treatment regimen Methods 0.000 description 2
- WQZGKKKJIJFFOK-GASJEMHNSA-N Glucose Natural products OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-GASJEMHNSA-N 0.000 description 1
- 102000001554 Hemoglobins Human genes 0.000 description 1
- 108010054147 Hemoglobins Proteins 0.000 description 1
- 208000031226 Hyperlipidaemia Diseases 0.000 description 1
- 208000001953 Hypotension Diseases 0.000 description 1
- 240000007594 Oryza sativa Species 0.000 description 1
- 235000007164 Oryza sativa Nutrition 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 235000021152 breakfast Nutrition 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 235000019504 cigarettes Nutrition 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 235000005911 diet Nutrition 0.000 description 1
- 230000037213 diet Effects 0.000 description 1
- 235000021158 dinner Nutrition 0.000 description 1
- 208000002173 dizziness Diseases 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 238000013399 early diagnosis Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 201000005577 familial hyperlipidemia Diseases 0.000 description 1
- 235000013305 food Nutrition 0.000 description 1
- 239000008103 glucose Substances 0.000 description 1
- 208000021822 hypotensive Diseases 0.000 description 1
- 230000001077 hypotensive effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000003064 k means clustering Methods 0.000 description 1
- 150000002632 lipids Chemical class 0.000 description 1
- 230000003340 mental effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 235000009566 rice Nutrition 0.000 description 1
- 210000002784 stomach Anatomy 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
- 150000003626 triacylglycerols Chemical class 0.000 description 1
- UFTFJSFQGQCHQW-UHFFFAOYSA-N triformin Chemical compound O=COCC(OC=O)COC=O UFTFJSFQGQCHQW-UHFFFAOYSA-N 0.000 description 1
- 230000003442 weekly effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- Pathology (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Investigating Or Analysing Biological Materials (AREA)
- Measuring And Recording Apparatus For Diagnosis (AREA)
Abstract
The embodiment of the invention provides a method and a device for predicting cardiovascular and cerebrovascular disease risks, wherein the method comprises the following steps: obtaining a sample set; dividing samples in a sample set into a preset number of local clusters, calculating first K-value first adjacent samples of input samples according to a preset first K value and the first distance set so as to determine a target local cluster, and calculating the distance between the input samples and samples in the target local cluster so as to determine second K-value second adjacent samples of the input samples; determining the label of the input sample, and determining whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases; finally determining whether the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases. In the embodiment, the similarity of the characteristic data of the patient with the cardiovascular and cerebrovascular diseases is considered to be high, so that the influence of different sample data on the training prediction model is avoided. Therefore, the accuracy of predicting whether the patient to be predicted is a cardiovascular and cerebrovascular disease patient can be improved.
Description
Technical Field
The invention relates to the field of prediction analysis, in particular to a cardiovascular and cerebrovascular disease risk prediction method and device.
Background
With the increasing of living pressure and mental pressure of people, the incidence of cardiovascular and cerebrovascular diseases is increased year by year, and the health of residents is seriously influenced. The medical practice shows that if the patients with cardiovascular and cerebrovascular diseases can be diagnosed accurately in the early diagnosis, the intervention and treatment effect on the cardiovascular and cerebrovascular diseases are greatly facilitated.
In the prior art, a data mining technology is used for mining case data characteristics of cardiovascular and cerebrovascular diseases, physical examination characteristic data and return visit data of all patients form a training set, and a prediction model is trained by using a decision tree, a logistic regression and an artificial neural network algorithm. Then the physical examination data of the patient to be predicted is used as an input sample, the input sample is input into the trained prediction model, and whether the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases or not is output.
Taking the training and prediction model of the artificial neural network algorithm as an example, in the process of training and predicting the model by using the artificial neural network algorithm, because the input samples of the neural network comprise the samples of patients with non-cardiovascular and cerebrovascular diseases and the samples of patients with cardiovascular and cerebrovascular diseases, and the difference of characteristic data in the samples of the patients with non-cardiovascular and cerebrovascular diseases and the samples of the patients with cardiovascular and cerebrovascular diseases is large, all samples in a training set are used as the input of an input layer, and the error function of an output layer of the neural network is large. Because of the influence of different sample data, the weight and the threshold of each layer of the neural network are adjusted according to the error function, and the trained prediction model is not accurate. Therefore, the accuracy rate of predicting whether the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases is not high by training the prediction model by using the artificial neural network algorithm.
Disclosure of Invention
The embodiment of the invention aims to provide a cardiovascular and cerebrovascular disease risk prediction method and device so as to improve the accuracy of predicting whether a patient is a cardiovascular and cerebrovascular disease patient. The specific technical scheme is as follows:
in a first aspect, an embodiment of the present invention provides a method for predicting risk of a cardiovascular disease, including:
obtaining a sample set; the sample set is determined according to a plurality of samples of the labeled patient medical data base set; one sample includes: patient's number, characteristics and characteristic data; the label includes: a first tag and a second tag; the first label identifies a sample of a patient with cardiovascular disease; the second label identifies a non-cardiovascular patient sample;
obtaining an input sample; the input sample is formed by combining medical health physical examination data and medical treatment data of a patient to be predicted;
performing metric learning by using a cosine-large-interval nearest neighbor COS-LMNN algorithm to obtain a global metric matrix of the sample set;
dividing the samples in the sample set into a preset number of local clusters by using a preset clustering algorithm;
calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm according to the global metric matrix to form a first distance set;
calculating first K-value first adjacent samples of the input samples by using a K-nearest neighbor algorithm according to a preset first K value and the first distance set;
determining a local cluster in which the first neighboring sample is located;
selecting local clusters with the number of the first adjacent samples exceeding a first preset threshold value from the local clusters where the first adjacent samples are located as target local clusters;
dividing the input sample into the target local cluster;
calculating by using a cosine similarity algorithm according to a local measurement matrix of the target local cluster obtained by learning a COS-LMNN algorithm, and forming a second distance set by the distance between the input sample and the sample in the target local cluster;
in the target local cluster, according to a preset second K value and the second distance set, using a K-nearest neighbor algorithm to determine second adjacent samples of the input samples with the second K value;
counting the number of first labels and the number of second labels of the second adjacent samples;
if the ratio of the number of the first labels to the number of the second labels exceeds a preset label threshold value, taking the first labels as labels of the input samples, otherwise, taking the second labels as labels of the input samples;
determining whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases according to the label of the input sample;
if the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases, determining that the patient to be predicted in the input sample is the patient with cardiovascular and cerebrovascular diseases;
and if the input sample is not the sample of the patient with the cardiovascular and cerebrovascular diseases, determining that the patient to be predicted in the input sample is not the patient with the cardiovascular and cerebrovascular diseases.
Optionally, after the step of determining that the patient to be predicted in the input sample is a patient with cardiovascular and cerebrovascular diseases, the method further comprises:
determining whether the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient according to health revisitation data of the patient;
if the patient to be predicted is a patient with high risk of cardiovascular and cerebrovascular diseases, making a recommendation of hospitalization for the patient to be predicted;
if the patient to be predicted is not the patient with the high-risk cardiovascular and cerebrovascular diseases, making a suggestion for increasing the physical examination frequency of the patient to be predicted;
after the step of determining that the patient to be predicted in the input sample is not a patient with cardiovascular and cerebrovascular disease, the method further comprises:
determining whether the patient to be predicted is a healthy user according to health return visit data of the patient;
if the patient to be predicted is a healthy user, making a recommendation for keeping the normal examination frequency of the normal patient;
if the patient to be predicted is not a healthy user, marking the patient to be predicted as a missed patient, and adding the characteristic data of the missed patient into the patient medical database set;
wherein, the missed diagnosis patient is a patient with cardiovascular and cerebrovascular diseases.
Optionally, the first label identifies a sample of a patient with cardiovascular and cerebrovascular diseases, including:
determining the identification information of the cardiovascular and cerebrovascular disease patient according to the collected health return visit data of the patient;
the health revisitation data of the patient comprises: patient's number, characteristics, characteristic data and confirmed condition; the identification information includes: confirming symptoms, and confirming characteristics and characteristic data corresponding to the symptoms;
according to the identification information of the patients with the cardiovascular and cerebrovascular diseases, centrally determining a sample of the patients with the cardiovascular and cerebrovascular diseases in the medical database;
setting a first label on a sample of the patient with the cardiovascular and cerebrovascular diseases;
the second label identifies a non-cardiovascular patient sample comprising:
and setting a second label for other samples except the sample of the patient with the cardiovascular and cerebrovascular diseases.
Optionally, obtaining a sample set includes:
according to a plurality of samples of the patient medical database set with the labels, sample deletion processing is carried out on the samples with the sample missing values larger than a first threshold value;
the sample missing values are: the ratio of the number of missing features in a sample to the total number of features in the sample;
searching in the deleted multiple samples, and performing feature deletion processing on the features of which the feature missing values are greater than a second threshold value;
the feature missing values are: the ratio of the number of the features lacking the feature data to the total number of the same features in the same features of the plurality of samples;
searching the characteristics of the missing characteristic data in the plurality of samples subjected to the characteristic deletion processing to be used as first characteristics;
filling missing values of the feature data missing from the first feature by using a multi-filling method;
classifying the characteristic data of the plurality of samples filled with the missing values according to the data types to obtain classification results;
wherein the classification result comprises: discrete feature data and continuous feature data;
according to the classification result, the discrete feature data and the continuous feature data are processed corresponding to the data type;
adding the characteristic data which is obtained by correspondingly processing the discrete characteristic data and the continuous characteristic data into the patient medical database set as a first database set;
wherein, the discrete feature data and the continuous feature data are processed corresponding to the data type, and the processing comprises the following steps: carrying out one-hot encoding on the discrete characteristic data; normalizing the continuous characteristic data by using a Z-score method of positive Tailore normalization;
carrying out unbalanced processing on the samples of the first database set by using an undersampling and SMOTE algorithm to obtain a second database set;
calculating the variance of the same feature data in the second database set by using an analysis of variance method, and deleting the feature data of which the variance value is smaller than a preset variance threshold value;
calculating the weight of each feature data after the feature data with the variance value smaller than the preset variance threshold value is deleted by using a relief algorithm;
deleting the feature data with the score value smaller than a preset score threshold value and the corresponding features according to the weight of the feature data and the score value corresponding to the weight of the feature data to obtain a fourth database set;
from the fourth data set, a sample set is determined using a forward selection method.
Optionally, the calculating, according to the global metric matrix, distances between the input samples and the samples in the sample set by using a cosine similarity algorithm to form a first distance set includes:
calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm formula according to the global measurement matrix to form a first distance set;
wherein the cosine similarity algorithm formula is as follows:
the first set of distances D1 includes: { D (x)i,x1),D(xi,x2),D(xi,x3),…,D(xi,xj)};
Where i represents the index of the input sample, xiRepresents the ith input sample as xi(ii) a The sample set is X; global metric momentThe matrix is A; m ═ ATA; j represents the sample number in the sample set; x is the number ofjRepresents the jth sample in the sample set; i and j are positive integers; d (x)i,xj) Representing input samples x under a global metric matrixiDistance from jth sample in X set; a (x)i,xj) Represents x after A matrix transformationi,xjThe distance between them.
Optionally, the calculating, according to the local measurement matrix of the target local cluster obtained by the COS-LMNN algorithm learning, the distance between the input sample and the sample in the target local cluster by using a cosine similarity algorithm to form a second distance set, includes:
calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm formula according to a local measurement matrix of the target local cluster obtained by learning of a COS-LMNN algorithm to form a second distance set;
wherein the cosine similarity algorithm formula is as follows:
the second set of distances D2 includes: { D (x)i,xs1),D(xi,xs2),…,D(xi,xsi)};
Where i represents the index of the input sample, xiRepresents the ith input sample as x; x is the number ofsiRepresents samples of the same category as i; the local metric matrix is AS;MS=AS TAS(ii) a i is a positive integer; d (x)i,xsi) Representing input samples x under a local metric matrixiDistances to samples of the same category as i in the target local cluster; i is a positive integer; a. thes(xi,xsi) Represents the passage ASX after matrix transformationi,xsiThe distance between them.
In a second aspect, the present embodiment provides a cardiovascular and cerebrovascular disease risk prediction apparatus, including:
the set acquisition module is used for acquiring a sample set;
the sample set is determined according to a plurality of samples of the labeled patient medical data base set; one sample includes: patient's number, characteristics and characteristic data; the label includes: a first tag and a second tag; the first label identifies a sample of a patient with cardiovascular disease; the second label identifies a non-cardiovascular patient sample;
the system comprises a sample acquisition module, a data acquisition module and a data processing module, wherein the sample acquisition module is used for acquiring an input sample;
the input sample is formed by combining medical health physical examination data and medical treatment data of a patient to be predicted;
the matrix calculation module is used for performing metric learning by using a cosine-large-interval nearest neighbor COS-LMNN algorithm to obtain a global metric matrix of the sample set;
the first local cluster determining module is used for dividing the samples in the sample set into a preset number of local clusters by using a preset clustering algorithm;
the first distance determining module is used for calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm according to the global measurement matrix to form a first distance set;
the first sample determining module is used for calculating first K-value first adjacent samples of the input samples by using a K-nearest neighbor algorithm according to a preset first K value and the first distance set;
a second local cluster determining module, configured to determine a local cluster in which the first neighboring sample is located;
a target local cluster determining module, configured to select, as a target local cluster, a local cluster in which the number of first neighboring samples exceeds a first preset threshold from among local clusters in which the first neighboring samples are located;
a local cluster dividing module, configured to divide the input sample into the target local cluster;
the second distance determination module is used for calculating by using a cosine similarity algorithm according to a local measurement matrix of the target local cluster obtained by COS-LMNN algorithm learning, and forming a second distance set by the distance between the input sample and the sample in the target local cluster;
a second sample determining module, configured to determine, in the target local cluster, second K-value second neighboring samples of the input sample according to a preset second K value and the second distance set by using a K-nearest neighbor algorithm;
the counting module is used for counting the number of the first labels and the number of the second labels of the second adjacent samples;
the label determining module is used for taking the first label as the label of the input sample if the ratio of the number of the first labels to the number of the second labels exceeds a preset label threshold value, and otherwise, taking the second label as the label of the input sample;
the patient sample determination module is used for determining whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases according to the label of the input sample;
the cardiovascular and cerebrovascular disease patient determination module is used for determining that the patient to be predicted in the input sample is the cardiovascular and cerebrovascular disease patient if the input sample is the cardiovascular and cerebrovascular disease patient;
and the non-cardiovascular and cerebrovascular disease patient determination module is used for determining that the patient to be predicted in the input sample is not the cardiovascular and cerebrovascular disease patient if the input sample is not the cardiovascular and cerebrovascular disease patient.
Optionally, the cardiovascular and cerebrovascular disease risk prediction apparatus provided in this embodiment further includes:
the high-risk determination module is used for determining whether the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient according to the health return visit data of the patient;
the hospitalization suggestion module is used for making a suggestion of hospitalization treatment on the patient to be predicted if the patient to be predicted is a patient with high risk of cardiovascular and cerebrovascular diseases;
the physical examination increasing suggestion module is used for making a suggestion of increasing the physical examination frequency for the patient to be predicted if the patient to be predicted is not the high-risk cardiovascular and cerebrovascular disease patient;
the health determination module is used for determining whether the patient to be predicted is a healthy user according to the health return visit data of the patient;
the normal physical examination suggestion module is used for making a suggestion of keeping the normal physical examination frequency for the normal patient if the patient to be predicted is a healthy user;
the missed-diagnosis patient determination module is used for marking the patient to be predicted as a missed-diagnosis patient if the patient to be predicted is not a healthy user, and adding the characteristic data of the missed-diagnosis patient into the patient medical database set;
wherein, the missed diagnosis patient is a patient with cardiovascular and cerebrovascular diseases.
Optionally, the set obtaining module includes:
the sample deleting submodule is used for deleting the samples with the sample missing values larger than the first threshold value according to the plurality of samples of the patient medical database set with the labels;
the sample missing values are: the ratio of the number of missing features in a sample to the total number of features in the sample;
the characteristic deleting submodule is used for searching in the plurality of deleted samples, and the characteristic deletion processing is carried out on the characteristic of which the characteristic missing value is greater than the second threshold value;
the feature missing values are: the ratio of the number of the features lacking the feature data to the total number of the same features in the same features of the plurality of samples;
the first characteristic submodule is used for searching the characteristic of the missing characteristic data of the plurality of samples after the characteristic deletion processing is carried out and taking the characteristic as a first characteristic;
a missing value filling submodule, configured to perform missing value filling on the feature data missing from the first feature by using a multiple filling method;
the data classification submodule is used for classifying the characteristic data of the plurality of samples filled with the missing values according to the data types to obtain a classification result;
wherein the classification result comprises: discrete feature data and continuous feature data;
the data processing submodule is used for processing the discrete characteristic data and the continuous characteristic data corresponding to the data type according to the classification result;
the set updating submodule is used for adding the characteristic data which is obtained by correspondingly processing the discrete characteristic data and the continuous characteristic data into the patient medical database set to be used as a first database set;
wherein, the discrete feature data and the continuous feature data are processed corresponding to the data type, and the processing comprises the following steps: carrying out one-hot encoding on the discrete characteristic data; normalizing the continuous characteristic data by using a Z-score method of positive Tailore normalization;
the equalization processing submodule is used for carrying out equalization processing on the samples of the first database set by using an undersampling and SMOTE algorithm to obtain a second database set;
the variance deleting submodule is used for calculating the variance of the same feature data in the second database set by using an analysis of variance method, and deleting the feature data of which the variance value is smaller than a preset variance threshold value;
the weight calculation submodule is used for calculating the weight of each feature data after the feature data with the variance value smaller than the preset variance threshold value is deleted by using a relief algorithm;
and the set determining submodule is used for determining the sample set by using a forward selection method according to the weight of each characteristic data and the second database set.
In another aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing the cardiovascular and cerebrovascular disease risk prediction method when executing the program stored in the memory.
In yet another aspect of the present invention, there is also provided a computer-readable storage medium, having stored therein instructions, which when executed on a computer, cause the computer to execute a method for predicting risk of cardiovascular and cerebrovascular diseases according to any one of the above.
In yet another aspect of the present invention, the present invention also provides a computer program product containing instructions, which when run on a computer, causes the computer to execute a method for predicting risk of cardiovascular and cerebrovascular diseases as described in any one of the above.
According to the method and the device for predicting the cardiovascular and cerebrovascular disease risk, provided by the embodiment of the invention, a sample set is obtained by obtaining a sample; dividing samples in a sample set into a preset number of local clusters, calculating first K-value first adjacent samples of input samples according to a preset first K value and the first distance set by using a K-nearest neighbor algorithm so as to determine a target local cluster, dividing the input samples into the target local cluster, and calculating the distance between the input samples and samples in the target local cluster so as to determine second K-value second adjacent samples of the input samples; counting the number of the first labels and the number of the second labels of the second adjacent samples, thereby determining the labels of the input samples and determining whether the input samples are samples of patients with cardiovascular and cerebrovascular diseases; finally determining whether the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases. In the embodiment, the similarity of the characteristic data of the cardiovascular and cerebrovascular disease patient is considered to be high, the similarity distance between the samples is calculated, and the nearest sample of the input sample is determined, so that whether the patient to be predicted in the input sample is the cardiovascular and cerebrovascular disease patient is determined, and the influence of different sample characteristic data on the training of the prediction model is avoided. Therefore, the accuracy of predicting whether the patient to be predicted is a cardiovascular and cerebrovascular disease patient can be improved. Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a method for predicting cardiovascular and cerebrovascular disease risk according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating the step S101 in FIG. 1 according to an embodiment of the present invention;
FIG. 3 is a structural diagram of a cardiovascular disease risk prediction device according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In this embodiment, in order to solve the problem that the accuracy of predicting whether a patient to be predicted is a cardiovascular and cerebrovascular disease patient is not high by training a prediction model using a decision tree, logistic regression, and an artificial neural network algorithm in the prior art, it can be understood that feature data of the cardiovascular and cerebrovascular disease patients are all similar, and therefore, a sample similar to a patient sample to be predicted can be obtained by using a similarity distance between the cardiovascular and cerebrovascular disease patient samples, so that whether the patient sample to be predicted is a cardiovascular and cerebrovascular disease patient sample is known, and whether the patient to be predicted is a cardiovascular and cerebrovascular disease patient is determined.
As shown in fig. 1, a method for predicting risk of cardiovascular and cerebrovascular diseases according to an embodiment of the present invention includes the following steps:
s101, obtaining a sample set; the sample set is determined according to a plurality of samples of the patient medical database set with the set labels; one sample includes: patient's number, characteristics and characteristic data; the label includes: a first tag and a second tag; the first label identifies a sample of a patient with cardiovascular disease; the second label identifies a non-cardiovascular patient sample;
for example, a sample includes: patient number 0001; is characterized by comprising the following steps: age, gender, city, occupation, family genetic history, disease history, eating habits, smoking habits, drinking habits, blood pressure, pulse, blood lipid, blood glucose, etc.; the characteristic data includes: age: 50; sex: male; city: wuhan; occupation: a teacher; family genetic history: none; history of disease: hypertension; the diet rule is as follows: baking the cooked wheaten food or rice or dinner when the breakfast is not eaten; smoking habit: two cigarettes a day; drinking habits: once in at least three days; blood pressure: 100-; pulse: 60-100 times/min; serum total cholesterol: 2.9-5.17 mmoi/l; serum triglycerides: 0.56-1.7 mmoi/l; high density lipoprotein cholesterol: 0, 94-2.0 mmoi/l; low density lipoprotein cholesterol: 2.07-3.12 i/l; blood sugar: 7.8-9.0 mmoL/L empty stomach, etc.
It can be understood that, in the present embodiment, the plurality of samples having the labels set in the patient medical data database set are formed by combining the medical examination data and the medical treatment data of the patient into one sample in advance, or the medical examination data and the medical treatment data of the patient may be combined into one sample each time the sample set is acquired. In view of the former time saving, the present embodiment combines the medical examination data and the medical visit data of the patient into a sample in advance, and then sets the label according to the health return visit data of the patient. The patient's health revisitation data contained: patient label, patient identified condition, and patient identified data characteristic of the condition. For example: health revisitation data includes: stroke, hypertension, coronary heart disease, hyperlipemia, hyperglycemia, hemoptysis, dizziness, cerebral infarction, heart failure and other cardiovascular and cerebrovascular diseases. The patient can confirm the disease condition by a doctor or by a patient's body or by a family member in the prior art, which is not described herein.
The medical examination data and the medical treatment data of the patient are the same as those in the prior art, the medical examination data comprises the label and the characteristic data of the patient, and the medical treatment data comprises the label of the patient and the basic information of the patient. Combining the medical examination data and the medical treatment data of the patient according to the patient label to form a sample, then determining whether the sample is a cardiovascular and cerebrovascular disease patient sample according to the patient label contained in the health return visit data of the patient and the disease condition confirmed by the patient, and setting a label for the sample.
S102, obtaining an input sample; the input sample is formed by combining medical health physical examination data and medical treatment data of a patient to be predicted;
s103, performing metric learning by using a cosine-large-interval nearest neighbor COS-LMNN algorithm to obtain a global metric matrix of the sample set;
the COS-LMNN algorithm in this embodiment is an algorithm that combines a cosine COS algorithm with a large-interval nearest neighbor LMNN algorithm, and calculates a global metric matrix of a sample set, and the combination manner is the same as that of the method in the prior art, and is not described herein again.
S104, dividing the samples in the sample set into a preset number of local clusters by using a preset clustering algorithm;
the preset number is a number set according to industry experience, and the preset clustering algorithm can be a k-means clustering algorithm, a hierarchical clustering algorithm, an SOM clustering algorithm, an FCM clustering algorithm and the like in the prior art. The preset number can be adjusted adaptively according to different clustering algorithms.
It can be understood that the present embodiment classifies samples in the sample set into different local clusters, for example, there are 7 samples in the sample set, which are A, B, C, D, E, F and G respectively, and the result after the classification is the local cluster 1: a and B; local cluster 2: C. f, G, respectively; local cluster 3: d and E.
S105, calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm according to the global measurement matrix to form a first distance set;
s106, calculating first adjacent samples of the first K value of the input sample by using a K-nearest neighbor algorithm according to a preset first K value and a first distance set;
in this embodiment, the first K value is a value preset according to practical experience, and a value of the first K value is the same as the number of first neighboring samples of the input sample calculated by using the K-nearest neighbor algorithm.
For example, the sample set includes: samples 1, 2, 3 and 4, assuming samples 1, 2 and 3 are samples of patients with cardiovascular disease, the sample distance between samples 1, 2 and 3 may be 0. If the input sample is a cardiovascular patient sample and the second K value is set to 2, then the first K value first adjacent samples of the input sample are samples 1 and 2, or 2 and 3, or 1 and 3.
S107, determining a local cluster where the first adjacent sample is located;
s108, selecting local clusters with the number of first adjacent samples exceeding a first preset threshold value from the local clusters where the adjacent samples are located as target local clusters;
s109, dividing the input sample into a target local cluster;
s110, calculating by using a cosine similarity algorithm according to a local measurement matrix of a target local cluster obtained by learning of a COS-LMNN algorithm, inputting the distance between a sample and the sample in the target local cluster, and forming a second distance set;
s111, determining second adjacent samples of the second K value of the input samples in the target local cluster by using a K-nearest neighbor algorithm according to a preset second K value and a second distance set;
in this embodiment, in order to determine second neighboring samples of the input sample in the target local cluster, the number of the second neighboring samples is the same as the value of the second K value, and the obtained second neighboring samples for determining the input sample may include multiple or one.
S112, counting the number of the first labels and the number of the second labels of the second adjacent samples;
s113, if the ratio of the number of the first labels to the number of the second labels exceeds a preset label threshold value, taking the first labels as labels of the input samples, otherwise, taking the second labels as labels of the input samples;
s114, determining whether the input sample is the sample of the patient with the cardiovascular and cerebrovascular diseases or not according to the label of the input sample;
s115, if the input sample is the sample of the patient with the cardiovascular and cerebrovascular diseases, determining that the patient to be predicted in the input sample is the patient with the cardiovascular and cerebrovascular diseases;
and S116, if the input sample is not the sample of the patient with the cardiovascular and cerebrovascular diseases, determining that the patient to be predicted in the input sample is not the patient with the cardiovascular and cerebrovascular diseases.
Compared with the prior art, the data mining technology is used for mining case data characteristics of cardiovascular and cerebrovascular diseases in the prior art, physical examination characteristic data and return visit data of all patients form a training set, and a prediction model is trained by using a decision tree, logistic regression and an artificial neural network algorithm. Then the physical examination data of the patient to be predicted is used as an input sample, the input sample is input into the trained prediction model, and whether the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases or not is output.
The decision tree analyzes the physical examination data of all patients to be predicted in the process of training the prediction model, sample data with the largest physical examination data information amount gain is used as a first node, other physical examination data are sequentially used as branches according to the physical examination data information amount gain, and the training is stopped when the data samples are only of one type, so that the prediction model is obtained. When the physical examination data with the maximum information gain is the physical examination data of the patient with the non-cardiovascular and cerebrovascular diseases, the prediction model trained by the method is influenced by the sample data with the maximum information gain of the physical examination data, and the accuracy of the prediction result of the prediction model trained by the decision tree is not high.
In the process of training the prediction model by using the logistic regression algorithm, the minimum of the loss function needs to be solved, and the prediction model is determined. Because the process of solving the minimum loss function is easily influenced by different sample data, the probability that the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases is output by the prediction model, and therefore whether the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases is determined to be inaccurate.
The embodiment obtains a sample set by obtaining a sample; dividing the samples in the sample set into a preset number of local clusters, and calculating to obtain first K-value first adjacent samples of the input samples, thereby determining the target local cluster. Determining second K-valued second neighboring samples of the input sample by calculating a distance of the input sample from a sample in the target local cluster; counting the number of the first labels and the number of the second labels of the second adjacent samples, thereby determining the labels of the input samples and determining whether the input samples are samples of patients with cardiovascular and cerebrovascular diseases; finally determining whether the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases. In the embodiment, a prediction model is not required to be trained, the purpose of determining whether a patient to be predicted is a cardiovascular and cerebrovascular disease patient or not is achieved by using the sample similarity distance considering that the similarity of the characteristic data of the cardiovascular and cerebrovascular disease patient is high, and the influence of sample data with the largest gain of the data information amount of the receptor test in the process of training the prediction model by using a decision tree is avoided. In the embodiment, the loss function is solved without using a logistic regression algorithm to train the prediction model, so that the condition that the minimum process for solving the loss function is influenced by different sample data to cause inaccuracy of the trained prediction model is avoided. Therefore, the accuracy of predicting whether the patient to be predicted is a cardiovascular and cerebrovascular disease patient can be improved.
Optionally, in an embodiment of the method for predicting a cardiovascular and cerebrovascular disease risk of the present invention, after the step of determining that the patient to be predicted in the input sample is a cardiovascular and cerebrovascular disease patient if the input sample is a cardiovascular and cerebrovascular disease patient sample S115, the method further includes:
the method comprises the following steps: determining whether the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient according to the health return visit data of the patient;
the characteristic data in the health revisit data exceed normal indexes, the sign information is abnormal, and the symptoms are abnormal, so that the patient to be predicted can be judged to be a high-risk patient with cardiovascular and cerebrovascular diseases. For example, blood pressure is excessive, hemoglobin is excessive, and symptoms are abnormal, such as faint, cough and hemoptysis.
Step two: if the patient to be predicted is a patient with high risk of cardiovascular and cerebrovascular diseases, the patient to be predicted is recommended to be treated in hospitalization;
according to the fact that the patient to be predicted is a patient with high-risk cardiovascular and cerebrovascular diseases, the patient is provided with the hospitalization days and the treatment scheme corresponding to the characteristic data of the patient according to the characteristic data of the patient.
In this embodiment, a cardiovascular and cerebrovascular patient treatment database is pre-established, and the cardiovascular and cerebrovascular patient treatment database includes: characteristic data of the patient, the number of hospitalization days corresponding to the characteristic data and a treatment scheme. The treatment regimen comprises: the dosage and frequency of insulin injection and the dosage and frequency of taking hypotensive drugs, physical exercise or whether surgical treatment is needed, and the like.
For example: blood pressure of high-risk cardiovascular and cerebrovascular disease patients: 120-; blood sugar: the number of hospitalization days for fasting from 7.8 to 9.0mmoL/L was 20 days, and the treatment regimen for the profile data was 1U per day of insulin injection.
The embodiment saves the time of doctor diagnosis suggestion and medical resources. Judging that the patient is a patient with high risk of cardiovascular and cerebrovascular diseases, and giving a recommendation for hospitalization of the patient to be predicted.
Step three: and if the patient to be predicted is not the patient with the high-risk cardiovascular and cerebrovascular diseases, advising that the physical examination frequency of the patient to be predicted is increased.
It can be understood that, in this embodiment, after the patient to be predicted is a patient with cardiovascular and cerebrovascular diseases, if the health revisit data record of the patient includes that the characteristic data of the patient to be predicted belongs to the range of the characteristic data of a certain number of patients with cardiovascular and cerebrovascular diseases, the sign information of the patient to be predicted is not abnormal, and the symptoms are not abnormal, then the patient is not a patient with high risk of cardiovascular and cerebrovascular diseases. The patient may be advised of a physical examination frequency corresponding to the range of the characteristic data based on the range of the characteristic data.
For example: the blood pressure range of 100 patients with cardiovascular and cerebrovascular diseases is as follows: 100-: 102-140mmHg, the patient does not have life-threatening symptoms such as faint, cough, hemoptysis and the like, and the patient is not a patient with high risk of cardiovascular and cerebrovascular diseases. If the patient has a one-month previous physical examination, the blood pressure value in the patient characteristic data ranges from 102 to 140 mmHg. Assuming that the physical examination frequency corresponding to the characteristic data of the patient is twice a month, the patient is recommended to carry out secondary physical examination for one month. The embodiment saves the time of doctor diagnosis suggestion and medical resources.
Optionally, in an embodiment of the method for predicting a cardiovascular and cerebrovascular disease risk of the present invention, after the step of determining that the patient to be predicted in the input sample is not a cardiovascular and cerebrovascular disease patient in S115, the method further includes:
the method comprises the following steps: determining whether the patient to be predicted is a healthy user or not according to the health return visit data of the patient;
wherein, all the characteristic data of the healthy user, namely the patient to be predicted, are within the standard range specified by all the medical characteristic data. For example: medical regulation of normal blood pressure: 80-90/120-140 mmHg, if the health revisit data shows that the blood pressure of the patient to be predicted is 82/125mmHg, other characteristic data of the patient are all in the standard range stated by each item of medical characteristic data, and the patient is a healthy user.
Step two: if the patient to be predicted is a healthy user, making a recommendation for keeping normal examination frequency for a normal patient;
it is understood that if the patient to be predicted is a healthy user, the physical examination frequency of the healthy user corresponds to the data characteristic of the patient. The patient is advised to maintain the same number of physical exams as the previous physical exam. For example: the patient had a one-month previous physical examination frequency, and a one-month one-time physical examination frequency is recommended. The embodiment selects the healthy users and gives appropriate suggestions, so that the time for diagnosing the suggestions by doctors is saved, and the medical expenditure of the patients is reduced.
Step three: if the patient to be predicted is not a healthy user, marking the patient to be predicted as a missed patient, and adding the characteristic data of the missed patient into a patient medical database set;
wherein, the missed diagnosis patient is a patient with cardiovascular and cerebrovascular diseases.
It is understood that if the patient to be predicted is not a healthy user, it may be determined from the health revisit data of the patient that the patient to be predicted is a cardiovascular and cerebrovascular disease patient or not a cardiovascular and cerebrovascular disease patient. If the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases, the patient is marked as a patient with missed diagnosis, and the characteristic data of the patient is added into a patient medical database set, so that the wrong prediction is prevented when the same type of patient to be predicted is predicted whether to be the patient with the cardiovascular and cerebrovascular diseases, and the accuracy of predicting whether to be predicted is the patient with the cardiovascular and cerebrovascular diseases is improved.
Optionally, in an embodiment of the method for predicting risk of a cardiovascular and cerebrovascular disease, the identifying, by the first tag, a sample of a patient with a cardiovascular and cerebrovascular disease includes:
the method comprises the following steps: determining the identification information of the cardiovascular and cerebrovascular disease patient according to the collected health return visit data of the patient;
the health revisitation data of the patients included: patient's number, characteristics, characteristic data and confirmed condition; the identification information includes: confirming symptoms, and confirming characteristics and characteristic data corresponding to the symptoms;
step two: according to the identification information of the patients with cardiovascular and cerebrovascular diseases, centrally determining the samples of the patients with cardiovascular and cerebrovascular diseases in a medical database;
step three: a sample of a patient with cardiovascular and cerebrovascular diseases is provided with a first label.
According to the embodiment, the medical database is distinguished through the health revisit data of the patient, the sample of the patient with the cardiovascular and cerebrovascular diseases is determined in a centralized mode, the first label is set, and time is saved for determining the label of the input sample.
Optionally, in an embodiment of the method for predicting risk of a cardiovascular and cerebrovascular disease, the second label identifies a sample of a patient with a non-cardiovascular and cerebrovascular disease, and includes:
and setting a second label for other samples except the sample of the patient with the cardiovascular and cerebrovascular diseases.
The present embodiment may use health revisit data one month after the physical examination of the user. And setting labels for the cardiovascular and cerebrovascular disease samples and the non-cardiovascular and cerebrovascular disease samples in the health return visit data. The label includes: letters, numbers, symbols, and the like. For example: health revisitation data includes: the description of various cardiovascular and cerebrovascular diseases such as stroke, hypertension, coronary heart disease, hyperlipidemia, hyperglycemia, cerebral infarction, heart failure and the like sets labels as positive samples, adds a label field of a category and sets a label 1. All samples which are not positive samples are used as negative samples, a label '0' is set, and the samples are added into a medical database set.
According to the embodiment, the medical database is distinguished through the health revisit data of the patient, the sample of the patient with the non-cardiovascular and cerebrovascular diseases is determined in a centralized mode, the second label is set, and time is saved for determining the label of the input sample.
Optionally, in an embodiment of the method for predicting risk of a cardiovascular and cerebrovascular disease, in the present invention, S101 obtains a sample set, including:
s201, according to a plurality of samples of the patient medical database set with the labels, deleting the samples with the sample missing values larger than a first threshold value;
the sample missing values are: the ratio of the number of missing features in a sample to the total number of features in the sample;
the first threshold value is a value manually specified according to industry experience, and a sample missing value is exemplified below. For example, if 10 features are included in a sample and 7 features are missing, the ratio of the number of missing features in the sample to the total number of features in the sample isAssume that the prescribed first threshold isThe sample is processed for sample deletion.
In this embodiment, the purpose of performing the sample deleting process on the sample with the missing value greater than the first threshold is: the method reduces samples with less characteristic data in the patient medical database, improves the quality of the samples in the patient medical database, and saves time for subsequent processing.
S202, searching in the deleted multiple samples, and performing feature deletion on the features of which the feature missing values are greater than a second threshold value;
in the embodiment, the feature deletion processing is performed on the feature with the feature missing value larger than the second threshold; the purpose is as follows: fewer characteristic data in the samples of the patient medical data bank are reduced, the quality of the characteristic data in the samples of the patient medical data bank is improved, and time is saved for subsequent processing.
The feature deficiency values are: the ratio of the number of the features lacking the feature data to the total number of the same features in the same features of the plurality of samples;
wherein the second threshold is a value that is manually specified based on industry experience. The feature missing values are illustrated below, for example, the same feature in 10 samples: and (4) pulse. The number of pulse features lacking feature data is 7 in total, the total number of pulse features is 10, and it is assumed that the first threshold value is defined asThe pulse features are subjected to feature deletion processing.
In this embodiment, the purpose of performing feature deletion processing on the feature missing value greater than the second threshold is as follows: fewer characteristic data in the samples of the patient medical data bank are reduced, the quality of the characteristic data in the samples of the patient medical data bank is improved, and time is saved for subsequent processing.
S203, searching the characteristics of the missing characteristic data in the plurality of samples subjected to the characteristic deletion processing to be used as first characteristics;
s204, filling missing values of the first feature missing feature data by using a multiple filling method;
wherein, a module constructed by using a multiple filling method in IBM SPSS statistics 23 is adopted to fill missing values, for example, 2 pieces of sample blood pressure characteristic data are missing, and the module constructed by using the multiple filling method is used for filling missing values according to the characteristic data of a patient: age: 50; blood fat: 1. serum total cholesterol is 2.9-5.17 mmoi/l; 2. 0.56-1.7 mmoi/l of serum triglyceride; 3. 0.94-2.0 mmoi/l of high density lipoprotein cholesterol; 4. 2.07-3.12 i/l of low-density lipoprotein cholesterol; blood sugar: the fasting is 7.8-9.0 mmoL/L, the filling value of the blood pressure characteristic data in 2 samples is 100-145mmHg, the specific filling mode is the same as the filling mode of the prior art, and the detailed description is omitted here.
The embodiment fills missing values of missing feature data of a plurality of samples, and can improve the quality of the samples so as to improve the quality of the obtained sample set.
S205, classifying the characteristic data of the plurality of samples filled with the missing values according to the data types to obtain classification results;
wherein, the classification result includes: discrete feature data and continuous feature data;
s206, according to the classification result, processing the discrete feature data and the continuous feature data corresponding to the data type;
s207, adding the feature data which is obtained by correspondingly processing the discrete feature data and the continuous feature data into a patient medical database set as a first database set;
wherein, the discrete characteristic data and the continuous characteristic data are processed corresponding to the data type, and the processing comprises the following steps: carrying out one-hot encoding on the discrete characteristic data; normalizing the continuous characteristic data by using a Z-score method of positive Tailore normalization;
the data types of the characteristic data of the patient include: discrete feature data and continuous feature data. For example, blood pressure data, heartbeat data are of a continuous type, and age data are of a discrete type.
For example, for discrete feature data, one-hot coded code is written that applies to the feature data. Taking the example of encoding "age", first, the age characteristics are segmented into 7 intervals of "76 and above", "66-75", "55-65", "46-55", "36-45", "26-35", "below 25" according to the number of samples, and if one person is 30 years old, the age value after unique hot encoding is 0000010. Other discrete characteristics like age characteristics such as gender, city, occupation, family genetic history, disease history, eating habits, smoking habits, drinking habits, weekly movement habits, etc. are subject to unique heat code conversion.
It is understood that the normalization processing method using the z-score method of positive-parity normalization for the continuous feature data in this embodiment is the same as the processing method of the prior art, and will not be described herein. According to different data types, after the discrete characteristic data and the continuous characteristic data are correspondingly processed, the consistency of data results caused by the fact that the data are processed by the same method is avoided, and the accuracy of data processing can be improved.
S208, carrying out unbalanced processing on the samples of the first database set by using an undersampling and SMOTE algorithm to obtain a second database set;
in this embodiment, the unbalanced processing is performed on the samples of the first database set to make the samples of the same type distributed more uniformly, so as to obtain an accurate second database set.
S209, calculating the variance of the same characteristic data in the second database set by using an analysis of variance method, deleting the characteristic data of which the variance value is smaller than a preset variance threshold value, and obtaining a third database set;
in this embodiment, feature data with a feature data variance value smaller than a preset variance threshold is selected to be deleted, data with a small sample feature data difference can be reduced, and the second database set from which the feature data with a feature data variance value smaller than the preset variance threshold is deleted is used as the third database set. It can be understood that: the larger the difference value is, the larger the difference of the samples is, and the higher the accuracy rate of distinguishing the cardiovascular and cerebrovascular disease samples from the non-cardiovascular and cerebrovascular disease samples is.
S210, calculating by using a relief algorithm, and deleting the weight of each feature data after the feature data with the feature data variance value smaller than a preset variance threshold value is deleted;
s211, deleting the feature data with the score value smaller than a preset score threshold value and the corresponding features according to the weight of the feature data and the score value corresponding to the weight of the feature data to obtain a fourth database set;
in this embodiment, a weight database is pre-established, and the weight database includes: the weight of the feature data is a score value corresponding to the weight of the feature data. And searching score values corresponding to the weights of the characteristic data in the database according to the weight of each characteristic data, and scoring each characteristic data. And deleting the feature data and the corresponding features of which the score value does not exceed the score threshold value in the third database set, wherein the score threshold value is a numerical value set according to industry experience.
And S212, determining a sample set by using a forward selection method according to the fourth database set.
It is to be understood that, in the process of determining the sample set by using the forward selection method, the evaluation function may be used to evaluate the feature data corresponding to each feature in the fourth database set, and determine the evaluation function value of the feature data corresponding to each feature.
In some examples, for feature data corresponding to the same feature, a sample corresponding to feature data corresponding to a feature having the same evaluation function value may be used as a model sample set, where the model sample set includes a plurality of samples, and the plurality of samples includes: at least one same characteristic and characteristic data corresponding to the characteristic; and then evaluating each model sample set, and finally selecting the model sample set with the highest evaluation function value as the sample set. Evaluating each model sample set may include: calculating the average value of the evaluation function values of all the characteristic data in the model sample set, or selecting the average value of the characteristic data related to cardiovascular and cerebrovascular diseases in the model sample set, and evaluating each model sample set can also use the method of the evaluation set in the prior art, which is not described herein again.
The following examples illustrate: suppose that there are 3 feature data: blood pressure: 100-; pulse: 60-100 times/min; blood sugar: hollow 7.8-9.0 mmoL/L. The evaluation function values of the three feature data are 64, 78 and 12 respectively; selecting a pulse: forming a model sample set 1 by samples of which the number is 60-100 times/minute; selecting blood pressure: the samples of 100-145mmHg form a model sample set 2; selecting blood sugar: samples with 7.8-9.0 mmoL/L fasting form a model sample set 3; the average value of the characteristic data evaluation function values in the model sample set 1 is 50 points, the average value of the characteristic data evaluation function values in the model sample set 2 is 45 points, the average value of the characteristic data evaluation function values in the model sample set 3 is 65 points, and the model sample set 3 is used as a sample set.
According to the embodiment, the quality of the sample is improved by preprocessing the sample of the patient medical database set and the characteristic data in the sample, so that the quality of the sample set can be improved.
Optionally, in step S105, according to the global metric matrix, calculating distances between the input samples and the samples in the sample set by using a cosine similarity algorithm, and forming a first distance set, where the method includes:
calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm formula according to the global measurement matrix to form a first distance set;
wherein the cosine similarity algorithm formula is as follows:
the first set of distances D1 includes: { D (x)i,x1),D(xi,x2),D(xi,x3),…,D(xi,xj)};
Where i represents the index of the input sample, xiRepresents the ith input sample as xi(ii) a The sample set is X; the global metric matrix is A; m ═ ATA; j represents the sample number in the sample set; x is the number ofjRepresents the jth sample in the sample set; i and j are positive integers; d(s)i,xj) Representing input samples x under a global metric matrixiDistance from jth sample in X set; a (x)i,xj) Represents x after A matrix transformationi,xjThe distance between them.
Optionally, in step S110, according to a local measurement matrix of the target local cluster obtained by learning the COS-LMNN algorithm, calculating by using a cosine similarity algorithm, inputting distances between samples and samples in the target local cluster, and forming a second distance set, where the method includes:
calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm formula;
the cosine similarity algorithm formula is as follows:
the second set of distances D2 includes: { D (x)i,xs1),D(xi,xs2),…,D(xi,xsi)};
Where i represents the index of the input sample, xiRepresents the ith input sample as x; x is the number ofsiRepresent the same class as iThen, the process is carried out; the local metric matrix is AS;MS=AS TAS(ii) a i is a positive integer; d (x)i,xsi) Representing input samples x under a local metric matrixiDistances to samples of the same category as i in the target local cluster; i is a positive integer; a. theS(xi,xsi) Represents the passage ASX after matrix transformationi,xsiThe distance between them.
As shown in fig. 3, a cardiovascular and cerebrovascular disease risk prediction apparatus provided in an embodiment of the present invention includes:
a set obtaining module 301, configured to obtain a sample set;
the sample set is determined according to a plurality of samples of the patient medical database set with the set labels; one sample includes: patient's number, characteristics and characteristic data; the label includes: a first tag and a second tag; the first label identifies a sample of a patient with cardiovascular disease; the second label identifies a non-cardiovascular patient sample;
a sample obtaining module 302, configured to obtain an input sample;
the input sample is formed by combining medical health physical examination data and medical treatment data of a patient to be predicted;
the matrix calculation module 303 is configured to perform metric learning by using a cosine-large-interval nearest neighbor COS-LMNN algorithm to obtain a global metric matrix of the sample set;
a first local cluster determining module 304, configured to divide the samples in the sample set into a preset number of local clusters by using a preset clustering algorithm;
a first distance determining module 305, configured to calculate, according to the global metric matrix, distances between the input samples and the samples in the sample set by using a cosine similarity algorithm, so as to form a first distance set;
a first sample determining module 306, configured to calculate, according to a preset first K value and the first distance set, first K-value first neighboring samples of the input sample by using a K-nearest neighbor algorithm;
a second local cluster determining module 307, configured to determine a local cluster where the first neighboring sample is located;
a target local cluster determining module 308, configured to select, as a target local cluster, a local cluster in which the number of first neighboring samples exceeds a first preset threshold from among local clusters in which neighboring samples are located;
a local cluster dividing module 309, configured to divide an input sample into the target local cluster;
the second distance determining module 310 is configured to calculate, by using a cosine similarity algorithm, according to a local measurement matrix of the target local cluster obtained through COS-LMNN algorithm learning, input a distance between a sample and a sample in the target local cluster, and form a second distance set;
a second sample determining module 311, configured to determine, in the target local cluster, second K-value second neighboring samples of the input samples according to a preset second K value and a second distance set by using a K-nearest neighbor algorithm;
a counting module 312, configured to count the number of the first labels and the number of the second labels of the second neighboring samples;
the label determining module 313 is configured to use the first label as a label of the input sample if a ratio of the number of the first labels to the number of the second labels exceeds a preset label threshold, and otherwise use the second label as a label of the input sample;
a patient sample determination module 314, configured to determine whether the input sample is a sample of a patient with a cardiovascular disease according to the label of the input sample;
a cardiovascular and cerebrovascular disease patient determination module 315, configured to determine that the patient to be predicted in the input sample is a cardiovascular and cerebrovascular disease patient if the input sample is a cardiovascular and cerebrovascular disease patient;
a non-cardiovascular and cerebrovascular disease patient determination module 316, configured to determine that the patient to be predicted in the input sample is not a cardiovascular and cerebrovascular disease patient if the input sample is not a cardiovascular and cerebrovascular disease patient sample.
Optionally, the device for predicting risk of cardiovascular and cerebrovascular diseases provided in the embodiment of the present invention further includes:
the high-risk determining module is used for determining whether the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient according to the health return visit data of the patient;
the hospitalization suggestion module is used for making a suggestion of hospitalization on the patient to be predicted if the patient to be predicted is a patient with high risk of cardiovascular and cerebrovascular diseases;
the physical examination increasing suggestion module is used for making a suggestion of increasing the physical examination frequency for the patient to be predicted if the patient to be predicted is not the high-risk cardiovascular and cerebrovascular disease patient;
the health determination module is used for determining whether the patient to be predicted is a healthy user according to the health return visit data of the patient;
the normal physical examination suggestion module is used for making a suggestion of keeping the normal physical examination frequency for the normal patient if the patient to be predicted is a healthy user;
the missed-diagnosis patient determination module is used for marking the patient to be predicted as a missed-diagnosis patient if the patient to be predicted is not a healthy user, and adding the characteristic data of the missed-diagnosis patient into the patient medical database set;
wherein, the missed diagnosis patient is a patient with cardiovascular and cerebrovascular diseases.
Optionally, the set obtaining module 301 includes:
the sample deleting submodule is used for deleting the samples with the sample missing values larger than the first threshold value according to the plurality of samples of the patient medical database set with the labels;
the sample missing values are: the ratio of the number of missing features in a sample to the total number of features in the sample;
the characteristic deleting submodule is used for searching in the plurality of deleted samples, and the characteristic deletion processing is carried out on the characteristic of which the characteristic missing value is greater than the second threshold value;
the feature missing values are: the ratio of the number of the features lacking the feature data to the total number of the same features in the same features of the plurality of samples;
the first characteristic submodule is used for searching the characteristic of the missing characteristic data of the plurality of samples after the characteristic deletion processing is carried out and taking the characteristic as a first characteristic;
a missing value filling submodule, configured to perform missing value filling on the feature data missing from the first feature by using a multiple filling method;
the data classification submodule is used for classifying the characteristic data of the plurality of samples filled with the missing values according to the data types to obtain a classification result;
wherein the classification result comprises: discrete feature data and continuous feature data;
the data processing submodule is used for processing the discrete feature data and the continuous feature data corresponding to the data type according to the classification result;
the set updating submodule is used for adding the characteristic data which is obtained by correspondingly processing the discrete characteristic data and the continuous characteristic data into the patient medical database set to be used as a first database set;
wherein, the discrete feature data and the continuous feature data are processed corresponding to the data type, and the processing comprises the following steps: carrying out one-hot encoding on the discrete characteristic data; normalizing the continuous characteristic data by using a Z-score method of positive Tailore normalization;
the equalization processing submodule is used for carrying out equalization processing on the samples of the first database set by using an undersampling and SMOTE algorithm to obtain a second database set;
the variance deleting submodule is used for calculating the variance of the same characteristic data in the second database set by using an analysis of variance method, deleting the characteristic data of which the variance value is smaller than a preset variance threshold value, and obtaining a third database set;
the weight calculation submodule is used for calculating the weight of each feature data in the third database set by using a relief algorithm;
the score deletion submodule is used for deleting the feature data with the score value smaller than a preset score threshold value and the corresponding features in the third database set according to the weight of the feature data and the score value corresponding to the weight of the feature data to obtain a fourth database set;
and the set determining submodule is used for determining a sample set by using a forward selection method according to the fourth database set.
The cardiovascular and cerebrovascular disease risk prediction device of this embodiment further includes:
the high-risk determining module is used for determining whether the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient according to the health return visit data of the patient;
the hospitalization suggestion module is used for making a suggestion of hospitalization on the patient to be predicted if the patient to be predicted is a patient with high risk of cardiovascular and cerebrovascular diseases;
the physical examination increasing suggestion module is used for making a suggestion of increasing the physical examination frequency for the patient to be predicted if the patient to be predicted is not the high-risk cardiovascular and cerebrovascular disease patient;
the health determination module is used for determining whether the patient to be predicted is a healthy user according to the health return visit data of the patient;
the normal physical examination suggestion module is used for making a suggestion of keeping the normal physical examination frequency for the normal patient if the patient to be predicted is a healthy user;
the missed-diagnosis patient determining module is used for marking the patient to be predicted as a missed-diagnosis patient if the patient to be predicted is not a healthy user, and adding the characteristic data of the missed-diagnosis patient into a patient medical database set;
wherein, the missed diagnosis patient is a patient with cardiovascular and cerebrovascular diseases.
Optionally, the first distance determining module is specifically configured to: calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm formula according to the global measurement matrix to form a first distance set;
the cosine similarity algorithm formula is as follows:
the first set of distances D1 includes: { D (x)i,x1),D(xi,x2),D(xi,x3),…,D(xi,xj)};
Where i represents the index of the input sample, xiRepresents the ith input sample as xi(ii) a The sample set is X; the global metric matrix is A; m ═ ATA;j represents the sample number in the sample set; x is the number ofjRepresents the jth sample in the sample set; i and j are positive integers; d (x)i,xj) Representing input samples x under a global metric matrixiDistance from jth sample in X set; a (x)i,xj) Represents x after A matrix transformationi,xjThe distance between them.
Optionally, the second distance determining module is specifically configured to:
calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm formula according to a local measurement matrix of the target local cluster obtained by learning of a COS-LMNN algorithm to form a second distance set;
the cosine similarity algorithm formula is as follows:
the second set of distances D2 includes: { D (x)i,xs1),D(xi,xs2),…,D(xi,xsi)};
Where i represents the index of the input sample, xiRepresents the ith input sample as x; x is the number ofsiRepresents samples of the same category as i; the local metric matrix is AS;MS=AS TAS(ii) a i is a positive integer; d (x)i,xsi) Representing input samples x under a local metric matrixiDistances to samples of the same category as i in the target local cluster; i is a positive integer; a. thes(xi,xsi) Represents the passage ASX after matrix transformationi,xsiThe distance between them.
An embodiment of the present invention further provides an electronic device, as shown in fig. 4, including a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 complete mutual communication through the communication bus 404,
a memory 403 for storing a computer program;
the processor 401, when executing the program stored in the memory 403, implements the following steps:
obtaining a sample set; the sample set is determined according to a plurality of samples of the patient medical database set with the set labels; one sample includes: patient's number, characteristics and characteristic data; the label includes: a first tag and a second tag; the first label identifies a sample of a patient with cardiovascular disease; the second label identifies a non-cardiovascular patient sample;
obtaining an input sample; the input sample is formed by combining medical health physical examination data and medical treatment data of a patient to be predicted;
performing metric learning by using a cosine-large-interval nearest neighbor COS-LMNN algorithm to obtain a global metric matrix of the sample set;
dividing the samples in the sample set into a preset number of local clusters by using a preset clustering algorithm;
calculating the distance between an input sample and a sample in a sample set by using a cosine similarity algorithm according to the global measurement matrix to form a first distance set;
calculating to obtain first K-value first adjacent samples of the input samples by using a K-nearest neighbor algorithm according to a preset first K value and a first distance set;
determining a local cluster in which the first neighboring sample is located;
selecting local clusters with the number of first adjacent samples exceeding a first preset threshold value from local clusters where the adjacent samples are located as target local clusters;
drawing an input sample into the target local cluster;
according to a local measurement matrix of the target local cluster obtained by learning of a COS-LMNN algorithm, calculating by using a cosine similarity algorithm, inputting the distance between a sample and the sample in the target local cluster, and forming a second distance set;
determining second K-value second adjacent samples of the input samples in the target local cluster by using a K-nearest neighbor algorithm according to a preset second K value and the second distance set;
counting the number of first labels and the number of second labels of the second adjacent samples;
if the ratio of the number of the first labels to the number of the second labels exceeds a preset label threshold value, taking the first labels as labels of the input samples, otherwise, taking the second labels as labels of the input samples;
determining whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases according to the label of the input sample;
if the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases, determining that the patient to be predicted in the input sample is the patient with cardiovascular and cerebrovascular diseases;
and if the input sample is not the sample of the patient with the cardiovascular and cerebrovascular diseases, determining that the patient to be predicted in the input sample is not the patient with the cardiovascular and cerebrovascular diseases.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.
In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which has instructions stored therein, and when the instructions are executed on a computer, the computer is caused to execute a cardiovascular disease risk prediction method as described in any one of the above embodiments.
In yet another embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform a method for cardiovascular disease risk prediction as described in any of the above embodiments.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.
Claims (2)
1. A cardiovascular disease risk prediction device, the device comprising:
the set acquisition module is used for acquiring a sample set;
the sample set is determined according to a plurality of samples of the labeled patient medical data base set; one sample includes: patient's number, characteristics and characteristic data; the label includes: a first tag and a second tag; the first label identifies a sample of a patient with cardiovascular disease; the second label identifies a non-cardiovascular patient sample;
the system comprises a sample acquisition module, a data acquisition module and a data processing module, wherein the sample acquisition module is used for acquiring an input sample;
the input sample is formed by combining medical health physical examination data and medical treatment data of a patient to be predicted;
the matrix calculation module is used for performing metric learning by using a cosine-large-interval nearest neighbor COS-LMNN algorithm to obtain a global metric matrix of the sample set;
the first local cluster determining module is used for dividing the samples in the sample set into a preset number of local clusters by using a preset clustering algorithm;
the first distance determining module is used for calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm according to the global measurement matrix to form a first distance set;
the first sample determining module is used for calculating first K-value first adjacent samples of the input samples by using a K-nearest neighbor algorithm according to a preset first K value and the first distance set;
a second local cluster determining module, configured to determine a local cluster in which the first neighboring sample is located;
a target local cluster determining module, configured to select, as a target local cluster, a local cluster in which the number of first neighboring samples exceeds a first preset threshold from among local clusters in which the first neighboring samples are located;
a local cluster dividing module, configured to divide the input sample into the target local cluster;
the second distance determination module is used for calculating by using a cosine similarity algorithm according to a local measurement matrix of the target local cluster obtained by COS-LMNN algorithm learning, and forming a second distance set by the distance between the input sample and the sample in the target local cluster;
a second sample determining module, configured to determine, in the target local cluster, second K-value second neighboring samples of the input sample according to a preset second K value and the second distance set by using a K-nearest neighbor algorithm;
the counting module is used for counting the number of the first labels and the number of the second labels of the second adjacent samples;
the label determining module is used for taking the first label as the label of the input sample if the ratio of the number of the first labels to the number of the second labels of the second adjacent sample exceeds a preset label threshold value, and otherwise, taking the second label as the label of the input sample;
the patient sample determination module is used for determining whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases according to the label of the input sample;
the cardiovascular and cerebrovascular disease patient determination module is used for determining that the patient to be predicted in the input sample is the cardiovascular and cerebrovascular disease patient if the input sample is the cardiovascular and cerebrovascular disease patient;
the non-cardiovascular and cerebrovascular disease patient determination module is used for determining that the patient to be predicted in the input sample is not the patient with the cardiovascular and cerebrovascular diseases if the input sample is not the sample of the patient with the cardiovascular and cerebrovascular diseases;
the set acquisition module includes:
the sample deleting submodule is used for deleting the samples with the sample missing values larger than the first threshold value according to the plurality of samples of the patient medical database set with the labels;
the sample missing values are: the ratio of the number of missing features in a sample to the total number of features in the sample;
the characteristic deleting submodule is used for searching in the plurality of deleted samples, and the characteristic deletion processing is carried out on the characteristic of which the characteristic missing value is greater than the second threshold value;
the feature missing values are: the ratio of the number of the features lacking the feature data to the total number of the same features in the same features of the plurality of samples;
the first characteristic submodule is used for searching the characteristic of the missing characteristic data of the plurality of samples after the characteristic deletion processing is carried out and taking the characteristic as a first characteristic;
a missing value filling submodule, configured to perform missing value filling on the feature data missing from the first feature by using a multiple filling method;
the data classification submodule is used for classifying the characteristic data of the plurality of samples filled with the missing values according to the data types to obtain a classification result;
wherein the classification result comprises: discrete feature data and continuous feature data;
the data processing submodule is used for processing the discrete characteristic data and the continuous characteristic data corresponding to the data type according to the classification result;
the set updating submodule is used for adding the characteristic data which is obtained by correspondingly processing the discrete characteristic data and the continuous characteristic data into the patient medical database set to be used as a first database set;
wherein, the discrete feature data and the continuous feature data are processed corresponding to the data type, and the processing comprises the following steps: carrying out one-hot encoding on the discrete characteristic data; normalizing the continuous characteristic data by using a Z-score method of positive Tailore normalization;
the equalization processing submodule is used for carrying out imbalance processing on the samples of the first database set by using an undersampling and SMOTE algorithm to obtain a second database set;
the variance deleting submodule is used for calculating the variance of the same characteristic data in the second database set by using an analysis of variance method, deleting the characteristic data of which the variance value is smaller than a preset variance threshold value, and obtaining a third database set;
the weight calculation submodule is used for calculating the weight of each feature data in the third database set by using a relief algorithm;
the score deletion submodule is used for deleting the feature data with the score value smaller than a preset score threshold value and the corresponding features in the third database set according to the weight of the feature data and the score value corresponding to the weight of the feature data to obtain a fourth database set;
and the set determining submodule determines a sample set by using a forward selection method according to the fourth database set.
2. The apparatus of claim 1, further comprising:
the high-risk determination module is used for determining whether the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient according to the health return visit data of the patient;
the hospitalization suggestion module is used for making a suggestion of hospitalization treatment on the patient to be predicted if the patient to be predicted is a patient with high risk of cardiovascular and cerebrovascular diseases;
the physical examination increasing suggestion module is used for making a suggestion of increasing the physical examination frequency for the patient to be predicted if the patient to be predicted is not the high-risk cardiovascular and cerebrovascular disease patient;
the health determination module is used for determining whether the patient to be predicted is a healthy user according to the health return visit data of the patient;
the normal physical examination suggestion module is used for making a suggestion of keeping the normal physical examination frequency for the normal patient if the patient to be predicted is a healthy user;
the missed-diagnosis patient determination module is used for marking the patient to be predicted as a missed-diagnosis patient if the patient to be predicted is not a healthy user, and adding the characteristic data of the missed-diagnosis patient into the patient medical database set;
wherein, the missed diagnosis patient is a patient with cardiovascular and cerebrovascular diseases.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810449174.4A CN108648827B (en) | 2018-05-11 | 2018-05-11 | Cardiovascular and cerebrovascular disease risk prediction method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810449174.4A CN108648827B (en) | 2018-05-11 | 2018-05-11 | Cardiovascular and cerebrovascular disease risk prediction method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108648827A CN108648827A (en) | 2018-10-12 |
CN108648827B true CN108648827B (en) | 2022-04-08 |
Family
ID=63754823
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810449174.4A Expired - Fee Related CN108648827B (en) | 2018-05-11 | 2018-05-11 | Cardiovascular and cerebrovascular disease risk prediction method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108648827B (en) |
Families Citing this family (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109360658B (en) * | 2018-11-01 | 2021-06-08 | 北京航空航天大学 | Disease pattern mining method and device based on word vector model |
CN109492102A (en) * | 2018-11-08 | 2019-03-19 | 中国联合网络通信集团有限公司 | User data processing method, device, equipment and readable storage medium storing program for executing |
CN109685135B (en) * | 2018-12-21 | 2022-03-25 | 电子科技大学 | Few-sample image classification method based on improved metric learning |
CN111403024A (en) * | 2019-01-02 | 2020-07-10 | 天津幸福生命科技有限公司 | Method and device for obtaining disease judgment model based on medical data |
CN110232403B (en) * | 2019-05-15 | 2024-02-27 | 腾讯科技(深圳)有限公司 | Label prediction method and device, electronic equipment and medium |
CN111739634A (en) * | 2020-05-14 | 2020-10-02 | 平安科技(深圳)有限公司 | Method, device and equipment for intelligently grouping similar patients and storage medium |
CN112435757B (en) * | 2020-10-27 | 2024-07-16 | 深圳市利来山科技有限公司 | Prediction device and system for acute hepatitis |
CN113488174A (en) * | 2021-08-05 | 2021-10-08 | 新乡医学院第一附属医院 | Method for predicting the risk of acute cerebrovascular disease |
CN113707255B (en) * | 2021-08-31 | 2023-09-26 | 平安科技(深圳)有限公司 | Health guidance method, device, computer equipment and medium based on similar patients |
CN113782216B (en) * | 2021-09-15 | 2023-10-24 | 平安科技(深圳)有限公司 | Disabling weight determining method and device, electronic equipment and storage medium |
CN113628709B (en) * | 2021-10-09 | 2022-02-11 | 腾讯科技(深圳)有限公司 | Similar object determination method, device, equipment and storage medium |
CN113948207B (en) * | 2021-10-18 | 2024-08-16 | 东北大学 | Blood sugar data processing method for hypoglycemia early warning |
CN113988205B (en) * | 2021-11-08 | 2022-09-20 | 福建龙净环保股份有限公司 | Method and system for judging electric precipitation working condition |
CN117174330A (en) * | 2023-08-21 | 2023-12-05 | 肾泰网健康科技(南京)有限公司 | IgA nephropathy patient treatment scheme recommendation method based on machine learning |
CN117831771B (en) * | 2024-03-05 | 2024-05-17 | 凯斯艾生物科技(苏州)有限公司 | Disease risk prediction model construction method and system based on deep learning |
CN118194023B (en) * | 2024-05-13 | 2024-08-20 | 湖南格尔智慧科技有限公司 | Patient disease data commonality extraction system based on collaborative deep learning |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007033132A (en) * | 2005-07-25 | 2007-02-08 | Sysmex Corp | Analyzing system, inspection data processor, computer program and analyzer |
CN107895596A (en) * | 2016-12-19 | 2018-04-10 | 平安科技(深圳)有限公司 | Risk Forecast Method and system |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050032066A1 (en) * | 2003-08-04 | 2005-02-10 | Heng Chew Kiat | Method for assessing risk of diseases with multiple contributing factors |
EP3223180A1 (en) * | 2016-03-24 | 2017-09-27 | Fujitsu Limited | A system and a method for assessing patient risk using open data and clinician input |
CN106874663A (en) * | 2017-01-26 | 2017-06-20 | 中电科软件信息服务有限公司 | Cardiovascular and cerebrovascular disease Risk Forecast Method and system |
-
2018
- 2018-05-11 CN CN201810449174.4A patent/CN108648827B/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007033132A (en) * | 2005-07-25 | 2007-02-08 | Sysmex Corp | Analyzing system, inspection data processor, computer program and analyzer |
CN107895596A (en) * | 2016-12-19 | 2018-04-10 | 平安科技(深圳)有限公司 | Risk Forecast Method and system |
Also Published As
Publication number | Publication date |
---|---|
CN108648827A (en) | 2018-10-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108648827B (en) | Cardiovascular and cerebrovascular disease risk prediction method and device | |
KR102024375B1 (en) | Apparatus and method for predicting disease risk of chronic kidney disease | |
CN109411082B (en) | Medical quality evaluation and treatment recommendation method | |
JP5977898B1 (en) | BEHAVIOR PREDICTION DEVICE, BEHAVIOR PREDICTION DEVICE CONTROL METHOD, AND BEHAVIOR PREDICTION DEVICE CONTROL PROGRAM | |
Becker et al. | ICU scoring systems allow prediction of patient outcomes and comparison of ICU performance | |
CN107845424B (en) | Method and system for diagnostic information processing analysis | |
CN109360658B (en) | Disease pattern mining method and device based on word vector model | |
US11205140B2 (en) | Methods and systems for self-fulfillment of an alimentary instruction set based on vibrant constitutional guidance | |
Zhang et al. | Advanced diagnostic imaging utilization during emergency department visits in the United States: A predictive modeling study for emergency department triage | |
CN113838577B (en) | Convenient layered old people MODS early death risk assessment model, device and establishment method | |
Xie et al. | Machine learning–based prediction models for delirium: a systematic review and meta-analysis | |
Geethadevi et al. | Multi‐domain prognostic models used in middle‐aged adults without known cognitive impairment for predicting subsequent dementia | |
US20200364566A1 (en) | Systems and methods for predicting pain level | |
Rathi et al. | Early Prediction of Diabetes Using Machine Learning Techniques | |
CN116864062A (en) | Health physical examination report data analysis management system based on Internet | |
CN113593703B (en) | Device and method for constructing pressure injury risk prediction model | |
Douibi et al. | An analysis of ambulatory blood pressure monitoring using multi-label classification | |
US20220054091A1 (en) | Methods and systems for self-fulfillment of an alimentary instruction set based on vibrant constitutional guidance | |
CN113140315B (en) | Health self-testing system, server and health detection system | |
Van Dommelen et al. | Methods to obtain referral criteria in growth monitoring | |
Miller et al. | Imputation of Non-Response in Height and Weight in the Mexican Health and Aging Study | |
Preo et al. | Significant EHR feature-driven t2d inference: predictive machine learning and networks | |
Goh et al. | Development of an effective clustering algorithm for older fallers | |
Jiang et al. | Diabetes prediction model for unbalanced community follow-up data set based on optimal feature selection and scorecard | |
CN118448049B (en) | Grouping distribution management system and method based on different chronic diseases |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20220408 |
|
CF01 | Termination of patent right due to non-payment of annual fee |