CN108648827B

CN108648827B - Cardiovascular and cerebrovascular disease risk prediction method and device

Info

Publication number: CN108648827B
Application number: CN201810449174.4A
Authority: CN
Inventors: 刘奎; 倪壮; 康桂霞; 杨波; 张宁波
Original assignee: Chinese PLA General Hospital; Beijing University of Posts and Telecommunications
Current assignee: Chinese PLA General Hospital; Beijing University of Posts and Telecommunications
Priority date: 2018-05-11
Filing date: 2018-05-11
Publication date: 2022-04-08
Anticipated expiration: 2038-05-11
Also published as: CN108648827A

Abstract

The embodiment of the invention provides a method and a device for predicting cardiovascular and cerebrovascular disease risks, wherein the method comprises the following steps: obtaining a sample set; dividing samples in a sample set into a preset number of local clusters, calculating first K-value first adjacent samples of input samples according to a preset first K value and the first distance set so as to determine a target local cluster, and calculating the distance between the input samples and samples in the target local cluster so as to determine second K-value second adjacent samples of the input samples; determining the label of the input sample, and determining whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases; finally determining whether the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases. In the embodiment, the similarity of the characteristic data of the patient with the cardiovascular and cerebrovascular diseases is considered to be high, so that the influence of different sample data on the training prediction model is avoided. Therefore, the accuracy of predicting whether the patient to be predicted is a cardiovascular and cerebrovascular disease patient can be improved.

Description

Cardiovascular and cerebrovascular disease risk prediction method and device

Technical Field

The invention relates to the field of prediction analysis, in particular to a cardiovascular and cerebrovascular disease risk prediction method and device.

Background

With the increasing of living pressure and mental pressure of people, the incidence of cardiovascular and cerebrovascular diseases is increased year by year, and the health of residents is seriously influenced. The medical practice shows that if the patients with cardiovascular and cerebrovascular diseases can be diagnosed accurately in the early diagnosis, the intervention and treatment effect on the cardiovascular and cerebrovascular diseases are greatly facilitated.

In the prior art, a data mining technology is used for mining case data characteristics of cardiovascular and cerebrovascular diseases, physical examination characteristic data and return visit data of all patients form a training set, and a prediction model is trained by using a decision tree, a logistic regression and an artificial neural network algorithm. Then the physical examination data of the patient to be predicted is used as an input sample, the input sample is input into the trained prediction model, and whether the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases or not is output.

Taking the training and prediction model of the artificial neural network algorithm as an example, in the process of training and predicting the model by using the artificial neural network algorithm, because the input samples of the neural network comprise the samples of patients with non-cardiovascular and cerebrovascular diseases and the samples of patients with cardiovascular and cerebrovascular diseases, and the difference of characteristic data in the samples of the patients with non-cardiovascular and cerebrovascular diseases and the samples of the patients with cardiovascular and cerebrovascular diseases is large, all samples in a training set are used as the input of an input layer, and the error function of an output layer of the neural network is large. Because of the influence of different sample data, the weight and the threshold of each layer of the neural network are adjusted according to the error function, and the trained prediction model is not accurate. Therefore, the accuracy rate of predicting whether the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases is not high by training the prediction model by using the artificial neural network algorithm.

Disclosure of Invention

The embodiment of the invention aims to provide a cardiovascular and cerebrovascular disease risk prediction method and device so as to improve the accuracy of predicting whether a patient is a cardiovascular and cerebrovascular disease patient. The specific technical scheme is as follows:

in a first aspect, an embodiment of the present invention provides a method for predicting risk of a cardiovascular disease, including:

obtaining a sample set; the sample set is determined according to a plurality of samples of the labeled patient medical data base set; one sample includes: patient's number, characteristics and characteristic data; the label includes: a first tag and a second tag; the first label identifies a sample of a patient with cardiovascular disease; the second label identifies a non-cardiovascular patient sample;

obtaining an input sample; the input sample is formed by combining medical health physical examination data and medical treatment data of a patient to be predicted;

performing metric learning by using a cosine-large-interval nearest neighbor COS-LMNN algorithm to obtain a global metric matrix of the sample set;

dividing the samples in the sample set into a preset number of local clusters by using a preset clustering algorithm;

calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm according to the global metric matrix to form a first distance set;

calculating first K-value first adjacent samples of the input samples by using a K-nearest neighbor algorithm according to a preset first K value and the first distance set;

determining a local cluster in which the first neighboring sample is located;

selecting local clusters with the number of the first adjacent samples exceeding a first preset threshold value from the local clusters where the first adjacent samples are located as target local clusters;

dividing the input sample into the target local cluster;

calculating by using a cosine similarity algorithm according to a local measurement matrix of the target local cluster obtained by learning a COS-LMNN algorithm, and forming a second distance set by the distance between the input sample and the sample in the target local cluster;

in the target local cluster, according to a preset second K value and the second distance set, using a K-nearest neighbor algorithm to determine second adjacent samples of the input samples with the second K value;

counting the number of first labels and the number of second labels of the second adjacent samples;

if the ratio of the number of the first labels to the number of the second labels exceeds a preset label threshold value, taking the first labels as labels of the input samples, otherwise, taking the second labels as labels of the input samples;

determining whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases according to the label of the input sample;

if the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases, determining that the patient to be predicted in the input sample is the patient with cardiovascular and cerebrovascular diseases;

and if the input sample is not the sample of the patient with the cardiovascular and cerebrovascular diseases, determining that the patient to be predicted in the input sample is not the patient with the cardiovascular and cerebrovascular diseases.

Optionally, after the step of determining that the patient to be predicted in the input sample is a patient with cardiovascular and cerebrovascular diseases, the method further comprises:

determining whether the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient according to health revisitation data of the patient;

if the patient to be predicted is a patient with high risk of cardiovascular and cerebrovascular diseases, making a recommendation of hospitalization for the patient to be predicted;

if the patient to be predicted is not the patient with the high-risk cardiovascular and cerebrovascular diseases, making a suggestion for increasing the physical examination frequency of the patient to be predicted;

after the step of determining that the patient to be predicted in the input sample is not a patient with cardiovascular and cerebrovascular disease, the method further comprises:

determining whether the patient to be predicted is a healthy user according to health return visit data of the patient;

if the patient to be predicted is a healthy user, making a recommendation for keeping the normal examination frequency of the normal patient;

if the patient to be predicted is not a healthy user, marking the patient to be predicted as a missed patient, and adding the characteristic data of the missed patient into the patient medical database set;

wherein, the missed diagnosis patient is a patient with cardiovascular and cerebrovascular diseases.

Optionally, the first label identifies a sample of a patient with cardiovascular and cerebrovascular diseases, including:

determining the identification information of the cardiovascular and cerebrovascular disease patient according to the collected health return visit data of the patient;

the health revisitation data of the patient comprises: patient's number, characteristics, characteristic data and confirmed condition; the identification information includes: confirming symptoms, and confirming characteristics and characteristic data corresponding to the symptoms;

according to the identification information of the patients with the cardiovascular and cerebrovascular diseases, centrally determining a sample of the patients with the cardiovascular and cerebrovascular diseases in the medical database;

setting a first label on a sample of the patient with the cardiovascular and cerebrovascular diseases;

the second label identifies a non-cardiovascular patient sample comprising:

and setting a second label for other samples except the sample of the patient with the cardiovascular and cerebrovascular diseases.

Optionally, obtaining a sample set includes:

according to a plurality of samples of the patient medical database set with the labels, sample deletion processing is carried out on the samples with the sample missing values larger than a first threshold value;

the sample missing values are: the ratio of the number of missing features in a sample to the total number of features in the sample;

searching in the deleted multiple samples, and performing feature deletion processing on the features of which the feature missing values are greater than a second threshold value;

the feature missing values are: the ratio of the number of the features lacking the feature data to the total number of the same features in the same features of the plurality of samples;

searching the characteristics of the missing characteristic data in the plurality of samples subjected to the characteristic deletion processing to be used as first characteristics;

filling missing values of the feature data missing from the first feature by using a multi-filling method;

classifying the characteristic data of the plurality of samples filled with the missing values according to the data types to obtain classification results;

wherein the classification result comprises: discrete feature data and continuous feature data;

according to the classification result, the discrete feature data and the continuous feature data are processed corresponding to the data type;

adding the characteristic data which is obtained by correspondingly processing the discrete characteristic data and the continuous characteristic data into the patient medical database set as a first database set;

wherein, the discrete feature data and the continuous feature data are processed corresponding to the data type, and the processing comprises the following steps: carrying out one-hot encoding on the discrete characteristic data; normalizing the continuous characteristic data by using a Z-score method of positive Tailore normalization;

carrying out unbalanced processing on the samples of the first database set by using an undersampling and SMOTE algorithm to obtain a second database set;

calculating the variance of the same feature data in the second database set by using an analysis of variance method, and deleting the feature data of which the variance value is smaller than a preset variance threshold value;

calculating the weight of each feature data after the feature data with the variance value smaller than the preset variance threshold value is deleted by using a relief algorithm;

deleting the feature data with the score value smaller than a preset score threshold value and the corresponding features according to the weight of the feature data and the score value corresponding to the weight of the feature data to obtain a fourth database set;

from the fourth data set, a sample set is determined using a forward selection method.

Optionally, the calculating, according to the global metric matrix, distances between the input samples and the samples in the sample set by using a cosine similarity algorithm to form a first distance set includes:

calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm formula according to the global measurement matrix to form a first distance set;

wherein the cosine similarity algorithm formula is as follows:

the first set of distances D1 includes: { D (x)_i,x₁)，D(x_i,x₂)，D(x_i,x₃)，…,D(x_i,x_j)}；

Where i represents the index of the input sample, x_iRepresents the ith input sample as x_i(ii) a The sample set is X; global metric momentThe matrix is A; m ═ A^TA; j represents the sample number in the sample set; x is the number of_jRepresents the jth sample in the sample set; i and j are positive integers; d (x)_i,x_j) Representing input samples x under a global metric matrix_iDistance from jth sample in X set; a (x)_i,x_j) Represents x after A matrix transformation_i,x_jThe distance between them.

Optionally, the calculating, according to the local measurement matrix of the target local cluster obtained by the COS-LMNN algorithm learning, the distance between the input sample and the sample in the target local cluster by using a cosine similarity algorithm to form a second distance set, includes:

calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm formula according to a local measurement matrix of the target local cluster obtained by learning of a COS-LMNN algorithm to form a second distance set;

wherein the cosine similarity algorithm formula is as follows:

the second set of distances D2 includes: { D (x)_i,x_s1)，D(x_i,x_s2)，…，D(x_i,x_si)}；

Where i represents the index of the input sample, x_iRepresents the ith input sample as x; x is the number of_siRepresents samples of the same category as i; the local metric matrix is A_S；M_S＝A_S ^TA_S(ii) a i is a positive integer; d (x)_i,x_si) Representing input samples x under a local metric matrix_iDistances to samples of the same category as i in the target local cluster; i is a positive integer; a. the_s(x_i,x_si) Represents the passage A_SX after matrix transformation_i,x_siThe distance between them.

In a second aspect, the present embodiment provides a cardiovascular and cerebrovascular disease risk prediction apparatus, including:

the set acquisition module is used for acquiring a sample set;

the sample set is determined according to a plurality of samples of the labeled patient medical data base set; one sample includes: patient's number, characteristics and characteristic data; the label includes: a first tag and a second tag; the first label identifies a sample of a patient with cardiovascular disease; the second label identifies a non-cardiovascular patient sample;

the system comprises a sample acquisition module, a data acquisition module and a data processing module, wherein the sample acquisition module is used for acquiring an input sample;

the input sample is formed by combining medical health physical examination data and medical treatment data of a patient to be predicted;

the matrix calculation module is used for performing metric learning by using a cosine-large-interval nearest neighbor COS-LMNN algorithm to obtain a global metric matrix of the sample set;

the first local cluster determining module is used for dividing the samples in the sample set into a preset number of local clusters by using a preset clustering algorithm;

the first distance determining module is used for calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm according to the global measurement matrix to form a first distance set;

the first sample determining module is used for calculating first K-value first adjacent samples of the input samples by using a K-nearest neighbor algorithm according to a preset first K value and the first distance set;

a second local cluster determining module, configured to determine a local cluster in which the first neighboring sample is located;

a target local cluster determining module, configured to select, as a target local cluster, a local cluster in which the number of first neighboring samples exceeds a first preset threshold from among local clusters in which the first neighboring samples are located;

a local cluster dividing module, configured to divide the input sample into the target local cluster;

the second distance determination module is used for calculating by using a cosine similarity algorithm according to a local measurement matrix of the target local cluster obtained by COS-LMNN algorithm learning, and forming a second distance set by the distance between the input sample and the sample in the target local cluster;

a second sample determining module, configured to determine, in the target local cluster, second K-value second neighboring samples of the input sample according to a preset second K value and the second distance set by using a K-nearest neighbor algorithm;

the counting module is used for counting the number of the first labels and the number of the second labels of the second adjacent samples;

the label determining module is used for taking the first label as the label of the input sample if the ratio of the number of the first labels to the number of the second labels exceeds a preset label threshold value, and otherwise, taking the second label as the label of the input sample;

the patient sample determination module is used for determining whether the input sample is a sample of a patient with cardiovascular and cerebrovascular diseases according to the label of the input sample;

the cardiovascular and cerebrovascular disease patient determination module is used for determining that the patient to be predicted in the input sample is the cardiovascular and cerebrovascular disease patient if the input sample is the cardiovascular and cerebrovascular disease patient;

and the non-cardiovascular and cerebrovascular disease patient determination module is used for determining that the patient to be predicted in the input sample is not the cardiovascular and cerebrovascular disease patient if the input sample is not the cardiovascular and cerebrovascular disease patient.

Optionally, the cardiovascular and cerebrovascular disease risk prediction apparatus provided in this embodiment further includes:

the high-risk determination module is used for determining whether the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient according to the health return visit data of the patient;

the hospitalization suggestion module is used for making a suggestion of hospitalization treatment on the patient to be predicted if the patient to be predicted is a patient with high risk of cardiovascular and cerebrovascular diseases;

the physical examination increasing suggestion module is used for making a suggestion of increasing the physical examination frequency for the patient to be predicted if the patient to be predicted is not the high-risk cardiovascular and cerebrovascular disease patient;

the health determination module is used for determining whether the patient to be predicted is a healthy user according to the health return visit data of the patient;

the normal physical examination suggestion module is used for making a suggestion of keeping the normal physical examination frequency for the normal patient if the patient to be predicted is a healthy user;

the missed-diagnosis patient determination module is used for marking the patient to be predicted as a missed-diagnosis patient if the patient to be predicted is not a healthy user, and adding the characteristic data of the missed-diagnosis patient into the patient medical database set;

Optionally, the set obtaining module includes:

the sample deleting submodule is used for deleting the samples with the sample missing values larger than the first threshold value according to the plurality of samples of the patient medical database set with the labels;

the characteristic deleting submodule is used for searching in the plurality of deleted samples, and the characteristic deletion processing is carried out on the characteristic of which the characteristic missing value is greater than the second threshold value;

the first characteristic submodule is used for searching the characteristic of the missing characteristic data of the plurality of samples after the characteristic deletion processing is carried out and taking the characteristic as a first characteristic;

a missing value filling submodule, configured to perform missing value filling on the feature data missing from the first feature by using a multiple filling method;

the data classification submodule is used for classifying the characteristic data of the plurality of samples filled with the missing values according to the data types to obtain a classification result;

the data processing submodule is used for processing the discrete characteristic data and the continuous characteristic data corresponding to the data type according to the classification result;

the set updating submodule is used for adding the characteristic data which is obtained by correspondingly processing the discrete characteristic data and the continuous characteristic data into the patient medical database set to be used as a first database set;

the equalization processing submodule is used for carrying out equalization processing on the samples of the first database set by using an undersampling and SMOTE algorithm to obtain a second database set;

the variance deleting submodule is used for calculating the variance of the same feature data in the second database set by using an analysis of variance method, and deleting the feature data of which the variance value is smaller than a preset variance threshold value;

the weight calculation submodule is used for calculating the weight of each feature data after the feature data with the variance value smaller than the preset variance threshold value is deleted by using a relief algorithm;

and the set determining submodule is used for determining the sample set by using a forward selection method according to the weight of each characteristic data and the second database set.

In another aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

and the processor is used for realizing the cardiovascular and cerebrovascular disease risk prediction method when executing the program stored in the memory.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium, having stored therein instructions, which when executed on a computer, cause the computer to execute a method for predicting risk of cardiovascular and cerebrovascular diseases according to any one of the above.

In yet another aspect of the present invention, the present invention also provides a computer program product containing instructions, which when run on a computer, causes the computer to execute a method for predicting risk of cardiovascular and cerebrovascular diseases as described in any one of the above.

According to the method and the device for predicting the cardiovascular and cerebrovascular disease risk, provided by the embodiment of the invention, a sample set is obtained by obtaining a sample; dividing samples in a sample set into a preset number of local clusters, calculating first K-value first adjacent samples of input samples according to a preset first K value and the first distance set by using a K-nearest neighbor algorithm so as to determine a target local cluster, dividing the input samples into the target local cluster, and calculating the distance between the input samples and samples in the target local cluster so as to determine second K-value second adjacent samples of the input samples; counting the number of the first labels and the number of the second labels of the second adjacent samples, thereby determining the labels of the input samples and determining whether the input samples are samples of patients with cardiovascular and cerebrovascular diseases; finally determining whether the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases. In the embodiment, the similarity of the characteristic data of the cardiovascular and cerebrovascular disease patient is considered to be high, the similarity distance between the samples is calculated, and the nearest sample of the input sample is determined, so that whether the patient to be predicted in the input sample is the cardiovascular and cerebrovascular disease patient is determined, and the influence of different sample characteristic data on the training of the prediction model is avoided. Therefore, the accuracy of predicting whether the patient to be predicted is a cardiovascular and cerebrovascular disease patient can be improved. Of course, not all of the advantages described above need to be achieved at the same time in the practice of any one product or method of the invention.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method for predicting cardiovascular and cerebrovascular disease risk according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating the step S101 in FIG. 1 according to an embodiment of the present invention;

FIG. 3 is a structural diagram of a cardiovascular disease risk prediction device according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In this embodiment, in order to solve the problem that the accuracy of predicting whether a patient to be predicted is a cardiovascular and cerebrovascular disease patient is not high by training a prediction model using a decision tree, logistic regression, and an artificial neural network algorithm in the prior art, it can be understood that feature data of the cardiovascular and cerebrovascular disease patients are all similar, and therefore, a sample similar to a patient sample to be predicted can be obtained by using a similarity distance between the cardiovascular and cerebrovascular disease patient samples, so that whether the patient sample to be predicted is a cardiovascular and cerebrovascular disease patient sample is known, and whether the patient to be predicted is a cardiovascular and cerebrovascular disease patient is determined.

As shown in fig. 1, a method for predicting risk of cardiovascular and cerebrovascular diseases according to an embodiment of the present invention includes the following steps:

s101, obtaining a sample set; the sample set is determined according to a plurality of samples of the patient medical database set with the set labels; one sample includes: patient's number, characteristics and characteristic data; the label includes: a first tag and a second tag; the first label identifies a sample of a patient with cardiovascular disease; the second label identifies a non-cardiovascular patient sample;

for example, a sample includes: patient number 0001; is characterized by comprising the following steps: age, gender, city, occupation, family genetic history, disease history, eating habits, smoking habits, drinking habits, blood pressure, pulse, blood lipid, blood glucose, etc.; the characteristic data includes: age: 50; sex: male; city: wuhan; occupation: a teacher; family genetic history: none; history of disease: hypertension; the diet rule is as follows: baking the cooked wheaten food or rice or dinner when the breakfast is not eaten; smoking habit: two cigarettes a day; drinking habits: once in at least three days; blood pressure: 100-; pulse: 60-100 times/min; serum total cholesterol: 2.9-5.17 mmoi/l; serum triglycerides: 0.56-1.7 mmoi/l; high density lipoprotein cholesterol: 0, 94-2.0 mmoi/l; low density lipoprotein cholesterol: 2.07-3.12 i/l; blood sugar: 7.8-9.0 mmoL/L empty stomach, etc.

It can be understood that, in the present embodiment, the plurality of samples having the labels set in the patient medical data database set are formed by combining the medical examination data and the medical treatment data of the patient into one sample in advance, or the medical examination data and the medical treatment data of the patient may be combined into one sample each time the sample set is acquired. In view of the former time saving, the present embodiment combines the medical examination data and the medical visit data of the patient into a sample in advance, and then sets the label according to the health return visit data of the patient. The patient's health revisitation data contained: patient label, patient identified condition, and patient identified data characteristic of the condition. For example: health revisitation data includes: stroke, hypertension, coronary heart disease, hyperlipemia, hyperglycemia, hemoptysis, dizziness, cerebral infarction, heart failure and other cardiovascular and cerebrovascular diseases. The patient can confirm the disease condition by a doctor or by a patient's body or by a family member in the prior art, which is not described herein.

The medical examination data and the medical treatment data of the patient are the same as those in the prior art, the medical examination data comprises the label and the characteristic data of the patient, and the medical treatment data comprises the label of the patient and the basic information of the patient. Combining the medical examination data and the medical treatment data of the patient according to the patient label to form a sample, then determining whether the sample is a cardiovascular and cerebrovascular disease patient sample according to the patient label contained in the health return visit data of the patient and the disease condition confirmed by the patient, and setting a label for the sample.

S102, obtaining an input sample; the input sample is formed by combining medical health physical examination data and medical treatment data of a patient to be predicted;

s103, performing metric learning by using a cosine-large-interval nearest neighbor COS-LMNN algorithm to obtain a global metric matrix of the sample set;

the COS-LMNN algorithm in this embodiment is an algorithm that combines a cosine COS algorithm with a large-interval nearest neighbor LMNN algorithm, and calculates a global metric matrix of a sample set, and the combination manner is the same as that of the method in the prior art, and is not described herein again.

S104, dividing the samples in the sample set into a preset number of local clusters by using a preset clustering algorithm;

the preset number is a number set according to industry experience, and the preset clustering algorithm can be a k-means clustering algorithm, a hierarchical clustering algorithm, an SOM clustering algorithm, an FCM clustering algorithm and the like in the prior art. The preset number can be adjusted adaptively according to different clustering algorithms.

It can be understood that the present embodiment classifies samples in the sample set into different local clusters, for example, there are 7 samples in the sample set, which are A, B, C, D, E, F and G respectively, and the result after the classification is the local cluster 1: a and B; local cluster 2: C. f, G, respectively; local cluster 3: d and E.

S105, calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm according to the global measurement matrix to form a first distance set;

s106, calculating first adjacent samples of the first K value of the input sample by using a K-nearest neighbor algorithm according to a preset first K value and a first distance set;

in this embodiment, the first K value is a value preset according to practical experience, and a value of the first K value is the same as the number of first neighboring samples of the input sample calculated by using the K-nearest neighbor algorithm.

For example, the sample set includes: samples 1, 2, 3 and 4, assuming samples 1, 2 and 3 are samples of patients with cardiovascular disease, the sample distance between samples 1, 2 and 3 may be 0. If the input sample is a cardiovascular patient sample and the second K value is set to 2, then the first K value first adjacent samples of the input sample are samples 1 and 2, or 2 and 3, or 1 and 3.

S107, determining a local cluster where the first adjacent sample is located;

s108, selecting local clusters with the number of first adjacent samples exceeding a first preset threshold value from the local clusters where the adjacent samples are located as target local clusters;

s109, dividing the input sample into a target local cluster;

s110, calculating by using a cosine similarity algorithm according to a local measurement matrix of a target local cluster obtained by learning of a COS-LMNN algorithm, inputting the distance between a sample and the sample in the target local cluster, and forming a second distance set;

s111, determining second adjacent samples of the second K value of the input samples in the target local cluster by using a K-nearest neighbor algorithm according to a preset second K value and a second distance set;

in this embodiment, in order to determine second neighboring samples of the input sample in the target local cluster, the number of the second neighboring samples is the same as the value of the second K value, and the obtained second neighboring samples for determining the input sample may include multiple or one.

S112, counting the number of the first labels and the number of the second labels of the second adjacent samples;

s113, if the ratio of the number of the first labels to the number of the second labels exceeds a preset label threshold value, taking the first labels as labels of the input samples, otherwise, taking the second labels as labels of the input samples;

s114, determining whether the input sample is the sample of the patient with the cardiovascular and cerebrovascular diseases or not according to the label of the input sample;

s115, if the input sample is the sample of the patient with the cardiovascular and cerebrovascular diseases, determining that the patient to be predicted in the input sample is the patient with the cardiovascular and cerebrovascular diseases;

and S116, if the input sample is not the sample of the patient with the cardiovascular and cerebrovascular diseases, determining that the patient to be predicted in the input sample is not the patient with the cardiovascular and cerebrovascular diseases.

Compared with the prior art, the data mining technology is used for mining case data characteristics of cardiovascular and cerebrovascular diseases in the prior art, physical examination characteristic data and return visit data of all patients form a training set, and a prediction model is trained by using a decision tree, logistic regression and an artificial neural network algorithm. Then the physical examination data of the patient to be predicted is used as an input sample, the input sample is input into the trained prediction model, and whether the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases or not is output.

The decision tree analyzes the physical examination data of all patients to be predicted in the process of training the prediction model, sample data with the largest physical examination data information amount gain is used as a first node, other physical examination data are sequentially used as branches according to the physical examination data information amount gain, and the training is stopped when the data samples are only of one type, so that the prediction model is obtained. When the physical examination data with the maximum information gain is the physical examination data of the patient with the non-cardiovascular and cerebrovascular diseases, the prediction model trained by the method is influenced by the sample data with the maximum information gain of the physical examination data, and the accuracy of the prediction result of the prediction model trained by the decision tree is not high.

In the process of training the prediction model by using the logistic regression algorithm, the minimum of the loss function needs to be solved, and the prediction model is determined. Because the process of solving the minimum loss function is easily influenced by different sample data, the probability that the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases is output by the prediction model, and therefore whether the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases is determined to be inaccurate.

The embodiment obtains a sample set by obtaining a sample; dividing the samples in the sample set into a preset number of local clusters, and calculating to obtain first K-value first adjacent samples of the input samples, thereby determining the target local cluster. Determining second K-valued second neighboring samples of the input sample by calculating a distance of the input sample from a sample in the target local cluster; counting the number of the first labels and the number of the second labels of the second adjacent samples, thereby determining the labels of the input samples and determining whether the input samples are samples of patients with cardiovascular and cerebrovascular diseases; finally determining whether the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases. In the embodiment, a prediction model is not required to be trained, the purpose of determining whether a patient to be predicted is a cardiovascular and cerebrovascular disease patient or not is achieved by using the sample similarity distance considering that the similarity of the characteristic data of the cardiovascular and cerebrovascular disease patient is high, and the influence of sample data with the largest gain of the data information amount of the receptor test in the process of training the prediction model by using a decision tree is avoided. In the embodiment, the loss function is solved without using a logistic regression algorithm to train the prediction model, so that the condition that the minimum process for solving the loss function is influenced by different sample data to cause inaccuracy of the trained prediction model is avoided. Therefore, the accuracy of predicting whether the patient to be predicted is a cardiovascular and cerebrovascular disease patient can be improved.

Optionally, in an embodiment of the method for predicting a cardiovascular and cerebrovascular disease risk of the present invention, after the step of determining that the patient to be predicted in the input sample is a cardiovascular and cerebrovascular disease patient if the input sample is a cardiovascular and cerebrovascular disease patient sample S115, the method further includes:

the method comprises the following steps: determining whether the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient according to the health return visit data of the patient;

the characteristic data in the health revisit data exceed normal indexes, the sign information is abnormal, and the symptoms are abnormal, so that the patient to be predicted can be judged to be a high-risk patient with cardiovascular and cerebrovascular diseases. For example, blood pressure is excessive, hemoglobin is excessive, and symptoms are abnormal, such as faint, cough and hemoptysis.

Step two: if the patient to be predicted is a patient with high risk of cardiovascular and cerebrovascular diseases, the patient to be predicted is recommended to be treated in hospitalization;

according to the fact that the patient to be predicted is a patient with high-risk cardiovascular and cerebrovascular diseases, the patient is provided with the hospitalization days and the treatment scheme corresponding to the characteristic data of the patient according to the characteristic data of the patient.

In this embodiment, a cardiovascular and cerebrovascular patient treatment database is pre-established, and the cardiovascular and cerebrovascular patient treatment database includes: characteristic data of the patient, the number of hospitalization days corresponding to the characteristic data and a treatment scheme. The treatment regimen comprises: the dosage and frequency of insulin injection and the dosage and frequency of taking hypotensive drugs, physical exercise or whether surgical treatment is needed, and the like.

For example: blood pressure of high-risk cardiovascular and cerebrovascular disease patients: 120-; blood sugar: the number of hospitalization days for fasting from 7.8 to 9.0mmoL/L was 20 days, and the treatment regimen for the profile data was 1U per day of insulin injection.

The embodiment saves the time of doctor diagnosis suggestion and medical resources. Judging that the patient is a patient with high risk of cardiovascular and cerebrovascular diseases, and giving a recommendation for hospitalization of the patient to be predicted.

Step three: and if the patient to be predicted is not the patient with the high-risk cardiovascular and cerebrovascular diseases, advising that the physical examination frequency of the patient to be predicted is increased.

It can be understood that, in this embodiment, after the patient to be predicted is a patient with cardiovascular and cerebrovascular diseases, if the health revisit data record of the patient includes that the characteristic data of the patient to be predicted belongs to the range of the characteristic data of a certain number of patients with cardiovascular and cerebrovascular diseases, the sign information of the patient to be predicted is not abnormal, and the symptoms are not abnormal, then the patient is not a patient with high risk of cardiovascular and cerebrovascular diseases. The patient may be advised of a physical examination frequency corresponding to the range of the characteristic data based on the range of the characteristic data.

For example: the blood pressure range of 100 patients with cardiovascular and cerebrovascular diseases is as follows: 100-: 102-140mmHg, the patient does not have life-threatening symptoms such as faint, cough, hemoptysis and the like, and the patient is not a patient with high risk of cardiovascular and cerebrovascular diseases. If the patient has a one-month previous physical examination, the blood pressure value in the patient characteristic data ranges from 102 to 140 mmHg. Assuming that the physical examination frequency corresponding to the characteristic data of the patient is twice a month, the patient is recommended to carry out secondary physical examination for one month. The embodiment saves the time of doctor diagnosis suggestion and medical resources.

Optionally, in an embodiment of the method for predicting a cardiovascular and cerebrovascular disease risk of the present invention, after the step of determining that the patient to be predicted in the input sample is not a cardiovascular and cerebrovascular disease patient in S115, the method further includes:

the method comprises the following steps: determining whether the patient to be predicted is a healthy user or not according to the health return visit data of the patient;

wherein, all the characteristic data of the healthy user, namely the patient to be predicted, are within the standard range specified by all the medical characteristic data. For example: medical regulation of normal blood pressure: 80-90/120-140 mmHg, if the health revisit data shows that the blood pressure of the patient to be predicted is 82/125mmHg, other characteristic data of the patient are all in the standard range stated by each item of medical characteristic data, and the patient is a healthy user.

Step two: if the patient to be predicted is a healthy user, making a recommendation for keeping normal examination frequency for a normal patient;

it is understood that if the patient to be predicted is a healthy user, the physical examination frequency of the healthy user corresponds to the data characteristic of the patient. The patient is advised to maintain the same number of physical exams as the previous physical exam. For example: the patient had a one-month previous physical examination frequency, and a one-month one-time physical examination frequency is recommended. The embodiment selects the healthy users and gives appropriate suggestions, so that the time for diagnosing the suggestions by doctors is saved, and the medical expenditure of the patients is reduced.

Step three: if the patient to be predicted is not a healthy user, marking the patient to be predicted as a missed patient, and adding the characteristic data of the missed patient into a patient medical database set;

It is understood that if the patient to be predicted is not a healthy user, it may be determined from the health revisit data of the patient that the patient to be predicted is a cardiovascular and cerebrovascular disease patient or not a cardiovascular and cerebrovascular disease patient. If the patient to be predicted is the patient with the cardiovascular and cerebrovascular diseases, the patient is marked as a patient with missed diagnosis, and the characteristic data of the patient is added into a patient medical database set, so that the wrong prediction is prevented when the same type of patient to be predicted is predicted whether to be the patient with the cardiovascular and cerebrovascular diseases, and the accuracy of predicting whether to be predicted is the patient with the cardiovascular and cerebrovascular diseases is improved.

Optionally, in an embodiment of the method for predicting risk of a cardiovascular and cerebrovascular disease, the identifying, by the first tag, a sample of a patient with a cardiovascular and cerebrovascular disease includes:

the method comprises the following steps: determining the identification information of the cardiovascular and cerebrovascular disease patient according to the collected health return visit data of the patient;

the health revisitation data of the patients included: patient's number, characteristics, characteristic data and confirmed condition; the identification information includes: confirming symptoms, and confirming characteristics and characteristic data corresponding to the symptoms;

step two: according to the identification information of the patients with cardiovascular and cerebrovascular diseases, centrally determining the samples of the patients with cardiovascular and cerebrovascular diseases in a medical database;

step three: a sample of a patient with cardiovascular and cerebrovascular diseases is provided with a first label.

According to the embodiment, the medical database is distinguished through the health revisit data of the patient, the sample of the patient with the cardiovascular and cerebrovascular diseases is determined in a centralized mode, the first label is set, and time is saved for determining the label of the input sample.

Optionally, in an embodiment of the method for predicting risk of a cardiovascular and cerebrovascular disease, the second label identifies a sample of a patient with a non-cardiovascular and cerebrovascular disease, and includes:

The present embodiment may use health revisit data one month after the physical examination of the user. And setting labels for the cardiovascular and cerebrovascular disease samples and the non-cardiovascular and cerebrovascular disease samples in the health return visit data. The label includes: letters, numbers, symbols, and the like. For example: health revisitation data includes: the description of various cardiovascular and cerebrovascular diseases such as stroke, hypertension, coronary heart disease, hyperlipidemia, hyperglycemia, cerebral infarction, heart failure and the like sets labels as positive samples, adds a label field of a category and sets a label 1. All samples which are not positive samples are used as negative samples, a label '0' is set, and the samples are added into a medical database set.

According to the embodiment, the medical database is distinguished through the health revisit data of the patient, the sample of the patient with the non-cardiovascular and cerebrovascular diseases is determined in a centralized mode, the second label is set, and time is saved for determining the label of the input sample.

Optionally, in an embodiment of the method for predicting risk of a cardiovascular and cerebrovascular disease, in the present invention, S101 obtains a sample set, including:

s201, according to a plurality of samples of the patient medical database set with the labels, deleting the samples with the sample missing values larger than a first threshold value;

the first threshold value is a value manually specified according to industry experience, and a sample missing value is exemplified below. For example, if 10 features are included in a sample and 7 features are missing, the ratio of the number of missing features in the sample to the total number of features in the sample is

Assume that the prescribed first threshold is

The sample is processed for sample deletion.

In this embodiment, the purpose of performing the sample deleting process on the sample with the missing value greater than the first threshold is: the method reduces samples with less characteristic data in the patient medical database, improves the quality of the samples in the patient medical database, and saves time for subsequent processing.

S202, searching in the deleted multiple samples, and performing feature deletion on the features of which the feature missing values are greater than a second threshold value;

in the embodiment, the feature deletion processing is performed on the feature with the feature missing value larger than the second threshold; the purpose is as follows: fewer characteristic data in the samples of the patient medical data bank are reduced, the quality of the characteristic data in the samples of the patient medical data bank is improved, and time is saved for subsequent processing.

The feature deficiency values are: the ratio of the number of the features lacking the feature data to the total number of the same features in the same features of the plurality of samples;

wherein the second threshold is a value that is manually specified based on industry experience. The feature missing values are illustrated below, for example, the same feature in 10 samples: and (4) pulse. The number of pulse features lacking feature data is 7 in total, the total number of pulse features is 10, and it is assumed that the first threshold value is defined as

The pulse features are subjected to feature deletion processing.

In this embodiment, the purpose of performing feature deletion processing on the feature missing value greater than the second threshold is as follows: fewer characteristic data in the samples of the patient medical data bank are reduced, the quality of the characteristic data in the samples of the patient medical data bank is improved, and time is saved for subsequent processing.

S203, searching the characteristics of the missing characteristic data in the plurality of samples subjected to the characteristic deletion processing to be used as first characteristics;

s204, filling missing values of the first feature missing feature data by using a multiple filling method;

wherein, a module constructed by using a multiple filling method in IBM SPSS statistics 23 is adopted to fill missing values, for example, 2 pieces of sample blood pressure characteristic data are missing, and the module constructed by using the multiple filling method is used for filling missing values according to the characteristic data of a patient: age: 50; blood fat: 1. serum total cholesterol is 2.9-5.17 mmoi/l; 2. 0.56-1.7 mmoi/l of serum triglyceride; 3. 0.94-2.0 mmoi/l of high density lipoprotein cholesterol; 4. 2.07-3.12 i/l of low-density lipoprotein cholesterol; blood sugar: the fasting is 7.8-9.0 mmoL/L, the filling value of the blood pressure characteristic data in 2 samples is 100-145mmHg, the specific filling mode is the same as the filling mode of the prior art, and the detailed description is omitted here.

The embodiment fills missing values of missing feature data of a plurality of samples, and can improve the quality of the samples so as to improve the quality of the obtained sample set.

S205, classifying the characteristic data of the plurality of samples filled with the missing values according to the data types to obtain classification results;

wherein, the classification result includes: discrete feature data and continuous feature data;

s206, according to the classification result, processing the discrete feature data and the continuous feature data corresponding to the data type;

s207, adding the feature data which is obtained by correspondingly processing the discrete feature data and the continuous feature data into a patient medical database set as a first database set;

wherein, the discrete characteristic data and the continuous characteristic data are processed corresponding to the data type, and the processing comprises the following steps: carrying out one-hot encoding on the discrete characteristic data; normalizing the continuous characteristic data by using a Z-score method of positive Tailore normalization;

the data types of the characteristic data of the patient include: discrete feature data and continuous feature data. For example, blood pressure data, heartbeat data are of a continuous type, and age data are of a discrete type.

For example, for discrete feature data, one-hot coded code is written that applies to the feature data. Taking the example of encoding "age", first, the age characteristics are segmented into 7 intervals of "76 and above", "66-75", "55-65", "46-55", "36-45", "26-35", "below 25" according to the number of samples, and if one person is 30 years old, the age value after unique hot encoding is 0000010. Other discrete characteristics like age characteristics such as gender, city, occupation, family genetic history, disease history, eating habits, smoking habits, drinking habits, weekly movement habits, etc. are subject to unique heat code conversion.

It is understood that the normalization processing method using the z-score method of positive-parity normalization for the continuous feature data in this embodiment is the same as the processing method of the prior art, and will not be described herein. According to different data types, after the discrete characteristic data and the continuous characteristic data are correspondingly processed, the consistency of data results caused by the fact that the data are processed by the same method is avoided, and the accuracy of data processing can be improved.

S208, carrying out unbalanced processing on the samples of the first database set by using an undersampling and SMOTE algorithm to obtain a second database set;

in this embodiment, the unbalanced processing is performed on the samples of the first database set to make the samples of the same type distributed more uniformly, so as to obtain an accurate second database set.

S209, calculating the variance of the same characteristic data in the second database set by using an analysis of variance method, deleting the characteristic data of which the variance value is smaller than a preset variance threshold value, and obtaining a third database set;

in this embodiment, feature data with a feature data variance value smaller than a preset variance threshold is selected to be deleted, data with a small sample feature data difference can be reduced, and the second database set from which the feature data with a feature data variance value smaller than the preset variance threshold is deleted is used as the third database set. It can be understood that: the larger the difference value is, the larger the difference of the samples is, and the higher the accuracy rate of distinguishing the cardiovascular and cerebrovascular disease samples from the non-cardiovascular and cerebrovascular disease samples is.

S210, calculating by using a relief algorithm, and deleting the weight of each feature data after the feature data with the feature data variance value smaller than a preset variance threshold value is deleted;

s211, deleting the feature data with the score value smaller than a preset score threshold value and the corresponding features according to the weight of the feature data and the score value corresponding to the weight of the feature data to obtain a fourth database set;

in this embodiment, a weight database is pre-established, and the weight database includes: the weight of the feature data is a score value corresponding to the weight of the feature data. And searching score values corresponding to the weights of the characteristic data in the database according to the weight of each characteristic data, and scoring each characteristic data. And deleting the feature data and the corresponding features of which the score value does not exceed the score threshold value in the third database set, wherein the score threshold value is a numerical value set according to industry experience.

And S212, determining a sample set by using a forward selection method according to the fourth database set.

It is to be understood that, in the process of determining the sample set by using the forward selection method, the evaluation function may be used to evaluate the feature data corresponding to each feature in the fourth database set, and determine the evaluation function value of the feature data corresponding to each feature.

In some examples, for feature data corresponding to the same feature, a sample corresponding to feature data corresponding to a feature having the same evaluation function value may be used as a model sample set, where the model sample set includes a plurality of samples, and the plurality of samples includes: at least one same characteristic and characteristic data corresponding to the characteristic; and then evaluating each model sample set, and finally selecting the model sample set with the highest evaluation function value as the sample set. Evaluating each model sample set may include: calculating the average value of the evaluation function values of all the characteristic data in the model sample set, or selecting the average value of the characteristic data related to cardiovascular and cerebrovascular diseases in the model sample set, and evaluating each model sample set can also use the method of the evaluation set in the prior art, which is not described herein again.

The following examples illustrate: suppose that there are 3 feature data: blood pressure: 100-; pulse: 60-100 times/min; blood sugar: hollow 7.8-9.0 mmoL/L. The evaluation function values of the three feature data are 64, 78 and 12 respectively; selecting a pulse: forming a model sample set 1 by samples of which the number is 60-100 times/minute; selecting blood pressure: the samples of 100-145mmHg form a model sample set 2; selecting blood sugar: samples with 7.8-9.0 mmoL/L fasting form a model sample set 3; the average value of the characteristic data evaluation function values in the model sample set 1 is 50 points, the average value of the characteristic data evaluation function values in the model sample set 2 is 45 points, the average value of the characteristic data evaluation function values in the model sample set 3 is 65 points, and the model sample set 3 is used as a sample set.

According to the embodiment, the quality of the sample is improved by preprocessing the sample of the patient medical database set and the characteristic data in the sample, so that the quality of the sample set can be improved.

Optionally, in step S105, according to the global metric matrix, calculating distances between the input samples and the samples in the sample set by using a cosine similarity algorithm, and forming a first distance set, where the method includes:

wherein the cosine similarity algorithm formula is as follows:

Where i represents the index of the input sample, x_iRepresents the ith input sample as x_i(ii) a The sample set is X; the global metric matrix is A; m ═ A^TA; j represents the sample number in the sample set; x is the number of_jRepresents the jth sample in the sample set; i and j are positive integers; d(s)_i,x_j) Representing input samples x under a global metric matrix_iDistance from jth sample in X set; a (x)_i,x_j) Represents x after A matrix transformation_i,x_jThe distance between them.

Optionally, in step S110, according to a local measurement matrix of the target local cluster obtained by learning the COS-LMNN algorithm, calculating by using a cosine similarity algorithm, inputting distances between samples and samples in the target local cluster, and forming a second distance set, where the method includes:

calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm formula;

the cosine similarity algorithm formula is as follows:

Where i represents the index of the input sample, x_iRepresents the ith input sample as x; x is the number of_siRepresent the same class as iThen, the process is carried out; the local metric matrix is A_S；M_S＝A_S ^TA_S(ii) a i is a positive integer; d (x)_i,x_si) Representing input samples x under a local metric matrix_iDistances to samples of the same category as i in the target local cluster; i is a positive integer; a. the_S(x_i,x_si) Represents the passage A_SX after matrix transformation_i,x_siThe distance between them.

As shown in fig. 3, a cardiovascular and cerebrovascular disease risk prediction apparatus provided in an embodiment of the present invention includes:

a set obtaining module 301, configured to obtain a sample set;

the sample set is determined according to a plurality of samples of the patient medical database set with the set labels; one sample includes: patient's number, characteristics and characteristic data; the label includes: a first tag and a second tag; the first label identifies a sample of a patient with cardiovascular disease; the second label identifies a non-cardiovascular patient sample;

a sample obtaining module 302, configured to obtain an input sample;

the matrix calculation module 303 is configured to perform metric learning by using a cosine-large-interval nearest neighbor COS-LMNN algorithm to obtain a global metric matrix of the sample set;

a first local cluster determining module 304, configured to divide the samples in the sample set into a preset number of local clusters by using a preset clustering algorithm;

a first distance determining module 305, configured to calculate, according to the global metric matrix, distances between the input samples and the samples in the sample set by using a cosine similarity algorithm, so as to form a first distance set;

a first sample determining module 306, configured to calculate, according to a preset first K value and the first distance set, first K-value first neighboring samples of the input sample by using a K-nearest neighbor algorithm;

a second local cluster determining module 307, configured to determine a local cluster where the first neighboring sample is located;

a target local cluster determining module 308, configured to select, as a target local cluster, a local cluster in which the number of first neighboring samples exceeds a first preset threshold from among local clusters in which neighboring samples are located;

a local cluster dividing module 309, configured to divide an input sample into the target local cluster;

the second distance determining module 310 is configured to calculate, by using a cosine similarity algorithm, according to a local measurement matrix of the target local cluster obtained through COS-LMNN algorithm learning, input a distance between a sample and a sample in the target local cluster, and form a second distance set;

a second sample determining module 311, configured to determine, in the target local cluster, second K-value second neighboring samples of the input samples according to a preset second K value and a second distance set by using a K-nearest neighbor algorithm;

a counting module 312, configured to count the number of the first labels and the number of the second labels of the second neighboring samples;

the label determining module 313 is configured to use the first label as a label of the input sample if a ratio of the number of the first labels to the number of the second labels exceeds a preset label threshold, and otherwise use the second label as a label of the input sample;

a patient sample determination module 314, configured to determine whether the input sample is a sample of a patient with a cardiovascular disease according to the label of the input sample;

a cardiovascular and cerebrovascular disease patient determination module 315, configured to determine that the patient to be predicted in the input sample is a cardiovascular and cerebrovascular disease patient if the input sample is a cardiovascular and cerebrovascular disease patient;

a non-cardiovascular and cerebrovascular disease patient determination module 316, configured to determine that the patient to be predicted in the input sample is not a cardiovascular and cerebrovascular disease patient if the input sample is not a cardiovascular and cerebrovascular disease patient sample.

Optionally, the device for predicting risk of cardiovascular and cerebrovascular diseases provided in the embodiment of the present invention further includes:

the high-risk determining module is used for determining whether the patient to be predicted is a high-risk cardiovascular and cerebrovascular disease patient according to the health return visit data of the patient;

the hospitalization suggestion module is used for making a suggestion of hospitalization on the patient to be predicted if the patient to be predicted is a patient with high risk of cardiovascular and cerebrovascular diseases;

Optionally, the set obtaining module 301 includes:

the data processing submodule is used for processing the discrete feature data and the continuous feature data corresponding to the data type according to the classification result;

the variance deleting submodule is used for calculating the variance of the same characteristic data in the second database set by using an analysis of variance method, deleting the characteristic data of which the variance value is smaller than a preset variance threshold value, and obtaining a third database set;

the weight calculation submodule is used for calculating the weight of each feature data in the third database set by using a relief algorithm;

the score deletion submodule is used for deleting the feature data with the score value smaller than a preset score threshold value and the corresponding features in the third database set according to the weight of the feature data and the score value corresponding to the weight of the feature data to obtain a fourth database set;

and the set determining submodule is used for determining a sample set by using a forward selection method according to the fourth database set.

The cardiovascular and cerebrovascular disease risk prediction device of this embodiment further includes:

the missed-diagnosis patient determining module is used for marking the patient to be predicted as a missed-diagnosis patient if the patient to be predicted is not a healthy user, and adding the characteristic data of the missed-diagnosis patient into a patient medical database set;

Optionally, the first distance determining module is specifically configured to: calculating the distance between the input sample and the sample in the sample set by using a cosine similarity algorithm formula according to the global measurement matrix to form a first distance set;

the cosine similarity algorithm formula is as follows:

Where i represents the index of the input sample, x_iRepresents the ith input sample as x_i(ii) a The sample set is X; the global metric matrix is A; m ═ A^TA；j represents the sample number in the sample set; x is the number of_jRepresents the jth sample in the sample set; i and j are positive integers; d (x)_i,x_j) Representing input samples x under a global metric matrix_iDistance from jth sample in X set; a (x)_i,x_j) Represents x after A matrix transformation_i,x_jThe distance between them.

Optionally, the second distance determining module is specifically configured to:

the cosine similarity algorithm formula is as follows:

An embodiment of the present invention further provides an electronic device, as shown in fig. 4, including a processor 401, a communication interface 402, a memory 403, and a communication bus 404, where the processor 401, the communication interface 402, and the memory 403 complete mutual communication through the communication bus 404,

a memory 403 for storing a computer program;

the processor 401, when executing the program stored in the memory 403, implements the following steps:

obtaining a sample set; the sample set is determined according to a plurality of samples of the patient medical database set with the set labels; one sample includes: patient's number, characteristics and characteristic data; the label includes: a first tag and a second tag; the first label identifies a sample of a patient with cardiovascular disease; the second label identifies a non-cardiovascular patient sample;

calculating the distance between an input sample and a sample in a sample set by using a cosine similarity algorithm according to the global measurement matrix to form a first distance set;

calculating to obtain first K-value first adjacent samples of the input samples by using a K-nearest neighbor algorithm according to a preset first K value and a first distance set;

determining a local cluster in which the first neighboring sample is located;

selecting local clusters with the number of first adjacent samples exceeding a first preset threshold value from local clusters where the adjacent samples are located as target local clusters;

drawing an input sample into the target local cluster;

according to a local measurement matrix of the target local cluster obtained by learning of a COS-LMNN algorithm, calculating by using a cosine similarity algorithm, inputting the distance between a sample and the sample in the target local cluster, and forming a second distance set;

determining second K-value second adjacent samples of the input samples in the target local cluster by using a K-nearest neighbor algorithm according to a preset second K value and the second distance set;

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, which has instructions stored therein, and when the instructions are executed on a computer, the computer is caused to execute a cardiovascular disease risk prediction method as described in any one of the above embodiments.

In yet another embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform a method for cardiovascular disease risk prediction as described in any of the above embodiments.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, as for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A cardiovascular disease risk prediction device, the device comprising:

the set acquisition module is used for acquiring a sample set;

the label determining module is used for taking the first label as the label of the input sample if the ratio of the number of the first labels to the number of the second labels of the second adjacent sample exceeds a preset label threshold value, and otherwise, taking the second label as the label of the input sample;

the non-cardiovascular and cerebrovascular disease patient determination module is used for determining that the patient to be predicted in the input sample is not the patient with the cardiovascular and cerebrovascular diseases if the input sample is not the sample of the patient with the cardiovascular and cerebrovascular diseases;

the set acquisition module includes:

the equalization processing submodule is used for carrying out imbalance processing on the samples of the first database set by using an undersampling and SMOTE algorithm to obtain a second database set;

and the set determining submodule determines a sample set by using a forward selection method according to the fourth database set.

2. The apparatus of claim 1, further comprising: