CN113223701A - Sudden heart disease prediction method based on Transformer-MHP model - Google Patents
Sudden heart disease prediction method based on Transformer-MHP model Download PDFInfo
- Publication number
- CN113223701A CN113223701A CN202110531057.4A CN202110531057A CN113223701A CN 113223701 A CN113223701 A CN 113223701A CN 202110531057 A CN202110531057 A CN 202110531057A CN 113223701 A CN113223701 A CN 113223701A
- Authority
- CN
- China
- Prior art keywords
- data
- model
- transformer
- heart disease
- mhp
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 208000019622 heart disease Diseases 0.000 title claims abstract description 56
- 238000000034 method Methods 0.000 title claims abstract description 35
- 238000012549 training Methods 0.000 claims abstract description 57
- 238000012545 processing Methods 0.000 claims abstract description 33
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 238000004458 analytical method Methods 0.000 claims abstract description 9
- 238000012216 screening Methods 0.000 claims abstract description 9
- 230000009467 reduction Effects 0.000 claims abstract description 8
- 238000000513 principal component analysis Methods 0.000 claims abstract description 7
- 238000012352 Spearman correlation analysis Methods 0.000 claims abstract description 6
- 206010002383 Angina Pectoris Diseases 0.000 claims description 16
- 230000007547 defect Effects 0.000 claims description 14
- 208000010125 myocardial infarction Diseases 0.000 claims description 12
- 230000000284 resting effect Effects 0.000 claims description 8
- 208000024891 symptom Diseases 0.000 claims description 8
- 238000010801 machine learning Methods 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 238000007405 data analysis Methods 0.000 claims description 5
- 230000008030 elimination Effects 0.000 claims description 5
- 238000003379 elimination reaction Methods 0.000 claims description 5
- 230000007246 mechanism Effects 0.000 claims description 5
- 238000012163 sequencing technique Methods 0.000 claims description 5
- 239000013598 vector Substances 0.000 claims description 5
- 206010008479 Chest Pain Diseases 0.000 claims description 4
- WQZGKKKJIJFFOK-GASJEMHNSA-N Glucose Natural products OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O WQZGKKKJIJFFOK-GASJEMHNSA-N 0.000 claims description 4
- 208000007177 Left Ventricular Hypertrophy Diseases 0.000 claims description 4
- 230000005856 abnormality Effects 0.000 claims description 4
- 230000036772 blood pressure Effects 0.000 claims description 4
- HVYWMOMLDIMFJA-DPAQBDIFSA-N cholesterol Chemical compound C1C=C2C[C@@H](O)CC[C@]2(C)[C@@H]2[C@@H]1[C@@H]1CC[C@H]([C@H](C)CCCC(C)C)[C@@]1(C)CC2 HVYWMOMLDIMFJA-DPAQBDIFSA-N 0.000 claims description 4
- 238000001514 detection method Methods 0.000 claims description 4
- 239000008103 glucose Substances 0.000 claims description 4
- 238000011835 investigation Methods 0.000 claims description 4
- 230000002441 reversible effect Effects 0.000 claims description 4
- 238000011425 standardization method Methods 0.000 claims description 3
- 239000008280 blood Substances 0.000 claims 1
- 210000004369 blood Anatomy 0.000 claims 1
- 238000001914 filtration Methods 0.000 claims 1
- 238000011156 evaluation Methods 0.000 abstract description 7
- 238000010276 construction Methods 0.000 abstract description 2
- 238000002474 experimental method Methods 0.000 abstract description 2
- 238000003058 natural language processing Methods 0.000 abstract 1
- 230000002159 abnormal effect Effects 0.000 description 15
- 239000011159 matrix material Substances 0.000 description 8
- 238000012706 support-vector machine Methods 0.000 description 5
- 238000013473 artificial intelligence Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 4
- 230000001351 cycling effect Effects 0.000 description 3
- 238000003646 Spearman's rank correlation coefficient Methods 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002955 isolation Methods 0.000 description 2
- 208000020446 Cardiac disease Diseases 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000747 cardiac effect Effects 0.000 description 1
- 238000007675 cardiac surgery Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000002864 sequence alignment Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
- 238000002759 z-score normalization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/20—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/30—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for calculating health indices; for individual health risk assessment
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Public Health (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Pathology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Epidemiology (AREA)
- Primary Health Care (AREA)
- Measurement And Recording Of Electrical Phenomena And Electrical Characteristics Of The Living Body (AREA)
Abstract
The invention discloses a method for predicting sudden heart disease based on a Transformer-MHP model, which comprises four parts of data preprocessing, characteristic analysis, model construction and training and performance evaluation. Firstly, data preprocessing is carried out according to the obtained heart disease data samples, then, a Principal Component Analysis (PCA) method is used for carrying out dimensionality reduction analysis on the data set, and finally, a Spearman correlation analysis algorithm is used for screening fourteen characteristic attributes for carrying out model training. The main scope of the Transformer algorithm is natural language processing and remarkable in achievement, the traditional Transformer framework is improved and innovated, and a new Transformer-MHP algorithm model is provided by combining a high-expansibility parallel processing algorithm and is used for probability prediction of sudden heart disease in the AI medical field so as to assist in improving the medical working efficiency and accuracy. Finally, the model is subjected to performance evaluation through experiments, and the result shows that the Transformer-MHP heart disease prediction algorithm has better accuracy and interpretability compared with the traditional algorithm.
Description
Technical Field
The invention belongs to the field of AI medical treatment, and particularly relates to a method for predicting sudden heart disease based on a Transformer-MHP model.
Background
The increasing pressure facing the medical industry has been caused by the changing population and structure and by uncontrollable environmental factors. However, with the breakthrough and popularization of the artificial intelligence technology, the application scenes are more and more abundant and generalized. By means of the advantages of high-performance and high-efficiency data processing of a computer and the combination of big data analysis and deep learning, artificial intelligence changes the medical situation to a great extent, obviously reduces the cost and improves the efficiency.
At present, machine learning algorithms such as MLP (Lempo-Lempo), decision tree, SVM (support vector machine), K-Means and the like are used for constructing a prediction model in the field of sudden heart disease prediction, but training results show that the algorithms have certain defects and have a space for improving the accuracy and efficiency of the model. Therefore, it is necessary to construct an efficient machine learning algorithm to assist the prediction of heart attack.
Disclosure of Invention
The invention aims to overcome the defects of the prior art, provides a novel sudden heart disease prediction method based on a Transformer-MHP model, constructs an artificial intelligence technology system fusing medical health and modern science, and provides auxiliary support for predicting sudden heart disease and improving the efficiency and accuracy of medical staff.
Transformer is a new model proposed in 2017 by Google's researcher for the seq2seq task. Compared with the RNN or CNN using sequence alignment, the traditional Transformer algorithm model can completely depend on a Self-attention mechanism to calculate a conversion model of input and output representation, and can directly acquire global information and perform parallel calculation.
The technical scheme of the invention is as follows:
the sudden heart disease prediction method based on the Transformer-MHP model comprises the following steps:
(1) collecting paper or electronic medical records with considerable number, and screening a heart disease data set;
(2) data preprocessing for subsequent feature analysis;
(3) performing data analysis by using the data set processed in the step (2) to obtain a training set D of the algorithm, wherein the training set D has a size of N;
(4) selecting fourteen feature attributes with the highest correlation coefficient from the training set D obtained in the step (3) according to expert suggestion, and constructing a new training set D' for subsequent model training;
(5) constructing a model by using a Transformer-MHP, and training the model;
(6) and (4) testing and evaluating the Transformer-MHP model obtained in the step (5) to verify the accuracy and the interpretability of the model.
Further, the step (2) is specifically as follows:
and 2.1) carrying out missing value processing on all attributes of the obtained data sample, and adopting a mean interpolation method. If the attribute can be measured by a constant, the missing value is interpolated by using the average value of the valid values of the attribute, and if the attribute is measured by a numerical grade, the missing value is interpolated by using the mode of the valid values of the attribute.
2.2) carrying out standardization processing on the data obtained in the step 2.1) according to a Z-Score standardization method, wherein the processed data conform to standard normal distribution, and thus errors caused by different dimensions are eliminated.
2.3) based on the idea of Isolation Forest anomaly detection algorithm, recursively and randomly dividing a heart disease data set and establishing a local model, wherein each isolated tree is used for identifying a specific attribute subsample. And calculating and sorting the abnormal score of each sample point, and breaking the sample point with the abnormal score close to 1 into the abnormal points. And directly deleting the sample points marked as the abnormal points, thereby removing the abnormal data which are sparsely distributed and are far away from the population with high density.
Further, theThe step (3) is specifically as follows: based on the idea of PCA principal component analysis data processing algorithm, the dimensionality reduction processing is carried out on the high-dimensionality heart disease data set, important features are reserved, noise is removed, accordingly attribute indexes needing to be analyzed are reduced, and the data processing speed is improved. After data is standardized, eigenvalue lambda and corresponding eigenvectors among the covariance matrixes of the samples are calculated, the eigenvalue lambda is sorted from large to small, the first k samples are selected according to a sorting sequence, and the k eigenvectors corresponding to the samples are taken out to obtain a group: { (λ)1,u1),(λ2,u2),…,(λk,uk)}. And projecting the original features onto the selected feature vectors to obtain the k-dimensional features after the dimension reduction.
Further, the step (4) is specifically as follows: based on the idea of a Spearman correlation analysis algorithm, the correlation degree between heart disease attribute grade variables after grading sequencing is measured, a grade correlation coefficient is obtained, and the correlation of the heart disease attribute grade variables is evaluated, so that feature selection is carried out, and attributes which are irrelevant and reduce the accuracy of a model are removed. In the invention, fourteen feature attributes are selected from a heart disease data set, and the fourteen feature attributes form a new training set for model training.
Further, the step (5) is specifically as follows:
5.1) improving the encoder-decoder layer based on the encoder-decoder structure of the traditional Transformer. A Transformer framework for NLP field is provided, input/output layers of the Transformer framework are only suitable for input processing words, and a Position encoding data preprocessing mechanism is added to relative Position information. Now, the internal structure of the heart disease is modified to completely match with the input of fourteen characteristic attribute data of the heart disease and the data is processed.
And 5.2) based on the idea of high-expandability MHP parallel analysis algorithm, enabling target data of the input decoder to be parallel to input data of the input encoder to execute data stream processing. While data processing is performed, false data competition is filtered, so that model overhead is greatly reduced, and model efficiency is improved.
5.3) carrying out model training based on the idea of the Wrapper recursive feature elimination algorithm. And performing multiple rounds of training by using the base model, removing the characteristics of a plurality of weight coefficients after each round of training, performing next round of training based on a new characteristic set, and continuously iterating to obtain the required characteristic quantity.
And 5.4) adding the training set obtained in the step 4 into the Transformer-MHP machine learning model provided by the invention, and training to obtain a recursion formula between the input variable and the output variable according to the data label serving as the output variable Y.
Further, the step (6) is specifically as follows: and (3) calculating evaluation indexes such as accuracy, recall rate and the like of the model based on the test set and the confusion matrix, and comparing the performance of the multiple models in predicting the probability problem of the sudden heart disease to realize final prediction performance expectation. The forecasting model based on Transform-MHP heart disease has better accuracy and generalization compared with the traditional model.
Further, the Spearman rank correlation data analysis method in the step (4) comprises the following steps:
4.1) rank ordering the data of two variables (X, Y) to obtain the ordered position (X ', Y'), i.e. rank, diIs the difference in rank, and n is the number of data in the variable.
4.2) calculating the Spireman grade correlation coefficient:
4.3) analyzing the Spireman grade correlation coefficient, and finally determining fourteen characteristic attributes through screening and investigation: age (age), gender (sex), chest pain type (cp), resting blood pressure (trestbps), serum cholesterol level (chol), fasting plasma glucose (fbs), resting electrocardiogram (restecg), maximum new range (thalach), exercise induced angina (exang), exercise induced ST depression versus rest (oldpeak), maximum exercise ST segment slope (ca), defect (oral), target (target). Wherein, the value of "sex" is 1 for male and 0 for female; the value of "cp" indicates typical angina pectoris when 1, atypical angina pectoris when 2, non-angina pectoris when 3, and no symptom when 4; the value of "restecg" indicates normal when it is 0, ST-T wave abnormality when it is 1, and left ventricular hypertrophy when it is 2; a value of "exang" of 1 indicates yes, and 0 indicates no; a value of "slope" of 1 indicates an upward slope, 2 indicates a flat slope, and 3 indicates a downward slope; the value of "ca" is from 0 to 3; "thal" has a value of 0 indicating normal, 1 indicating fixed defect, and 2 indicating reversible defect; a value of "target" is 0 indicating that the chance of heart attack is reduced, and 1 indicating that the chance of heart attack is increased.
Further, in the step 5.1), the method specifically comprises the following steps: an input Embedding layer and a Positional Encoding layer of an encoder are removed, fourteen data attribute characteristic inputs are changed, layer standardization processing is firstly carried out after the data attribute characteristic inputs, and input values are normalized, so that the training speed is accelerated, and the training stability is improved. And inputting the normalized data into a multi-head self-attention layer, and autonomously calculating the self-attention weight of the input data by the multi-head self-attention layer and distributing the weight. And the data is processed by a multi-head self-attention layer and transmitted to a fully-connected feedforward neural network. And the decoder receives the output result of the encoder and the output result of the first sublayer of the decoder, carries out layer standardization processing on the data, and finally outputs a True state or a False state to judge whether sudden heart disease symptoms exist.
Further, in the step 5.3), the method specifically comprises the following steps: for the prediction model with the weight of the feature, RFE selects the required feature by continuously reducing the scale of the feature set in a recursive mode. Each feature is first assigned a weight and then trained on these original features using a predictive model. After the weight values of the features are obtained, absolute values of the weight values are taken, and the minimum absolute value is eliminated. And finally, continuously cycling and recursing according to the method until the residual characteristic quantity reaches the required characteristic quantity.
The invention has the beneficial effects that:
the invention uses the Transformer algorithm model in the field of sudden heart disease prediction, carries out algorithm innovation and improvement, provides a new Transform-MHP algorithm model, can improve the medical quality and service efficiency, reduces misdiagnosis and mistreatment, and makes a contribution to medical treatment and artificial intelligence multidisciplinary crossing.
Drawings
FIG. 1 is a whole structure of a Transformer-MHP model;
FIG. 2 is a schematic diagram of a transform-MHP encoder-decoder architecture;
fig. 3 is a schematic structural diagram of a dot product attention module.
Detailed Description
In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.
The sudden heart disease prediction method based on the Transformer-MHP model comprises the following steps:
(1) a considerable number of paper or electronic medical records are collected and heart disease data set screening is performed.
(2) And (4) preprocessing data. Firstly, missing value processing is carried out on data based on a mean interpolation method, then the data are standardized according to a Z-Score standardization method, the processed data are made to accord with standard normal distribution, and therefore errors caused by different dimensions are eliminated. And finally deleting abnormal data based on the Isolation Forest abnormal detection algorithm. A heart disease data set is recursively randomized and a local model is built, wherein each isolated tree is used to identify a particular attribute subsample. And calculating the abnormal score of each sample point, sequencing, judging the abnormal points and deleting the sample points marked as the abnormal points, thereby removing the abnormal data which are distributed sparsely and are far away from the population with high density for subsequent feature analysis. In the training process, each isolated tree randomly selects a part of samples, and the algorithm becomes more stable as the number of the trees increases. The calculation of related distance and density indexes is avoided, the speed is greatly improved, and the system overhead is reduced.
(3) And performing dimensionality reduction on the high-dimensional heart disease data set based on PCA principal component analysis, and reserving important features to remove noise, thereby reducing attribute indexes needing analysis. Through dimension reduction, the model and the compressed data are simplified, original data information is kept to the maximum degree, the problems of large medical data volume, high dimension and difficulty in analysis are effectively solved, and the calculation cost is reduced.
(4) Based on the idea of a Spearman correlation analysis algorithm, the correlation degree between heart disease attribute grade variables after grading sequencing is measured, a grade correlation coefficient is obtained, and the correlation of the heart disease attribute grade variables is evaluated, so that feature selection is carried out, and attributes which are irrelevant and reduce the accuracy of a model are removed. According to the expert suggestion, fourteen feature attributes with high relativity are finally selected from the heart disease data set, and the fourteen feature attributes form a new training set for model training.
(5) Based on a traditional Transformer framework, a new Transformer-MHP algorithm model is constructed by combining a parallel processing algorithm with high expansibility. The input and output mode and the relative position vector processing mechanism originally suitable for the NLP field in the encoder-decoder layer are improved, so that the input and the output of fourteen feature attribute data of the heart disease can be completely matched and processed. The target data of the input decoder and the input data of the input encoder can be processed in parallel, and false data competition is filtered while data processing is carried out, so that model overhead is greatly reduced, and model efficiency is improved. And then carrying out model training based on a Wrapper recursive feature elimination algorithm on the obtained new Transformer-MHP algorithm model. And performing multiple rounds of training by using the base model, removing the characteristics of a plurality of weight coefficients after each round of training, performing next round of training based on a new characteristic set, and continuously iterating to obtain the required characteristic quantity. And (5) finally, adding the training set obtained in the step (4) into the machine learning model, and training to obtain a recursion formula between the input variable and the output variable according to the data label as the output variable Y.
(6) And calculating evaluation indexes such as accuracy, recall rate and the like of the model based on the test set and the confusion matrix, and comparing the performance of the multiple models in predicting the probability problem of the sudden heart disease. The confusion matrix is a visual standard format for representing the accuracy evaluation of the model, and can simply and clearly reflect the coincidence condition of the real value and the predicted value. And comparing model evaluation indexes, thereby proving that the prediction model based on Transform-MHP heart disease provided by the invention has better accuracy and generalization compared with the traditional model.
The step (4) of constructing the training set containing fourteen feature attributes comprises the following steps:
(41) rank ordering the data of the two variables (X, Y) to obtain the ordered positions (X', YIn order of order, diIs the difference in rank, and n is the number of data in the variable.
(42) Calculating a spearman rank correlation coefficient:
(43) analyzing the Spanish level correlation coefficient, and finally determining fourteen characteristic attributes through screening and investigation: age (age), gender (sex), chest pain type (cp), resting blood pressure (trestbps), serum cholesterol level (chol), fasting plasma glucose (fbs), resting electrocardiogram (restecg), maximum new range (thalach), exercise induced angina (exang), exercise induced ST depression versus rest (oldpeak), maximum exercise ST segment slope (ca), defect (oral), target (target). Wherein, the value of "sex" is 1 for male and 0 for female; the value of "cp" indicates typical angina pectoris when 1, atypical angina pectoris when 2, non-angina pectoris when 3, and no symptom when 4; the value of "restecg" indicates normal when it is 0, ST-T wave abnormality when it is 1, and left ventricular hypertrophy when it is 2; a value of "exang" of 1 indicates yes, and 0 indicates no; a value of "slope" of 1 indicates an upward slope, 2 indicates a flat slope, and 3 indicates a downward slope; the value of "ca" is from 0 to 3; "thal" has a value of 0 indicating normal, 1 indicating fixed defect, and 2 indicating reversible defect; a value of "target" is 0 indicating that the chance of heart attack is reduced, and 1 indicating that the chance of heart attack is increased.
In the step (5), improving the encoder-decoder layer based on the traditional Transformer structure comprises the following steps: an input Embedding layer and a Positional Encoding layer of an encoder are removed, fourteen data attribute characteristic inputs are changed, layer standardization processing is firstly carried out after the data attribute characteristic inputs, and input values are normalized, so that the training speed is accelerated, and the training stability is improved. And inputting the normalized data into a multi-head self-attention layer, and autonomously calculating the self-attention weight of the input data by the multi-head self-attention layer and distributing the weight. And the data is processed by a multi-head self-attention layer and transmitted to a fully-connected feedforward neural network. And the decoder receives the output result of the encoder and the output result of the first sublayer of the decoder, carries out layer standardization processing on the data, and finally outputs a True state or a False state to judge whether sudden heart disease symptoms exist.
The model training based on the Wrapper recursive feature elimination algorithm in the step (5) comprises the following steps: for the prediction model with the weight of the feature, RFE selects the required feature by continuously reducing the scale of the feature set in a recursive mode. Each feature is first assigned a weight and then trained on these original features using a predictive model. After the weight values of the features are obtained, absolute values of the weight values are taken, and the minimum absolute value is eliminated. And finally, continuously cycling and recursing according to the method until the residual characteristic quantity reaches the required characteristic quantity.
Examples
As shown in fig. 1, the method of the invention comprises the following steps:
s1: data and a considerable number of paper or electronic medical records are collected and heart disease data set screening is performed.
S2: and (4) preprocessing data.
S3: and performing dimensionality reduction treatment on the high-dimensional heart disease data set based on PCA principal component analysis to simplify a model and compress data.
S4: feature selection is carried out based on a Spearman correlation analysis algorithm, and attributes which are irrelevant and reduce model precision are removed. According to the expert suggestion, fourteen feature attributes with high relativity are finally selected from the heart disease data set, and the fourteen feature attributes form a new training set for model training.
S5: the model was constructed using Transformer-MHP, as shown in fig. 1, and model training was performed.
S6: the Transformer-MHP model provided by the invention is tested and evaluated, and compared with a basic model to verify the accuracy and generalization of the model.
Wherein, in S1, collecting data and selecting data comprises the steps of:
s101: a UCI cardiac data set is selected that contains 76 attributes, of which 14 are referenced by the present invention.
S102: medical record data collection and clinical data collection are carried out on 408 beds in the cardiac surgery of the hospital, covering from 2016 to 2021 years, and totaling 5126 experimental data. 5000 data are finally selected for model experiments and stored in the EXCEL form, and the characteristic attributes are 76.
Wherein, in S2, the data preprocessing includes the following steps:
s201: and performing missing value processing on all attributes of the obtained data sample based on a mean interpolation method. If the attribute can be measured by a constant, the missing value is interpolated by using the average value of the valid values of the attribute, and if the attribute is measured by a numerical grade, the missing value is interpolated by using the mode of the valid values of the attribute. In total, 32 data have attribute value loss, and all people adopt an artificial filling mode.
S202: the data were normalized based on the Z-Score normalization method, which requires the mean and standard deviation of the raw data. The new data is the quotient of the original data minus the mean value and the standard deviation.
Mu refers to the overall average value of the data, x refers to the observed value of the data, sigma refers to the overall standard deviation, N refers to the number of the attribute data samples, and j locates each attribute sample point, and the formula is as follows:
s203: the method for preprocessing the data of the acquired data set based on the Isolationforsest anomaly detection method comprises the following steps: firstly, randomly selecting k points from training data as subsamples to be put into a root node of an isolated tree, designating the dimension as d, generating a cutting point p between the maximum value and the minimum value of the data dimension of the current node, and generating a hyperplane for cutting the subspace. Points less than p are placed on the left branch of the current node and vice versa on the right branch, and new leaf nodes are recursively constructed until only one data or tree on a leaf node has grown to the set height. T isolated trees are obtained, h (x) is the height of x in each tree, c (k) is the average of the path lengths at a given number of samples k, and is used to normalize the path length h (x) of sample x. And calculating an abnormal score s, judging a point with an abnormal score close to 1 as an abnormal point, and removing the abnormal point.
In S3, performing dimensionality reduction on the high-dimensional cardiac disease data set based on PCA principal component analysis includes the following steps: data were normalized, the mean of each feature was averaged, and then the mean of each feature was subtracted from the mean of itself for all samples. And selecting a sample X and a sample Y to calculate a covariance matrix, wherein the larger the absolute value of the covariance is, the larger the influence of the sample X and the sample Y on each other is, and the smaller the covariance matrix is. Calculating an eigenvalue lambda and a corresponding eigenvector of the covariance matrix, sorting the eigenvalues lambda from large to small, selecting the first k samples according to the sorting, and taking out the k eigenvectors corresponding to the samples to obtain a group: { (λ)1,u1),(λ2,u2),…,(λk,uk)}. And finally, projecting the original features onto the selected feature vectors to obtain the reduced-dimension k-dimension features.
Wherein selecting features based on Spearman correlation analysis in S4 comprises the steps of:
s401: rank ordering data of any two variables (X, Y) to obtain rank position (X ', Y'), diIs the difference of rank, n is the number of data in the variable, and calculates the spearman rank correlation coefficient, the calculation formula is:
s402: analyzing the Spanish level correlation coefficient, and finally determining fourteen characteristic attributes through screening and investigation: age (age), gender (sex), chest pain type (cp), resting blood pressure (trestbps), serum cholesterol level (chol), fasting plasma glucose (fbs), resting electrocardiogram (restecg), maximum new range (thalach), exercise induced angina (exang), exercise induced ST depression versus rest (oldpeak), maximum exercise ST segment slope (ca), defect (oral), target (target). Wherein, the value of "sex" is 1 for male and 0 for female; the value of "cp" indicates typical angina pectoris when 1, atypical angina pectoris when 2, non-angina pectoris when 3, and no symptom when 4; the value of "restecg" indicates normal when it is 0, ST-T wave abnormality when it is 1, and left ventricular hypertrophy when it is 2; a value of "exang" of 1 indicates yes, and 0 indicates no; a value of "slope" of 1 indicates an upward slope, 2 indicates a flat slope, and 3 indicates a downward slope; the value of "ca" is from 0 to 3; "thal" has a value of 0 indicating normal, 1 indicating fixed defect, and 2 indicating reversible defect; a value of "target" is 0 indicating that the chance of heart attack is reduced, and 1 indicating that the chance of heart attack is increased.
Wherein, in S5, the model construction and model training comprises the following steps:
s501: an input Embedding layer and a Positional Encoding layer of an encoder are removed, fourteen data attribute characteristic inputs are changed, layer standardization processing is firstly carried out after the data attribute characteristic inputs, and input values are normalized, so that the training speed is accelerated, and the training stability is improved. An abstract representation of the improved transform-MHP encoder-decoder is shown in fig. 2. And inputting the normalized data into a multi-head self-attention layer, and autonomously calculating the self-attention weight of the input data by the multi-head self-attention layer and distributing the weight. And the data is processed by a multi-head self-attention layer and transmitted to a fully-connected feedforward neural network. And the decoder receives the output result of the encoder and the output result of the first sublayer of the decoder, carries out layer standardization processing on the data, and finally outputs a True state or a False state to judge whether sudden heart disease symptoms exist. The attention mechanism used by the Transformer-MHP model is dot product attention, and the structure of the dot product attention module in the multi-headed self-attention layer is shown in FIG. 3. The Mask operation is optional, the dot product attention module in the encoder does not need the Mask operation, and the dot product attention module in the decoder does need the Mask operation on the input vector. In the dot-product attention module, the input data is regarded as three roles of Query, Key, and Value (respectively represented by Q, K, V). Where K is a keyword dictionary, V is a value in a keyword, Q is a task to be queried, dkRefers to the dimension of the input data matrix. The purpose of this module is to find the corresponding V value in K by Q (by looking up the keywords in the task location keyword dictionaryThe corresponding value). The formula for dot product attention is as follows:
s502: and carrying out model training based on the idea of a Wrapper recursive feature elimination algorithm. Each feature is first assigned a weight and then trained on these original features using a predictive model. After the weight values of the features are obtained, absolute values of the weight values are taken, and the minimum absolute value is eliminated. And finally, continuously cycling and recursing according to the method until the residual characteristic quantity reaches the required characteristic quantity.
In S6, the testing and evaluating of the model includes the following steps:
s601: and selecting a decision tree model, an MLP neural network, an SVM support vector machine and a K-Means unsupervised algorithm to construct a traditional sudden heart disease prediction model for comparison. Meanwhile, based on a Transformer-MHP model, the collected data is added into the machine learning model according to the weight.
S602: and calculating the accuracy, the recall rate and the F1 score of the model according to the test set and the confusion matrix to serve as evaluation indexes, and comparing the performance of the multiple models in predicting the probability problem of the sudden heart disease. Wherein the accuracy of the MLP multi-layer perceptual training model using supervised learning is 0.7631578947368421; the accuracy of the predictive model using the supervised learning SVM is 0.8026315789473685; the accuracy of the prediction model was 0.75 using the K-Means algorithm. And the accuracy of the Transformer-MHP prediction model reaches 0.91387324, which is obviously superior to the three models. The comparison result proves that the novel heart disease prediction model provided by the invention has good accuracy and interpretability.
The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. The industry has described the principles of the invention, and variations and modifications are possible without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.
Claims (8)
1. The sudden heart disease prediction method based on the Transformer-MHP model is characterized by comprising the following steps: the method comprises the following steps:
(1) collecting a large number of paper medical records or electronic medical records, and screening a heart disease data set;
(2) preprocessing the data of the heart disease data set obtained in the step (1) for subsequent characteristic analysis;
(3) performing data analysis by using the data set processed in the step (2) to obtain a training set D of the algorithm, wherein the training set D has a size of N;
(4) selecting fourteen feature attributes with the highest correlation coefficient from the training set D obtained in the step (3) to construct a new training set D' for subsequent model training;
(5) constructing a model by using a Transformer-MHP, and training the model;
(6) and (4) testing and evaluating the Transformer-MHP model obtained in the step (5) to verify the accuracy and the interpretability of the model.
2. The method of predicting heart disease outbreak based on the Transformer-MHP model of claim 1, which is characterized by: the step (2) is specifically as follows:
2.1) carrying out missing value processing on all attributes of the obtained data sample, and adopting a mean interpolation method; if the attribute can be measured by a constant, interpolating a missing value by using the average value of the effective values of the attribute, and if the attribute is measured by a numerical grade, interpolating the missing value by using the mode of the effective values of the attribute;
2.2) carrying out standardization processing on the data obtained in the step 2.1) according to a Z-Score standardization method, wherein the processed data conform to standard normal distribution, so that errors caused by different dimensions are eliminated;
2.3) based on the idea of the IsolationsForest anomaly detection algorithm, recursively and randomly dividing a heart disease data set and establishing a local model, wherein each isolated tree is used for identifying specific attribute subsamples, calculating and sequencing the anomaly score of each sample point, breaking the sample point with the anomaly score close to 1 into an anomaly point, and directly deleting the sample point marked as the anomaly point, thereby removing the anomaly data which are sparsely distributed and are far away from a high-density population.
3. The method of predicting heart disease outbreak based on the Transformer-MHP model of claim 2, wherein: the step (3) is specifically as follows: based on the idea of PCA principal component analysis data processing algorithm, performing dimensionality reduction processing on a high-dimensionality heart disease data set, reserving important features and removing noise, thereby reducing attribute indexes to be analyzed, improving data processing speed, calculating eigenvalue lambda and corresponding eigenvector between sample covariance matrixes after data are standardized, sorting the eigenvalue lambda from large to small, selecting the first k samples according to a sorting sequence, and taking out the k eigenvectors corresponding to the samples to obtain a group: { (λ)1,u1),(λ2,u2),…,(λk,uk) And projecting the original features onto the selected feature vectors to obtain the reduced-dimension k-dimension features.
4. The method of claim 3 for predicting a heart attack based on a Transformer-MHP model, wherein: the step (4) is specifically as follows: based on the idea of a Spearman correlation analysis algorithm, the correlation degree between heart disease attribute grade variables after grading sequencing is measured, grade correlation coefficients are obtained, the correlation is evaluated, therefore, feature selection is carried out, attributes which are irrelevant and reduce model precision are eliminated, fourteen feature attributes are selected from a heart disease data set, and the fourteen feature attributes form a new training set for model training.
5. The method of predicting heart disease outbreak based on the Transformer-MHP model of claim 4, wherein: the Spearman rank correlation data analysis method based on the step (4) comprises the following steps:
4.1) rank ordering the data of two variables (X, Y) to obtain the ordered position (X ', Y'), i.e. rank, diIs the difference of rank, n is the number of data in the variable;
4.2) calculating a Spireman grade correlation coefficient;
4.3) analyzing the Spireman grade correlation coefficient, and finally determining fourteen characteristic attributes through screening and investigation: age (age), gender (sex), chest pain type (cp), resting blood pressure (trestbps), serum cholesterol level (chol), fasting blood glucose (fbs), resting electrocardiogram (restecg), maximum new mileage (thalach), exercise induced angina (exang), exercise versus rest induced ST depression (oldpeak), maximum exercise ST segment slope (ca), defect (thal), target (target), wherein a value of "sex" is 1 for a male and 0 for a female; the value of "cp" indicates typical angina pectoris when 1, atypical angina pectoris when 2, non-angina pectoris when 3, and no symptom when 4; the value of "restecg" indicates normal when it is 0, ST-T wave abnormality when it is 1, and left ventricular hypertrophy when it is 2; a value of "exang" of 1 indicates yes, and 0 indicates no; a value of "slope" of 1 indicates an upward slope, 2 indicates a flat slope, and 3 indicates a downward slope; the value of "ca" is from 0 to 3; "thal" has a value of 0 indicating normal, 1 indicating fixed defect, and 2 indicating reversible defect; a value of "target" is 0 indicating that the chance of heart attack is reduced, and 1 indicating that the chance of heart attack is increased.
6. The method of predicting heart attack based on a Transformer-MHP model according to claim 5, wherein: the step (5) is specifically as follows:
5.1) improving the encoder-decoder layer based on the encoder-decoder structure of the traditional Transformer. A Transformer framework for NLP field is provided, input/output layers of the Transformer framework are suitable for input processing words, and a Position encoding data preprocessing mechanism is added to relative Position information. Modifying the internal structure of the heart disease to completely match the input of fourteen feature attribute data of the heart disease and processing the feature attribute data;
5.2) based on the idea of high-expandability MHP parallel analysis algorithm, enabling target data input into the decoder and input data input into the encoder to execute data stream processing in parallel, and filtering false data competition while executing the data processing;
5.3) carrying out model training based on the idea of a Wrapper recursive feature elimination algorithm, carrying out multi-round training by using a base model, removing the features of a plurality of weight coefficients after each round of training, carrying out next round of training based on a new feature set, and continuously iterating to obtain the required feature quantity;
and 5.4) adding the training set obtained in the step (4) into the Transformer-MHP machine learning model provided by the invention, and training to obtain a recursion formula between the input variable and the output variable according to the data label as the output variable Y.
7. The method of predicting heart disease outbreak based on the Transformer-MHP model of claim 6, wherein: in the step 5.1), the method specifically comprises the following steps: removing an input Embedding layer and a Positional Encoding layer of an encoder, changing the input Embedding layer and the Positional Encoding layer into fourteen data attribute characteristic inputs, firstly carrying out layer standardization processing after the input, normalizing an input value, accelerating the training speed and improving the training stability, inputting the standardized data into a multi-head self-attention layer, automatically calculating the self-attention weight of the input data by the multi-head self-attention layer, distributing the weight, processing the data by the multi-head self-attention layer, transmitting the data to a fully-connected feedforward neural network, receiving the output result of the encoder and the output result of a first sublayer of the encoder by the decoder, carrying out layer standardization processing on the data, and finally outputting a True state or a False state to judge whether sudden heart disease symptoms exist.
8. The method of claim 7 for predicting a heart attack based on a Transformer-MHP model, wherein: in the step 5.3), the method specifically comprises the following steps: for a prediction model with the weight of the feature, RFE selects the required feature by continuously reducing the scale of the feature set in a recursive mode; firstly, each feature is assigned with a weight, then a prediction model is adopted to train on the original features, after the weight values of the features are obtained, the absolute values of the weight values are taken, the minimum absolute value is removed, and finally recursion is continuously circulated according to the method until the residual feature quantity reaches the required feature quantity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110531057.4A CN113223701A (en) | 2021-05-16 | 2021-05-16 | Sudden heart disease prediction method based on Transformer-MHP model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110531057.4A CN113223701A (en) | 2021-05-16 | 2021-05-16 | Sudden heart disease prediction method based on Transformer-MHP model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113223701A true CN113223701A (en) | 2021-08-06 |
Family
ID=77092105
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110531057.4A Pending CN113223701A (en) | 2021-05-16 | 2021-05-16 | Sudden heart disease prediction method based on Transformer-MHP model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113223701A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113887089A (en) * | 2021-11-17 | 2022-01-04 | 中冶赛迪重庆信息技术有限公司 | Wire rod mechanical property prediction method and computer readable storage medium |
CN114536104A (en) * | 2022-03-25 | 2022-05-27 | 成都飞机工业(集团)有限责任公司 | Dynamic prediction method for tool life |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111080032A (en) * | 2019-12-30 | 2020-04-28 | 成都数之联科技有限公司 | Load prediction method based on Transformer structure |
CN111243751A (en) * | 2020-01-17 | 2020-06-05 | 河北工业大学 | Heart disease prediction method based on dual feature selection and XGboost algorithm |
CN112336310A (en) * | 2020-11-04 | 2021-02-09 | 吾征智能技术(北京)有限公司 | Heart disease diagnosis system based on FCBF and SVM fusion |
-
2021
- 2021-05-16 CN CN202110531057.4A patent/CN113223701A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111080032A (en) * | 2019-12-30 | 2020-04-28 | 成都数之联科技有限公司 | Load prediction method based on Transformer structure |
CN111243751A (en) * | 2020-01-17 | 2020-06-05 | 河北工业大学 | Heart disease prediction method based on dual feature selection and XGboost algorithm |
CN112336310A (en) * | 2020-11-04 | 2021-02-09 | 吾征智能技术(北京)有限公司 | Heart disease diagnosis system based on FCBF and SVM fusion |
Non-Patent Citations (2)
Title |
---|
HDX柿子: "Machine Learning 实战-特征选择之递归特征消除", pages 1 - 36, Retrieved from the Internet <URL:https://blog.csdn.net/zxjoke/article/details/105501640> * |
初识-CV: "Transformer模型详解(图解最完整版)", pages 1 - 15, Retrieved from the Internet <URL:https://blog.csdn.net/qq_38410428/article/details/112348321> * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113887089A (en) * | 2021-11-17 | 2022-01-04 | 中冶赛迪重庆信息技术有限公司 | Wire rod mechanical property prediction method and computer readable storage medium |
CN114536104A (en) * | 2022-03-25 | 2022-05-27 | 成都飞机工业(集团)有限责任公司 | Dynamic prediction method for tool life |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Abdollahi et al. | A hybrid method for heart disease diagnosis utilizing feature selection based ensemble classifier model generation | |
US20180150609A1 (en) | Server and method for predicting future health trends through similar case cluster based prediction models | |
CN111402979B (en) | Method and device for detecting consistency of disease description and diagnosis | |
CN113223701A (en) | Sudden heart disease prediction method based on Transformer-MHP model | |
CN116682557A (en) | Chronic complications early risk early warning method based on small sample deep learning | |
CN114496233A (en) | Auxiliary diagnosis system for myocardial infarction complications | |
CN118280579B (en) | Sepsis patient condition assessment method and system based on multi-mode data fusion | |
Lutimath et al. | Prediction of heart disease using genetic algorithm | |
CN114898879A (en) | Chronic disease risk prediction method based on graph representation learning | |
CN113284627B (en) | Medication recommendation method based on patient characterization learning | |
CN113838018B (en) | Cnn-former-based liver fibrosis lesion detection model training method and system | |
CN113643781B (en) | Personalized recommendation method and system for health intervention scheme based on time sequence early warning signal | |
CN117151215B (en) | Coronary heart disease multi-mode data characteristic extraction method based on knowledge graph | |
CN117476226A (en) | Multi-scale convolutional neural network diabetes risk assessment method based on feature fusion | |
Muthulakshmi et al. | Prediction of Heart Disease using Ensemble Learning | |
CN116779156A (en) | Construction method of postoperative index anomaly prediction system and postoperative risk prediction equipment | |
CN115662595A (en) | User information management method and system based on online diagnosis and treatment system | |
CN115206539A (en) | Multi-label integrated classification method based on perioperative patient risk event data | |
CN112489803A (en) | Risk event prediction method and system, and generation method of risk event prediction system | |
Medasani et al. | Machine Learning Techniques for Cardiac Risk Analysis | |
Gold et al. | Heart failure prediction framework using random forest and J48 with Adaboost algorithms | |
Hamidi et al. | Analysis and evaluation of techniques for myocardial infarction based on genetic algorithm and weight by SVM | |
Diyasa et al. | Data Classification of Patient Characteristics Based on Nutritional Treatment Using the K-Nearest Neighbors Algorithm | |
Muthulakshmi et al. | Big Data Analytics for Heart Disease Prediction using Regularized Principal and Quadratic Entropy Boosting | |
CN118520157B (en) | Matching recommendation method and system for clinical recruitment projects |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |