CN113223701A

CN113223701A - Sudden heart disease prediction method based on Transformer-MHP model

Info

Publication number: CN113223701A
Application number: CN202110531057.4A
Authority: CN
Inventors: 王宇嘉; 蔡虓; 姚可越; 冯艺
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2021-05-16
Filing date: 2021-05-16
Publication date: 2021-08-06

Abstract

The invention discloses a method for predicting sudden heart disease based on a Transformer-MHP model, which comprises four parts of data preprocessing, characteristic analysis, model construction and training and performance evaluation. Firstly, data preprocessing is carried out according to the obtained heart disease data samples, then, a Principal Component Analysis (PCA) method is used for carrying out dimensionality reduction analysis on the data set, and finally, a Spearman correlation analysis algorithm is used for screening fourteen characteristic attributes for carrying out model training. The main scope of the Transformer algorithm is natural language processing and remarkable in achievement, the traditional Transformer framework is improved and innovated, and a new Transformer-MHP algorithm model is provided by combining a high-expansibility parallel processing algorithm and is used for probability prediction of sudden heart disease in the AI medical field so as to assist in improving the medical working efficiency and accuracy. Finally, the model is subjected to performance evaluation through experiments, and the result shows that the Transformer-MHP heart disease prediction algorithm has better accuracy and interpretability compared with the traditional algorithm.

Description

Sudden heart disease prediction method based on Transformer-MHP model

Technical Field

The invention belongs to the field of AI medical treatment, and particularly relates to a method for predicting sudden heart disease based on a Transformer-MHP model.

Background

The increasing pressure facing the medical industry has been caused by the changing population and structure and by uncontrollable environmental factors. However, with the breakthrough and popularization of the artificial intelligence technology, the application scenes are more and more abundant and generalized. By means of the advantages of high-performance and high-efficiency data processing of a computer and the combination of big data analysis and deep learning, artificial intelligence changes the medical situation to a great extent, obviously reduces the cost and improves the efficiency.

At present, machine learning algorithms such as MLP (Lempo-Lempo), decision tree, SVM (support vector machine), K-Means and the like are used for constructing a prediction model in the field of sudden heart disease prediction, but training results show that the algorithms have certain defects and have a space for improving the accuracy and efficiency of the model. Therefore, it is necessary to construct an efficient machine learning algorithm to assist the prediction of heart attack.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a novel sudden heart disease prediction method based on a Transformer-MHP model, constructs an artificial intelligence technology system fusing medical health and modern science, and provides auxiliary support for predicting sudden heart disease and improving the efficiency and accuracy of medical staff.

Transformer is a new model proposed in 2017 by Google's researcher for the seq2seq task. Compared with the RNN or CNN using sequence alignment, the traditional Transformer algorithm model can completely depend on a Self-attention mechanism to calculate a conversion model of input and output representation, and can directly acquire global information and perform parallel calculation.

The technical scheme of the invention is as follows:

the sudden heart disease prediction method based on the Transformer-MHP model comprises the following steps:

(1) collecting paper or electronic medical records with considerable number, and screening a heart disease data set;

(2) data preprocessing for subsequent feature analysis;

(3) performing data analysis by using the data set processed in the step (2) to obtain a training set D of the algorithm, wherein the training set D has a size of N;

(4) selecting fourteen feature attributes with the highest correlation coefficient from the training set D obtained in the step (3) according to expert suggestion, and constructing a new training set D' for subsequent model training;

(5) constructing a model by using a Transformer-MHP, and training the model;

(6) and (4) testing and evaluating the Transformer-MHP model obtained in the step (5) to verify the accuracy and the interpretability of the model.

Further, the step (2) is specifically as follows:

and 2.1) carrying out missing value processing on all attributes of the obtained data sample, and adopting a mean interpolation method. If the attribute can be measured by a constant, the missing value is interpolated by using the average value of the valid values of the attribute, and if the attribute is measured by a numerical grade, the missing value is interpolated by using the mode of the valid values of the attribute.

2.2) carrying out standardization processing on the data obtained in the step 2.1) according to a Z-Score standardization method, wherein the processed data conform to standard normal distribution, and thus errors caused by different dimensions are eliminated.

2.3) based on the idea of Isolation Forest anomaly detection algorithm, recursively and randomly dividing a heart disease data set and establishing a local model, wherein each isolated tree is used for identifying a specific attribute subsample. And calculating and sorting the abnormal score of each sample point, and breaking the sample point with the abnormal score close to 1 into the abnormal points. And directly deleting the sample points marked as the abnormal points, thereby removing the abnormal data which are sparsely distributed and are far away from the population with high density.

Further, theThe step (3) is specifically as follows: based on the idea of PCA principal component analysis data processing algorithm, the dimensionality reduction processing is carried out on the high-dimensionality heart disease data set, important features are reserved, noise is removed, accordingly attribute indexes needing to be analyzed are reduced, and the data processing speed is improved. After data is standardized, eigenvalue lambda and corresponding eigenvectors among the covariance matrixes of the samples are calculated, the eigenvalue lambda is sorted from large to small, the first k samples are selected according to a sorting sequence, and the k eigenvectors corresponding to the samples are taken out to obtain a group: { (λ)₁,u₁),(λ₂,u₂),…,(λ_k,u_k)}. And projecting the original features onto the selected feature vectors to obtain the k-dimensional features after the dimension reduction.

Further, the step (4) is specifically as follows: based on the idea of a Spearman correlation analysis algorithm, the correlation degree between heart disease attribute grade variables after grading sequencing is measured, a grade correlation coefficient is obtained, and the correlation of the heart disease attribute grade variables is evaluated, so that feature selection is carried out, and attributes which are irrelevant and reduce the accuracy of a model are removed. In the invention, fourteen feature attributes are selected from a heart disease data set, and the fourteen feature attributes form a new training set for model training.

Further, the step (5) is specifically as follows:

5.1) improving the encoder-decoder layer based on the encoder-decoder structure of the traditional Transformer. A Transformer framework for NLP field is provided, input/output layers of the Transformer framework are only suitable for input processing words, and a Position encoding data preprocessing mechanism is added to relative Position information. Now, the internal structure of the heart disease is modified to completely match with the input of fourteen characteristic attribute data of the heart disease and the data is processed.

And 5.2) based on the idea of high-expandability MHP parallel analysis algorithm, enabling target data of the input decoder to be parallel to input data of the input encoder to execute data stream processing. While data processing is performed, false data competition is filtered, so that model overhead is greatly reduced, and model efficiency is improved.

5.3) carrying out model training based on the idea of the Wrapper recursive feature elimination algorithm. And performing multiple rounds of training by using the base model, removing the characteristics of a plurality of weight coefficients after each round of training, performing next round of training based on a new characteristic set, and continuously iterating to obtain the required characteristic quantity.

And 5.4) adding the training set obtained in the step 4 into the Transformer-MHP machine learning model provided by the invention, and training to obtain a recursion formula between the input variable and the output variable according to the data label serving as the output variable Y.

Further, the step (6) is specifically as follows: and (3) calculating evaluation indexes such as accuracy, recall rate and the like of the model based on the test set and the confusion matrix, and comparing the performance of the multiple models in predicting the probability problem of the sudden heart disease to realize final prediction performance expectation. The forecasting model based on Transform-MHP heart disease has better accuracy and generalization compared with the traditional model.

Further, the Spearman rank correlation data analysis method in the step (4) comprises the following steps:

4.1) rank ordering the data of two variables (X, Y) to obtain the ordered position (X ', Y'), i.e. rank, d_iIs the difference in rank, and n is the number of data in the variable.

4.2) calculating the Spireman grade correlation coefficient:

4.3) analyzing the Spireman grade correlation coefficient, and finally determining fourteen characteristic attributes through screening and investigation: age (age), gender (sex), chest pain type (cp), resting blood pressure (trestbps), serum cholesterol level (chol), fasting plasma glucose (fbs), resting electrocardiogram (restecg), maximum new range (thalach), exercise induced angina (exang), exercise induced ST depression versus rest (oldpeak), maximum exercise ST segment slope (ca), defect (oral), target (target). Wherein, the value of "sex" is 1 for male and 0 for female; the value of "cp" indicates typical angina pectoris when 1, atypical angina pectoris when 2, non-angina pectoris when 3, and no symptom when 4; the value of "restecg" indicates normal when it is 0, ST-T wave abnormality when it is 1, and left ventricular hypertrophy when it is 2; a value of "exang" of 1 indicates yes, and 0 indicates no; a value of "slope" of 1 indicates an upward slope, 2 indicates a flat slope, and 3 indicates a downward slope; the value of "ca" is from 0 to 3; "thal" has a value of 0 indicating normal, 1 indicating fixed defect, and 2 indicating reversible defect; a value of "target" is 0 indicating that the chance of heart attack is reduced, and 1 indicating that the chance of heart attack is increased.

Further, in the step 5.1), the method specifically comprises the following steps: an input Embedding layer and a Positional Encoding layer of an encoder are removed, fourteen data attribute characteristic inputs are changed, layer standardization processing is firstly carried out after the data attribute characteristic inputs, and input values are normalized, so that the training speed is accelerated, and the training stability is improved. And inputting the normalized data into a multi-head self-attention layer, and autonomously calculating the self-attention weight of the input data by the multi-head self-attention layer and distributing the weight. And the data is processed by a multi-head self-attention layer and transmitted to a fully-connected feedforward neural network. And the decoder receives the output result of the encoder and the output result of the first sublayer of the decoder, carries out layer standardization processing on the data, and finally outputs a True state or a False state to judge whether sudden heart disease symptoms exist.

Further, in the step 5.3), the method specifically comprises the following steps: for the prediction model with the weight of the feature, RFE selects the required feature by continuously reducing the scale of the feature set in a recursive mode. Each feature is first assigned a weight and then trained on these original features using a predictive model. After the weight values of the features are obtained, absolute values of the weight values are taken, and the minimum absolute value is eliminated. And finally, continuously cycling and recursing according to the method until the residual characteristic quantity reaches the required characteristic quantity.

The invention has the beneficial effects that:

the invention uses the Transformer algorithm model in the field of sudden heart disease prediction, carries out algorithm innovation and improvement, provides a new Transform-MHP algorithm model, can improve the medical quality and service efficiency, reduces misdiagnosis and mistreatment, and makes a contribution to medical treatment and artificial intelligence multidisciplinary crossing.

Drawings

FIG. 1 is a whole structure of a Transformer-MHP model;

FIG. 2 is a schematic diagram of a transform-MHP encoder-decoder architecture;

fig. 3 is a schematic structural diagram of a dot product attention module.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

(1) a considerable number of paper or electronic medical records are collected and heart disease data set screening is performed.

(2) And (4) preprocessing data. Firstly, missing value processing is carried out on data based on a mean interpolation method, then the data are standardized according to a Z-Score standardization method, the processed data are made to accord with standard normal distribution, and therefore errors caused by different dimensions are eliminated. And finally deleting abnormal data based on the Isolation Forest abnormal detection algorithm. A heart disease data set is recursively randomized and a local model is built, wherein each isolated tree is used to identify a particular attribute subsample. And calculating the abnormal score of each sample point, sequencing, judging the abnormal points and deleting the sample points marked as the abnormal points, thereby removing the abnormal data which are distributed sparsely and are far away from the population with high density for subsequent feature analysis. In the training process, each isolated tree randomly selects a part of samples, and the algorithm becomes more stable as the number of the trees increases. The calculation of related distance and density indexes is avoided, the speed is greatly improved, and the system overhead is reduced.

(3) And performing dimensionality reduction on the high-dimensional heart disease data set based on PCA principal component analysis, and reserving important features to remove noise, thereby reducing attribute indexes needing analysis. Through dimension reduction, the model and the compressed data are simplified, original data information is kept to the maximum degree, the problems of large medical data volume, high dimension and difficulty in analysis are effectively solved, and the calculation cost is reduced.

(4) Based on the idea of a Spearman correlation analysis algorithm, the correlation degree between heart disease attribute grade variables after grading sequencing is measured, a grade correlation coefficient is obtained, and the correlation of the heart disease attribute grade variables is evaluated, so that feature selection is carried out, and attributes which are irrelevant and reduce the accuracy of a model are removed. According to the expert suggestion, fourteen feature attributes with high relativity are finally selected from the heart disease data set, and the fourteen feature attributes form a new training set for model training.

(5) Based on a traditional Transformer framework, a new Transformer-MHP algorithm model is constructed by combining a parallel processing algorithm with high expansibility. The input and output mode and the relative position vector processing mechanism originally suitable for the NLP field in the encoder-decoder layer are improved, so that the input and the output of fourteen feature attribute data of the heart disease can be completely matched and processed. The target data of the input decoder and the input data of the input encoder can be processed in parallel, and false data competition is filtered while data processing is carried out, so that model overhead is greatly reduced, and model efficiency is improved. And then carrying out model training based on a Wrapper recursive feature elimination algorithm on the obtained new Transformer-MHP algorithm model. And performing multiple rounds of training by using the base model, removing the characteristics of a plurality of weight coefficients after each round of training, performing next round of training based on a new characteristic set, and continuously iterating to obtain the required characteristic quantity. And (5) finally, adding the training set obtained in the step (4) into the machine learning model, and training to obtain a recursion formula between the input variable and the output variable according to the data label as the output variable Y.

(6) And calculating evaluation indexes such as accuracy, recall rate and the like of the model based on the test set and the confusion matrix, and comparing the performance of the multiple models in predicting the probability problem of the sudden heart disease. The confusion matrix is a visual standard format for representing the accuracy evaluation of the model, and can simply and clearly reflect the coincidence condition of the real value and the predicted value. And comparing model evaluation indexes, thereby proving that the prediction model based on Transform-MHP heart disease provided by the invention has better accuracy and generalization compared with the traditional model.

The step (4) of constructing the training set containing fourteen feature attributes comprises the following steps:

(41) rank ordering the data of the two variables (X, Y) to obtain the ordered positions (X', YIn order of order, d_iIs the difference in rank, and n is the number of data in the variable.

(42) Calculating a spearman rank correlation coefficient:

(43) analyzing the Spanish level correlation coefficient, and finally determining fourteen characteristic attributes through screening and investigation: age (age), gender (sex), chest pain type (cp), resting blood pressure (trestbps), serum cholesterol level (chol), fasting plasma glucose (fbs), resting electrocardiogram (restecg), maximum new range (thalach), exercise induced angina (exang), exercise induced ST depression versus rest (oldpeak), maximum exercise ST segment slope (ca), defect (oral), target (target). Wherein, the value of "sex" is 1 for male and 0 for female; the value of "cp" indicates typical angina pectoris when 1, atypical angina pectoris when 2, non-angina pectoris when 3, and no symptom when 4; the value of "restecg" indicates normal when it is 0, ST-T wave abnormality when it is 1, and left ventricular hypertrophy when it is 2; a value of "exang" of 1 indicates yes, and 0 indicates no; a value of "slope" of 1 indicates an upward slope, 2 indicates a flat slope, and 3 indicates a downward slope; the value of "ca" is from 0 to 3; "thal" has a value of 0 indicating normal, 1 indicating fixed defect, and 2 indicating reversible defect; a value of "target" is 0 indicating that the chance of heart attack is reduced, and 1 indicating that the chance of heart attack is increased.

In the step (5), improving the encoder-decoder layer based on the traditional Transformer structure comprises the following steps: an input Embedding layer and a Positional Encoding layer of an encoder are removed, fourteen data attribute characteristic inputs are changed, layer standardization processing is firstly carried out after the data attribute characteristic inputs, and input values are normalized, so that the training speed is accelerated, and the training stability is improved. And inputting the normalized data into a multi-head self-attention layer, and autonomously calculating the self-attention weight of the input data by the multi-head self-attention layer and distributing the weight. And the data is processed by a multi-head self-attention layer and transmitted to a fully-connected feedforward neural network. And the decoder receives the output result of the encoder and the output result of the first sublayer of the decoder, carries out layer standardization processing on the data, and finally outputs a True state or a False state to judge whether sudden heart disease symptoms exist.

The model training based on the Wrapper recursive feature elimination algorithm in the step (5) comprises the following steps: for the prediction model with the weight of the feature, RFE selects the required feature by continuously reducing the scale of the feature set in a recursive mode. Each feature is first assigned a weight and then trained on these original features using a predictive model. After the weight values of the features are obtained, absolute values of the weight values are taken, and the minimum absolute value is eliminated. And finally, continuously cycling and recursing according to the method until the residual characteristic quantity reaches the required characteristic quantity.

Examples

As shown in fig. 1, the method of the invention comprises the following steps:

s1: data and a considerable number of paper or electronic medical records are collected and heart disease data set screening is performed.

S2: and (4) preprocessing data.

S3: and performing dimensionality reduction treatment on the high-dimensional heart disease data set based on PCA principal component analysis to simplify a model and compress data.

S4: feature selection is carried out based on a Spearman correlation analysis algorithm, and attributes which are irrelevant and reduce model precision are removed. According to the expert suggestion, fourteen feature attributes with high relativity are finally selected from the heart disease data set, and the fourteen feature attributes form a new training set for model training.

S5: the model was constructed using Transformer-MHP, as shown in fig. 1, and model training was performed.

S6: the Transformer-MHP model provided by the invention is tested and evaluated, and compared with a basic model to verify the accuracy and generalization of the model.

Wherein, in S1, collecting data and selecting data comprises the steps of:

s101: a UCI cardiac data set is selected that contains 76 attributes, of which 14 are referenced by the present invention.

S102: medical record data collection and clinical data collection are carried out on 408 beds in the cardiac surgery of the hospital, covering from 2016 to 2021 years, and totaling 5126 experimental data. 5000 data are finally selected for model experiments and stored in the EXCEL form, and the characteristic attributes are 76.

Wherein, in S2, the data preprocessing includes the following steps:

s201: and performing missing value processing on all attributes of the obtained data sample based on a mean interpolation method. If the attribute can be measured by a constant, the missing value is interpolated by using the average value of the valid values of the attribute, and if the attribute is measured by a numerical grade, the missing value is interpolated by using the mode of the valid values of the attribute. In total, 32 data have attribute value loss, and all people adopt an artificial filling mode.

S202: the data were normalized based on the Z-Score normalization method, which requires the mean and standard deviation of the raw data. The new data is the quotient of the original data minus the mean value and the standard deviation.

Mu refers to the overall average value of the data, x refers to the observed value of the data, sigma refers to the overall standard deviation, N refers to the number of the attribute data samples, and j locates each attribute sample point, and the formula is as follows:

s203: the method for preprocessing the data of the acquired data set based on the Isolationforsest anomaly detection method comprises the following steps: firstly, randomly selecting k points from training data as subsamples to be put into a root node of an isolated tree, designating the dimension as d, generating a cutting point p between the maximum value and the minimum value of the data dimension of the current node, and generating a hyperplane for cutting the subspace. Points less than p are placed on the left branch of the current node and vice versa on the right branch, and new leaf nodes are recursively constructed until only one data or tree on a leaf node has grown to the set height. T isolated trees are obtained, h (x) is the height of x in each tree, c (k) is the average of the path lengths at a given number of samples k, and is used to normalize the path length h (x) of sample x. And calculating an abnormal score s, judging a point with an abnormal score close to 1 as an abnormal point, and removing the abnormal point.

In S3, performing dimensionality reduction on the high-dimensional cardiac disease data set based on PCA principal component analysis includes the following steps: data were normalized, the mean of each feature was averaged, and then the mean of each feature was subtracted from the mean of itself for all samples. And selecting a sample X and a sample Y to calculate a covariance matrix, wherein the larger the absolute value of the covariance is, the larger the influence of the sample X and the sample Y on each other is, and the smaller the covariance matrix is. Calculating an eigenvalue lambda and a corresponding eigenvector of the covariance matrix, sorting the eigenvalues lambda from large to small, selecting the first k samples according to the sorting, and taking out the k eigenvectors corresponding to the samples to obtain a group: { (λ)₁,u₁),(λ₂,u₂),…,(λ_k,u_k)}. And finally, projecting the original features onto the selected feature vectors to obtain the reduced-dimension k-dimension features.

Wherein selecting features based on Spearman correlation analysis in S4 comprises the steps of:

s401: rank ordering data of any two variables (X, Y) to obtain rank position (X ', Y'), d_iIs the difference of rank, n is the number of data in the variable, and calculates the spearman rank correlation coefficient, the calculation formula is:

s402: analyzing the Spanish level correlation coefficient, and finally determining fourteen characteristic attributes through screening and investigation: age (age), gender (sex), chest pain type (cp), resting blood pressure (trestbps), serum cholesterol level (chol), fasting plasma glucose (fbs), resting electrocardiogram (restecg), maximum new range (thalach), exercise induced angina (exang), exercise induced ST depression versus rest (oldpeak), maximum exercise ST segment slope (ca), defect (oral), target (target). Wherein, the value of "sex" is 1 for male and 0 for female; the value of "cp" indicates typical angina pectoris when 1, atypical angina pectoris when 2, non-angina pectoris when 3, and no symptom when 4; the value of "restecg" indicates normal when it is 0, ST-T wave abnormality when it is 1, and left ventricular hypertrophy when it is 2; a value of "exang" of 1 indicates yes, and 0 indicates no; a value of "slope" of 1 indicates an upward slope, 2 indicates a flat slope, and 3 indicates a downward slope; the value of "ca" is from 0 to 3; "thal" has a value of 0 indicating normal, 1 indicating fixed defect, and 2 indicating reversible defect; a value of "target" is 0 indicating that the chance of heart attack is reduced, and 1 indicating that the chance of heart attack is increased.

Wherein, in S5, the model construction and model training comprises the following steps:

s501: an input Embedding layer and a Positional Encoding layer of an encoder are removed, fourteen data attribute characteristic inputs are changed, layer standardization processing is firstly carried out after the data attribute characteristic inputs, and input values are normalized, so that the training speed is accelerated, and the training stability is improved. An abstract representation of the improved transform-MHP encoder-decoder is shown in fig. 2. And inputting the normalized data into a multi-head self-attention layer, and autonomously calculating the self-attention weight of the input data by the multi-head self-attention layer and distributing the weight. And the data is processed by a multi-head self-attention layer and transmitted to a fully-connected feedforward neural network. And the decoder receives the output result of the encoder and the output result of the first sublayer of the decoder, carries out layer standardization processing on the data, and finally outputs a True state or a False state to judge whether sudden heart disease symptoms exist. The attention mechanism used by the Transformer-MHP model is dot product attention, and the structure of the dot product attention module in the multi-headed self-attention layer is shown in FIG. 3. The Mask operation is optional, the dot product attention module in the encoder does not need the Mask operation, and the dot product attention module in the decoder does need the Mask operation on the input vector. In the dot-product attention module, the input data is regarded as three roles of Query, Key, and Value (respectively represented by Q, K, V). Where K is a keyword dictionary, V is a value in a keyword, Q is a task to be queried, d_kRefers to the dimension of the input data matrix. The purpose of this module is to find the corresponding V value in K by Q (by looking up the keywords in the task location keyword dictionaryThe corresponding value). The formula for dot product attention is as follows:

s502: and carrying out model training based on the idea of a Wrapper recursive feature elimination algorithm. Each feature is first assigned a weight and then trained on these original features using a predictive model. After the weight values of the features are obtained, absolute values of the weight values are taken, and the minimum absolute value is eliminated. And finally, continuously cycling and recursing according to the method until the residual characteristic quantity reaches the required characteristic quantity.

In S6, the testing and evaluating of the model includes the following steps:

s601: and selecting a decision tree model, an MLP neural network, an SVM support vector machine and a K-Means unsupervised algorithm to construct a traditional sudden heart disease prediction model for comparison. Meanwhile, based on a Transformer-MHP model, the collected data is added into the machine learning model according to the weight.

S602: and calculating the accuracy, the recall rate and the F1 score of the model according to the test set and the confusion matrix to serve as evaluation indexes, and comparing the performance of the multiple models in predicting the probability problem of the sudden heart disease. Wherein the accuracy of the MLP multi-layer perceptual training model using supervised learning is 0.7631578947368421; the accuracy of the predictive model using the supervised learning SVM is 0.8026315789473685; the accuracy of the prediction model was 0.75 using the K-Means algorithm. And the accuracy of the Transformer-MHP prediction model reaches 0.91387324, which is obviously superior to the three models. The comparison result proves that the novel heart disease prediction model provided by the invention has good accuracy and interpretability.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. The industry has described the principles of the invention, and variations and modifications are possible without departing from the spirit and scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. The sudden heart disease prediction method based on the Transformer-MHP model is characterized by comprising the following steps: the method comprises the following steps:

(1) collecting a large number of paper medical records or electronic medical records, and screening a heart disease data set;

(2) preprocessing the data of the heart disease data set obtained in the step (1) for subsequent characteristic analysis;

(4) selecting fourteen feature attributes with the highest correlation coefficient from the training set D obtained in the step (3) to construct a new training set D' for subsequent model training;

(5) constructing a model by using a Transformer-MHP, and training the model;

2. The method of predicting heart disease outbreak based on the Transformer-MHP model of claim 1, which is characterized by: the step (2) is specifically as follows:

2.1) carrying out missing value processing on all attributes of the obtained data sample, and adopting a mean interpolation method; if the attribute can be measured by a constant, interpolating a missing value by using the average value of the effective values of the attribute, and if the attribute is measured by a numerical grade, interpolating the missing value by using the mode of the effective values of the attribute;

2.2) carrying out standardization processing on the data obtained in the step 2.1) according to a Z-Score standardization method, wherein the processed data conform to standard normal distribution, so that errors caused by different dimensions are eliminated;

2.3) based on the idea of the IsolationsForest anomaly detection algorithm, recursively and randomly dividing a heart disease data set and establishing a local model, wherein each isolated tree is used for identifying specific attribute subsamples, calculating and sequencing the anomaly score of each sample point, breaking the sample point with the anomaly score close to 1 into an anomaly point, and directly deleting the sample point marked as the anomaly point, thereby removing the anomaly data which are sparsely distributed and are far away from a high-density population.

3. The method of predicting heart disease outbreak based on the Transformer-MHP model of claim 2, wherein: the step (3) is specifically as follows: based on the idea of PCA principal component analysis data processing algorithm, performing dimensionality reduction processing on a high-dimensionality heart disease data set, reserving important features and removing noise, thereby reducing attribute indexes to be analyzed, improving data processing speed, calculating eigenvalue lambda and corresponding eigenvector between sample covariance matrixes after data are standardized, sorting the eigenvalue lambda from large to small, selecting the first k samples according to a sorting sequence, and taking out the k eigenvectors corresponding to the samples to obtain a group: { (λ)₁,u₁),(λ₂,u₂),…,(λ_k,u_k) And projecting the original features onto the selected feature vectors to obtain the reduced-dimension k-dimension features.

4. The method of claim 3 for predicting a heart attack based on a Transformer-MHP model, wherein: the step (4) is specifically as follows: based on the idea of a Spearman correlation analysis algorithm, the correlation degree between heart disease attribute grade variables after grading sequencing is measured, grade correlation coefficients are obtained, the correlation is evaluated, therefore, feature selection is carried out, attributes which are irrelevant and reduce model precision are eliminated, fourteen feature attributes are selected from a heart disease data set, and the fourteen feature attributes form a new training set for model training.

5. The method of predicting heart disease outbreak based on the Transformer-MHP model of claim 4, wherein: the Spearman rank correlation data analysis method based on the step (4) comprises the following steps:

4.1) rank ordering the data of two variables (X, Y) to obtain the ordered position (X ', Y'), i.e. rank, d_iIs the difference of rank, n is the number of data in the variable;

4.2) calculating a Spireman grade correlation coefficient;

4.3) analyzing the Spireman grade correlation coefficient, and finally determining fourteen characteristic attributes through screening and investigation: age (age), gender (sex), chest pain type (cp), resting blood pressure (trestbps), serum cholesterol level (chol), fasting blood glucose (fbs), resting electrocardiogram (restecg), maximum new mileage (thalach), exercise induced angina (exang), exercise versus rest induced ST depression (oldpeak), maximum exercise ST segment slope (ca), defect (thal), target (target), wherein a value of "sex" is 1 for a male and 0 for a female; the value of "cp" indicates typical angina pectoris when 1, atypical angina pectoris when 2, non-angina pectoris when 3, and no symptom when 4; the value of "restecg" indicates normal when it is 0, ST-T wave abnormality when it is 1, and left ventricular hypertrophy when it is 2; a value of "exang" of 1 indicates yes, and 0 indicates no; a value of "slope" of 1 indicates an upward slope, 2 indicates a flat slope, and 3 indicates a downward slope; the value of "ca" is from 0 to 3; "thal" has a value of 0 indicating normal, 1 indicating fixed defect, and 2 indicating reversible defect; a value of "target" is 0 indicating that the chance of heart attack is reduced, and 1 indicating that the chance of heart attack is increased.

6. The method of predicting heart attack based on a Transformer-MHP model according to claim 5, wherein: the step (5) is specifically as follows:

5.1) improving the encoder-decoder layer based on the encoder-decoder structure of the traditional Transformer. A Transformer framework for NLP field is provided, input/output layers of the Transformer framework are suitable for input processing words, and a Position encoding data preprocessing mechanism is added to relative Position information. Modifying the internal structure of the heart disease to completely match the input of fourteen feature attribute data of the heart disease and processing the feature attribute data;

5.2) based on the idea of high-expandability MHP parallel analysis algorithm, enabling target data input into the decoder and input data input into the encoder to execute data stream processing in parallel, and filtering false data competition while executing the data processing;

5.3) carrying out model training based on the idea of a Wrapper recursive feature elimination algorithm, carrying out multi-round training by using a base model, removing the features of a plurality of weight coefficients after each round of training, carrying out next round of training based on a new feature set, and continuously iterating to obtain the required feature quantity;

and 5.4) adding the training set obtained in the step (4) into the Transformer-MHP machine learning model provided by the invention, and training to obtain a recursion formula between the input variable and the output variable according to the data label as the output variable Y.

7. The method of predicting heart disease outbreak based on the Transformer-MHP model of claim 6, wherein: in the step 5.1), the method specifically comprises the following steps: removing an input Embedding layer and a Positional Encoding layer of an encoder, changing the input Embedding layer and the Positional Encoding layer into fourteen data attribute characteristic inputs, firstly carrying out layer standardization processing after the input, normalizing an input value, accelerating the training speed and improving the training stability, inputting the standardized data into a multi-head self-attention layer, automatically calculating the self-attention weight of the input data by the multi-head self-attention layer, distributing the weight, processing the data by the multi-head self-attention layer, transmitting the data to a fully-connected feedforward neural network, receiving the output result of the encoder and the output result of a first sublayer of the encoder by the decoder, carrying out layer standardization processing on the data, and finally outputting a True state or a False state to judge whether sudden heart disease symptoms exist.

8. The method of claim 7 for predicting a heart attack based on a Transformer-MHP model, wherein: in the step 5.3), the method specifically comprises the following steps: for a prediction model with the weight of the feature, RFE selects the required feature by continuously reducing the scale of the feature set in a recursive mode; firstly, each feature is assigned with a weight, then a prediction model is adopted to train on the original features, after the weight values of the features are obtained, the absolute values of the weight values are taken, the minimum absolute value is removed, and finally recursion is continuously circulated according to the method until the residual feature quantity reaches the required feature quantity.