CN116959725A

CN116959725A - Disease risk prediction method based on multi-mode data fusion

Info

Publication number: CN116959725A
Application number: CN202310951791.5A
Authority: CN
Inventors: 马梦媛
Original assignee: Individual
Current assignee: Individual
Priority date: 2023-07-31
Filing date: 2023-07-31
Publication date: 2023-10-27

Abstract

The invention relates to the technical field of disease risk prediction, in particular to a disease risk prediction method based on multi-mode data fusion, which comprises the following steps: the data preprocessing comprises the steps of cleaning, normalizing and normalizing data of different modes such as medical images, genome data, electronic medical records and the like; feature extraction, namely performing feature extraction on the preprocessed data, and selecting and extracting features by adopting a deep learning algorithm, a clustering algorithm and other methods; data fusion, wherein the extracted features are fused by using a fusion algorithm to form a comprehensive feature; constructing a prediction model, and constructing a disease risk prediction model based on comprehensive characteristics; and outputting a prediction result, predicting risk of a new case, and outputting the prediction result. The invention realizes the efficient fusion and utilization of multi-mode data, provides a more accurate and stable disease risk prediction method, and has important practical value and wide application prospect in the field of medical health.

Description

Disease risk prediction method based on multi-mode data fusion

Technical Field

The invention relates to the technical field of disease risk prediction, in particular to a disease risk prediction method based on multi-mode data fusion.

Background

In the medical field, disease risk prediction is a vital work, and can help doctors to identify high-risk disease groups in advance, so that early prevention and early treatment are realized, and the risk of disease occurrence is reduced. The existing disease risk prediction method is mainly based on single-mode data, such as medical images, genome data or electronic medical record data.

However, due to the differences between the characteristics and the information content of various data, the complexity of the disease cannot be comprehensively reflected by the data in a single mode, and the information of various data cannot be fully utilized, so that the accuracy and the stability of the prediction result are limited.

Disclosure of Invention

Based on the above object, the present invention provides a disease risk prediction method by multi-modal data fusion.

A disease risk prediction method based on multi-mode data fusion comprises the following steps:

step one: the data preprocessing comprises the steps of cleaning, normalizing and normalizing the medical images, genome data and data of different modes of the electronic medical record;

step two: feature extraction, namely performing feature extraction on the preprocessed data, and selecting and extracting features by adopting a deep learning algorithm and a clustering algorithm method;

step three: data fusion, wherein the extracted features are fused by using a fusion algorithm to form a comprehensive feature; constructing a prediction model, and constructing a disease risk prediction model based on comprehensive characteristics;

step four: and outputting a prediction result, predicting risk of a new case, and outputting the prediction result.

Further, the data preprocessing step comprises the steps of cleaning, normalizing and normalizing the medical image, genome data and the data of the electronic medical record,

data cleansing includes identifying and processing missing data, duplicate data, and outlier data to reduce its negative impact on the predicted outcome;

the standardization process is to convert the data with different measurement units or measurement scales into relative values without units so as to eliminate the influence caused by the different measurement units among the data and to compare and fuse the data from different sources;

the normalization processing is used for converting the data into a unified value range so as to eliminate the influence caused by the difference of the value ranges among the data and ensure the stability of model training and the reliability of a prediction result.

Furthermore, the feature extraction step adopts a deep learning algorithm, and adopts a deep learning algorithm, a clustering algorithm and a feature selection algorithm in the feature extraction process,

adopts a deep learning algorithm to effectively extract the spatial characteristics in the medical image data,

adopting a clustering algorithm to process the electronic medical record data, grouping the medical record data according to the similarity, and extracting group characteristics;

and selecting the features with the greatest influence on the prediction result from the plurality of extracted features by using a feature selection algorithm.

Furthermore, the data fusion step is performed in a linear fusion mode, and the linear fusion is used for integrating the characteristics of different modes, including the characteristics of medical images, genome data and electronic medical records.

Further, the specific operation of the linear fusion is as follows:

firstly, carrying out normalization processing on feature matrixes from different modes to ensure that the value of each feature is in the same range, so that the weight fairness of each feature in the fusion process is ensured, and the fusion result is not excessively influenced due to the large value range of certain features;

furthermore, according to the preset weights or weights obtained through training and learning, the characteristics of different modes are weighted and summed, the determination of the weights is obtained through a cross-validation mode, so that the contribution of the fused characteristics to a prediction model is maximized, and the importance of the characteristics of different modes in disease risk prediction is considered in the determination process of the weights;

and finally, taking the feature matrix obtained through linear fusion as input data for training and predicting a prediction model.

Further, the data fusion step is based on fusion of models, and the operation flow is as follows:

firstly, respectively inputting features extracted from data of different modes into respective prediction models, wherein the prediction models are Convolutional Neural Network (CNN) models suitable for processing various data characteristics, and for genome data, a cyclic neural network (RNN) model is used; for electronic medical record data, a Support Vector Machine (SVM) model is used;

then, taking the prediction results of each model on the disease risk as new features, and combining the new features into a prediction result feature matrix;

furthermore, inputting the feature matrix of the predicted result into a new predicted model, wherein the new predicted model is a linear regression model and is used for learning how to best combine the predicted results of all the models so as to obtain the most accurate disease risk prediction;

and finally, predicting the unknown data by using a new prediction model, wherein the obtained result is a disease risk prediction result based on multi-mode data fusion.

Further, the step of validating the predictive model includes evaluating the predictive model using a separate test data set to determine performance of the model, the step including calculating an accuracy rate, a recall rate, an area under ROC curve (AUC) index to comprehensively evaluate the predictive performance of the model.

Further, the output of the prediction result comprises specific values of the disease risks, grade classification of the disease risks and confidence intervals of the disease risks.

The invention has the beneficial effects that:

the invention adopts the steps of feature extraction, data fusion, model construction, verification and the like, and effectively integrates the multi-mode data from medical images, genome data and electronic medical records. Compared with the traditional single-mode disease risk prediction method, the method disclosed by the invention can more comprehensively consider and utilize the data of various modes, and improves the accuracy and stability of the prediction result.

According to the method, through a fusion mode based on the model, not only can information of each mode be combined, but also potential association among the modes can be found, and the prediction performance is further improved. Meanwhile, the performance of the model can be ensured to be always kept in an optimal state by periodically updating and optimizing the prediction model, and the model is suitable for possible changes of medical data and disease modes.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only of the invention and that other drawings can be obtained from them without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a prediction method according to an embodiment of the present invention.

Detailed Description

The present invention will be further described in detail with reference to specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent.

It is to be noted that unless otherwise defined, technical or scientific terms used herein should be taken in a general sense as understood by one of ordinary skill in the art to which the present invention belongs. The terms "first," "second," and the like, as used herein, do not denote any order, quantity, or importance, but rather are used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that elements or items preceding the word are included in the element or item listed after the word and equivalents thereof, but does not exclude other elements or items. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", etc. are used merely to indicate relative positional relationships, which may also be changed when the absolute position of the object to be described is changed.

Example 1

As shown in fig. 1, a disease risk prediction method by multi-modal data fusion includes the following steps:

step one: the data preprocessing comprises the steps of cleaning, normalizing and normalizing data of different modes such as medical images, genome data, electronic medical records and the like;

step two: feature extraction, namely performing feature extraction on the preprocessed data, and selecting and extracting features by adopting a deep learning algorithm, a clustering algorithm and other methods;

step four: outputting a prediction result, performing risk prediction on a new case, and outputting the prediction result;

by the method, not only can data of different modes be utilized, but also the accuracy of disease risk prediction can be improved through feature extraction and fusion.

The data preprocessing step comprises the steps of cleaning, normalizing and normalizing the medical images, genome data and data of the electronic medical record,

the normalization processing is used for converting the data into a uniform value range (such as between 0 and 1) so as to eliminate the influence of the difference of the value ranges among the data and ensure the stability of model training and the reliability of a prediction result;

the aim of the step is to enable the data of different modes to be subjected to subsequent processing under the same standard, so that the accuracy and stability of disease risk prediction are further enhanced.

The feature extraction step adopts a deep learning algorithm, a clustering algorithm and a feature selection algorithm in the process of feature extraction,

selecting the feature with the greatest influence on the prediction result from a plurality of extracted features by using a feature selection algorithm;

these selected features constitute a feature set that is intended to provide the best predictive performance while avoiding overfitting. The method can effectively extract representative characteristics, improves the accuracy of the prediction result, and has excellent efficiency and feasibility for processing big data.

In addition, the data fusion step can also adopt a deep fusion method, such as fusion of the features by using a Deep Neural Network (DNN), and can extract higher and more abstract fusion features through nonlinear transformation and mapping of the network. Therefore, the data of different modes can be fully utilized, and the prediction accuracy and stability are improved. The step is a core link of the method, and determines whether the prediction model can effectively utilize information of each mode, so that an effect superior to any single-mode prediction is achieved.

The linear fusion operation is as follows:

firstly, carrying out normalization processing on feature matrixes from different modes to ensure that the value of each feature is in the same range, for example, 0 to 1, so as to ensure that the weight of each feature in the fusion process is fair and the fusion result is not excessively influenced because the value range of some features is large;

furthermore, according to the preset weights or the weights obtained through training and learning, the characteristics of different modes are weighted and summed, the weights are determined through a cross-validation mode, so that the contribution of the fused characteristics to a prediction model is maximized, and in the process of determining the weights, the importance of the characteristics of different modes in disease risk prediction is considered, for example, for certain diseases, the importance of genome data may be higher than that of medical image data;

finally, the feature matrix obtained through linear fusion is used as input data for training and predicting a prediction model, so that the model can be ensured to fully utilize multi-mode information, and the accuracy and stability of prediction are improved;

by means of the linear fusion mode, information of different modes can be effectively combined, accuracy of disease risk prediction is improved, and a prediction result is more accurate and stable.

The step of verifying the prediction model comprises the step of evaluating the prediction model by using an independent test data set to determine the performance of the model, wherein the step comprises the steps of calculating the accuracy, the recall and the area under ROC curve (AUC) index to comprehensively evaluate the prediction performance of the model, and the step is helpful for understanding the performance of the model on unknown data, ensuring the generalization capability of the model and avoiding overfitting.

The output of the prediction result comprises specific values of the disease risks, grade classification of the disease risks and confidence intervals of the disease risks, and the result output mode can provide more comprehensive and specific information for doctors and help the doctors to make more accurate diagnosis and treatment decisions.

Example 2

The method comprises the following steps:

The data fusion step is based on the fusion of models, and the operation flow is as follows:

then, the prediction results of the disease risks of the models are taken as new features and combined into a prediction result feature matrix, and the step is mainly based on the observation that the prediction results of the different models on the same task often contain different information, and the richer and more accurate prediction results can be obtained by combining the information;

finally, predicting the unknown data by using a new prediction model, wherein the obtained result is a disease risk prediction result based on multi-mode data fusion;

the fusion method based on the model can effectively combine data from different modes to find potential association among different modes, so that accuracy and robustness of disease risk prediction are improved. This approach has significant advantages in processing high-dimensional, complex medical data, which is an important means of implementing the present invention.

Those of ordinary skill in the art will appreciate that: the discussion of any of the embodiments above is merely exemplary and is not intended to suggest that the scope of the invention is limited to these examples; the technical features of the above embodiments or in the different embodiments may also be combined within the idea of the invention, the steps may be implemented in any order and there are many other variations of the different aspects of the invention as described above, which are not provided in detail for the sake of brevity.

The present invention is intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the present invention should be included in the scope of the present invention.

Claims

1. The disease risk prediction method based on multi-mode data fusion is characterized by comprising the following steps of:

step three: data fusion, wherein the extracted features are fused by using a fusion algorithm to form a comprehensive feature;

step four: constructing a prediction model, and constructing a disease risk prediction model based on comprehensive characteristics;

step five: and outputting a prediction result, predicting risk of a new case, and outputting the prediction result.

2. The method for predicting disease risk by multi-modal data fusion as defined in claim 1 wherein the data preprocessing step includes cleaning, normalizing and normalizing the medical image, genomic data, and electronic medical record data,

3. The method for predicting disease risk by multi-modal data fusion as defined in claim 1, wherein the feature extraction step employs a deep learning algorithm, and in the feature extraction process, employs a deep learning algorithm, a clustering algorithm and a feature selection algorithm,

4. The method for predicting disease risk by multi-modal data fusion according to claim 1, wherein the data fusion step is performed by linear fusion, and the linear fusion is used for integrating features of different modalities, including features of medical images, genome data and electronic medical records.

5. The method for predicting disease risk by multi-modal data fusion according to claim 4, wherein the linear fusion specifically operates as follows:

6. The disease risk prediction method based on multi-modal data fusion according to claim 1, wherein the data fusion step is based on model fusion, and the operation flow is as follows:

firstly, respectively inputting the features extracted from the data of different modes into respective prediction models, wherein the prediction models are convolutional neural network models suitable for processing various data characteristics, and for genome data, a cyclic neural network model is used; for electronic medical record data, a support vector machine model is used;

7. The method of claim 5, wherein the step of validating the model comprises evaluating the model using an independent test data set to determine the model performance, the step comprising calculating the accuracy, recall, and area under ROC curve metrics to fully evaluate the model's predicted performance.

8. The method for predicting disease risk by multi-modal data fusion according to claim 1, wherein the output of the prediction result includes specific values of disease risk, class classification of disease risk, confidence interval of disease risk.