CN114121291A

CN114121291A - Disease grading prediction method and device, electronic equipment and storage medium

Info

Publication number: CN114121291A
Application number: CN202111250817.0A
Authority: CN
Inventors: 梁爽; 赵成; 刘岩
Original assignee: Taikang Insurance Group Co Ltd
Current assignee: Taikang Insurance Group Co Ltd
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2022-03-01

Abstract

The embodiment of the invention provides a disease grading prediction method, a disease grading prediction device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring target image characteristics, target gene characteristics and target clinical characteristics of a target user; the target users comprise first users in a preset training set and second users in a preset test set; determining target characteristics of a target user according to the target image characteristics, the target gene characteristics and the target clinical characteristics; training a preset hierarchical prediction model based on the target feature vector and a preset label corresponding to the first user to obtain a target hierarchical prediction model; and inputting the target characteristics of the second user into the target grading prediction model to obtain the target grading corresponding to the second user. A target grading prediction model is obtained through training, automatic grading prediction is carried out based on the target grading prediction model, interference of human factors is reduced, and accuracy of disease grading prediction is improved.

Description

Disease grading prediction method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of medical image analysis, and in particular, to a disease classification prediction method, apparatus, electronic device, and storage medium.

Background

Brain tumor is an abnormal cell which is abnormally divided and abnormally grown in brain tissue, the morbidity is high, the mortality rate is over 3 percent, and the health of human body is seriously harmed. Among them, glioma is one of the common brain tumors originated in the cranium, and can be classified into four grades I-IV according to the classification of the world health organization. Wherein, the I-II grade is low grade glioma, and the III-IV grade is high grade glioma. High-grade gliomas (HGGs), such as glioblastoma multiforme (GBM), have a mean survival time of 23 months, a two-year survival rate of 47.4%, and a four-year survival rate of only 18.5%. Low Grade Gliomas (LGGs), such as oligodendrocytoma and astrocytoma, have a ten year survival rate of 57%. Accurate classification of brain glioma is a prerequisite for saving patient life, and has positive clinical significance for treatment decision, monitoring and management of chemoradiotherapy and prognosis evaluation.

For the brain tumor, Magnetic Resonance Imaging (MRI) is a typical non-invasive Imaging technique, can generate high-quality brain images without damage and skull artifacts, can provide more comprehensive information for brain tumor analysis, and is a main technical means for analyzing and processing brain tumors.

In the prior art, a grading method for diseases such as brain tumors is generally qualitatively analyzed by a radiologist by combining self experience and MRI images, and the method based on artificial experience grading is greatly influenced by human factors, so that the condition of wrong grading often occurs, and the accuracy of disease grading is low.

Disclosure of Invention

The embodiment of the invention provides a disease grading prediction method and device, electronic equipment and a storage medium, and aims to solve the problem that the accuracy of existing manual disease grading is low.

In order to solve the above problem, the embodiment of the present invention is implemented as follows:

in a first aspect, an embodiment of the present invention discloses a disease classification prediction method, including:

acquiring target image characteristics, target gene characteristics and target clinical characteristics of a target user; the target users comprise first users in a preset training set and second users in a preset test set;

determining a target feature of the target user according to the target image feature, the target gene feature and the target clinical feature;

training a preset hierarchical prediction model based on the target feature and a preset label corresponding to the first user to obtain a target hierarchical prediction model;

and inputting the target characteristics of the second user into the target grading prediction model to obtain the target grading corresponding to the second user.

Optionally, the obtaining of the target image feature of the target user includes:

acquiring a target image of the target user;

segmenting the target image according to a preset image segmentation algorithm, and determining a target area in the target image;

and extracting the image characteristics of the target area according to a preset image processing algorithm to obtain the target image characteristics of the target user.

Optionally, the acquiring target gene characteristics of the target user includes:

obtaining target genomics data of the target user;

selecting candidate genes from the target genomics data based on the type of the target disease; the candidate gene is related to the target disease and/or the mutation rate of the candidate gene in the target user is higher than a first preset threshold value;

and extracting the gene characteristics of the candidate genes to obtain the target gene characteristics of the target user.

Optionally, the acquiring the target clinical characteristics of the target user includes:

acquiring preset clinical data of the target user;

according to the type of the target disease, screening preset clinical data related to the target disease from the preset clinical data to obtain target clinical information;

and extracting the clinical characteristics of the target clinical information to obtain the target clinical characteristics of the target user.

Optionally, the determining the target feature of the target user according to the target image feature, the target gene feature and the target clinical feature includes:

merging the target image features, the target gene features and the target clinical features to obtain target total quantity features;

selecting target features of the target user from the target full-scale features according to a preset feature selection algorithm; the target features are the features with the importance degrees ranked at the top N bits in the target full-scale features; n is the target number of the target features; and N is an integer greater than 0.

Optionally, the merging the target image feature, the target gene feature and the target clinical feature includes:

vectorizing the target image features, the target gene features and the target clinical features respectively to obtain target image feature vectors, target gene feature vectors and target clinical feature vectors;

and merging the target image feature vector, the target gene feature vector and the target clinical feature vector to obtain a first target full-scale feature vector corresponding to the target full-scale feature.

Optionally, the selecting, according to a preset feature selection algorithm, a target feature of the target user from the target full-scale features includes:

normalizing the first target full-scale feature vector to obtain a second target full-scale feature vector corresponding to the first target full-scale feature;

determining the importance degree of each feature in the second target full-scale feature vector according to the preset feature selection algorithm;

determining a target number N corresponding to the target feature according to the importance degree and the preset feature selection algorithm;

and screening the feature with the importance degree ranked at the top N bits in the second target full-scale feature vector as the target feature.

Optionally, the determining the number N of the targets corresponding to the target feature includes:

sequentially selecting different numbers of features according to the ranking of the importance degrees;

inputting the features of each quantity into the preset feature selection algorithm for verification respectively, and determining the average accuracy corresponding to each feature quantity respectively;

and taking the feature quantity with the highest average accuracy as a target quantity N corresponding to the target feature.

In a second aspect, an embodiment of the present invention discloses a disease grading prediction apparatus, including:

the acquisition module is used for acquiring target image characteristics, target gene characteristics and target clinical characteristics of a target user; the target users comprise first users in a preset training set and second users in a preset test set;

a determination module, configured to determine a target feature of the target user according to the target image feature, the target gene feature, and the target clinical feature;

the training module is used for training a preset grading prediction model based on the target characteristics and a preset label corresponding to the first user to obtain a target grading prediction model;

and the input module is used for inputting the target characteristics of the second user into the target grading prediction model to obtain the target grading corresponding to the second user.

Optionally, the obtaining module is specifically configured to:

acquiring a target image of the target user;

Optionally, the obtaining module is further specifically configured to:

obtaining target genomics data of the target user;

Optionally, the obtaining module is further specifically configured to:

acquiring preset clinical data of the target user;

Optionally, the determining module is specifically configured to:

Optionally, the determining module is specifically configured to: vectorizing the target image features, the target gene features and the target clinical features respectively to obtain target image feature vectors, target gene feature vectors and target clinical feature vectors;

Optionally, the determining module is further specifically configured to:

In a third aspect, an embodiment of the present invention further provides an electronic device, which includes a processor, a memory, and a computer program stored on the memory and executable on the processor, and when executed by the processor, the electronic device implements the step of disease classification prediction according to the first aspect.

In a fourth aspect, the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the step of disease classification prediction according to the first aspect.

In the embodiment of the invention, the target image characteristic, the target gene characteristic and the target clinical characteristic of a target user are obtained; the target users comprise first users in a preset training set and second users in a preset test set; determining target characteristics of a target user according to the target image characteristics, the target gene characteristics and the target clinical characteristics; training a preset hierarchical prediction model based on the target feature vector and a preset label corresponding to the first user to obtain a target hierarchical prediction model; and inputting the target characteristics of the second user into the target grading prediction model to obtain the target grading corresponding to the second user. Therefore, the target hierarchical prediction model is obtained by training by introducing the target image characteristics, the target gene characteristics and the target clinical characteristics of the target user, and automatic hierarchical prediction is performed based on the target hierarchical prediction model, so that the interference of human factors is reduced, and the accuracy of disease hierarchical prediction is improved.

Drawings

FIG. 1 is a flow chart illustrating the steps of a disease stratification prediction method according to an embodiment of the present invention;

FIG. 2 is a flow chart illustrating disease progression prediction according to an embodiment of the present invention;

fig. 3 is a block diagram showing a disease classification prediction apparatus according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, a flow chart of the steps of a disease staging prediction method of the present invention is shown. The execution subject in the embodiment of the present invention may refer to a computing device, and may specifically refer to a medical device, an intelligent device, a server, and the like, which is not specifically limited in this embodiment of the present invention. The disease grading prediction method specifically comprises the following steps:

step 101, acquiring target image characteristics, target gene characteristics and target clinical characteristics of a target user; the target users comprise first users in a preset training set and second users in a preset testing set.

In an embodiment of the present invention, the target user may refer to a target patient with a disease. Before a final target hierarchical prediction model is obtained, corresponding training and testing processes are required to be completed based on relevant data of a target user. A first user in the preset training set may be used for training of the model, and a second user in the preset test set may be used for accuracy testing of the model.

The target image feature may refer to a relevant feature of an MRI image of a target user, and for example, the target influence feature may include a first-order statistic feature, a shape feature, a texture feature, and the like, where the first-order statistic feature may specifically include a mean value, a variance, and the like, the shape feature may specifically include a volume, a surface area, and the like, and the texture feature may specifically include a gray level co-occurrence matrix, a gray level area size matrix, and the like, and may specifically be determined based on a specific kind of an actual disease and a specific hierarchical prediction requirement, which is not limited in this embodiment of the present invention.

The target gene characteristics may refer to gene information of a target user, and may specifically include a gene expression level, a gene mutation state, and the like. The target clinical characteristics may refer to clinical data of the target user, and may specifically include data of the target user such as age, sex, height, weight, and blood pressure, and the embodiments of the present invention are not limited to specific types of the target gene characteristics and the target clinical characteristics.

The traditional manual film reading mode can only be based on manual experience and partial histological classification characteristics in the medical image. The genes introduced in the examples of the invention are more critical predictors of changes in some molecular markers associated with treatment and prognosis. Illustratively, several molecular genetic markers (including IDH1/2 mutation, TP53 mutation, MGMT promoter methylation, 1p/19q co-deletion, etc.) play an important role in tumorigenesis. In the embodiment of the invention, the target image characteristics and the target gene characteristics are fused, and the target clinical characteristics of the target user are combined, so that the advantages of both imaging and genomics can be combined, the biomolecular hierarchical information is fused into an imaging method, and the actual clinical data of the target user can be combined, thereby further improving the accuracy of disease grading prediction.

And 102, determining the target characteristics of the target user according to the target image characteristics, the target gene characteristics and the target clinical characteristics.

In the embodiment of the present invention, the target feature may refer to each feature used by the target user for model training. The target influence characteristics, the target gene characteristics and the target clinical characteristics of the target user can be multiple, the characteristics with higher importance degree of disease grading prediction can be screened out through characteristic fusion and extraction in the step to obtain the target characteristics, and the model is trained based on the target characteristics subsequently, so that the accuracy of model prediction can be improved.

Step 103, training a preset hierarchical prediction model based on the target characteristics and a preset label corresponding to the first user to obtain a target hierarchical prediction model.

In the embodiment of the present invention, the preset label may be a hierarchical label (label) corresponding to a first user in a preset training set. The preset tag may be predetermined and labeled. The preset hierarchical prediction model may refer to a preset classification model, specifically, a Support Vector Machines (SVM) classification model, and the like, and of course, other classifiers may also be used. The target hierarchical prediction model may refer to a trained hierarchical prediction model that may be used for prediction.

In this step, after the target features of the target users are obtained, the preset hierarchical prediction model may be trained based on data of the first users in the preset training set, the target features of the plurality of first users and the preset labels of the first users are input into the preset hierarchical prediction model, and a training process of the preset hierarchical prediction model is executed to obtain a final target hierarchical prediction model.

And 104, inputting the target characteristics of the second user into the target grading prediction model to obtain the target grading corresponding to the second user.

In an embodiment of the present invention, the target rating may refer to a prediction result of a disease level of the second user. After the preset hierarchical prediction model is trained and the target hierarchical prediction model is obtained, the target characteristics of the second user in the preset test set can be input into the target hierarchical prediction model to obtain the target hierarchy of the second user. When disease grading prediction needs to be carried out on other patients subsequently, target image characteristics, target gene characteristics and target clinical characteristics of other patients can be collected, target characteristics of other patients are obtained based on the characteristics, the target characteristics of other patients are input into a target grading prediction model, and then disease grades corresponding to other patients can be obtained, so that the method is convenient and rapid to use, and the accuracy is high.

In the embodiment of the invention, target image characteristics, target gene characteristics and target clinical characteristics of a target user are obtained; the target users comprise first users in a preset training set and second users in a preset test set; determining target characteristics of a target user according to the target image characteristics, the target gene characteristics and the target clinical characteristics; training a preset hierarchical prediction model based on the target feature vector and a preset label corresponding to the first user to obtain a target hierarchical prediction model; and inputting the target characteristic vector of the second user into the target grading prediction model to obtain the target grading corresponding to the second user. Therefore, the target hierarchical prediction model is obtained by training by introducing the target image characteristics, the target gene characteristics and the target clinical characteristics of the target user, and automatic hierarchical prediction is performed based on the target hierarchical prediction model, so that the interference of human factors is reduced, and the accuracy of disease hierarchical prediction is improved.

Optionally, in the embodiment of the present invention, the obtaining of the target image feature of the target user in step 101 may be specifically implemented through the following steps S21 to S23:

and step S21, acquiring a target image of the target user.

In an embodiment of the present invention, the target image may refer to a multi-modality MRI magnetic resonance image (including four sequences of T1, FLAIR, T1c, and T2) of the target user, where T1 and T2 are used to measure physical quantities of electromagnetic waves, T1 may be used to show an anatomical image, and T2 may be used to show a lesion image. The FLAIR water pressure image is based on a magnetic resonance imaging liquid attenuation inversion sequence, also called a water inhibition imaging technology, a T1c sequence is formed by shooting contrast agent (pigment) into blood before MR is carried out, bright places are rich in blood supply, and intensive display shows that the blood flow is rich, and a tumor part is a part with rapid blood flow. The T1c sequence can further indicate intratumoral conditions, identifying tumors from non-neoplastic lesions. The multi-modal target image can be obtained, the comprehensiveness and integrity of image data can be guaranteed, and the accuracy of subsequent grading prediction is improved.

Illustratively, when performing hierarchical prediction on a target disease, i.e., brain glioma, the embodiments of the present invention may acquire and use a TCGA-LGG low-level brain glioma data set and a TCGA-GBM high-level brain glioma data set, which collectively include pre-operative multi-modal MRI magnetic resonance images of 65 low-level brain tumor patients and 102 high-level brain tumor patients, and perform feature extraction and subsequent model training and testing processes.

And step S22, segmenting the target image according to a preset image segmentation algorithm, and determining a target area in the target image.

In the embodiment of the present invention, the preset image segmentation algorithm may be used for image segmentation, and specifically may refer to a Semantic Feature Pyramid Network (SFPN) deep learning model, and the like. The target region may refer to a disease region in the target image, and for a brain tumor, the target region may refer to a tumor region.

In this step, after the target image is obtained, the target image may be preprocessed, specifically, the preprocessing may include size scaling and normalization processing (Z-score normalization), and the target image may be scaled to a uniform size and processing standard through the preprocessing, so that accuracy of subsequent feature extraction is ensured, and influence of irrelevant factors such as image size on accuracy of hierarchical prediction is avoided to the greatest extent. And then, automatically segmenting the target image at the pixel level by adopting an SFPN deep learning model, and classifying and judging the region to which each pixel belongs to determine the target region in the target image. For example, after performing pixel-level segmentation on a target image of a brain tumor patient, the classification result of each pixel can be determined, which belongs to a tumor region output 1 and a non-tumor region output 0.

And step S23, extracting the image features of the target area according to a preset image processing algorithm to obtain the target image features of the target user.

In the embodiment of the invention, the preset image processing algorithm can be used for extracting the image features. After the target region of the disease is determined according to the original target image and the preset image segmentation algorithm, the image processing algorithm can be used to extract the image features of the target region images under 4 modalities (T1, FLAIR, T1c, T2) respectively to form the target image features.

For example, for a target user with a brain tumor, in the embodiment of the present invention, an MRI image of the target user is first acquired, carrying out image segmentation on the MRI image, determining a target region, namely a tumor region image, then extracting influence characteristics of the tumor region image under 4 modalities, wherein the influence characteristics comprise 386 first-order statistic characteristics, shape characteristics and texture characteristics to form an image characteristic vector R, the first-order statistic characteristics comprise 18 average values, variances, kurtosis, energy, entropies and the like, the shape characteristics comprise 14 volume, surface area, sphericity, maximum three-dimensional diameter and the like, the texture characteristics comprise 75 gray level co-occurrence matrixes, gray level area size matrixes, gray level run matrixes, gray level correlation matrixes and adjacent gray level difference matrixes, and table 1 shows the types and the numbers of the image characteristics extracted in different modes in the embodiment of the invention. Note that since the shape feature is not dependent on the modality and is related only to the segmentation result of the target region, only one set may be extracted.

TABLE 1

In the embodiment of the invention, a target image of a target user is obtained; segmenting the target image according to a preset image segmentation algorithm, and determining a target area in the target image; and extracting the image characteristics of the target area according to a preset image processing algorithm to obtain the target image characteristics of the target user. Therefore, the target area can be automatically segmented and determined, the image features of the target area can be subsequently extracted to obtain the target image features, the image processing efficiency is improved, the extracted image features have interpretability, and the accuracy of subsequent model prediction is further ensured.

Optionally, in the embodiment of the present invention, the obtaining of the target gene feature of the target user in step 101 may be specifically implemented by steps S31 to S33 as follows:

and step S31, obtaining target genomics data of the target user.

In the embodiment of the present invention, the target genomics data may refer to genetic data of a target user.

Step S32, selecting candidate genes from the target genomics data based on the types of target diseases; the candidate gene is related to the target disease and/or the mutation rate of the candidate gene in the target user is higher than a first preset threshold value.

In the embodiment of the present invention, the target disease may refer to a disease targeted by hierarchical prediction, such as brain glioma. The candidate gene can be a gene which is highly related to a target disease and has a high mutation rate in target genomics data. The first preset threshold may be a preset mutation rate threshold, and when the mutation rate is higher than the first preset threshold, it may be determined that the mutation rate of the gene is higher.

Specifically, in this step, the first n genes with strongest hierarchical relevance to the target disease may be selected as the first candidate genes according to the type of the target disease; then, gene mutation states of the target user are obtained, where the gene mutation states include a wild type (denoted as 0) and a mutant type (denoted as 1), and according to the difference of mutation rates of different genes in the target user (the mutation rate of a certain gene is the number of patients with the gene mutation in a data set/the total number of patients in the data set), the first q genes with the highest mutation rate are selected as second candidate genes, and certainly, genes with a mutation rate higher than a first preset threshold value may be directly selected as the second candidate genes. After the first candidate gene and the second candidate gene are obtained, the intersection or union of the first candidate gene and the second candidate gene can be taken to determine the final candidate gene, so that the gene with the strongest relevance with the target disease or the highest mutation rate can be accurately determined, and the accuracy and the comprehensiveness of the subsequent gene feature extraction are ensured.

And step S33, extracting the gene characteristics of the candidate genes to obtain the target gene characteristics of the target user.

In the embodiment of the invention, after the candidate genes are determined, the gene characteristics of each target user for each candidate gene, specifically including the gene expression level, the gene mutation state and the like, can be respectively obtained, so as to obtain the target gene characteristics of the target user.

In the embodiment of the invention, target genomics data of a target user are obtained; selecting candidate genes from the target genomics data based on the type of the target disease; the candidate gene is related to a target disease and/or the mutation rate of the candidate gene in a target user is higher than a first preset threshold value; and extracting the gene characteristics of the candidate genes to obtain the target gene characteristics of the target user. Therefore, in the embodiment of the invention, by determining the candidate gene most related to the target disease or having the highest mutation rate and extracting the gene characteristics of the candidate gene of the target user, the genomics characteristics and the image characteristics can be fused, and the accuracy of disease grading prediction is improved.

Optionally, in this embodiment of the present invention, the obtaining of the target clinical characteristics of the target user in step 101 may be specifically implemented by steps S41 to S43 as follows:

and step S41, acquiring preset clinical data of the target user.

In an embodiment of the present invention, the preset clinical data may refer to various clinical data of the target user.

And step S42, according to the type of the target disease, screening preset clinical data related to the target disease from the preset clinical data to obtain target clinical information.

In an embodiment of the present invention, the target clinical information may refer to clinical data related to the target disease in the preset clinical data. Through data screening, can filter out the clinical data irrelevant with the target disease, guarantee the validity of clinical data.

And step S43, extracting the clinical characteristics of the target clinical information to obtain the target clinical characteristics of the target user.

In the embodiment of the invention, after the target clinical information related to the target disease is determined, data processing operations such as discretization and labeling can be performed on the target clinical information, and the target clinical characteristics of the target user can be extracted.

In the embodiment of the invention, preset clinical data of a target user are acquired; screening preset clinical data related to the target disease from the preset clinical data according to the type of the target disease to obtain target clinical information; and extracting the clinical characteristics of the target clinical information to obtain the target clinical characteristics of the target user. Therefore, in the embodiment of the invention, the target clinical data of the target user is determined to be used in the subsequent model training process through data screening and data extraction processing, so that the classification accuracy of diseases can be improved.

Optionally, in this embodiment of the present invention, step 102 may be specifically implemented by the following steps 1021 to 1022:

and 1021, merging the target image characteristic, the target gene characteristic and the target clinical characteristic to obtain a target total quantity characteristic.

In the embodiment of the present invention, the target full-scale feature may refer to a feature set formed by combining a target image feature, a target gene feature, and a target clinical feature. The target total quantity features comprise all the features in the target image features, the target gene features and the target clinical features determined in the previous steps, and can be used as a data basis for subsequent feature screening.

Step 1022, selecting a target feature of the target user from the target full-scale features according to a preset feature selection algorithm; the target features are the features with the importance degrees ranked at the top N bits in the target full-scale features; n is the target number of the target features; and N is an integer greater than 0.

In the embodiment of the present invention, the preset feature selection algorithm may be a preset feature screening algorithm, and specifically may refer to a cross validation-recursive feature elimination method, etc., and the embodiment of the present invention does not limit the specific type of the preset feature selection algorithm. The degree of importance may refer to the degree of importance of a certain feature of the target full-scale features for the hierarchical prediction. The target number N may refer to an optimal number of target features. It should be noted that the accuracy of the hierarchical prediction of the target features with different numbers is different, and in this step, the target number N may be determined first based on the importance degree and a preset feature selection algorithm, and then the target features with the target number N are selected to form a target feature set finally used for the hierarchical prediction.

In the embodiment of the invention, the target image characteristics, the target gene characteristics and the target clinical characteristics are combined to obtain target total quantity characteristics; selecting target characteristics of a target user from the target full-scale characteristics according to a preset characteristic selection algorithm; the target feature is the feature with the importance degree ranked in the top N bits in the target full-scale feature. Therefore, the target total quantity features are obtained through feature combination, then the target features with higher importance degree are determined based on a feature selection algorithm, the optimal target feature subset can be selected, the classifier is trained by using the optimal feature subset, and the accuracy of the target grading prediction model can be further improved.

Optionally, in the embodiment of the present invention, the step 1021 may be specifically implemented by the following steps S51 to S52:

step S51, vectorizing the target image feature, the target gene feature and the target clinical feature respectively to obtain a target image feature vector, a target gene feature vector and a target clinical feature vector.

In the embodiment of the invention, after the target image feature, the target gene feature and the target clinical feature are obtained, vectorization of the feature set can be performed to obtain a target image feature vector R, a target gene feature vector G and a target clinical feature vector C.

And step S52, merging the target image feature vector, the target gene feature vector and the target clinical feature vector to obtain a first target full-scale feature vector corresponding to the target full-scale feature.

In an embodiment of the present invention, the first target full-scale feature vector may refer to a vector corresponding to an original target full-scale feature including all image genomics features. Specifically, the target image feature vector R, the target gene feature vector G, and the target clinical feature vector C may be merged to obtain the first target full-scale feature vector F, i.e., F ═ R + G + C.

In the embodiment of the invention, the target image characteristic, the target gene characteristic and the target clinical characteristic are respectively vectorized to obtain a target image characteristic vector, a target gene characteristic vector and a target clinical characteristic vector; and merging the target image feature vector, the target gene feature vector and the target clinical feature vector to obtain a first target full-scale feature vector corresponding to the target full-scale feature. Therefore, the first target full-quantity feature vector is obtained by vectorizing and merging the features, the operation speed can be improved by performing feature processing based on the vector, and the comprehensiveness of the features in subsequent feature screening can be ensured by merging the vectors.

Optionally, in the embodiment of the present invention, the step 1022 specifically includes the following steps S61 to S64:

step S61, performing normalization processing on the first target full-scale feature vector to obtain a second target full-scale feature vector corresponding to the first target full-scale feature.

In this embodiment of the present invention, the second target full-scale feature vector may refer to the first target full-scale feature vector after the normalization processing. In this step, after the feature combination is performed to obtain the first target full-scale feature, the normalization operation may be performed on the first target full-scale feature vector F obtained after the combination, and all the quantized feature data in F are normalized to the range of [ -1,1] to obtain F ', so as to obtain the normalized second target full-scale feature vector F'. The specific normalization formula may be:

and step S62, determining the importance degree of each feature in the second target full-quantity feature vector according to the preset feature selection algorithm.

In the embodiment of the invention, after the normalization processing is performed, the importance degree of each feature in the second target full-quantity feature vector can be calculated by using the second target full-quantity feature vector and a preset feature selection algorithm. Exemplarily, taking a preset feature selection algorithm as a support vector machine-based recursive feature elimination method (SVM-RFE) as an example, when feature selection is performed based on a second target full-scale feature vector F', the specific method is to randomly divide a data set into 5 subsets, wherein 4 subsets are used as training sets, and the other 1 subset is used as a verification set, so that 5 sets of training sets and verification sets can be obtained, the SVM classifier is used for modeling the 5 sets of training sets and the verification sets respectively, the importance degree of each feature is calculated, and the importance degree ranking of all the features is obtained by using the recursive feature elimination method (RFE).

And step S63, determining the target number N corresponding to the target feature according to the importance degree and the preset feature selection algorithm.

In the embodiment of the invention, after the importance degree of each feature in the second target full-scale feature vector is determined, verification can be performed again based on the importance degree and the preset feature selection algorithm, the optimal number N corresponding to the target feature is determined, and the accuracy in the subsequent model training process is further improved.

And step S64, screening the features with the importance degree ranked at the top N bits in the second target full-quantity feature vector as the target features.

In the embodiment of the present invention, after the number N of targets is determined, features with importance degree ranked to top TopN are selected as target features in the second target full-scale feature vector based on the ranking of importance degree, so as to form a target feature set.

In the embodiment of the invention, the first target full-scale feature vector is subjected to normalization processing to obtain a second target full-scale feature vector corresponding to the first target full-scale feature; determining the importance degree of each feature in the second target full-scale feature vector according to a preset feature selection algorithm; determining the target number N corresponding to the target features according to the importance degree and a preset feature selection algorithm; and in the second target full-quantity feature vector, screening features with importance degrees ranked in the top N bits as target features. Therefore, the target features are screened by determining the importance degree of each feature and determining the optimal number of the target features, the rationality of feature screening can be improved, the screening of the optimal target feature set is realized, and the accuracy of classification of a target grading prediction model obtained by subsequent training can be further improved.

Optionally, in the embodiment of the present invention, step S63 may be specifically implemented by steps S631 to S633 as follows:

and S631, sequentially selecting different numbers of features according to the ranking of the importance degrees.

In the embodiment of the invention, after the importance degree of each feature in the second target full-quantity feature vector is determined, the optimal number of the target features can be further determined. In this step, different numbers of features may be selected first, for example, the first 5 features, the first 10 features, the first 20 features, and the like, which are ranked in order of importance degree, may be selected from the second target full-scale feature vector, and then the feature set composed of the different numbers of features may be verified to determine the optimal number.

Step S632 is to input the features of each quantity into the preset feature selection algorithm for verification, and determine the average accuracy corresponding to each feature quantity.

In the embodiment of the present invention, the average accuracy may be an accuracy when the model trained based on the features of the feature number is finally used for the preset classification. Exemplarily, in this step, 5-fold cross validation may be adopted to determine the number of the selected optimal features, that is, the target number N, sequentially select different numbers of features according to the importance order, and perform cross validation on the selected feature sets to obtain average accuracies corresponding to the feature sets of different feature numbers, respectively.

Step S633, taking the feature quantity with the highest average accuracy as the target quantity N corresponding to the target feature.

In the embodiment of the present invention, after the average accuracy corresponding to the feature sets with different feature quantities is determined, the feature quantity of the feature set with the highest average accuracy may be used as the selected optimal feature quantity. Then, an RFE (recursive feature elimination) operation may be performed on the entire data set to obtain all target features and compose a target feature vector set. Illustratively, taking the data sets of the foregoing 65 low-grade brain tumor patients and 102 high-grade brain tumor patients as examples, the SVM-RFE method is used to determine the number of the optimal features to be 19, and table 2 shows the types of the target features determined after feature selection according to the embodiment of the present invention.

TABLE 2

In the embodiment of the invention, different numbers of features are sequentially selected according to the sequence of the importance degrees; respectively inputting the features of each quantity into a preset feature selection algorithm for verification, and determining the average accuracy corresponding to each feature quantity; and taking the feature quantity with the highest average accuracy as the target quantity N corresponding to the target features. Therefore, the optimal number of the features is determined through the importance degree and the preset feature selection algorithm, the optimal target feature subset can be selected, the classifier is trained by using the optimal feature subset, and the accuracy of model grading prediction can be further improved.

Illustratively, fig. 2 shows a flow chart of disease staging prediction according to an embodiment of the present invention. The following describes a specific procedure of the disease grading prediction method according to the embodiment of the present invention with reference to an example, in view of a case where a target disease is glioma and a target user is a patient with glioma:

step 201, obtaining a multi-modal MRI image of a patient with brain glioma; automatically segmenting the multi-modal MRI image according to a preset image segmentation algorithm to determine a tumor region (target region); and extracting the target image characteristics of the tumor region image through a preset image processing algorithm.

Step 202, acquiring gene data (target genomics data) of a patient with glioma, and determining candidate genes through correlation analysis and mutation rate statistics; and then extracting the gene characteristics of the candidate genes of each patient to obtain the target gene characteristics.

Step 203, obtaining clinical data (preset clinical data) of the patient with glioma, determining target clinical information through data screening, and obtaining target clinical characteristics through characteristic extraction.

Steps 201 to 203 are described in detail in the foregoing, and the embodiments of the present invention are not described herein again.

And 204, combining the target image features, the target clinical features and the target gene features to obtain target full-scale features, and then extracting the features to obtain the target features.

In the embodiment of the invention, a TCGA-LGG low-grade brain glioma data set and a TCGA-GBM high-grade brain glioma data set are firstly obtained, and the TCGA-LGG low-grade brain glioma data set and the TCGA-GBM high-grade brain glioma data set collectively comprise 65 cases of low-grade brain tumor patients and 102 cases of high-grade brain tumor patients preoperative multi-modal MRI magnetic resonance images (comprising four sequences of T1, FLAIR, T1c and T2), and preset clinical data and target genomics data corresponding to each patient. For MRI image data, a Semantic Feature Pyramid Network (SFPN) is adopted to automatically segment a tumor region, and an image processing algorithm is used to extract first-order statistic features, shape features, texture features and the like of the tumor region to form target image features; for gene data, determining candidate genes through correlation analysis and mutation rate statistics, and extracting the gene expression level and the gene mutation state of the candidate genes as target gene characteristics; for clinical data, target clinical information, such as age, sex and the like of a patient, is obtained through data screening, and the target clinical information is discretized, labeled and the like to obtain target clinical characteristics. And combining the three features to obtain a target total quantity feature, and performing feature selection on the target total quantity feature by using a cross validation-recursive feature elimination method to obtain a target feature vector set corresponding to the target feature.

Step 205, inputting the target features and the preset labels into a preset hierarchical prediction model (SVM classifier), and training to obtain the target hierarchical prediction model.

In the step, a target feature vector set corresponding to the target features is randomly divided into 5 subsets, and a Support Vector Machine (SVM) classification model is adopted as a preset hierarchical prediction classification model and a 5-fold cross validation method is adopted to perform model training to obtain a target hierarchical prediction classification model. Then, target feature vectors of patients (second users) in the preset test set can be sequentially input into the target grading prediction classification model, and the model outputs target grading, namely a tumor grading prediction result. The cross validation method adopted in the embodiment of the invention specifically comprises the steps of grouping original data, taking one part as a training set and taking the other part as a test set, firstly training the classifier by using the training set, and then testing the model obtained by training by using the test set so as to be used as a performance index for evaluating the classifier.

In the embodiment of the present invention, for the prediction result, the classification accuracy may be used as a performance evaluation index of the model, and the calculation formula is (TP + TN)/(P + N), where TP represents the number correctly divided into positive examples, TN represents the number correctly divided into negative examples, and P + N represents the sum of all positive examples and all negative examples, that is, all sample numbers.

In the above example, the results of the five-fold cross validation evaluation of 167 patients are shown in table 3, in order to validate the performance of the disease grading prediction method of the embodiment of the present invention, the imaging group method (only using image features to perform grading prediction) is compared with the embodiment of the present invention, it can be seen that the disease grading prediction method of the embodiment of the present invention has higher grading accuracy on 4 subsets (1, 2, 4, 5), the average accuracy of 167 patients is 94%, and is improved by 2.4% compared with the imaging group method used alone.

TABLE 3 five-fold cross-validation evaluation results

TABLE 3

From the results, the disease grading method provided by the embodiment of the invention can comprehensively quantify the tumor phenotype characteristics, reduce the human factor interference through automatic quantitative analysis, introduce the genomics characteristics and clinical characteristics of the patient for prediction, improve the grading prediction accuracy and provide a basis for the patient to establish the optimal diagnosis and treatment scheme compared with the imaging omics method only using the image characteristics.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Referring to fig. 3, a block diagram of a disease grading prediction apparatus according to an embodiment of the present invention is shown, and specifically, the apparatus 30 may include the following modules:

an obtaining module 301, configured to obtain a target image feature, a target gene feature, and a target clinical feature of a target user; the target users comprise first users in a preset training set and second users in a preset test set;

a determining module 302, configured to determine a target feature of the target user according to the target image feature, the target gene feature, and the target clinical feature;

a training module 303, configured to train a preset hierarchical prediction model based on the target feature and a preset label corresponding to the first user, to obtain a target hierarchical prediction model;

an input module 304, configured to input the target feature of the second user into the target classification prediction model, so as to obtain a target classification corresponding to the second user.

In summary, the disease grading prediction apparatus provided in the embodiment of the present invention obtains the target image feature, the target gene feature and the target clinical feature of the target user; the target users comprise first users in a preset training set and second users in a preset test set; determining target characteristics of a target user according to the target image characteristics, the target gene characteristics and the target clinical characteristics; training a preset hierarchical prediction model based on the target feature vector and a preset label corresponding to the first user to obtain a target hierarchical prediction model; and inputting the target characteristics of the second user into the target grading prediction model to obtain the target grading corresponding to the second user. Therefore, the target hierarchical prediction model is obtained by training by introducing the target image characteristics, the target gene characteristics and the target clinical characteristics of the target user, and automatic hierarchical prediction is performed based on the target hierarchical prediction model, so that the interference of human factors is reduced, and the accuracy of disease hierarchical prediction is improved.

Optionally, the obtaining module 301 is specifically configured to:

acquiring a target image of the target user;

Optionally, the obtaining module 301 is further specifically configured to:

obtaining target genomics data of the target user;

Optionally, the obtaining module 301 is further specifically configured to:

acquiring preset clinical data of the target user;

Optionally, the determining module 302 is specifically configured to:

Optionally, an embodiment of the present invention further provides an electronic device, which includes a processor, a memory, and a computer program that is stored in the memory and is executable on the processor, and when the computer program is executed by the processor, the computer program implements each process of the disease classification prediction method embodiment, and can achieve the same technical effect, and in order to avoid repetition, details are not repeated here.

Optionally, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the disease classification prediction method embodiment, and can achieve the same technical effect, and is not described herein again to avoid repetition.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As is readily imaginable to the person skilled in the art: any combination of the above embodiments is possible, and thus any combination between the above embodiments is an embodiment of the present invention, but the present disclosure is not necessarily detailed herein for reasons of space.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the invention and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims

1. A disease staging prediction method, comprising:

2. The method of claim 1, wherein the obtaining target image features of the target user comprises:

acquiring a target image of the target user;

3. The method of claim 1, wherein the obtaining the target gene signature of the target user comprises:

obtaining target genomics data of the target user;

4. The method of claim 1, wherein the obtaining a target clinical profile of a target user comprises:

acquiring preset clinical data of the target user;

5. The method of any one of claims 1 to 4, wherein said determining a target feature of the target user based on the target image feature, the target genetic feature, and the target clinical feature comprises:

6. The method of claim 5, wherein said merging the target image signature, the target genetic signature, and the target clinical signature comprises:

7. The method according to claim 6, wherein the selecting the target feature of the target user from the target full-scale features according to a preset feature selection algorithm comprises:

8. The method according to claim 7, wherein the determining the target number N corresponding to the target feature comprises:

9. A disease progression prediction apparatus, comprising:

10. An electronic device comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing a disease progression prediction method as claimed in any one of claims 1 to 8.

11. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a disease progression prediction method according to any one of claims 1 to 8.