CN112884754A - Multi-modal Alzheimer's disease medical image recognition and classification method and system - Google Patents
Multi-modal Alzheimer's disease medical image recognition and classification method and system Download PDFInfo
- Publication number
- CN112884754A CN112884754A CN202110265610.4A CN202110265610A CN112884754A CN 112884754 A CN112884754 A CN 112884754A CN 202110265610 A CN202110265610 A CN 202110265610A CN 112884754 A CN112884754 A CN 112884754A
- Authority
- CN
- China
- Prior art keywords
- data
- snp
- classifier
- alzheimer
- classifiers
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 208000024827 Alzheimer disease Diseases 0.000 title claims abstract description 58
- 238000000034 method Methods 0.000 title claims abstract description 30
- 108090000623 proteins and genes Proteins 0.000 claims abstract description 64
- 238000012098 association analyses Methods 0.000 claims abstract description 15
- 230000002068 genetic effect Effects 0.000 claims abstract description 15
- 238000012216 screening Methods 0.000 claims description 31
- 238000007781 pre-processing Methods 0.000 claims description 30
- 238000013527 convolutional neural network Methods 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 18
- 238000012217 deletion Methods 0.000 claims description 17
- 230000037430 deletion Effects 0.000 claims description 17
- 230000010354 integration Effects 0.000 claims description 17
- 238000003066 decision tree Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 9
- 238000007637 random forest analysis Methods 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 7
- 238000012097 association analysis method Methods 0.000 claims description 7
- 238000007477 logistic regression Methods 0.000 claims description 7
- 238000009499 grossing Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 230000002787 reinforcement Effects 0.000 claims description 6
- 210000003625 skull Anatomy 0.000 claims description 6
- 238000003745 diagnosis Methods 0.000 abstract description 13
- 230000000694 effects Effects 0.000 abstract description 10
- 230000004927 fusion Effects 0.000 abstract description 7
- 230000008569 process Effects 0.000 description 9
- 238000004458 analytical method Methods 0.000 description 5
- 201000010099 disease Diseases 0.000 description 5
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 5
- 238000004364 calculation method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000010219 correlation analysis Methods 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 239000008280 blood Substances 0.000 description 2
- 210000004369 blood Anatomy 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 208000006888 Agnosia Diseases 0.000 description 1
- 241001047040 Agnosia Species 0.000 description 1
- 108700028369 Alleles Proteins 0.000 description 1
- 206010012289 Dementia Diseases 0.000 description 1
- 206010036631 Presenile dementia Diseases 0.000 description 1
- 206010039966 Senile dementia Diseases 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 201000007201 aphasia Diseases 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 210000005013 brain tissue Anatomy 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000006735 deficit Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 230000004064 dysfunction Effects 0.000 description 1
- 230000002888 effect on disease Effects 0.000 description 1
- 230000010326 executive functioning Effects 0.000 description 1
- 230000004886 head movement Effects 0.000 description 1
- 238000010191 image analysis Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 230000013011 mating Effects 0.000 description 1
- 206010027175 memory impairment Diseases 0.000 description 1
- 238000013508 migration Methods 0.000 description 1
- 230000005012 migration Effects 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 210000000653 nervous system Anatomy 0.000 description 1
- 208000015122 neurodegenerative disease Diseases 0.000 description 1
- 239000002773 nucleotide Substances 0.000 description 1
- 125000003729 nucleotide group Chemical group 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/0002—Inspection of images, e.g. flaw detection
- G06T7/0012—Biomedical image inspection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2148—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10072—Tomographic images
- G06T2207/10088—Magnetic resonance imaging [MRI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/03—Recognition of patterns in medical or anatomical images
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Biotechnology (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Computing Systems (AREA)
- Genetics & Genomics (AREA)
- Mathematical Physics (AREA)
- Quality & Reliability (AREA)
- Radiology & Medical Imaging (AREA)
- Nuclear Medicine, Radiotherapy & Molecular Imaging (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Image Analysis (AREA)
Abstract
The application discloses a method and a system for recognizing and classifying multi-modal medical images of Alzheimer's disease, two modal data of medical images and genomics are combined, the diagnosis of the Alzheimer's disease is more accurate and reliable by reading the image data and combining genome association analysis and utilizing the combination of the image and the gene data, and the technical problems that the multi-modal fusion effect of the image data and the genetic data in the medical diagnosis of the existing Alzheimer's disease is poor, and the recognition and classification accuracy of people in different stages of the Alzheimer's disease are influenced are solved.
Description
Technical Field
The application relates to the technical field of medical image analysis, in particular to a method and a system for recognizing and classifying multi-modal medical images of Alzheimer's disease.
Background
Alzheimer's Disease (AD) is a progressive degenerative disease of the nervous system with occult onset. Clinically, it is characterized by generalized dementia such as memory impairment, aphasia, disuse, agnosia, impairment of visual-spatial skills, dysfunction in executive functioning, and personality and behavioral changes. Patients who are older than 65 years are called presenile dementia; the patient after 65 years old is called senile dementia.
People groups of people in different stages of the Alzheimer disease are classified, so that the early-stage people of the Alzheimer disease can be identified, effective gene information can be obtained, and assistance can be provided for prevention and diagnosis of patients with the early-stage Alzheimer disease. The traditional multi-mode fusion effect of image data and genetic data in the medical diagnosis of the Alzheimer's disease is poor, and the effect is mainly reflected in that: the features extracted from the medical image after the current preprocessing have a plurality of features which have no effect on the classification of the crowd, so that the classification accuracy is influenced; nowadays, in the aspect of using SNP (single nucleotide polymorphism) data for alzheimer's disease diagnosis, usually, manually selecting SNP data of a gene related to a disease condition, however, there is a possibility that the manually selected SNP data is missed in the selection process, and many SNPs related to diseases are not recorded, and the calculation complexity of the SNP data is high. Therefore, it is still a technical problem to be solved by those skilled in the art to improve the multi-modal fusion effect of image data and genetic data in medical diagnosis of alzheimer's disease and further improve the recognition and classification accuracy of people in different stages of alzheimer's disease.
Disclosure of Invention
The application provides a method and a system for recognizing and classifying multi-modal medical images of Alzheimer's disease, which are used for solving the technical problems that the multi-modal fusion effect of image data and genetic data in the existing medical diagnosis of Alzheimer's disease is poor, and the recognition and classification accuracy of people in different stages of Alzheimer's disease is influenced.
In view of the above, a first aspect of the present application provides a method for recognizing and classifying multi-modal alzheimer medical images, including:
constructing medical databases of different populations of Alzheimer's disease, wherein the medical databases comprise coronal MRI image data and gene SNP data;
after image preprocessing is carried out on the MRI image data, a CNN (convolutional neural network) is used for constructing classifiers, and at least three optimal classifiers are selected as high-quality MRI-based classifiers;
preprocessing the gene SNP data by using a GWAS whole genome association analysis method to obtain a coded SNP locus data set;
constructing classifiers by using a decision tree as a base classifier and using three integration strategies of a random forest classifier, a Bagging classifier and an XGboost classifier to obtain three SNP base classifiers;
performing ensemble learning on all the high-quality MRI-based classifiers and the SNP-based classifier based on an improved probability weight ensemble learning mode to obtain a final enhanced version classifier;
and performing multi-modal Alzheimer's disease medical image recognition classification by using the enhanced classifier.
Optionally, the preprocessing the gene SNP data using GWAS genome-wide association analysis to obtain an encoded SNP locus data set, includes:
performing GWAS whole genome association analysis on the gene SNP data by using PLINK software, wherein the GWAS whole genome association analysis comprises the following steps: screening gene SNP data according to site deletion rate, screening gene SNP data according to site information deletion rate, screening gene SNP data according to Hardy-Weinberg balance, screening gene SNP data according to linkage imbalance, screening gene SNP data according to individual independence, analyzing by using a Logistic regression model to obtain the related significance p value of each SNP and phenotype, selecting SNP with high relevance according to the p value to encode, and forming an encoded SNP site data set.
Optionally, the image pre-processing the MRI image data comprises:
performing skull removal and registration processing on the MRI image data;
smoothing the MRI image data;
performing gray scale normalization on the MRI image data;
two-dimensional slicing is performed on the MRI image data.
Optionally, the MRI image data is image pre-processed using SPM12 software.
Optionally, the ensemble learning mode based on the improved probability weights is:
p(x)=sigmoid(w1)p(x|h1)+sigmoid(w2)p(x|h2)+···sigmoid(wn)p(x|hn)
wherein n is the number of classifiers, sigmoid () is an activation function, w is a performance index of the classifier, p is the probability of the current classifier, and h is the number of network layers.
The second aspect of the present application provides a multimodal alzheimer's disease medical image recognition and classification system, comprising:
the data module is used for constructing medical databases of different populations of Alzheimer's disease, and the medical databases comprise coronal MRI image data and gene SNP data;
the MRI image processing module is used for preprocessing the MRI image data, constructing classifiers by using CNN (convolutional neural network), and selecting at least three optimal classifiers as high-quality MRI-based classifiers;
the first gene data processing module is used for preprocessing the gene SNP data by using a GWAS whole genome association analysis method to obtain an encoded SNP locus data set;
the second gene data processing module is used for constructing classifiers by using a decision tree as a base classifier and using three integration strategies of a random forest classifier, a Bagging classifier and an XGboost classifier to obtain three SNP base classifiers;
the ensemble learning reinforcement module is used for carrying out ensemble learning on all the high-quality MRI-based classifiers and the SNP-based classifiers based on an improved probability weight ensemble learning mode to obtain a final reinforcement version classifier;
and the recognition and classification module is used for performing multi-modal Alzheimer disease medical image recognition and classification by using the enhanced classifier.
Optionally, the first genetic data processing module is specifically configured to:
performing GWAS whole genome association analysis on the gene SNP data by using PLINK software, wherein the GWAS whole genome association analysis comprises the following steps: screening gene SNP data according to site deletion rate, screening gene SNP data according to site information deletion rate, screening gene SNP data according to Hardy-Weinberg balance, screening gene SNP data according to linkage imbalance, screening gene SNP data according to individual independence, analyzing by using a Logistic regression model to obtain the related significance p value of each SNP and phenotype, selecting SNP with high relevance according to the p value to encode, and forming an encoded SNP site data set.
Optionally, the image pre-processing the MRI image data comprises:
performing skull removal and registration processing on the MRI image data;
smoothing the MRI image data;
performing gray scale normalization on the MRI image data;
two-dimensional slicing is performed on the MRI image data.
Optionally, the MRI image data is image pre-processed using SPM12 software.
Optionally, the ensemble learning mode based on the improved probability weights is:
p(x)=sigmoid(w1)p(x|h1)+sigmoid(w2)p(x|h2)+···sigmoid(wn)p(x|hn)
wherein n is the number of classifiers, sigmoid () is an activation function, w is a performance index of the classifier, p is the probability of the current classifier, and h is the number of network layers.
According to the technical scheme, the embodiment of the application has the following advantages:
the application provides a multi-modal Alzheimer's disease medical image recognition and classification method, which comprises the following steps: constructing medical databases of different populations of Alzheimer's disease, wherein the medical databases comprise coronal MRI image data and gene SNP data; after image preprocessing is carried out on the MRI image data, a CNN (convolutional neural network) is used for constructing classifiers, and at least three optimal classifiers are selected as high-quality MRI-based classifiers; preprocessing the gene SNP data by using a GWAS whole genome association analysis method to obtain a coded SNP locus data set; constructing classifiers by using a decision tree as a base classifier and using three integration strategies of a random forest classifier, a Bagging classifier and an XGboost classifier to obtain three SNP base classifiers; performing ensemble learning on all the high-quality MRI-based classifiers and the SNP-based classifier based on an improved probability weight ensemble learning mode to obtain a final enhanced version classifier; and performing multi-modal Alzheimer's disease medical image recognition classification by using the enhanced classifier.
The method comprises the steps of training a group of base classifiers by using a deep convolutional neural network for each MRI image in a two-dimensional space, then selecting at least three image slice classifiers with classification effects on disease groups as the base classifiers of the image classifiers during integration, ensuring that the selected slices have certain coincidence with clinical manifestations of diseases, and simultaneously enabling the image classifiers during integration to have diversity, wherein the performance of integrated learning is not only related to the performance of the classifiers but also related to the diversity of the classifiers, and the plurality of image classifiers are used for integration better than the single image classifier in integration effect.
In the application, GWAS (genome wide association analysis) is used for preprocessing genome data, and GWAS is used for analyzing the association between SNP sites and phenotypes, so that SNP related to the phenotypes is screened out, and the phenotypes can be selected in two modes, wherein the first mode is a linear phenotype, such as height, weight, intelligence and the like; the second is a binary phenotype, such as diseased and unaffected, known as case and control, with 0 diseased and 1 unaffected. The GWAS is used for analyzing and reducing the dimensionality of the SNP data, so that the complexity of calculation is greatly reduced, the identification error caused by redundant information is reduced, and the identification precision is improved; in addition, in order to improve the performance of SNP data classification, the invention uses various integration strategies to construct the SNP classifier, so that the classification performance is improved on one hand, and the diversity of the SNP classifier is improved on the other hand.
The method combines two modal data of medical images and genomics, combines genome correlation analysis by reading image data and combining the image data with gene data to enable the diagnosis of the Alzheimer's disease to be more accurate and reliable, and solves the technical problems that the multi-modal fusion effect of the image data and the genetic data in the medical diagnosis of the existing Alzheimer's disease is poor, and the recognition and classification accuracy of people in different stages of the Alzheimer's disease is influenced.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other related drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a multi-modal alzheimer medical image recognition and classification method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an MRI image data preprocessing process in an embodiment of the present application;
fig. 3 is a schematic flowchart of constructing a classifier using CNN in the embodiment of the present application;
FIG. 4 is a schematic diagram showing a process of preprocessing gene data in the examples of the present application;
FIG. 5 is a schematic diagram of a process of constructing a classifier by the SNP classifier model in the embodiment of the present application;
fig. 6 is a schematic diagram of an ensemble learning process in the embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Example 1
For easy understanding, please refer to fig. 1, the present application provides an embodiment of a method for recognizing and classifying multi-modal alzheimer's disease medical images, comprising:
step 101, constructing medical databases of different populations of Alzheimer's disease, wherein the medical databases comprise coronal MRI image data and gene SNP data.
The invention relates to multi-modal ensemble learning, which needs to combine two modal data of medical images and genomics, so that a stable and reliable medical image and SNP data database of different populations of Alzheimer's disease containing coronary MRI image data and gene SNP data needs to be constructed in advance.
And 102, after image preprocessing is carried out on the MRI image data, a CNN (convolutional neural network) is used for constructing classifiers, and at least three optimal classifiers are selected as high-quality MRI-based classifiers.
Acquiring coronal MRI image data from a medical database, and preprocessing the coronal MRI image data as shown in fig. 2, where the preprocessing process may be performed by SPM12 software, and is intended to normalize the original image and appropriately reduce noise processing, so as to facilitate implementation of subsequent medical image classification, including:
1. firstly, noise and the influence of non-brain tissue structures are removed through operations such as head movement correction and skull stripping, and then all tested structural images are subjected to spatial standardization to register different tested MRI images to a uniform coordinate space so as to eliminate differences among individuals.
2. And then, the obtained result image is subjected to Gaussian smoothing to remove the influence of noise on the image, so that the data is more closely similar to positive distribution, and the effectiveness of parameter detection is increased.
3. And carrying out gray level normalization on the image.
4. Two-dimensional slicing is performed.
After image preprocessing is performed on MRI image data, a CNN (convolutional neural network) construction classifier is used, at least three optimal classifiers are selected as high-quality MRI-based classifiers, as shown in FIG. 3, an adopted CNN model structure is composed of 6 convolutional layers (conv in FIG. 3), 3 pooling layers (pool in FIG. 3) and 3 full-link layers (FC in FIG. 3), the last full-link layer is only provided with two nodes, and a softmax function is adopted to realize two classifications. And (3) training 40 epochs by each CNN-based classifier, wherein after testing, the 40 epochs are enough to make the base classifier converge, so that the classification accuracy of the base classifier on the original slices of the training set reaches 100%. The ReLU function was used for all convolutional layer activation functions, Adam was used for the gradient update algorithm, the learning rate was set to 0.0001, and the number of input slices per batch (blocksize) was set to 200.
And 103, preprocessing gene SNP data by using a GWAS whole genome association analysis method to obtain a coded SNP locus data set.
As shown in FIG. 4, FIG. 4 shows the preprocessing process of gene data, which can be performed by using PLINK software to perform GWAS whole genome association analysis, and the obtained SNP site data is encoded into 0, 1, 2(AA-0, AA-1, AA-2). The process is as follows:
(1) screening according to heterozygosity
In the genotype data, every two characters represent the genotype of one SNP, for example, GGGCAATA contains the genotypes of four SNPs, namely GG, GC, AA and TA, wherein the GG and the AA belong to homozygous type, and the GC and the TA belong to heterozygous type. According to the genetic law, the frequencies of heterozygous genes of different samples are similar in a natural population. Abnormal data to be tested which do not accord with the rule can be eliminated according to the rule.
(2) Screening according to site deletion Rate
The SNP deletion rate of a sample is an important index reflecting the genotype data quality of the sample, if the site deletion rate of the sample is too high, the sample data quality is poor, and the sample needs to be removed so as not to influence the subsequent analysis.
(3) Screening according to site information deletion rate
The site information deletion rate is the information deletion rate of a certain SNP in all the test subjects. If the information deletion rate of a certain SNP is too high, the data quality of the SNP is poor, the SNP is not suitable for subsequent analysis, and the SNP information needs to be deleted.
(4) Equilibrium screening according to Hardy-Weinberg
Hardy-Weinberg's law of equilibrium, also known as the law of genetic equilibrium, is an important law in the inheritance of the population, independently demonstrated in 1908 and 1909 by England mathematicians G.H.Hardy (Godfrey Harold Hardy) and German physicians William.Winberg (Wilhelm Weinberg), respectively. The main contents are as follows: a population is ideally (independent of specific interfering factors such as nonrandom mating, day selection, population migration, mutation or limited population size) and over multiple generations, the gene frequency and genotype frequency remain constant and in a stable equilibrium.
(5) Screening according to linkage imbalance
Linkage Disequilibrium (LD) refers to the presence of non-random combinations at two or more loci or alleles from a single genus. Simply, if two genes are not completely independently inherited during the process of inheritance, linkage disequilibrium exists between them. In practice, r2 is commonly used to indicate the linkage disequilibrium strength of SNPs, and the larger r2, the stronger the linkage disequilibrium phenomenon, and the weaker the independence of the related SNPs. Since the SNP that is desired to be finally found in the GWAS analysis is a highly independent SNP, the SNP with a high linkage phenomenon is deleted by linkage disequilibrium (typically, one SNP is left for each set of linked SNPs).
(6) Screening according to Individual independence
Data independence needs to be maintained as much as possible, and if the samples have close relativity or data of the same sample is adopted for multiple times during data acquisition, SNP distribution is not in a natural state, and the analysis result is deviated. The genetic relationship coefficient is also called as blood relationship coefficient, and the similarity degree of the genetic composition between individuals in a population is represented by a numerical value, namely the blood relationship coefficient, which can reflect the genetic relationship degree between two individuals.
(7) Association analysis
In GWAS, there are two ways to select a phenotype, the first is a linear phenotype, if height, weight, intelligence, etc.; the second is a binary phenotype, such as diseased and unaffected, known as case and control, with 0 diseased and 1 unaffected. When the phenotype to be analyzed is a binary trait, the analysis is typically performed using Logistic regression models; when the phenotype to be analyzed is a linear trait, a common linear regression model is typically used. The method uses a Logistic regression model to analyze to obtain the related significance p value of each SNP and phenotype, and selects SNPs with high relevance according to the p value to encode into 0, 1 and 2 to form a data set.
And step 104, constructing classifiers by using a decision tree as a base classifier and using three integration strategies of a random forest classifier, a Bagging classifier and an XGboost classifier to obtain three SNP base classifiers.
As shown in fig. 5, fig. 5 is an SNP classifier model, which is constructed by using a decision tree as a base classifier and using three integration modes, namely a random forest classifier, a Bagging classifier and an XGBoost classifier.
And 105, performing ensemble learning on all high-quality MRI-based classifiers and the SNP-based classifier based on the improved probability weight ensemble learning mode to obtain a final enhanced classifier.
And step 106, carrying out multi-modal Alzheimer disease medical image recognition and classification by using an enhanced classifier.
As shown in fig. 6, after the MRI classifier and the SNP classifier are constructed, a learning mode based on improved probability weight integration is used, and finally an enhanced version of the classifier is obtained. The integration method based on improved probability weight weighting is used:
p(x)=sigmoid(w1)p(x|h1)+sigmoid(w2)p(x|h2)+···sigmoid(wn)p(x|hn)
wherein n is the number of classifiers, sigmoid () is an activation function, w is a performance index of the classifier and is composed of the probability of a verification set, p is the probability of the current classifier, and h is the number of network layers. The method can effectively solve the degree of unbalanced weight among the classifiers, so that a high-efficiency enhanced classifier is formed. And performing multi-modal Alzheimer disease medical image recognition classification by using an enhanced classifier.
The ensemble learning results are related not only to individual classifier performance, but also to the diversity between the integrated classifiers. MRI selects the base classifier to be integrated finally according to the performance of each slice classifier, ensures that the selected slice has certain inosculation with the clinical manifestation of the disease, and simultaneously ensures that the image classifiers have diversity during integration; the practice of convolutional networks has proven that convolutional neural networks are advantageous for reducing the risk of over-fitting, while deep features of the image are learned.
The GWAS is used for analyzing and reducing the dimensionality of the SNP data, so that the complexity of calculation is greatly reduced, the identification error caused by redundant information is reduced, and the identification precision is improved; the SNP classifier takes a decision tree as a base classifier and constructs the classifier in various integrated modes, so that the performance of the SNP classifier is improved on one hand, and the diversity of the SNP classifier is also improved on the other hand.
The method combines two modal data of medical images and genomics, combines genome correlation analysis by reading image data and combining the image data with gene data to enable the diagnosis of the Alzheimer's disease to be more accurate and reliable, and solves the technical problems that the multi-modal fusion effect of the image data and the genetic data in the medical diagnosis of the existing Alzheimer's disease is poor, and the recognition and classification accuracy of people in different stages of the Alzheimer's disease is influenced.
The application also provides an embodiment of the multi-modal alzheimer's disease medical image recognition and classification system, which comprises:
the data module is used for constructing medical databases of different populations of Alzheimer's disease, and the medical databases comprise coronal MRI image data and gene SNP data;
the MRI image processing module is used for preprocessing the MRI image data, constructing classifiers by using CNN (convolutional neural network), and selecting at least three optimal classifiers as high-quality MRI-based classifiers;
the first gene data processing module is used for preprocessing gene SNP data by using a GWAS whole genome association analysis method to obtain a coded SNP locus data set;
the second gene data processing module is used for constructing classifiers by using a decision tree as a base classifier and using three integration strategies of a random forest classifier, a Bagging classifier and an XGboost classifier to obtain three SNP base classifiers;
the ensemble learning reinforcement module is used for carrying out ensemble learning on all high-quality MRI-based classifiers and SNP-based classifiers based on the improved probability weight ensemble learning mode to obtain a final reinforcement version classifier;
and the recognition and classification module is used for performing multi-modal Alzheimer disease medical image recognition and classification by using an enhanced classifier.
The first gene data processing module is specifically configured to:
using PLINK software for GWAS whole genome association analysis of gene SNP data, including: screening gene SNP data according to site deletion rate, screening gene SNP data according to site information deletion rate, screening gene SNP data according to Hardy-Weinberg balance, screening gene SNP data according to linkage imbalance, screening gene SNP data according to individual independence, analyzing by using a Logistic regression model to obtain the related significance p value of each SNP and phenotype, selecting SNP with high relevance according to the p value to encode, and forming an encoded SNP site data set.
Image pre-processing MRI image data includes:
performing skull removal and registration processing on the MRI image data;
smoothing the MRI image data;
carrying out gray level normalization on MRI image data;
two-dimensional slices are taken of the MRI image data.
Image pre-processing of the MRI image data is performed using SPM12 software.
The ensemble learning mode based on improved probability weights is:
p(x)=sigmoid(w1)p(x|h1)+sigmoid(w2)p(x|h2)+···sigmoid(wn)p(x|hn)
wherein n is the number of classifiers, sigmoid () is an activation function, w is a performance index of the classifier, p is the probability of the current classifier, and h is the number of network layers.
The ensemble learning results are related not only to individual classifier performance, but also to the diversity between the integrated classifiers. MRI selects the base classifier to be integrated finally according to the performance of each slice classifier, ensures that the selected slice has certain inosculation with the clinical manifestation of the disease, and simultaneously ensures that the image classifiers have diversity during integration; the practice of convolutional networks has proven that convolutional neural networks are advantageous for reducing the risk of over-fitting, while deep features of the image are learned.
The GWAS is used for analyzing and reducing the dimensionality of the SNP data, so that the complexity of calculation is greatly reduced, the identification error caused by redundant information is reduced, and the identification precision is improved; the SNP classifier takes a decision tree as a base classifier and constructs the classifier in various integrated modes, so that the performance of the SNP classifier is improved on one hand, and the diversity of the SNP classifier is also improved on the other hand.
The method combines two modal data of medical images and genomics, combines genome correlation analysis by reading image data and combining the image data with gene data to enable the diagnosis of the Alzheimer's disease to be more accurate and reliable, and solves the technical problems that the multi-modal fusion effect of the image data and the genetic data in the medical diagnosis of the existing Alzheimer's disease is poor, and the recognition and classification accuracy of people in different stages of the Alzheimer's disease is influenced.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.
Claims (10)
1. A multi-modal Alzheimer's disease medical image recognition and classification method is characterized by comprising the following steps:
constructing medical databases of different populations of Alzheimer's disease, wherein the medical databases comprise coronal MRI image data and gene SNP data;
after image preprocessing is carried out on the MRI image data, a CNN (convolutional neural network) is used for constructing classifiers, and at least three optimal classifiers are selected as high-quality MRI-based classifiers;
preprocessing the gene SNP data by using a GWAS whole genome association analysis method to obtain a coded SNP locus data set;
constructing classifiers by using a decision tree as a base classifier and using three integration strategies of a random forest classifier, a Bagging classifier and an XGboost classifier to obtain three SNP base classifiers;
performing ensemble learning on all the high-quality MRI-based classifiers and the SNP-based classifier based on an improved probability weight ensemble learning mode to obtain a final enhanced version classifier;
and performing multi-modal Alzheimer's disease medical image recognition classification by using the enhanced classifier.
2. The method for multi-modal alzheimer's disease medical image recognition and classification as claimed in claim 1, wherein the preprocessing of the genetic SNP data using GWAS genome-wide association analysis to obtain encoded SNP site data set comprises:
performing GWAS whole genome association analysis on the gene SNP data by using PLINK software, wherein the GWAS whole genome association analysis comprises the following steps: screening gene SNP data according to site deletion rate, screening gene SNP data according to site information deletion rate, screening gene SNP data according to Hardy-Weinberg balance, screening gene SNP data according to linkage imbalance, screening gene SNP data according to individual independence, analyzing by using a Logistic regression model to obtain the related significance p value of each SNP and phenotype, selecting SNP with high relevance according to the p value to encode, and forming an encoded SNP site data set.
3. The method for recognizing and classifying the medical images of the multi-modal alzheimer's disease according to claim 1, wherein the image preprocessing of the MRI image data comprises:
performing skull removal and registration processing on the MRI image data;
smoothing the MRI image data;
performing gray scale normalization on the MRI image data;
two-dimensional slicing is performed on the MRI image data.
4. The method for medical image recognition and classification of multi-modal alzheimer's disease according to claim 3, wherein the MRI image data is pre-processed using SPM12 software.
5. The method for recognizing and classifying multi-modal medical images of alzheimer's disease as claimed in claim 1, wherein the integrated learning mode based on improved probability weight is:
p(x)=sigmoid(w1)p(x|h1)+sigmoid(w2)p(x|h2)+···sigmoid(wn)p(x|hn)
wherein n is the number of classifiers, sigmoid () is an activation function, w is a performance index of the classifier, p is the probability of the current classifier, and h is the number of network layers.
6. A multi-modal Alzheimer's disease medical image recognition and classification system is characterized by comprising:
the data module is used for constructing medical databases of different populations of Alzheimer's disease, and the medical databases comprise coronal MRI image data and gene SNP data;
the MRI image processing module is used for preprocessing the MRI image data, constructing classifiers by using CNN (convolutional neural network), and selecting at least three optimal classifiers as high-quality MRI-based classifiers;
the first gene data processing module is used for preprocessing the gene SNP data by using a GWAS whole genome association analysis method to obtain an encoded SNP locus data set;
the second gene data processing module is used for constructing classifiers by using a decision tree as a base classifier and using three integration strategies of a random forest classifier, a Bagging classifier and an XGboost classifier to obtain three SNP base classifiers;
the ensemble learning reinforcement module is used for carrying out ensemble learning on all the high-quality MRI-based classifiers and the SNP-based classifiers based on an improved probability weight ensemble learning mode to obtain a final reinforcement version classifier;
and the recognition and classification module is used for performing multi-modal Alzheimer disease medical image recognition and classification by using the enhanced classifier.
7. The multi-modality alzheimer's disease medical image recognition classification system of claim 6, wherein the first genetic data processing module is specifically configured to:
performing GWAS whole genome association analysis on the gene SNP data by using PLINK software, wherein the GWAS whole genome association analysis comprises the following steps: screening gene SNP data according to site deletion rate, screening gene SNP data according to site information deletion rate, screening gene SNP data according to Hardy-Weinberg balance, screening gene SNP data according to linkage imbalance, screening gene SNP data according to individual independence, analyzing by using a Logistic regression model to obtain the related significance p value of each SNP and phenotype, selecting SNP with high relevance according to the p value to encode, and forming an encoded SNP site data set.
8. The multi-modality alzheimer's medical image recognition classification system of claim 6 wherein image pre-processing the MRI image data comprises:
performing skull removal and registration processing on the MRI image data;
smoothing the MRI image data;
performing gray scale normalization on the MRI image data;
two-dimensional slicing is performed on the MRI image data.
9. The multi-modality alzheimer's medical image recognition classification system of claim 8 wherein the MRI image data is image pre-processed using SPM12 software.
10. The multimodal alzheimer's disease medical image recognition and classification system of claim 6 wherein the integrated learning mode based on improved probability weights is:
p(x)=sigmoid(w1)p(x|h1)+sigmoid(w2)p(x|h2)+···sigmoid(wn)p(x|hn)
wherein n is the number of classifiers, sigmoid () is an activation function, w is a performance index of the classifier, p is the probability of the current classifier, and h is the number of network layers.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110265610.4A CN112884754A (en) | 2021-03-11 | 2021-03-11 | Multi-modal Alzheimer's disease medical image recognition and classification method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110265610.4A CN112884754A (en) | 2021-03-11 | 2021-03-11 | Multi-modal Alzheimer's disease medical image recognition and classification method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN112884754A true CN112884754A (en) | 2021-06-01 |
Family
ID=76041325
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110265610.4A Pending CN112884754A (en) | 2021-03-11 | 2021-03-11 | Multi-modal Alzheimer's disease medical image recognition and classification method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112884754A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113380379A (en) * | 2021-06-08 | 2021-09-10 | 上海健康医学院 | Imaging phenotype-based whole genome association analysis method, medium and equipment |
CN113724863A (en) * | 2021-09-08 | 2021-11-30 | 山东建筑大学 | Automatic discrimination system, storage medium and equipment for autism spectrum disorder |
CN114202524A (en) * | 2021-12-10 | 2022-03-18 | 中国人民解放军陆军特色医学中心 | Performance evaluation method and system of multi-modal medical image |
CN114372497A (en) * | 2021-08-18 | 2022-04-19 | 中电长城网际系统应用有限公司 | Multi-modal security data classification method and classification system |
CN117349714A (en) * | 2023-12-06 | 2024-01-05 | 中南大学 | Classification method, system, equipment and medium for medical image of Alzheimer disease |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109589092A (en) * | 2018-10-08 | 2019-04-09 | 广州市本真网络科技有限公司 | Method and system are determined based on the Alzheimer's disease of integrated study |
CN110097128A (en) * | 2019-05-07 | 2019-08-06 | 广东工业大学 | Medical Images Classification apparatus and system |
CN110232679A (en) * | 2019-05-24 | 2019-09-13 | 潘丹 | A kind of Alzheimer's disease genetic biomarkers object determines method and system |
CN110516758A (en) * | 2019-09-02 | 2019-11-29 | 广东工业大学 | A kind of alzheimer's disease classification prediction technique and system |
-
2021
- 2021-03-11 CN CN202110265610.4A patent/CN112884754A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109589092A (en) * | 2018-10-08 | 2019-04-09 | 广州市本真网络科技有限公司 | Method and system are determined based on the Alzheimer's disease of integrated study |
CN110097128A (en) * | 2019-05-07 | 2019-08-06 | 广东工业大学 | Medical Images Classification apparatus and system |
CN110232679A (en) * | 2019-05-24 | 2019-09-13 | 潘丹 | A kind of Alzheimer's disease genetic biomarkers object determines method and system |
CN110516758A (en) * | 2019-09-02 | 2019-11-29 | 广东工业大学 | A kind of alzheimer's disease classification prediction technique and system |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113380379A (en) * | 2021-06-08 | 2021-09-10 | 上海健康医学院 | Imaging phenotype-based whole genome association analysis method, medium and equipment |
CN114372497A (en) * | 2021-08-18 | 2022-04-19 | 中电长城网际系统应用有限公司 | Multi-modal security data classification method and classification system |
CN113724863A (en) * | 2021-09-08 | 2021-11-30 | 山东建筑大学 | Automatic discrimination system, storage medium and equipment for autism spectrum disorder |
CN114202524A (en) * | 2021-12-10 | 2022-03-18 | 中国人民解放军陆军特色医学中心 | Performance evaluation method and system of multi-modal medical image |
CN117349714A (en) * | 2023-12-06 | 2024-01-05 | 中南大学 | Classification method, system, equipment and medium for medical image of Alzheimer disease |
CN117349714B (en) * | 2023-12-06 | 2024-02-13 | 中南大学 | Classification method, system, equipment and medium for medical image of Alzheimer disease |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112884754A (en) | Multi-modal Alzheimer's disease medical image recognition and classification method and system | |
US7133856B2 (en) | Binary tree for complex supervised learning | |
AU2002359549B2 (en) | Methods for the identification of genetic features | |
US7653491B2 (en) | Computer systems and methods for subdividing a complex disease into component diseases | |
US20030224394A1 (en) | Computer systems and methods for identifying genes and determining pathways associated with traits | |
CN113517066B (en) | Depression assessment method and system based on candidate gene methylation sequencing and deep learning | |
WO2004013727A2 (en) | Computer systems and methods that use clinical and expression quantitative trait loci to associate genes with traits | |
Hejase et al. | A deep-learning approach for inference of selective sweeps from the ancestral recombination graph | |
US7640113B2 (en) | Methods and apparatus for complex genetics classification based on correspondence analysis and linear/quadratic analysis | |
Bi et al. | Detecting risk gene and pathogenic brain region in EMCI using a novel GERF algorithm based on brain imaging and genetic data | |
Ying et al. | Multi-modal data analysis for alzheimer’s disease diagnosis: An ensemble model using imagery and genetic features | |
Kumar et al. | An amalgam method efficient for finding of cancer gene using CSC from micro array data | |
CN109215738B (en) | Method for predicting Alzheimer's disease-related gene | |
Alatrany et al. | Transfer learning for classification of Alzheimer's disease based on genome wide data | |
Alatrany et al. | A novel hybrid machine learning approach using deep learning for the prediction of Alzheimer disease using genome data | |
Abd El Hamid et al. | Identifying genetic biomarkers associated to Alzheimer's disease using Support Vector Machine | |
Filipovych et al. | A composite multivariate polygenic and neuroimaging score for prediction of conversion to Alzheimer's disease | |
Hejase et al. | Sia: Selection inference using the ancestral recombination graph | |
US20030077617A1 (en) | Method for diagnosis of a disease by using multiple SNP (single nucleotide polymorphism) variations and clinical data | |
JP5852902B2 (en) | Gene interaction analysis system, method and program thereof | |
Cudic et al. | Prediction of sorghum bicolor genotype from in-situ images using autoencoder-identified SNPs | |
CN110993031B (en) | Analysis method, analysis device, apparatus and storage medium for autism candidate gene | |
AU2021207383B2 (en) | Ancestry inference based on convolutional neural network | |
Sherwood et al. | Brain evolution: Mapping the inner Neandertal | |
Nahlawi | Genetic feature selection using dimensionality reduction approaches: A comparative study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210601 |
|
RJ01 | Rejection of invention patent application after publication |