CN114722918A - Tumor classification method based on DNA methylation - Google Patents

Tumor classification method based on DNA methylation Download PDF

Info

Publication number
CN114722918A
CN114722918A CN202210271294.6A CN202210271294A CN114722918A CN 114722918 A CN114722918 A CN 114722918A CN 202210271294 A CN202210271294 A CN 202210271294A CN 114722918 A CN114722918 A CN 114722918A
Authority
CN
China
Prior art keywords
data
methylation
analysis
dna methylation
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210271294.6A
Other languages
Chinese (zh)
Inventor
马杰
王佳甲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202210271294.6A priority Critical patent/CN114722918A/en
Publication of CN114722918A publication Critical patent/CN114722918A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention discloses a neural tumor classification algorithm based on DNA methylation, which comprises four steps of methylation data correction processing, methylation data cluster analysis, methylation data classification analysis and high-dimensional small sample processing. The accuracy of the method applied to the tumor classification algorithm based on DNA methylation is higher than that of the traditional SVM (support vector machine), logistic regression and decision tree methods.

Description

Tumor classification method based on DNA methylation
Technical Field
The invention relates to the field of DNA methylation analysis, in particular to a tumor classification algorithm based on DNA methylation.
Background
Methylation data clustering analysis belongs to the category of unsupervised learning in machine learning, and in definition, clustering is to study a classification method according to the characteristics of data per se aiming at a large amount of data or samples, reasonably classify the data according to the classification method, and finally divide similar data into a group, namely 'same type, same type and different type are different'. As for methylation data, because central nervous system tumors and soft tissue sarcomas have multiple types, DNA methylation data of the tumors and the soft tissue sarcomas are different inevitably, and the different types can be found out by unsupervised cluster analysis, so that the different types can be defined by combining application, and even new tumor types can be discovered.
The classification of methylated data is a basic data mining mode, and according to the characteristics of the methylated data, data objects can be divided into different parts and types, and then further analyzed, so that the essence of things can be further mined. The classification method is a supervised learning method for modeling or predicting discrete random variables.
The purpose of class learning is to learn a classification function or classification model, also often referred to as classifier, from a given set of manually labeled class training samples. When new data comes in, prediction can be made according to this function, and the new data item is mapped to one of the given classes. A targeted classifier is designed for DNA methylation data, and when new methylation data are input, the classifier can quickly output corresponding classification results after processing under the condition of ensuring reliability.
Because the methylated data is divided into 450K, 850K and the like according to the number of probes, in the machine learning, the corresponding feature number is very large, but the corresponding sample number is relatively small, so that in the data processing process, the processing of high-dimensional small samples is involved, and a relatively applicable method is to perform dimension reduction processing on the feature data. Secondly, for the classifier, the output result is only a numerical value and cannot represent the probability of the classifier, at the moment, the calibration model of the classifier can solve the problem, the classification accuracy and the classification effect evaluation can be known through the output of the classifier calibration model, the classification result is based, and the classification effect can be evaluated.
Therefore, how to efficiently classify tumor data based on DNA methylation requires optimization and design of algorithms.
Disclosure of Invention
In order to overcome the above-mentioned drawbacks of the prior art, the present invention aims to provide a neural tumor classification algorithm based on DNA methylation.
By classifying the methylation data through the algorithm of the present invention, 81 tumor types can be identified using methylation fingerprints. Meanwhile, a Tson method can be applied to visualize the result of the clustering sample.
A neural tumor classification algorithm based on DNA methylation, comprising the steps of:
and a methylation data correction processing step:
processing the neural tumor data based on DNA methylation by using R language to obtain the signal intensity of an original IDAT file, and finally obtaining the signal values of all methylation probes for further analysis through data correction calculation;
and (3) clustering analysis of methylation data:
performing cluster analysis on the methylation sample data through unsupervised cluster analysis K-MEANS, and simultaneously performing effect evaluation on a cluster result to find out the optimal category number of the cluster analysis;
displaying the clustered different data through T-SNE dimension reduction analysis;
and (3) carrying out classification analysis on the methylation data:
a random forest model is established by using supervised classification analysis,
then dividing the data set into a plurality of parts, carrying out 10-time cross validation, and in each iteration, reserving one part of data as a validation set and using the other parts as a training set;
then, training and realizing a classification function by using a random forest method, wherein a plurality of nodes are adopted in the random forest for classification learning;
processing the high-dimensional small sample:
sorting all the features by using a calibration classifier through feature selection, and essentially preventing overfitting of the model by selecting a feature set with the importance of a feature value of more than 95%;
the DNA methylated nerve tumor data is specifically 850K probe data or 450K probe data.
In a preferred embodiment of the present invention, the R language processing specifically adopts the following processing steps:
reading the original data of the IDAT methylated chip and establishing an RGset object;
secondly, performing quality control filtration on the data processed in the first step, wherein the filtration strength is that the p value is less than 0.05;
thirdly, standardizing the filtered data, wherein the standardization is specifically to correct the beta value of the probe, so that the distribution of the beta values among repeated samples is closer, and the difference among the repeated samples is reduced; normalizing by adopting a preprcessSWAN () function in the minif packet;
fourthly, filtering the standardized data to remove repeated data and remove X chromosome methylation and Y chromosome methylation data;
and fifthly, outputting the data subjected to data removal in the fourth step to form a data matrix.
In a preferred embodiment of the present invention, the data correction calculation is processed in a manner of
And (4) applying a normalization function in a minif packet in the R language to perform background noise reduction and normalization processing.
In a preferred embodiment of the present invention, the unsupervised cluster analysis K-MEANS specifically performs cluster analysis on methylation sample data as follows:
step S1, randomly selecting 91 sample points in a sample set as initial points of a mean value;
step S2, calculating Euclidean distance between each sample and each average value point;
step S3, dividing the sample into the cluster where the nearest mean value point is located;
step S4, calculating the average value of all the samples in the cluster, and taking the calculated average value as an updated average value point;
step S5, repeating the step S2-4 until the clustering center does not change any more to obtain the data of the cluster;
and step S6, finishing the cluster analysis by obtaining the final 91 cluster data.
In a preferred embodiment of the present invention, the effect is evaluated using the following indices 1 to 3:
the index 1 is an inertia index, wherein the inertia index refers to the attribute of a K-Means model object, is used as an unsupervised evaluation index without a real classification result label and is used for expressing the sum of the distances from a sample to the nearest clustering center;
the smaller the value of the inertia is, the better the value is, the smaller the value is, the more concentrated the distribution of the samples among the classes is;
the index 2 is a reed coefficient index, the reed coefficient index represents the actual class division by C, and K represents the clustering result;
defining a as the number of instance pairs which are divided into the same class in C and the same cluster in K; defining b as the number of pairs of instances divided into different classes in C and into different clusters in K; defining n as the total number of instances, the rand coefficient RI ═ 2(a + b)/n × (n-1);
the index 3 is a mutual information index, and the mutual information index refers to the similarity between two labels of the same data, namely the similarity of the distribution of the two data is measured;
using mutual information to measure clustering effects requires knowledge of actual category information.
In a preferred embodiment of the invention, the T-SNE dimension reduction analysis is processed by simplifying a gradient formula by using a symmetric version of SNE; the gradient formula is a gradient grad formula;
in the low dimensional space, the similarity between two points is expressed using a t-distribution instead of a gaussian distribution.
In a preferred embodiment of the invention, the data set is divided into 10 parts, and the random forest adopts 10000 nodes for classification learning.
In a preferred embodiment of the present invention, the calibration classifier specifically uses the classified result as a new training set, and retrains a relationship by logistic regression to obtain a specific probability value, where the calibration scores of different tumor types are different, and the reliability of the result can be reflected by applying the probability value;
and applying a sigmoid function to carry out regularization processing on the calibration scores, so that the classification result outputs a result after the existing calibration scores and the probability values exist at the same time.
In a preferred embodiment of the invention, the logistic regression is a multi-classification model, represented by a conditional probability distribution P (Y | X), in the form of logistic regression in terms of parametric words;
the random variable X takes a real number, and the random variable Y takes 0 or 1;
estimating model parameters by a supervised learning method: p (Y ═ 1| X) ═ exp (w × X + b)/1+ exp (w × X + b).
The invention has the beneficial effects that:
the accuracy of the method applied to the tumor classification algorithm based on DNA methylation is higher than that of the traditional SVM (support vector machine), logistic regression and decision tree methods.
The features of the present invention will be apparent from the accompanying drawings and from the detailed description of the preferred embodiments which follows.
Drawings
Fig. 1 is a schematic diagram of the principle of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood, however, that the description herein of specific embodiments is only intended to illustrate the invention and not to limit the scope of the invention. Moreover, in the following structures, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.
First, in the data preprocessing section, the present embodiment performs the following steps:
step 1, data extraction: methylated data numbered GSE109381 was downloaded from GEO data.
Step 2, probe screening: probes on the X, Y chromosome were removed, and probes not present in 850K were removed. Finally, 428,799 methylated probes remained.
Next, in the model training part, the present embodiment performs the following steps:
and 3, only focusing on probe values corresponding to the screened probes, taking the original training data as a vector X, and dividing the vector X into a training set vector and a test set vector according to the proportion of 8: 2. Similarly, the same operation is performed on the class vector Y corresponding to the original training data to obtain a training set Y and a test set Y.
And 4, inputting the vector X and the corresponding category label vector Y into a random forest model consisting of N decision trees (when the implementation is carried out, N is set to 10000). The model randomly samples the input training samples with the samples put back to obtain N groups of training data. When the number of decision trees is 1000, the accuracy rate is 79 percent, and when the number of decision trees is 2000, the accuracy rate is 82 percent; when the number of the decision trees is 5000, the accuracy rate is 92%; when the number of decision trees is 10000, the accuracy rate is 98.1%.
And 5, training the N groups of training data in the step 2 to obtain N different classifiers (decision trees). The N classifiers are used for averaging the class probabilities of the sample prediction, so that a class prediction vector P of the sample can be obtained.
And 6, in order to enable the value of the class prediction probability P to reflect the confidence of model prediction, a sigmoid calibrator based on Platt is used for calibrating the probability P to obtain the calibrated probability P.
Finally, in the model test section, the present embodiment performs the following steps:
and 7, processing the test set data by referring to a data preprocessing part to obtain a corresponding test set vector X.
Step 8, inputting the test set vector X into the model and the probability calibrator to obtain a probability prediction result corresponding to the input test sample, and selecting the class with the highest probability as the output class of the model;
and the prediction probability value corresponding to the output category is used as a recording model reference Score. A reference score greater than 0.9 is considered a result that is trustworthy.
The results of the comparison of the process of the invention with other processes are shown in Table 1:
algorithm Rate of accuracy
The method of the invention 98.1%
SVM (support vector machine) 93.5%
Logistic regression 85.6%
Decision tree 90.4%
The foregoing shows and describes the general principles and features of the present invention, together with the advantages thereof.
It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined by the appended claims and their equivalents.

Claims (9)

1. A neural tumor classification algorithm based on DNA methylation, comprising the steps of:
and a methylation data correction processing step:
processing the neural tumor data based on DNA methylation by using R language to obtain the signal intensity of an original IDAT file, and finally obtaining the signal values of all methylation probes for further analysis through data correction calculation;
and (3) clustering analysis of methylation data:
performing cluster analysis on the methylation sample data through unsupervised cluster analysis K-MEANS, and simultaneously performing effect evaluation on a cluster result to find out the optimal category number of the cluster analysis;
displaying the clustered different data through T-SNE dimension reduction analysis;
and (3) carrying out classification analysis on the methylation data:
a random forest model is established by using supervised classification analysis,
then dividing the data set into a plurality of parts, carrying out 10 times of cross validation, and in each iteration, reserving one part of data as a validation set and using the other parts as a training set;
then, training and realizing a classification function by using a random forest method, wherein a plurality of nodes are adopted in the random forest for classification learning;
processing the high-dimensional small sample:
sorting all the features by using a calibration classifier through feature selection, and essentially preventing overfitting of the model by selecting a feature set with the importance of a feature value of more than 95%;
the DNA methylated nerve tumor data is specifically 850K probe data or 450K probe data.
2. The DNA methylation-based neural tumor classification algorithm of claim 1, wherein the R language processing specifically adopts the following processing steps:
in the first step, the first step is that,
reading the original data of the IDAT methylated chip and establishing an RGset object;
secondly, performing quality control filtration on the data processed in the first step, wherein the filtration strength is that the p value is less than 0.05;
thirdly, standardizing the filtered data, wherein the standardization is specifically to correct the beta value of the probe, so that the distribution of the beta values among repeated samples is closer, and the difference among the repeated samples is reduced; normalizing by adopting a preprcessSWAN () function in the minif packet;
fourthly, filtering the standardized data to remove repeated data and remove the data of X chromosome methylation and Y chromosome methylation;
and fifthly, outputting the data subjected to data removal in the fourth step to form a data matrix.
3. The DNA methylation-based neural tumor classification algorithm of claim 1, wherein the data correction calculation is performed by performing background noise reduction and normalization using a normalization function in the minif packet in the R language.
4. The DNA methylation-based neuro-tumor classification algorithm of claim 1, wherein the unsupervised cluster analysis K-MEANS, the cluster analysis of methylation sample data specifically comprises:
step S1, randomly selecting 91 sample points in a sample set as initial points of a mean value;
step S2, calculating Euclidean distance between each sample and each average value point;
step S3, dividing the sample into the cluster where the nearest mean value point is located;
step S4, calculating the average value of all the samples in the cluster, and taking the calculated average value as an updated average value point;
step S5, repeating the step S2-4 until the clustering center does not change any more to obtain the data of the cluster;
and step S6, finishing the cluster analysis by obtaining the final 91 cluster data.
5. The DNA methylation-based neural tumor classification algorithm of claim 1, wherein the effect is evaluated using the following criteria 1-3:
the index 1 is an inertia index, wherein the inertia index refers to the attribute of a K-Means model object, is used as an unsupervised evaluation index without a real classification result label and is used for expressing the sum of the distances from a sample to the nearest clustering center;
the smaller the value of the inertia is, the better the value is, the smaller the value is, the more concentrated the distribution of the samples among the classes is;
the index 2 is a reed coefficient index, the reed coefficient index represents the actual class division by C, and K represents the clustering result;
defining a as the number of instance pairs which are divided into the same class in C and the same cluster in K; defining b as the number of pairs of instances divided into different categories in C and different clusters in K; defining n as the total number of instances, the rand coefficient RI ═ 2(a + b)/n × (n-1);
the index 3 is a mutual information index, and the mutual information index refers to the similarity between two labels of the same data, namely the similarity of the distribution of the two data is measured;
using mutual information to measure clustering effects requires knowledge of actual category information.
6. The DNA methylation-based neural tumor classification algorithm of claim 1, wherein the T-SNE dimension reduction analysis is performed by simplifying a gradient formula using a symmetric version of SNE; the gradient formula is a gradient grad formula;
in the low dimensional space, the similarity between two points is expressed using a t-distribution instead of a gaussian distribution.
7. The DNA methylation-based neural tumor classification algorithm of claim 1, wherein the data set is divided into 10 parts, and a random forest is classified and learned by 10000 nodes.
8. The neural tumor classification algorithm based on DNA methylation of claim 1, wherein the calibration classifier is specifically to use the classified results as a new training set, train a relationship again by logistic regression to obtain specific probability values, the calibration scores are different for different tumor types, and the confidence level of the result can be reflected by the probability values;
and applying a sigmoid function to carry out regularization processing on the calibration scores, so that the classification result outputs a result after the existing calibration scores and the probability values exist at the same time.
9. The DNA methylation-based neural tumor classification algorithm of claim 1, wherein the logistic regression is a multi-classification model represented by a conditional probability distribution P (Y | X) in the form of logistic regression with parametric words;
the random variable X takes a real number, and the random variable Y takes 0 or 1;
estimating model parameters by a supervised learning method: p (Y ═ 1| X) ═ exp (w × X + b)/1+ exp (w × X + b).
CN202210271294.6A 2022-03-18 2022-03-18 Tumor classification method based on DNA methylation Pending CN114722918A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210271294.6A CN114722918A (en) 2022-03-18 2022-03-18 Tumor classification method based on DNA methylation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210271294.6A CN114722918A (en) 2022-03-18 2022-03-18 Tumor classification method based on DNA methylation

Publications (1)

Publication Number Publication Date
CN114722918A true CN114722918A (en) 2022-07-08

Family

ID=82238089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210271294.6A Pending CN114722918A (en) 2022-03-18 2022-03-18 Tumor classification method based on DNA methylation

Country Status (1)

Country Link
CN (1) CN114722918A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116312794A (en) * 2023-01-09 2023-06-23 哈尔滨医科大学 Methylation sample clustering method fused with single cell analysis method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116312794A (en) * 2023-01-09 2023-06-23 哈尔滨医科大学 Methylation sample clustering method fused with single cell analysis method
CN116312794B (en) * 2023-01-09 2023-11-14 哈尔滨医科大学 Methylation sample clustering method fused with single cell analysis method

Similar Documents

Publication Publication Date Title
KR20180120056A (en) Method and system for pre-processing machine learning data
Nair et al. Implicit mixtures of restricted Boltzmann machines
CN111832647A (en) Abnormal flow detection system and method
CN103366367A (en) Pixel number clustering-based fuzzy C-average value gray level image splitting method
CN104598774A (en) Feature gene selection method based on logistic and relevant information entropy
Corrales et al. An empirical multi-classifier for coffee rust detection in colombian crops
CN113256409A (en) Bank retail customer attrition prediction method based on machine learning
CN112949954B (en) Method for establishing financial fraud recognition model based on recognition learning
CN111309577A (en) Spark-oriented batch processing application execution time prediction model construction method
CN114722918A (en) Tumor classification method based on DNA methylation
CN111783866B (en) Production logistics early warning information multi-classification method based on improved FOA-SVM
CN113469288A (en) High-risk personnel early warning method integrating multiple machine learning algorithms
Shaheen et al. Predictive analytics for loan default in banking sector using machine learning techniques
Kuhn et al. Brcars: a dataset for fine-grained classification of car images
CN114782761B (en) Intelligent storage material identification method and system based on deep learning
CN115661498A (en) Self-optimization single cell clustering method
CN115310491A (en) Class-imbalance magnetic resonance whole brain data classification method based on deep learning
CN115393631A (en) Hyperspectral image classification method based on Bayesian layer graph convolution neural network
CN113569920A (en) Second neighbor anomaly detection method based on automatic coding
CN113591780A (en) Method and system for identifying driving risk of driver
Fávero et al. Classification Performance Evaluation from Multilevel Logistic and Support Vector Machine Algorithms through Simulated Data in Python
CN112819027A (en) Machine learning and similarity scoring-based classification method
CN113538029A (en) User behavior data prediction method, device, equipment and medium
Dong et al. White blood cell classification
JP7404962B2 (en) Image processing system and image processing program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination