CN114722918A

CN114722918A - Tumor classification method based on DNA methylation

Info

Publication number: CN114722918A
Application number: CN202210271294.6A
Authority: CN
Inventors: 马杰; 王佳甲
Original assignee: Individual
Current assignee: Individual
Priority date: 2022-03-18
Filing date: 2022-03-18
Publication date: 2022-07-08

Abstract

The invention discloses a neural tumor classification algorithm based on DNA methylation, which comprises four steps of methylation data correction processing, methylation data cluster analysis, methylation data classification analysis and high-dimensional small sample processing. The accuracy of the method applied to the tumor classification algorithm based on DNA methylation is higher than that of the traditional SVM (support vector machine), logistic regression and decision tree methods.

Description

Tumor classification method based on DNA methylation

Technical Field

The invention relates to the field of DNA methylation analysis, in particular to a tumor classification algorithm based on DNA methylation.

Background

Methylation data clustering analysis belongs to the category of unsupervised learning in machine learning, and in definition, clustering is to study a classification method according to the characteristics of data per se aiming at a large amount of data or samples, reasonably classify the data according to the classification method, and finally divide similar data into a group, namely 'same type, same type and different type are different'. As for methylation data, because central nervous system tumors and soft tissue sarcomas have multiple types, DNA methylation data of the tumors and the soft tissue sarcomas are different inevitably, and the different types can be found out by unsupervised cluster analysis, so that the different types can be defined by combining application, and even new tumor types can be discovered.

The classification of methylated data is a basic data mining mode, and according to the characteristics of the methylated data, data objects can be divided into different parts and types, and then further analyzed, so that the essence of things can be further mined. The classification method is a supervised learning method for modeling or predicting discrete random variables.

The purpose of class learning is to learn a classification function or classification model, also often referred to as classifier, from a given set of manually labeled class training samples. When new data comes in, prediction can be made according to this function, and the new data item is mapped to one of the given classes. A targeted classifier is designed for DNA methylation data, and when new methylation data are input, the classifier can quickly output corresponding classification results after processing under the condition of ensuring reliability.

Because the methylated data is divided into 450K, 850K and the like according to the number of probes, in the machine learning, the corresponding feature number is very large, but the corresponding sample number is relatively small, so that in the data processing process, the processing of high-dimensional small samples is involved, and a relatively applicable method is to perform dimension reduction processing on the feature data. Secondly, for the classifier, the output result is only a numerical value and cannot represent the probability of the classifier, at the moment, the calibration model of the classifier can solve the problem, the classification accuracy and the classification effect evaluation can be known through the output of the classifier calibration model, the classification result is based, and the classification effect can be evaluated.

Therefore, how to efficiently classify tumor data based on DNA methylation requires optimization and design of algorithms.

Disclosure of Invention

In order to overcome the above-mentioned drawbacks of the prior art, the present invention aims to provide a neural tumor classification algorithm based on DNA methylation.

By classifying the methylation data through the algorithm of the present invention, 81 tumor types can be identified using methylation fingerprints. Meanwhile, a Tson method can be applied to visualize the result of the clustering sample.

A neural tumor classification algorithm based on DNA methylation, comprising the steps of:

and a methylation data correction processing step:

processing the neural tumor data based on DNA methylation by using R language to obtain the signal intensity of an original IDAT file, and finally obtaining the signal values of all methylation probes for further analysis through data correction calculation;

and (3) clustering analysis of methylation data:

performing cluster analysis on the methylation sample data through unsupervised cluster analysis K-MEANS, and simultaneously performing effect evaluation on a cluster result to find out the optimal category number of the cluster analysis;

displaying the clustered different data through T-SNE dimension reduction analysis;

and (3) carrying out classification analysis on the methylation data:

a random forest model is established by using supervised classification analysis,

then dividing the data set into a plurality of parts, carrying out 10-time cross validation, and in each iteration, reserving one part of data as a validation set and using the other parts as a training set;

then, training and realizing a classification function by using a random forest method, wherein a plurality of nodes are adopted in the random forest for classification learning;

processing the high-dimensional small sample:

sorting all the features by using a calibration classifier through feature selection, and essentially preventing overfitting of the model by selecting a feature set with the importance of a feature value of more than 95%;

the DNA methylated nerve tumor data is specifically 850K probe data or 450K probe data.

In a preferred embodiment of the present invention, the R language processing specifically adopts the following processing steps:

reading the original data of the IDAT methylated chip and establishing an RGset object;

secondly, performing quality control filtration on the data processed in the first step, wherein the filtration strength is that the p value is less than 0.05;

thirdly, standardizing the filtered data, wherein the standardization is specifically to correct the beta value of the probe, so that the distribution of the beta values among repeated samples is closer, and the difference among the repeated samples is reduced; normalizing by adopting a preprcessSWAN () function in the minif packet;

fourthly, filtering the standardized data to remove repeated data and remove X chromosome methylation and Y chromosome methylation data;

and fifthly, outputting the data subjected to data removal in the fourth step to form a data matrix.

In a preferred embodiment of the present invention, the data correction calculation is processed in a manner of

And (4) applying a normalization function in a minif packet in the R language to perform background noise reduction and normalization processing.

In a preferred embodiment of the present invention, the unsupervised cluster analysis K-MEANS specifically performs cluster analysis on methylation sample data as follows:

step S1, randomly selecting 91 sample points in a sample set as initial points of a mean value;

step S2, calculating Euclidean distance between each sample and each average value point;

step S3, dividing the sample into the cluster where the nearest mean value point is located;

step S4, calculating the average value of all the samples in the cluster, and taking the calculated average value as an updated average value point;

step S5, repeating the step S2-4 until the clustering center does not change any more to obtain the data of the cluster;

and step S6, finishing the cluster analysis by obtaining the final 91 cluster data.

In a preferred embodiment of the present invention, the effect is evaluated using the following indices 1 to 3:

the index 1 is an inertia index, wherein the inertia index refers to the attribute of a K-Means model object, is used as an unsupervised evaluation index without a real classification result label and is used for expressing the sum of the distances from a sample to the nearest clustering center;

the smaller the value of the inertia is, the better the value is, the smaller the value is, the more concentrated the distribution of the samples among the classes is;

the index 2 is a reed coefficient index, the reed coefficient index represents the actual class division by C, and K represents the clustering result;

defining a as the number of instance pairs which are divided into the same class in C and the same cluster in K; defining b as the number of pairs of instances divided into different classes in C and into different clusters in K; defining n as the total number of instances, the rand coefficient RI ═ 2(a + b)/n × (n-1);

the index 3 is a mutual information index, and the mutual information index refers to the similarity between two labels of the same data, namely the similarity of the distribution of the two data is measured;

using mutual information to measure clustering effects requires knowledge of actual category information.

In a preferred embodiment of the invention, the T-SNE dimension reduction analysis is processed by simplifying a gradient formula by using a symmetric version of SNE; the gradient formula is a gradient grad formula;

in the low dimensional space, the similarity between two points is expressed using a t-distribution instead of a gaussian distribution.

In a preferred embodiment of the invention, the data set is divided into 10 parts, and the random forest adopts 10000 nodes for classification learning.

In a preferred embodiment of the present invention, the calibration classifier specifically uses the classified result as a new training set, and retrains a relationship by logistic regression to obtain a specific probability value, where the calibration scores of different tumor types are different, and the reliability of the result can be reflected by applying the probability value;

and applying a sigmoid function to carry out regularization processing on the calibration scores, so that the classification result outputs a result after the existing calibration scores and the probability values exist at the same time.

In a preferred embodiment of the invention, the logistic regression is a multi-classification model, represented by a conditional probability distribution P (Y | X), in the form of logistic regression in terms of parametric words;

the random variable X takes a real number, and the random variable Y takes 0 or 1;

estimating model parameters by a supervised learning method: p (Y ═ 1| X) ═ exp (w × X + b)/1+ exp (w × X + b).

The invention has the beneficial effects that:

the accuracy of the method applied to the tumor classification algorithm based on DNA methylation is higher than that of the traditional SVM (support vector machine), logistic regression and decision tree methods.

The features of the present invention will be apparent from the accompanying drawings and from the detailed description of the preferred embodiments which follows.

Drawings

Fig. 1 is a schematic diagram of the principle of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood, however, that the description herein of specific embodiments is only intended to illustrate the invention and not to limit the scope of the invention. Moreover, in the following structures, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

First, in the data preprocessing section, the present embodiment performs the following steps:

step 1, data extraction: methylated data numbered GSE109381 was downloaded from GEO data.

Step 2, probe screening: probes on the X, Y chromosome were removed, and probes not present in 850K were removed. Finally, 428,799 methylated probes remained.

Next, in the model training part, the present embodiment performs the following steps:

and 3, only focusing on probe values corresponding to the screened probes, taking the original training data as a vector X, and dividing the vector X into a training set vector and a test set vector according to the proportion of 8: 2. Similarly, the same operation is performed on the class vector Y corresponding to the original training data to obtain a training set Y and a test set Y.

And 4, inputting the vector X and the corresponding category label vector Y into a random forest model consisting of N decision trees (when the implementation is carried out, N is set to 10000). The model randomly samples the input training samples with the samples put back to obtain N groups of training data. When the number of decision trees is 1000, the accuracy rate is 79 percent, and when the number of decision trees is 2000, the accuracy rate is 82 percent; when the number of the decision trees is 5000, the accuracy rate is 92%; when the number of decision trees is 10000, the accuracy rate is 98.1%.

And 5, training the N groups of training data in the step 2 to obtain N different classifiers (decision trees). The N classifiers are used for averaging the class probabilities of the sample prediction, so that a class prediction vector P of the sample can be obtained.

And 6, in order to enable the value of the class prediction probability P to reflect the confidence of model prediction, a sigmoid calibrator based on Platt is used for calibrating the probability P to obtain the calibrated probability P.

Finally, in the model test section, the present embodiment performs the following steps:

and 7, processing the test set data by referring to a data preprocessing part to obtain a corresponding test set vector X.

Step 8, inputting the test set vector X into the model and the probability calibrator to obtain a probability prediction result corresponding to the input test sample, and selecting the class with the highest probability as the output class of the model;

and the prediction probability value corresponding to the output category is used as a recording model reference Score. A reference score greater than 0.9 is considered a result that is trustworthy.

The results of the comparison of the process of the invention with other processes are shown in Table 1:

algorithm	Rate of accuracy
		The method of the invention	98.1％
SVM (support vector machine)	93.5％
		Logistic regression	85.6％
Decision tree	90.4％

The foregoing shows and describes the general principles and features of the present invention, together with the advantages thereof.

It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are merely illustrative of the principles of the present invention, but that various changes and modifications may be made without departing from the spirit and scope of the invention, which is defined by the appended claims and their equivalents.

Claims

1. A neural tumor classification algorithm based on DNA methylation, comprising the steps of:

and a methylation data correction processing step:

and (3) clustering analysis of methylation data:

and (3) carrying out classification analysis on the methylation data:

then dividing the data set into a plurality of parts, carrying out 10 times of cross validation, and in each iteration, reserving one part of data as a validation set and using the other parts as a training set;

processing the high-dimensional small sample:

2. The DNA methylation-based neural tumor classification algorithm of claim 1, wherein the R language processing specifically adopts the following processing steps:

in the first step, the first step is that,

fourthly, filtering the standardized data to remove repeated data and remove the data of X chromosome methylation and Y chromosome methylation;

3. The DNA methylation-based neural tumor classification algorithm of claim 1, wherein the data correction calculation is performed by performing background noise reduction and normalization using a normalization function in the minif packet in the R language.

4. The DNA methylation-based neuro-tumor classification algorithm of claim 1, wherein the unsupervised cluster analysis K-MEANS, the cluster analysis of methylation sample data specifically comprises:

5. The DNA methylation-based neural tumor classification algorithm of claim 1, wherein the effect is evaluated using the following criteria 1-3:

defining a as the number of instance pairs which are divided into the same class in C and the same cluster in K; defining b as the number of pairs of instances divided into different categories in C and different clusters in K; defining n as the total number of instances, the rand coefficient RI ═ 2(a + b)/n × (n-1);

6. The DNA methylation-based neural tumor classification algorithm of claim 1, wherein the T-SNE dimension reduction analysis is performed by simplifying a gradient formula using a symmetric version of SNE; the gradient formula is a gradient grad formula;

7. The DNA methylation-based neural tumor classification algorithm of claim 1, wherein the data set is divided into 10 parts, and a random forest is classified and learned by 10000 nodes.

8. The neural tumor classification algorithm based on DNA methylation of claim 1, wherein the calibration classifier is specifically to use the classified results as a new training set, train a relationship again by logistic regression to obtain specific probability values, the calibration scores are different for different tumor types, and the confidence level of the result can be reflected by the probability values;

9. The DNA methylation-based neural tumor classification algorithm of claim 1, wherein the logistic regression is a multi-classification model represented by a conditional probability distribution P (Y | X) in the form of logistic regression with parametric words;