CN112382342A

CN112382342A - Cancer methylation data classification method based on integrated feature selection

Info

Publication number: CN112382342A
Application number: CN202011329335.XA
Authority: CN
Inventors: 潘晓光; 田奇; 董虎弟; 陈智娇; 白丽霞
Original assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Current assignee: Shanxi Sanyouhe Smart Information Technology Co Ltd
Priority date: 2020-11-24
Filing date: 2020-11-24
Publication date: 2021-02-19

Abstract

The invention belongs to the technical field of data processing, and particularly relates to a cancer methylation data classification method based on integrated feature selection, which comprises the following steps: inputting a cancer and normal sample dataset of methylation sites, wherein each row in the dataset represents an individual to be tested and is marked as normal or cancer, and each column represents a characteristic site; preprocessing data, and filtering various missing values in the data set; the selection of firm differential methylation sites is realized by an integrated feature selection method; training a multi-classifier model based on the stable differential methylation sites, and voting according to the prediction result of each classifier to obtain a final classification judgment result; and outputting a final classification result. The method can effectively solve the problems of differential site identification of high-flux methylated data and classification of potentially uncertain samples. The invention is used for classification of cancer methylation data.

Description

Cancer methylation data classification method based on integrated feature selection

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a cancer methylation data classification method based on integrated feature selection.

Background

With the development of computers and sequencing technologies, more and more large-scale biological data are generated, and how to mine the value contained in the data is one of important means for further developing precise medical treatment. DNA methylation, a widely studied epigenetic marker, plays a crucial role in tumorigenesis. Advances in high throughput sequencing technologies, such as the Infinium 450K platform, have made it possible to provide genome-scale DNA methylation data with single CpG site resolution. On this basis, how to identify sites with differential expression in normal and cancer samples and thereby differentiate epigenetic differences between cancer and normal persons can enhance the early detection and prevention of cancer in humans. However, there is a strong imbalance between the number of samples and sites (about 1:1000) with respect to the data currently available, which makes it particularly difficult to analyze methylation data on a large scale between cancer patients and normal persons. Currently, methods for differentiating cancer and normal samples based on large-scale methylation data are available, and most methods are based on simple feature preprocessing and a single classifier, so that it is difficult to accurately differentiate cancer from normal samples, and it is difficult to obtain differential methylation sites which are crucial for differentiating cancer from normal samples.

Disclosure of Invention

Aiming at the technical problem that the existing method for distinguishing the cancer from the normal sample based on the large-scale methylation data is difficult to accurately distinguish the cancer from the normal sample, the invention provides the method for classifying the cancer methylation data based on the integrated feature selection, which has high classification accuracy, strong identification capability and high efficiency.

In order to solve the technical problems, the invention adopts the technical scheme that:

a method for classifying cancer methylation data based on integrated feature selection, comprising the steps of:

s1, inputting a cancer of a methylation site and a normal sample data set, wherein each row in the data set represents a tested individual and is marked as normal or cancer, and each column represents a characteristic site;

s2, preprocessing data, and filtering various missing values in the data set;

s3, selecting stable differential methylation sites by an integrated feature selection method;

s4, training a multi-classifier model based on the stable differential methylation sites, and voting according to the prediction result of each classifier to obtain a final classification judgment result;

and S5, outputting the final classification result.

The data preprocessing method in the step S2 includes: comprises the following steps:

s2.1, searching missing values in the data, and filtering columns or features containing the missing values if the missing values exist in the original data;

s2.2, correcting the batch effect of the data without the missing value;

s2.3, filtering out the position point set with the minimum variance, sorting all the positions from large to small according to the variance by calculating the variance of the methylation values of the positions in all the measured samples, and then cutting off the positions around 1/3 which are arranged at the tail end.

In the S2.2, an empirical Bayesian EB method is adopted to eliminate the influence of batch effect.

The integrated feature selection method in the step S3 is as follows: comprises the following steps:

s3.1, introducing sample diversity, wherein the sample diversity is obtained by carrying out multiple random sampling on original data in an equal proportion to obtain different sample subsets, and then applying a feature selection method on the sample subsets to obtain different feature site sets;

s3.2, introducing function diversity, namely obtaining different differential methylation site sets by applying different feature selection methods on the same sample subset;

and S3.3, extracting the two different site sets by adopting a plurality of feature selection methods, obtaining the two feature site subsets by each sample subset, obtaining a feature subset corresponding to each sample subset by taking the union of the two feature site subsets, and finally obtaining the intersection of the feature subsets corresponding to all the sample subsets to obtain a stable different site set.

The method for obtaining the final classification judgment result in S4 includes: comprises the following steps:

s4.1, training logistic regression according to the result of the integrated feature selection method, integrating the output of the logistic regression classifier to the distribution of {0,1} probability through a maximum likelihood function and a sigmoid function, and thus realizing the division of the samples;

s4.2, classifying the samples through a support vector machine, wherein the support vector machine realizes the division of the samples by searching the support vectors in the samples and maximizing the distance between the two types of samples;

s4.3, classifying the samples through a random forest classifier, wherein the random forest classifier gradually realizes the division of the samples according to the value of the characteristic parameters through the structure of a tree;

and S4.4, integrating the prediction results of the three classifiers in a voting mode.

Compared with the prior art, the invention has the following beneficial effects:

the method can effectively solve the problems of differential site identification of high-flux methylated data and classification of potentially uncertain samples. By integrating the feature selection method, robust differential methylation sites in the input methylation data can be efficiently identified, and classification of samples is achieved based on these robust differential methylation sites. Compared with the traditional method based on single feature selection and a single classifier, the method introduces integrated feature selection in the differential locus identification process, can obtain more reliable and more discriminative differential methylation loci, and can effectively improve the classification accuracy of the sample to be evaluated in a voting fusion mode of multiple classifiers.

Drawings

FIG. 1 is a flow chart of the operation of the present invention;

FIG. 2 is a schematic diagram of the main steps of the present invention;

FIG. 3 is a flowchart illustrating an integrated feature selection method according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A method for classifying cancer methylation data based on integrated feature selection, as shown in fig. 1, comprising the steps of:

step 1, taking Infinium 450K platform data as an example, inputting a cancer and normal sample data set containing large-scale methylation sites, wherein a row represents a sample, namely a tested individual, is marked as normal or cancer, and a column represents a characteristic, namely a site;

and 2, after data is input, preprocessing is firstly carried out. The method comprises the following steps that firstly, missing values in data are searched, if the missing values exist in original data, the data are considered to be high in dimensionality, and columns or features containing the missing values are filtered by the data which contain hundreds of thousands of measuring sites; the second step is to correct for batch effects on data that do not contain missing values. Batch effects refer to the fact that in reality, a single measurement sample is limited, and more samples may be measured several days or months apart, and thus systematic "batch effects" or non-biological differences, making samples from different batches not directly comparable, may result in data errors due to variations in such biologically irrelevant factors. Here we use an Empirical Bayesian (EB) method to eliminate the effect of the batch effect. EB methods perform very well in the microarray problem because they can robustly process high dimensional data when the sample size is small. The data processed by the EB method can be used for subsequent computational analysis. The third step is to filter out the set of the bits with the smallest variance. Here, all sites are sorted by variance from large to small by calculating the variance of the methylation value of each list of features or sites in all samples measured, and then the sites around 1/3 that are at the end are discarded. On the one hand, for sites with small variance, they show little difference in both normal and cancer samples and therefore cannot guide the subsequent classification; on the other hand, filtering out the sites with small variance can reduce the dimensionality of the data, thereby saving computing resources in subsequent computational analysis.

Step 3, after completing the above pre-treatment, we achieve the selection of robust differential methylation sites by integrating feature selection methods as shown in fig. 3. The integrated feature selection method realizes stable feature selection from two angles, firstly, sample diversity is introduced, namely, original data is subjected to multiple times of random sampling in an equal proportion to obtain different sample subsets, and then the feature selection method is applied to the sample subsets to obtain different feature site sets; second, we introduce "functional diversity", i.e., by applying different feature selection methods on the same sample subset to obtain different sets of differentially methylated sites. Specifically, we combine cross validation and multi-feature selection methods to realize stable site set extraction, and we firstly use the thought of multi-fold cross validation for reference, evenly divide the preprocessed data into m parts according to the original proportion of normal and cancer samples, use one part as a test set to evaluate the classification performance of the feature selection result, and use the remaining m-1 parts as a training set as input. And then extracting a difference site set by adopting a plurality of feature selection methods. Here we use elastic canonical net (ElasticNet) and Relief feature selection algorithms to achieve the extraction of the set of difference sites. The former combines L1 and L2 regularization methods to realize the filtering of irrelevant features and redundant features, and the latter selects the feature site most relevant to the classification result by giving different weights to the features through the relevance of the features and the classification labels. For each sample subset, we can obtain two feature site subsets, then we take the union of the two to obtain the feature subset corresponding to each sample subset, and finally, the feature subsets corresponding to the m sample subsets are intersected to obtain a stable difference site set. The specific algorithm principle is shown in fig. 2.

Based on the obtained robust differential methylation site set, we can build a classification model to predict whether the location sample belongs to cancer or normal sample. Specifically, according to the result of the integrated feature selection method, logistic regression, a support vector machine and a random forest classifier are trained to realize classification of samples. The logistic regression classifier achieves the partitioning of the samples by maximizing the likelihood function and integrating the output by the sigmoid function to the distribution for {0,1} probabilities. The support vector machine realizes the division of the samples by searching the support vectors in the samples and maximizing the distance between the two types of samples. And the random forest gradually realizes the division of the samples according to the value of the characteristic parameters through the structure of the tree. Since the three classifiers analyze from different aspects of sample properties and obtain partitions of the sample, their decision results for the same sample may not be consistent. Therefore, the prediction results of each classifier are integrated by voting. Taking a training process of a certain time as an example, for a certain sample to be evaluated, assuming that the three classifiers respectively output three judgment results of normal, normal and cancer, according to the voting principle, the prediction result of the sample is finally normal.

And 5, after the construction of various classifiers is completed, predicting the sample attribute by inputting methylation data aiming at an unknown sample to be evaluated.

Although only the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art, and all changes are encompassed in the scope of the present invention.

Claims

1. A method for classifying cancer methylation data based on integrated feature selection, comprising: comprises the following steps:

s2, preprocessing data, and filtering various missing values in the data set;

and S5, outputting the final classification result.

2. The method of claim 1, wherein the integrated feature selection-based classification of cancer methylation data comprises: the data preprocessing method in the step S2 includes: comprises the following steps:

s2.2, correcting the batch effect of the data without the missing value;

3. The method of claim 2, wherein the integrated feature selection-based classification of cancer methylation data comprises: in the S2.2, an empirical Bayesian EB method is adopted to eliminate the influence of batch effect.

4. The method of claim 1, wherein the integrated feature selection-based classification of cancer methylation data comprises: the integrated feature selection method in the step S3 is as follows: comprises the following steps:

5. The method of claim 1, wherein the integrated feature selection-based classification of cancer methylation data comprises: the method for obtaining the final classification judgment result in S4 includes: comprises the following steps: