CN117059165A

CN117059165A - Differential methylation region selection and screening method, system, terminal and medium based on ensemble learning

Info

Publication number: CN117059165A
Application number: CN202310935440.5A
Authority: CN
Inventors: 王磊; 李玉欣; 石涵; 杨峰; 洪跟东
Original assignee: Shanghai Ruijing Biotechnology Co ltd
Current assignee: Shanghai Ruijing Biotechnology Co ltd
Priority date: 2023-07-27
Filing date: 2023-07-27
Publication date: 2023-11-14

Abstract

The invention provides a method, a system, a terminal and a medium for selecting and screening a differential methylation region based on ensemble learning, which improve the accuracy and reliability of cancer detection by selecting and screening differential methylation sites by utilizing an ensemble learning algorithm idea. Meanwhile, a great amount of optimization processing is carried out on the interference of the missing value on the data analysis, the influence of the DNA methylation spectrum missing value on the model is reduced, and the redundant and non-critical information of the differential methylation region is reduced, so that the screened differential methylation region has robust prediction performance and higher accuracy in assisting the early cancer diagnosis and recurrence prediction process.

Description

Differential methylation region selection and screening method, system, terminal and medium based on ensemble learning

Technical Field

The invention relates to the technical field of biomedicine, in particular to a differential methylation region selection and screening method, a system, a terminal and a medium based on ensemble learning.

Background

DNA methylation is an important epigenetic mechanism in cellular regulatory systems, and is an important component of cellular processes such as cellular development and differentiation due to its role in regulating gene expression. Recent advances in sequencing technology have enabled us to generate high throughput methylation data and measure methylation up to Shan Jianji resolution. However, analysis of such bulky DNA methylation data is challenging, with the most critical point being the determination of significant differences in base pair methylation levels under different biological conditions.

Functional studies have shown that methylation is often associated with genomic regions other than individual CpG, such as CpG islans, cpG island shores, genomic blocks or genomic 2kb regions, and the like. Whereas Differential Methylation Regions (DMR) specific for cancer cells in the genome have also been reported many times, which are involved in many different cancer species, tissues and cell types and are associated with gene expression levels. Thus, identification of DMR is one of the key issues in profiling disease etiology. There are many methods for detecting DMR based on high throughput sequencing technology, however, DNA methylation sequencing data covers a large amount of redundant and non-critical information, which makes it more difficult to find differentially methylated regions and use them for cancer screening. Thus, there is a need to develop new methods to screen out true tumor markers from potentially differentially methylated regions to improve diagnostic accuracy and reliability.

Machine learning has been widely used for benign and malignant differential diagnosis of tumors and has the advantage of being more efficient than traditional algorithms, but its model training requires a large amount of high quality data, while the actual DNA methylation spectrum usually contains a number of missing values that affect analytical modeling of the data, which is particularly pronounced when the sample size is small. The selection of the classifier generally adopts a random forest, a support vector machine and other machine learning methods, and different machine learning methods obviously have different prediction effects, so that how to select the best machine learning classifier is still a difficult problem.

Disclosure of Invention

In view of the above drawbacks of the prior art, an object of the present invention is to provide a method, a system, a terminal and a medium for selecting and screening a differential methylation region based on ensemble learning, which are used for solving the above problems of the prior art.

To achieve the above and other related objects, the present invention provides a differential methylation region selection and screening method based on ensemble learning, the method comprising: comparing the methylation high throughput platform sequencing sample data to a human reference genome and obtaining methylation data for a plurality of methylation samples; wherein the types of methylation data samples include: disease samples and normal control samples; the methylation data includes: methylation site information corresponding to the sample and methylation conversion rate; preprocessing methylation data of each methylation sample to obtain methylation depth matrix data and methylation rate matrix data; screening the differential methylation sites based on the methylation depth matrix data and the methylation rate matrix data, and obtaining corresponding differential methylation site methylation rate matrix data; selecting a to-be-selected differential methylation region based on the differential methylation site methylation rate matrix data and the methylation depth matrix data, and obtaining methylation rate matrix data of each to-be-selected differential methylation region; removing potential deviation factors based on the methylation rate matrix data of each candidate difference methylation region to obtain corresponding methylation rate matrix data of each candidate difference methylation region from which potential deviation is removed; and training an integrated learning model based on the feature subset obtained by processing the methylation rate matrix data of each candidate differential methylation region with the potential deviation removed, so as to obtain a final differential methylation region.

In one embodiment of the invention, aligning the methylation high throughput platform sequencing sample data to the human reference genome and obtaining methylation data for a plurality of methylation samples comprises: comparing the methylation high-throughput platform sequencing sample data to a human reference genome to obtain a plurality of disease samples and normal control samples, and respectively storing the disease samples and the normal control samples into bam files; extracting methylation site information in each bam file; wherein the methylation site information comprises: genomic chromosome, each methylation site, and site data for each methylation site; the site data for each methylation site includes: methylation site depth and methylation rate; the methylation ratio of the sites of the chromosome locus types CHG and CHH for each methylation sample is calculated based on the human reference genome and the extracted methylation locus information to obtain the methylation conversion rate corresponding to each methylation sample.

In one embodiment of the invention, the preprocessing the sample methylation data comprises: performing binomial distribution test on each C site based on methylation data of each methylation sample, and screening each methylation site which is checked to be true in each methylation sample; performing depth filtration on each methylation site in each methylation sample based on the total depth of sequencing; integrating the methylation site depth and the methylation rate of each methylation site subjected to depth filtration in each methylation sample into a methylation depth matrix table and a methylation rate matrix table; wherein the methylation depth matrix table comprises methylation site depths for each methylation site of each disease sample and normal control sample; the methylation rate matrix table includes the methylation rates of each methylation site of each disease sample and normal control sample; deleting the methylation depth matrix table and the site data of the methylation sites with the missing sample proportion exceeding the preset proportion in the methylation rate matrix table to obtain a first methylation depth matrix table and a first methylation rate matrix table; calculating Beta distribution parameters for the methylation rates of all the methylation sites in the first methylation rate matrix table in all samples, and randomly taking values from Beta distribution to fill in the missing values of the methylation rates in the first methylation rate matrix table so as to obtain a second methylation rate matrix table; removing methylation depth batch effect from the first methylation depth matrix table to obtain a second methylation depth matrix table; and carrying out methylation depth smoothing treatment on the second methylation depth matrix table to obtain the smooth depth of each methylation site so as to obtain a third methylation depth matrix table.

In one embodiment of the present invention, performing the methylation depth smoothing on the second methylation depth matrix table includes: based on a sliding window method with a preset width, respectively calculating a median value or a fitting value of the methylation site depth of each methylation site within a preset width range for the second methylation depth matrix table so as to obtain the smooth depth of each methylation site; wherein the means for calculating the median or fit value for the methylation site depth comprises: if the number of samples in the preset width range accords with a preset threshold value, performing moving median calculation on the depth of each methylation site in the corresponding preset width range; if the number of samples in the preset width range does not accord with the preset threshold value, fitting value calculation is carried out on the methylation site depths in the corresponding preset width range by adopting local weighted linear regression.

In one embodiment of the present invention, screening the differential methylation sites based on the methylation depth matrix data and the methylation rate matrix data, and obtaining corresponding differential methylation site methylation rate matrix data comprises: based on a logistic regression model, performing significance test on each sample of the second methylation rate matrix table in a disease group and a normal control group, and calculating the p value of each methylation site by using the smooth depth of each methylation site as a weight; performing multiple detection and correction on the p value of each methylation site by using an R language function to obtain the q value of each methylation site, and screening potential difference methylation sites to obtain a third methylation rate matrix table of the screened potential difference methylation sites; screening each potential difference methylation site based on the calculated methylation rate difference value between the disease group sample and the normal control group sample to obtain a plurality of first screening difference methylation sites; filtering each first screening differential methylation site by utilizing a dbSNP database of the NCBI website to obtain a differential methylation site so as to obtain a third methylation rate matrix table; and removing site data of each differential methylation site on X and Y chromosomes in the third methylation rate matrix table to obtain a differential methylation site methylation rate matrix table.

In one embodiment of the present invention, the selecting the differential methylation region to be selected based on the differential methylation site methylation rate matrix data and the methylation depth matrix data comprises: calculating pearson correlation coefficients of adjacent sites of each differential methylation site based on the differential methylation site methylation rate matrix table to determine an outward extension start site; calculating a pearson correlation coefficient matrix of each window by adopting a preset width sliding window method for the methylation rate matrix table of the differential methylation sites, and outwards extending based on outwards extending starting sites of the corresponding windows when the pearson correlation coefficient matrix meets extension conditions to obtain a plurality of methylation areas; combining the overlapped methylation regions to obtain each difference methylation region to be selected; calculating the average methylation rate of all C sites of each region to be selected with different methylation or the C sites with different methylation, and obtaining a methylation rate matrix table of each region to be selected; based on the third methylation depth matrix table, obtaining a methylation site depth matrix table of each region to be selected; calculating the average depth of each methylation sample of each different methylation area to be selected based on the methylation site depth matrix table of each area to be selected; carrying out logarithmic conversion on each average depth to obtain the logarithmic depth of each methylation sample of each difference methylation area to be selected; grouping the logarithmic depth of each methylation sample according to chromosomes, and carrying out standardization treatment by utilizing a median value; correcting each log depth subjected to normalization treatment by using a moving window method according to the GC content of the corresponding region so as to obtain a low coverage depth difference methylation region; and carrying out region overlapping detection on each low coverage depth difference methylation region and each methylation rate matrix table of the selected region, and further removing the low coverage depth difference methylation regions in each methylation rate matrix table of the selected region to obtain a corresponding methylation rate matrix table of the methylation region of the selected region.

In one embodiment of the present invention, the potential bias factor removal based on the methylation rate matrix data for each of the candidate differential methylation regions comprises: performing potential deviation factor removal operation based on current set dimension decomposition on each methylation rate matrix table of the difference methylation region to be selected to obtain corresponding processing data; judging whether the potential deviation factor removal process of the current set dimension decomposition is qualified or not based on the obtained processing data; if the current set dimension decomposition is not qualified, the potential deviation factor removing operation based on the current set dimension decomposition is circularly carried out until the potential deviation factor removing processing of the current set dimension decomposition is judged to be qualified; if the potential deviation factor is qualified, performing qualification judgment operation of removing the potential deviation factor, and under the condition that the potential deviation factor is judged to be qualified, acquiring methylation rate matrix data of a corresponding to-be-selected difference methylation area of removing the potential deviation factor; wherein, the operation of removing potential deviation factors based on the current set dimension decomposition on each methylation rate matrix table of the difference methylation areas to be selected comprises the following steps: carrying out standard fraction normalization processing on the methylation rate matrix table of the corresponding difference methylation region to be selected to obtain a first methylation rate matrix table of the difference methylation region; decomposing the methylation rate matrix table of the first difference methylation area into a current set dimension by utilizing SVD singular values, and then obtaining a first right singular vector matrix of the first difference methylation area; taking the differential methylation area as a response variable, and performing logarithmic linear fitting of beta distribution on the right singular vector matrix to obtain a first estimated value matrix; performing binominal distribution logarithmic linear fitting on the first estimated value matrix by taking the sample as a response variable to obtain a second estimated value matrix; multiplying the transposed matrix of the second estimated value matrix by using the first estimated value matrix, and performing centering processing to obtain a first matrix; and decomposing the first matrix into fixed dimensions by utilizing SVD singular values, and obtaining a second right singular vector matrix of the first matrix.

In one embodiment of the present invention, determining whether the current set dimension decomposition is qualified for the process of removing the potential deviation factor includes: judging whether the second right singular vector matrix is identical to the first right singular vector matrix; if the potential deviation factors are the same, judging that the potential deviation factor removal treatment of the current set dimension decomposition is qualified; if the potential deviation factor removal process is different, judging that the potential deviation factor removal process of the current set dimension decomposition is not qualified; the potential deviation factor qualification judgment operation comprises the following steps: judging whether the methylation rate matrix table corresponding to the first differential methylation region converges to an expected threshold value or not; if yes, the first difference methylation area methylation rate matrix table is qualified, and the first difference methylation area methylation rate matrix table is used as the methylation rate matrix data of the difference methylation area to be selected for removing potential deviation factors; if not, the current set dimension is updated to be the next dimension, and potential deviation factor removal operation based on next dimension decomposition is performed until the methylation rate matrix table of the first difference methylation region converges to an expected threshold.

In one embodiment of the present invention, training the ensemble learning model based on the feature subset obtained by processing the methylation rate matrix data of each candidate differential methylation region with potential bias removed to obtain the final differential methylation region includes: obtaining CpG density of each candidate difference methylation area from the methylation rate matrix data of each candidate difference methylation area with potential deviation factors removed, and deleting low-density CpG areas in the methylation rate matrix data of the corresponding candidate difference methylation area to obtain methylation rate matrix data of each screening difference methylation area; calculating average methylation differences between the disease group samples and the normal control group samples in the methylation rate matrix data of each screened methylation area, and screening final potential methylation areas to obtain the methylation rate matrix data of each final potential methylation area; removing redundant areas of the final potential difference methylation areas by using a recursive feature elimination method, and taking each residual difference methylation area as a feature to form an initial feature subset; performing importance score calculation operation on the initial feature subset based on the integrated learning model, and obtaining importance scores of each feature and classification accuracy of the initial feature subset; removing the features with the lowest importance scores from the initial feature subsets to obtain new feature subsets, then executing importance score calculation operation based on the integrated learning model, removing the features with the lowest importance scores to obtain new feature subsets, and cycling until the new feature subsets are obtained to be empty; and taking the feature subset with high classification precision in the feature subsets obtained in a circulating way as an optimal feature combination to obtain a final differential methylation region.

In an embodiment of the present invention, the integrated learning model adopts a two-layer structure, wherein the first layer structure adopts three learners of polynomial naive bayes, random forests and Gradient boosting, and the second layer structure adopts a logistic regression classifier.

In one embodiment of the present invention, the importance score calculating operation includes: equally dividing a disease group sample and a normal control group sample of each final potential difference methylation area methylation rate matrix data into a set threshold number of samples to obtain corresponding split data; training and predicting the integrated learning model by adopting set threshold value fold cross validation on the split data, obtaining the prediction result and the real result of each sample to perform model optimization, and obtaining the importance score of each feature.

To achieve the above and other related objects, the present invention provides a differential methylation region selection and screening system based on ensemble learning, the system comprising: the methylation data extraction module is used for comparing the methylation high-throughput platform sequencing sample data to a human reference genome and obtaining methylation data of a plurality of methylation samples; wherein the types of methylation data samples include: disease samples and normal control samples; the methylation data includes: methylation site information corresponding to the sample and methylation conversion rate; the methylation data preprocessing module is connected with the methylation data extraction module and is used for preprocessing the methylation data of each methylation sample to obtain methylation depth matrix data and methylation rate matrix data; the differential methylation site acquisition module is connected with the methylation data preprocessing module and is used for screening differential methylation sites based on the methylation depth matrix data and the methylation rate matrix data and obtaining corresponding differential methylation site methylation rate matrix data; the to-be-selected differential methylation region selection module is connected with the differential methylation site acquisition module and the methylation data preprocessing module and is used for selecting the to-be-selected differential methylation region based on the differential methylation site methylation rate matrix data and the methylation depth matrix data and obtaining the methylation rate matrix data of each to-be-selected differential methylation region; the potential deviation factor removing module is connected with the to-be-selected difference methylation area selecting module and is used for removing potential deviation factors based on methylation rate matrix data of each to-be-selected difference methylation area to obtain corresponding methylation rate matrix data of each to-be-selected difference methylation area from which potential deviation is removed; and the differential methylation region screening module is connected with the potential deviation factor removing module and is used for training the integrated learning model based on the feature subset obtained by processing the methylation rate matrix data of each candidate differential methylation region with the potential deviation removed so as to obtain the final differential methylation region.

To achieve the above and other related objects, the present invention provides a differential methylation area selection and screening terminal based on ensemble learning, comprising: one or more memories and one or more processors; the one or more memories are used for storing computer programs; the one or more processors are coupled to the memory for executing the computer program to perform the ensemble learning based differential methylation region selection and screening method.

To achieve the above and other related objects, the present invention provides a computer-readable storage medium storing a computer program which, when executed by one or more processors, performs the ensemble learning-based differential methylation region selection and screening method.

As described above, the invention relates to a method, a system, a terminal and a medium for selecting and screening a differential methylation region based on ensemble learning, which have the following beneficial effects: the invention improves the accuracy and reliability of cancer detection by selecting and screening the differential methylation sites by utilizing the integrated learning algorithm thought. Meanwhile, a great amount of optimization processing is carried out on the interference of the missing value on the data analysis, the influence of the DNA methylation spectrum missing value on the model is reduced, and the redundant and non-critical information of the differential methylation region is reduced, so that the screened differential methylation region has robust prediction performance and higher accuracy in assisting the early cancer diagnosis and recurrence prediction process.

Drawings

FIG. 1 is a flow chart of a differential methylation region selection and screening method based on ensemble learning according to an embodiment of the present invention.

FIG. 2 is a schematic diagram of a methylation data extraction process according to an embodiment of the invention.

FIG. 3 is a schematic diagram showing a methylation data preprocessing flow in an embodiment of the invention.

FIG. 4 is a schematic diagram showing a differential methylation site acquisition procedure in an embodiment of the present invention.

FIG. 5 is a schematic diagram showing a process for selecting a region of differential methylation to be selected according to an embodiment of the present invention.

FIG. 6 is a schematic diagram of a potential bias factor removal process according to an embodiment of the invention.

FIG. 7 is a schematic diagram of a differential methylation region screening process according to one embodiment of the present invention.

FIG. 8 is a schematic diagram of the training and prediction process of the learning model in accordance with an embodiment of the present invention.

FIG. 9 is a schematic diagram of a differential methylation region selection and screening system based on ensemble learning according to an embodiment of the present invention.

Fig. 10 is a schematic diagram showing the structure of a differential methylation area selection and screening terminal based on ensemble learning according to an embodiment of the present invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict.

In the following description, reference is made to the accompanying drawings, which illustrate several embodiments of the invention. It is to be understood that other embodiments may be utilized and that mechanical, structural, electrical, and operational changes may be made without departing from the spirit and scope of the present invention. The following detailed description is not to be taken in a limiting sense, and the scope of embodiments of the present invention is defined only by the claims of the issued patent. The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. Spatially relative terms, such as "upper," "lower," "left," "right," "lower," "below," "lower," "above," "upper," and the like, may be used herein to facilitate a description of one element or feature as illustrated in the figures relative to another element or feature.

Throughout the specification, when a portion is said to be "connected" to another portion, this includes not only the case of "direct connection" but also the case of "indirect connection" with other elements interposed therebetween. In addition, when a certain component is said to be "included" in a certain section, unless otherwise stated, other components are not excluded, but it is meant that other components may be included.

The first, second, and third terms are used herein to describe various portions, components, regions, layers and/or sections, but are not limited thereto. These terms are only used to distinguish one portion, component, region, layer or section from another portion, component, region, layer or section. Thus, a first portion, component, region, layer or section discussed below could be termed a second portion, component, region, layer or section without departing from the scope of the present invention.

Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" specify the presence of stated features, operations, elements, components, items, categories, and/or groups, but do not preclude the presence, presence or addition of one or more other features, operations, elements, components, items, categories, and/or groups. The terms "or" and/or "as used herein are to be construed as inclusive, or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a, A is as follows; b, a step of preparing a composite material; c, performing operation; a and B; a and C; b and C; A. b and C). An exception to this definition will occur only when a combination of elements, functions or operations are in some way inherently mutually exclusive.

The invention provides a method for selecting and screening a differential methylation region based on ensemble learning, which improves the accuracy and reliability of cancer detection by selecting and screening differential methylation sites by utilizing an ensemble learning algorithm idea. Meanwhile, a great amount of optimization processing is carried out on the interference of the missing value on the data analysis, the influence of the DNA methylation spectrum missing value on the model is reduced, and the redundant and non-critical information of the differential methylation region is reduced, so that the screened differential methylation region has robust prediction performance and higher accuracy in assisting the early cancer diagnosis and recurrence prediction process.

The embodiments of the present invention will be described in detail below with reference to the attached drawings so that those skilled in the art to which the present invention pertains can easily implement the present invention. This invention may be embodied in many different forms and is not limited to the embodiments described herein.

FIG. 1 shows a flow chart of a differential methylation region selection and screening method based on ensemble learning in an embodiment of the present invention.

The method comprises the following steps:

step S1: the methylation high throughput platform sequencing sample data is aligned to a human reference genome and methylation data for a plurality of methylation samples is obtained.

In detail, the types of the methylation data samples include: a disease sample corresponding to the disease group and a normal control sample corresponding to the normal control group; and methylation data for the methylation samples includes: methylation site information and methylation conversion rate of the corresponding samples.

The methylation high throughput platform for collecting the methylation high throughput platform sequencing sample data includes, but is not limited to, existing sequencing platforms such as Illumina, DNBSEQ. The human reference genome references the UCSC website and may have different versions, such as hg19/GRCH37, hg38/GRCH38, etc.

In one embodiment, as shown in fig. 2, step S1 includes:

step S11: comparing the methylation high-throughput platform sequencing sample data to a human reference genome to obtain a plurality of disease samples and normal control samples, and respectively storing the disease samples and the normal control samples into bam files;

step S12: extracting methylation site information in each bam file; wherein the methylation site information comprises: genomic chromosome, each methylation site, and site data for each methylation site; the site data for each methylation site includes: methylation site depth and methylation rate; wherein the methylation site depth comprises: depth of C bases where methylation occurs and depth of C bases where no methylation occurs;

Wherein, the calculation formula of methylation rate is:

where mC is the depth of the C base where methylation occurs and umC is the depth of the C base where no methylation occurs.

Step S13: the methylation ratio of the sites of the chromosome locus types CHG and CHH for each methylation sample is calculated based on the human reference genome and the extracted methylation locus information to obtain the methylation conversion rate corresponding to each methylation sample.

Wherein the type is CHG, which comprises one of the locus and the subsequent two bases CAG, CTG, CCG, and any one of the reverse complementary strands is also included. The type CHH includes one of the locus and the subsequent two bases CAA, CTA, CCA, CAC, CTC, CCC, CAT, CTT, CCT, and any situation that the reverse complementary strand belongs to is also counted.

The methylation conversion (bs_ratio) is calculated as:

wherein total mC is the total depth of methylated C bases of CHG and CHH, and total umC is the total depth of unmethylated C bases of CHG and CHH.

Step S2: and preprocessing the methylation data of each methylation sample to obtain methylation depth matrix data and methylation rate matrix data.

In one embodiment, as shown in fig. 3, step S2 includes:

step S21: performing binomial distribution test on each C site based on methylation site information of each methylation sample, and screening each methylation site which is checked to be true in each methylation sample;

specifically, binom (n, β) is checked for binom (X) on each C site in the methylation site information in the methylation data of each methylation sample to check whether the site is a true methylation site; wherein the binomial distribution check q-value may be calculated using an R-language function, for example: p value = binom.test (mC, mc+ umC, bs_ratio); q value = p.adjust (p value, 'fdr'); where mC is the depth of site where methylation occurs, umC is the depth of site where no methylation occurs, and bs_ratio is the methylation conversion in the methylation data.

Screening sites with q value less than or equal to 0.05, namely screening each real methylation site, and deleting sites with q value more than 0.05.

Step S22: performing depth filtration on each methylation site in each methylation sample based on the total depth of sequencing;

specifically, the methylation sites having a total sequencing depth of less than 10 in each of the methylation sites obtained by the screening in step S21 were deleted.

Step S23: the methylation site depth and the methylation rate of each methylation site in each methylation sample subjected to depth filtration are integrated into a methylation depth matrix table and a methylation rate matrix table.

Wherein the methylation depth matrix table comprises methylation site depths for each methylation site of each disease sample and normal control sample; the methylation rate matrix table includes the methylation rates of each methylation site for each disease sample and normal control sample.

Specifically, the methylation site depth and the methylation rate of each methylation site subjected to depth filtration in each methylation sample obtained in the step S22 are respectively integrated into a methylation depth matrix table M and a methylation rate matrix table R;

the methylation depth matrix table M is a disease sample and a normal control sample containing the methylation site depth, wherein the column of the table represents the sample, namely the tested individual, and is marked as normal or cancer, and the row represents the characteristic, namely the methylation site.

The methylation rate matrix table R is a disease sample and a normal control sample containing methylation rate, wherein the column of the table represents a sample, namely a tested individual, and is marked as normal or cancer, and the row represents a characteristic, namely a methylation site.

Step S24: deleting the methylation depth matrix table and the site data of the methylation sites with the missing sample proportion exceeding the preset proportion in the methylation rate matrix table to obtain a first methylation depth matrix table and a first methylation rate matrix table;

preferably, the methylation depth matrix table and the site data of the methylation sites with a missing sample proportion of more than 20% in the methylation rate matrix table are deleted.

Step S25: calculating Beta distribution parameters for the methylation rates of all the methylation sites in the first methylation rate matrix table in all samples, and randomly taking values from Beta distribution to fill in the missing values of the methylation rates in the first methylation rate matrix table so as to obtain a second methylation rate matrix table;

specifically, parameters of Beta distribution Y-Beta (alpha, beta) are calculated for each position point according to rows of a first methylation rate matrix table, and then values are randomly taken from the Beta distribution to fill up missing values; wherein, the calculation formulas of alpha and beta are as follows:

wherein μ is the average value, σ ² Is the variance.

Step S26: and removing methylation depth batch effect from the first methylation depth matrix table to obtain a second methylation depth matrix table.

Specifically, the R language kBET packet k nearest neighbor is adopted to remove the methylation depth batch effect of the first methylation depth matrix table.

Step S27: and carrying out methylation depth smoothing treatment on the second methylation depth matrix table to obtain the smooth depth of each methylation site so as to obtain a third methylation depth matrix table.

In one embodiment, step S27 includes:

based on a sliding window method with a preset width, respectively calculating a median value or a fitting value of the methylation site depth of each methylation site within a preset width range for the second methylation depth matrix table so as to obtain the smooth depth of each methylation site;

wherein the means for calculating the median or fit value for the methylation site depth comprises:

if the number of samples in the preset width range accords with a preset threshold value, performing moving median calculation on the depth of each methylation site in the corresponding preset width range; specifically, if the number of samples in the preset width range is less than the preset threshold (e.g., 10), the smoothing process is performed using the moving median.

If the number of samples in the preset width range does not accord with the preset threshold value, calculating fitting values of the methylation site depths in the corresponding preset width range by adopting local weighted linear regression; specifically, if the number of samples in the preset width range exceeds a preset threshold, smoothing by adopting local weighted linear regression (non-parametric algorithm); when the methylation depth is subjected to local weighted linear regression processing, the weight w is used on the basis of least square fitting, and the optimal model parameter theta is calculated, namely:

Wherein. The weight w is a gaussian kernel function, and the bandwidth parameter τ is a preset fixed value (e.g., 0.1).

And then, smoothing the methylation depth of the corresponding position according to the prediction model.

Step S3: screening the differential methylation sites based on the methylation depth matrix data and the methylation rate matrix data, and obtaining corresponding differential methylation site methylation rate matrix data.

In one embodiment, as shown in fig. 4, step S3 includes:

step S31: based on a logistic regression model, performing significance test on each sample of the second methylation rate matrix table in a disease group and a normal control group, and calculating the p value of each methylation site by using the smooth depth of each methylation site as a weight;

specifically, a logistic regression model based on binomial response variables is used to predict cancer and normal classification for each disease sample of the disease group and each normal control sample of the normal control group, and the p-value of each methylation site is calculated using the smoothed depth of each methylation site as a weight.

Step S32: performing multiple detection and correction on the p value of each methylation site by using an R language function to obtain the q value of each methylation site, and screening potential difference methylation sites to obtain a third methylation rate matrix table of the screened potential difference methylation sites;

Specifically, the p value of each methylation site is subjected to multiple test correction by using an R language function p.adjust (p value, 'fdr') to obtain the q value of each methylation site; if the q value is less than or equal to the set threshold, the site is a potential differential methylation site, otherwise, the site is not the potential differential methylation site, and deletion is performed. For example, a methylation site with a threshold of 0.05, i.e., q value of 0.05 or less, is set as a potential differential methylation site, and the remaining sites are deleted.

Step S33: screening each potential difference methylation site based on the calculated methylation rate difference value between the disease group sample and the normal control group sample to obtain a plurality of first screening difference methylation sites;

specifically, the calculated difference in methylation rate between each disease sample of the disease group and each normal control sample of the normal control group retains methylation sites having absolute differences greater than a preset threshold (e.g., 20%).

Step S34: filtering each first screening differential methylation site by utilizing a dbSNP database of the NCBI website to obtain a differential methylation site so as to obtain a third methylation rate matrix table;

specifically, deleting the position of the dbSNP database located in the NCBI website in each first screening differential methylation position to obtain the rest position as the differential methylation position so as to obtain a third methylation rate matrix table.

Step S35: and removing site data of each differential methylation site on X and Y chromosomes in the third methylation rate matrix table to obtain a differential methylation site methylation rate matrix table.

Step S4: and selecting the methylation areas of the difference to be selected based on the methylation rate matrix data of the methylation sites and the methylation depth matrix data, and obtaining the methylation rate matrix data of each methylation area of the difference to be selected.

In one embodiment, as shown in fig. 5, step S4 includes:

step S41: calculating pearson correlation coefficients of adjacent sites of each differential methylation site based on the differential methylation site methylation rate matrix table to determine an outward extension start site;

specifically, adjacent site pearson correlation coefficients of each differential methylation site are calculated based on the differential methylation site methylation rate matrix table, and sites having adjacent site pearson correlation coefficients greater than a preset threshold are determined as outward extension start sites.

Step S42: calculating a pearson correlation coefficient matrix of each window by adopting a preset width sliding window method for the methylation rate matrix table of the differential methylation sites, and outwards extending based on outwards extending starting sites of the corresponding windows when the pearson correlation coefficient matrix meets extension conditions to obtain a plurality of methylation areas;

Specifically, sliding traversal is carried out on the methylation rate matrix table of the differential methylation sites by adopting a preset width sliding window, and a pearson correlation coefficient matrix corresponding to each preset width sliding window is obtained based on pearson correlation coefficients of adjacent sites of each differential methylation site; and when the R square average value of the pearson correlation coefficient matrix of the window is larger than a preset threshold value and the interval is not more than 100bp, outwards extending the outwards extending starting sites of the corresponding window to obtain a plurality of methylation regions. Wherein there is a partial overlap between the methylation regions.

Step S43: combining the overlapped methylation regions to obtain each difference methylation region to be selected;

specifically, overlapping methylation regions are combined into one region, and the remaining regions are used as alternative differential methylation regions respectively.

Step S44: calculating the average methylation rate of all C sites of each region to be selected with different methylation or the C sites with different methylation, and obtaining a methylation rate matrix table of each region to be selected;

specifically, the average methylation rate (methyl ratio) of all C sites or the average methylation rate (diff methyl ratio) of the differentially methylated C sites of each candidate differentially methylated region is calculated as follows:

Or,

wherein total mC is the sum of all C bases methylated in the region, total dmC is the sum of all C bases unmethylated in the region, total umC is the sum of all C bases unmethylated in the region.

Step S45: based on the third methylation depth matrix table, obtaining a methylation site depth matrix table of each region to be selected;

specifically, the smooth depth of each differential methylation site of each candidate differential methylation region of the third methylation depth matrix table is extracted, and a methylation site depth matrix table of each candidate region is obtained.

Step S46: calculating the average depth of each methylation sample of each different methylation area to be selected based on the methylation site depth matrix table of each area to be selected;

specifically, based on the methylation depth matrix table of each region to be selected, the average depth of each methylation sample of each region is calculated in a segmented mode according to a preset bed file (if not, all regions of the genome are defaulted).

Step S47: carrying out logarithmic conversion on each average depth to obtain the logarithmic depth of each methylation sample of each difference methylation area to be selected;

Specifically, log2 conversion is performed on each average depth to obtain the log depth of each methylation sample of each alternative differential methylation region.

Step S48: grouping the logarithmic depth of each methylation sample according to chromosomes, and carrying out standardization treatment by utilizing a median value;

specifically, the log depth of each methylation sample was grouped by chromosome of the methylation sample, and normalized using the median.

Step S49: correcting each log depth subjected to normalization treatment by using a moving window method according to the GC content of the corresponding region so as to obtain a low coverage depth difference methylation region;

specifically, the normalized logarithmic depths are corrected by using a moving window method according to the GC content of the corresponding region, and the methylation region with the average depth value smaller than the preset threshold value is used as the methylation region with the low coverage depth difference.

Step S410: and carrying out region overlapping detection on each low coverage depth difference methylation region and each methylation rate matrix table of the selected region, and further removing the low coverage depth difference methylation regions in each methylation rate matrix table of the selected region to obtain a corresponding methylation rate matrix table of the methylation region of the selected region.

Step S5: and removing potential deviation factors based on the methylation rate matrix data of each candidate difference methylation region to obtain corresponding methylation rate matrix data of each candidate difference methylation region from which potential deviation is removed.

Wherein some sources of bias in methylation NGS sequencing can be measured directly (e.g., sequencing experiments, reagents, tissue sampling, etc.), yet other unmeasurable samples and potentially unknown bias are introduced, thus potentially systematic bias needs to be removed prior to modeling.

In one embodiment, as shown in fig. 6, step S5:

performing potential deviation factor removal operation based on current set dimension decomposition on each methylation rate matrix table of the difference methylation region to be selected to obtain corresponding processing data;

judging whether the potential deviation factor removal process of the current set dimension decomposition is qualified or not based on the obtained processing data;

in one embodiment, determining whether the current set dimension decomposition is eligible for removal of potential bias factors includes: judging whether the second right singular vector matrix is identical to the first right singular vector matrix; if the potential deviation factors are the same, judging that the potential deviation factor removal treatment of the current set dimension decomposition is qualified; if the potential deviation factor removal process is different, judging that the potential deviation factor removal process of the current set dimension decomposition is not qualified;

If the current set dimension decomposition is not qualified, the potential deviation factor removing operation based on the current set dimension decomposition is circularly carried out until the potential deviation factor removing processing of the current set dimension decomposition is judged to be qualified;

if the potential deviation factor is qualified, performing qualification judgment operation of removing the potential deviation factor, and under the condition that the potential deviation factor is judged to be qualified, acquiring methylation rate matrix data of a corresponding to-be-selected difference methylation area of removing the potential deviation factor; in one embodiment, the step of removing the latent deviation factor eligibility includes: judging whether the methylation rate matrix table corresponding to the first differential methylation region converges to an expected threshold value or not; if yes, the first difference methylation area methylation rate matrix table is qualified, and the first difference methylation area methylation rate matrix table is used as the methylation rate matrix data of the difference methylation area to be selected for removing potential deviation factors; if not, the result is failed, and S56: and updating the current set dimension to be the next dimension, and performing potential deviation factor removal operation based on next dimension decomposition until the methylation rate matrix table of the first difference methylation region converges to the expected threshold.

Wherein the removing potential deviation factor operation comprises:

carrying out standard fraction normalization processing on the methylation rate matrix table of the corresponding difference methylation region to be selected to obtain a first methylation rate matrix table of the difference methylation region;

Decomposing the methylation rate matrix table of the first difference methylation area into a current set dimension by utilizing SVD singular values, and then obtaining a first right singular vector matrix of the first difference methylation area;

taking the differential methylation area as a response variable, and performing logarithmic linear fitting of beta distribution on the right singular vector matrix to obtain a first estimated value matrix;

performing binominal distribution logarithmic linear fitting on the first estimated value matrix by taking the sample as a response variable to obtain a second estimated value matrix;

multiplying the transposed matrix of the second estimated value matrix by using the first estimated value matrix, and performing centering processing to obtain a first matrix;

and decomposing the first matrix into fixed dimensions by utilizing SVD singular values, and obtaining a second right singular vector matrix of the first matrix.

For example, step S5 includes: 5.1, performing normalization processing on the methylation rate data obtained in the step 4.11 to obtain a matrix M;5.2, decomposing K dimensions of the M obtained in the step 5.1 by utilizing SVD singular values, and taking a right singular vector matrix as h-old;5.3, performing logarithmic linear fitting of beta distribution on the h-old by taking the methylation area as a response variable to obtain an estimated value matrix g;5.4, performing binominal distribution logarithmic linear fitting on g by taking the sample as a response variable to obtain an estimated value matrix h;5.5, pair g× (h) ^T Performing centering treatment to obtain a matrix Z;5.6, decomposing K dimensions of Z by utilizing SVD singular values, and taking a right singular vector matrix as h-new;

5.7, cycling through steps 5.3-5.6 until h-old = h-new or the overall variance converges to a desired threshold; 5.8, updating K (such as 1-10), repeating the step 5.7 until M converges to the expected threshold; and 5.9, obtaining methylation rate data with potential deviation factors removed.

Step S6: and training an integrated learning model based on the feature subset obtained by processing the methylation rate matrix data of each candidate differential methylation region with the potential deviation removed, so as to obtain a final differential methylation region.

In one embodiment, as shown in fig. 7, step S6 includes:

step S61: obtaining CpG density of each candidate difference methylation area from the methylation rate matrix data of each candidate difference methylation area with potential deviation factors removed, and deleting low-density CpG areas in the methylation rate matrix data of the corresponding candidate difference methylation area to obtain methylation rate matrix data of each screening difference methylation area; preferably, the low-density CpG regions with less than 3 CpG's per 100bp range in each of the candidate differential methylation region methylation rate matrix data are deleted.

Step S62: calculating average methylation differences between the disease group samples and the normal control group samples in the methylation rate matrix data of each screened methylation area, and screening final potential methylation areas to obtain the methylation rate matrix data of each final potential methylation area;

Specifically, calculating the average methylation difference value of the disease group sample and the normal control group sample in the methylation rate matrix data of each screening differential methylation region, and reserving the differential methylation region with the difference value of more than or equal to 20% as a final potential differential methylation region to obtain the methylation rate matrix data of each final potential differential methylation region.

Step S63: removing redundant areas of the final potential difference methylation areas by using a recursive feature elimination method, and taking each residual difference methylation area as a feature to form an initial feature subset;

step S64: performing importance score calculation operation on the initial feature subset based on the integrated learning model, and obtaining importance scores of each feature and classification accuracy of the initial feature subset;

step S65: removing the features with the lowest importance scores from the initial feature subsets to obtain new feature subsets, then executing importance score calculation operation based on the integrated learning model, removing the features with the lowest importance scores to obtain new feature subsets, and cycling until the new feature subsets are obtained to be empty;

specifically, one feature with the lowest feature importance is removed from the current feature subset D, a new feature subset E is obtained, the new feature subset E is input into the integrated learning model constructed in step S65 again, the importance score of each feature in the new feature subset E is calculated, and the classification accuracy of the new feature subset E is obtained by using a cross verification method. Updating the feature subset E to be the initial feature subset D, and recursively repeating steps S64 to S65 until the feature subset E is empty.

Step S66: and taking the feature subset with high classification precision in the feature subsets obtained in a circulating way as an optimal feature combination to obtain a final differential methylation region.

In a specific embodiment, the optimization algorithm and the adjustment algorithm parameters can be generally higher in precision than the original single algorithm model, and the generalization capability is also enhanced. However, since the methylation data is relatively complex and changeable, the abnormal value is more, and the single prediction algorithm has its own limitation, in order to improve the generalization capability of the single algorithm model, the integrated learning model used in step S64 adopts a Stacking integration method, and nonlinear combination is performed on the polynomial naive bayes, the random forest, gradient Boosting and the logistic regression 4 heterogeneous learners, so as to construct a two-layer Stacking integrated learning model.

Methylation data is inevitably subject to missing values due to methylation sequencing and the like. The tree model has low sensitivity to missing values, and can be used when data is missing in most cases, such as random forests, gradient Boosting and the like. When distance measurement is involved, if the distance between two points is calculated, missing data becomes important, and the effect is poor when the missing value is processed improperly, such as a K nearest neighbor algorithm (KNN) and a Support Vector Machine (SVM) which are commonly used in biology; the cost function of a linear model often involves calculation of distance, calculating the difference between the predicted value and the actual value, which easily results in sensitivity to missing values (e.g., LASSO, LDA, etc.). Neural networks are robust, not very sensitive to missing data, but generally do not have as much data available. The Bayesian model is stable for missing data, and is more suitable for selecting the Bayesian model when the data quantity is small.

Therefore, the integrated learning model adopts a double-layer structure, wherein the first-layer structure, and in order to improve the applicability of the model, the primary learner adopts three learners of polynomial naive Bayes, random forest and Gradient boosting; in order to reduce the risk of overfitting, the second layer structure adopts a logistic regression classifier as a secondary learner, and the secondary learner fuses the primary learner to obtain an integrated learning model, and adopts the logistic regression classifier.

In one embodiment, as shown in fig. 8, the importance score calculating operation in step S64 includes:

step S641: equally dividing a disease group sample and a normal control group sample of each final potential difference methylation area methylation rate matrix data into a set threshold number of samples to obtain corresponding split data;

step S642: training and predicting the integrated learning model by adopting set threshold value fold cross validation on the split data, obtaining the prediction result and the real result of each sample to perform model optimization, and obtaining the importance score of each feature.

Preferably, the disease group samples and normal control group samples of each final potentially differential methylation region methylation rate matrix data are aliquoted into 10 parts; training and predicting split data by adopting 10-fold cross validation, taking an average value, combining 10 training set prediction results as P, and combining 10 prediction set original results as A; to prevent overfitting, data a is predicted with P using logistic regression as the training set; extracting classification probability of each sample obtained by prediction as a credibility score, and calculating parameters such as prediction accuracy and the like; the extraction model predicts the importance scores for each feature.

Similar to the principles of the embodiments described above, the present invention provides a differentially methylated region selection and screening system based on ensemble learning.

Specific embodiments are provided below with reference to the accompanying drawings:

FIG. 9 shows a schematic diagram of a differential methylation region selection and screening system based on ensemble learning in an embodiment of the present invention.

The system comprises:

a methylation data extraction module 1 for comparing the methylation high throughput platform sequencing sample data to a human reference genome and obtaining methylation data of a plurality of methylation samples; wherein the types of methylation data samples include: disease samples and normal control samples; the methylation data includes: methylation site information corresponding to the sample and methylation conversion rate;

the methylation data preprocessing module 2 is connected with the methylation data extraction module 1 and is used for preprocessing the methylation data of each methylation sample to obtain methylation depth matrix data and methylation rate matrix data;

the differential methylation site acquisition module 3 is connected with the methylation data preprocessing module 2 and is used for screening differential methylation sites based on the methylation depth matrix data and the methylation rate matrix data and acquiring corresponding differential methylation site methylation rate matrix data;

The to-be-selected differential methylation region selection module 4 is connected with the differential methylation site acquisition module 3 and the methylation data preprocessing module 2 and is used for selecting the to-be-selected differential methylation region based on the differential methylation site methylation rate matrix data and the methylation depth matrix data and obtaining the methylation rate matrix data of each to-be-selected differential methylation region;

the potential deviation factor removing module 5 is connected with the to-be-selected differential methylation region selecting module 4 and is used for removing potential deviation factors based on the methylation rate matrix data of each to-be-selected differential methylation region to obtain corresponding methylation rate matrix data of each to-be-selected differential methylation region from which potential deviation is removed;

and the differential methylation region screening module 6 is connected with the potential deviation factor removing module 5 and is used for training the integrated learning model based on the feature subset obtained by processing the methylation rate matrix data of each candidate differential methylation region with the potential deviation removed so as to obtain the final differential methylation region.

It should be noted that, it should be understood that the division of the modules in the embodiment of the system of fig. 9 is merely a division of logic functions, and may be fully or partially integrated into a physical entity or may be physically separated. And these modules may all be implemented in software in the form of calls by the processing element; or can be realized in hardware; the method can also be realized in a mode that a part of modules are called by processing elements and software, and the part of modules are realized in a hardware mode;

For example, each module may be one or more integrated circuits configured to implement the above methods, e.g.: one or more application specific integrated circuits (Application Specific Integrated Circuit, abbreviated as ASIC), or one or more microprocessors (digital signal processor, abbreviated as DSP), or one or more field programmable gate arrays (Field Programmable Gate Array, abbreviated as FPGA), or the like. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a central processing unit (Central Processing Unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Since the implementation principle of the differential methylation region selection and screening system based on ensemble learning is described in the foregoing embodiments, a detailed description is omitted here.

Fig. 10 shows a schematic diagram of the structure of the ensemble learning-based differential methylation area selection and screening terminal 30 in an embodiment of the present invention.

The ensemble learning-based differential methylation region selection and screening terminal 30 includes: a memory 31 and a processor 32. The memory 31 is for storing a computer program; the processor 32 runs a computer program to implement the ensemble learning based differential methylation region selection and screening method as described in fig. 1.

Alternatively, the number of the memories 31 may be one or more, and the number of the processors 32 may be one or more, and one is taken as an example in fig. 10.

Optionally, the processor 32 in the ensemble learning-based differential methylation area selection and screening terminal 30 loads one or more instructions corresponding to the process of the application program into the memory 31 according to the steps as shown in fig. 1, and the processor 32 runs the application program stored in the first memory 31, so as to implement various functions in the ensemble learning-based differential methylation area selection and screening method as shown in fig. 1.

Optionally, the memory 31 may include, but is not limited to, high speed random access memory, nonvolatile memory. Such as one or more disk storage devices, flash memory devices, or other non-volatile solid-state storage devices; the processor 32 may include, but is not limited to, a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

Alternatively, the processor 32 may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; but also digital signal processors (Digital Signal Processing, DSP for short), application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), field-programmable gate arrays (Field-Programmable Gate Array, FPGA for short) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.

The invention also provides a computer readable storage medium storing a computer program which when run implements the ensemble learning based differential methylation region selection and screening method as shown in fig. 1. The computer-readable storage medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs (compact disk-read only memories), magneto-optical disks, ROMs (read-only memories), RAMs (random access memories), EPROMs (erasable programmable read only memories), EEPROMs (electrically erasable programmable read only memories), magnetic or optical cards, flash memory, or other type of media/machine-readable medium suitable for storing machine-executable instructions. The computer readable storage medium may be an article of manufacture that is not accessed by a computer device or may be a component used by an accessed computer device.

Compared with the prior art, the invention has the advantages that:

1. the invention provides a method for selecting and screening differential methylation regions, which can effectively simplify redundant information in the methylation regions under the condition of ensuring accuracy.

2. According to the invention, aiming at the actual situation that the DNA methylation spectrum usually contains the missing value to interfere with the modeling and prediction of the data, a plurality of optimization processing operations are performed, and the influence of the missing value on modeling of the model is effectively reduced.

3. The invention adopts the beta-binomial distribution logarithmic fit model, effectively removes other potential unknown system deviation factors which cannot be measured, reduces the interference of background noise, and enables the model to reflect the essential characteristics of DNA methylation more accurately.

4. The invention adopts the integrated learning model to collect the advantages of various machine learning algorithms, and improves the precision and generalization capability of feature prediction.

In summary, the method, the system, the terminal and the medium for selecting and screening the differential methylation region based on the ensemble learning improve the accuracy and the reliability of cancer detection by selecting and screening the differential methylation sites by utilizing the algorithm thought of the ensemble learning. Meanwhile, a great amount of optimization processing is carried out on the interference of the missing value on the data analysis, the influence of the DNA methylation spectrum missing value on the model is reduced, and the redundant and non-critical information of the differential methylation region is reduced, so that the screened differential methylation region has robust prediction performance and higher accuracy in assisting the early cancer diagnosis and recurrence prediction process. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. It is therefore intended that all equivalent modifications and changes made by those skilled in the art without departing from the spirit and technical spirit of the present invention shall be covered by the appended claims.

Claims

1. A method for selecting and screening a differentially methylated region based on ensemble learning, the method comprising:

comparing the methylation high throughput platform sequencing sample data to a human reference genome and obtaining methylation data for a plurality of methylation samples; wherein the types of methylation data samples include: disease samples and normal control samples; the methylation data includes: methylation site information corresponding to the sample and methylation conversion rate;

preprocessing methylation data of each methylation sample to obtain methylation depth matrix data and methylation rate matrix data;

screening the differential methylation sites based on the methylation depth matrix data and the methylation rate matrix data, and obtaining corresponding differential methylation site methylation rate matrix data;

Selecting a to-be-selected differential methylation region based on the differential methylation site methylation rate matrix data and the methylation depth matrix data, and obtaining methylation rate matrix data of each to-be-selected differential methylation region;

removing potential deviation factors based on the methylation rate matrix data of each candidate difference methylation region to obtain corresponding methylation rate matrix data of each candidate difference methylation region from which potential deviation is removed;

and training an integrated learning model based on the feature subset obtained by processing the methylation rate matrix data of each candidate differential methylation region with the potential deviation removed, so as to obtain a final differential methylation region.

2. The ensemble-based differential methylation region selection and screening method of claim 1, wherein aligning methylation high throughput platform sequencing sample data to a human reference genome and obtaining methylation data for a plurality of methylation samples comprises:

comparing the methylation high-throughput platform sequencing sample data to a human reference genome to obtain a plurality of disease samples and normal control samples, and respectively storing the disease samples and the normal control samples into bam files;

extracting methylation site information in each bam file; wherein the methylation site information comprises: genomic chromosome, each methylation site, and site data for each methylation site; the site data for each methylation site includes: methylation site depth and methylation rate;

The methylation ratio of the sites of the chromosome locus types CHG and CHH for each methylation sample is calculated based on the human reference genome and the extracted methylation locus information to obtain the methylation conversion rate corresponding to each methylation sample.

3. The ensemble-based differential methylation region selection and screening method of claim 2, wherein preprocessing the sample methylation data comprises:

performing binomial distribution test on each C site based on methylation data of each methylation sample, and screening each methylation site which is checked to be true in each methylation sample;

performing depth filtration on each methylation site in each methylation sample based on the total depth of sequencing;

integrating the methylation site depth and the methylation rate of each methylation site subjected to depth filtration in each methylation sample into a methylation depth matrix table and a methylation rate matrix table; wherein the methylation depth matrix table comprises methylation site depths for each methylation site of each disease sample and normal control sample; the methylation rate matrix table includes the methylation rates of each methylation site of each disease sample and normal control sample;

Deleting the methylation depth matrix table and the site data of the methylation sites with the missing sample proportion exceeding the preset proportion in the methylation rate matrix table to obtain a first methylation depth matrix table and a first methylation rate matrix table;

calculating Beta distribution parameters for the methylation rates of all the methylation sites in the first methylation rate matrix table in all samples, and randomly taking values from Beta distribution to fill in the missing values of the methylation rates in the first methylation rate matrix table so as to obtain a second methylation rate matrix table;

removing methylation depth batch effect from the first methylation depth matrix table to obtain a second methylation depth matrix table;

and carrying out methylation depth smoothing treatment on the second methylation depth matrix table to obtain the smooth depth of each methylation site so as to obtain a third methylation depth matrix table.

4. The method of ensemble learning-based differential methylation region selection and screening as set forth in claim 3, wherein performing methylation depth smoothing on the second methylation depth matrix table comprises:

if the number of samples in the preset width range accords with a preset threshold value, performing moving median calculation on the depth of each methylation site in the corresponding preset width range;

if the number of samples in the preset width range does not accord with the preset threshold value, fitting value calculation is carried out on the methylation site depths in the corresponding preset width range by adopting local weighted linear regression.

5. The ensemble learning based differential methylation region selection and screening method of claim 2, wherein screening differential methylation sites based on methylation depth matrix data and methylation rate matrix data and obtaining corresponding differential methylation site methylation rate matrix data comprises:

based on a logistic regression model, performing significance test on each sample of the second methylation rate matrix table in a disease group and a normal control group, and calculating the p value of each methylation site by using the smooth depth of each methylation site as a weight;

performing multiple detection and correction on the p value of each methylation site by using an R language function to obtain the q value of each methylation site, and screening potential difference methylation sites to obtain a third methylation rate matrix table of the screened potential difference methylation sites;

Screening each potential difference methylation site based on the calculated methylation rate difference value between the disease group sample and the normal control group sample to obtain a plurality of first screening difference methylation sites;

filtering each first screening differential methylation site by utilizing a dbSNP database of the NCBI website to obtain a differential methylation site so as to obtain a third methylation rate matrix table;

and removing site data of each differential methylation site on X and Y chromosomes in the third methylation rate matrix table to obtain a differential methylation site methylation rate matrix table.

6. The ensemble learning based differential methylation region selection and screening method of claim 5, wherein the performing the candidate differential methylation region selection based on the differential methylation site methylation rate matrix data and the methylation depth matrix data comprises:

calculating pearson correlation coefficients of adjacent sites of each differential methylation site based on the differential methylation site methylation rate matrix table to determine an outward extension start site;

calculating a pearson correlation coefficient matrix of each window by adopting a preset width sliding window method for the methylation rate matrix table of the differential methylation sites, and outwards extending based on outwards extending starting sites of the corresponding windows when the pearson correlation coefficient matrix meets extension conditions to obtain a plurality of methylation areas;

Combining the overlapped methylation regions to obtain each difference methylation region to be selected;

calculating the average methylation rate of all C sites of each region to be selected with different methylation or the C sites with different methylation, and obtaining a methylation rate matrix table of each region to be selected;

based on the third methylation depth matrix table, obtaining a methylation site depth matrix table of each region to be selected;

calculating the average depth of each methylation sample of each different methylation area to be selected based on the methylation site depth matrix table of each area to be selected;

carrying out logarithmic conversion on each average depth to obtain the logarithmic depth of each methylation sample of each difference methylation area to be selected;

grouping the logarithmic depth of each methylation sample according to chromosomes, and carrying out standardization treatment by utilizing a median value;

correcting each log depth subjected to normalization treatment by using a moving window method according to the GC content of the corresponding region so as to obtain a low coverage depth difference methylation region;

and carrying out region overlapping detection on each low coverage depth difference methylation region and each methylation rate matrix table of the selected region, and further removing the low coverage depth difference methylation regions in each methylation rate matrix table of the selected region to obtain a corresponding methylation rate matrix table of the methylation region of the selected region.

7. The ensemble-based differential methylation region selection and screening method of claim 1, wherein performing potential bias factor removal based on methylation rate matrix data for each candidate differential methylation region comprises:

if the potential deviation factor is qualified, performing qualification judgment operation of removing the potential deviation factor, and under the condition that the potential deviation factor is judged to be qualified, acquiring methylation rate matrix data of a corresponding to-be-selected difference methylation area of removing the potential deviation factor;

wherein, the operation of removing potential deviation factors based on the current set dimension decomposition on each methylation rate matrix table of the difference methylation areas to be selected comprises the following steps:

8. The method of ensemble learning-based differential methylation region selection and screening of claim 7, wherein determining whether the current set dimension decomposition removal potential deviation factor process is acceptable comprises:

judging whether the second right singular vector matrix is identical to the first right singular vector matrix;

if the potential deviation factors are the same, judging that the potential deviation factor removal treatment of the current set dimension decomposition is qualified;

if the potential deviation factor removal process is different, judging that the potential deviation factor removal process of the current set dimension decomposition is not qualified;

The potential deviation factor qualification judgment operation comprises the following steps:

judging whether the methylation rate matrix table corresponding to the first differential methylation region converges to an expected threshold value or not;

if yes, the first difference methylation area methylation rate matrix table is qualified, and the first difference methylation area methylation rate matrix table is used as the methylation rate matrix data of the difference methylation area to be selected for removing potential deviation factors;

if not, the current set dimension is updated to be the next dimension, and potential deviation factor removal operation based on next dimension decomposition is performed until the methylation rate matrix table of the first difference methylation region converges to an expected threshold.

9. The ensemble-based differential methylation region selection and screening method of claim 1, wherein training the ensemble-learning model based on feature subsets obtained by processing the methylation rate matrix data of each candidate differential methylation region from which potential bias is removed to obtain the final differential methylation region comprises:

obtaining CpG density of each candidate difference methylation area from the methylation rate matrix data of each candidate difference methylation area with potential deviation factors removed, and deleting low-density CpG areas in the methylation rate matrix data of the corresponding candidate difference methylation area to obtain methylation rate matrix data of each screening difference methylation area;

Calculating average methylation differences between the disease group samples and the normal control group samples in the methylation rate matrix data of each screened methylation area, and screening final potential methylation areas to obtain the methylation rate matrix data of each final potential methylation area;

removing redundant areas of the final potential difference methylation areas by using a recursive feature elimination method, and taking each residual difference methylation area as a feature to form an initial feature subset;

performing importance score calculation operation on the initial feature subset based on the integrated learning model, and obtaining importance scores of each feature and classification accuracy of the initial feature subset;

removing the features with the lowest importance scores from the initial feature subsets to obtain new feature subsets, then executing importance score calculation operation based on the integrated learning model, removing the features with the lowest importance scores to obtain new feature subsets, and cycling until the new feature subsets are obtained to be empty;

and taking the feature subset with high classification precision in the feature subsets obtained in a circulating way as an optimal feature combination to obtain a final differential methylation region.

10. The method for selecting and screening a differentially methylated region based on ensemble learning according to claim 9, wherein the ensemble learning model adopts a two-layer structure, wherein the first layer structure adopts three learners of polynomial naive bayes, random forests and Gradient boosting, and the second layer structure adopts a logistic regression classifier.

11. The ensemble-based differential methylation region selection and screening method as recited in claim 9, wherein the importance score calculation operation includes:

equally dividing a disease group sample and a normal control group sample of each final potential difference methylation area methylation rate matrix data into a set threshold number of samples to obtain corresponding split data;

training and predicting the integrated learning model by adopting set threshold value fold cross validation on the split data, obtaining the prediction result and the real result of each sample to perform model optimization, and obtaining the importance score of each feature.

12. A differentially methylated region selection and screening system based on ensemble learning, the system comprising:

the methylation data extraction module is used for comparing the methylation high-throughput platform sequencing sample data to a human reference genome and obtaining methylation data of a plurality of methylation samples; wherein the types of methylation data samples include: disease samples and normal control samples; the methylation data includes: methylation site information corresponding to the sample and methylation conversion rate;

the methylation data preprocessing module is connected with the methylation data extraction module and is used for preprocessing the methylation data of each methylation sample to obtain methylation depth matrix data and methylation rate matrix data;

The differential methylation site acquisition module is connected with the methylation data preprocessing module and is used for screening differential methylation sites based on the methylation depth matrix data and the methylation rate matrix data and obtaining corresponding differential methylation site methylation rate matrix data;

the to-be-selected differential methylation region selection module is connected with the differential methylation site acquisition module and the methylation data preprocessing module and is used for selecting the to-be-selected differential methylation region based on the differential methylation site methylation rate matrix data and the methylation depth matrix data and obtaining the methylation rate matrix data of each to-be-selected differential methylation region;

the potential deviation factor removing module is connected with the to-be-selected difference methylation area selecting module and is used for removing potential deviation factors based on methylation rate matrix data of each to-be-selected difference methylation area to obtain corresponding methylation rate matrix data of each to-be-selected difference methylation area from which potential deviation is removed;

and the differential methylation region screening module is connected with the potential deviation factor removing module and is used for training the integrated learning model based on the feature subset obtained by processing the methylation rate matrix data of each candidate differential methylation region with the potential deviation removed so as to obtain the final differential methylation region.

13. A differential methylation region selection and screening terminal based on ensemble learning, comprising: one or more memories and one or more processors;

the one or more memories are used for storing computer programs;

the one or more processors being coupled to the memory for running the computer program to perform the method of claims 1 to 11.

14. A computer-readable storage medium, characterized in that a computer program is stored, which, when being executed by one or more processors, performs the method of any of claims 1 to 11.