CN116758993A

CN116758993A - DNA methylation prediction method integrating multiple groups of chemical characteristics

Info

Publication number: CN116758993A
Application number: CN202310718721.5A
Authority: CN
Inventors: 马宝山; 申忆文; 刘玉
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2023-06-16
Filing date: 2023-06-16
Publication date: 2023-09-15

Abstract

The invention discloses a DNA methylation prediction method integrating multiple groups of chemical characteristics, which comprises the steps of determining CpG sites to be predicted, obtaining multiple groups of chemical characteristic sets of CpG sites in a cancer tissue, obtaining methylation characteristic sets of CpG sites in the cancer tissue, respectively calculating correlation coefficients between the methylation characteristics of the CpG sites of the cancer tissue and miRNA, mRNA and methylation characteristics of the cancer tissue based on Pearson correlation coefficients, respectively selecting K miRNAs, Q mRNAs and L methylation characteristics according to the values of the correlation coefficients, constructing multiple groups of chemical correlation characteristic sets, constructing a DNA methylation prediction model based on a deep neural network, training the DNA methylation prediction model according to the multiple groups of chemical correlation characteristic sets, calculating an evaluation index of the trained DNA methylation prediction model, obtaining the evaluated DNA methylation prediction model when the evaluation index meets a threshold value, and predicting the methylation of the cancer tissue according to the evaluated DNA methylation prediction model. The prediction accuracy of DNA methylation is improved.

Description

DNA methylation prediction method integrating multiple groups of chemical characteristics

Technical Field

The invention relates to the field of DNA methylation prediction, in particular to a DNA methylation prediction method integrating multiple groups of chemical characteristics.

Background

DNA methylation of cancer tissues and paracancerous tissues is closely related to the occurrence and development of cancer, and analysis of DNA methylation variation helps reveal the molecular biological mechanisms of cancer. It has been difficult to predict DNA methylation using only a single set of chemicals, and with the rapid development of sequencing technology researchers have obtained a vast array of multiple biological sets of data, and there is a significant correlation between DNA methylation of cancerous and paracancestral tissues and multiple sets of data, so it is highly necessary to predict DNA methylation using integrated multiple sets of chemical features. In summary, DNA methylation data is important for cancer research, and existing research only uses a single set of biological information provided by a single set of biological characteristics to predict DNA methylation, and the prediction effect on DNA methylation data is poor.

Disclosure of Invention

The invention provides a DNA methylation prediction method integrating multiple groups of chemical characteristics so as to overcome the technical problems.

A DNA methylation prediction method integrating multiple sets of chemical characteristics comprises,

step one, determining CpG sites to be predicted, obtaining a plurality of groups of chemical feature sets of the CpG sites in the tissue beside the cancer, wherein the plurality of groups of chemical feature sets comprise miRNA feature sets, mRNA feature sets and methylation feature sets, obtaining methylation feature sets of the CpG sites in the cancer tissue,

calculating correlation coefficients between the methylation characteristic of the CpG site of the cancer tissue and the miRNA characteristic, the mRNA characteristic and the methylation characteristic of the tissue beside the cancer based on the Pearson correlation coefficient, selecting K miRNA characteristics from the miRNA characteristic set according to the value of the correlation coefficient between the methylation characteristic of the CpG site of the cancer tissue and the miRNA characteristic of the tissue beside the cancer, selecting Q mRNA characteristics from the mRNA characteristic set according to the value of the correlation coefficient between the methylation characteristic of the CpG site of the cancer tissue and the mRNA characteristic of the tissue beside the cancer, selecting L methylation characteristics from the methylation characteristic set according to the value of the correlation coefficient between the methylation characteristic of the CpG site of the cancer tissue and the methylation characteristic of the tissue beside the cancer,

thirdly, constructing a plurality of groups of related features according to the K miRNA features, the Q mRNA features and the L methylation features, constructing a DNA methylation prediction model based on a deep neural network, training the DNA methylation prediction model according to the plurality of groups of related features, calculating an evaluation index of the trained DNA methylation prediction model, acquiring the evaluated DNA methylation prediction model when the evaluation index meets a threshold value, and predicting methylation of cancer tissues according to the evaluated DNA methylation prediction model.

Preferably, calculating the correlation coefficient between the methylation characteristic of the CpG sites of the cancer tissue and the miRNA characteristic, the mRNA characteristic and the methylation characteristic of the paracancerous tissue based on the Pearson correlation coefficient comprises calculating the correlation coefficient according to the formula (1),

wherein x is _i CpG site methylation characteristic value, mRNA characteristic value or miRNA characteristic value of the i-th sample paracancerous tissue in the plurality of groups of chemical characteristic sets,representing the characteristic average value of all samples of the CpG sites; y is _i Methylation characteristic value representing the corresponding CpG site in the ith sample cancer tissue in the multiple sets of chemical characteristics,/I>The methylation characteristic average value of all samples of the CpG sites is shown, and n is the number of samples.

Preferably, the constructing a DNA methylation prediction model based on the deep neural network is that the DNA methylation prediction model includes v input neurons, k hidden layer neurons, h output layer neurons, and the input received by the q-th neuron of the hidden layer is:

the output of the hidden layer qth neuron is:

wherein the weight between the p-th neuron of the input layer and the q-th neuron of the hidden layer is w _pq ，x _i As input vector, the input received by the r-th neuron of the output layer is the characteristic b of a plurality of groups of the characteristics sets of the chemical relevance _j N is the number of samples, and the weight between the h neuron of the hidden layer and the r neuron of the output layer is e _hr 。

Preferably, the calculating the evaluation index of the trained DNA methylation prediction model comprises calculating the absolute value of the Pearson correlation coefficient according to the formula (4), calculating the average absolute error of the Pearson correlation coefficient according to the formula (5),

in the middle ofy _i. Representing the predicted DNA methylation value and the actual DNA methylation characteristic value of the ith sample respectively,represents the predicted mean and the actual mean, respectively, +.>Respectively representing standard deviations; />y _ij The predicted DNA methylation value and the actual DNA methylation value, respectively, represent the jth feature of the ith sample.

Preferably, the DNA methylation predictive model can also be optimized for parameters of the DNA methylation predictive model by ten fold cross validation.

The invention provides a DNA methylation prediction method integrating multiple groups of chemical features, which is characterized in that the multiple groups of chemical features related to target CpG sites are extracted based on a feature selection method, then a model integrating the multiple groups of chemical features to predict the DNA methylation level of cancer tissues is established, the influence of key parameters such as a neural network structure, feature quantity and the like on the performance of the DNA methylation prediction model is analyzed by comparing performance indexes such as average absolute errors and the like, model parameters are optimized, and the accuracy of DNA methylation prediction is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the drawings that are needed in the embodiments or the description of the prior art will be briefly described below, it will be obvious that the drawings in the following description are some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

FIG. 1 is a flow chart of the method of the present invention;

fig. 2 is a diagram of an implementation of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

FIG. 1 is a flowchart of the method of the present invention, as shown in FIG. 1, the method of the present embodiment may include:

Based on the scheme, a plurality of groups of chemical features related to the target CpG sites are extracted through a feature selection method, a model for predicting the DNA methylation level of cancer tissues by integrating the plurality of groups of chemical features is established, the influence of key parameters such as a neural network structure, feature quantity and the like on the performance of the DNA methylation prediction model is analyzed by comparing performance indexes such as average absolute errors, model parameters are optimized, and the accuracy of DNA methylation prediction is improved.

specifically, a feature selection method is adopted to extract multiple groups of chemical features at first:

definition of miRNA data matrix as I _i ＝(miRNA _i1 ，miRNA _i2 ，...，miRNA _ia )，

mRNA data matrix N _i ＝(mRNA _i1 ，mRNA _i2 ，...，mRNA _ib )，

Methylation data matrix M _i ＝(CpG _i1 ，CPG _i2 ，...，CpG _ic )，

Where i represents samples (n samples total), j represents features (a total of miRNA data, b total of mRNA data, and c total of methylation data).

Calculating correlation coefficients between methylation characteristics of CpG sites of the cancer tissues and miRNA characteristics, mRNA characteristics and methylation characteristics of the tissues beside the cancer based on the Pearson correlation coefficients,

calculating the correlation coefficient between the methylation characteristic of the CpG sites of the cancer tissue and the miRNA characteristic, the mRNA characteristic and the methylation characteristic of the tissue beside the cancer based on the Pearson correlation coefficient comprises calculating the correlation coefficient according to a formula (1),

K miRNA features are selected from the miRNA feature set according to the value of the correlation coefficient of the methylation feature of the CpG site of the cancer tissue and the miRNA feature of the tissue beside the cancer, Q mRNA features are selected from the mRNA feature set according to the value of the correlation coefficient of the methylation feature of the CpG site of the cancer tissue and the mRNA feature of the tissue beside the cancer, L methylation features are selected from the methylation feature set according to the value of the correlation coefficient of the methylation feature of the CpG site of the cancer tissue and the methylation feature of the tissue beside the cancer, specifically,

after calculating the correlation coefficients between the CpG sites, i.e. the target sites, of the cancer tissue and the three histology features, respectively, for each target site, the first K miRNA (top miRNA) with high correlation coefficient values are selected according to formula (2), the first Q mRNA (top mRNA) are selected according to formula (3), and the first L methylation (top methyl) features are selected according to formula (4):

wherein CpG is used _{target j} Represents the j-th predicted target site, m represents the number of target sites and miRNA ₁ …miRNA _K Representative and CpG _{target j} The top K most relevant miRNA features, mRNA ₁ …mRNA _Q Representative and CpG _{target j} The top Q mRNA features most relevant, cpG ₁ …CpG _L Representative and CpG _{target j} The top L methylation signatures most relevant, the total number of signatures required for each target site is: v=k+q+l.

Thirdly, constructing a plurality of sets of related features according to K miRNA features, Q mRNA features and L methylation features, constructing a DNA methylation prediction model based on a deep neural network, training the DNA methylation prediction model according to the sets of related features, and constructing the DNA methylation prediction model: with multiple sets of biologically relevant features (miRNAs) ₁ …miRNA _K ，mRNA ₁ …mRNA _Q ，CpG ₁ …CpG _L ) As input data, the dimension of the input vector is v, and the dimension of the output vector is h.

The DNA methylation prediction model is constructed based on a deep neural network and comprises v input neurons, k hidden layer neurons and h output layer neurons, wherein the input received by the q-th neuron of the hidden layer is as follows:

the output of the hidden layer qth neuron is:

wherein the weight between the p-th neuron of the input layer and the q-th neuron of the hidden layer is w _pq ，x _i As input vector, the input received by the r-th neuron of the output layer is the characteristic b of a plurality of groups of the characteristics sets of the chemical relevance _j N is the number of samples, the h neuron of the hidden layer and the h neuron of the output layerThe weight between r neurons is e _hr 。

Calculating the evaluation index of the trained DNA methylation prediction model,

the evaluation index of the DNA methylation prediction model after calculation training comprises the absolute value of the Pearson correlation coefficient calculated according to a formula (7) and the average absolute error of the Pearson correlation coefficient calculated according to a formula (8),

And when the evaluation index meets the threshold value, acquiring an evaluated DNA methylation prediction model, and predicting methylation of the cancer tissue according to the evaluated DNA methylation prediction model. The DNA methylation prediction model can optimize parameters of the DNA methylation prediction model through ten-fold cross validation, specifically, a data set is divided into 10 mutually exclusive subsets with equal size, data consistency is maintained during division, a union set of 9 subsets is used as a training set each time, the rest is used as a test set, 10 times of training are carried out, and finally, an average value of 10 times of results is taken.

In the mathematical model studied in this embodiment, using multiple sets of chemical data in the surrogate tissue to predict DNA methylation data in the target tissue can be decomposed into the following technical steps, and the implementation flow chart is shown in fig. 2:

matching and data preprocessing the miRNA, mRNA, DNA methylation three histology data of the cancer tissue and the paracancer tissue;

performing feature extraction and fusion on multiple groups of chemical data by using a feature selection strategy based on correlation;

establishing a mathematical model of DNA methylation between the paracancerous tissue and the cancerous tissue;

predicting DNA methylation data of the target tissue using the multiple sets of integrated data of the surrogate tissue;

the influence of key parameters such as the number of layers, the number of neurons and characteristic dimensions of the deep learning model on the performance of the model is analyzed, the model parameters are optimized through ten-fold cross validation, namely, a data set is divided into 10 mutually exclusive subsets with equal size, the consistency of the data is maintained during division, the union set of 9 subsets is used as a training set each time, the rest is used as a test set, 10 times of training are carried out, and finally, the average value of 10 times of results is taken.

Absolute values (R), mean absolute error (MAE, mean absolute error) of pearson correlation coefficients were used to evaluate the predictive performance of the model.

The whole beneficial effects are that:

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A DNA methylation prediction method integrating multiple groups of chemical characteristics is characterized by comprising the following steps of,

2. The method for DNA methylation prediction integrated with multiple sets of chemical features according to claim 1, wherein calculating correlation coefficients between methylation features of CpG sites of cancer tissue and miRNA features, mRNA features and methylation features of paracancerous tissue, respectively, based on Pearson correlation coefficients comprises calculating correlation coefficients according to formula (1),

3. The DNA methylation prediction method integrating multiple sets of chemical characteristics according to claim 1, wherein the constructing a DNA methylation prediction model based on a deep neural network is that the DNA methylation prediction model includes v input neurons, k hidden layer neurons, h output layer neurons, and the input received by the q-th neuron of the hidden layer is:

the output of the hidden layer qth neuron is:

4. The method for DNA methylation prediction integrated with multiple sets of chemical features according to claim 1, wherein the calculating the evaluation index of the trained DNA methylation prediction model comprises calculating an absolute value of the Pearson correlation coefficient according to formula (4), calculating an average absolute error of the Pearson correlation coefficient according to formula (5),

in the middle ofy _i. Representing the predicted DNA methylation value and the actual DNA methylation characteristic value of the ith sample, respectively,/->Represents the predicted mean and the actual mean, respectively, +.>Respectively representing a prediction standard deviation and an actual standard deviation; />y _ij Respectively are provided withPredicted DNA methylation values and actual DNA methylation values representing the jth feature of the ith sample.

5. The method for predicting DNA methylation integrating multiple sets of chemical features according to claim 1, wherein the DNA methylation prediction model optimizes parameters of the DNA methylation prediction model by ten fold cross validation.