CN116312765A

CN116312765A - Multi-stage-based prediction method for influence of non-coding variation on activity of enhancer

Info

Publication number: CN116312765A
Application number: CN202310122535.5A
Authority: CN
Inventors: 张永清; 邹权; 刘宇航; 牛颢; 丁春利; 吴锡; 王紫轩; 熊术文; 王茂丞; 喻云; 林天华
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2023-02-15
Filing date: 2023-02-15
Publication date: 2023-06-23

Abstract

The invention provides a prediction method for influence of non-coding variation on activity of an enhancer based on multiple stages, which relates to the technical field of biological information, and comprises the steps of obtaining relevant characteristics of the enhancer and preprocessing the relevant characteristics; constructing and training a chromatin feature prediction model based on meta learning; obtaining a combined characterization of the fused multi-chromatin features based on the feature fusion model; constructing and training an enhancer activity prediction model based on multi-chromatin feature joint characterization; predicting the influence of the variation on the activity of the enhancer by using a chromatin feature prediction model and an enhancer activity prediction model; functional variants were screened for their effect on enhancer activity. The invention provides an effective enhancer activity prediction framework, realizes accurate prediction of the influence of variation on the enhancer activity, and solves the defect of poor effect of the traditional method for predicting based on a DNA sequence.

Description

Multi-stage-based prediction method for influence of non-coding variation on activity of enhancer

Technical Field

The invention relates to the technical field of biological information, in particular to a prediction method for influence of non-coding variation on activity of enhancers based on multiple stages.

Background

Millions of enhancers (enhancers) are contained in the human genome and act as important Cis-regulatory elements (Cis-regulatory element, CRE) that act as switches to regulate the time and location of gene expression. The activity of the enhancer is closely related to the gene expression, and prediction of the activity of the enhancer not only helps to understand the cell-specific expression of the gene, but also provides a target for gene therapy. Thousands of Genome-wide association analyses (Genome-Wide Association Studies, GWAS) have revealed that 93% of the common genetic variation associated with a particular trait or disease is located in non-coding regions, although most of them have no significant effect, some genetic diseases are caused by the accumulation of many less-affected variations. Studies have shown that the variants identified by GWAS are enriched in regulatory regions, which can control expression of disease-associated genes by altering enhancer activity. Importantly, since enhancer activity has significant cell specificity, the effects of these variants are different in different cell types, and thus predicting enhancer activity and further inferring the cell type specific effect of variation on enhancer activity is a critical issue.

Predicting the effect of variants on enhancer activity based on multiple chromatin characteristics is challenging because the variants can alter chromatin state. To solve this problem, mutation impact prediction based on polychromic features is achieved using the ab initio approach, which predicts a staining mass spectrum based on DNA sequences. Ex pecto employs a deep learning-based model that first predicts chromatin features using DNA sequences and then uses these features to predict the effects of variation. Although these methods are more effective in predicting the effects of non-coding variations in the human genome, their performance is still affected by complex linkages between model structure and chromatin.

Disclosure of Invention

Aiming at the defects in the prior art, the prediction method for the influence of the non-coding variation on the activity of the enhancer based on multiple stages provided by the invention solves the defect of poor effect of the traditional method for predicting based on the DNA sequence.

In order to achieve the above purpose, the invention adopts the following technical scheme:

the scheme provides a prediction method for influence of non-coding variation on activity of enhancers based on multiple stages, which comprises the following steps:

S1, acquiring relevant characteristics of an enhancer, and preprocessing the relevant characteristics;

s2, constructing and training a chromatin feature prediction model based on meta-learning based on the preprocessed relevant features of the enhancers;

s3, constructing a characteristic fusion model based on a self-coding generation countermeasure network, and obtaining the combined characterization of the fused multi-chromatin characteristics based on the characteristic fusion model;

s4, constructing and training an enhancer activity prediction model based on multi-chromatin feature fusion according to the joint characterization and the chromatin feature prediction model parameters;

s5, predicting the influence of variation on the activity of the enhancer by using a chromatin characteristic prediction model and an enhancer activity prediction model;

s6, screening the functional variation according to the influence of the variation on the activity of the enhancer.

The beneficial effects of the invention are as follows: the method comprises the steps of constructing a high-resolution chromatin feature prediction model based on meta-learning; then, a chromatin feature fusion model based on a self-coding generation countermeasure network is provided, then, parameters obtained by combined characterization of fused multi-chromatin features and training based on a chromatin feature prediction model are utilized, so that accuracy of enhancer activity prediction is provided, finally, influence of variation on enhancer activity is accurately predicted from the aspect of multi-chromatin features by using a mutation simulation mode, influence of variation prediction from the aspect of multi-chromatin features is realized, and the invention is based on the parameters trained by the chromatin feature prediction model.

Further, the step S1 includes the steps of:

s101, acquiring an enhancer related characteristic data set;

s102, preprocessing an enhancer related characteristic data set to obtain positive and negative samples of a training set, wherein the positive and negative samples of the training set comprise DNA sequences and a plurality of corresponding chromatin characteristics;

s103, based on positive and negative samples of the training set, fixing the length of the sequence to obtain a DNA sequence with the length of 1001 bp;

s104, dividing the DNA sequence into k-mer base segments, encoding each base segment by using single thermal encoding, and learning the distributed representation of the base segments by using a Word2vec mode, wherein the chromatin characteristics use log ₂ Scaling (1+x), x representing the chromatin feature value, log ₂ (. Cndot.) represents a log function based on 2;

s105, dividing the related characteristic data set of the enhancer to finish preprocessing the data.

The beneficial effects of the above-mentioned further scheme are: the invention codes in a k-mer base segment mode, effectively captures the internal connection between bases and solves the defect of poor effect of the traditional method.

Still further, the step S2 includes the steps of:

s201, constructing a chromatin feature prediction model based on the preprocessed relevant features of the enhancers;

S202, updating chromatin feature prediction model parameters;

s203, performing fine adjustment on each task in the chromatin feature prediction model based on the model obtained by meta learning, wherein the task is a chromatin feature;

s204, inputting the DNA sequence into the fine-tuned model to obtain a trained chromatin feature prediction model.

The beneficial effects of the above-mentioned further scheme are: the invention can realize high-resolution chromatin feature prediction by constructing the chromatin feature prediction model.

Still further, the step S202 includes the steps of:

s2021, training a meta learning model by using a training set obtained by dividing a data set, wherein the training set comprises a query set and a support set;

s2022, initializing chromatin feature prediction model parameters according to normal distribution;

s2023, circulating epoch;

s2024, randomly sampling the tasks to form a batch;

s2025, carrying out cyclic processing on the batch, and training a chromatin feature prediction model by utilizing a support set of each task to obtain a group of parameters so as to finish primary parameter updating;

s2026, calculating a loss value for each task in a batch by using a query set, summing up the loss values, carrying out random gradient descent processing on the gradient, completing second parameter updating, realizing updating of chromatin characteristic prediction model parameters, and returning to the step S2023.

The beneficial effects of the above-mentioned further scheme are: the invention introduces a meta learning method, and the commonality among different chromatins can be effectively extracted by using the meta learning method because of the internal relation among different chromatin characteristics.

Still further, the expression of the meta-learned objective function is as follows:

wherein ,

representing minimized model loss, f _θ ' represents the parameters after model training, L (·) represents the mean square error loss function, f _θ Representing parameters before model training, a represents learning rate, < ->

Representing the gradient in the chain derivative.

Still further, the chromatin feature prediction model comprises:

the chromatin feature prediction model comprises:

the coding module is used for extracting the enhancer DNA sequence characteristics by utilizing the multilayer inner coil network;

the self-attention module is used for extracting the dependency relationship in the higher-order characteristic based on the enhancer DNA sequence characteristic;

and the decoding and jump connection module is used for up-sampling by utilizing a quadratic interpolation method based on the dependency relationship and performing feature fusion by using a convolutional neural network.

The beneficial effects of the above-mentioned further scheme are: the invention effectively realizes high-resolution prediction of enhancer activity by adopting the U-shaped network structure.

Still further, the encoding module includes:

the first inner winding layer has an inner winding core size K of 7, a step length of 1, a channel number of each group of channels of 8, 4 groups of channels in total, a channel reduction ratio of 4, and an output enhancer DNA sequence feature mapping dimension of 1001 multiplied by 32;

the first maximum pooling layer is used for carrying out feature sampling on the feature map of the enhancer DNA sequence output by the first inner rolling layer, wherein the pooling window is 2 in size, the step length is 2, and the dimension of the feature map of the enhancer DNA sequence output by the first inner rolling layer is 500 multiplied by 32;

the second inner winding layer has an inner winding core size K of 7, a step length of 1, a channel number of each group of channels of 8, 4 groups of channels in total, a channel reduction ratio of 4, and a dimension of the output enhancer DNA sequence feature map of 500 multiplied by 32;

the second maximum pooling layer is used for carrying out feature sampling on the feature map of the enhancer DNA sequence output by the second inner rolling layer, wherein the pooling window is 2 in size, the step length is 2, and the dimension of the feature map of the enhancer DNA sequence output is 250 multiplied by 32;

the third inner winding layer has the inner winding core size K of 7, the step length of 1, the number of channels of each group of channels of 8, 4 groups of channels in total, the channel reduction ratio of 4, and the dimension of the output enhancer DNA sequence feature mapping of 250 multiplied by 32;

the third maximum pooling layer is used for carrying out feature sampling on the characteristic of the enhancer DNA sequence output by the third inner rolling layer, wherein the pooling window size is 2, the step length is 2, and the dimension of the feature mapping of the output enhancer DNA sequence is 125 multiplied by 32;

The self-attention module includes:

the first attention layer has a multi-head number of 8 in an attention mechanism, a feedforward neural network dimension of 64 and a dimension of 125×32 of the characteristic mapping of the output enhancer DNA sequence;

a second attention layer, the number of the heads in the attention mechanism is 8, the dimension of the feedforward neural network is 64, and the dimension of the characteristic mapping of the enhancer DNA sequences is 125 multiplied by 32;

the decoding and hopping connection module includes:

the first sampling layer adds the characteristic of the enhancer DNA sequence and the dependency relationship output by the self-attention module, carries out batch normalization processing on the added result, inputs the normalized characteristic into the first up-sampling layer, the dimension of up-sampling is 250, and the dimension of the output characteristic mapping is 250 multiplied by 32;

the first convolution layer has the number of filters of 32, the convolution kernel size of 5, the step length of 1, the activation function of ELU, and the dimension of the output feature mapping of 250 multiplied by 32 by adopting the Same convolution;

the second sampling layer is used for adding the feature mapping output by the first convolution layer and the feature of the enhancer DNA sequence output by the second maximum pooling layer, carrying out batch normalization processing on the added feature, inputting the output feature into the second up-sampling layer, wherein the dimension of up-sampling is 497, and the dimension of the output feature mapping is 500 multiplied by 32;

The number of filters of the second convolution layer is 4, the convolution kernel size is 5, the step length is 1, the activation function is ELU, the Same convolution is adopted, and the dimension of the output feature mapping is 500 multiplied by 4;

the third sampling layer is used for adding the feature mapping output by the second convolution layer and the feature of the enhancer DNA sequence output by the third maximum pooling layer, carrying out batch normalization processing on the added feature, inputting the output feature into the third up-sampling layer, wherein the dimension of up-sampling is 1001, and the dimension of the feature mapping output is 1001 multiplied by 4;

and the third convolution layer has the number of filters of 1, the convolution kernel size of 5, the step size of 1, the ELU as an activation function, adopts the Same convolution, and the dimension of the output feature mapping is 1001 multiplied by 1.

The beneficial effects of the above-mentioned further scheme are: and high-resolution chromatin feature prediction is realized by using a U-shaped network.

Still further, the loss function expression of the feature fusion model in the step S3 is as follows:

L＝λ ₁ L ₁ +λ ₂ L ₂

y _n ＝RELU(G _n '(z|ΘG _n' ))

z＝S([z ₁ ,z ₂ ,...,z _n ,...,z _m ]|Θ _S )

wherein L represents a feature fusion modelIs lambda ₁ and λ₂ All represent weights, L ₁ Representing reconstruction errors, L ₂ Representing the distribution error, x _n An initial feature vector, y, representing the nth chromatin feature _n Representing the nth chromatin feature reconstructed by the feature fusion model, RELU (·) representing RELU activation function, G _n' and ΘG_n' Respectively representing a feature fusion model function and parameters corresponding to an nth independent decoding layer, wherein z represents a low-dimensional implicit feature, and I represent L2 norms, S and theta _S Respectively representing the feature fusion model function and the parameters, z corresponding to the feature sharing layer _n Low-dimensional recessive features, z, representing nth chromatin features _m Low-dimensional recessive features representing the mth chromatin feature, E _z'～p(z) (log(D(z'|Θ _D ) ) represents log (D (z '|Θ) when the sample z' is sampled from the data distribution p (z) _D ) Where E represents mathematical expectation, D represents a function, p represents gaussian distribution, log (·) represents logarithm, and the probability distribution of z' follows gaussian distribution, D and Θ) _D Representing the corresponding feature fusion model functions and parameters of the encoder,

representing the slave data distribution Q (z|Θ _Q ) Sampling yields a z-time log (1-D (z|Θ) _D ) For example,) mathematical expectation, _a->

Representing the slave data distribution Q (z|Θ _Q ) The sampling yields the z-time (log (D (z|Θ) _D ) Q and Θ) are the mathematical expectation _Q And respectively representing the model function and the parameter corresponding to the generator.

The beneficial effects of the above-mentioned further scheme are: the invention generates an countermeasure network based on self-coding, and effectively realizes the high-dimensional heterogeneous chromatin feature fusion.

Still further, the step S5 includes the steps of:

s501, modifying partial base information in a DNA sequence in a simulated mutation mode;

S502, predicting S by using a chromatin prediction model based on the change result _ref and S_alt Chromatin characteristics C of (2) _ref and C_alt, wherein ,S_ref and S_alt Respectively representing the DNA sequences before and after the simulated mutation;

s503 characterizing chromatin _ref 、C _alt And inputting the DNA sequence characteristics into a characteristic fusion model to obtain joint characterization;

s504, inputting the combined characterization obtained in the step S503 into an enhancer activity prediction model to obtain the enhancer activity Y before and after mutation _ref and Y_alt ；

S505 according to enhancer Activity Y _ref and Y_alt The effect of the variation on enhancer activity was calculated.

The beneficial effects of the above-mentioned further scheme are: the method realizes the prediction of the influence of variation based on multi-chromatin characteristics on the enhancer, and effectively solves the defect of poor effect of the traditional DNA sequence-based simulated mutation method.

Still further, the step S6 includes the steps of:

s601, randomly selecting a plurality of non-enhancer regions positioned in an open chromatin region as a control group;

s602, for each base of each DNA sequence in a control group, obtaining a set S of variation influence of the control group according to the influence of variation on the activity of an enhancer;

s603, determining the 2.5 th percentile and the 97.5 th percentile in the set S as experience significance thresholds;

S604, determining the functional variation according to the experience significance threshold.

The beneficial effects of the above-mentioned further scheme are: the invention utilizes the control group of the non-enhancer region to determine the experience threshold value, thereby screening out the functional variation and effectively screening the potential functional variation.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a schematic diagram of a Word2vec model structure.

FIG. 3 is a block diagram of a high resolution dyeing mass spectrometry prediction model.

Fig. 4 is a meta learning training schematic.

FIG. 5 is a graph of a self-encoder based generated countermeasure network model.

FIG. 6 is a block diagram of a predictive model of enhancer activity based on fusion of multiple chromatin features.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.

Examples

As shown in fig. 1, the present invention provides a prediction method for influence of non-coding mutation on enhancer activity based on multiple stages, which is implemented as follows:

S1, obtaining relevant characteristics of the enhancers, and preprocessing the relevant characteristics, wherein the implementation method comprises the following steps:

s101, acquiring an enhancer related characteristic data set;

s105, dividing the related characteristic data set of the enhancer to finish the pre-processing of the data.

In this embodiment, data is collected and preprocessed: enhancer-related features of common human cell lines, including DNA sequence features, enhancer activity features (STARR-seq), 11 histone modification features (history ChIP-seq) chromatin opening features (DNase-seq), were obtained and data pre-treated.

In this embodiment, the enhancer-related feature data set is downloaded from the open database code; and preprocessing the downloaded file by using a GKMSVM and deepools tool to obtain positive and negative samples of the training set. The length of the sequence was fixed to give a DNA sequence of 1001bp in length. And (3) data coding: the DNA sequence is first divided into k-mer base segments, each base segment is encoded using one-hot encoding, and then the distributed representation of the base segments is learned using Word2 vec. Chromatin features scale log2 (1+x). Dividing the data set: for each dataset, 70% of the samples were used for training, 10% for validation, and 20% for testing.

In this example, enhancer activity and chromatin feature datasets were obtained from the co-open database encoding and subjected to data preprocessing operations using gkmsm and deepfools to obtain positive and negative samples:

1. acquiring data

The acquired data sets are all from the human DNA encyclopedia of elements (ENCODEs). The downloaded data included enhancer activity data sets (STARR-seq) in the A549, GM12878, HCT116, MCF-7, hepG2, and K562 cell lines and their corresponding 11 histone modification and chromatin patency data sets. Specifically, 11 histone modifications include H2AFZ, H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me3, H3K27ac, H3K27me3, H3K36me3, H4K79me2, and H4K20me1, and finally, a plurality of data files with suffix names of bed and bigwig are obtained.

2. Collecting positive and negative samples

1) Collection of positive samples

i. Processing a file with a suffix name of bed by using a GKMSVM tool, and calculating a section of DNA sequence on each binding site by taking each binding site as a unit according to the start coordinate and the end coordinate of the binding site on a chromosome, wherein the section of DNA sequence is used as a positive sample of the DNA sequence in a training set;

ii. The deepfools tool extracts coordinates from a file with suffix name bed, and extracts a sequencing signal (signal) from a file with suffix name bigwig, which is used as a positive sample of the signal in the training set.

2) Negative sample collection

i. For a DNA sequence positive sample, matching DNA sequence fragments with the GC base content similar to that of the DNA sequence positive sample on the whole genome area as a DNA sequence negative sample in a training set;

ii. According to the DNA sequence negative sample, the starting and ending coordinates on the chromosome are reversely calculated, and the signal is extracted from a file with the suffix name bigwig by using the depTools, so that the negative sample of the signal in the training set is obtained.

3. Fixed sequence length

The sequence is uniformly fixed to 1001bp:

1) For each DNA sequence, for its initial and final coordinates, the intermediate coordinates are obtained by adding and dividing by 2.

2) The middle point is used as a reference coordinate, 500 coordinates are expanded forwards and backwards, and a 1001bp DNA sequence is formed.

4. Data set partitioning

80% of the data were randomly selected as training set, 10% as validation set, and 10% as test set.

S2, constructing and training a chromatin feature prediction model based on meta-learning based on the preprocessed enhancer related features, wherein the implementation method is as follows:

s202, updating parameters of a chromatin feature prediction model, wherein the implementation method comprises the following steps:

s2023, circulating epoch;

s2024, randomly sampling the tasks to form a batch;

s2026, calculating a loss value for each task in a batch by using a query set, summing up the loss values, carrying out random gradient descent treatment on the gradient, completing second parameter updating, realizing updating of chromatin characteristic prediction model parameters, and returning to the step S2023;

The chromatin feature prediction model comprises:

and the decoding and jump connection module is used for carrying out up-sampling by utilizing a quadratic interpolation method based on the dependency relationship and carrying out feature fusion by using a neural network based on full convolution.

In this embodiment, a chromatin feature prediction model based on meta-learning is constructed, to implement prediction of various chromatin features from DNA sequences, and a model is trained using a mean square error as a loss function, a back propagation algorithm, and a meta-learning strategy.

In the embodiment, a high-resolution chromatin feature prediction model is constructed based on a U-shaped network by using bilinear interpolation and other technologies in combination with neural network operators such as inland (transformation) and transform; training a Model-Agnostic Meta-Learning (MAML) similar strategy; fine tuning of the model: performing fine tuning on each task based on the model obtained by meta-learning, wherein each task (namely, each chromatin feature) obtains a model; inputting the DNA sequence into the fine-tuned model to obtain the predicted chromatin characteristics.

In this embodiment, the meta learning algorithm includes the steps of: training by using the training set which is finished by the step of preprocessing, wherein the data in the training set is divided into two data sets, including a query set and a support set; initializing model parameters according to normal distribution random; cycling through one epoch; randomly sampling the tasks to form a batch (batch); training the model by a support set of each task in a batch cycle to obtain a group of parameters, wherein the group of parameters are updated for the first time; for each task in one batch, calculating a loss value by using a query set, summing the loss values, carrying out random gradient descent SGD on the gradient, completing the second parameter updating, returning to cycle for one epoch, and starting a new training round.

In this embodiment, the data is encoded: since each base in the DNA sequence does not exist independently, the DNA sequence is divided into k-mer base segments, and each high-dimensional sparse base segment is distributively encoded using Skip-Gram-based word2vec, as shown in FIG. 2 below, the DNA sequence is first divided into k-mer fragments to fully consider higher-order dependencies. To ensure consistent sequence length, the character "N" is padded before and after the sequence. For example, the DNA sequence "ATCGA" can be represented (NAT, ATC, TCG, CGA, GAN) as five base segments with 3-mers, and because the chromatin feature prediction model requires binary vectors as input, each base segment is represented using a one-hot encoding strategy, and because five characters may occur per position, encoding dimensions of 5k, resulting in a high-dimensional sparsity problem, the word2vec strategy is used to learn the low-dimensional representation of the segment. Each thermally encoded segment is converted into an n x d matrix in a distributed representation format, where n=1001 and d represents the dimension of the distributed representation. The one-time thermal encoding can be expressed by the following formula:

In order to learn the low-dimensional characterization of the base segment, an objective function Lw of a Word2vec (Word vector) model is designed as follows:

LW＝∑logp(context(w)|w)

where p represents the probability, context (w) |w represents the probability of predicting the correct surrounding word given word w,

specifically, the context (w) of a given center word w is predicted using MLP. Since Skip-Gram has better performance than cbow, the strategy of Word2vec of Skip-Gram is used to learn the distributed representation of k-mer base segments, converting the DNA sequence into a two-dimensional vector of (1001, 32).

1) The Skip-gram strategy predicts the probability of surrounding base segments by inputting one base segment, thereby obtaining a higher order correlation.

2) Positive samples of enhancers of all cells were used to construct the dataset, 90% of which were training sets and 10% of which were test sets.

3) The loss function was optimized by small batch gradient descent, batch size=256, adam used as an optimizer, learning rate=0.001. When the test set loss is no longer reduced, training is stopped and the weight matrix of the hidden layer can be defined as a distributed representation of the segments.

The distributed encoding of the final DNA sequence can be expressed by the following formula:

S _o ＝[o ₁ ,o ₂ ,...,o _i ,...,o ₁₀₀₀ ,o ₁₀₀₁ ]

wherein ,o_i The characteristic code of the ith base segment is shown. For other chromatin features, the normalization is performed using log (1+x) and expressed by means of a feature matrix.

In this embodiment, a model based on U-Net is used to obtain a high-resolution prediction, and in order to improve the performance and robustness of the model, a transducer module is to be fused. In general, the chromatin feature prediction model consists essentially of four modules, including (1) the Encoder module. Enhancer DNA sequence features were extracted using an enhancer based on the invresolution neural network operator. (2) a self-attention module. The dependency in the high-order features is extracted using a transducer-based self-attention model. (3) a Decoder module. Upsampling is performed using quadratic interpolation and feature fusion is performed using a neural network based on full convolution. (4) a jump connection module. The specific description of the four modules of the addition of the decoder and the corresponding layer characteristics of the encoder is as follows, and the model structure is shown in fig. 3.

In this embodiment, the encoding module includes:

The second inner winding layer has the inner winding core size K of 7, the step length of 1, the number of channels of each group of channels of 8, 4 groups of channels in total, the channel reduction ratio of 4, and the dimension of the feature mapping of the output enhancer DNA sequences of 500 multiplied by 32;

the third inner winding layer has the inner winding core size K of 7, the step length of 1, the number of channels of each group of channels of 8, 4 groups of channels in total, the channel reduction ratio of 4, and the dimension of the feature mapping of the output enhancer DNA sequences of 250 multiplied by 32;

the self-attention module includes:

the first attention layer has the attention mechanism that the multi-head number is 8, the dimension of a feedforward neural network dim_feed forward is 64, and the dimension of the feature mapping of the output enhancer DNA sequence is 125 multiplied by 32;

the second attention layer has 8 multi-heads in the attention mechanism, 64 in dim_feed forward dimension of the feedforward neural network and 125×32 in dimension of the enhancer DNA sequence feature map;

In this example, the forward and reverse information of the DNA sequence are extracted and added to each other in order to consider both the forward and reverse information.

The decoding and hopping connection module includes:

the first sampling layer adds the characteristic of the enhancer DNA sequence and the input of the self-attention module, carries out batch normalization processing on the added result to achieve the purpose of a residual network, inputs the normalized characteristic into the first up-sampling layer, the dimension of up-sampling is 250, and the dimension of the output characteristic mapping is 250 multiplied by 32;

the second sampling layer adds the feature mapping output by the first convolution layer and the feature of the enhancer DNA sequence output by the second maximum pooling layer to achieve the purpose of a residual network, carries out batch normalization processing on the added features, inputs the output features into the second up-sampling layer, wherein the dimension of up-sampling is 497, and the dimension of the output feature mapping is 500 multiplied by 32;

the second convolution layer has the filter number of 4, the convolution kernel size of 5, the step length of 1, the activation function of ELU, and the dimension of the output feature mapping of 500 multiplied by 4 by adopting the Same convolution;

The third sampling layer adds the feature map output by the second convolution layer and the feature of the enhancer DNA sequence output by the third maximum pooling layer to achieve the purpose of a residual network, performs batch normalization processing on the added features, inputs the output features into the third up-sampling layer, wherein the dimension of up-sampling is 1001, and the dimension of the output feature map is 1001 multiplied by 4;

and the third convolution layer has the filter number of 1, the convolution kernel size of 5, the step length of 1, the activation function of ELU, and the dimension of 1001×1 by adopting the Same convolution, and the dimension of the output feature mapping is kept unchanged.

In this embodiment, each convolution layer is used to adjust the upsampling feature value.

In this embodiment, the encoding module: this section consists of 3 resolution modules and a max pooling layer, where each module consists of a convolution layer, a ReLU layer, and a dropout layer. In particular, the Encoder module is used to progressively reduce the spatial dimension and encode sequence-specific features. Compared with the traditional convolution operation, the Involution operator has two advantages of position specificity (Spatial-specific) and Channel ambiguity (Channel-diagnostic), and the calculated amount is also remarkably reduced compared with the traditional convolution operation. The module can be expressed by the following formula:

x ^(l) ＝ReLU(inv(S ^(l) ,W ^(l) ,b ^(l) ))

x ^(l) ＝MaxPool(x ^(l) )

x ^(l) ＝Dropout(x ^(l) ,r＝0.2)

wherein ,S^(l) 、x ^(l) 、W ^(l) and b^(l) The input, output, weight matrix and bias of the first coding module are shown, respectively. The channel number is Word2vec, the learned dimension=32, the convolution kernel size K is set to 8, the pooling window size is set to 2, and the step size is set to 2.

In this embodiment, the self-attention module: because of the different importance and interdependencies between higher-order features, the Transformer module is used to extract the long-distance dependence between higher-order features. Each transducer module consists of a self-attention layer and a position feed forward neural network (FFN), with a residual module added to them and layer normalized. The present invention uses multi-headed self-attention to capture richer feature information:

wherein ,Q^(l) 、K ^(l) 、V ^(l) Query, key and value vector representing the first block, W ^o(l) 、W ^Q(l) 、W ^K(l) and W^V(l) Is the corresponding weight matrix of the first block, SM (·) represents the softmax activation function, d _k Representing the scaling factor. After each multi-headed self-attention module, FFN is applied to effect spatial transformation. It consists of a three-layer neural network with a ReLU activation function. Can be described by the following formula:

x＝max(0,xW ₁ +b ₁ )W ₂ +b ₂

to fully take into account the global context information of the DNA sequence, a transducer module is applied to the forward and reverse directions of the DNA sequence, and then a global averaging pool is performed to extract high-level features.

In this embodiment, the decode and skip connect module: the decoding is used to decode the high-order abstract features and ultimately predict the high-resolution chromatin feature spectrum. The purpose of the skip connection module is to combine the encoder and decoder information to preserve more advanced features. The module comprises a plurality of upsampling layers and a mixing block, and features of the upsampling layers are extracted through batch normalization, reLu activation and convolution operations.

o ^(l) ＝Bilinear(u ^(l) )

o ^(l) ＝BN(o ^(l) +x ^(l) )

o ^(l) ＝ReLU(Conv(o ^(l) ,W ^(l) ，b ^(l) ))

Wherein Bilinear (·) represents Bilinear interpolation; u (u) ^(l) Representing the value obtained after the first upsampling layer. W (W) ^(l and b^(l) The weight matrix and bias of the first block are represented.

In this embodiment, the project is intended to train a meta-learning model using a strategy similar to MAML. Due to the small sample problem, conventional deep learning training strategies often suffer from model overfitting. The main goal of the small sample element learning is to learn the model initialization parameters so that the model initialization parameters can be quickly adapted to other related tasks. Furthermore, since the nature of MAML is feature reuse, our task in this project is to predict a variety of chromatin features. The close association that exists between the various chromatin features makes this task suitable for training in meta-learning fashion. As shown in fig. 4, meta-learning is a meta-learning process, adaptation is a fine tuning process, and an expression of an objective function of a meta-learning model is as follows:

wherein ,

Representing the gradient in the chain derivative.

And S3, constructing a characteristic fusion model based on self-coding generation countermeasure network, and obtaining the combined characterization of the fused multi-chromatin characteristics based on the characteristic fusion model.

In this example, enhancer activity prediction models based on multi-chromatin feature fusion were constructed and trained. The chromatin characteristics in the model are predicted by a chromatin characteristic prediction model, and the enhancer activity prediction model is trained by using a mean square error as a loss function and using a back propagation algorithm.

In the embodiment, a DNA sequence is input based on a chromatin feature prediction model of meta learning to obtain a first high-order feature; inputting the fine-tuned chromatin characteristics based on the meta-learning chromatin characteristic prediction model to obtain second high-order characteristics; and adding according to the first high-order features and the second high-order features, namely adding the first high-order features and the second high-order features, and sequentially performing bilinear interpolation, jump connection, feature fusion and other operations to obtain the predicted binding site affinity signal intensity.

in this embodiment, fusion of multiple chromatin features is achieved by constructing a self-encoding generation countermeasure network for a given multiple chromatin feature data. The specific implementation is as follows:

low-dimensional coding: whereas tensor stitching is difficult to integrate directly high-dimensional, heterogeneous features, and tends to ignore the contribution of low-dimensional features. The method utilizes the independent coding layer to map different data into the subspace with low dimension and isomorphism, and comprises the following calculation processes:

z _n ＝ReLU(F _n (x _n |ΘF _n ))

wherein ,x_n An initial feature vector, z, which is the nth chromatin feature _m Low-dimensional recessive features representing mth chromatin features, F _n And ΘF respectively represent the model function and parameters corresponding to the nth independent coding layer.

For inherent relativity and dependency among various features, the low-dimensional implicit features acquired in the previous step are integrated in a centralized and unified way by utilizing a feature sharing layer, and the calculation process is as follows:

z＝S([z ₁ ,z ₂ ,...,z _n ,...,z _m ]|Θ _S )

wherein S and Θ _S And respectively representing the feature fusion model function and the parameters corresponding to the feature sharing layer.

High-dimensional decoding: in order to make the generated abstract features as similar as possible to the real histology features, the implicit features z are reconstructed into the initial feature vector x by using an independent decoding layer _n . In this stage, the low-dimensional encoder and the high-dimensional decoder are updated by minimizing reconstruction errors, and the calculation process is that

y _n ＝RELU(G _n' (z|ΘG _n' ))

wherein ,G_n' and ΘG_n' Respectively representing the feature fusion model function and parameters corresponding to the n' independent decoding layer, wherein the I & is L2 norm.

Challenge learning: the mutual antagonism between the encoder and the discriminator is used to ensure that the posterior distribution of the output of the encoder is consistent with the prior distribution so as to solve the distribution difference of different characteristics, and the calculation process is as follows:

wherein D and Θ _D Representing the corresponding feature fusion model function and parameters of the encoder, Q and Θ _Q The table represents the model function and parameters corresponding to the generator, respectively, and p represents the gaussian distribution. The objective function of countermeasure learning can be expressed as:

the final loss of the feature fusion model is represented by the formula L ₁ and L₂ The sum of the losses in (a) can be expressed as:

L＝λ ₁ L ₁ +λ ₂ L ₂

wherein ,λ₁ and λ₂ All represent weights, after the feature fusion model is fully learned, the low-dimensional implicit features z are the integrated high-order feature set, and L represents the loss function of the feature fusion model and L is taken as the input of the subsequent model ₁ Representing reconstruction errors, L ₂ Representing the distribution error, y _n Representing the nth chromatin feature reconstructed by the feature fusion model, RELU (·) representing RELU activation function, G _n' and ΘG_n' Respectively representing a feature fusion model function and parameters corresponding to an nth independent decoding layer, wherein z represents a low-dimensional implicit feature, and I.I. represents an L2 norm and z represents a low-dimensional implicit feature _n Low-dimensional recessive features representing nth chromatin features, E _z'～p(z) (log(D(z'|Θ _D ) ) represents log (D (z '|Θ) when the sample z' is sampled from the data distribution p (z) _D ) Where E represents mathematical expectation, D represents a function, p represents gaussian distribution, log (·) represents logarithm, and the probability distribution of z' follows gaussian distribution, D and Θ) _D Representing the corresponding feature fusion model functions and parameters of the encoder,

Representing the slave data distribution Q (z|Θ _Q ) The sampling yields the z-time (log (D (z|Θ) _D ) Q and Θ) are the mathematical expectation _Q Respectively representing model functions and parameters corresponding to the generator;

s5, predicting the influence of variation on the activity of the enhancer by using a chromatin characteristic prediction model and an enhancer activity prediction model, wherein the implementation method is as follows:

S505 according to enhancer Activity Y _ref and Y_alt The effect of the variation on enhancer activity was calculated, i.e. (Y _ref -Y _alt )。

In this embodiment, the influence of the prediction mutation on the activity of the enhancer is firstly predicted by using a chromatin feature prediction model, secondly, the combined characterization of multiple features is obtained by using a feature fusion model in step S3, and finally, the high-resolution activity prediction of the enhancer is realized by using an enhancer activity prediction model in step S4.

In this example, some of the base information in the DNA sequence is altered by using a simulated mutation, e.g., adenine (A) to guanine (G), the original DNA sequence is S _ref The DNA sequence after simulated mutation is S _alt The method comprises the steps of carrying out a first treatment on the surface of the Prediction of S using trained chromatin feature prediction models _ref and S_alt Chromatin characteristics C of (2) _ref and C_alt The method comprises the steps of carrying out a first treatment on the surface of the Realizing the joint characterization of multiple features by using the feature fusion model after training to obtain the joint characterization Z before and after mutation _ref and Z_alt The method comprises the steps of carrying out a first treatment on the surface of the Will combine to characterize Z _ref and Z_alt Chromatin characteristics C _ref and C_alt And inputting the DNA sequence characteristics into an enhancer activity prediction model to obtain the enhancer activity Y before and after mutation _ref and Y_alt . Variation affects available Y _ref -Y _alt And (3) representing.

In this example, enhancer activity can be more fully understood by analysis of integration between multiple chromatin features. Thus, we achieved predictions of enhancer activity based on the joint characterization obtained by the above model. The model initialization parameters are parameters obtained by model unitary learning, on the basis, we perform fine tuning, the mean square error is used as a loss function, the model is optimized through small-batch gradient descent, the batch size is=64, adam is used as an optimizer, the learning rate is=0.001, and training is stopped when the loss of the verification set is no longer reduced. The model structure is shown in fig. 6.

S6, screening the functional variation according to the influence of the variation on the activity of the enhancer, wherein the implementation method is as follows:

In this example, functional variants were screened. Functional variation is determined by means of threshold screening. By evaluating the effect of non-enhancer region variation, an empirical threshold is obtained where variation has a significant impact on enhancer activity.

In this example, for each motif, to determine the empirical threshold for variation affecting enhancer activity, 10000 non-enhancer regions of the same length were randomly selected as the control set, for each base, the average variation score for three possible variations was calculated, after repeating this step for all positions in the control set, a large set of average variation scores was established from the random sequence, and then the 2.5 percentile and 97.5 percentile of the empirical distribution of variation scores were determined as the significance threshold.

The beneficial effects of the present invention are verified by comparative experiments as follows.

The data used in this experiment were extracted from the human DNA element encyclopedia database and included together the enhancer activity dataset of the six cell lines and their corresponding 11 histone ChIP-seq and DNase-seq datasets for lung cancer human alveolar basal epithelial cells (A549), human B lymphocytes (GM 12878), human colon cancer cells (HCT 116), human breast cancer cells (MCF-7), human hepatoma cells (HepG 2) and human chronic myelogenous leukemia cells (K562). The prediction comparison is carried out by adopting BPNet (method 1) and FCNignal (method 2) and model 1 in the method of the invention, wherein the method of the invention comprises conventional model training (method 1) and model training (method 2) based on meta learning, and as shown in table 1, table 1 is a chromatin characteristic prediction result table.

TABLE 1

Cell lines	A549	GM12878	HCT116	MCF-7	HepG2	K562
							Method
1	0.865	0.843	0.823	0.812	0.846	0.821
							Method 2	0.854	0.831	0.810	0.808	0.814	0.816
The method 1	0.882	0.877	0.843	0.853	0.870	0.867
							The recipeMethod 2	0.895	0.884	0.879	0.890	0.883	0.896

As can be seen from table 1, compared with the existing deep learning methods (method 1 and method 2), the method of the present invention can obtain higher prediction accuracy (Pearson correlation coefficient, PCC) on all the six cell line data of the experiment, indicating that the method of the present invention has stronger chromatin feature prediction capability. Meanwhile, the meta-learning strategy (the second method) is better in performance than the meta-learning strategy (the first method), so that the meta-learning method can effectively capture the internal relation between different chromatin characteristics, and therefore better prediction is achieved.

Using the same dataset, model two was used to predict enhancer activity, where the chromatin features entered in model two were model one predicted. The prediction was compared using BPNet (method 1) and FCNsignal (method 2) and model one (method one) and model two (method two) of the method of the invention, respectively, as shown in table 2, table 2 enhancer activity prediction results table.

TABLE 2

Cell lines	A549	GM12878	HCT116	MCF-7	HepG2	K562
							Method
1	0.893	0.879	0.891	0.893	0.904	0.908
							Method 2	0.881	0.874	0.884	0.883	0.886	0.897
The method 1	0.912	0.908	0.916	0.913	0.919	0.921
							The method 2	0.943	0.939	0.948	0.951	0.932	0.941

As can be seen from table 2, compared with the existing deep learning methods (method 1 and method 2), the method of the present invention can obtain higher prediction accuracy (Pearson correlation coefficient, PCC) on all the six cell line data of the experiment, indicating that the method of the present invention has a stronger enhancer activity prediction capability. Meanwhile, it is noted that the prediction using multiple chromatin features (method two) performed better than the prediction using only DNA sequences (method one), indicating the effectiveness of the "two-step" strategy proposed in the present invention, i.e. predicting chromatin features using model one and then predicting enhancer activity using predicted chromatin features using model two.

It can be concluded that the method of the invention has higher prediction accuracy compared with the existing methods for predicting the activity of enhancers, and can further realize the prediction of the influence of variation based on the characteristics of multiple chromatins.

The present invention calculates the mutation score to infer the effect of mutation on enhancer activity. Wild Type (WT) can be defined as a reference sequence (ref), and variant (Alter type) can be defined as a sequence (alt) containing a variable. The mutation score is calculated by subtracting the maximum reference sequence signal from the alternative allele signal (panel D), and for a strict definition of the mutation score, it can be defined as yalt-yrf. With these scores, the present invention analyzes the relationship between the effect of the variation and its location, by selecting an empirical threshold in order to study the proportion of functional variation in potential enhancer areas. Specifically, the invention randomly selects 10,000 DNA sequences located in the open chromatin region as a control set, then calculates the average mutation score of 3 potential mutations at each base position, and after repeating this process for each sequence in the control set, a non-enhancer region mutation score set is obtained, and finally, the 2.5 th and 97.5 th percentiles of the empirical distribution of mutation scores are determined as significance thresholds. With this significant threshold, variations in the enhancer region were analyzed. As shown in table 3, CTCF had the highest potential functional variation ratio and YY1 had the lowest ratio. Overall, these results demonstrate the excellent performance of the invention in predicting the effect of variation on enhancer activity.

TABLE 3 Table 3

In summary, the invention designs a deep learning model based on multiple chromatin features to predict the effect of variation on enhancer activity. Unlike the existing method which relies only on DNA sequence information for prediction, two models are constructed to predict the chromatin characteristics by using the DNA sequence first and then to predict the activity of the enhancer by using the predicted chromatin characteristics. The research result of the invention can be applied to the biomedical field, and in addition, the functional variation is determined by a threshold screening mode.

Claims

1. A method for predicting the effect of non-coding variations on enhancer activity based on multiple phases, comprising the steps of:

2. The method of predicting the effect of multi-stage based non-coding variation on enhancer activity according to claim 1, wherein step S1 comprises the steps of:

s101, acquiring an enhancer related characteristic data set;

3. The method of predicting the effect of multi-stage based non-coding variation on enhancer activity according to claim 2, wherein step S2 comprises the steps of:

s202, updating chromatin feature prediction model parameters;

4. The method of predicting the effect of a non-coding variation on enhancer activity according to claim 3, wherein said step S202 comprises the steps of:

s2023, circulating epoch;

s2024, randomly sampling the tasks to form a batch;

5. The method of predicting the effect of a non-coding variation on enhancer activity based on multiple phases of claim 4 wherein the expression of the meta-learned objective function is as follows:

wherein ,

Representing the gradient in the chain derivative.

6. The multi-stage non-coding variation based on enhancer activity impact prediction method of claim 5, wherein the chromatin feature prediction model comprises:

7. The multi-stage non-coding variation based on enhancer activity impact prediction method of claim 6, wherein the coding module comprises:

the self-attention module includes:

the decoding and hopping connection module includes:

8. The method of predicting the effect of non-coding mutation on enhancer activity according to claim 7, wherein the loss function expression of the feature fusion model in step S3 is as follows:

L＝λ ₁ L ₁ +λ ₂ L ₂

y _n ＝RELU(G _n' (z|ΘG _n' ))

z＝S([z ₁ ,z ₂ ,...,z _n ,...,z _m ]|Θ _S )

wherein L represents a loss function of the feature fusion model, lambda ₁ and λ₂ All represent weights, L ₁ Representing reconstruction errors, L ₂ Representing the distribution error, x _n An initial feature vector, y, representing the nth chromatin feature _n Representing the nth chromatin feature reconstructed by the feature fusion model, RELU (·) representing RELU activation function, G _n' and ΘG_n' Respectively representing a feature fusion model function and parameters corresponding to an nth independent decoding layer, wherein z represents a low-dimensional implicit feature, and I represent L2 norms, S and theta _S Respectively representing the feature fusion model function and the parameters, z corresponding to the feature sharing layer _n Low-dimensional recessive features, z, representing nth chromatin features _m Low-dimensional recessive features representing the mth chromatin feature, E _z'～p(z) (log(D(z'|Θ _D ) ) represents log (D (z '|Θ) when the sample z' is sampled from the data distribution p (z) _D ) Where E represents mathematical expectation, D represents a function, p represents a Gaussian distribution, log (), and the probability of taking the logarithm, z'The distribution obeys Gaussian distribution, D and Θ _D Representing the corresponding feature fusion model functions and parameters of the encoder,

9. The method of predicting the effect of multi-stage based non-coding variation on enhancer activity according to claim 8, wherein said step S5 comprises the steps of:

10. The method of predicting the effect of multi-stage based non-coding variation on enhancer activity according to claim 9, wherein step S6 comprises the steps of: