CN116312765A - Multi-stage-based prediction method for influence of non-coding variation on activity of enhancer - Google Patents
Multi-stage-based prediction method for influence of non-coding variation on activity of enhancer Download PDFInfo
- Publication number
- CN116312765A CN116312765A CN202310122535.5A CN202310122535A CN116312765A CN 116312765 A CN116312765 A CN 116312765A CN 202310122535 A CN202310122535 A CN 202310122535A CN 116312765 A CN116312765 A CN 116312765A
- Authority
- CN
- China
- Prior art keywords
- feature
- enhancer
- chromatin
- dna sequence
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000003623 enhancer Substances 0.000 title claims abstract description 169
- 230000000694 effects Effects 0.000 title claims abstract description 113
- 238000000034 method Methods 0.000 title claims abstract description 82
- 108010077544 Chromatin Proteins 0.000 claims abstract description 126
- 210000003483 chromatin Anatomy 0.000 claims abstract description 126
- 108091028043 Nucleic acid sequence Proteins 0.000 claims abstract description 111
- 238000012549 training Methods 0.000 claims abstract description 58
- 230000004927 fusion Effects 0.000 claims abstract description 46
- 238000012512 characterization method Methods 0.000 claims abstract description 20
- 238000007781 pre-processing Methods 0.000 claims abstract description 13
- 230000006870 function Effects 0.000 claims description 52
- 238000005070 sampling Methods 0.000 claims description 49
- 238000013507 mapping Methods 0.000 claims description 37
- 230000035772 mutation Effects 0.000 claims description 27
- 238000011176 pooling Methods 0.000 claims description 26
- 238000004804 winding Methods 0.000 claims description 18
- 230000004913 activation Effects 0.000 claims description 15
- 238000012545 processing Methods 0.000 claims description 15
- 238000013528 artificial neural network Methods 0.000 claims description 12
- 238000005096 rolling process Methods 0.000 claims description 12
- 238000010606 normalization Methods 0.000 claims description 11
- 230000014509 gene expression Effects 0.000 claims description 9
- 230000009467 reduction Effects 0.000 claims description 9
- 239000013598 vector Substances 0.000 claims description 8
- 150000003839 salts Chemical class 0.000 claims description 7
- 238000012216 screening Methods 0.000 claims description 7
- 230000007246 mechanism Effects 0.000 claims description 6
- 230000008859 change Effects 0.000 claims description 3
- 125000004122 cyclic group Chemical group 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims description 2
- 230000007547 defect Effects 0.000 abstract description 5
- 210000004027 cell Anatomy 0.000 description 15
- 230000009286 beneficial effect Effects 0.000 description 10
- 230000008569 process Effects 0.000 description 7
- 239000011159 matrix material Substances 0.000 description 6
- 108010033040 Histones Proteins 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 230000002441 reversible effect Effects 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 238000001353 Chip-sequencing Methods 0.000 description 2
- 108020004414 DNA Proteins 0.000 description 2
- 210000000349 chromosome Anatomy 0.000 description 2
- 201000010099 disease Diseases 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- UYTPUPDQBNUYGX-UHFFFAOYSA-N guanine Chemical compound O=C1NC(N)=NC2=C1N=CN2 UYTPUPDQBNUYGX-UHFFFAOYSA-N 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000010200 validation analysis Methods 0.000 description 2
- 229930024421 Adenine Natural products 0.000 description 1
- GFFGJBXGBJISGV-UHFFFAOYSA-N Adenine Chemical compound NC1=NC=NC2=C1N=CN2 GFFGJBXGBJISGV-UHFFFAOYSA-N 0.000 description 1
- 108700028369 Alleles Proteins 0.000 description 1
- 238000012935 Averaging Methods 0.000 description 1
- 208000032791 BCR-ABL1 positive chronic myelogenous leukemia Diseases 0.000 description 1
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 108010014064 CCCTC-Binding Factor Proteins 0.000 description 1
- 102000016897 CCCTC-Binding Factor Human genes 0.000 description 1
- 208000010833 Chronic myeloid leukaemia Diseases 0.000 description 1
- 108091062157 Cis-regulatory element Proteins 0.000 description 1
- 206010009944 Colon cancer Diseases 0.000 description 1
- 102100023919 Histone H2A.Z Human genes 0.000 description 1
- 101000905054 Homo sapiens Histone H2A.Z Proteins 0.000 description 1
- 208000026350 Inborn Genetic disease Diseases 0.000 description 1
- 206010058467 Lung neoplasm malignant Diseases 0.000 description 1
- 208000033761 Myelogenous Chronic BCR-ABL Positive Leukemia Diseases 0.000 description 1
- 108700026244 Open Reading Frames Proteins 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 229960000643 adenine Drugs 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000008485 antagonism Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000012098 association analyses Methods 0.000 description 1
- 210000003719 b-lymphocyte Anatomy 0.000 description 1
- 208000029742 colonic neoplasm Diseases 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 230000001351 cycling effect Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000004043 dyeing Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 210000002919 epithelial cell Anatomy 0.000 description 1
- 238000001415 gene therapy Methods 0.000 description 1
- 208000016361 genetic disease Diseases 0.000 description 1
- 230000007614 genetic variation Effects 0.000 description 1
- 206010073071 hepatocellular carcinoma Diseases 0.000 description 1
- 108010051779 histone H3 trimethyl Lys4 Proteins 0.000 description 1
- 210000005260 human cell Anatomy 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 201000005202 lung cancer Diseases 0.000 description 1
- 208000020816 lung neoplasm Diseases 0.000 description 1
- 238000004949 mass spectrometry Methods 0.000 description 1
- 238000001819 mass spectrum Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000010186 staining Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/20—Allele or variant detection, e.g. single nucleotide polymorphism [SNP] detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Biophysics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Data Mining & Analysis (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Artificial Intelligence (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention provides a prediction method for influence of non-coding variation on activity of an enhancer based on multiple stages, which relates to the technical field of biological information, and comprises the steps of obtaining relevant characteristics of the enhancer and preprocessing the relevant characteristics; constructing and training a chromatin feature prediction model based on meta learning; obtaining a combined characterization of the fused multi-chromatin features based on the feature fusion model; constructing and training an enhancer activity prediction model based on multi-chromatin feature joint characterization; predicting the influence of the variation on the activity of the enhancer by using a chromatin feature prediction model and an enhancer activity prediction model; functional variants were screened for their effect on enhancer activity. The invention provides an effective enhancer activity prediction framework, realizes accurate prediction of the influence of variation on the enhancer activity, and solves the defect of poor effect of the traditional method for predicting based on a DNA sequence.
Description
Technical Field
The invention relates to the technical field of biological information, in particular to a prediction method for influence of non-coding variation on activity of enhancers based on multiple stages.
Background
Millions of enhancers (enhancers) are contained in the human genome and act as important Cis-regulatory elements (Cis-regulatory element, CRE) that act as switches to regulate the time and location of gene expression. The activity of the enhancer is closely related to the gene expression, and prediction of the activity of the enhancer not only helps to understand the cell-specific expression of the gene, but also provides a target for gene therapy. Thousands of Genome-wide association analyses (Genome-Wide Association Studies, GWAS) have revealed that 93% of the common genetic variation associated with a particular trait or disease is located in non-coding regions, although most of them have no significant effect, some genetic diseases are caused by the accumulation of many less-affected variations. Studies have shown that the variants identified by GWAS are enriched in regulatory regions, which can control expression of disease-associated genes by altering enhancer activity. Importantly, since enhancer activity has significant cell specificity, the effects of these variants are different in different cell types, and thus predicting enhancer activity and further inferring the cell type specific effect of variation on enhancer activity is a critical issue.
Predicting the effect of variants on enhancer activity based on multiple chromatin characteristics is challenging because the variants can alter chromatin state. To solve this problem, mutation impact prediction based on polychromic features is achieved using the ab initio approach, which predicts a staining mass spectrum based on DNA sequences. Ex pecto employs a deep learning-based model that first predicts chromatin features using DNA sequences and then uses these features to predict the effects of variation. Although these methods are more effective in predicting the effects of non-coding variations in the human genome, their performance is still affected by complex linkages between model structure and chromatin.
Disclosure of Invention
Aiming at the defects in the prior art, the prediction method for the influence of the non-coding variation on the activity of the enhancer based on multiple stages provided by the invention solves the defect of poor effect of the traditional method for predicting based on the DNA sequence.
In order to achieve the above purpose, the invention adopts the following technical scheme:
the scheme provides a prediction method for influence of non-coding variation on activity of enhancers based on multiple stages, which comprises the following steps:
S1, acquiring relevant characteristics of an enhancer, and preprocessing the relevant characteristics;
s2, constructing and training a chromatin feature prediction model based on meta-learning based on the preprocessed relevant features of the enhancers;
s3, constructing a characteristic fusion model based on a self-coding generation countermeasure network, and obtaining the combined characterization of the fused multi-chromatin characteristics based on the characteristic fusion model;
s4, constructing and training an enhancer activity prediction model based on multi-chromatin feature fusion according to the joint characterization and the chromatin feature prediction model parameters;
s5, predicting the influence of variation on the activity of the enhancer by using a chromatin characteristic prediction model and an enhancer activity prediction model;
s6, screening the functional variation according to the influence of the variation on the activity of the enhancer.
The beneficial effects of the invention are as follows: the method comprises the steps of constructing a high-resolution chromatin feature prediction model based on meta-learning; then, a chromatin feature fusion model based on a self-coding generation countermeasure network is provided, then, parameters obtained by combined characterization of fused multi-chromatin features and training based on a chromatin feature prediction model are utilized, so that accuracy of enhancer activity prediction is provided, finally, influence of variation on enhancer activity is accurately predicted from the aspect of multi-chromatin features by using a mutation simulation mode, influence of variation prediction from the aspect of multi-chromatin features is realized, and the invention is based on the parameters trained by the chromatin feature prediction model.
Further, the step S1 includes the steps of:
s101, acquiring an enhancer related characteristic data set;
s102, preprocessing an enhancer related characteristic data set to obtain positive and negative samples of a training set, wherein the positive and negative samples of the training set comprise DNA sequences and a plurality of corresponding chromatin characteristics;
s103, based on positive and negative samples of the training set, fixing the length of the sequence to obtain a DNA sequence with the length of 1001 bp;
s104, dividing the DNA sequence into k-mer base segments, encoding each base segment by using single thermal encoding, and learning the distributed representation of the base segments by using a Word2vec mode, wherein the chromatin characteristics use log 2 Scaling (1+x), x representing the chromatin feature value, log 2 (. Cndot.) represents a log function based on 2;
s105, dividing the related characteristic data set of the enhancer to finish preprocessing the data.
The beneficial effects of the above-mentioned further scheme are: the invention codes in a k-mer base segment mode, effectively captures the internal connection between bases and solves the defect of poor effect of the traditional method.
Still further, the step S2 includes the steps of:
s201, constructing a chromatin feature prediction model based on the preprocessed relevant features of the enhancers;
S202, updating chromatin feature prediction model parameters;
s203, performing fine adjustment on each task in the chromatin feature prediction model based on the model obtained by meta learning, wherein the task is a chromatin feature;
s204, inputting the DNA sequence into the fine-tuned model to obtain a trained chromatin feature prediction model.
The beneficial effects of the above-mentioned further scheme are: the invention can realize high-resolution chromatin feature prediction by constructing the chromatin feature prediction model.
Still further, the step S202 includes the steps of:
s2021, training a meta learning model by using a training set obtained by dividing a data set, wherein the training set comprises a query set and a support set;
s2022, initializing chromatin feature prediction model parameters according to normal distribution;
s2023, circulating epoch;
s2024, randomly sampling the tasks to form a batch;
s2025, carrying out cyclic processing on the batch, and training a chromatin feature prediction model by utilizing a support set of each task to obtain a group of parameters so as to finish primary parameter updating;
s2026, calculating a loss value for each task in a batch by using a query set, summing up the loss values, carrying out random gradient descent processing on the gradient, completing second parameter updating, realizing updating of chromatin characteristic prediction model parameters, and returning to the step S2023.
The beneficial effects of the above-mentioned further scheme are: the invention introduces a meta learning method, and the commonality among different chromatins can be effectively extracted by using the meta learning method because of the internal relation among different chromatin characteristics.
Still further, the expression of the meta-learned objective function is as follows:
wherein ,representing minimized model loss, f θ ' represents the parameters after model training, L (·) represents the mean square error loss function, f θ Representing parameters before model training, a represents learning rate, < ->Representing the gradient in the chain derivative.
Still further, the chromatin feature prediction model comprises:
the chromatin feature prediction model comprises:
the coding module is used for extracting the enhancer DNA sequence characteristics by utilizing the multilayer inner coil network;
the self-attention module is used for extracting the dependency relationship in the higher-order characteristic based on the enhancer DNA sequence characteristic;
and the decoding and jump connection module is used for up-sampling by utilizing a quadratic interpolation method based on the dependency relationship and performing feature fusion by using a convolutional neural network.
The beneficial effects of the above-mentioned further scheme are: the invention effectively realizes high-resolution prediction of enhancer activity by adopting the U-shaped network structure.
Still further, the encoding module includes:
the first inner winding layer has an inner winding core size K of 7, a step length of 1, a channel number of each group of channels of 8, 4 groups of channels in total, a channel reduction ratio of 4, and an output enhancer DNA sequence feature mapping dimension of 1001 multiplied by 32;
the first maximum pooling layer is used for carrying out feature sampling on the feature map of the enhancer DNA sequence output by the first inner rolling layer, wherein the pooling window is 2 in size, the step length is 2, and the dimension of the feature map of the enhancer DNA sequence output by the first inner rolling layer is 500 multiplied by 32;
the second inner winding layer has an inner winding core size K of 7, a step length of 1, a channel number of each group of channels of 8, 4 groups of channels in total, a channel reduction ratio of 4, and a dimension of the output enhancer DNA sequence feature map of 500 multiplied by 32;
the second maximum pooling layer is used for carrying out feature sampling on the feature map of the enhancer DNA sequence output by the second inner rolling layer, wherein the pooling window is 2 in size, the step length is 2, and the dimension of the feature map of the enhancer DNA sequence output is 250 multiplied by 32;
the third inner winding layer has the inner winding core size K of 7, the step length of 1, the number of channels of each group of channels of 8, 4 groups of channels in total, the channel reduction ratio of 4, and the dimension of the output enhancer DNA sequence feature mapping of 250 multiplied by 32;
the third maximum pooling layer is used for carrying out feature sampling on the characteristic of the enhancer DNA sequence output by the third inner rolling layer, wherein the pooling window size is 2, the step length is 2, and the dimension of the feature mapping of the output enhancer DNA sequence is 125 multiplied by 32;
The self-attention module includes:
the first attention layer has a multi-head number of 8 in an attention mechanism, a feedforward neural network dimension of 64 and a dimension of 125×32 of the characteristic mapping of the output enhancer DNA sequence;
a second attention layer, the number of the heads in the attention mechanism is 8, the dimension of the feedforward neural network is 64, and the dimension of the characteristic mapping of the enhancer DNA sequences is 125 multiplied by 32;
the decoding and hopping connection module includes:
the first sampling layer adds the characteristic of the enhancer DNA sequence and the dependency relationship output by the self-attention module, carries out batch normalization processing on the added result, inputs the normalized characteristic into the first up-sampling layer, the dimension of up-sampling is 250, and the dimension of the output characteristic mapping is 250 multiplied by 32;
the first convolution layer has the number of filters of 32, the convolution kernel size of 5, the step length of 1, the activation function of ELU, and the dimension of the output feature mapping of 250 multiplied by 32 by adopting the Same convolution;
the second sampling layer is used for adding the feature mapping output by the first convolution layer and the feature of the enhancer DNA sequence output by the second maximum pooling layer, carrying out batch normalization processing on the added feature, inputting the output feature into the second up-sampling layer, wherein the dimension of up-sampling is 497, and the dimension of the output feature mapping is 500 multiplied by 32;
The number of filters of the second convolution layer is 4, the convolution kernel size is 5, the step length is 1, the activation function is ELU, the Same convolution is adopted, and the dimension of the output feature mapping is 500 multiplied by 4;
the third sampling layer is used for adding the feature mapping output by the second convolution layer and the feature of the enhancer DNA sequence output by the third maximum pooling layer, carrying out batch normalization processing on the added feature, inputting the output feature into the third up-sampling layer, wherein the dimension of up-sampling is 1001, and the dimension of the feature mapping output is 1001 multiplied by 4;
and the third convolution layer has the number of filters of 1, the convolution kernel size of 5, the step size of 1, the ELU as an activation function, adopts the Same convolution, and the dimension of the output feature mapping is 1001 multiplied by 1.
The beneficial effects of the above-mentioned further scheme are: and high-resolution chromatin feature prediction is realized by using a U-shaped network.
Still further, the loss function expression of the feature fusion model in the step S3 is as follows:
L=λ 1 L 1 +λ 2 L 2
y n =RELU(G n '(z|ΘG n' ))
z=S([z 1 ,z 2 ,...,z n ,...,z m ]|Θ S )
wherein L represents a feature fusion modelIs lambda 1 and λ2 All represent weights, L 1 Representing reconstruction errors, L 2 Representing the distribution error, x n An initial feature vector, y, representing the nth chromatin feature n Representing the nth chromatin feature reconstructed by the feature fusion model, RELU (·) representing RELU activation function, G n' and ΘGn' Respectively representing a feature fusion model function and parameters corresponding to an nth independent decoding layer, wherein z represents a low-dimensional implicit feature, and I represent L2 norms, S and theta S Respectively representing the feature fusion model function and the parameters, z corresponding to the feature sharing layer n Low-dimensional recessive features, z, representing nth chromatin features m Low-dimensional recessive features representing the mth chromatin feature, E z'~p(z) (log(D(z'|Θ D ) ) represents log (D (z '|Θ) when the sample z' is sampled from the data distribution p (z) D ) Where E represents mathematical expectation, D represents a function, p represents gaussian distribution, log (·) represents logarithm, and the probability distribution of z' follows gaussian distribution, D and Θ) D Representing the corresponding feature fusion model functions and parameters of the encoder,representing the slave data distribution Q (z|Θ Q ) Sampling yields a z-time log (1-D (z|Θ) D ) For example,) mathematical expectation, _a->Representing the slave data distribution Q (z|Θ Q ) The sampling yields the z-time (log (D (z|Θ) D ) Q and Θ) are the mathematical expectation Q And respectively representing the model function and the parameter corresponding to the generator.
The beneficial effects of the above-mentioned further scheme are: the invention generates an countermeasure network based on self-coding, and effectively realizes the high-dimensional heterogeneous chromatin feature fusion.
Still further, the step S5 includes the steps of:
s501, modifying partial base information in a DNA sequence in a simulated mutation mode;
S502, predicting S by using a chromatin prediction model based on the change result ref and Salt Chromatin characteristics C of (2) ref and Calt, wherein ,Sref and Salt Respectively representing the DNA sequences before and after the simulated mutation;
s503 characterizing chromatin ref 、C alt And inputting the DNA sequence characteristics into a characteristic fusion model to obtain joint characterization;
s504, inputting the combined characterization obtained in the step S503 into an enhancer activity prediction model to obtain the enhancer activity Y before and after mutation ref and Yalt ;
S505 according to enhancer Activity Y ref and Yalt The effect of the variation on enhancer activity was calculated.
The beneficial effects of the above-mentioned further scheme are: the method realizes the prediction of the influence of variation based on multi-chromatin characteristics on the enhancer, and effectively solves the defect of poor effect of the traditional DNA sequence-based simulated mutation method.
Still further, the step S6 includes the steps of:
s601, randomly selecting a plurality of non-enhancer regions positioned in an open chromatin region as a control group;
s602, for each base of each DNA sequence in a control group, obtaining a set S of variation influence of the control group according to the influence of variation on the activity of an enhancer;
s603, determining the 2.5 th percentile and the 97.5 th percentile in the set S as experience significance thresholds;
S604, determining the functional variation according to the experience significance threshold.
The beneficial effects of the above-mentioned further scheme are: the invention utilizes the control group of the non-enhancer region to determine the experience threshold value, thereby screening out the functional variation and effectively screening the potential functional variation.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
FIG. 2 is a schematic diagram of a Word2vec model structure.
FIG. 3 is a block diagram of a high resolution dyeing mass spectrometry prediction model.
Fig. 4 is a meta learning training schematic.
FIG. 5 is a graph of a self-encoder based generated countermeasure network model.
FIG. 6 is a block diagram of a predictive model of enhancer activity based on fusion of multiple chromatin features.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and all the inventions which make use of the inventive concept are protected by the spirit and scope of the present invention as defined and defined in the appended claims to those skilled in the art.
Examples
As shown in fig. 1, the present invention provides a prediction method for influence of non-coding mutation on enhancer activity based on multiple stages, which is implemented as follows:
S1, obtaining relevant characteristics of the enhancers, and preprocessing the relevant characteristics, wherein the implementation method comprises the following steps:
s101, acquiring an enhancer related characteristic data set;
s102, preprocessing an enhancer related characteristic data set to obtain positive and negative samples of a training set, wherein the positive and negative samples of the training set comprise DNA sequences and a plurality of corresponding chromatin characteristics;
s103, based on positive and negative samples of the training set, fixing the length of the sequence to obtain a DNA sequence with the length of 1001 bp;
s104, dividing the DNA sequence into k-mer base segments, encoding each base segment by using single thermal encoding, and learning the distributed representation of the base segments by using a Word2vec mode, wherein the chromatin characteristics use log 2 Scaling (1+x), x representing the chromatin feature value, log 2 (. Cndot.) represents a log function based on 2;
s105, dividing the related characteristic data set of the enhancer to finish the pre-processing of the data.
In this embodiment, data is collected and preprocessed: enhancer-related features of common human cell lines, including DNA sequence features, enhancer activity features (STARR-seq), 11 histone modification features (history ChIP-seq) chromatin opening features (DNase-seq), were obtained and data pre-treated.
In this embodiment, the enhancer-related feature data set is downloaded from the open database code; and preprocessing the downloaded file by using a GKMSVM and deepools tool to obtain positive and negative samples of the training set. The length of the sequence was fixed to give a DNA sequence of 1001bp in length. And (3) data coding: the DNA sequence is first divided into k-mer base segments, each base segment is encoded using one-hot encoding, and then the distributed representation of the base segments is learned using Word2 vec. Chromatin features scale log2 (1+x). Dividing the data set: for each dataset, 70% of the samples were used for training, 10% for validation, and 20% for testing.
In this example, enhancer activity and chromatin feature datasets were obtained from the co-open database encoding and subjected to data preprocessing operations using gkmsm and deepfools to obtain positive and negative samples:
1. acquiring data
The acquired data sets are all from the human DNA encyclopedia of elements (ENCODEs). The downloaded data included enhancer activity data sets (STARR-seq) in the A549, GM12878, HCT116, MCF-7, hepG2, and K562 cell lines and their corresponding 11 histone modification and chromatin patency data sets. Specifically, 11 histone modifications include H2AFZ, H3K4me1, H3K4me2, H3K4me3, H3K9ac, H3K9me3, H3K27ac, H3K27me3, H3K36me3, H4K79me2, and H4K20me1, and finally, a plurality of data files with suffix names of bed and bigwig are obtained.
2. Collecting positive and negative samples
1) Collection of positive samples
i. Processing a file with a suffix name of bed by using a GKMSVM tool, and calculating a section of DNA sequence on each binding site by taking each binding site as a unit according to the start coordinate and the end coordinate of the binding site on a chromosome, wherein the section of DNA sequence is used as a positive sample of the DNA sequence in a training set;
ii. The deepfools tool extracts coordinates from a file with suffix name bed, and extracts a sequencing signal (signal) from a file with suffix name bigwig, which is used as a positive sample of the signal in the training set.
2) Negative sample collection
i. For a DNA sequence positive sample, matching DNA sequence fragments with the GC base content similar to that of the DNA sequence positive sample on the whole genome area as a DNA sequence negative sample in a training set;
ii. According to the DNA sequence negative sample, the starting and ending coordinates on the chromosome are reversely calculated, and the signal is extracted from a file with the suffix name bigwig by using the depTools, so that the negative sample of the signal in the training set is obtained.
3. Fixed sequence length
The sequence is uniformly fixed to 1001bp:
1) For each DNA sequence, for its initial and final coordinates, the intermediate coordinates are obtained by adding and dividing by 2.
2) The middle point is used as a reference coordinate, 500 coordinates are expanded forwards and backwards, and a 1001bp DNA sequence is formed.
4. Data set partitioning
80% of the data were randomly selected as training set, 10% as validation set, and 10% as test set.
S2, constructing and training a chromatin feature prediction model based on meta-learning based on the preprocessed enhancer related features, wherein the implementation method is as follows:
s201, constructing a chromatin feature prediction model based on the preprocessed relevant features of the enhancers;
s202, updating parameters of a chromatin feature prediction model, wherein the implementation method comprises the following steps:
s2021, training a meta learning model by using a training set obtained by dividing a data set, wherein the training set comprises a query set and a support set;
s2022, initializing chromatin feature prediction model parameters according to normal distribution;
s2023, circulating epoch;
s2024, randomly sampling the tasks to form a batch;
s2025, carrying out cyclic processing on the batch, and training a chromatin feature prediction model by utilizing a support set of each task to obtain a group of parameters so as to finish primary parameter updating;
s2026, calculating a loss value for each task in a batch by using a query set, summing up the loss values, carrying out random gradient descent treatment on the gradient, completing second parameter updating, realizing updating of chromatin characteristic prediction model parameters, and returning to the step S2023;
S203, performing fine adjustment on each task in the chromatin feature prediction model based on the model obtained by meta learning, wherein the task is a chromatin feature;
s204, inputting the DNA sequence into the fine-tuned model to obtain a trained chromatin feature prediction model.
The chromatin feature prediction model comprises:
the coding module is used for extracting the enhancer DNA sequence characteristics by utilizing the multilayer inner coil network;
the self-attention module is used for extracting the dependency relationship in the higher-order characteristic based on the enhancer DNA sequence characteristic;
and the decoding and jump connection module is used for carrying out up-sampling by utilizing a quadratic interpolation method based on the dependency relationship and carrying out feature fusion by using a neural network based on full convolution.
In this embodiment, a chromatin feature prediction model based on meta-learning is constructed, to implement prediction of various chromatin features from DNA sequences, and a model is trained using a mean square error as a loss function, a back propagation algorithm, and a meta-learning strategy.
In the embodiment, a high-resolution chromatin feature prediction model is constructed based on a U-shaped network by using bilinear interpolation and other technologies in combination with neural network operators such as inland (transformation) and transform; training a Model-Agnostic Meta-Learning (MAML) similar strategy; fine tuning of the model: performing fine tuning on each task based on the model obtained by meta-learning, wherein each task (namely, each chromatin feature) obtains a model; inputting the DNA sequence into the fine-tuned model to obtain the predicted chromatin characteristics.
In this embodiment, the meta learning algorithm includes the steps of: training by using the training set which is finished by the step of preprocessing, wherein the data in the training set is divided into two data sets, including a query set and a support set; initializing model parameters according to normal distribution random; cycling through one epoch; randomly sampling the tasks to form a batch (batch); training the model by a support set of each task in a batch cycle to obtain a group of parameters, wherein the group of parameters are updated for the first time; for each task in one batch, calculating a loss value by using a query set, summing the loss values, carrying out random gradient descent SGD on the gradient, completing the second parameter updating, returning to cycle for one epoch, and starting a new training round.
In this embodiment, the data is encoded: since each base in the DNA sequence does not exist independently, the DNA sequence is divided into k-mer base segments, and each high-dimensional sparse base segment is distributively encoded using Skip-Gram-based word2vec, as shown in FIG. 2 below, the DNA sequence is first divided into k-mer fragments to fully consider higher-order dependencies. To ensure consistent sequence length, the character "N" is padded before and after the sequence. For example, the DNA sequence "ATCGA" can be represented (NAT, ATC, TCG, CGA, GAN) as five base segments with 3-mers, and because the chromatin feature prediction model requires binary vectors as input, each base segment is represented using a one-hot encoding strategy, and because five characters may occur per position, encoding dimensions of 5k, resulting in a high-dimensional sparsity problem, the word2vec strategy is used to learn the low-dimensional representation of the segment. Each thermally encoded segment is converted into an n x d matrix in a distributed representation format, where n=1001 and d represents the dimension of the distributed representation. The one-time thermal encoding can be expressed by the following formula:
In order to learn the low-dimensional characterization of the base segment, an objective function Lw of a Word2vec (Word vector) model is designed as follows:
LW=∑logp(context(w)|w)
where p represents the probability, context (w) |w represents the probability of predicting the correct surrounding word given word w,
specifically, the context (w) of a given center word w is predicted using MLP. Since Skip-Gram has better performance than cbow, the strategy of Word2vec of Skip-Gram is used to learn the distributed representation of k-mer base segments, converting the DNA sequence into a two-dimensional vector of (1001, 32).
1) The Skip-gram strategy predicts the probability of surrounding base segments by inputting one base segment, thereby obtaining a higher order correlation.
2) Positive samples of enhancers of all cells were used to construct the dataset, 90% of which were training sets and 10% of which were test sets.
3) The loss function was optimized by small batch gradient descent, batch size=256, adam used as an optimizer, learning rate=0.001. When the test set loss is no longer reduced, training is stopped and the weight matrix of the hidden layer can be defined as a distributed representation of the segments.
The distributed encoding of the final DNA sequence can be expressed by the following formula:
S o =[o 1 ,o 2 ,...,o i ,...,o 1000 ,o 1001 ]
wherein ,oi The characteristic code of the ith base segment is shown. For other chromatin features, the normalization is performed using log (1+x) and expressed by means of a feature matrix.
In this embodiment, a model based on U-Net is used to obtain a high-resolution prediction, and in order to improve the performance and robustness of the model, a transducer module is to be fused. In general, the chromatin feature prediction model consists essentially of four modules, including (1) the Encoder module. Enhancer DNA sequence features were extracted using an enhancer based on the invresolution neural network operator. (2) a self-attention module. The dependency in the high-order features is extracted using a transducer-based self-attention model. (3) a Decoder module. Upsampling is performed using quadratic interpolation and feature fusion is performed using a neural network based on full convolution. (4) a jump connection module. The specific description of the four modules of the addition of the decoder and the corresponding layer characteristics of the encoder is as follows, and the model structure is shown in fig. 3.
In this embodiment, the encoding module includes:
the first inner winding layer has an inner winding core size K of 7, a step length of 1, a channel number of each group of channels of 8, 4 groups of channels in total, a channel reduction ratio of 4, and an output enhancer DNA sequence feature mapping dimension of 1001 multiplied by 32;
the first maximum pooling layer is used for carrying out feature sampling on the feature map of the enhancer DNA sequence output by the first inner rolling layer, wherein the pooling window is 2 in size, the step length is 2, and the dimension of the feature map of the enhancer DNA sequence output by the first inner rolling layer is 500 multiplied by 32;
The second inner winding layer has the inner winding core size K of 7, the step length of 1, the number of channels of each group of channels of 8, 4 groups of channels in total, the channel reduction ratio of 4, and the dimension of the feature mapping of the output enhancer DNA sequences of 500 multiplied by 32;
the second maximum pooling layer is used for carrying out feature sampling on the feature map of the enhancer DNA sequence output by the second inner rolling layer, wherein the pooling window is 2 in size, the step length is 2, and the dimension of the feature map of the enhancer DNA sequence output is 250 multiplied by 32;
the third inner winding layer has the inner winding core size K of 7, the step length of 1, the number of channels of each group of channels of 8, 4 groups of channels in total, the channel reduction ratio of 4, and the dimension of the feature mapping of the output enhancer DNA sequences of 250 multiplied by 32;
the third maximum pooling layer is used for carrying out feature sampling on the characteristic of the enhancer DNA sequence output by the third inner rolling layer, wherein the pooling window size is 2, the step length is 2, and the dimension of the feature mapping of the output enhancer DNA sequence is 125 multiplied by 32;
the self-attention module includes:
the first attention layer has the attention mechanism that the multi-head number is 8, the dimension of a feedforward neural network dim_feed forward is 64, and the dimension of the feature mapping of the output enhancer DNA sequence is 125 multiplied by 32;
the second attention layer has 8 multi-heads in the attention mechanism, 64 in dim_feed forward dimension of the feedforward neural network and 125×32 in dimension of the enhancer DNA sequence feature map;
In this example, the forward and reverse information of the DNA sequence are extracted and added to each other in order to consider both the forward and reverse information.
The decoding and hopping connection module includes:
the first sampling layer adds the characteristic of the enhancer DNA sequence and the input of the self-attention module, carries out batch normalization processing on the added result to achieve the purpose of a residual network, inputs the normalized characteristic into the first up-sampling layer, the dimension of up-sampling is 250, and the dimension of the output characteristic mapping is 250 multiplied by 32;
the first convolution layer has the number of filters of 32, the convolution kernel size of 5, the step length of 1, the activation function of ELU, and the dimension of the output feature mapping of 250 multiplied by 32 by adopting the Same convolution;
the second sampling layer adds the feature mapping output by the first convolution layer and the feature of the enhancer DNA sequence output by the second maximum pooling layer to achieve the purpose of a residual network, carries out batch normalization processing on the added features, inputs the output features into the second up-sampling layer, wherein the dimension of up-sampling is 497, and the dimension of the output feature mapping is 500 multiplied by 32;
the second convolution layer has the filter number of 4, the convolution kernel size of 5, the step length of 1, the activation function of ELU, and the dimension of the output feature mapping of 500 multiplied by 4 by adopting the Same convolution;
The third sampling layer adds the feature map output by the second convolution layer and the feature of the enhancer DNA sequence output by the third maximum pooling layer to achieve the purpose of a residual network, performs batch normalization processing on the added features, inputs the output features into the third up-sampling layer, wherein the dimension of up-sampling is 1001, and the dimension of the output feature map is 1001 multiplied by 4;
and the third convolution layer has the filter number of 1, the convolution kernel size of 5, the step length of 1, the activation function of ELU, and the dimension of 1001×1 by adopting the Same convolution, and the dimension of the output feature mapping is kept unchanged.
In this embodiment, each convolution layer is used to adjust the upsampling feature value.
In this embodiment, the encoding module: this section consists of 3 resolution modules and a max pooling layer, where each module consists of a convolution layer, a ReLU layer, and a dropout layer. In particular, the Encoder module is used to progressively reduce the spatial dimension and encode sequence-specific features. Compared with the traditional convolution operation, the Involution operator has two advantages of position specificity (Spatial-specific) and Channel ambiguity (Channel-diagnostic), and the calculated amount is also remarkably reduced compared with the traditional convolution operation. The module can be expressed by the following formula:
x (l) =ReLU(inv(S (l) ,W (l) ,b (l) ))
x (l) =MaxPool(x (l) )
x (l) =Dropout(x (l) ,r=0.2)
wherein ,S(l) 、x (l) 、W (l) and b(l) The input, output, weight matrix and bias of the first coding module are shown, respectively. The channel number is Word2vec, the learned dimension=32, the convolution kernel size K is set to 8, the pooling window size is set to 2, and the step size is set to 2.
In this embodiment, the self-attention module: because of the different importance and interdependencies between higher-order features, the Transformer module is used to extract the long-distance dependence between higher-order features. Each transducer module consists of a self-attention layer and a position feed forward neural network (FFN), with a residual module added to them and layer normalized. The present invention uses multi-headed self-attention to capture richer feature information:
wherein ,Q(l) 、K (l) 、V (l) Query, key and value vector representing the first block, W o(l) 、W Q(l) 、W K(l) and WV(l) Is the corresponding weight matrix of the first block, SM (·) represents the softmax activation function, d k Representing the scaling factor. After each multi-headed self-attention module, FFN is applied to effect spatial transformation. It consists of a three-layer neural network with a ReLU activation function. Can be described by the following formula:
x=max(0,xW 1 +b 1 )W 2 +b 2
to fully take into account the global context information of the DNA sequence, a transducer module is applied to the forward and reverse directions of the DNA sequence, and then a global averaging pool is performed to extract high-level features.
In this embodiment, the decode and skip connect module: the decoding is used to decode the high-order abstract features and ultimately predict the high-resolution chromatin feature spectrum. The purpose of the skip connection module is to combine the encoder and decoder information to preserve more advanced features. The module comprises a plurality of upsampling layers and a mixing block, and features of the upsampling layers are extracted through batch normalization, reLu activation and convolution operations.
o (l) =Bilinear(u (l) )
o (l) =BN(o (l) +x (l) )
o (l) =ReLU(Conv(o (l) ,W (l) ,b (l) ))
Wherein Bilinear (·) represents Bilinear interpolation; u (u) (l) Representing the value obtained after the first upsampling layer. W (W) (l and b(l) The weight matrix and bias of the first block are represented.
In this embodiment, the project is intended to train a meta-learning model using a strategy similar to MAML. Due to the small sample problem, conventional deep learning training strategies often suffer from model overfitting. The main goal of the small sample element learning is to learn the model initialization parameters so that the model initialization parameters can be quickly adapted to other related tasks. Furthermore, since the nature of MAML is feature reuse, our task in this project is to predict a variety of chromatin features. The close association that exists between the various chromatin features makes this task suitable for training in meta-learning fashion. As shown in fig. 4, meta-learning is a meta-learning process, adaptation is a fine tuning process, and an expression of an objective function of a meta-learning model is as follows:
wherein ,representing minimized model loss, f θ ' represents the parameters after model training, L (·) represents the mean square error loss function, f θ Representing parameters before model training, a represents learning rate, < ->Representing the gradient in the chain derivative.
And S3, constructing a characteristic fusion model based on self-coding generation countermeasure network, and obtaining the combined characterization of the fused multi-chromatin characteristics based on the characteristic fusion model.
In this example, enhancer activity prediction models based on multi-chromatin feature fusion were constructed and trained. The chromatin characteristics in the model are predicted by a chromatin characteristic prediction model, and the enhancer activity prediction model is trained by using a mean square error as a loss function and using a back propagation algorithm.
In the embodiment, a DNA sequence is input based on a chromatin feature prediction model of meta learning to obtain a first high-order feature; inputting the fine-tuned chromatin characteristics based on the meta-learning chromatin characteristic prediction model to obtain second high-order characteristics; and adding according to the first high-order features and the second high-order features, namely adding the first high-order features and the second high-order features, and sequentially performing bilinear interpolation, jump connection, feature fusion and other operations to obtain the predicted binding site affinity signal intensity.
S4, constructing and training an enhancer activity prediction model based on multi-chromatin feature fusion according to the joint characterization and the chromatin feature prediction model parameters;
in this embodiment, fusion of multiple chromatin features is achieved by constructing a self-encoding generation countermeasure network for a given multiple chromatin feature data. The specific implementation is as follows:
low-dimensional coding: whereas tensor stitching is difficult to integrate directly high-dimensional, heterogeneous features, and tends to ignore the contribution of low-dimensional features. The method utilizes the independent coding layer to map different data into the subspace with low dimension and isomorphism, and comprises the following calculation processes:
z n =ReLU(F n (x n |ΘF n ))
wherein ,xn An initial feature vector, z, which is the nth chromatin feature m Low-dimensional recessive features representing mth chromatin features, F n And ΘF respectively represent the model function and parameters corresponding to the nth independent coding layer.
For inherent relativity and dependency among various features, the low-dimensional implicit features acquired in the previous step are integrated in a centralized and unified way by utilizing a feature sharing layer, and the calculation process is as follows:
z=S([z 1 ,z 2 ,...,z n ,...,z m ]|Θ S )
wherein S and Θ S And respectively representing the feature fusion model function and the parameters corresponding to the feature sharing layer.
High-dimensional decoding: in order to make the generated abstract features as similar as possible to the real histology features, the implicit features z are reconstructed into the initial feature vector x by using an independent decoding layer n . In this stage, the low-dimensional encoder and the high-dimensional decoder are updated by minimizing reconstruction errors, and the calculation process is that
y n =RELU(G n' (z|ΘG n' ))
wherein ,Gn' and ΘGn' Respectively representing the feature fusion model function and parameters corresponding to the n' independent decoding layer, wherein the I & is L2 norm.
Challenge learning: the mutual antagonism between the encoder and the discriminator is used to ensure that the posterior distribution of the output of the encoder is consistent with the prior distribution so as to solve the distribution difference of different characteristics, and the calculation process is as follows:
wherein D and Θ D Representing the corresponding feature fusion model function and parameters of the encoder, Q and Θ Q The table represents the model function and parameters corresponding to the generator, respectively, and p represents the gaussian distribution. The objective function of countermeasure learning can be expressed as:
the final loss of the feature fusion model is represented by the formula L 1 and L2 The sum of the losses in (a) can be expressed as:
L=λ 1 L 1 +λ 2 L 2
wherein ,λ1 and λ2 All represent weights, after the feature fusion model is fully learned, the low-dimensional implicit features z are the integrated high-order feature set, and L represents the loss function of the feature fusion model and L is taken as the input of the subsequent model 1 Representing reconstruction errors, L 2 Representing the distribution error, y n Representing the nth chromatin feature reconstructed by the feature fusion model, RELU (·) representing RELU activation function, G n' and ΘGn' Respectively representing a feature fusion model function and parameters corresponding to an nth independent decoding layer, wherein z represents a low-dimensional implicit feature, and I.I. represents an L2 norm and z represents a low-dimensional implicit feature n Low-dimensional recessive features representing nth chromatin features, E z'~p(z) (log(D(z'|Θ D ) ) represents log (D (z '|Θ) when the sample z' is sampled from the data distribution p (z) D ) Where E represents mathematical expectation, D represents a function, p represents gaussian distribution, log (·) represents logarithm, and the probability distribution of z' follows gaussian distribution, D and Θ) D Representing the corresponding feature fusion model functions and parameters of the encoder,representing the slave data distribution Q (z|Θ Q ) Sampling yields a z-time log (1-D (z|Θ) D ) For example,) mathematical expectation, _a->Representing the slave data distribution Q (z|Θ Q ) The sampling yields the z-time (log (D (z|Θ) D ) Q and Θ) are the mathematical expectation Q Respectively representing model functions and parameters corresponding to the generator;
s5, predicting the influence of variation on the activity of the enhancer by using a chromatin characteristic prediction model and an enhancer activity prediction model, wherein the implementation method is as follows:
s501, modifying partial base information in a DNA sequence in a simulated mutation mode;
s502, predicting S by using a chromatin prediction model based on the change result ref and Salt Chromatin characteristics C of (2) ref and Calt, wherein ,Sref and Salt Respectively representing the DNA sequences before and after the simulated mutation;
S503 characterizing chromatin ref 、C alt And inputting the DNA sequence characteristics into a characteristic fusion model to obtain joint characterization;
s504, inputting the combined characterization obtained in the step S503 into an enhancer activity prediction model to obtain the enhancer activity Y before and after mutation ref and Yalt ;
S505 according to enhancer Activity Y ref and Yalt The effect of the variation on enhancer activity was calculated, i.e. (Y ref -Y alt )。
In this embodiment, the influence of the prediction mutation on the activity of the enhancer is firstly predicted by using a chromatin feature prediction model, secondly, the combined characterization of multiple features is obtained by using a feature fusion model in step S3, and finally, the high-resolution activity prediction of the enhancer is realized by using an enhancer activity prediction model in step S4.
In this example, some of the base information in the DNA sequence is altered by using a simulated mutation, e.g., adenine (A) to guanine (G), the original DNA sequence is S ref The DNA sequence after simulated mutation is S alt The method comprises the steps of carrying out a first treatment on the surface of the Prediction of S using trained chromatin feature prediction models ref and Salt Chromatin characteristics C of (2) ref and Calt The method comprises the steps of carrying out a first treatment on the surface of the Realizing the joint characterization of multiple features by using the feature fusion model after training to obtain the joint characterization Z before and after mutation ref and Zalt The method comprises the steps of carrying out a first treatment on the surface of the Will combine to characterize Z ref and Zalt Chromatin characteristics C ref and Calt And inputting the DNA sequence characteristics into an enhancer activity prediction model to obtain the enhancer activity Y before and after mutation ref and Yalt . Variation affects available Y ref -Y alt And (3) representing.
In this example, enhancer activity can be more fully understood by analysis of integration between multiple chromatin features. Thus, we achieved predictions of enhancer activity based on the joint characterization obtained by the above model. The model initialization parameters are parameters obtained by model unitary learning, on the basis, we perform fine tuning, the mean square error is used as a loss function, the model is optimized through small-batch gradient descent, the batch size is=64, adam is used as an optimizer, the learning rate is=0.001, and training is stopped when the loss of the verification set is no longer reduced. The model structure is shown in fig. 6.
S6, screening the functional variation according to the influence of the variation on the activity of the enhancer, wherein the implementation method is as follows:
s601, randomly selecting a plurality of non-enhancer regions positioned in an open chromatin region as a control group;
s602, for each base of each DNA sequence in a control group, obtaining a set S of variation influence of the control group according to the influence of variation on the activity of an enhancer;
s603, determining the 2.5 th percentile and the 97.5 th percentile in the set S as experience significance thresholds;
S604, determining the functional variation according to the experience significance threshold.
In this example, functional variants were screened. Functional variation is determined by means of threshold screening. By evaluating the effect of non-enhancer region variation, an empirical threshold is obtained where variation has a significant impact on enhancer activity.
In this example, for each motif, to determine the empirical threshold for variation affecting enhancer activity, 10000 non-enhancer regions of the same length were randomly selected as the control set, for each base, the average variation score for three possible variations was calculated, after repeating this step for all positions in the control set, a large set of average variation scores was established from the random sequence, and then the 2.5 percentile and 97.5 percentile of the empirical distribution of variation scores were determined as the significance threshold.
The beneficial effects of the present invention are verified by comparative experiments as follows.
The data used in this experiment were extracted from the human DNA element encyclopedia database and included together the enhancer activity dataset of the six cell lines and their corresponding 11 histone ChIP-seq and DNase-seq datasets for lung cancer human alveolar basal epithelial cells (A549), human B lymphocytes (GM 12878), human colon cancer cells (HCT 116), human breast cancer cells (MCF-7), human hepatoma cells (HepG 2) and human chronic myelogenous leukemia cells (K562). The prediction comparison is carried out by adopting BPNet (method 1) and FCNignal (method 2) and model 1 in the method of the invention, wherein the method of the invention comprises conventional model training (method 1) and model training (method 2) based on meta learning, and as shown in table 1, table 1 is a chromatin characteristic prediction result table.
TABLE 1
Cell lines | A549 | GM12878 | HCT116 | MCF-7 | | K562 |
Method | ||||||
1 | 0.865 | 0.843 | 0.823 | 0.812 | 0.846 | 0.821 |
Method 2 | 0.854 | 0.831 | 0.810 | 0.808 | 0.814 | 0.816 |
The |
0.882 | 0.877 | 0.843 | 0.853 | 0.870 | 0.867 |
The recipeMethod 2 | 0.895 | 0.884 | 0.879 | 0.890 | 0.883 | 0.896 |
As can be seen from table 1, compared with the existing deep learning methods (method 1 and method 2), the method of the present invention can obtain higher prediction accuracy (Pearson correlation coefficient, PCC) on all the six cell line data of the experiment, indicating that the method of the present invention has stronger chromatin feature prediction capability. Meanwhile, the meta-learning strategy (the second method) is better in performance than the meta-learning strategy (the first method), so that the meta-learning method can effectively capture the internal relation between different chromatin characteristics, and therefore better prediction is achieved.
Using the same dataset, model two was used to predict enhancer activity, where the chromatin features entered in model two were model one predicted. The prediction was compared using BPNet (method 1) and FCNsignal (method 2) and model one (method one) and model two (method two) of the method of the invention, respectively, as shown in table 2, table 2 enhancer activity prediction results table.
TABLE 2
Cell lines | A549 | GM12878 | HCT116 | MCF-7 | | K562 |
Method | ||||||
1 | 0.893 | 0.879 | 0.891 | 0.893 | 0.904 | 0.908 |
Method 2 | 0.881 | 0.874 | 0.884 | 0.883 | 0.886 | 0.897 |
The |
0.912 | 0.908 | 0.916 | 0.913 | 0.919 | 0.921 |
The method 2 | 0.943 | 0.939 | 0.948 | 0.951 | 0.932 | 0.941 |
As can be seen from table 2, compared with the existing deep learning methods (method 1 and method 2), the method of the present invention can obtain higher prediction accuracy (Pearson correlation coefficient, PCC) on all the six cell line data of the experiment, indicating that the method of the present invention has a stronger enhancer activity prediction capability. Meanwhile, it is noted that the prediction using multiple chromatin features (method two) performed better than the prediction using only DNA sequences (method one), indicating the effectiveness of the "two-step" strategy proposed in the present invention, i.e. predicting chromatin features using model one and then predicting enhancer activity using predicted chromatin features using model two.
It can be concluded that the method of the invention has higher prediction accuracy compared with the existing methods for predicting the activity of enhancers, and can further realize the prediction of the influence of variation based on the characteristics of multiple chromatins.
The present invention calculates the mutation score to infer the effect of mutation on enhancer activity. Wild Type (WT) can be defined as a reference sequence (ref), and variant (Alter type) can be defined as a sequence (alt) containing a variable. The mutation score is calculated by subtracting the maximum reference sequence signal from the alternative allele signal (panel D), and for a strict definition of the mutation score, it can be defined as yalt-yrf. With these scores, the present invention analyzes the relationship between the effect of the variation and its location, by selecting an empirical threshold in order to study the proportion of functional variation in potential enhancer areas. Specifically, the invention randomly selects 10,000 DNA sequences located in the open chromatin region as a control set, then calculates the average mutation score of 3 potential mutations at each base position, and after repeating this process for each sequence in the control set, a non-enhancer region mutation score set is obtained, and finally, the 2.5 th and 97.5 th percentiles of the empirical distribution of mutation scores are determined as significance thresholds. With this significant threshold, variations in the enhancer region were analyzed. As shown in table 3, CTCF had the highest potential functional variation ratio and YY1 had the lowest ratio. Overall, these results demonstrate the excellent performance of the invention in predicting the effect of variation on enhancer activity.
TABLE 3 Table 3
In summary, the invention designs a deep learning model based on multiple chromatin features to predict the effect of variation on enhancer activity. Unlike the existing method which relies only on DNA sequence information for prediction, two models are constructed to predict the chromatin characteristics by using the DNA sequence first and then to predict the activity of the enhancer by using the predicted chromatin characteristics. The research result of the invention can be applied to the biomedical field, and in addition, the functional variation is determined by a threshold screening mode.
Claims (10)
1. A method for predicting the effect of non-coding variations on enhancer activity based on multiple phases, comprising the steps of:
s1, acquiring relevant characteristics of an enhancer, and preprocessing the relevant characteristics;
s2, constructing and training a chromatin feature prediction model based on meta-learning based on the preprocessed relevant features of the enhancers;
s3, constructing a characteristic fusion model based on a self-coding generation countermeasure network, and obtaining the combined characterization of the fused multi-chromatin characteristics based on the characteristic fusion model;
s4, constructing and training an enhancer activity prediction model based on multi-chromatin feature fusion according to the joint characterization and the chromatin feature prediction model parameters;
S5, predicting the influence of variation on the activity of the enhancer by using a chromatin characteristic prediction model and an enhancer activity prediction model;
s6, screening the functional variation according to the influence of the variation on the activity of the enhancer.
2. The method of predicting the effect of multi-stage based non-coding variation on enhancer activity according to claim 1, wherein step S1 comprises the steps of:
s101, acquiring an enhancer related characteristic data set;
s102, preprocessing an enhancer related characteristic data set to obtain positive and negative samples of a training set, wherein the positive and negative samples of the training set comprise DNA sequences and a plurality of corresponding chromatin characteristics;
s103, based on positive and negative samples of the training set, fixing the length of the sequence to obtain a DNA sequence with the length of 1001 bp;
s104, dividing the DNA sequence into k-mer base segments, encoding each base segment by using single thermal encoding, and learning the distributed representation of the base segments by using a Word2vec mode, wherein the chromatin characteristics use log 2 Scaling (1+x), x representing the chromatin feature value, log 2 (. Cndot.) represents a log function based on 2;
s105, dividing the related characteristic data set of the enhancer to finish preprocessing the data.
3. The method of predicting the effect of multi-stage based non-coding variation on enhancer activity according to claim 2, wherein step S2 comprises the steps of:
s201, constructing a chromatin feature prediction model based on the preprocessed relevant features of the enhancers;
s202, updating chromatin feature prediction model parameters;
s203, performing fine adjustment on each task in the chromatin feature prediction model based on the model obtained by meta learning, wherein the task is a chromatin feature;
s204, inputting the DNA sequence into the fine-tuned model to obtain a trained chromatin feature prediction model.
4. The method of predicting the effect of a non-coding variation on enhancer activity according to claim 3, wherein said step S202 comprises the steps of:
s2021, training a meta learning model by using a training set obtained by dividing a data set, wherein the training set comprises a query set and a support set;
s2022, initializing chromatin feature prediction model parameters according to normal distribution;
s2023, circulating epoch;
s2024, randomly sampling the tasks to form a batch;
s2025, carrying out cyclic processing on the batch, and training a chromatin feature prediction model by utilizing a support set of each task to obtain a group of parameters so as to finish primary parameter updating;
S2026, calculating a loss value for each task in a batch by using a query set, summing up the loss values, carrying out random gradient descent processing on the gradient, completing second parameter updating, realizing updating of chromatin characteristic prediction model parameters, and returning to the step S2023.
5. The method of predicting the effect of a non-coding variation on enhancer activity based on multiple phases of claim 4 wherein the expression of the meta-learned objective function is as follows:
6. The multi-stage non-coding variation based on enhancer activity impact prediction method of claim 5, wherein the chromatin feature prediction model comprises:
the coding module is used for extracting the enhancer DNA sequence characteristics by utilizing the multilayer inner coil network;
the self-attention module is used for extracting the dependency relationship in the higher-order characteristic based on the enhancer DNA sequence characteristic;
and the decoding and jump connection module is used for up-sampling by utilizing a quadratic interpolation method based on the dependency relationship and performing feature fusion by using a convolutional neural network.
7. The multi-stage non-coding variation based on enhancer activity impact prediction method of claim 6, wherein the coding module comprises:
the first inner winding layer has an inner winding core size K of 7, a step length of 1, a channel number of each group of channels of 8, 4 groups of channels in total, a channel reduction ratio of 4, and an output enhancer DNA sequence feature mapping dimension of 1001 multiplied by 32;
the first maximum pooling layer is used for carrying out feature sampling on the feature map of the enhancer DNA sequence output by the first inner rolling layer, wherein the pooling window is 2 in size, the step length is 2, and the dimension of the feature map of the enhancer DNA sequence output by the first inner rolling layer is 500 multiplied by 32;
the second inner winding layer has an inner winding core size K of 7, a step length of 1, a channel number of each group of channels of 8, 4 groups of channels in total, a channel reduction ratio of 4, and a dimension of the output enhancer DNA sequence feature map of 500 multiplied by 32;
the second maximum pooling layer is used for carrying out feature sampling on the feature map of the enhancer DNA sequence output by the second inner rolling layer, wherein the pooling window is 2 in size, the step length is 2, and the dimension of the feature map of the enhancer DNA sequence output is 250 multiplied by 32;
the third inner winding layer has the inner winding core size K of 7, the step length of 1, the number of channels of each group of channels of 8, 4 groups of channels in total, the channel reduction ratio of 4, and the dimension of the output enhancer DNA sequence feature mapping of 250 multiplied by 32;
The third maximum pooling layer is used for carrying out feature sampling on the characteristic of the enhancer DNA sequence output by the third inner rolling layer, wherein the pooling window size is 2, the step length is 2, and the dimension of the feature mapping of the output enhancer DNA sequence is 125 multiplied by 32;
the self-attention module includes:
the first attention layer has a multi-head number of 8 in an attention mechanism, a feedforward neural network dimension of 64 and a dimension of 125×32 of the characteristic mapping of the output enhancer DNA sequence;
a second attention layer, the number of the heads in the attention mechanism is 8, the dimension of the feedforward neural network is 64, and the dimension of the characteristic mapping of the enhancer DNA sequences is 125 multiplied by 32;
the decoding and hopping connection module includes:
the first sampling layer adds the characteristic of the enhancer DNA sequence and the dependency relationship output by the self-attention module, carries out batch normalization processing on the added result, inputs the normalized characteristic into the first up-sampling layer, the dimension of up-sampling is 250, and the dimension of the output characteristic mapping is 250 multiplied by 32;
the first convolution layer has the number of filters of 32, the convolution kernel size of 5, the step length of 1, the activation function of ELU, and the dimension of the output feature mapping of 250 multiplied by 32 by adopting the Same convolution;
the second sampling layer is used for adding the feature mapping output by the first convolution layer and the feature of the enhancer DNA sequence output by the second maximum pooling layer, carrying out batch normalization processing on the added feature, inputting the output feature into the second up-sampling layer, wherein the dimension of up-sampling is 497, and the dimension of the output feature mapping is 500 multiplied by 32;
The number of filters of the second convolution layer is 4, the convolution kernel size is 5, the step length is 1, the activation function is ELU, the Same convolution is adopted, and the dimension of the output feature mapping is 500 multiplied by 4;
the third sampling layer is used for adding the feature mapping output by the second convolution layer and the feature of the enhancer DNA sequence output by the third maximum pooling layer, carrying out batch normalization processing on the added feature, inputting the output feature into the third up-sampling layer, wherein the dimension of up-sampling is 1001, and the dimension of the feature mapping output is 1001 multiplied by 4;
and the third convolution layer has the number of filters of 1, the convolution kernel size of 5, the step size of 1, the ELU as an activation function, adopts the Same convolution, and the dimension of the output feature mapping is 1001 multiplied by 1.
8. The method of predicting the effect of non-coding mutation on enhancer activity according to claim 7, wherein the loss function expression of the feature fusion model in step S3 is as follows:
L=λ 1 L 1 +λ 2 L 2
y n =RELU(G n' (z|ΘG n' ))
z=S([z 1 ,z 2 ,...,z n ,...,z m ]|Θ S )
wherein L represents a loss function of the feature fusion model, lambda 1 and λ2 All represent weights, L 1 Representing reconstruction errors, L 2 Representing the distribution error, x n An initial feature vector, y, representing the nth chromatin feature n Representing the nth chromatin feature reconstructed by the feature fusion model, RELU (·) representing RELU activation function, G n' and ΘGn' Respectively representing a feature fusion model function and parameters corresponding to an nth independent decoding layer, wherein z represents a low-dimensional implicit feature, and I represent L2 norms, S and theta S Respectively representing the feature fusion model function and the parameters, z corresponding to the feature sharing layer n Low-dimensional recessive features, z, representing nth chromatin features m Low-dimensional recessive features representing the mth chromatin feature, E z'~p(z) (log(D(z'|Θ D ) ) represents log (D (z '|Θ) when the sample z' is sampled from the data distribution p (z) D ) Where E represents mathematical expectation, D represents a function, p represents a Gaussian distribution, log (), and the probability of taking the logarithm, z'The distribution obeys Gaussian distribution, D and Θ D Representing the corresponding feature fusion model functions and parameters of the encoder,representing the slave data distribution Q (z|Θ Q ) Sampling yields a z-time log (1-D (z|Θ) D ) For example,) mathematical expectation, _a->Representing the slave data distribution Q (z|Θ Q ) The sampling yields the z-time (log (D (z|Θ) D ) Q and Θ) are the mathematical expectation Q And respectively representing the model function and the parameter corresponding to the generator.
9. The method of predicting the effect of multi-stage based non-coding variation on enhancer activity according to claim 8, wherein said step S5 comprises the steps of:
s501, modifying partial base information in a DNA sequence in a simulated mutation mode;
S502, predicting S by using a chromatin prediction model based on the change result ref and Salt Chromatin characteristics C of (2) ref and Calt, wherein ,Sref and Salt Respectively representing the DNA sequences before and after the simulated mutation;
s503 characterizing chromatin ref 、C alt And inputting the DNA sequence characteristics into a characteristic fusion model to obtain joint characterization;
s504, inputting the combined characterization obtained in the step S503 into an enhancer activity prediction model to obtain the enhancer activity Y before and after mutation ref and Yalt ;
S505 according to enhancer Activity Y ref and Yalt The effect of the variation on enhancer activity was calculated.
10. The method of predicting the effect of multi-stage based non-coding variation on enhancer activity according to claim 9, wherein step S6 comprises the steps of:
s601, randomly selecting a plurality of non-enhancer regions positioned in an open chromatin region as a control group;
s602, for each base of each DNA sequence in a control group, obtaining a set S of variation influence of the control group according to the influence of variation on the activity of an enhancer;
s603, determining the 2.5 th percentile and the 97.5 th percentile in the set S as experience significance thresholds;
s604, determining the functional variation according to the experience significance threshold.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310122535.5A CN116312765A (en) | 2023-02-15 | 2023-02-15 | Multi-stage-based prediction method for influence of non-coding variation on activity of enhancer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310122535.5A CN116312765A (en) | 2023-02-15 | 2023-02-15 | Multi-stage-based prediction method for influence of non-coding variation on activity of enhancer |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116312765A true CN116312765A (en) | 2023-06-23 |
Family
ID=86791517
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310122535.5A Pending CN116312765A (en) | 2023-02-15 | 2023-02-15 | Multi-stage-based prediction method for influence of non-coding variation on activity of enhancer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116312765A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116884495A (en) * | 2023-08-07 | 2023-10-13 | 成都信息工程大学 | Diffusion model-based long tail chromatin state prediction method |
-
2023
- 2023-02-15 CN CN202310122535.5A patent/CN116312765A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116884495A (en) * | 2023-08-07 | 2023-10-13 | 成都信息工程大学 | Diffusion model-based long tail chromatin state prediction method |
CN116884495B (en) * | 2023-08-07 | 2024-03-08 | 成都信息工程大学 | Diffusion model-based long tail chromatin state prediction method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Fudenberg et al. | Predicting 3D genome folding from DNA sequence with Akita | |
CN110459264B (en) | Method for predicting relevance of circular RNA and diseases based on gradient enhanced decision tree | |
CN106874704B (en) | A kind of gene based on linear model is total to the sub- recognition methods of key regulatory in regulated and control network | |
CN113593634B (en) | Transcription factor binding site prediction method fusing DNA shape characteristics | |
CN110993113B (en) | LncRNA-disease relation prediction method and system based on MF-SDAE | |
KR20230152043A (en) | Drug optimization by active learning | |
CN116312765A (en) | Multi-stage-based prediction method for influence of non-coding variation on activity of enhancer | |
CN108427865B (en) | Method for predicting correlation between LncRNA and environmental factors | |
Rau et al. | Reverse engineering gene regulatory networks using approximate Bayesian computation | |
KR20220053642A (en) | Computer-implemented method and apparatus for analyzing genetic data | |
CN110942803A (en) | Efficient prediction method for correlation between LncRNA and environmental factors | |
Geng et al. | A deep learning framework for enhancer prediction using word embedding and sequence generation | |
Fang et al. | Word2vec based deep learning network for DNA N4-methylcytosine sites identification | |
CN109801681B (en) | SNP (Single nucleotide polymorphism) selection method based on improved fuzzy clustering algorithm | |
CN110400605A (en) | A kind of the ligand bioactivity prediction technique and its application of GPCR drug targets | |
CN116959585B (en) | Deep learning-based whole genome prediction method | |
Cooke et al. | Fine-tuning of approximate Bayesian computation for human population genomics | |
Li et al. | Disentangled wasserstein autoencoder for t-cell receptor engineering | |
Wang et al. | MSCAP: DNA Methylation Age Predictor based on Multiscale Convolutional Neural Network | |
CN115769300A (en) | Variant pathogenicity scoring and classification and uses thereof | |
Khajouei et al. | An information theoretic treatment of sequence-to-expression modeling | |
Li et al. | DiscDiff: Latent Diffusion Model for DNA Sequence Generation | |
Zhu et al. | Collaborative completion of transcription factor binding profiles via local sensitive unified embedding | |
Al_Rashid | Predicting the behaviour of the senescence-accelerated mouse (Sam) strains (samps and samr) using machine learning algorithm | |
Spanbauer et al. | Sparse prediction informed by genetic annotations using the logit normal prior for Bayesian regression tree ensembles |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |