CN115810398A

CN115810398A - TF-DNA binding identification method based on multi-feature fusion

Info

Publication number: CN115810398A
Application number: CN202211696499.5A
Authority: CN
Inventors: 张永清; 邹权; 刘宇航; 吴锡; 王紫轩; 熊术文; 王茂丞; 喻云; 林天华; 向艳辉
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2022-10-21
Filing date: 2022-12-28
Publication date: 2023-03-17

Abstract

The invention discloses a TF-DNA combination recognition method based on multi-feature fusion, which comprises the following steps: s1: acquiring five kinds of original data related to the combination of transcription factors of common human tissues; s2: preprocessing various data; s3: carrying out data coding processing on DNA sequence data in the preprocessed original data; s4: normalizing other data in the preprocessed original data; s5: carrying out global dependence extraction, feature extraction and feature fusion on the coded DNA sequence data and the normalized data by utilizing a multi-feature fusion self-attention mechanism and a convolutional neural network to obtain an overall feature mapping combination; s6: training the multi-feature fused TF-DNA combination recognition model according to the overall feature mapping combination to obtain a trained multi-feature fused TF-DNA combination recognition model; s7: and identifying the data to be identified by using the trained multi-feature fused TF-DNA combined identification model.

Description

TF-DNA binding identification method based on multi-feature fusion

Technical Field

The invention relates to the technical field of biological information, in particular to a TF-DNA combination identification method based on multi-feature fusion.

Background

Precision medicine is a new method for disease prevention and treatment that takes into account differences in individual genes, environments, and living habits as a next-generation diagnosis and treatment technology. Compared with the traditional diagnosis and treatment method, the method has great technical advantages, attaches more importance to the depth characteristic of 'diseases', accurately finds the reasons and treatment targets of the diseases, accurately classifies the development process of the diseases, finally achieves the aim of performing personalized and accurate treatment on the diseases and specific patients, and improves the benefits of disease prevention and diagnosis and treatment. The key to this is to recognize various gene transcription regulatory elements and analyze their relationship with various chromatin characteristics, recognizing human life activities at the molecular level.

Transcription Factor (TF) is a protein having a special function, which controls the Transcription process by specifically binding to a DNA sequence, thereby regulating gene expression. Numerous studies have demonstrated that TF plays a crucial role in human physiological processes due to its broad tissue-specific binding. It can regulate the differential expression of genes in tissues, influence the occurrence and development of human diseases and guide the activities of cells in the tissues. Therefore, the study of TF-DNA binding, and thus the study of their tissue-specific binding mechanisms, is crucial for studying how TF participates in transcriptional regulation, exploring gene function and understanding cellular activities in different tissues.

With the popularization of high-throughput sequencing technology, a large amount of accumulated experimental data lays a foundation for the research on gene transcription identification through a calculation method, and the defect that the traditional biological experiment wastes time and labor is overcome. At present, an experimental mode of predicting gene transcription by a calculation method and then verifying by a biological experiment gradually becomes mainstream. Deepbind uses a convolutional neural network to open the way of predicting transcription factor binding sites through deep learning. However, since many TFs do not bind specifically to DNA sequences, sequence-based methods result in a large number of false positives. Since then, a number of improved methods have been proposed, wherein gene transcription recognition methods based on multi-feature fusion have become the mainstream of research in this field.

The integration analysis among multiple characteristics can understand the biological process from a more comprehensive view point, thereby better predicting the binding of TF-DNA. Studies have shown that the binding of transcription factors to DNA sequences is not only related to DNA sequences, but also to various characteristics of DNA shape, histone modification, chromatin openness, and the like.

Most of the existing researches only focus on DNA sequences, and various characteristics are not systematically analyzed. Furthermore, the studies of transcription factors in tissue layers are currently limited. How to construct a high-performance and high-robustness gene transcription recognition model based on multiple features, how to effectively solve the problem of small samples, how to solve the problem of high dimensional isomerism among multiple sets of mathematical data, and how to reasonably characterize the features are still the current problems. In addition, how to solve the problem of 'black box' of the model and provide an effective interpretable method for the model so as to identify important areas influencing the decision of the model is very important for understanding the biological process.

Disclosure of Invention

The invention aims to provide a TF-DNA combination identification method based on multi-feature fusion, which can effectively solve the problem of high dimensional isomerism existing among small samples and multi-group chemical data and can reasonably characterize features.

The technical scheme for solving the technical problems is as follows:

the invention provides a TF-DNA binding recognition method based on multi-feature fusion, which comprises the following steps:

s1: acquiring five kinds of original data related to the combination of transcription factors of common human tissues;

s2: preprocessing the various data to obtain preprocessed original data;

s3: carrying out data coding processing on the DNA sequence data in the preprocessed original data to obtain coded DNA sequence data;

s4: normalizing other data in the preprocessed original data to obtain normalized data;

s5: performing global dependence extraction, feature extraction and feature fusion on the coded DNA sequence data and the normalized data by using a multi-feature fusion attention mechanism and a convolutional neural network to obtain an overall feature mapping combination;

s6: training the multi-feature fused TF-DNA combination recognition model by using a transfer learning method according to the overall feature mapping combination to obtain a trained multi-feature fused TF-DNA combination recognition model;

s7: and identifying the data to be identified by using the trained multi-feature fusion TF-DNA combination identification model to obtain an identification result.

Optionally, in step S2, the preprocessed raw data includes: DNA sequence data and its corresponding DNA shape data, chromatin accessibility data, histone modification data, and conservation data.

Optionally, the step S2 includes:

s21: extracting DNA sequence data in the original data by using a Chip-seq data set;

s22: processing the DNA sequence data by using a GKM-SVM to obtain a positive sample and a negative sample;

s23: acquiring three types of DNA shape data of the positive sample and the negative sample;

s24: extracting chromatin accessibility data and histone modification data in the DNA sequence data;

s25: generating conservative data of the positive sample and the negative sample by using phastCons100way according to the gene coordinates of the DNA sequence data;

s26: outputting the DNA sequence data, the DNA shape data, the chromatin accessibility data, the histone modification data and the conservative data as the pre-processed raw data, and the chromatin accessibility data, the histone modification data and the conservative data are the other data.

Alternatively, the step S22 includes:

s221: for each data set, a sequence of 101bp expanded by the gene coordinate with the peak as the center is used as the positive sample.

S222: selecting a region in the whole genome having a similar GC content as the positive sample as the negative sample.

Alternatively, the three types of DNA shape data include:

A：inter-bp：HelT,Rise,Roll,Shift,Slide and Tilt；

B：intra-bp：Buckle,Opening,ProT,Shear,Stagger and Stretch；

c: MGW and EP.

Optionally, the step S3 includes:

s31: simultaneously dividing the positive sample and the negative sample into a plurality of k-mer base sections with equal sequence length;

s32: coding the plurality of k-mer base segments by utilizing a thermal independent code to obtain a plurality of coded k-mer base segments;

s33: converting high-dimensional sparse hot independent codes into distributed codes according to a word2vec strategy;

s34: and characterizing the coded k-mer base segments by using the distributed codes to obtain coded DNA sequence data.

Optionally, in step S4, the normalized data is:

S _m ＝[m ₁ ，m ₂ ，…m _i ，…，m ₁₀₀ ，m ₁₀₁ ]

S _h ＝[h ₁ ，h ₂ ，…h _i ，…，h ₁₀₀ ，h ₁₀₁ ]

S _d ＝[d ₁ ，d ₂ ，…d _i ，…，d ₁₀₀ ，d ₁₀₁ ]

S _c ＝[c ₁ ，c ₂ ，…c _i ，…，c ₁₀₀ ，c ₁₀₁ ]

wherein S is _m 、S _h 、S _d 、S _c A feature matrix representing the DNA shape, chromatin accessibility, histone modification and conservation fraction, respectively, corresponding to sequence S, m _i 、h _i 、d _i 、c _i Respectively shows the DNA shape, chromatin accessibility, histone modification and conservative fraction corresponding to the ith base in the DNA sequence.

Optionally, the step S5 includes: the multi-feature fusion convolutional neural network comprises a multi-head self-attention mechanism module, a first feature fusion module, a convolutional feedforward neural network module, a second feature fusion module, a convolutional layer I, a maximum pooling layer, a convolutional layer II, a global pooling layer, a full-connection layer and a ReLU activation layer which are arranged in sequence,

the multi-head self-attention mechanism module is used for acquiring the coded DNA sequence data/normalized data and capturing long-distance and short-distance dependence of different characteristics in the coded DNA sequence data/normalized data;

the first characteristic fusion module is used for fusing the coded DNA sequence data/normalized data and the long-distance dependence and the short-distance dependence to obtain a first fusion result;

the convolution feedforward neural network module is used for performing feature enhancement on the first fusion result to obtain an enhancement result;

the second feature fusion module is used for fusing the first fusion result and the reinforcement result to obtain a global dependency extraction result;

the convolutional layer I is used for extracting initial features in the global dependency extraction result and transmitting the initial features to the maximum pooling layer;

the maximum pooling layer is used for mapping the features in the initial features;

the convolution layer two is used for carrying out convolution calculation on the feature mapping to obtain calculated feature mapping;

the global pooling layer is used for sampling the calculated feature mapping to obtain a feature extraction result;

and the full connection layer and the ReLU activation layer are used for fusing the global dependence extraction result and the feature extraction result to obtain an overall feature mapping combination.

Optionally, in the step S6, the migration learning method includes a weak migration learning method and a strong migration learning method, and the weak migration learning method includes:

a1: pre-training the transcription factor AT in the tissue T by using the rest data sets until the loss of the multi-feature fused TF-DNA binding recognition model is not reduced;

a2: fine-tuning the multi-feature fused TF-DNA binding recognition model using a dataset of ATs until the loss no longer decreases;

a3: outputting the TF-DNA binding recognition model of the multi-feature fusion;

the strong transfer learning method comprises the following steps:

b1: pre-training a certain transcription factor A by using a data set of all tissues of A until the loss of the multi-feature fused TF-DNA binding recognition model is not reduced;

b2: freezing all network structures except the full connection layer;

b3: fine-tuning the multi-feature fused TF-DNA binding recognition model using target tissue data until loss is no longer reduced;

b4: outputting the multi-feature fused TF-DNA binding recognition model.

The invention also provides a TF-DNA binding recognition system, which comprises a processor, a memory and a computer program stored on the memory, wherein the computer program executes the TF-DNA binding recognition method based on multi-feature fusion on the processor.

The invention has the following beneficial effects:

1) According to the method, the representation of different data is realized by adopting a method of word2vec and a feature matrix, the high-dimensional isomerism among different data is effectively solved, and the data fusion is realized;

2) According to the method, a self-attention mechanism and convolution are adopted, two branches are arranged to learn the DNA sequence and other characteristic information in parallel, the confusion between the characteristics of the two branches is avoided, the potential characteristics of TF-DNA combination are fully extracted, and the model performance is remarkably improved;

3) The method provided by the invention can analyze the importance of various characteristics on TF-DNA combination, so as to better understand tissue specificity TF-DNA combination;

4) The invention can identify key regions influencing TF-DNA combination in a genome, and the key regions are analyzed to explore a gene transcription mechanism;

5) The invention makes the latest progress in the aspect of Transcription Factor Binding Sites (TFBSs) prediction, solves the defect that the prior research can not fully consider local dependence and global dependence, and fully considers global information;

6) The model pre-training method based on the transfer learning has certain transportability and can be applied to the problem of less data volume to expand a data set;

7) The invention predicts whether the unverified data has an interaction relation to guide a biological experiment by constructing neural network learning on the known gene sequence data, thereby effectively reducing the experiment time and financial loss.

Drawings

FIG. 1 is a flow chart of the TF-DNA binding recognition method based on multi-feature fusion according to the present invention;

FIG. 2 is a data flow diagram of the DNA sequence and other feature codes proposed by the present invention;

FIG. 3 is a schematic diagram of a neural network structure based on TF-DNA binding recognition of multi-feature fusion constructed in an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a C-Transformer according to the present invention;

FIG. 5 is a graph showing the results of the experiment in example 2 of the present invention;

FIG. 6 is a schematic diagram of attention score extraction in example 3 of the present invention;

fig. 7 is a result of attention visualization in embodiment 4 of the present invention.

Detailed Description

The principles and features of this invention are described below in conjunction with the following drawings, which are set forth to illustrate, but are not to be construed to limit the scope of the invention.

The invention provides a TF-DNA binding recognition method based on multi-feature fusion, which is shown in a reference figure 1 and comprises the following steps:

in the present invention, the TF-DNA binding sequence, histone modification and chromatin opening data obtained were derived from human DNA encyclopedia (ENCODE), DNA shape data were generated using dnashapser (a DNA shape generation tool), and conservative data were generated using phastCons100 way. The original TF-DNA binding sequence stores the gene coordinates of the original TF-DNA binding sequence in a bed file format, and a GKMSVM is used for generating the gene sequence according to the gene coordinates. Histone modification and chromatin patency data were stored using bigwig (a binary file format) file format, and depetools were used to extract features of the corresponding sequences. Finally, the above five data are stored in fasta (a biological sequence storage file format) file formats, which are "chromosome name: sequence position, [ specific feature, tag (0 or 1) ].

S2: preprocessing the various data to obtain preprocessed original data;

optionally, the step S2 includes:

s21: extracting DNA sequence data from the raw data by using a chromatin co-immunoprecipitation sequencing (Chip-seq) data set;

s22: processing the DNA sequence data by using a GKM-SVM (a K-mer-based DNA sequence processing tool) to obtain a positive sample and a negative sample;

alternatively, the step S22 includes:

Here, the present invention expands a sequence of 101bp for each Chip-seq data set, centering on the peak value according to the gene coordinates. If the length L is less than 101bp, filling sequences with the length of (101-L)/2 at the front end and the rear end of the sequences respectively, and if the length N of the front end of the sequences is less than (101-L)/2, filling sequences with the length of N at the front end and filling sequences with the length of 101-L-N at the rear end; if the sequence length L exceeds 101bp, only the 101bp sequence is retained centering on the peak.

S222: and selecting a region with similar Guanine-Cytosine (GC) content as the positive sample from the whole genome as the negative sample.

The number of positive and negative samples combined by TF-DNA has the problem of class imbalance, so that the model performance is easily influenced. In this regard, the present invention employs a sample data balancing approach such that the number of positive and negative samples is approximately equal.

1) Over-sampling of positive samples

i. And simultaneously extracting the forward and reverse strand characteristics of the DNA, and expanding the data set of m positive samples into 2m.

2) Down-sampling of negative samples

i. N regions with similar GC content to the positive sample were selected as negative samples in the whole genome. Randomly selecting 2m sequences from n negative samples;

equalizing the number of positive and negative samples.

S23: acquiring three types of DNA shape data of the positive sample and the negative sample, and standardizing the three types of DNA shape data as [0,1];

the three types of DNA shape data include:

a: inter base (inter-bp): helical Twist (Helix Twist, helT), rise (Rise), roll (Roll), shift (Shift), slide (Slide) and Tilt (Tilt);

b: within the base (intra-bp): bend (Buckle), open (Opening), propeller twist (ProT), shear (Shear), cross (setter), and pull (Stretch);

c: minor Groove Width (MGW) and Electrostatic Potential (EP).

s25: generating conservative data of positive and negative samples using phasecons 100way (a conservative score calculation tool) based on the gene coordinates of the DNA sequence data;

here, different data are mainly characterized in different ways. Specifically, for a DNA sequence, a word2vec (word vector) strategy of Skip-Gram (a word2vec model) is used to represent the base segments of k-mers in a distributed manner, and other data are encoded in a feature matrix manner.

Optionally, the step S3 includes:

referring to fig. 2, the positive and negative sample sequences are divided into a large number of k-mer base segments, and one-hot is used for coding based on the base segments, and 'N' is used for padding before and after the sequences to ensure that the sequence lengths are consistent.

The DNA sequence is encoded using a k-mer based one-hot encoding. The DNA sequence "ATCG" is a stretch of bases obtained by 3-mer partition (NTC, ATC, TCG, CGN).

each base stretch is represented using a unique thermal code having a dimension of 5 due to the 5 occurrences of nucleotides at each position (A, G, C, T and N) ^k 。

in order to solve the problem of high-dimensional sparsity existing in a coding mode, a word2vec method of Skip-Gram is used for learning the low-dimensional representation of a base segment, an objective function L is designed, and the formula of the objective function L is as follows:

L＝∑ _logp (context(w)|w)

where p represents the probability, i.e., it is desirable that the higher the probability of the predicted word context (w) around w given the input word w, the context (w) representing the word around w, and w representing any word in a sentence. Such as a sentence, "theris an applet on the table", it is expected that given a word w (such as an applet), it can be predicted that the words context (w) around w have words such as an, on, the table, etc.

Learning a distributed representation of the base segments of the one-hot encoded k-mers using the strategy of word2vec of Skip-Gram converts a one-dimensional DNA sequence into a two-dimensional vector of (101, 16).

1) The Skip-gram strategy predicts the probability of surrounding base segments by importing one DNA base segment, thereby obtaining higher order correlations.

2) 5000 sequences were randomly sampled from each positive data set, forming a training set of approximately 430000 sequences.

3) The loss function was optimized by a small batch gradient descent, with batch size =256, adam acting as the optimizer, and learning rate =0.001. After about five epochs, the loss is no longer reduced, training is stopped, and the weight matrix of the hidden layer can be defined as a segmented distributed representation.

And encoding other characteristics by adopting a characteristic matrix mode. DNA shape was converted to a two-dimensional vector of (101, 14), 8 histone modifications were converted to a two-dimensional vector of (101, 8), chromatin accessibility was converted to a two-dimensional vector of (101, 1), and the conservation score was converted to a two-dimensional vector of (101, 1).

The encoded DNA sequence data were:

S _o ＝[o ₁ ，o ₂ ，…o _i ，…，o ₁₀₀ ，o ₁₀₁ ]

wherein o is _i And (3) representing the characteristic code of the ith base segment.

the normalized data is as follows:

S _m ＝[m ₁ ，m ₂ ，…m _i ，…，m ₁₀₀ ，m ₁₀₁ ]

S _h ＝[h ₁ ，h ₂ ，…h _i ，…，h ₁₀₀ ，h ₁₀₁ ]

S _d ＝[d ₁ ，d ₂ ，…d _i ，…，d ₁₀₀ ，d ₁₀₁ ]

S _c ＝[c ₁ ，c ₂ ，…c _i ，…，c ₁₀₀ ，c ₁₀₁ ]

S5: carrying out global dependence extraction, feature extraction and feature fusion on the coded DNA sequence data and the normalized data by utilizing a multi-feature fusion convolutional neural network to obtain an integral feature mapping combination;

referring to fig. 3 and 4, the multi-feature fusion convolutional neural network comprises a multi-head self-attention mechanism module, a first feature fusion module, a convolutional feedforward neural network module, a second feature fusion module, a convolutional layer one, a maximum pooling layer, a convolutional layer two, a global pooling layer, a full connection layer and a ReLU (linear rectification function) activation layer which are arranged in sequence,

the multi-head self-attention mechanism module is used for acquiring the coded DNA sequence data/normalized data and capturing long and short distance dependence of different characteristics in the coded DNA sequence data/normalized data;

to capture the long and short range dependencies present in different features, a transform-based self-attention mechanism is used, with the number of stacks n of transform modules being 2.

the convolution feedforward neural network module is used for carrying out feature enhancement on the first fusion result to obtain an enhancement result;

the final DNA sequence feature map has dimensions of 101X 16, and the other transcription factor binding feature maps have dimensions of 101X 24.

the filter number is set to 128, the size of the convolution kernel is 1 × 8, and the step size of the convolution window is 1. The layer extracts 128 features from the input data, and the dimension of the output DNA sequence feature map and other features is 94 × 128.

The maximum pooling layer is used for mapping the features in the initial features; the size of the maximum pooling window is set to 1 × 2, the step size is 2, and the dimension of the output DNA sequence feature and other feature maps is 47 × 64.

The convolution layer two is used for carrying out convolution calculation on the feature mapping to obtain the calculated feature mapping; the number of the filters of the convolution layer two is 256, the size of the convolution kernel is 1 multiplied by 8, the step length is 2, the activation function is ReLU, and the dimensionality of the output DNA sequence characteristics and other characteristic mappings is 40 multiplied by 256.

The global pooling layer is used for sampling the calculated feature mapping to obtain a feature extraction result; the sliding window size of the global pooling layer is the same as that of the whole feature map, each input feature map of W × H × C is converted into an output of 1 × 1 × C, and the dimension of the feature map of the output DNA sequence is 20 × 256 and the dimension of the feature map of other features is 1 × 256 using GlobalMaxPoint 2D in pytorch.

Correspondingly adding and fusing each feature mapping of the DNA sequence and other data after passing through the pooling layer, wherein in the fusion process, each corresponding position data of the DNA sequence data and other TF combination data is added, and the dimensionality after addition is 1 multiplied by 512;

and transferring the fused feature mapping to a regularization layer for regularization, and performing regularization by using a dropout function. The dropout probability is set to 0.2 in this embodiment. And transferring the feature mapping after the regularization processing to a full-connection layer, wherein the dimension of the feature mapping output by a hidden layer is 1 × 64, and a ReLU activation function is used for activation. The final output layer has dimensions of 1 × 2 and is activated using the softmax function.

When given an input feature matrix X, the multi-headed attention mechanism can be expressed as:

representing a feature matrix containing position information, A representing the output of multi-head self-attention, P representing the position information and being obtained through training and learning, multiHead (-) representing a multi-head attention mechanism, and LM (-) representing a Layer normalization operation. Wherein the content of the first and second substances,

can be specifically expressed as:

wherein the content of the first and second substances,

H _i represents the output of the ith head of attention, for a total of dmodel//2, with dmodel representing the dimensions of the position vector,

the weight matrix for the model, in particular,

express the ith attentionHead H _i A corresponding weight matrix. Concat (. Cndot.) denotes a splicing operation, d _k For the scaling factor, use

Perform zooming to prevent

Too large an inner product of (c).

a2: fine-tuning the multi-feature fused TF-DNA binding recognition model using a dataset of AT until the loss is no longer reduced;

a3: outputting the multi-feature fused TF-DNA binding recognition model;

the strong transfer learning method comprises the following steps:

b1: pre-training a certain transcription factor A by using a data set of all tissues A until the loss of the TF-DNA binding recognition model with multi-feature fusion is not reduced;

b2: freezing all network structures except the full connection layer;

b4: and outputting the TF-DNA binding recognition model of the multi-feature fusion.

S7: and identifying the data to be identified by utilizing the trained multi-feature fused TF-DNA combined identification model to obtain an identification result.

Example 2

The beneficial effects of the present invention are verified by comparative experiments as follows.

The data used in this experiment were extracted from public databases and contained a total of 5 TF-DNA binding data of 34 tissues. The method comprises four steps of GHTNet (a model is pre-trained without using a migration learning strategy), GHTNet-DNA (a model is pre-trained without using the migration learning strategy and only DNA sequence data), GHTNet-transit one (a model is pre-trained by using a weak migration learning strategy) and GHTNet-transit two (a model is pre-trained by using a strong migration learning strategy). The method disclosed by the invention is comprehensively compared with other methods by using three indexes of ACC, AUROC and AUPRC.

Table 1TF-DNA binding recognition comparison results:

in summary, the method of the present invention can accurately predict whether the sequence contains the transcription factor binding site, and the average accuracy in 86 data sets reaches 92.54%, which greatly precedes 84.77% of method two, 84.86% of method three, 84.50% of method one, 84.30% of method four and 91.93% of method five. Similarly, the mean performance of the method of the invention was also optimized over AUC and PRC compared to other methods.

In addition, it can be observed that the model based on the combination of CNN and RNN has better performance than the model based on CNN alone. On average, method three was 0.0041 (p =2.30 e-5) and 0.0035 (p =3.10 e-4) higher in AUROC than method one and method two, respectively. This indicates that CNN and RNN are combined to allow more accurate transcription of the recognition gene. When the method of the invention only uses DNA sequence data to predict TFBS, the performance of the method exceeds those of the methods based on sequences, namely the method I, the method II and the method III, on each evaluation index. This indicates that the performance of the method proposed by the present invention exceeds that of the existing RNN + CNN based method. After the fusion of the migratory learning method, the model effect is improved to some extent, which indicates that the same transcription factor has many common characteristics in different tissues.

Therefore, the method can be concluded to have higher prediction precision compared with the existing TF-DNA binding prediction method.

Example 3

In a third embodiment of the invention, the importance of various features on TF-DNA was analyzed. The invention adopts the leave-one-feature-out method to analyze the importance of each feature. The construction and training process of the model is consistent with the first embodiment.

First, the present inventors analyzed the effects of 13 DNA shapes and EP characteristics on gene transcription recognition. It was found through experiments that only these 14 features were relied on to predict transcription factor binding sites, but the performance was reduced compared to models relying on DNA sequences (FIG. 5 b). On average, AUROC decreased by 0.055 (p =3.06 e-19). This suggests that although DNA shape is less important than DNA sequence in the task of gene transcription binding recognition, it may be helpful to recognize transcription factor tissue-specific binding.

Next, the present invention fuses both DNA sequence and DNA shape data. After fusing the DNA shapes, the model performance was improved by 0.0027 (p =5.32 e-3) compared to the DNA sequence-only model AUROC (fig. 5 b). This indicates that the fusion DNA shape can efficiently recognize tissue-specific binding of transcription factors. In order to further study the influence of DNA shape on transcription binding, the present inventors divided the DNA shape into three groups, i.e., intra-base, inter-base and MWG, and conducted the study. The significance of the shape of each of these three types of DNA was analyzed using the leave-one-out method, respectively. Overall, the average contribution of the three types of DNA shapes to gene transcription recognition is different, and the contribution of intra-base, inter-base and MGW are in proportion: 36.4%, 37.9% and 25.6%. Furthermore, the present invention has been studied for each of the two types of DNA shapes, inter-base and intra-base. The present inventors found that the contribution of each DNA shape was also different. Roll and Buckle are the most important features of the two classes of DNA shapes, contributing 25.56% and 37.54%, respectively (fig. 5 a). Specifically, the recognition of gene transcription binding by both the HelT and Rise DNA shapes negatively contributed, and the Area under the Operating Characteristic curve (AUROC) of the model acceptor increased by 0.0021 and 0.0017 after removing the two shapes. The invention analyzes the importance of DNA sequence and DNA shape in different tissues. Two transcription factors, CTCF (a transcription factor) and POLR2A (a transcription factor), were exemplified. The invention finds that the importance of DNA sequence and DNA shape in different tissues has significant difference. For CTCF and POLR2A, the DNA sequence was of minimal importance in vaginal (Vagina) and Peyer's patch (Peyer's patch) tissues, respectively; DNA shape was the least important in Spleen (Spleen) and Thyroid (thyoid gland) tissues, respectively (fig. 5 c). This indicates that both DNA shape and DNA sequence are important factors affecting tissue specific binding of transcription factors.

Study on epigenomic features the present inventors have found that gene transcriptional binding can be predicted using only eight histone modifications and DNase data, but performance is reduced compared to models that rely on DNA sequences. Overall, DNase i hypersensitive site (DNase) is more important than histone modification for predicting TF-DNA binding. Furthermore, the present inventors have found that the importance of these features shows strong tissue specificity, i.e. the importance of the same feature of the same TF varies significantly between different tissues.

It can be concluded that the analysis of the importance of different characteristics on TF-DNA binding by the method of the invention helps to explain the differential expression of genes in different human tissues.

Example 4

In a fourth embodiment of the present invention, in order to solve the "black box" problem of the model, on the basis of the trained model of the first embodiment, an attention mechanism is visualized to extract an area influencing model decision, and a convolution kernel is used to perform motif mining, so as to analyze tissue-specific binding of transcription factors, including the following steps:

x1: and (4) training a model. The following experiments were performed using the model trained in example one.

X2: and selecting a phantom (motif) detector. Potential motif probes were picked in the convolution kernel by maximal pooling.

X3: extracting TF-DNA binding region. The maximum activation value of the motif detector is I. And selecting the position of the maximum activation value selected by the current motif detector for each sequence of the positive sample. And the position activation value is larger than the threshold value (0.7I) of manual screening, and the position activation value is regarded as a potential binding site.

X4: and comparing the similarity of motif. The motif extracted by the invention and the validated motif in the database are compared and visualized by using a TOMTOM tool.

X5: attention visualization mechanisms. The importance of each position in the sequence was assessed and visualized with self-attention (fig. 6).

Specifically, the detailed process of the second step is as follows:

x21: for each sequence, the maximum activation value for each convolution kernel is calculated.

X22: the convolution kernel for which the maximum of these maximum activation values is located is found.

X23: and carrying out two-step operation of X21 and X22 on the whole positive sample data set, and counting the occurrence times of the maximum value of each convolution kernel.

X24: a convolution kernel with a maximum occurring more than one fifth the number of samples was used as the motif detector.

The data from this experiment are those from example one. The methods of Deepbind (method 1), MEME-ChIP (method 2), gkm-SVM (method 3) and the invention are respectively adopted to carry out motif similarity comparison.

The prediction method provided by the invention can realize automatic operation of the process in a software mode during specific implementation. The apparatus for operating the process should also be within the scope of the present invention.

The beneficial effects of the present invention are verified by experimental results below.

TABLE 2motif similarity comparison results

As can be seen from Table 2, the method of the present invention can capture the transcription factor binding motif more efficiently than other methods. The method is obviously superior to other methods in three evaluation indexes. Furthermore, fig. 7 shows attention scores for four ChIP-Seq validated CTCF binding sites in pancreatic tissue and their corresponding attention maps. Therefore, the attention mechanism in the present invention can give higher attention to TF-DNA binding region, which helps us to understand the TF-DNA binding mechanism better. And the invention is verified from the side to be capable of better identifying TF-DNA combination.

In conclusion, the invention designs the identification method of TF-DNA combination based on multi-feature fusion, and can effectively improve the prediction performance of TF-DNA combination. The research result of the invention can be applied to the field of biomedicine, researchers can predict TF-DNA combination by the method of the invention, screen out potential combination areas from massive data, and effectively solve the defect that the traditional experiment is time-consuming and labor-consuming. The method of the present invention can also be used for analyzing the characteristics affecting the specific binding of transcription factors, is helpful for explaining the gene differential expression in different tissues of human, and provides clues for diagnosing diseases, developing therapeutic targets and clarifying etiology of diseases. In addition, the present invention can identify important regions in the genome, and can provide important biological insights by analyzing these important regions.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims

1. A TF-DNA binding recognition method based on multi-feature fusion is characterized by comprising the following steps:

s2: preprocessing the various data to obtain preprocessed original data;

s5: performing global dependency extraction, feature extraction and feature fusion on the coded DNA sequence data and the normalized data by using a multi-feature fusion attention mechanism and a convolutional neural network to obtain an overall feature mapping combination;

s6: training the multi-feature fusion TF-DNA combination recognition model by using a transfer learning method according to the overall feature mapping combination to obtain the trained multi-feature fusion TF-DNA combination recognition model;

2. The TF-DNA binding recognition method based on multi-feature fusion of claim 1, wherein in the step S2, the preprocessed raw data comprises: DNA sequence data and its corresponding DNA shape data, chromatin accessibility data, histone modification data, and conservation data.

3. The TF-DNA binding identification method based on multi-feature fusion according to claim 1 or2, wherein the step S2 comprises:

s22: processing the DNA sequence data by using GKM-SVM to obtain a positive sample and a negative sample;

4. The TF-DNA binding recognition method based on multi-feature fusion according to claim 3, wherein said step S22 comprises:

5. The multiple feature fusion based TF-DNA binding recognition method according to claim 3, wherein said three types of DNA shape data include:

A：inter-bp：HelT,Rise,Roll,Shift,Slide and Tilt；

B：intra-bp：Buckle,Opening,ProT,Shear,Stagger and Stretch；

c: MGW and EP.

6. The TF-DNA binding recognition method based on multi-feature fusion according to claim 3, wherein said step S3 comprises:

s33: converting high-dimensional sparse hot unique codes into distributed codes according to a word2vec strategy;

7. The TF-DNA binding recognition method based on multi-feature fusion according to claim 3, wherein in the step S4, the normalized data is:

S _m ＝[m ₁ ，m ₂ ，…m _i ，…，m ₁₀₀ ，m ₁₀₁ ]

S _h ＝[h ₁ ，h ₂ ，…h _i ，…，h ₁₀₀ ，h ₁₀₁ ]

S _d ＝[d ₁ ，d ₂ ，…d _i ，…，d ₁₀₀ ，d ₁₀₁ ]

S _c ＝[c ₁ ，c ₂ ，…c _i ，…，c ₁₀₀ ，c ₁₀₁ ]

8. The TF-DNA binding recognition method based on multi-feature fusion according to claim 1, wherein the step S5 comprises: the multi-feature fusion convolutional neural network comprises a multi-head self-attention mechanism module, a first feature fusion module, a convolutional feedforward neural network module, a second feature fusion module, a convolutional layer I, a maximum pooling layer, a convolutional layer II, a global pooling layer, a full-connection layer and a ReLU activation layer which are arranged in sequence,

and the full connection layer and the ReLU activation layer are used for fusing the global dependency extraction result and the feature extraction result to obtain an integral feature mapping combination.

9. The TF-DNA binding recognition method based on multi-feature fusion according to claim 1, wherein in the step S6, the migratory learning method includes a weak migratory learning method and a strong migratory learning method, and the weak migratory learning method includes:

the strong transfer learning method comprises the following steps:

b2: freezing all network structures except the full connection layer;

b4: outputting the multi-feature fused TF-DNA binding recognition model.

10. A TF-DNA binding recognition system comprising a processor, a memory and a computer program stored on the memory, the computer program executing on the processor the TF-DNA binding recognition method based on multi-feature fusion according to any one of the claims 1 to 9.