CN111312329B

CN111312329B - Transcription factor binding site prediction method based on deep convolution automatic encoder

Info

Publication number: CN111312329B
Application number: CN202010115572.XA
Authority: CN
Inventors: 张永清; 乔少杰; 郜东瑞; 曾圆麒; 陈庆园; 卢荣钊; 林志宇
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2023-03-24
Anticipated expiration: 2040-02-25
Also published as: CN111312329A

Abstract

The invention discloses a transcription factor binding site prediction method based on a deep convolution automatic encoder, which is applied to the fields of computer technology and biological information technology and aims to solve the problem that a model depends on a negative sequence sample without a binding site and improve the generalization capability of the model; the method comprises the steps of specifically enriching DNA fragments combined with target proteins by a chromatin co-immunoprecipitation technology to obtain an original data set; then preprocessing the original data set to obtain a training data set; secondly, inputting the training data set into a convolution automatic encoder for training; finally, carrying out binding site recognition according to the trained convolution automatic encoder; experiments prove that the method can predict different transcription factor binding sites of different cell lines and has the effect of high-accuracy recognition.

Description

Transcription factor binding site prediction method based on deep convolution automatic encoder

Technical Field

The invention belongs to the technical field of computer technology and biological information, and particularly relates to a transcription factor binding site prediction technology.

Background

In the early days of studying transcription factor binding sites, the recognition problem of the conventional transcription factor binding sites was to experimentally obtain the true transcription factor binding sites from DNA sequences. With the development of bioinformatics, various methods using mathematical models have been developed, which allow researchers to be not limited to only transcription factor binding site information. The study of Transcription Factor Binding Sites (TFBS) has been long and has been widely used for the first time to study transcriptional regulators in the upstream promoter region of co-expressed genes. Because of the relatively short sequence of the transcription factor binding site, the same transcription factor will bind to the same or similar DNA sequences, making the identification of an accurate transcription factor binding site more challenging. The search is based on algorithms that identify transcription factor binding sites, either collectively or based on a probabilistic model. In particular, the problems with recognition of transcription factor binding sites can be summarized in the following categories.

The recognition of consensus sequence based transcription factor binding sites is a pattern-driven approach. A search is performed in the input sequence space. Consensus sequence is a representation of transcription factor binding sites, assuming that a binding site of length l is sought, with four bases at each position, and a total of 4l different forms present, then all similar examples are sought in the input sequence, and finally significance is calculated by the number of examples. Such methods are suitable for searching for binding sites that are short in length and highly conserved.

In addition to the algorithm based on the consensus sequence, there is an algorithm based on a Position Weight Matrix (Position Weight Matrix). The location weight matrix-based algorithm is a heuristic searching algorithm. When the probability model is designed based on the position weight matrix, because the transcription factor binding site may appear at any position of the input sequence, similar subsequences in each input sequence are selected from the input sequences, aligned to generate a corresponding probability matrix, and the significance of the subsequences relative to the background sequence is calculated through the probability matrix. Thus the transcription factor binding site recognition problem translates into a problem for combinatorial optimization.

Early transcription factor binding site recognition methods mainly focused on gene promoter regions, generally obtained several hundred of the retroactive sequences, and mostly adopted search algorithms to solve problems. Due to the fact that the data size generated by the ChIP-seq technology is large, many traditional algorithms cannot process the data. After the ChIP-seq technology is widely applied, improved algorithms are designed so as to greatly improve the calculation speed under the condition of reducing the calculation accuracy as little as possible.

Algorithms for ChIP-seq data, such as the well-known traditional transcription factor binding site recognition algorithm MEME, a me-ChIP algorithm for ChIP-seq data versions, chIP version of Gibbs Sampler, STEME is another MEME acceleration-based approach. Searching for the optimal transcription factor binding site through one subset of the input data set reduces the time overhead caused by overlarge input data. Or constructing a suffix tree index to search the sequence, accelerating the speed of an EM algorithm and solving the problem of ChIP-seq transcription factor binding site recognition. Nowadays, more and more binding sites have been validated by bio-wet experiments. In contrast, many binding sites have not been analyzed and discovered. (1) In the case of insufficient conditions, most depth models rely heavily on negative examples. In previous studies, methods of generating negative samples may not have received much attention. It may contain some noisy data and affect the performance of the prediction model in the TFBS. (2) Because of the non-uniform data samples of different transcription factors, the same model has limited prediction levels for different transcription factors. For example, transcription factors with a large number of data samples often have a significant predictive effect, while samples with longer sample sequences will reduce the predictive performance of the model. Insufficient generalization capability of the model and the like.

Disclosure of Invention

In order to solve the technical problems, the invention provides a method for predicting transcription factor binding sites based on a deep convolution automatic encoder, and the discovery of the motifs is helpful for understanding the expression of genes.

The technical scheme adopted by the invention is as follows: a method for transcription factor binding site prediction based on a deep convolution auto-encoder, comprising:

a method for transcription factor binding site prediction based on a deep convolution auto-encoder, comprising:

s1, specifically enriching DNA fragments combined with target proteins by a chromatin co-immunoprecipitation technology, thereby obtaining an original data set;

s2, preprocessing the original data set to obtain a training data set;

s3, inputting the training data set into a convolution automatic encoder for training;

and S4, identifying the binding sites according to the trained convolution automatic encoder.

The step S2 of preprocessing specifically comprises the following steps:

a1, screening an original data set; the original dataset was reprocessed through a unified pipeline using four different call peaking tools, iteratively excluding the one with the low S from the 4 peak sets of each dataset until the ratio between the large and small S in the dataset is less than or equal to 2, the dataset deleting the entire dataset if there is only one peak set left; s represents the empirical weight of the light tail;

a2, removing the data sets with the number of samples not more than 5000 after the samples are processed in the step A1;

and A3, selecting from the data set processed in the step A2 by setting the sequence length to obtain effective data of the DNA sequence with fixed length.

The step A3 specifically comprises the following sub-steps:

a31, making labels for the DNA sequences screened in the step A2, dividing a data set into two parts, and generating an opposite sample for one part by scrambling the sequences; another is mapped to a D-dimensional space;

a32, coding the DNA sequence processed in the step A31 by using single heat coding; given a DNA sequence of length DNA s = (s _1, s _2, \8230;, s _ l) and a fixed motif scanner length m;

a33, a matrix S encoded by the equation, the columns of the matrix S corresponding to the monothermic vectors of A, C, G or T, the columns of the matrix S are defined by [1, 0] T, [0,1, 0] T, [0,0,1,0] T and [0,0,0,1] T.

The step S3 specifically comprises the following steps:

s31, inputting the training set processed in the step S2 into an unsupervised convolution automatic encoder for training;

s32, importing the parameters of the filter and the pooling window of the trained unsupervised convolution automatic encoder into the supervised convolution automatic encoder;

and S33, inputting the training set processed in the step S2 into a supervised convolution automatic encoder for training.

And S5, after the maximum combined output of the convolutional layer, the supervised convolutional auto-encoder replaces the original MLP layer by using a completely connected expressway network.

The invention has the beneficial effects that: the invention firstly specifically enriches the DNA fragments combined with the target protein by a chromatin co-immunoprecipitation (ChIP) technique, purifies and constructs a library, and then carries out high-throughput sequencing on the enriched DNA fragments. Preprocessing the acquired high-throughput sequencing data to obtain DNA segment information interacted with histones, transcription factors and the like in the whole genome range; then, self-adaptive extraction is carried out on DNA sequences with different lengths, and sequences with complete isometric peaks which are obvious are detected; secondly, automatically extracting characteristic values of DNA segment information interacting with the transcription factors by using a convolution self-encoder, and training a transcription factor binding site prediction model; finally, utilizing the transcription factor binding site prediction model to carry out binding site recognition; the method of the invention can predict different transcription factor binding sites of different cell lines and carry out high-accuracy recognition on the transcription factor binding sites of human beings and mice.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a diagram of a complete algorithm for using the method of the present invention;

FIG. 3 is a diagram of the preliminary data preprocessing training feature extraction of the present invention;

FIG. 4 is a feature-enhanced model training diagram of the present invention.

Detailed Description

In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the present invention will be further explained with reference to the accompanying drawings.

In the present invention, we have designed a hybrid deep neural network that integrates the convolutional autocoder with the high-speed fully-connected MLP at this stage, taking into account the spatial and sequential characteristics of the DNA sequence. Convolutional Neural Networks (CNNs) are a special version of Artificial Neural Networks (ANNs) that employ weight-sharing strategies to capture local patterns in data (e.g., DNA sequences). The method comprises the steps of preliminary prediction algorithm of transcription factor binding sites of preprocessed DNA sequence data, feature extraction and model establishment, and the overall flow chart of the system is shown in figure 1. The following will be described in detail:

the invention aims to make full use of the information of the transcription factor binding sites and predict whether the DNA sequence has the binding sites by using network characteristics. Based on the above ideas, three prediction algorithms are provided, and the experimental flows of the algorithms are consistent.

The method comprises the following steps: prediction based on Convolutional Neural Network (CNN) method. The method aims to extract the characteristic information of a data source through the capability of extracting the characteristics of a convolutional neural network, and carries out modeling prediction, and the method specifically comprises the following steps:

11. the appropriate number of network layers is selected. This step is the key step to ensure that we extract the features that we are most concerned with; for multimedia data identification, the amount of characteristic information contained in pictures and videos is usually complex, the corresponding network structure is relatively deep, and at least, tens of layers or even hundreds of layers exist, such as Googlenet, vgnet and the like; the sequence model is similar to text data, the data has relatively less characteristic information, and an over-deep model not only has little significance for characteristic extraction, but also wastes calculation memory and reduces model efficiency, so that the number of layers is generally selected from several to more than ten layers, and the specific number of layers is determined according to actual conditions;

12. performing model training and prediction by using a proper convolution kernel;

13. and evaluating the prediction result, and calculating AUC and PRAUC.

The second method comprises the following steps: prediction based on a deep network approach. The method aims to extract the characteristic information of a data source through the capability of extracting the characteristics of a deep network, and carries out modeling prediction, and comprises the following specific steps:

21. adjusting a training test according to a model, wherein the finally used convolution self-encoder structure comprises three convolution layers, two pooling layers, two upper sampling layers and three anti-convolution layers; the coding layer structure of the convolutional neural network structure convolutional self-encoder for classified prediction output is consistent and is used for receiving parameters after heuristic training, and then five high-speed information layers are connected to construct a depth network, and network parameters are set;

22. selecting the number of neurons to carry out model training and prediction; the number of the neurons is similar to the number of features which are required to be proposed from data, a traditional model manually proposes a certain number of data features through machine learning or other methods before a model is trained, and a step of manually extracting the features can be avoided by using a depth model, so that the number of the features which need to be automatically extracted by a network, namely the number of the neurons is only set, which is one of important parameters required to be adjusted through experiments in the depth model, and the number of the neurons does not show different effects because of the number of network layers and the complexity of data;

23. and evaluating the prediction result, and calculating AUC and PRAUC.

The third method comprises the following steps: prediction based on long short term memory network (LSTM) methods. The method comprises the steps of mining a dependency modeling algorithm among data by using the characteristics of the LSTM while data information is contained, and predicting a result. The method comprises the following specific steps:

31. constructing an LSTM network;

32. selecting proper LSTM layer number and neuron number; the key to LSTM is the cellular State (Cell State), and a layer of LSTM is made up of individual neurons (cellular states) connected in a chain. The cell states are mutually influenced, and the output of the previous cell state has certain influence weight on the input of the next cell node, so that the dependence and the relation among data are reflected. The cellular state is somewhat like a conveyor belt, which runs directly through the entire chain with only a few minor linear interactions. The information carried above can easily be streamed without change. LSTM has the ability to add or delete information to the state of a cell, which is controlled by a structure called a gate.

33. And evaluating the prediction result, and calculating AUC and PR-AUC.

The complete flow system framework diagram of the invention is shown in fig. 2, and comprises: the method comprises the steps of obtaining data sets from an Encode database and a GTRD database respectively, conducting data encoding after preprocessing an original data set, constructing a network model conforming to an encoding data interface, extracting characteristics through the network model, outputting a predicted value of a joint point, visualizing a result through evaluating the performance of the model of a function evaluation, and after the model is built, executing the whole calculation process from the training test on a server with GPU (graphics processing unit) calculation resources.

1) Primary data preprocessing pre-training feature extraction algorithm

11 Pre-processing of DNA data

A1, use of Gene Transcription Regulatory Database (GTRD) as a source of systematically processed ChIP-Seq data. Data of ChIP-Seq gathered by GTRD from GEO;

a first screening of the data using four different call peak tools (unified tube reprocessing its Mac, pics, sisrs, gem); the four different call peak tools (macs, gem, pics, sisrs) were achieved by motif discovery systematized in a multi-quinthousand ChIP-Seq experiment treated uniformly within the biomull framework using several ChIP-Seq peak calling tools aggregated in the Gene Transcription Regulatory Database (GTRD); GTRD summarizes ChIP-Seq data from Gene Expression Omnibus (GEO). Specifically, the method comprises the following steps:

for each peak caller, an empirical distribution of the number of peaks is constructed from all the peak sets obtained by that peak caller (ignoring the peak set that contains zero peaks). This allows us to replace the peak N in each set of peaks, with the empirical weight of the light tail defined as S = min (P (≧ N), P (≦ N)), where (P (≧ N), P (≦ N)) represents a two-sided hypothesis experiment, with the empirical probability of P being a peak set to contain a conditional number of peaks. A lower S value corresponds to a less likely (maximum or minimum) number of peaks. Consistent data sets are expected to have similar S values for different peak callers. Therefore, from the set of 4 peaks per dataset, we iteratively exclude the one with the low S until the ratio between the large and small S in the dataset becomes no >2. If only one peak set remains, deleting the entire data set;

a2, GTRD is the source of ChIP-Seq data processed by the system. GTRD provides 3311/2623 datasets and 12612/9938 peak sets for human/mouse 602/354 transcription factors, respectively. The coarse filter eliminates nearly one third of the set of peaks. The experiment keeps samples extracted by four peak extractors, namely macs and gem, to input samples with the fixed length of 300bp, so as to facilitate network input, and discards a data set with small sample size; in the embodiment, data with the number of data set samples not exceeding 5000 in the data after iteration is removed; considering that the data set of the small sample is not beneficial to training test after model establishment, and the characteristics contained in the data set of the small sample are not necessarily accurate, 5000 sample volumes prove to be a proper sample size boundary for the model of the invention according to a large number of experiments;

and A3, removing noise in the data after iteration according to A1, screening out data with a large enough sample amount according to A2, selecting the data in the data set screened in the two steps in a length range which does not lose the characteristics of the binding sites as far as possible by setting the sequence length because the input of the model limited sample needs to be fixed in length, and finally obtaining the effective data of the DNA sequence with the fixed length.

The step A3 specifically comprises the following sub-steps:

a31, making labels for the screened DNA sequences, and dividing a data set into two parts, wherein one part is used for training, and the other part generates a counter sample for training by scrambling the sequences;

a32, encoding the DNA sequence into an input of a similar image by using single heat encoding; given a DNA sequence of length DNA s = (s _1, s _2, \8230;, s _ l) and a fixed motif scanner length m; s _1, s _2, \8230, s _ l denotes the element of the DNA encoding vector of length l;

a33, the encoded matrix is obtained by the equation, the columns of matrix S correspond to the values of a, C, the simple heat vector of G or T, which is defined by [1, 0] T, [0,1,0,0] T, [0,0,1,0] T and [0,0,0,1] T.

As will be appreciated by those skilled in the art, A herein represents adenine deoxynucleotide, C represents cytosine deoxynucleotide, G represents guanine deoxynucleotide, and T represents thymine deoxynucleotide.

12 Convolutional auto-encoder feature extraction

Briefly, the autoencoder is a compressor that can convert the input and then output the same content as the input, obtain a feature map by setting the size of the filter in the convolutional autoencoder, and set the size of the pooling window to select the strongest feature;

the structure used in this example to detect transcription factor binding sites is an autoencoder, a neural network that is trained to reconstruct its input. Such a method for detecting abnormal data belongs to a family of reconstruction-based methods in which an algorithm is desired to reconstruct normal data with low error and abnormal data with higher error. The magnitude of the error is then used to determine whether the input data is normal. In the present invention, we evaluated multi-layered perceptrons (MLPs), convolutional Neural Networks (CNNs) and cyclic autoencoders, in particular autoencoders consisting of long short term memory network (LSTM) units. Unsupervised learning can learn the features of samples without labeling, while the purpose of convolutional autocoder is to utilize the convolution and pooling operations of a convolutional neural network to achieve unsupervised feature extraction for feature-invariant extraction. CNN has weight sharing and the input data can be original images, sequences, etc. The automatic encoder has the capability of rapidly extracting image essential components. The invention combines the advantages of two algorithms of a self-encoder and a convolutional neural network, and provides a heuristic network training method based on a Convolutional Automatic Encoder (CAE).

The implementation process of the feature extraction of the convolutional self-encoder is specifically as follows:

21. reading in the DNA sequence mapped to the D-dimensional space through a self-encoder;

22. the auto-encoder is pre-trained by the neural network to determine initial values.

23. And importing parameters of a filter and a pooling window which are trained by convolutional self-coding, and taking the DNA sequence and the scrambling sequence which are mapped to the D-dimensional space as input training of the neural network.

2) Improved algorithm for extracting sequence features

21 Carrying out the data preprocessing of the step 1) on a plurality of data samples corresponding to DNA sequences from different cell lines to obtain corresponding first component parameters, packaging the first component parameters into uniform interface parameters, and carrying out visual display, wherein the first component parameters are variables in neural network algorithm codes extracted from the data samples;

22 Relevant codes required by the created DNA data set are packaged in the corresponding component, and variables in the codes are used as second component parameters of the component for processing;

the second component parameter comprises the first component parameter;

judging for the second component parameter through the processing in the step 22), packaging the second component parameter in a return value of an interface after the data set is established, and transmitting the established data set to the next component;

23 Constructing the number of layers for generating the neural network based on the set parameters, namely completing the model configuration process;

carrying out model training on the configured model, carrying out visual display and storing the trained model file in a configuration path; and storing the training result of the model training, transmitting the full path of the model file as a parameter, transmitting the processed prediction data into the component, and finally obtaining the prediction information of the training model.

Through 21) -23), the preliminary data preprocessing obtains a feature extraction map through heuristic training and classification training as shown in fig. 3.

3) Feature selection and binding site recognition

After the model network is initialized, the model network is subjected to reverse fine adjustment through the marking data, and the classifier selects Sigmoid. We use unsupervised convolutional auto-coding networks for initialization. The reason for this is to reduce the noise caused by the negative samples while ensuring that most useful features are extracted. Conventionally, the convolutional layer is followed by a fully connected MLP layer. A new technique called high speed fully connected network has proven effective for deeper level representations. The highway network uses gating cells that learn to regulate the flow of information through the network. The past examples demonstrate that using highway MLP after a series of convolutions is more efficient than standard MLP and assume that a highway network is particularly suitable for use with convolutional layers because of its ability to adaptively combine local features. After maximum combined output of convolutional layers, we use a fully connected highway network of depth 5. The output of the highway MLP is fed to a class 2 Sigmoid type function.

The feature-enhanced model training diagram is shown in FIG. 4; extracting the characteristic value of the complete DNA fragment, specifically comprising the following steps: selecting 10 human transcription factor data sets as feature sets of experimental extraction data, performing dimension reduction prediction on the feature sets by using a convolutional neural network and high-speed full-connection based algorithm, training by using a cross validation method, and finally evaluating the model by using an evaluation function.

The evaluation function includes: precision function, recall function, accuracy function.

The invention has the technical innovation points and characteristics that: the invention relies on the background of the interdiscipline to predict the gene data in the biological and computer fields. By analyzing biological data, the data is integrated and mined from deep level, and the law of correlation is found. Integrating large amounts of data source information can provide more valuable information for prediction, and therefore, there is a need for improving the depth model with the ability to accommodate more information. Computing using artificial neural networks is a large-scale algorithm.

1) Research background innovation

The computer science and the life science are subject to interdisciplinary, and various mature algorithms of machine learning, data mining and pattern recognition are used for analyzing and solving the biological problems. The project adopts a correlation method to analyze the prediction problem of the transcription factor binding site in the biological field, aims to increase the prediction accuracy and provide guidance and direction for the experiment of a specific system and field.

2) Improved adaptive action fragment extraction algorithm

In bioinformatics, feature calculation is often performed using physicochemical properties or structural information of molecules themselves, but features obtained by the above methods include only self information, and the description of information of data in the entire data network is lacking. And studies of computational biology have shown that: the composition of life forms is closely related to their interrelated networks. Therefore, the invention provides a feature calculation method for searching the dependence of the artificial neural network simulation biological network, and lays a foundation for accurately predicting the dependence.

3) Improved adaptive feature extraction algorithm

Common transcription factor binding site prediction uses supervised learning methods. When unlabeled data sources are present, the predictive model needs to be improved to accommodate more information. The invention provides an algorithm combining unsupervised learning and supervised learning aiming at multiple data sources, and the prediction accuracy is improved. According to the method, a feature vector set which has good characterization capability and small calculation amount is obtained by feature extraction of DNA sequence fragments and dimension reduction processing of the features; meanwhile, a good transcription factor binding site recognition model is constructed by utilizing the feature vector set, and good accuracy is obtained for the transcription factor binding site recognition.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A method for transcription factor binding site prediction based on a deep convolution automatic encoder, comprising:

s2, preprocessing the original data set to obtain a training data set;

s3, inputting the training data set into a convolution automatic encoder for training; the auto-encoder is pre-trained by a neural network to determine initial values, the neural network model configuration process comprising the steps of:

1) Acquiring corresponding first construction parameters according to the preprocessing of the step S2, and packaging the first component parameters into uniform interface parameters, wherein the first component parameters are variables in neural network algorithm codes extracted from each data sample;

2) Packaging related codes required by the data set obtained through the preprocessing in the step S2 into corresponding components, and processing variables in the codes as second component parameters of the components; the second component parameter comprises the first component parameter;

3) Constructing the number of layers for generating the neural network based on the set parameters so as to complete the model configuration process;

s4, identifying binding sites according to the trained convolution automatic encoder;

the step S2 of preprocessing specifically comprises the following steps:

a3, selecting from the data set processed in the step A2 by setting the sequence length to obtain effective data of a DNA sequence with a fixed length;

the step A3 specifically comprises the following sub-steps:

a32, encoding the DNA sequence processed in the step A31 by using single heat encoding; given a DNA sequence of length L s = (s _1, s _2, \8230;, s _ L) and a fixed motif scanner length m;

a33, obtaining a coded matrix S by an equation, the columns of the matrix S corresponding to the monothermic vectors of A, C, G or T, the columns of the matrix S consisting of [1,0,0,0] ^T ，[0, 1, 0, 0] ^T ，[0, 0, 1, 0] ^T And [0,0,0,1]] ^T And (4) showing.

2. The method for transcription factor binding site prediction based on deep convolution automatic encoder according to claim 1, wherein step S3 is specifically:

3. The method of claim 2, wherein the supervised convolutional auto-encoder replaces the original MLP layer with a fully connected highway network after the maximum combined output of convolutional layers in step S33.