CN111312329A - Transcription factor binding site prediction method based on deep convolution automatic encoder - Google Patents

Transcription factor binding site prediction method based on deep convolution automatic encoder Download PDF

Info

Publication number
CN111312329A
CN111312329A CN202010115572.XA CN202010115572A CN111312329A CN 111312329 A CN111312329 A CN 111312329A CN 202010115572 A CN202010115572 A CN 202010115572A CN 111312329 A CN111312329 A CN 111312329A
Authority
CN
China
Prior art keywords
automatic encoder
binding site
training
transcription factor
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010115572.XA
Other languages
Chinese (zh)
Other versions
CN111312329B (en
Inventor
张永清
乔少杰
郜东瑞
曾圆麒
陈庆园
卢荣钊
林志宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
Original Assignee
Chengdu University of Information Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology filed Critical Chengdu University of Information Technology
Priority to CN202010115572.XA priority Critical patent/CN111312329B/en
Publication of CN111312329A publication Critical patent/CN111312329A/en
Application granted granted Critical
Publication of CN111312329B publication Critical patent/CN111312329B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/30Drug targeting using structural data; Docking or binding prediction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Chemical & Material Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Medicinal Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a transcription factor binding site prediction method based on a deep convolution automatic encoder, which is applied to the fields of computer technology and biological information technology and aims to solve the problem that a model depends on a negative sequence sample without a binding site and improve the generalization capability of the model; the method comprises the steps of specifically enriching DNA fragments combined with target proteins by a chromatin co-immunoprecipitation technology to obtain an original data set; then preprocessing the original data set to obtain a training data set; secondly, inputting the training data set into a convolution automatic encoder for training; finally, carrying out binding site recognition according to the trained convolution automatic encoder; experiments prove that the method can predict different transcription factor binding sites of different cell lines and has the effect of high-accuracy recognition.

Description

Transcription factor binding site prediction method based on deep convolution automatic encoder
Technical Field
The invention belongs to the technical field of computer technology and biological information, and particularly relates to a transcription factor binding site prediction technology.
Background
In the early days of studying transcription factor binding sites, the recognition problem of the conventional transcription factor binding sites was to experimentally obtain the true transcription factor binding sites from DNA sequences. With the development of bioinformatics, various methods using mathematical models have been developed, which allow researchers to be not limited to only transcription factor binding site information. The study of Transcription Factor Binding Sites (TFBS) has been long and has been widely used for the first time to study transcriptional regulators in the upstream promoter region of co-expressed genes. Because of the relatively short sequence of the transcription factor binding site, the same transcription factor will bind to the same or similar DNA sequences, making the identification of an accurate transcription factor binding site more challenging. The search is based on algorithms that identify transcription factor binding sites, either collectively or based on a probabilistic model. In particular, the problems with recognition of transcription factor binding sites can be summarized in the following categories.
The recognition of consensus sequence based transcription factor binding sites is a pattern-driven approach. A search is performed in the input sequence space. Consensus sequence is a representation of transcription factor binding sites, assuming that a binding site of length l is sought, with four bases at each position, and a total of 4l different forms present, then all similar examples are sought in the input sequence, and finally significance is calculated by the number of examples. Such methods are suitable for searching for binding sites that are short in length and highly conserved.
In addition to the algorithm based on the consensus sequence, there is an algorithm based on a Position Weight Matrix (Position Weight Matrix). The location weight matrix-based algorithm is a heuristic searching algorithm. When the probability model is designed based on the position weight matrix, because the transcription factor binding site can appear at any position of the input sequence, similar subsequences in each input sequence are selected from the input sequences, aligned to generate a corresponding probability matrix, and the significance of the subsequences relative to the background sequence is calculated through the probability matrix. Thus the transcription factor binding site recognition problem translates into a problem for combinatorial optimization.
Early transcription factor binding site recognition methods mainly focused on gene promoter regions, generally obtained several hundred of the retroactive sequences, and mostly adopted search algorithms to solve problems. Due to the fact that the data size generated by the ChIP-seq technology is large, many traditional algorithms cannot process the data. After the ChIP-seq technology is widely applied, some improved algorithms are designed so as to greatly improve the calculation speed under the condition of reducing the calculation precision as little as possible.
Algorithms for ChIP-seq data, such as the well-known traditional transcription factor binding site recognition algorithm MEME, a me-ChIP algorithm for ChIP-seq data versions, ChIP version of Gibbs Sampler, STEME is another MEME acceleration-based approach. Searching for the optimal transcription factor binding site through one subset of the input data set reduces the time overhead caused by overlarge input data. Or constructing a suffix tree index to search the sequence, accelerating the speed of an EM algorithm and solving the problem of ChIP-seq transcription factor binding site recognition. Nowadays, more and more binding sites have been validated by bio-wet experiments. In contrast, many binding sites have not been analyzed and discovered. (1) In the case of insufficient conditions, most depth models rely heavily on negative samples. In previous studies, methods of generating negative samples may not have received much attention. It may contain some noisy data and affect the performance of the prediction model in the TFBS. (2) Because of the non-uniform data samples of different transcription factors, the same model has limited prediction levels for different transcription factors. For example, transcription factors with a large number of data samples often have a significant predictive effect, while samples with longer sample sequences will reduce the predictive performance of the model. Insufficient generalization ability of the model and the like.
Disclosure of Invention
In order to solve the technical problems, the invention provides a method for predicting transcription factor binding sites based on a deep convolution automatic encoder, and the discovery of the motifs is helpful for understanding the expression of genes.
The technical scheme adopted by the invention is as follows: a method for transcription factor binding site prediction based on a deep convolution auto-encoder, comprising:
a method for transcription factor binding site prediction based on a deep convolution auto-encoder, comprising:
s1, specifically enriching DNA fragments combined with the target protein through a chromatin co-immunoprecipitation technology, thereby obtaining an original data set;
s2, preprocessing the original data set to obtain a training data set;
s3, inputting the training data set into a convolution automatic encoder for training;
and S4, carrying out binding site recognition according to the trained convolution automatic encoder.
The preprocessing of step S2 specifically includes:
a1, screening an original data set; the original dataset was reprocessed through a unified pipeline using four different call peaking tools, iteratively excluding the one with the low S from the 4 peak sets of each dataset until the ratio between the large and small S in the dataset is less than or equal to 2, the dataset deleting the entire dataset if there is only one peak set left; s represents the empirical weight of the light tail;
a2, removing the data sets with the number of samples not more than 5000 after the processing of the step A1;
and A3, selecting from the data set processed in the step A2 by setting the sequence length to obtain effective data of the DNA sequence with fixed length.
Step a3 specifically includes the following substeps:
a31, labeling the DNA sequences screened in the step A2, dividing a data set into two parts, and generating an opposite sample for one part by scrambling the sequences; another is mapped to a D-dimensional space;
a32, encoding the DNA sequence processed by the step A31 by using single heat encoding; giving a DNA sequence of length DNA s ═ (s _1, s _2, …, s _ l) and a fixed motif scanner length m;
a33, obtaining an encoded matrix S by the equation, the columns of matrix S corresponding to the monothermic vectors of A, C, G or T, the columns of matrix S being represented by [1,0,0,0] T, [0,1,0,0] T, [0,0,1,0] T and [0,0,0,1] T.
Step S3 specifically includes:
s31, inputting the training set processed in the step S2 into an unsupervised convolution automatic encoder for training;
s32, importing the parameters of the filter and the pooling window of the trained unsupervised convolution automatic encoder into the supervised convolution automatic encoder;
and S33, inputting the training set processed in the step S2 into a supervised convolution automatic encoder for training.
The supervised convolutional auto-encoder replaces the original MLP layer with a fully connected highway network after the maximum combined output of the convolutional layers in step S5.
The invention has the beneficial effects that: the invention firstly specifically enriches the DNA fragments combined with the target protein by a chromatin co-immunoprecipitation technology (ChIP), purifies and constructs a library, and then carries out high-throughput sequencing on the enriched DNA fragments. Preprocessing the acquired high-throughput sequencing data to obtain DNA segment information interacted with histones, transcription factors and the like in the whole genome range; then, self-adaptive extraction is carried out on DNA sequences with different lengths, and sequences with complete equal length and obvious peak values are detected; secondly, automatically extracting characteristic values of DNA segment information interacting with the transcription factors by using a convolution self-encoder, and training a transcription factor binding site prediction model; finally, utilizing the transcription factor binding site prediction model to carry out binding site recognition; the method of the invention can predict different transcription factor binding sites of different cell lines and carry out high-accuracy recognition on the transcription factor binding sites of human beings and mice.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a diagram of a complete algorithm for using the method of the present invention;
FIG. 3 is a diagram of the preliminary data preprocessing training feature extraction of the present invention;
FIG. 4 is a feature-enhanced model training diagram of the present invention.
Detailed Description
In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the present invention will be further explained with reference to the accompanying drawings.
In the present invention, we have designed a hybrid deep neural network that integrates the convolutional autocoder with the high-speed fully-connected MLP at this stage, taking into account the spatial and sequential characteristics of the DNA sequence. Convolutional Neural Networks (CNNs) are a special version of Artificial Neural Networks (ANNs) that employ weight-sharing strategies to capture local patterns in data (e.g., DNA sequences). The method comprises the steps of preliminary prediction algorithm of transcription factor binding sites of preprocessed DNA sequence data, feature extraction and model establishment, and the overall flow chart of the system is shown in figure 1. The following will be described in detail:
the invention aims to make full use of the information of the transcription factor binding sites and predict whether the DNA sequence has the binding sites by using network characteristics. Based on the above idea, three prediction algorithms are provided, and the experimental flow of each algorithm is consistent.
The method comprises the following steps: prediction based on Convolutional Neural Network (CNN) method. The method aims to extract the characteristic information of a data source through the capability of extracting the characteristics of a convolutional neural network, and carries out modeling prediction, and the method specifically comprises the following steps:
11. the appropriate number of network layers is selected. This step is the key step to ensure that the features we are most concerned with are extracted; for multimedia data identification, the amount of characteristic information contained in pictures and videos is usually complex, the corresponding network structure is relatively deep, and at least, tens of layers or even hundreds of layers exist, such as Googlenet, vgnet and the like; the sequence model is similar to text data, the data has relatively less characteristic information, and an over-deep model not only has little significance for characteristic extraction, but also wastes calculation memory and reduces model efficiency, so that the number of layers is generally selected from several to more than ten layers, and the specific number of layers is determined according to actual conditions;
12. performing model training and prediction by using a proper convolution kernel;
13. and evaluating the prediction result, and calculating AUC and PRAUC.
The second method comprises the following steps: prediction based on a deep network approach. The method aims to extract the characteristic information of a data source through the capability of extracting the characteristics of a deep network, and carries out modeling prediction, and comprises the following specific steps:
21. adjusting a training test according to a model, wherein the finally used convolution self-encoder structure comprises three convolution layers, two pooling layers, two upper sampling layers and three anti-convolution layers; the coding layer structure of the convolutional neural network structure convolutional self-encoder for classified prediction output is consistent and is used for receiving parameters after heuristic training, and then five high-speed information layers are connected to construct a depth network, and network parameters are set;
22. selecting the number of neurons to carry out model training and prediction; the neuron is similar to the quantity of features which need to be extracted from data, the traditional model manually extracts certain quantity of data features through machine learning or other methods before training the model, and the step of manually extracting the features can be avoided by using the depth model, so that the quantity of the features which need to be automatically extracted by a network, namely the quantity of the neuron is only set, which is one of important parameters needing to be adjusted through experiments in the depth model, and the quantity of the neuron does not show different effects because of the number of network layers and the complexity of data;
23. and evaluating the prediction result, and calculating AUC and PRAUC.
The third method comprises the following steps: prediction based on long short term memory network (LSTM) methods. The method comprises the steps of mining a dependency modeling algorithm among data by using the characteristics of the LSTM while data information is contained, and predicting a result. The method comprises the following specific steps:
31. constructing an LSTM network;
32. selecting proper LSTM layer number and neuron number; the key to LSTM is the cellular State (Cell State), and a layer of LSTM is made up of individual neurons (cellular State) connected into a chain. The cell states are mutually influenced, and the output of the previous cell state has certain influence weight on the input of the next cell node, so that the dependence and the relation among data are reflected. The cellular state is somewhat like a conveyor belt, which runs directly through the entire chain with only a few minor linear interactions. The information carried above can easily be streamed without change. LSTM has the ability to add or delete information to the state of a cell, which is controlled by a structure called a gate.
33. And evaluating the prediction result, and calculating AUC and PR-AUC.
The complete flow system framework diagram of the invention is shown in fig. 2, and comprises: the method comprises the steps of obtaining data sets from an Encode database and a GTRD database respectively, conducting data encoding after preprocessing an original data set, constructing a network model conforming to an encoding data interface, extracting characteristics through the network model, outputting a predicted value of a joint point, visualizing a result through evaluating the performance of the model of a function evaluation, and after the model is built, executing the whole calculation process from the training test on a server with GPU (graphics processing unit) calculation resources.
1) Primary data preprocessing pre-training feature extraction algorithm
11) Pre-processing of DNA data
A1, using Gene Transcription Regulatory Database (GTRD) as the source of systematically processed ChIP-Seq data. Data of ChIP-Seq gathered by GTRD from GEO;
a first screening of the data using four different call peak tools (unified tube reprocessing its Mac, pics, sisrs, gem); the four different call peak tools (macs, gem, pics, sisrs) were achieved by motif discovery systematized in a multi-quinthousand ChIP-Seq experiment treated uniformly within the biomull framework using several ChIP-Seq peak calling tools aggregated in the Gene Transcription Regulatory Database (GTRD); GTRD summarizes ChIP-Seq data from Gene Expression Omnibus (GEO). Specifically, the method comprises the following steps:
for each peak caller, an empirical distribution of the number of peaks is constructed from all the peak sets obtained by that peak caller (ignoring the peak set that contains zero peaks). This allows us to replace the peak N in each set of peaks, with the empirical weight of the light tail defined as S min (P (≧ N), P (≦ N)), where (P (≧ N), P (≦ N)) represents a two-sided hypothesis experiment and the empirical probability of P being a peak is set to contain a conditional number of peaks. A lower S value corresponds to a less likely (maximum or minimum) number of peaks. Consistent data sets are expected to have similar S values for different peak callers. Therefore, from the set of 4 peaks per dataset, we iteratively exclude the one with the low S until the ratio between the large and small S in the dataset becomes no > 2. If only one peak set is left, deleting the whole data set;
a2, GTRD is the source of ChIP-Seq data processed by the system. GTRD provides 3311/2623 data sets and 12612/9938 peak sets for human/mouse 602/354 transcription factors, respectively. The coarse filter eliminates nearly one third of the set of peaks. The experiment keeps samples extracted by four peak extractors, namely macs and gem, to input samples with the fixed length of 300bp, so as to facilitate network input, and discards a data set with small sample size; in the embodiment, data with the number of data set samples not exceeding 5000 in the data after iteration is removed; considering that the data set of the small sample is not beneficial to training test after model establishment, and the characteristics contained in the data set of the small sample are not necessarily accurate, 5000 sample volumes prove to be a proper sample size boundary for the model of the invention according to a large number of experiments;
a3, removing noise in the data after iteration according to A1, screening out data with a large enough sample size according to A2, selecting from the data sets screened in the two steps within a length range which does not lose the characteristics of the binding sites as far as possible by setting the sequence length because the input of the model limited sample needs to be fixed in length, and finally obtaining the effective data of the DNA sequence with the fixed length.
Step a3 specifically includes the following substeps:
a31, labeling the screened DNA sequences, and dividing the data set into two parts, wherein one part is used for training, and the other part generates opposite samples for training by scrambling the sequences;
a32, encoding a DNA sequence into an image-like input using one-hot encoding; giving a DNA sequence of length DNA s ═ (s _1, s _2, …, s _ l) and a fixed motif scanner length m; s _1, s _2, …, s _ l denotes the element of the DNA code in a vector of length l;
a33, the encoded matrix is obtained by the equation, the columns of matrix S correspond to the monothermic vectors of A, C, G or T, which are represented by [1,0,0,0] T, [0,1,0,0] T, [0,0,1,0] T and [0,0,0,1] T.
As will be appreciated by those skilled in the art, A herein represents adenine deoxynucleotide, C represents cytosine deoxynucleotide, G represents guanine deoxynucleotide, and T represents thymine deoxynucleotide.
12) Convolutional self-encoder feature extraction
Briefly, the autoencoder is a compressor that can convert the input and then output the same content as the input, obtain a feature map by setting the size of the filter in the convolutional autoencoder, and set the size of the pooling window to select the strongest feature;
the structure used to detect the transcription factor binding site in this example is an autoencoder, a neural network trained to reconstruct its input. Such a method for detecting abnormal data belongs to a family of reconstruction-based methods in which an algorithm is desired to reconstruct normal data with low error and abnormal data with higher error. The magnitude of the error is then used to determine whether the input data is normal. In the present invention, we evaluated multi-layered perceptrons (MLPs), Convolutional Neural Networks (CNNs) and cyclic autoencoders, in particular autoencoders consisting of long short term memory network (LSTM) units. Unsupervised learning can learn the features of samples without labeling, while the purpose of convolutional autocoder is to utilize the convolution and pooling operations of a convolutional neural network to achieve unsupervised feature extraction for feature-invariant extraction. CNN has weight sharing and the input data can be original images, sequences, etc. The automatic encoder has the capability of rapidly extracting image essential components. The invention provides a heuristic network training method based on a Convolutional Automatic Encoder (CAE) by combining the advantages of two algorithms of an autoencoder and a convolutional neural network.
The implementation process of the feature extraction of the convolutional self-encoder is specifically as follows:
21. reading in the DNA sequence mapped to the D-dimensional space through a self-encoder;
22. the auto-encoder is pre-trained by the neural network to determine initial values.
23. And importing parameters of a filter and a pooling window which are trained by convolutional self-coding, and taking the DNA sequence and the scrambling sequence which are mapped to the D-dimensional space as input training of the neural network.
2) Improved algorithm for extracting sequence features
21) Carrying out data preprocessing of the step 1) on a plurality of data samples corresponding to DNA sequences from different cell lines to obtain corresponding first component parameters, packaging the first component parameters into uniform interface parameters, and carrying out visual display, wherein the first component parameters are variables in neural network algorithm codes extracted from the data samples;
22) packaging related codes required by the created DNA data set into the corresponding component, and processing variables in the codes as second component parameters of the component;
the second component parameter comprises the first component parameter;
judging for the second component parameter through the processing in the step 22), packaging the second component parameter in a return value of an interface after the data set is established, and transmitting the established data set to the next component;
23) constructing the number of layers for generating the neural network based on the set parameters, namely completing the model configuration process;
carrying out model training on the configured model, carrying out visual display and storing the trained model file in a configuration path; and storing the training result of the model training, transmitting the full path of the model file as a parameter, transmitting the processed prediction data into the component, and finally obtaining the prediction information of the training model.
Through 21) -23), the preliminary data preprocessing obtains a feature extraction map through heuristic training and classification training as shown in fig. 3.
3) Feature selection and binding site recognition
After the model network is initialized, the model network is subjected to reverse fine adjustment through the marking data, and the classifier selects Sigmoid. We use unsupervised convolutional auto-coding networks for initialization. The reason for this is to reduce the noise caused by the negative samples while ensuring that most useful features are extracted. Conventionally, the convolutional layer is followed by a fully connected MLP layer. A new technique called high speed fully connected network has proven effective for deeper level representations. The highway network uses gating cells that learn to regulate the flow of information through the network. The past examples demonstrate that using highway MLP after a series of convolutions is more efficient than standard MLP and assume that a highway network is particularly suitable for use with convolutional layers because of its ability to adaptively combine local features. After maximum combined output of convolutional layers, we use a fully connected highway network of depth 5. The output of the highway MLP is fed to a class 2 Sigmoid type function.
The feature-enhanced model training diagram is shown in FIG. 4; extracting the characteristic value of the complete DNA fragment, specifically comprising the following steps: selecting 10 human transcription factor data sets as feature sets of experimental extraction data, performing dimension reduction prediction on the feature sets by using a convolutional neural network and high-speed full-connection based algorithm, training by using a cross validation method, and finally evaluating the model by using an evaluation function.
The evaluation function includes: precision function, recall function, accuracy function.
The invention has the technical innovation points and characteristics that: the invention relies on the background of the interdiscipline to predict the gene data in the biological and computer fields. By analyzing biological data, the data is integrated and mined from deep level, and the law of correlation is found. Integrating large amounts of data source information can provide more valuable information for prediction, and therefore, there is a need for improved depth models with the ability to accommodate more information. Computing using artificial neural networks is a large-scale algorithm.
1) Research background innovation
The computer science and the life science are subject to interdisciplinary, and various mature algorithms of machine learning, data mining and pattern recognition are used for analyzing and solving the biological problems. The project adopts a correlation method to analyze the problem of the transcription factor binding site prediction in the biological field, aims to increase the prediction accuracy and provide guidance and direction for the experiment of a specific system and the field.
2) Improved adaptive action fragment extraction algorithm
In bioinformatics, feature calculation is often performed using physicochemical properties or structural information of molecules themselves, but features obtained by the above methods include only self information, and the description of information of data in the entire data network is lacking. And studies of computational biology have shown that: the composition of life forms is closely related to their interrelated networks. Therefore, the invention provides a feature calculation method for searching the dependence of the artificial neural network simulation biological network, and lays a foundation for accurately predicting the dependence.
3) Improved adaptive feature extraction algorithm
Common transcription factor binding site prediction uses supervised learning methods. When unlabeled data sources are present, the predictive model needs to be improved to accommodate more information. The invention provides an algorithm combining unsupervised learning and supervised learning aiming at multiple data sources, and the prediction accuracy is improved. According to the method, a feature vector set which has good characterization capability and small calculation amount is obtained by feature extraction of DNA sequence fragments and dimension reduction processing of the features; meanwhile, a good transcription factor binding site recognition model is constructed by utilizing the feature vector set, and good accuracy is obtained for the transcription factor binding site recognition.
It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims (5)

1. A method for transcription factor binding site prediction based on a deep convolution automatic encoder, comprising:
s1, specifically enriching DNA fragments combined with the target protein through a chromatin co-immunoprecipitation technology, thereby obtaining an original data set;
s2, preprocessing the original data set to obtain a training data set;
s3, inputting the training data set into a convolution automatic encoder for training;
and S4, carrying out binding site recognition according to the trained convolution automatic encoder.
2. The method for predicting the transcription factor binding site based on the deep convolution automatic encoder as claimed in claim 1, wherein the preprocessing of step S2 is specifically:
a1, screening an original data set; the original dataset was reprocessed through a unified pipeline using four different call peaking tools, iteratively excluding the one with the low S from the 4 peak sets of each dataset until the ratio between the large and small S in the dataset is less than or equal to 2, the dataset deleting the entire dataset if there is only one peak set left; s represents the empirical weight of the light tail;
a2, removing the data sets with the number of samples not more than 5000 after the processing of the step A1;
and A3, selecting from the data set processed in the step A2 by setting the sequence length to obtain effective data of the DNA sequence with fixed length.
3. The method for transcription factor binding site prediction based on deep convolution automatic encoder according to claim 2, wherein step a3 specifically includes the following sub-steps:
a31, labeling the DNA sequences screened in the step A2, dividing a data set into two parts, and generating an opposite sample for one part by scrambling the sequences; another is mapped to a D-dimensional space;
a32, encoding the DNA sequence processed by the step A31 by using single heat encoding; giving a DNA sequence of length DNA s ═ (s _1, s _2, …, s _ l) and a fixed motif scanner length m;
a33, obtaining an encoded matrix S by the equation, the columns of matrix S corresponding to the monothermic vectors of A, C, G or T, the columns of matrix S consisting of [1,0,0,0]T,[0,1,0,0]T,[0,0,1,0]TAnd [0,0,0,1]]TAnd (4) showing.
4. The method for predicting the transcription factor binding site based on the deep convolution automatic encoder as claimed in claim 3, wherein the step S3 is specifically as follows:
s31, inputting the training set processed in the step S2 into an unsupervised convolution automatic encoder for training;
s32, importing the parameters of the filter and the pooling window of the trained unsupervised convolution automatic encoder into the supervised convolution automatic encoder;
and S33, inputting the training set processed in the step S2 into a supervised convolution automatic encoder for training.
5. The method of claim 4, wherein the supervised convolutional auto-encoder replaces the original MLP layer with a fully connected highway network after the maximum combined output of the convolutional layers in step S5.
CN202010115572.XA 2020-02-25 2020-02-25 Transcription factor binding site prediction method based on deep convolution automatic encoder Active CN111312329B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010115572.XA CN111312329B (en) 2020-02-25 2020-02-25 Transcription factor binding site prediction method based on deep convolution automatic encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010115572.XA CN111312329B (en) 2020-02-25 2020-02-25 Transcription factor binding site prediction method based on deep convolution automatic encoder

Publications (2)

Publication Number Publication Date
CN111312329A true CN111312329A (en) 2020-06-19
CN111312329B CN111312329B (en) 2023-03-24

Family

ID=71161946

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010115572.XA Active CN111312329B (en) 2020-02-25 2020-02-25 Transcription factor binding site prediction method based on deep convolution automatic encoder

Country Status (1)

Country Link
CN (1) CN111312329B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112349349A (en) * 2020-11-06 2021-02-09 西安奥卡云数据科技有限公司 Transcription factor binding site recognition discovery method and device based on Spark Streaming
CN112735514A (en) * 2021-01-18 2021-04-30 清华大学 Training and visualization method and system for neural network extraction regulation and control DNA combination mode
CN112932499A (en) * 2021-01-28 2021-06-11 晨思(广州)医疗科技有限公司 Network training and single-lead-connection electrocardiogram data processing method, computer device and medium
CN112992267A (en) * 2021-04-13 2021-06-18 中国人民解放军军事科学院军事医学研究院 Single-cell transcription factor regulation network prediction method and device
CN113035280A (en) * 2021-03-02 2021-06-25 四川大学 RBP binding site prediction algorithm based on deep learning
CN113160877A (en) * 2021-01-11 2021-07-23 东南大学 Prediction method of cell-specific genome G-quadruplex
CN113593634A (en) * 2021-08-06 2021-11-02 中国海洋大学 Transcription factor binding site prediction method fusing DNA shape characteristics
CN114639441A (en) * 2022-05-18 2022-06-17 山东建筑大学 Transcription factor binding site prediction method based on weighted multi-granularity scanning
CN114842914A (en) * 2022-04-24 2022-08-02 山东大学 Chromatin loop prediction method and system based on deep learning
CN114864002A (en) * 2022-04-28 2022-08-05 广西科学院 Transcription factor binding site recognition method based on deep learning
CN116153404A (en) * 2023-02-28 2023-05-23 成都信息工程大学 Single-cell ATAC-seq data analysis method
CN116403645A (en) * 2023-03-03 2023-07-07 阿里巴巴(中国)有限公司 Method and device for predicting transcription factor binding site
US20230291668A1 (en) * 2022-03-09 2023-09-14 Nozomi Networks Sagl Method for detecting anomalies in time series data produced by devices of an infrastructure in a network
CN116884495A (en) * 2023-08-07 2023-10-13 成都信息工程大学 Diffusion model-based long tail chromatin state prediction method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563430A (en) * 2017-08-28 2018-01-09 昆明理工大学 A kind of convolutional neural networks algorithm optimization method based on sparse autocoder and gray scale correlation fractal dimension
CN108763865A (en) * 2018-05-21 2018-11-06 成都信息工程大学 A kind of integrated learning approach of prediction DNA protein binding sites
CN109145948A (en) * 2018-07-18 2019-01-04 宁波沙塔信息技术有限公司 A kind of injection molding machine putty method for detecting abnormality based on integrated study
CN109215740A (en) * 2018-11-06 2019-01-15 中山大学 Full-length genome RNA secondary structure prediction method based on Xgboost
CN109559781A (en) * 2018-10-24 2019-04-02 成都信息工程大学 A kind of two-way LSTM and CNN model that prediction DNA- protein combines
CN110010201A (en) * 2019-04-16 2019-07-12 山东农业大学 A kind of site recognition methods of RNA alternative splicing and system
CN110334809A (en) * 2019-07-03 2019-10-15 成都淞幸科技有限责任公司 A kind of Component encapsulating method and system of intelligent algorithm
CN110832596A (en) * 2017-10-16 2020-02-21 因美纳有限公司 Deep convolutional neural network training method based on deep learning

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107563430A (en) * 2017-08-28 2018-01-09 昆明理工大学 A kind of convolutional neural networks algorithm optimization method based on sparse autocoder and gray scale correlation fractal dimension
CN110832596A (en) * 2017-10-16 2020-02-21 因美纳有限公司 Deep convolutional neural network training method based on deep learning
CN108763865A (en) * 2018-05-21 2018-11-06 成都信息工程大学 A kind of integrated learning approach of prediction DNA protein binding sites
CN109145948A (en) * 2018-07-18 2019-01-04 宁波沙塔信息技术有限公司 A kind of injection molding machine putty method for detecting abnormality based on integrated study
CN109559781A (en) * 2018-10-24 2019-04-02 成都信息工程大学 A kind of two-way LSTM and CNN model that prediction DNA- protein combines
CN109215740A (en) * 2018-11-06 2019-01-15 中山大学 Full-length genome RNA secondary structure prediction method based on Xgboost
CN110010201A (en) * 2019-04-16 2019-07-12 山东农业大学 A kind of site recognition methods of RNA alternative splicing and system
CN110334809A (en) * 2019-07-03 2019-10-15 成都淞幸科技有限责任公司 A kind of Component encapsulating method and system of intelligent algorithm

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
IVAN V KULAKOVSKIY等: "HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis", 《NUCLEIC ACIDS RESEARCH》 *
JINYU YANG等: "Prediction of regulatory motifs from human Chip-sequencing data using a deep learning framework", 《NUCLEIC ACIDS RESEARCH》 *
YUANQI ZENG等: "A Transcription Factor Binding Site Prediction Algorithm Based on Semi-Supervised Learning", 《2019 16TH INTERNATIONAL COMPUTER CONFERENCE ON WAVELET ACTIVE MEDIA TECHNOLOGY AND INFORMATION PROCESSING》 *
刘欢: "基于DNase高通测序信息的DNA蛋白结合位点识别研究", 《中国优秀硕士学位论文全文数据库_信息科技辑》 *
杨比特: "基于深度学习的增强子调控序列识别研究", 《中国优秀硕士学位论文全文数据库_基础科学辑》 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112349349A (en) * 2020-11-06 2021-02-09 西安奥卡云数据科技有限公司 Transcription factor binding site recognition discovery method and device based on Spark Streaming
CN113160877A (en) * 2021-01-11 2021-07-23 东南大学 Prediction method of cell-specific genome G-quadruplex
CN112735514B (en) * 2021-01-18 2022-09-16 清华大学 Training and visualization method and system for neural network extraction regulation and control DNA combination mode
CN112735514A (en) * 2021-01-18 2021-04-30 清华大学 Training and visualization method and system for neural network extraction regulation and control DNA combination mode
CN112932499A (en) * 2021-01-28 2021-06-11 晨思(广州)医疗科技有限公司 Network training and single-lead-connection electrocardiogram data processing method, computer device and medium
CN113035280A (en) * 2021-03-02 2021-06-25 四川大学 RBP binding site prediction algorithm based on deep learning
CN113035280B (en) * 2021-03-02 2022-03-11 四川大学 RBP binding site prediction algorithm based on deep learning
CN112992267A (en) * 2021-04-13 2021-06-18 中国人民解放军军事科学院军事医学研究院 Single-cell transcription factor regulation network prediction method and device
CN112992267B (en) * 2021-04-13 2024-02-09 中国人民解放军军事科学院军事医学研究院 Single-cell transcription factor regulation network prediction method and device
CN113593634A (en) * 2021-08-06 2021-11-02 中国海洋大学 Transcription factor binding site prediction method fusing DNA shape characteristics
CN113593634B (en) * 2021-08-06 2022-03-11 中国海洋大学 Transcription factor binding site prediction method fusing DNA shape characteristics
US20230291668A1 (en) * 2022-03-09 2023-09-14 Nozomi Networks Sagl Method for detecting anomalies in time series data produced by devices of an infrastructure in a network
US11831527B2 (en) * 2022-03-09 2023-11-28 Nozomi Networks Sagl Method for detecting anomalies in time series data produced by devices of an infrastructure in a network
CN114842914B (en) * 2022-04-24 2024-04-05 山东大学 Deep learning-based chromatin ring prediction method and system
CN114842914A (en) * 2022-04-24 2022-08-02 山东大学 Chromatin loop prediction method and system based on deep learning
CN114864002A (en) * 2022-04-28 2022-08-05 广西科学院 Transcription factor binding site recognition method based on deep learning
CN114864002B (en) * 2022-04-28 2023-03-10 广西科学院 Transcription factor binding site recognition method based on deep learning
CN114639441B (en) * 2022-05-18 2022-08-05 山东建筑大学 Transcription factor binding site prediction method based on weighted multi-granularity scanning
CN114639441A (en) * 2022-05-18 2022-06-17 山东建筑大学 Transcription factor binding site prediction method based on weighted multi-granularity scanning
CN116153404A (en) * 2023-02-28 2023-05-23 成都信息工程大学 Single-cell ATAC-seq data analysis method
CN116153404B (en) * 2023-02-28 2023-08-15 成都信息工程大学 Single-cell ATAC-seq data analysis method
CN116403645B (en) * 2023-03-03 2024-01-09 阿里巴巴(中国)有限公司 Method and device for predicting transcription factor binding site
CN116403645A (en) * 2023-03-03 2023-07-07 阿里巴巴(中国)有限公司 Method and device for predicting transcription factor binding site
CN116884495A (en) * 2023-08-07 2023-10-13 成都信息工程大学 Diffusion model-based long tail chromatin state prediction method
CN116884495B (en) * 2023-08-07 2024-03-08 成都信息工程大学 Diffusion model-based long tail chromatin state prediction method

Also Published As

Publication number Publication date
CN111312329B (en) 2023-03-24

Similar Documents

Publication Publication Date Title
CN111312329B (en) Transcription factor binding site prediction method based on deep convolution automatic encoder
US11900225B2 (en) Generating information regarding chemical compound based on latent representation
Lanchantin et al. Deep motif dashboard: visualizing and understanding genomic sequences using deep neural networks
Erfanian et al. Deep learning applications in single-cell genomics and transcriptomics data analysis
CN108170736B (en) Document rapid scanning qualitative method based on cyclic attention mechanism
CN111798921A (en) RNA binding protein prediction method and device based on multi-scale attention convolution neural network
Tavakoli Modeling genome data using bidirectional LSTM
CN114743600B (en) Deep learning prediction method of target-ligand binding affinity based on gated attention mechanism
CN113764034B (en) Method, device, equipment and medium for predicting potential BGC in genome sequence
CN113571125A (en) Drug target interaction prediction method based on multilayer network and graph coding
Zhang et al. CAE-CNN: Predicting transcription factor binding site with convolutional autoencoder and convolutional neural network
CN116206688A (en) Multi-mode information fusion model and method for DTA prediction
CN113160886B (en) Cell type prediction system based on single cell Hi-C data
Li et al. MetaAc4C: A multi-module deep learning framework for accurate prediction of N4-acetylcytidine sites based on pre-trained bidirectional encoder representation and generative adversarial networks
CN118038995B (en) Method and system for predicting small open reading window coding polypeptide capacity in non-coding RNA
Stewart et al. Learning flexible features for conditional random fields
Halsana et al. DensePPI: A Novel Image-Based Deep Learning Method for Prediction of Protein–Protein Interactions
Banoori et al. Few-Shot Bioacoustics Event Detection using Transudative Inference with Data Augmentation
CN116386733A (en) Protein function prediction method based on multi-view multi-scale multi-attention mechanism
Rastogi et al. Semi-parametric inducing point networks and neural processes
Gu et al. Unlocking potential binders: Multimodal pretraining del-fusion for denoising dna-encoded libraries
CN117976047B (en) Key protein prediction method based on deep learning
Bonat et al. Apply Machine Learning Algorithms for Genomics Data Classification
Kai et al. Molecular design method based on novel molecular representation and variational auto-encoder
Zhao et al. Adaptive Multi-view Graph Convolutional Network for Gene Ontology Annotations of Proteins

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant