CN113096732A - Die body mining method based on deep embedded convolutional neural network - Google Patents
Die body mining method based on deep embedded convolutional neural network Download PDFInfo
- Publication number
- CN113096732A CN113096732A CN202110509307.4A CN202110509307A CN113096732A CN 113096732 A CN113096732 A CN 113096732A CN 202110509307 A CN202110509307 A CN 202110509307A CN 113096732 A CN113096732 A CN 113096732A
- Authority
- CN
- China
- Prior art keywords
- model
- embedded
- neural network
- convolutional
- edeepcnn
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 33
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 28
- 238000005065 mining Methods 0.000 title claims abstract description 16
- 239000013598 vector Substances 0.000 claims abstract description 52
- 238000012549 training Methods 0.000 claims abstract description 32
- 108091028043 Nucleic acid sequence Proteins 0.000 claims abstract description 17
- 238000000605 extraction Methods 0.000 claims abstract description 5
- 230000003993 interaction Effects 0.000 claims description 21
- 230000008569 process Effects 0.000 claims description 13
- 239000011159 matrix material Substances 0.000 claims description 10
- 238000002790 cross-validation Methods 0.000 claims description 5
- 238000011176 pooling Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 3
- 239000002773 nucleotide Substances 0.000 abstract description 25
- 125000003729 nucleotide group Chemical group 0.000 abstract description 25
- 239000010410 layer Substances 0.000 description 37
- 108091023040 Transcription factor Proteins 0.000 description 12
- 102000040945 Transcription factor Human genes 0.000 description 12
- 238000012360 testing method Methods 0.000 description 10
- 210000002569 neuron Anatomy 0.000 description 8
- 238000010586 diagram Methods 0.000 description 6
- 238000002474 experimental method Methods 0.000 description 5
- 239000002356 single layer Substances 0.000 description 5
- 238000013528 artificial neural network Methods 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 230000000694 effects Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000031018 biological processes and functions Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 108090000623 proteins and genes Proteins 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 108020004414 DNA Proteins 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000033228 biological regulation Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000012733 comparative method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000012854 evaluation process Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000000338 in vitro Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000008844 regulatory mechanism Effects 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- Medical Informatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a die body mining method based on a deep embedded convolutional neural network, which comprises the following steps: s1, constructing a deep embedded convolutional neural network eDeepCNN model; s2, carrying out K-mer coding on the DNA sequence, using an embedded vector as input representation of the K-mer in the model, training as a data set of the model, and carrying out feature extraction and binding prediction; s3, comparing the deep embedded convolutional neural network eDeepCNN model with a shallow layer network, and verifying the superiority of the deep embedded convolutional neural network eDeepCNN model. In the invention, the K-mer code explicitly models the dependency relationship of adjacent nucleotides in a DNA sequence, the shape information of the DNA sequence is hidden, and the high-dimensional embedded vector can fully represent the potential information contained in the K-mer.
Description
Technical Field
The invention relates to the technical field of computer identification and deep learning, in particular to a motif mining method based on a deep embedded convolutional neural network.
Background
Transcription factors play an important role in biological processes such as gene transcription, repair and regulation. The gene variation of the binding site of the transcription factor is closely related to some serious diseases. Therefore, mining of transcription factor binding sites or motif mining has an important influence on understanding the regulatory mechanism of transcription factors. Traditionally, transcription factor binding sites are represented by a position weight matrix PWM, which is calculated by aligning motif sequences and counting the nucleotide distribution of the corresponding positions. However, PWM only focuses on the nucleotide distribution of motif sequences, and ignores the information of motif adjacent sequences, and case studies show that the context sequence information of motifs has a significant influence on binding behavior. Inspired by a position weight matrix, the Deepbind constructs a single-layer convolutional neural network model for a motif mining task, and researches show that the nucleotide distribution of a binding site adjacent sequence has important influence on binding behaviors. In practical biological processes, multiple transcription factors may cooperate with each other to affect the binding process. Thus, there may be motif-motif interactions in a sequence, and a single layer convolutional network is equally ineffective for this case.
PWM assumes that the nucleotides in a DNA sequence are independent of each other and is a simple approximation of a true physical process. The deep bind carries out unique thermal coding based on single nucleotide and has the advantages of simplicity and intuition, but cannot fully express the interaction of adjacent nucleotides, so that a motif mining method based on a deep embedded convolutional neural network is urgently needed.
Disclosure of Invention
The invention aims to capture the interaction between a motif and an adjacent nucleotide sequence aiming at a transcription factor binding prediction task and construct a deep convolutional network eDeepCNN model on the basis of a Deepbind model.
In order to achieve the purpose, the invention provides the following scheme:
a die body mining method based on a deep embedded convolutional neural network comprises the following steps:
s1, constructing a deep embedded convolutional neural network eDeepCNN model;
s2, carrying out K-mer coding on the DNA sequence, training a data set of the eDeepCNN model by using an embedded vector as an input representation of a K-mer in the eDeepCNN model, and carrying out feature extraction and binding prediction;
s3, comparing the eDeepCNN model with a shallow network, and verifying the superiority of the eDeepCNN model.
Preferably, the eDeepCNN model in S1 includes three convolutional layers, and a local maximum pooling layer and a missing layer are disposed behind each convolutional layer to help the deeply-embedded convolutional neural network model resist an overfitting phenomenon during the training process.
Preferably, the three convolutional layers are respectively: the device comprises a first convolutional layer, a second convolutional layer and a third convolutional layer, wherein the first convolutional layer is used for being responsible for extracting sequence local modes, and the second convolutional layer and the third convolutional layer model the interaction between the local modes.
Preferably, the first convolutional layer is calculated to obtain a motif score sequence, which is used as an input of the second convolutional layer, and a local distribution pattern of the score sequence is identified, so as to capture the interaction between the motif and the adjacent sequence; the third convolutional layer has the same operation mode as the second convolutional layer.
Preferably, the embedding vector in S2 represents an embedding representation point in the high-dimensional hidden space, represents an interaction relationship between the relative positions of the embedding vectors of different K-mers in the high-dimensional space, and implements one-to-one mapping between K-mer sequence numbers and corresponding embedding vectors, to obtain a sequence composed of K-mer sequence numbers.
Preferably, the corresponding embedded vectors are found in a table look-up mode according to the K-mer sequence numbers, the embedded vectors are sequentially formed into a two-dimensional array, and the two-dimensional array is converted into an embedded vector matrix through an embedded vector layer.
Preferably, in S2, before training, the embedded vector matrix is randomly initialized, and the embedded vectors corresponding to the K-mers are adjusted and optimized according to training data.
Preferably, in S3, a five-fold cross-validation strategy is adopted for evaluating the accuracy of the eDeepCNN model.
The invention has the beneficial effects that:
the invention provides a method for combining K-mer coding and embedded vector representation and a deep embedded convolutional neural network eDeepCNN by capturing the interaction of a motif and an adjacent nucleotide sequence aiming at a transcription factor binding prediction task. Compared with a single-layer convolutional network, the multilayer convolutional network can capture the context information of the motif sequence and the interaction between the motif and the adjacent sequence, and the fitting capability of the convolutional neural network is fully utilized. The PBM model assumes mutual independence between adjacent nucleotides, the K-mer coding explicitly models the dependency relationship of the adjacent nucleotides in the DNA sequence, the shape information of the DNA sequence is implicit, the embedded vector representation has stronger representation capability and more flexibility compared with the one-hot coding, and the implicit information contained by the K-mer can be fully characterized.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a schematic diagram of a deep-embedded convolutional neural network model structure according to the present invention;
FIG. 3 is a diagram illustrating a comparison between the one-hot encoding and the K-mer encoding of the present invention;
FIG. 4 is a diagram illustrating a comparison between the neural network model structure and the model structure after using the loss strategy according to the present invention;
FIG. 5 is a schematic diagram of the model training and evaluation process under the five-fold intersection of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
A motif mining method based on deep embedded convolutional neural network has a flow as shown in figure 1, and comprises
S1, constructing a deep embedded convolutional neural network eDeepCNN model (shown in the attached figure 2);
deep convolutional network depcnn operated by three layers of convolution with loss and local pooling strategies. The first layer of convolution extracts the local pattern features of the DNA sequence, and calculates scores for all possible local motifs, which is the same as the Deepbind model. Second and third convolutional layers capable of capturing the interaction of motifs and adjacent sequences. The second convolutional layer receives as input the sequence of motif scores calculated by the first convolutional operation and identifies the local distribution pattern of the sequence of scores, and takes into account the interaction between adjacent motifs or the interaction between a motif and an adjacent sequence. According to the same logic, the third convolution layer has a larger receptive field than the second convolution layer, and can capture the interaction between local modes in a larger range in the sequence. Meanwhile, after the interaction of the local modes is preliminarily extracted through the convolution operation of the second layer, the third convolution layer can consider the high-order interaction between the local modes. Finally, the wider receptive field of the multilayer convolutional network can also adapt to the condition that the binding regions of the transcription factors are different in size. The fitting capability of the model is improved after the multilayer convolutional networks are combined, and the candidate sequence can be more comprehensively modeled. A local max pooling layer and a missing layer are laid down after each convolutional layer. The loss strategy plays an important role in the model. Because the number and complexity of model parameters are improved by the plurality of convolutional layers, the loss strategy can help the model to resist the over-fitting phenomenon in the training process so as to improve the model performance. After the convolutional network, a global maximum pooling layer is used to capture the global features of the DNA sequence and form a fixed-length feature vector to be sent to the fully-connected network for final prediction.
S2, carrying out K-mer coding on the DNA sequence, training a data set of the eDeepCNN model by using an embedded vector as an input representation of a K-mer in the eDeepCNN model, and carrying out feature extraction and binding prediction;
k-mer encoding utilizes a sliding window of length K, and can directly and conveniently characterize the interdependence of adjacent K nucleotides by regarding adjacent K nucleotides in the window as the basic building blocks of a DNA sequence. When k is 1, a single nucleotide is used as a basic unit of DNA sequence, and then 4 kinds of single nucleotides (A, C, G, T) are in total, which means that the nucleotides are assumed to be independent from each other at the coding level. When k is equal to 2, the contiguous 2 nucleotides are considered in their entirety as the basic unit of the DNA sequence, for a total of 16 (4)2) Dinucleotides (Dinucleotides), AA, AC, AG, AT, CA, CC, CG, CT, … …, TA, TC, TG, TT, respectively, the dinucleotide codes explicitly taking into account the interaction between two neighbouring nucleotides. Similarly, when k is equal to 3, the DNA has a total of 64 (4)3) Species-independent Trinucleotides (Trinucleotides) allow direct modeling of the dependence between adjacent three nucleotides. When K-mer encoding is performed, the number of independent K-mers has an exponential relationship with K, and the total number increases sharply with the increase of K.
The eDeepCNN model outperformed the comparative method. Under the condition of 1-mer coding, the average R2 of an eDeepCNN-1mer model on 20 data sets reaches 0.59, the relative lifting amplitude is 4% compared with a DeepCNN model represented by single-hot coding and is lifted by 2.5%, and the relative lifting amplitude reaches 22% compared with a DeepBind model of a single-layer convolutional network and is lifted by 10.9%.
After the K-mer coding and the embedded vector representation are combined, the model index is further improved, on 10 data sets, the average index of the eDeepCNN-2mer model is 0.596 which is higher than 0.573 of the eDeepCNN-1mer model, and the improvement amplitude is 4%.
In this embodiment, the candidate sequence is traversed using a sliding window of length K, and the K-mers within the window are recorded. The K-mers are recorded as corresponding sequence numbers, thereby converting the DNA sequence into an array that can be computed. FIG. 3 is a diagram illustrating a comparison between a one-hot encoding and a K-mer encoding. For example, when k equals 2, there are 16 independent 2-mers, the dinucleotide AA corresponds to the number 0, AC corresponds to 1, TG corresponds to 14, and TT corresponds to 15. For sequences of length k, s ═ si1 … k, corresponding to the reference number d(s):
wherein s represents the input of the K-mer sequence, and d(s) outputs the corresponding sequence number of the K-mer. siRepresents the nucleotides constituting the K-mer, i denotes the position of the nucleotide in the K-mer, d(s)i) The nucleotides are mapped to the corresponding sequence numbers.
Embedded vector representation is widely used in many fields such as natural language processing, information extraction and recommendation systems, etc. The embedded vector represents an embedded representation point in a high-dimensional hidden space, the position of the embedded vector in the high-dimensional space contains a great amount of information, the embedded vector has stronger representation capability compared with single-hot coding, and the relative position of the embedded vector representing different K-mers in the high-dimensional space can better represent the interaction relation between the K-mers. On the other hand, when the K-mers are subjected to one-hot encoding, there is 4 in totalkAn independent K-mer to form a 4kThe unique heat vector of dimension. When k is 1, there are 4 different 1-mers. When k is 2, 16 different 2-mers exist, when k is 5, 1024 independent 5-mers exist, a 1024-dimensional vector is formed after single-hot coding, and the vector dimension of the single-hot coding rises exponentially with the increase of k, so that the model parameter quantity explodes, and the training process is difficult to carry out. And the embedded vector coding can effectively reduce the dimension of the input vector of the convolutional network and avoid the problem of parameter explosion. Moreover, the embedded vector coding dimension is variable, an optimal parameter can be searched and selected in the model training process, and the embedded vector coding method has strong flexibility.
In this embodiment, a one-to-one mapping relationship is constructed between the K-mer sequence numbers and the corresponding embedded vectors, and a sequence consisting of K-mer sequence numbers is obtained after K-mer encoding is performed on the candidate sequence. And finding out corresponding embedded vectors in a table look-up mode according to the serial numbers of the K-mers, and forming a two-dimensional array by the embedded vectors in sequence. In the neural network, the conversion is realized by using an embedded vector layer, the embedded vector layer maintains an embedded vector matrix, each row in the matrix represents an embedded vector corresponding to a corresponding sequence number, and in actual operation, the embedded vectors of the rows corresponding to the matrix are found according to the sequence numbers in the input sequence. Before training begins, the embedded vector matrix is initialized randomly, and embedded vectors corresponding to the K-mers are adjusted and optimized step by step according to training data.
An optional embedded vector layer is added before the convolutional network, and the corresponding model is called eDeepCNN. The detail parameter settings in the model include the width of the convolution kernel in each convolution layer, and the number of convolution kernels is listed in Table 1 below. Some hyper-parameter settings in the model inherit the classic model Deepbind in the auto-motif mining task, and the parameters are proved to be good choices, and other parts are determined in the hyper-parameter mesh search in the training process.
TABLE 1
S3, comparing the deep embedded convolutional neural network eDeepCNN model with a shallow layer network, and verifying the superiority of the deep embedded convolutional neural network eDeepCNN model.
In the machine learning paradigm, an entire data set is divided into a training set and a testing set, a model realizes the modeling of a task target by optimizing parameters on the training set and learning the data rule in the training set, and then the trained model is applied to the testing set to check the actual effect of the model. One of the main points is that the model can learn the general rule of the task target in the training set, so that the model can be applied to the test set to achieve a good effect. However, in practical situations, the model tends to over-fit the noise of the training data during the training process, or the model learns the specific rules of the training set, and the rules are not applicable to the test set. In this case, the model has a high performance on the training set, but a poor performance on the test set, which is a so-called overfitting phenomenon.
The loss (Dropout) strategy is an effective means to deal with the over-fitting problem in deep neural networks. During the training process, Dropout randomly masks a part of neurons in each parameter optimization process, and forces the output values of the neurons to be zero, which is equivalent to randomly discarding a part of neurons in the neural network. During this optimization, these discarded neuron weight values are unchanged, as shown in fig. 4.
In this embodiment, the Dropout strategy multiplies 0 by the probability of p for each neuron output during training. Meanwhile, in order to compensate for the reduction of the network input value of the next layer caused by the reduction, the Dropout strategy amplifies the unmasked output value by using 1/1-p as a coefficient. During the test, the neuron output values were unchanged without any masking. The Dropout procedure is calculated as follows:
wherein the parameter r obeys a 0-1 distribution with probability p, x represents the neuron input vector, w, b represents the neural network weight and bias parameters, and the function f represents the neuron activation function.
Because the number and complexity of model parameters are improved by a plurality of convolution layers, the overfitting risk is greatly improved. Therefore, the loss layers are arranged in the convolutional network and the full-connection network, and the model is assisted to resist the overfitting phenomenon in the training process by combining the L2 regularization strategy, so that the model performance is improved.
Using a determined coefficient R2To measure the correlation of the predicted output with the measured value. R2Coefficients have been used in past studies to measure the predictive effect of a model on PBM in vitro datasets. R2The calculation formula of (a) is as follows:
wherein, yiThe label value (label) representing sample i,represents the average of the values of the labels,representing the predicted value of sample i.
1-R2And calculating the ratio of the mean square error of the predicted value and the measured value of the regression model to the inherent mean square error of the measured data. R2The closer the value is to 1, the smaller the model prediction error is relative to the dataset intrinsic variance, representing naturally the predicted behavior of the model. Due to R2The inherent variance of the relative data sets is normalized, so that the expression of the model on different transcription factor data sets can be compared, the evaluation indexes on a plurality of data sets can be averaged, and the performance of the model on the transcription factor binding task can be better balanced.
To accurately evaluate the performance of the model, a five-fold cross-validation strategy was used. Five-fold cross validation the experiments were repeated a total of five times, with each experiment having a different training and test set partitioning. The whole data set is randomly divided into five parts with equal size, in each experiment, four parts are taken as training sets, the rest parts are taken as test sets, and in five experiments, five different test sets are sequentially selected. During training, we randomly sample one eighth of the training set as the validation set. We finally adopted the R of the model on the test set in five cross-validation2The mean of (a) evaluates the final performance of the model. The schematic diagram of the five-fold crossover experiment process is shown in figure 5.
The invention provides a method for combining K-mer coding and embedded vector representation and a deep embedded convolutional neural network eDeepCNN by capturing the interaction of a motif and an adjacent nucleotide sequence aiming at a transcription factor binding prediction task. Compared with a single-layer convolutional network, the multilayer convolutional network can capture the context information of the motif sequence and the interaction between the motif and the adjacent sequence, and the fitting capability of the convolutional neural network is fully utilized. The PBM model assumes mutual independence between adjacent nucleotides, the K-mer coding explicitly models the dependency relationship of the adjacent nucleotides in the DNA sequence, the shape information of the DNA sequence is implicit, compared with the one-hot coding, the embedded vector representation has stronger representation capability and greater flexibility, and the implicit information contained by the K-mer can be fully characterized.
The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.
Claims (8)
1. A die body mining method based on a deep embedded convolutional neural network is characterized by comprising the following steps:
s1, constructing a deep embedded convolutional neural network eDeepCNN model;
s2, carrying out K-mer coding on the DNA sequence, training a data set of the eDeepCNN model by using an embedded vector as an input representation of a K-mer in the eDeepCNN model, and carrying out feature extraction and binding prediction;
s3, comparing the eDeepCNN model with a shallow network, and verifying the superiority of the eDeepCNN model.
2. The model body mining method based on the deep-embedded convolutional neural network of claim 1, wherein the eDeepCNN model in S1 includes three convolutional layers, and a local maximum pooling layer and a missing layer are disposed behind each convolutional layer for helping the deep-embedded convolutional neural network model resist an over-fitting phenomenon in a training process.
3. The model body mining method based on the deep-embedded convolutional neural network of claim 2, wherein the three convolutional layers are respectively: the device comprises a first convolutional layer, a second convolutional layer and a third convolutional layer, wherein the first convolutional layer is used for being responsible for extracting sequence local modes, and the second convolutional layer and the third convolutional layer model the interaction between the local modes.
4. The motif mining method based on the deep-embedded convolutional neural network as claimed in claim 3, wherein the first convolutional layer is calculated to obtain a motif score sequence, and the motif score sequence is used as an input of the second convolutional layer to identify a local distribution pattern of the score sequence for capturing the interaction of the motif and the adjacent sequence; the third convolutional layer has the same operation mode as the second convolutional layer.
5. The method of claim 1, wherein in step S2, the embedding vector represents an embedding representation point in a high-dimensional hidden space, and represents an interaction relationship between relative positions of embedding vectors of different K-mers in the high-dimensional space, so as to implement one-to-one mapping between K-mer sequence numbers and corresponding embedding vectors, and obtain a sequence consisting of K-mer sequence numbers.
6. The method of claim 5, wherein the embedded vectors are sequentially grouped into a two-dimensional array according to the K-mer sequence numbers by looking up a table, and the two-dimensional array is transformed into an embedded vector matrix through an embedded vector layer.
7. The method of claim 1, wherein in step S2, before training, the embedded vector matrix is initialized randomly, and the embedded vectors corresponding to K-mers are adjusted and optimized according to training data.
8. The model body mining method based on the deep-embedding convolutional neural network of claim 1, wherein in S3, a five-fold cross-validation strategy is adopted for evaluating the accuracy of the eDeepCNN model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110509307.4A CN113096732A (en) | 2021-05-11 | 2021-05-11 | Die body mining method based on deep embedded convolutional neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110509307.4A CN113096732A (en) | 2021-05-11 | 2021-05-11 | Die body mining method based on deep embedded convolutional neural network |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113096732A true CN113096732A (en) | 2021-07-09 |
Family
ID=76664951
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110509307.4A Pending CN113096732A (en) | 2021-05-11 | 2021-05-11 | Die body mining method based on deep embedded convolutional neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113096732A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102206699A (en) * | 2010-07-14 | 2011-10-05 | 上海聚类生物科技有限公司 | Method for prediction of transcription factor binding site (TFBS) |
CN110335639A (en) * | 2019-06-13 | 2019-10-15 | 哈尔滨工业大学(深圳) | A kind of Transcription Factor Binding Sites Prediction Algorithm and device across transcription factor |
CN111341386A (en) * | 2020-02-17 | 2020-06-26 | 大连理工大学 | Attention-introducing multi-scale CNN-BilSTM non-coding RNA interaction relation prediction method |
CN111667884A (en) * | 2020-06-12 | 2020-09-15 | 天津大学 | Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism |
CN111696624A (en) * | 2020-06-08 | 2020-09-22 | 天津大学 | DNA binding protein identification and function annotation deep learning method based on self-attention mechanism |
CN112270955A (en) * | 2020-10-23 | 2021-01-26 | 大连民族大学 | Method for predicting RBP binding site of lncRNA (long-range nuclear ribonucleic acid) by attention mechanism |
-
2021
- 2021-05-11 CN CN202110509307.4A patent/CN113096732A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102206699A (en) * | 2010-07-14 | 2011-10-05 | 上海聚类生物科技有限公司 | Method for prediction of transcription factor binding site (TFBS) |
CN110335639A (en) * | 2019-06-13 | 2019-10-15 | 哈尔滨工业大学(深圳) | A kind of Transcription Factor Binding Sites Prediction Algorithm and device across transcription factor |
CN111341386A (en) * | 2020-02-17 | 2020-06-26 | 大连理工大学 | Attention-introducing multi-scale CNN-BilSTM non-coding RNA interaction relation prediction method |
CN111696624A (en) * | 2020-06-08 | 2020-09-22 | 天津大学 | DNA binding protein identification and function annotation deep learning method based on self-attention mechanism |
CN111667884A (en) * | 2020-06-12 | 2020-09-15 | 天津大学 | Convolutional neural network model for predicting protein interactions using protein primary sequences based on attention mechanism |
CN112270955A (en) * | 2020-10-23 | 2021-01-26 | 大连民族大学 | Method for predicting RBP binding site of lncRNA (long-range nuclear ribonucleic acid) by attention mechanism |
Non-Patent Citations (1)
Title |
---|
YINDONG ZHANG ET AL: "Predicting in-Vitro Transcription Factor Binding Sites with Deep Embedding Convolution Network", 《ICIC 2020: INTELLIGENT COMPUTING THEORIES AND APPLICATION》 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | An end-to-end deep learning architecture for graph classification | |
CN106778014B (en) | Disease risk prediction modeling method based on recurrent neural network | |
CN110334843B (en) | Time-varying attention improved Bi-LSTM hospitalization and hospitalization behavior prediction method and device | |
CN109086805B (en) | Clustering method based on deep neural network and pairwise constraints | |
CN110490320B (en) | Deep neural network structure optimization method based on fusion of prediction mechanism and genetic algorithm | |
CN114927162A (en) | Multi-set correlation phenotype prediction method based on hypergraph representation and Dirichlet distribution | |
CN110993113B (en) | LncRNA-disease relation prediction method and system based on MF-SDAE | |
Jiang et al. | A hybrid intelligent model for acute hypotensive episode prediction with large-scale data | |
CN107577924A (en) | A kind of long-chain non-coding RNA subcellular location prediction algorithm based on deep learning | |
CN112599187B (en) | Method for predicting drug and target protein binding fraction based on double-flow neural network | |
Maulik | Analysis of gene microarray data in a soft computing framework | |
CN112215259B (en) | Gene selection method and apparatus | |
Hota | Diagnosis of breast cancer using intelligent techniques | |
CN102073882A (en) | Method for matching and classifying spectrums of hyperspectral remote sensing image by DNA computing | |
CN113257359A (en) | CRISPR/Cas9 guide RNA editing efficiency prediction method based on CNN-SVR | |
Shen et al. | Simultaneous genes and training samples selection by modified particle swarm optimization for gene expression data classification | |
CN112926640A (en) | Cancer gene classification method and equipment based on two-stage depth feature selection and storage medium | |
CN101324926A (en) | Method for selecting characteristic facing to complicated mode classification | |
CN117034767A (en) | Ceramic roller kiln temperature prediction method based on KPCA-GWO-GRU | |
Nagae et al. | Automatic layer selection for transfer learning and quantitative evaluation of layer effectiveness | |
CN113096732A (en) | Die body mining method based on deep embedded convolutional neural network | |
CN116541785A (en) | Toxicity prediction method and system based on deep integration machine learning model | |
CN116504331A (en) | Frequency score prediction method for drug side effects based on multiple modes and multiple tasks | |
Ullah et al. | Crow-ENN: An Optimized Elman Neural Network with Crow Search Algorithm for Leukemia DNA Sequence Classification | |
CN114596913B (en) | Protein folding identification method and system based on depth central point model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20210709 |