CN113096732A

CN113096732A - Die body mining method based on deep embedded convolutional neural network

Info

Publication number: CN113096732A
Application number: CN202110509307.4A
Authority: CN
Inventors: 黄德双; 张寅东
Original assignee: Tongji University
Current assignee: Tongji University
Priority date: 2021-05-11
Filing date: 2021-05-11
Publication date: 2021-07-09

Abstract

The invention relates to a die body mining method based on a deep embedded convolutional neural network, which comprises the following steps: s1, constructing a deep embedded convolutional neural network eDeepCNN model; s2, carrying out K-mer coding on the DNA sequence, using an embedded vector as input representation of the K-mer in the model, training as a data set of the model, and carrying out feature extraction and binding prediction; s3, comparing the deep embedded convolutional neural network eDeepCNN model with a shallow layer network, and verifying the superiority of the deep embedded convolutional neural network eDeepCNN model. In the invention, the K-mer code explicitly models the dependency relationship of adjacent nucleotides in a DNA sequence, the shape information of the DNA sequence is hidden, and the high-dimensional embedded vector can fully represent the potential information contained in the K-mer.

Description

Die body mining method based on deep embedded convolutional neural network

Technical Field

The invention relates to the technical field of computer identification and deep learning, in particular to a motif mining method based on a deep embedded convolutional neural network.

Background

Transcription factors play an important role in biological processes such as gene transcription, repair and regulation. The gene variation of the binding site of the transcription factor is closely related to some serious diseases. Therefore, mining of transcription factor binding sites or motif mining has an important influence on understanding the regulatory mechanism of transcription factors. Traditionally, transcription factor binding sites are represented by a position weight matrix PWM, which is calculated by aligning motif sequences and counting the nucleotide distribution of the corresponding positions. However, PWM only focuses on the nucleotide distribution of motif sequences, and ignores the information of motif adjacent sequences, and case studies show that the context sequence information of motifs has a significant influence on binding behavior. Inspired by a position weight matrix, the Deepbind constructs a single-layer convolutional neural network model for a motif mining task, and researches show that the nucleotide distribution of a binding site adjacent sequence has important influence on binding behaviors. In practical biological processes, multiple transcription factors may cooperate with each other to affect the binding process. Thus, there may be motif-motif interactions in a sequence, and a single layer convolutional network is equally ineffective for this case.

PWM assumes that the nucleotides in a DNA sequence are independent of each other and is a simple approximation of a true physical process. The deep bind carries out unique thermal coding based on single nucleotide and has the advantages of simplicity and intuition, but cannot fully express the interaction of adjacent nucleotides, so that a motif mining method based on a deep embedded convolutional neural network is urgently needed.

Disclosure of Invention

The invention aims to capture the interaction between a motif and an adjacent nucleotide sequence aiming at a transcription factor binding prediction task and construct a deep convolutional network eDeepCNN model on the basis of a Deepbind model.

In order to achieve the purpose, the invention provides the following scheme:

a die body mining method based on a deep embedded convolutional neural network comprises the following steps:

s1, constructing a deep embedded convolutional neural network eDeepCNN model;

s2, carrying out K-mer coding on the DNA sequence, training a data set of the eDeepCNN model by using an embedded vector as an input representation of a K-mer in the eDeepCNN model, and carrying out feature extraction and binding prediction;

s3, comparing the eDeepCNN model with a shallow network, and verifying the superiority of the eDeepCNN model.

Preferably, the eDeepCNN model in S1 includes three convolutional layers, and a local maximum pooling layer and a missing layer are disposed behind each convolutional layer to help the deeply-embedded convolutional neural network model resist an overfitting phenomenon during the training process.

Preferably, the three convolutional layers are respectively: the device comprises a first convolutional layer, a second convolutional layer and a third convolutional layer, wherein the first convolutional layer is used for being responsible for extracting sequence local modes, and the second convolutional layer and the third convolutional layer model the interaction between the local modes.

Preferably, the first convolutional layer is calculated to obtain a motif score sequence, which is used as an input of the second convolutional layer, and a local distribution pattern of the score sequence is identified, so as to capture the interaction between the motif and the adjacent sequence; the third convolutional layer has the same operation mode as the second convolutional layer.

Preferably, the embedding vector in S2 represents an embedding representation point in the high-dimensional hidden space, represents an interaction relationship between the relative positions of the embedding vectors of different K-mers in the high-dimensional space, and implements one-to-one mapping between K-mer sequence numbers and corresponding embedding vectors, to obtain a sequence composed of K-mer sequence numbers.

Preferably, the corresponding embedded vectors are found in a table look-up mode according to the K-mer sequence numbers, the embedded vectors are sequentially formed into a two-dimensional array, and the two-dimensional array is converted into an embedded vector matrix through an embedded vector layer.

Preferably, in S2, before training, the embedded vector matrix is randomly initialized, and the embedded vectors corresponding to the K-mers are adjusted and optimized according to training data.

Preferably, in S3, a five-fold cross-validation strategy is adopted for evaluating the accuracy of the eDeepCNN model.

The invention has the beneficial effects that:

the invention provides a method for combining K-mer coding and embedded vector representation and a deep embedded convolutional neural network eDeepCNN by capturing the interaction of a motif and an adjacent nucleotide sequence aiming at a transcription factor binding prediction task. Compared with a single-layer convolutional network, the multilayer convolutional network can capture the context information of the motif sequence and the interaction between the motif and the adjacent sequence, and the fitting capability of the convolutional neural network is fully utilized. The PBM model assumes mutual independence between adjacent nucleotides, the K-mer coding explicitly models the dependency relationship of the adjacent nucleotides in the DNA sequence, the shape information of the DNA sequence is implicit, the embedded vector representation has stronger representation capability and more flexibility compared with the one-hot coding, and the implicit information contained by the K-mer can be fully characterized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a deep-embedded convolutional neural network model structure according to the present invention;

FIG. 3 is a diagram illustrating a comparison between the one-hot encoding and the K-mer encoding of the present invention;

FIG. 4 is a diagram illustrating a comparison between the neural network model structure and the model structure after using the loss strategy according to the present invention;

FIG. 5 is a schematic diagram of the model training and evaluation process under the five-fold intersection of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

A motif mining method based on deep embedded convolutional neural network has a flow as shown in figure 1, and comprises

S1, constructing a deep embedded convolutional neural network eDeepCNN model (shown in the attached figure 2);

deep convolutional network depcnn operated by three layers of convolution with loss and local pooling strategies. The first layer of convolution extracts the local pattern features of the DNA sequence, and calculates scores for all possible local motifs, which is the same as the Deepbind model. Second and third convolutional layers capable of capturing the interaction of motifs and adjacent sequences. The second convolutional layer receives as input the sequence of motif scores calculated by the first convolutional operation and identifies the local distribution pattern of the sequence of scores, and takes into account the interaction between adjacent motifs or the interaction between a motif and an adjacent sequence. According to the same logic, the third convolution layer has a larger receptive field than the second convolution layer, and can capture the interaction between local modes in a larger range in the sequence. Meanwhile, after the interaction of the local modes is preliminarily extracted through the convolution operation of the second layer, the third convolution layer can consider the high-order interaction between the local modes. Finally, the wider receptive field of the multilayer convolutional network can also adapt to the condition that the binding regions of the transcription factors are different in size. The fitting capability of the model is improved after the multilayer convolutional networks are combined, and the candidate sequence can be more comprehensively modeled. A local max pooling layer and a missing layer are laid down after each convolutional layer. The loss strategy plays an important role in the model. Because the number and complexity of model parameters are improved by the plurality of convolutional layers, the loss strategy can help the model to resist the over-fitting phenomenon in the training process so as to improve the model performance. After the convolutional network, a global maximum pooling layer is used to capture the global features of the DNA sequence and form a fixed-length feature vector to be sent to the fully-connected network for final prediction.

k-mer encoding utilizes a sliding window of length K, and can directly and conveniently characterize the interdependence of adjacent K nucleotides by regarding adjacent K nucleotides in the window as the basic building blocks of a DNA sequence. When k is 1, a single nucleotide is used as a basic unit of DNA sequence, and then 4 kinds of single nucleotides (A, C, G, T) are in total, which means that the nucleotides are assumed to be independent from each other at the coding level. When k is equal to 2, the contiguous 2 nucleotides are considered in their entirety as the basic unit of the DNA sequence, for a total of 16 (4)²) Dinucleotides (Dinucleotides), AA, AC, AG, AT, CA, CC, CG, CT, … …, TA, TC, TG, TT, respectively, the dinucleotide codes explicitly taking into account the interaction between two neighbouring nucleotides. Similarly, when k is equal to 3, the DNA has a total of 64 (4)³) Species-independent Trinucleotides (Trinucleotides) allow direct modeling of the dependence between adjacent three nucleotides. When K-mer encoding is performed, the number of independent K-mers has an exponential relationship with K, and the total number increases sharply with the increase of K.

The eDeepCNN model outperformed the comparative method. Under the condition of 1-mer coding, the average R2 of an eDeepCNN-1mer model on 20 data sets reaches 0.59, the relative lifting amplitude is 4% compared with a DeepCNN model represented by single-hot coding and is lifted by 2.5%, and the relative lifting amplitude reaches 22% compared with a DeepBind model of a single-layer convolutional network and is lifted by 10.9%.

After the K-mer coding and the embedded vector representation are combined, the model index is further improved, on 10 data sets, the average index of the eDeepCNN-2mer model is 0.596 which is higher than 0.573 of the eDeepCNN-1mer model, and the improvement amplitude is 4%.

In this embodiment, the candidate sequence is traversed using a sliding window of length K, and the K-mers within the window are recorded. The K-mers are recorded as corresponding sequence numbers, thereby converting the DNA sequence into an array that can be computed. FIG. 3 is a diagram illustrating a comparison between a one-hot encoding and a K-mer encoding. For example, when k equals 2, there are 16 independent 2-mers, the dinucleotide AA corresponds to the number 0, AC corresponds to 1, TG corresponds to 14, and TT corresponds to 15. For sequences of length k, s ═ s_i1 … k, corresponding to the reference number d(s):

wherein s represents the input of the K-mer sequence, and d(s) outputs the corresponding sequence number of the K-mer. s_iRepresents the nucleotides constituting the K-mer, i denotes the position of the nucleotide in the K-mer, d(s)_i) The nucleotides are mapped to the corresponding sequence numbers.

Embedded vector representation is widely used in many fields such as natural language processing, information extraction and recommendation systems, etc. The embedded vector represents an embedded representation point in a high-dimensional hidden space, the position of the embedded vector in the high-dimensional space contains a great amount of information, the embedded vector has stronger representation capability compared with single-hot coding, and the relative position of the embedded vector representing different K-mers in the high-dimensional space can better represent the interaction relation between the K-mers. On the other hand, when the K-mers are subjected to one-hot encoding, there is 4 in total^kAn independent K-mer to form a 4^kThe unique heat vector of dimension. When k is 1, there are 4 different 1-mers. When k is 2, 16 different 2-mers exist, when k is 5, 1024 independent 5-mers exist, a 1024-dimensional vector is formed after single-hot coding, and the vector dimension of the single-hot coding rises exponentially with the increase of k, so that the model parameter quantity explodes, and the training process is difficult to carry out. And the embedded vector coding can effectively reduce the dimension of the input vector of the convolutional network and avoid the problem of parameter explosion. Moreover, the embedded vector coding dimension is variable, an optimal parameter can be searched and selected in the model training process, and the embedded vector coding method has strong flexibility.

In this embodiment, a one-to-one mapping relationship is constructed between the K-mer sequence numbers and the corresponding embedded vectors, and a sequence consisting of K-mer sequence numbers is obtained after K-mer encoding is performed on the candidate sequence. And finding out corresponding embedded vectors in a table look-up mode according to the serial numbers of the K-mers, and forming a two-dimensional array by the embedded vectors in sequence. In the neural network, the conversion is realized by using an embedded vector layer, the embedded vector layer maintains an embedded vector matrix, each row in the matrix represents an embedded vector corresponding to a corresponding sequence number, and in actual operation, the embedded vectors of the rows corresponding to the matrix are found according to the sequence numbers in the input sequence. Before training begins, the embedded vector matrix is initialized randomly, and embedded vectors corresponding to the K-mers are adjusted and optimized step by step according to training data.

An optional embedded vector layer is added before the convolutional network, and the corresponding model is called eDeepCNN. The detail parameter settings in the model include the width of the convolution kernel in each convolution layer, and the number of convolution kernels is listed in Table 1 below. Some hyper-parameter settings in the model inherit the classic model Deepbind in the auto-motif mining task, and the parameters are proved to be good choices, and other parts are determined in the hyper-parameter mesh search in the training process.

TABLE 1

S3, comparing the deep embedded convolutional neural network eDeepCNN model with a shallow layer network, and verifying the superiority of the deep embedded convolutional neural network eDeepCNN model.

In the machine learning paradigm, an entire data set is divided into a training set and a testing set, a model realizes the modeling of a task target by optimizing parameters on the training set and learning the data rule in the training set, and then the trained model is applied to the testing set to check the actual effect of the model. One of the main points is that the model can learn the general rule of the task target in the training set, so that the model can be applied to the test set to achieve a good effect. However, in practical situations, the model tends to over-fit the noise of the training data during the training process, or the model learns the specific rules of the training set, and the rules are not applicable to the test set. In this case, the model has a high performance on the training set, but a poor performance on the test set, which is a so-called overfitting phenomenon.

The loss (Dropout) strategy is an effective means to deal with the over-fitting problem in deep neural networks. During the training process, Dropout randomly masks a part of neurons in each parameter optimization process, and forces the output values of the neurons to be zero, which is equivalent to randomly discarding a part of neurons in the neural network. During this optimization, these discarded neuron weight values are unchanged, as shown in fig. 4.

In this embodiment, the Dropout strategy multiplies 0 by the probability of p for each neuron output during training. Meanwhile, in order to compensate for the reduction of the network input value of the next layer caused by the reduction, the Dropout strategy amplifies the unmasked output value by using 1/1-p as a coefficient. During the test, the neuron output values were unchanged without any masking. The Dropout procedure is calculated as follows:

wherein the parameter r obeys a 0-1 distribution with probability p, x represents the neuron input vector, w, b represents the neural network weight and bias parameters, and the function f represents the neuron activation function.

Because the number and complexity of model parameters are improved by a plurality of convolution layers, the overfitting risk is greatly improved. Therefore, the loss layers are arranged in the convolutional network and the full-connection network, and the model is assisted to resist the overfitting phenomenon in the training process by combining the L2 regularization strategy, so that the model performance is improved.

Using a determined coefficient R²To measure the correlation of the predicted output with the measured value. R²Coefficients have been used in past studies to measure the predictive effect of a model on PBM in vitro datasets. R²The calculation formula of (a) is as follows:

wherein, y_iThe label value (label) representing sample i,

represents the average of the values of the labels,

representing the predicted value of sample i.

1-R²And calculating the ratio of the mean square error of the predicted value and the measured value of the regression model to the inherent mean square error of the measured data. R²The closer the value is to 1, the smaller the model prediction error is relative to the dataset intrinsic variance, representing naturally the predicted behavior of the model. Due to R²The inherent variance of the relative data sets is normalized, so that the expression of the model on different transcription factor data sets can be compared, the evaluation indexes on a plurality of data sets can be averaged, and the performance of the model on the transcription factor binding task can be better balanced.

To accurately evaluate the performance of the model, a five-fold cross-validation strategy was used. Five-fold cross validation the experiments were repeated a total of five times, with each experiment having a different training and test set partitioning. The whole data set is randomly divided into five parts with equal size, in each experiment, four parts are taken as training sets, the rest parts are taken as test sets, and in five experiments, five different test sets are sequentially selected. During training, we randomly sample one eighth of the training set as the validation set. We finally adopted the R of the model on the test set in five cross-validation²The mean of (a) evaluates the final performance of the model. The schematic diagram of the five-fold crossover experiment process is shown in figure 5.

The invention provides a method for combining K-mer coding and embedded vector representation and a deep embedded convolutional neural network eDeepCNN by capturing the interaction of a motif and an adjacent nucleotide sequence aiming at a transcription factor binding prediction task. Compared with a single-layer convolutional network, the multilayer convolutional network can capture the context information of the motif sequence and the interaction between the motif and the adjacent sequence, and the fitting capability of the convolutional neural network is fully utilized. The PBM model assumes mutual independence between adjacent nucleotides, the K-mer coding explicitly models the dependency relationship of the adjacent nucleotides in the DNA sequence, the shape information of the DNA sequence is implicit, compared with the one-hot coding, the embedded vector representation has stronger representation capability and greater flexibility, and the implicit information contained by the K-mer can be fully characterized.

The above-described embodiments are merely illustrative of the preferred embodiments of the present invention, and do not limit the scope of the present invention, and various modifications and improvements of the technical solutions of the present invention can be made by those skilled in the art without departing from the spirit of the present invention, and the technical solutions of the present invention are within the scope of the present invention defined by the claims.

Claims

1. A die body mining method based on a deep embedded convolutional neural network is characterized by comprising the following steps:

s1, constructing a deep embedded convolutional neural network eDeepCNN model;

2. The model body mining method based on the deep-embedded convolutional neural network of claim 1, wherein the eDeepCNN model in S1 includes three convolutional layers, and a local maximum pooling layer and a missing layer are disposed behind each convolutional layer for helping the deep-embedded convolutional neural network model resist an over-fitting phenomenon in a training process.

3. The model body mining method based on the deep-embedded convolutional neural network of claim 2, wherein the three convolutional layers are respectively: the device comprises a first convolutional layer, a second convolutional layer and a third convolutional layer, wherein the first convolutional layer is used for being responsible for extracting sequence local modes, and the second convolutional layer and the third convolutional layer model the interaction between the local modes.

4. The motif mining method based on the deep-embedded convolutional neural network as claimed in claim 3, wherein the first convolutional layer is calculated to obtain a motif score sequence, and the motif score sequence is used as an input of the second convolutional layer to identify a local distribution pattern of the score sequence for capturing the interaction of the motif and the adjacent sequence; the third convolutional layer has the same operation mode as the second convolutional layer.

5. The method of claim 1, wherein in step S2, the embedding vector represents an embedding representation point in a high-dimensional hidden space, and represents an interaction relationship between relative positions of embedding vectors of different K-mers in the high-dimensional space, so as to implement one-to-one mapping between K-mer sequence numbers and corresponding embedding vectors, and obtain a sequence consisting of K-mer sequence numbers.

6. The method of claim 5, wherein the embedded vectors are sequentially grouped into a two-dimensional array according to the K-mer sequence numbers by looking up a table, and the two-dimensional array is transformed into an embedded vector matrix through an embedded vector layer.

7. The method of claim 1, wherein in step S2, before training, the embedded vector matrix is initialized randomly, and the embedded vectors corresponding to K-mers are adjusted and optimized according to training data.

8. The model body mining method based on the deep-embedding convolutional neural network of claim 1, wherein in S3, a five-fold cross-validation strategy is adopted for evaluating the accuracy of the eDeepCNN model.