CN114898805A

CN114898805A - Cross-species promoter prediction method and system

Info

Publication number: CN114898805A
Application number: CN202210342942.2A
Authority: CN
Inventors: 吴昊; 张鹏宇
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-04-02
Filing date: 2022-04-02
Publication date: 2022-08-12
Anticipated expiration: 2042-04-02
Also published as: CN114898805B

Abstract

The disclosure belongs to the technical field of data processing, and provides a multi-species-crossing promoter prediction method and system. The method comprises the steps of obtaining a DNA sequence, and respectively extracting a first characteristic and a second characteristic of the DNA sequence; based on the first characteristic, a random forest model is adopted to obtain a first prediction probability value; based on the second characteristic, a convolutional neural network model is adopted to obtain a second prediction probability value; respectively assuming the weight of a random forest model and the weight of a convolutional neural network model, and constructing a loss function based on a first prediction probability value and a second prediction probability value; determining the weight values of the random forest model and the convolutional neural network model through a minimum loss function; and according to the sum of the weighted value of the random forest model and the product of the weighted value of the convolutional neural network model and the predicted probability value thereof, determining whether the probability value is a probability value for judging whether the probability value is a promoter.

Description

Cross-species promoter prediction method and system

Technical Field

The disclosure belongs to the technical field of data processing, and particularly relates to a multi-species-crossing promoter prediction method and system.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Promoters are non-coding DNA regions located near the Transcription Start Site (TSS) and are important for initiating transcription of a particular gene and gene expression in different species by cooperation with RNA polymerase (RNAP). In prokaryotes, promoters are involved in many biological functions, such as heat shock reactions, nitrogen fixation, and the like. In eukaryotes, promoters control the exact starting position of transcription and cooperate with their remote regulatory elements through chromatin loops and are involved in the process of developmental diseases, tumorigenesis and spatiotemporal gene expression. Therefore, the identification of the promoter becomes a big hot spot. In the early stage of promoter detection, biological methods such as whole genome mapping of histone modification are usually used for detection, however, these techniques face the problems of high cost, time and labor waste, etc. Subsequently, some calculation methods of predicting promoters were proposed to solve these problems. However, these calculation methods have poor performance and generalization performance, bring great inconvenience to practical application, and are difficult to meet the requirements of high precision and high generalization performance of prediction work. Furthermore, these methods have demonstrated performance on only one species, and thus applicability in practice has not been demonstrated. Therefore, the realization of high-precision and high-generalization prediction of promoters in multiple species has become an important research direction for predicting promoters.

In reality, the prediction of the promoter has the problems of high data requirement, low prediction precision and large difference among species. The problem of high data requirement causes high cost and heavy task for acquiring data; the reliability of the prediction result is low due to the problem of low prediction precision, and the next analysis is difficult to perform; the problem of large difference between species results in large difference of prediction capability of promoters of different species, so that parameters of the promoters of different species need to be readjusted to perform prediction.

Disclosure of Invention

In order to solve the technical problems in the background art, the present disclosure provides a method and a system for predicting a promoter across multiple species, which construct a high-precision high-generalization model for predicting promoters in different species based on a weighted-average ensemble learning method using only one type of data of DNA sequence data, and can effectively predict a promoter across cell lines and effectively distinguish an enhancer from a promoter.

In order to achieve the purpose, the following technical scheme is adopted in the disclosure:

a first aspect of the disclosure provides a promoter prediction method across multiple species.

A promoter prediction method across multiple species, comprising:

obtaining a DNA sequence, and respectively extracting a first characteristic and a second characteristic of the DNA sequence;

based on the first characteristic, a random forest model is adopted to obtain a first prediction probability value;

based on the second characteristic, a convolutional neural network model is adopted to obtain a second prediction probability value;

respectively assuming the weight of a random forest model and the weight of a convolutional neural network model, and constructing a loss function based on a first prediction probability value and a second prediction probability value;

determining the weight values of the random forest model and the convolutional neural network model through a minimum loss function;

and according to the sum of the weighted value of the random forest model and the product of the weighted value of the convolutional neural network model and the predicted probability value thereof, determining whether the probability value is a probability value for judging whether the probability value is a promoter.

A second aspect of the disclosure provides a promoter prediction system across multiple species.

A promoter prediction system across multiple species, comprising:

an acquisition and feature extraction module configured to: obtaining a DNA sequence, and respectively extracting a first characteristic and a second characteristic of the DNA sequence;

a first prediction module configured to: based on the first characteristic, a random forest model is adopted to obtain a first prediction probability value;

a second prediction module configured to: based on the second characteristic, a convolutional neural network model is adopted to obtain a second prediction probability value;

a loss function construction module configured to: respectively assuming the weight of a random forest model and the weight of a convolutional neural network model, and constructing a loss function based on a first prediction probability value and a second prediction probability value;

a weight determination module: it is configured to: determining the weight values of the random forest model and the convolutional neural network model through a minimum loss function;

a determination module configured to: and according to the sum of the weighted value of the random forest model and the product of the weighted value of the convolutional neural network model and the predicted probability value thereof, determining whether the probability value is a probability value for judging whether the probability value is a promoter.

A third aspect of the present disclosure provides a computer-readable storage medium.

A computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of promoter prediction across multiple species as described in the first aspect above.

A fourth aspect of the present disclosure provides a computer device.

A computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the steps in the method of promoter prediction across multiple species as described in the first aspect above.

Compared with the prior art, the beneficial effect of this disclosure is:

the present disclosure first feature-codes DNA by word vector techniques, then extracts the different features of the input by convolution calculations using CNN, while extracting more complex features by stacking convolution layers. In addition, the present disclosure uses conventional feature extraction algorithms to extract features of DNA sequences and as input features to a random forest. The performance and generalization capability of the predicted promoter are effectively improved by capturing data characteristics through the deep learning neural network and the machine learning method, and meanwhile, the accuracy of the predicted promoter is further improved through the integrated learning research.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a block diagram of a model of IPro-WAEL shown in an embodiment of the present disclosure;

FIG. 2 is a flow chart illustrating IPro-WAEL detection according to an embodiment of the present disclosure;

FIG. 3 is a graph of performance versus effect of IPro-WAEL and other classifiers on six intersections as shown in an embodiment of the disclosure;

FIG. 4 is a graph illustrating the predicted performance effect of IPro-WAEL across cell lines, as shown in an embodiment of the present disclosure.

Detailed Description

The present disclosure is further described with reference to the following drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

It is noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems according to various embodiments of the present disclosure. It should be noted that each block in the flowchart or block diagrams may represent a module, a segment, or a portion of code, which may comprise one or more executable instructions for implementing the logical function specified in the respective embodiment. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Example one

As shown in fig. 1, the embodiment provides a method for predicting a promoter across multiple species, and the embodiment is illustrated by applying the method to a server, it is understood that the method may also be applied to a terminal, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network server, cloud communication, middleware service, a domain name service, a security service CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. In this embodiment, the method includes the steps of:

Specifically, the model structure of the present embodiment with respect to the random forest model and the convolutional neural network model is shown in fig. 1. The Random Forest (RF) model uses a training set, a weight set, and a test set. The Convolutional Neural Network (CNN) model uses a training set, a validation set, a weight set, and a test set. The RF model trains the model using a training set and the CNN model trains the model using a training set and a validation set, where the validation set is used for the early-stop mechanism acquisition model to avoid overfitting. Then, the trained RF model and CNN model respectively predict the weight sets, and the weights of the two models are respectively W ₁ And W ₂ Thus, for each sample of the training set, there may be a total penalty value, W ₁ Multiplying the predicted probability of the RF model to the sample by W ₂ The sum of the prediction probabilities of the CNN model for that sample is multiplied. And the sum of the loss values of all samples of the whole training set is used as the loss value of the whole weight set. Thus, the loss value is one band with W ₁ And W ₂ Is used as the multivariate function of (2). Next, we use sequential least squares planning (SLSLQP) to extremize the multivariate function, W when the loss is minimal ₁ And W ₂ (the sum of the two weights is 1 and both are greater than 0) are the optimal weights for the two models, respectively. Then, the prediction results of the two models on the test set are multiplied by corresponding weights respectively, and then the sum is used as the final prediction probability.

In the detection process of this embodiment, as shown in fig. 2, for a sequence to be predicted, we first extract five traditional features of the sequence and fuse them, including reverse complementary k-mer (rckmer), unmatched k-mer (mismatch), k-spacer pair Composition (CKSNAP), trinucleotide physicochemical properties (TPCP) and pseudo trinucleotide composition (PseTNC), and the fused feature is the first feature. Then, taking the fused features as feature vectors of a random forest model; we also extract the Word2vec features of the sequence, i.e., the second features, using a pre-trained Word embedding model, and use the Word embedding vectors for the input of the Convolutional Neural Network (CNN) model. Then, the sum of the value obtained by multiplying the prediction probability value of the RF model by the weight of the RF model and the value obtained by multiplying the prediction probability value of the CNN model by the weight of the CNN model is used as a final prediction probability value, and classification is performed according to the prediction probability value (threshold 0.5). Greater than 0.5 is considered a promoter, otherwise it is considered a non-promoter.

As one or more embodiments, the process of weight determination includes:

acquiring a DNA sequence data set, and dividing the DNA sequence data set into: training and testing sets;

respectively training a random forest model and a convolutional neural network model by using samples in the training set to obtain a trained random forest model and a trained convolutional neural network model;

based on the test set, testing the trained random forest model and the trained convolutional neural network model to respectively obtain a first prediction probability value and a second prediction probability value;

constructing a loss function by combining the weight set based on the first prediction probability value and the second prediction probability value;

the two weighted values when the loss value of the loss function is minimum are respectively the optimal weight of the random forest model and the optimal weight of the convolutional neural network model;

and the optimal weight of the random forest model and the optimal weight of the convolutional neural network model are the determined weights.

To validate the protocol of this example, we first evaluated our method and the existing methods for predicting the performance of human promoters, and all the results of our model maintained the same parameters, so it can be seen that our model can be applied to the detection of promoters of multiple species without readjusting the parameters.

Mainly because the model parameters mainly relate the process of model-to-feature computation, and for promoter sequences of different species, we use the same feature extraction method, so the dimensions of features in different species are the same. So that the model with the same parameters can be used for training. However, promoter structures of different species are greatly different, so that the performance of the RF model and the performance of the CNN model in different species are greatly different, so that the model architecture of the invention trains and obtains different weight parameter combinations to be applied to different species. In other words, the weights of the two models in different species are different, and the process of obtaining the weights is automatically obtained by minimizing the loss value, so that the model at this time can be guaranteed to be optimal. In some species the classical characteristic is more reflective of the sequence structure of the promoter, while in some species the word-embedded characteristic is more reflective of the sequence structure of the promoter. By minimizing the loss value, the model with good performance is endowed with higher weight, so that the performance of the model is improved, and the model has excellent performance in the aspect of promoter prediction of different species.

We then further compared the generalization ability of the models intuitively. Given that some models cannot be applied to some datasets and that the models differ in performance across different datasets, we further compare the maximum intersection of datasets that these methods can apply. The results are shown in FIG. 3. We can see that iPro-WAEL can be effectively applied to all data sets, and exhibits optimal performance on these data sets.

Next, we verified whether iPro-WAEL could be tested for promoters across cell lines. We used a model trained from data from one cell line to predict the performance of promoters in different cell lines, and the results are shown in FIG. 4. Taking the upper left graph as an example, the prediction performance of the model trained by the training data of the GM12878 cell line on the test set of four cell lines shows that the model can effectively perform cross-cell line prediction and has excellent generalization capability and performance. This demonstrates the excellent generalization of our model. In addition, given that previous studies indicate that promoters and enhancers have very similar sequence structures, we further evaluated whether iPro-WAEL can effectively distinguish between enhancers and promoters. The results are shown in Table 1. Therefore, in general, the method provided by the invention has obvious improvement on prediction precision, generalization performance and applicability, can effectively predict the promoter of the cross-cell line and effectively distinguish the enhancer and the promoter, and meets the requirement of practical application.

TABLE 1 prediction of the Performance of IPro-WAEL to discriminate enhancers and promoters

Cell lines	AUC	Accuracy	MCC
				GM12878	0.9910	0.9622	0.9244
HeLa-S3	0.9943	0.9693	0.9387
				HUVEC	0.9851	0.9437	0.8873
K562	0.9929	0.9695	0.9391

The sequence structure differs due to differences in promoters in different species. Thus, the characteristics alone may not be effective in predicting promoters of different species. For this reason, we first fuse five traditional features for random forests and then use word vector embedding technology to extract features for CNN models, and models containing multiple features are more robust in predicting promoters of different species.

Secondly, in order to integrate the two models reasonably, the two models are integrated through an integration algorithm of weighted average. The weighted averaging method obtains a combined output by averaging the outputs of the individual models with different weights, which implies different importance. It can give more weight to a well-performing model than a simple averaging method. Specifically, the final output of the iPro-WAEL is calculated as follows:

wherein w _i Is the weight of the ith model, h _i (x) Is the output of the ith model, and the weights are limited by

To obtain the best suitable weights and complete independence of the independent test sets, we further divided the original training set into training sets and data sets for obtaining weights (weight sets) in a 7:1 ratio. Thus, the ratio of training set, weight set, and test set in the RF model is 7:1:2, while the ratio of training set, weight set, validation set, and test set in the CNN model is 6:1:1: 2. We then obtain the optimal weights by minimizing the loss values in the weight set. Specifically, the loss values in the weight set are calculated by the cross-entropy loss of the predicted value and the true tag, defined as follows:

L _log (y,p)＝-(ylog(p)+(1-y)log(1-p)) (3)

where y represents the true label and p represents the predicted value. Then we use the sequential least squares programming (slsrqp) algorithm to minimize the loss value, and the weight at the minimum loss value is the optimal weight.

Finally, in order to ensure the optimal performance of the model, parameters are adjusted by a grid search method. The parameters we adjust include the learning rate, the number of kernels, the size of the kernels, and the number of trees in the random forest. Table 2 and table 3 show the results of some of the parameter combinations. It can be seen that the performance of the model is affected by the parameter settings, with the best performance of the model with a learning rate of 0.001, a number of kernels of 32, a kernel size of 11, and a number of trees in random forests of 300. Therefore, we use this parameter combination to construct our model.

TABLE 2 Performance of RF model part parameter combinations

n_trees	AUC	Accuracy	MCC
				100	0.9803	0.9393	0.8787
200	0.9806	0.9400	0.8801
				300	0.9809	0.9415	0.8831
400	0.9809	0.9404	0.8809
				500	0.9808	0.9415	0.8831
600	0.9809	0.9415	0.8831
				700	0.9809	0.9415	0.8831
800	0.9809	0.9415	0.8831
				900	0.9809	0.9411	0.8824
1000	0.9808	0.9400	0.8803

TABLE 3 Performance of partial parameter combinations of CNN model

Example two

This example provides a promoter prediction system across multiple species.

A promoter prediction system across multiple species, comprising:

a determination module configured to: and according to the sum of the weighted value of the random forest model and the product of the weighted value of the convolutional neural network model and the predicted probability value thereof, determining whether the weighted value is the probability value for judging whether the weighted value is a promoter or not. It should be noted here that the obtaining and feature extracting module, the first predicting module, the second predicting module, the loss function constructing module, the weight determining module and the determining module are the same as the example and the application scenario realized by the steps in the first embodiment, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

EXAMPLE III

The present embodiment provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps in the promoter prediction method across multiple species as described in the first embodiment above.

Example four

The present embodiment provides a computer device, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the promoter prediction method across multiple species as described in the first embodiment.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A method of promoter prediction across multiple species, comprising:

2. The method of promoter prediction across multiple species according to claim 1, wherein the first characteristic comprises: fused reverse complement k-mers, mismatched k-mers, k-spacer nucleic acid pair compositions, trinucleotide physicochemical properties, and pseudo-trinucleotide compositions.

3. The method of claim 1, wherein the second feature is a word insertion vector obtained by extracting a DNA sequence using a word insertion model.

4. The method of predicting a promoter across multiple species according to claim 1, wherein said loss function is:

wherein, w _i Is the weight of the ith model, h _i (x) Is the output of the ith model。

5. The method of claim 1, wherein the weight constraint is:

。

6. the method of claim 1, wherein the weight determination process comprises:

respectively training a random forest model and a convolutional neural network model by adopting samples in the training set to obtain a trained random forest model and a trained convolutional neural network model;

7. The method of claim 6, wherein the weight set loss values are:

L _log (y,p)＝-(y log(p)+(1-y)log(1-p))

where y represents the true label and p represents the prediction probability value.

8. A promoter prediction system across multiple species, comprising:

a first prediction module configured to: based on the first characteristics, a random forest model is adopted to obtain a first predicted probability value;

9. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps in the method of promoter prediction across multiple species according to any one of claims 1 to 7.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in the method of promoter prediction across multiple species according to any one of claims 1-7.