CN114898805A - Cross-species promoter prediction method and system - Google Patents

Cross-species promoter prediction method and system Download PDF

Info

Publication number
CN114898805A
CN114898805A CN202210342942.2A CN202210342942A CN114898805A CN 114898805 A CN114898805 A CN 114898805A CN 202210342942 A CN202210342942 A CN 202210342942A CN 114898805 A CN114898805 A CN 114898805A
Authority
CN
China
Prior art keywords
probability value
model
prediction
weight
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210342942.2A
Other languages
Chinese (zh)
Other versions
CN114898805B (en
Inventor
吴昊
张鹏宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University
Original Assignee
Shandong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University filed Critical Shandong University
Priority to CN202210342942.2A priority Critical patent/CN114898805B/en
Publication of CN114898805A publication Critical patent/CN114898805A/en
Application granted granted Critical
Publication of CN114898805B publication Critical patent/CN114898805B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Biology (AREA)
  • Biomedical Technology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Public Health (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Genetics & Genomics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The disclosure belongs to the technical field of data processing, and provides a multi-species-crossing promoter prediction method and system. The method comprises the steps of obtaining a DNA sequence, and respectively extracting a first characteristic and a second characteristic of the DNA sequence; based on the first characteristic, a random forest model is adopted to obtain a first prediction probability value; based on the second characteristic, a convolutional neural network model is adopted to obtain a second prediction probability value; respectively assuming the weight of a random forest model and the weight of a convolutional neural network model, and constructing a loss function based on a first prediction probability value and a second prediction probability value; determining the weight values of the random forest model and the convolutional neural network model through a minimum loss function; and according to the sum of the weighted value of the random forest model and the product of the weighted value of the convolutional neural network model and the predicted probability value thereof, determining whether the probability value is a probability value for judging whether the probability value is a promoter.

Description

Cross-species promoter prediction method and system
Technical Field
The disclosure belongs to the technical field of data processing, and particularly relates to a multi-species-crossing promoter prediction method and system.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
Promoters are non-coding DNA regions located near the Transcription Start Site (TSS) and are important for initiating transcription of a particular gene and gene expression in different species by cooperation with RNA polymerase (RNAP). In prokaryotes, promoters are involved in many biological functions, such as heat shock reactions, nitrogen fixation, and the like. In eukaryotes, promoters control the exact starting position of transcription and cooperate with their remote regulatory elements through chromatin loops and are involved in the process of developmental diseases, tumorigenesis and spatiotemporal gene expression. Therefore, the identification of the promoter becomes a big hot spot. In the early stage of promoter detection, biological methods such as whole genome mapping of histone modification are usually used for detection, however, these techniques face the problems of high cost, time and labor waste, etc. Subsequently, some calculation methods of predicting promoters were proposed to solve these problems. However, these calculation methods have poor performance and generalization performance, bring great inconvenience to practical application, and are difficult to meet the requirements of high precision and high generalization performance of prediction work. Furthermore, these methods have demonstrated performance on only one species, and thus applicability in practice has not been demonstrated. Therefore, the realization of high-precision and high-generalization prediction of promoters in multiple species has become an important research direction for predicting promoters.
In reality, the prediction of the promoter has the problems of high data requirement, low prediction precision and large difference among species. The problem of high data requirement causes high cost and heavy task for acquiring data; the reliability of the prediction result is low due to the problem of low prediction precision, and the next analysis is difficult to perform; the problem of large difference between species results in large difference of prediction capability of promoters of different species, so that parameters of the promoters of different species need to be readjusted to perform prediction.
Disclosure of Invention
In order to solve the technical problems in the background art, the present disclosure provides a method and a system for predicting a promoter across multiple species, which construct a high-precision high-generalization model for predicting promoters in different species based on a weighted-average ensemble learning method using only one type of data of DNA sequence data, and can effectively predict a promoter across cell lines and effectively distinguish an enhancer from a promoter.
In order to achieve the purpose, the following technical scheme is adopted in the disclosure:
a first aspect of the disclosure provides a promoter prediction method across multiple species.
A promoter prediction method across multiple species, comprising:
obtaining a DNA sequence, and respectively extracting a first characteristic and a second characteristic of the DNA sequence;
based on the first characteristic, a random forest model is adopted to obtain a first prediction probability value;
based on the second characteristic, a convolutional neural network model is adopted to obtain a second prediction probability value;
respectively assuming the weight of a random forest model and the weight of a convolutional neural network model, and constructing a loss function based on a first prediction probability value and a second prediction probability value;
determining the weight values of the random forest model and the convolutional neural network model through a minimum loss function;
and according to the sum of the weighted value of the random forest model and the product of the weighted value of the convolutional neural network model and the predicted probability value thereof, determining whether the probability value is a probability value for judging whether the probability value is a promoter.
A second aspect of the disclosure provides a promoter prediction system across multiple species.
A promoter prediction system across multiple species, comprising:
an acquisition and feature extraction module configured to: obtaining a DNA sequence, and respectively extracting a first characteristic and a second characteristic of the DNA sequence;
a first prediction module configured to: based on the first characteristic, a random forest model is adopted to obtain a first prediction probability value;
a second prediction module configured to: based on the second characteristic, a convolutional neural network model is adopted to obtain a second prediction probability value;
a loss function construction module configured to: respectively assuming the weight of a random forest model and the weight of a convolutional neural network model, and constructing a loss function based on a first prediction probability value and a second prediction probability value;
a weight determination module: it is configured to: determining the weight values of the random forest model and the convolutional neural network model through a minimum loss function;
a determination module configured to: and according to the sum of the weighted value of the random forest model and the product of the weighted value of the convolutional neural network model and the predicted probability value thereof, determining whether the probability value is a probability value for judging whether the probability value is a promoter.
A third aspect of the present disclosure provides a computer-readable storage medium.
A computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of promoter prediction across multiple species as described in the first aspect above.
A fourth aspect of the present disclosure provides a computer device.
A computer apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor when executing the program implementing the steps in the method of promoter prediction across multiple species as described in the first aspect above.
Compared with the prior art, the beneficial effect of this disclosure is:
the present disclosure first feature-codes DNA by word vector techniques, then extracts the different features of the input by convolution calculations using CNN, while extracting more complex features by stacking convolution layers. In addition, the present disclosure uses conventional feature extraction algorithms to extract features of DNA sequences and as input features to a random forest. The performance and generalization capability of the predicted promoter are effectively improved by capturing data characteristics through the deep learning neural network and the machine learning method, and meanwhile, the accuracy of the predicted promoter is further improved through the integrated learning research.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.
FIG. 1 is a block diagram of a model of IPro-WAEL shown in an embodiment of the present disclosure;
FIG. 2 is a flow chart illustrating IPro-WAEL detection according to an embodiment of the present disclosure;
FIG. 3 is a graph of performance versus effect of IPro-WAEL and other classifiers on six intersections as shown in an embodiment of the disclosure;
FIG. 4 is a graph illustrating the predicted performance effect of IPro-WAEL across cell lines, as shown in an embodiment of the present disclosure.
Detailed Description
The present disclosure is further described with reference to the following drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
It is noted that the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of methods and systems according to various embodiments of the present disclosure. It should be noted that each block in the flowchart or block diagrams may represent a module, a segment, or a portion of code, which may comprise one or more executable instructions for implementing the logical function specified in the respective embodiment. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Example one
As shown in fig. 1, the embodiment provides a method for predicting a promoter across multiple species, and the embodiment is illustrated by applying the method to a server, it is understood that the method may also be applied to a terminal, and may also be applied to a system including a terminal and a server, and is implemented by interaction between the terminal and the server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network server, cloud communication, middleware service, a domain name service, a security service CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein. In this embodiment, the method includes the steps of:
obtaining a DNA sequence, and respectively extracting a first characteristic and a second characteristic of the DNA sequence;
based on the first characteristic, a random forest model is adopted to obtain a first prediction probability value;
based on the second characteristic, a convolutional neural network model is adopted to obtain a second prediction probability value;
respectively assuming the weight of a random forest model and the weight of a convolutional neural network model, and constructing a loss function based on a first prediction probability value and a second prediction probability value;
determining the weight values of the random forest model and the convolutional neural network model through a minimum loss function;
and according to the sum of the weighted value of the random forest model and the product of the weighted value of the convolutional neural network model and the predicted probability value thereof, determining whether the probability value is a probability value for judging whether the probability value is a promoter.
Specifically, the model structure of the present embodiment with respect to the random forest model and the convolutional neural network model is shown in fig. 1. The Random Forest (RF) model uses a training set, a weight set, and a test set. The Convolutional Neural Network (CNN) model uses a training set, a validation set, a weight set, and a test set. The RF model trains the model using a training set and the CNN model trains the model using a training set and a validation set, where the validation set is used for the early-stop mechanism acquisition model to avoid overfitting. Then, the trained RF model and CNN model respectively predict the weight sets, and the weights of the two models are respectively W 1 And W 2 Thus, for each sample of the training set, there may be a total penalty value, W 1 Multiplying the predicted probability of the RF model to the sample by W 2 The sum of the prediction probabilities of the CNN model for that sample is multiplied. And the sum of the loss values of all samples of the whole training set is used as the loss value of the whole weight set. Thus, the loss value is one band with W 1 And W 2 Is used as the multivariate function of (2). Next, we use sequential least squares planning (SLSLQP) to extremize the multivariate function, W when the loss is minimal 1 And W 2 (the sum of the two weights is 1 and both are greater than 0) are the optimal weights for the two models, respectively. Then, the prediction results of the two models on the test set are multiplied by corresponding weights respectively, and then the sum is used as the final prediction probability.
In the detection process of this embodiment, as shown in fig. 2, for a sequence to be predicted, we first extract five traditional features of the sequence and fuse them, including reverse complementary k-mer (rckmer), unmatched k-mer (mismatch), k-spacer pair Composition (CKSNAP), trinucleotide physicochemical properties (TPCP) and pseudo trinucleotide composition (PseTNC), and the fused feature is the first feature. Then, taking the fused features as feature vectors of a random forest model; we also extract the Word2vec features of the sequence, i.e., the second features, using a pre-trained Word embedding model, and use the Word embedding vectors for the input of the Convolutional Neural Network (CNN) model. Then, the sum of the value obtained by multiplying the prediction probability value of the RF model by the weight of the RF model and the value obtained by multiplying the prediction probability value of the CNN model by the weight of the CNN model is used as a final prediction probability value, and classification is performed according to the prediction probability value (threshold 0.5). Greater than 0.5 is considered a promoter, otherwise it is considered a non-promoter.
As one or more embodiments, the process of weight determination includes:
acquiring a DNA sequence data set, and dividing the DNA sequence data set into: training and testing sets;
respectively training a random forest model and a convolutional neural network model by using samples in the training set to obtain a trained random forest model and a trained convolutional neural network model;
based on the test set, testing the trained random forest model and the trained convolutional neural network model to respectively obtain a first prediction probability value and a second prediction probability value;
constructing a loss function by combining the weight set based on the first prediction probability value and the second prediction probability value;
the two weighted values when the loss value of the loss function is minimum are respectively the optimal weight of the random forest model and the optimal weight of the convolutional neural network model;
and the optimal weight of the random forest model and the optimal weight of the convolutional neural network model are the determined weights.
To validate the protocol of this example, we first evaluated our method and the existing methods for predicting the performance of human promoters, and all the results of our model maintained the same parameters, so it can be seen that our model can be applied to the detection of promoters of multiple species without readjusting the parameters.
Mainly because the model parameters mainly relate the process of model-to-feature computation, and for promoter sequences of different species, we use the same feature extraction method, so the dimensions of features in different species are the same. So that the model with the same parameters can be used for training. However, promoter structures of different species are greatly different, so that the performance of the RF model and the performance of the CNN model in different species are greatly different, so that the model architecture of the invention trains and obtains different weight parameter combinations to be applied to different species. In other words, the weights of the two models in different species are different, and the process of obtaining the weights is automatically obtained by minimizing the loss value, so that the model at this time can be guaranteed to be optimal. In some species the classical characteristic is more reflective of the sequence structure of the promoter, while in some species the word-embedded characteristic is more reflective of the sequence structure of the promoter. By minimizing the loss value, the model with good performance is endowed with higher weight, so that the performance of the model is improved, and the model has excellent performance in the aspect of promoter prediction of different species.
We then further compared the generalization ability of the models intuitively. Given that some models cannot be applied to some datasets and that the models differ in performance across different datasets, we further compare the maximum intersection of datasets that these methods can apply. The results are shown in FIG. 3. We can see that iPro-WAEL can be effectively applied to all data sets, and exhibits optimal performance on these data sets.
Next, we verified whether iPro-WAEL could be tested for promoters across cell lines. We used a model trained from data from one cell line to predict the performance of promoters in different cell lines, and the results are shown in FIG. 4. Taking the upper left graph as an example, the prediction performance of the model trained by the training data of the GM12878 cell line on the test set of four cell lines shows that the model can effectively perform cross-cell line prediction and has excellent generalization capability and performance. This demonstrates the excellent generalization of our model. In addition, given that previous studies indicate that promoters and enhancers have very similar sequence structures, we further evaluated whether iPro-WAEL can effectively distinguish between enhancers and promoters. The results are shown in Table 1. Therefore, in general, the method provided by the invention has obvious improvement on prediction precision, generalization performance and applicability, can effectively predict the promoter of the cross-cell line and effectively distinguish the enhancer and the promoter, and meets the requirement of practical application.
TABLE 1 prediction of the Performance of IPro-WAEL to discriminate enhancers and promoters
Cell lines AUC Accuracy MCC
GM12878 0.9910 0.9622 0.9244
HeLa-S3 0.9943 0.9693 0.9387
HUVEC 0.9851 0.9437 0.8873
K562 0.9929 0.9695 0.9391
The sequence structure differs due to differences in promoters in different species. Thus, the characteristics alone may not be effective in predicting promoters of different species. For this reason, we first fuse five traditional features for random forests and then use word vector embedding technology to extract features for CNN models, and models containing multiple features are more robust in predicting promoters of different species.
Secondly, in order to integrate the two models reasonably, the two models are integrated through an integration algorithm of weighted average. The weighted averaging method obtains a combined output by averaging the outputs of the individual models with different weights, which implies different importance. It can give more weight to a well-performing model than a simple averaging method. Specifically, the final output of the iPro-WAEL is calculated as follows:
Figure BDA0003580017810000091
wherein w i Is the weight of the ith model, h i (x) Is the output of the ith model, and the weights are limited by
Figure BDA0003580017810000092
To obtain the best suitable weights and complete independence of the independent test sets, we further divided the original training set into training sets and data sets for obtaining weights (weight sets) in a 7:1 ratio. Thus, the ratio of training set, weight set, and test set in the RF model is 7:1:2, while the ratio of training set, weight set, validation set, and test set in the CNN model is 6:1:1: 2. We then obtain the optimal weights by minimizing the loss values in the weight set. Specifically, the loss values in the weight set are calculated by the cross-entropy loss of the predicted value and the true tag, defined as follows:
L log (y,p)=-(ylog(p)+(1-y)log(1-p)) (3)
where y represents the true label and p represents the predicted value. Then we use the sequential least squares programming (slsrqp) algorithm to minimize the loss value, and the weight at the minimum loss value is the optimal weight.
Finally, in order to ensure the optimal performance of the model, parameters are adjusted by a grid search method. The parameters we adjust include the learning rate, the number of kernels, the size of the kernels, and the number of trees in the random forest. Table 2 and table 3 show the results of some of the parameter combinations. It can be seen that the performance of the model is affected by the parameter settings, with the best performance of the model with a learning rate of 0.001, a number of kernels of 32, a kernel size of 11, and a number of trees in random forests of 300. Therefore, we use this parameter combination to construct our model.
TABLE 2 Performance of RF model part parameter combinations
n_trees AUC Accuracy MCC
100 0.9803 0.9393 0.8787
200 0.9806 0.9400 0.8801
300 0.9809 0.9415 0.8831
400 0.9809 0.9404 0.8809
500 0.9808 0.9415 0.8831
600 0.9809 0.9415 0.8831
700 0.9809 0.9415 0.8831
800 0.9809 0.9415 0.8831
900 0.9809 0.9411 0.8824
1000 0.9808 0.9400 0.8803
TABLE 3 Performance of partial parameter combinations of CNN model
Figure BDA0003580017810000101
Figure BDA0003580017810000111
Example two
This example provides a promoter prediction system across multiple species.
A promoter prediction system across multiple species, comprising:
an acquisition and feature extraction module configured to: obtaining a DNA sequence, and respectively extracting a first characteristic and a second characteristic of the DNA sequence;
a first prediction module configured to: based on the first characteristic, a random forest model is adopted to obtain a first prediction probability value;
a second prediction module configured to: based on the second characteristic, a convolutional neural network model is adopted to obtain a second prediction probability value;
a loss function construction module configured to: respectively assuming the weight of a random forest model and the weight of a convolutional neural network model, and constructing a loss function based on a first prediction probability value and a second prediction probability value;
a weight determination module: it is configured to: determining the weight values of the random forest model and the convolutional neural network model through a minimum loss function;
a determination module configured to: and according to the sum of the weighted value of the random forest model and the product of the weighted value of the convolutional neural network model and the predicted probability value thereof, determining whether the weighted value is the probability value for judging whether the weighted value is a promoter or not. It should be noted here that the obtaining and feature extracting module, the first predicting module, the second predicting module, the loss function constructing module, the weight determining module and the determining module are the same as the example and the application scenario realized by the steps in the first embodiment, but are not limited to the disclosure of the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.
EXAMPLE III
The present embodiment provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps in the promoter prediction method across multiple species as described in the first embodiment above.
Example four
The present embodiment provides a computer device, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the steps of the promoter prediction method across multiple species as described in the first embodiment.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims (10)

1. A method of promoter prediction across multiple species, comprising:
obtaining a DNA sequence, and respectively extracting a first characteristic and a second characteristic of the DNA sequence;
based on the first characteristic, a random forest model is adopted to obtain a first prediction probability value;
based on the second characteristic, a convolutional neural network model is adopted to obtain a second prediction probability value;
respectively assuming the weight of a random forest model and the weight of a convolutional neural network model, and constructing a loss function based on a first prediction probability value and a second prediction probability value;
determining the weight values of the random forest model and the convolutional neural network model through a minimum loss function;
and according to the sum of the weighted value of the random forest model and the product of the weighted value of the convolutional neural network model and the predicted probability value thereof, determining whether the probability value is a probability value for judging whether the probability value is a promoter.
2. The method of promoter prediction across multiple species according to claim 1, wherein the first characteristic comprises: fused reverse complement k-mers, mismatched k-mers, k-spacer nucleic acid pair compositions, trinucleotide physicochemical properties, and pseudo-trinucleotide compositions.
3. The method of claim 1, wherein the second feature is a word insertion vector obtained by extracting a DNA sequence using a word insertion model.
4. The method of predicting a promoter across multiple species according to claim 1, wherein said loss function is:
Figure FDA0003580017800000011
wherein, w i Is the weight of the ith model, h i (x) Is the output of the ith model。
5. The method of claim 1, wherein the weight constraint is:
Figure FDA0003580017800000021
6. the method of claim 1, wherein the weight determination process comprises:
acquiring a DNA sequence data set, and dividing the DNA sequence data set into: training and testing sets;
respectively training a random forest model and a convolutional neural network model by adopting samples in the training set to obtain a trained random forest model and a trained convolutional neural network model;
based on the test set, testing the trained random forest model and the trained convolutional neural network model to respectively obtain a first prediction probability value and a second prediction probability value;
constructing a loss function by combining the weight set based on the first prediction probability value and the second prediction probability value;
the two weighted values when the loss value of the loss function is minimum are respectively the optimal weight of the random forest model and the optimal weight of the convolutional neural network model;
and the optimal weight of the random forest model and the optimal weight of the convolutional neural network model are the determined weights.
7. The method of claim 6, wherein the weight set loss values are:
L log (y,p)=-(y log(p)+(1-y)log(1-p))
where y represents the true label and p represents the prediction probability value.
8. A promoter prediction system across multiple species, comprising:
an acquisition and feature extraction module configured to: obtaining a DNA sequence, and respectively extracting a first characteristic and a second characteristic of the DNA sequence;
a first prediction module configured to: based on the first characteristics, a random forest model is adopted to obtain a first predicted probability value;
a second prediction module configured to: based on the second characteristic, a convolutional neural network model is adopted to obtain a second prediction probability value;
a loss function construction module configured to: respectively assuming the weight of a random forest model and the weight of a convolutional neural network model, and constructing a loss function based on a first prediction probability value and a second prediction probability value;
a weight determination module: it is configured to: determining the weight values of the random forest model and the convolutional neural network model through a minimum loss function;
a determination module configured to: and according to the sum of the weighted value of the random forest model and the product of the weighted value of the convolutional neural network model and the predicted probability value thereof, determining whether the probability value is a probability value for judging whether the probability value is a promoter.
9. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps in the method of promoter prediction across multiple species according to any one of claims 1 to 7.
10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps in the method of promoter prediction across multiple species according to any one of claims 1-7.
CN202210342942.2A 2022-04-02 2022-04-02 Multi-species-crossing promoter prediction method and system Active CN114898805B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210342942.2A CN114898805B (en) 2022-04-02 2022-04-02 Multi-species-crossing promoter prediction method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210342942.2A CN114898805B (en) 2022-04-02 2022-04-02 Multi-species-crossing promoter prediction method and system

Publications (2)

Publication Number Publication Date
CN114898805A true CN114898805A (en) 2022-08-12
CN114898805B CN114898805B (en) 2024-06-18

Family

ID=82715518

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210342942.2A Active CN114898805B (en) 2022-04-02 2022-04-02 Multi-species-crossing promoter prediction method and system

Country Status (1)

Country Link
CN (1) CN114898805B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108231067A (en) * 2018-01-13 2018-06-29 福州大学 Sound scenery recognition methods based on convolutional neural networks and random forest classification
CN110298611A (en) * 2019-05-16 2019-10-01 重庆瑞尔科技发展有限公司 Regulate and control method and system based on the cargo shipping efficiency of random forest and deep learning
CN113744805A (en) * 2021-09-30 2021-12-03 山东大学 Method and system for predicting DNA methylation based on BERT framework

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108231067A (en) * 2018-01-13 2018-06-29 福州大学 Sound scenery recognition methods based on convolutional neural networks and random forest classification
CN110298611A (en) * 2019-05-16 2019-10-01 重庆瑞尔科技发展有限公司 Regulate and control method and system based on the cargo shipping efficiency of random forest and deep learning
CN113744805A (en) * 2021-09-30 2021-12-03 山东大学 Method and system for predicting DNA methylation based on BERT framework

Also Published As

Publication number Publication date
CN114898805B (en) 2024-06-18

Similar Documents

Publication Publication Date Title
Forester et al. Comparing methods for detecting multilocus adaptation with multivariate genotype–environment associations
Caye et al. TESS3: fast inference of spatial population structure and genome scans for selection
US11615346B2 (en) Method and system for training model by using training data
CN110852755B (en) User identity identification method and device for transaction scene
CN112955883B (en) Application recommendation method and device, server and computer-readable storage medium
Meher et al. Prediction of donor splice sites using random forest with a new sequence encoding approach
Piao et al. A new ensemble method with feature space partitioning for high‐dimensional data classification
Li et al. Empirical research of hybridizing principal component analysis with multivariate discriminant analysis and logistic regression for business failure prediction
Guo et al. Compartmentalized gene regulatory network of the pathogenic fungus Fusarium graminearum
Basuchoudhary et al. Machine-learning techniques in economics: new tools for predicting economic growth
US11403550B2 (en) Classifier
van Putten et al. Distorted‐distance models for directional dispersal: a general framework with application to a wind‐dispersed tree
Cui et al. Comparative analysis and classification of cassette exons and constitutive exons
Fang et al. Prediction of antifungal peptides by deep learning with character embedding
CN114861531B (en) Model parameter optimization method and device for repeated purchase prediction of user
Roigé et al. Cluster validity and uncertainty assessment for self-organizing map pest profile analysis.
McKibben et al. Applying machine learning to classify the origins of gene duplications
KR20220138696A (en) Method and apparatus for classifying image
Chu et al. Binary quatre using time-varying transfer functions
CN114898805A (en) Cross-species promoter prediction method and system
Osman et al. Hybrid learning algorithm in neural network system for enzyme classification
Ng et al. Comparing the regression slopes of independent groups
Muzio et al. networkGWAS: A network-based approach to discover genetic associations
CN115936104A (en) Method and apparatus for training machine learning models
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant