CN111899792A - Method for screening small open reading frames with peptide coding capacity - Google Patents

Method for screening small open reading frames with peptide coding capacity Download PDF

Info

Publication number
CN111899792A
CN111899792A CN202010777126.5A CN202010777126A CN111899792A CN 111899792 A CN111899792 A CN 111899792A CN 202010777126 A CN202010777126 A CN 202010777126A CN 111899792 A CN111899792 A CN 111899792A
Authority
CN
China
Prior art keywords
screening
genome
sorfs
genomes
peptide coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010777126.5A
Other languages
Chinese (zh)
Other versions
CN111899792B (en
Inventor
郭丽
钱博文
于家峰
刘健
龚乐君
窦相华
姜雯雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202010777126.5A priority Critical patent/CN111899792B/en
Publication of CN111899792A publication Critical patent/CN111899792A/en
Application granted granted Critical
Publication of CN111899792B publication Critical patent/CN111899792B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Artificial Intelligence (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Chemical & Material Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Analytical Chemistry (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

The invention belongs to the technical field of medicines, and particularly relates to a method for screening a small open reading frame with peptide coding capacity, which comprises the steps of screening prokaryotic genomes annotated by sORFs from a database by taking the GC content of the genomes as the basis, grouping the prokaryotic genomes according to the GC content of the genomes, screening out the sORFs with definite biological functions in each genome, performing redundancy removal treatment by using a CD-Hit program, sequentially taking the peptide coding sORFs in each genome as positive samples, taking the corresponding random scrambled sequences as negative samples, respectively predicting the sORFs in other genomes as a primary training set, and taking the genome with the best prediction effect of the GC content interval of each genome as a source genome of a final training set; and based on the screened training set, using the codon usage frequency of each sequence as a characteristic parameter, and using a classifier to train and complete screening. The method has good screening effect on prokaryotic organisms and eukaryotic organisms.

Description

Method for screening small open reading frames with peptide coding capacity
Technical Field
The invention belongs to the technical field of medicines, and particularly relates to a method for screening a small open reading frame with peptide coding capacity.
Background
In recent years, various omics sequencing technologies, such as genome and transcriptome, have been rapidly developed, and a great research model from qualitative to quantitative in life sciences has been promoted. However, since the last 20 years of the publication of human genome maps, people are increasingly aware that the complexity of genome interpretation is far beyond expectations, and more milestone genomic dark materials are discovered from protein-coding genes to non-coding RNAs, which pose serious challenges to researchers. In 2016, the topic group of Eric professor at texas university reported in the Science journal a small peptide DWORF encoded by a small open reading frame (srf) located on lncRNA, which is only 34 amino acids in length and plays an important role in myocardial contraction. The discovery of DWORF has led to unprecedented interest in sORFs and their encoded polypeptides (SEPs), a long-standing overlooked genomic element.
Small open reading frames (sORFs) with the capability of encoding polypeptides are important elements of genomes with important functions discovered in recent years, and a large number of researches prove that peptide encoding sORFs exist universally and play important biological functions at present, the sequence length of the peptide encoding sORFs is below 100 amino acids, but the research on the sORFs is in the primary stage due to low expression level and abundance, short sequence and lack of experimental technology, biological information algorithms and related database resources capable of effectively identifying the sORFs are also lacked, and an effective identification and screening method is also lacked. In order to avoid false positive of the screening result, the traditional gene screening method usually artificially sets 300 bases or 100 amino acids as the minimum gene length and has obvious species dependence. Therefore, the development of effective peptide coding sORFs prediction screening technology has important significance for the mining of functional polypeptides.
Disclosure of Invention
The research fully fuses sequence characteristics and machine learning technology, develops a method capable of efficiently predicting peptide coding sORFs, has a good screening effect on prokaryotic organisms and eukaryotic organisms, and can provide a new technology for functional peptide research in the future.
sORFs have short sequences, large research difficulty and lack of effective data sets, so that the screening technology of the sORFs is difficult and few methods are available at present. The invention provides a method for effectively screening peptide coding sORFs by establishing a targeted sORFs data set and combining a machine learning technology.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows:
a method of screening for small open reading frames having peptide coding capacity, said method comprising the steps of:
screening prokaryotic genomes annotated by sORFs from a database by taking the GC content of the genomes as a basis, grouping the prokaryotic genomes according to the GC content of the genomes, screening out the sORFs with definite biological functions in each genome, performing redundancy removal by using a CD-Hit program, sequentially taking peptide coding sORFs in each genome as positive samples, taking corresponding random scrambled sequences as negative samples, respectively predicting the sORFs in the rest genomes as a primary training set, and taking the genome with the best prediction effect of the GC content interval of each genome as a source genome of a final training set; and based on the screened training set, using the codon use frequency of each sequence as a characteristic parameter, and using Logistic regression as a classifier to train and complete screening.
Further, the codon usage frequency cof in each sequence was calculated as follows:
Figure BDA0002618850190000021
wherein, ciIndicates the number of each codon, and N indicates the length of the sequence.
Further, the grouping method according to the GC content of the genome comprises the following steps: the genome GC content is divided into 5 parts which are respectively less than 29 percent, 30 percent to 39 percent, 40 percent to 49 percent, 50 percent to 59 percent and more than or equal to 60 percent.
Further, the threshold for redundancy processing is 80%.
Further, the start and stop codons of the negative sample are identical to those of the positive sample.
Furthermore, five genomes with the best prediction effect in the GC content interval of each genome are screened, sORFs meeting the conditions in the five genomes and the scrambled sequences thereof are selected as a final training set, and sORFs in the rest genomes are selected as a test set.
Furthermore, the screening efficiency is evaluated by taking the accuracy and the Markov correlation coefficient as evaluation parameters.
Further, accuracy ACCAnd the mahalanobis correlation coefficient MCC are respectively expressed as:
Figure BDA0002618850190000031
Figure BDA0002618850190000032
TP and FN represent the number of positive samples that are correctly predicted and mispredicted, respectively, and TN and FP represent the number of negative samples that are correctly predicted and mispredicted, respectively.
Further, the results were obtained by using the 64 values corresponding to the codon usage frequency as the input values of the logistic regression classifier.
Advantageous effects
The method for screening the peptide coding sORFs provided by the invention can quickly and accurately screen sORFs with the peptide coding, fully fuse genome data and a machine learning method, has high screening accuracy and provides a new thought for functional gene and polypeptide research.
The method has no species dependency, can be suitable for different species, and has the prediction accuracy and the Markov Correlation Coefficient (MCC) value of 98.36 percent and 0.9673 percent respectively for sORFs with 50-100 amino acids, and the prediction accuracy and the MCC value of 92.27 percent and 0.8454 percent respectively for sORFs with less than 50 amino acids. The method can provide effective research technology for screening and related research of peptide coding sORFs and coding functional polypeptides thereof.
Detailed Description
Example 1
The purpose is as follows: sORFs have short sequences, large research difficulty and lack of effective data sets, so that the screening technology of the sORFs is difficult and few methods are available at present. The invention provides a method for effectively screening peptide coding sORFs by establishing a targeted sORFs data set and combining a machine learning technology.
The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:
1. a standard data set is established. The core of the method is to develop an effective data set screening strategy and establish a reliable peptide coding sORFs data set which can be universally used in different species. Therefore, the invention firstly screens and downloads 61 nuclear organism genomes with sORFs annotations from a Refseq database based on the GC content of the genomes, and divides the genomes into 5 parts according to the GC content of the genomes, wherein the 5 parts are < 29%, 30% -39%, 40% -49%, 50% -59% and > 60%. Then, sORFs with definite biological functions in each genome are screened out, and then redundancy removal processing (80% threshold value) is carried out by using a CD-Hit program to form a data set required by the work. On the basis, the peptide coding sORFs in each genome are sequentially used as positive samples, the corresponding random scrambling sequences are used as negative samples (the start codon and the stop codon of the negative samples are consistent with those of the positive samples), the positive samples are used as a primary training set to respectively predict the sORFs in the rest 60 genomes, and the genome with the best prediction effect of the GC content interval of each genome is used as a source genome of a final training set. Finally, the sORFs meeting the conditions in five genomes of NC _009089, NC _00313, NC _012962, NC _000913 and NC _008380 and the scrambled sequences thereof are used as a final training set, and the sORFs in the remaining 59 genomes are used as a test set.
2. And (4) a classifier. The present work uses Logistic regression as a coding/non-coding classifier.
3. And (4) characteristic parameters. The characteristic parameter of the algorithm is the percentage content of the codon.
The method specifically comprises the following steps:
1. a standard data set. Establishing a reliable and species-universal peptide-encoding sORFs dataset is the key point of the method. The invention takes the GC content of the genome as the basis, firstly screens and downloads 61 prokaryotic genomes with sORFs annotation from a Refseq database, and divides the prokaryotic genomes into 5 parts according to the GC content of the genomes, wherein the parts are < 29%, 30% -39%, 40% -49%, 50% -59% and more than or equal to 60%. Then, sORFs with definite biological functions in each genome were screened, and then redundancy removal processing (80% threshold) was performed by using the CD-Hit program to form a data set required for this work (Table 1). On the basis, the peptide coding sORFs in each genome are sequentially used as positive samples, the corresponding random scrambling sequences are used as negative samples (the start codon and the stop codon of the negative samples are consistent with those of the positive samples), the positive samples are used as a primary training set to respectively predict the sORFs in the rest 60 genomes, and the genome with the best prediction effect of the GC content interval of each genome is used as a source genome of a final training set. Finally, the sORFs meeting the conditions in five genomes of NC _009089, NC _00313, NC _012962, NC _000913 and NC _008380 and the scrambled sequences thereof are used as a final training set, and the sORFs in the remaining 59 genomes are used as a test set.
TABLE 1 information on peptides encoding sORFs satisfying the conditions in each genome
Figure BDA0002618850190000051
Figure BDA0002618850190000061
2. Characteristic parameter
For short sequences, most of the conventional characteristic parameters cannot be applied. In this work, we have tested codon usage frequency (cof), amino acid usage frequency (amf), and sequence complexity (sc) in each sequence, and each parameter is described as follows:
Figure BDA0002618850190000062
Figure BDA0002618850190000071
Figure BDA0002618850190000072
wherein, ciDenotes the number of each codon, N denotes the sequence length, AAjThe number of each amino acid is indicated.
3. Classifier
The method adopts Logistic regression (Logistic regression) as a classifier.
The classifier is a classification model for distinguishing positive and negative samples, after the training of the training set, a user inputs an sORF sequence, and the method can judge whether the input sequence has the peptide coding capacity according to the training of the classifier.
4. Evaluation of efficiency
In order to evaluate the screening efficiency of the method, the accuracy (accuracycacy, A) is respectively utilizedCC) And the Matthew's Correlation Coefficient (MCC) as evaluation parameters, which are respectively expressed as:
Figure BDA0002618850190000073
Figure BDA0002618850190000074
TP and FN represent the number of positive samples that are correctly predicted and mispredicted, respectively, and TN and FP represent the number of negative samples that are correctly predicted and mispredicted, respectively.
5. Efficiency of screening
In order to ensure the best screening effect, several characteristic parameters in step 2, namely three parameters of codon usage frequency (cof), amino acid usage frequency (amf) and sequence complexity (sc) in each sequence are respectively used as input characteristics for training, and then according to the results in table 2, the most efficient one is selected as the final characteristic parameter.
The screening efficiency of the present invention for different data sets is provided in table 2, and the results show that the screening effect for sORFs with a length of less than 50 amino acids is the best when codon usage preference is used, the screening effect for other sORFs is also more ideal, and the efficiency is the best when the sORFs are used as characteristic parameters, so that the codon usage frequency is finally used as the characteristic parameters in the method. The results were obtained using the 64 values between which the codon usage frequency corresponds as input values to a logistic regression classifier. In order to further show the stability of the screening efficiency of the method, table 3 provides the single prediction efficiency of the sORFs for each genome, the prediction accuracy and the MCC value mean values of the sORFs for each genome are 0.9777 and 0.9562 respectively, and the median values are 0.9846 and 0.9697 respectively, which shows that the method has good robustness.
TABLE 2 screening efficiency of peptide-encoded sORFs
Figure BDA0002618850190000081
TABLE 3 prediction efficiency of results from screening of peptide-encoded sORFs in different genomes
Figure BDA0002618850190000082
Figure BDA0002618850190000091
Figure BDA0002618850190000101
The method for screening the peptide coding sORFs gene provided by the invention sufficiently fuses genome data and a machine learning method, has high screening accuracy and provides a new thought for functional gene and polypeptide research.

Claims (9)

1. A method of screening for small open reading frames having peptide coding capacity, said method comprising the steps of:
screening prokaryotic genomes annotated by sORFs from a database by taking the GC content of the genomes as a basis, grouping the prokaryotic genomes according to the GC content of the genomes, screening out the sORFs with definite biological functions in each genome, performing redundancy removal by using a CD-Hit program, sequentially taking peptide coding sORFs in each genome as positive samples, taking corresponding random scrambled sequences as negative samples, respectively predicting the sORFs in the rest genomes as a primary training set, and taking the genome with the best prediction effect of the GC content interval of each genome as a source genome of a final training set; and based on the screened training set, using the codon use frequency of each sequence as a characteristic parameter, and using Logistic regression as a classifier to train and complete screening.
2. The method for screening for small open reading frames capable of encoding peptides according to claim 1, wherein the codon usage frequency cof in each sequence is calculated as follows:
Figure FDA0002618850180000011
wherein, ciIndicates the number of each codon, and N indicates the length of the sequence.
3. The method for screening small open reading frames with peptide coding capacity according to claim 1, wherein the grouping according to the GC content of the genome is: the genome GC content is divided into 5 parts which are respectively less than 29 percent, 30 percent to 39 percent, 40 percent to 49 percent, 50 percent to 59 percent and more than or equal to 60 percent.
4. The method for screening for small open reading frames capable of encoding peptides of claim 1 wherein the threshold for redundant processing is 80%.
5. The method for selecting a small open reading frame with peptide coding ability according to claim 1, wherein the start and stop codons of the negative sample are identical to those of the positive sample.
6. The method of claim 1, wherein five genomes with the best prediction effect in the GC content interval of each genome are screened, the sORFs satisfying the conditions in the five genomes and the scrambled sequences thereof are selected as a final training set, and the sORFs in the rest genomes are selected as a test set.
7. The method for screening open reading frames with small peptide coding capacity according to claim 1, characterized in that the screening efficiency is evaluated by using accuracy and Markov correlation coefficient as evaluation parameters.
8. The method of claim 1 for screening for small open reading frames with peptide coding capacity with accuracy ACCAnd the mahalanobis correlation coefficient MCC are respectively expressed as:
Figure FDA0002618850180000021
Figure FDA0002618850180000022
TP and FN represent the number of positive samples that are correctly predicted and mispredicted, respectively, and TN and FP represent the number of negative samples that are correctly predicted and mispredicted, respectively.
9. The method of claim 1, wherein the results are obtained by using 64 values corresponding to codon usage frequency as input values of a logistic regression classifier.
CN202010777126.5A 2020-08-05 2020-08-05 Method for screening small open reading frames with peptide coding capacity Active CN111899792B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010777126.5A CN111899792B (en) 2020-08-05 2020-08-05 Method for screening small open reading frames with peptide coding capacity

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010777126.5A CN111899792B (en) 2020-08-05 2020-08-05 Method for screening small open reading frames with peptide coding capacity

Publications (2)

Publication Number Publication Date
CN111899792A true CN111899792A (en) 2020-11-06
CN111899792B CN111899792B (en) 2022-10-14

Family

ID=73245681

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010777126.5A Active CN111899792B (en) 2020-08-05 2020-08-05 Method for screening small open reading frames with peptide coding capacity

Country Status (1)

Country Link
CN (1) CN111899792B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116453599A (en) * 2023-06-19 2023-07-18 深圳大学 Open reading frame prediction method, apparatus and storage medium

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109599149A (en) * 2018-10-25 2019-04-09 华中科技大学 A kind of prediction technique of RNA coding potential

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109599149A (en) * 2018-10-25 2019-04-09 华中科技大学 A kind of prediction technique of RNA coding potential

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐晓捷等: "牙龈卟啉单胞菌编码基因重注释研究", 《生物信息学》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116453599A (en) * 2023-06-19 2023-07-18 深圳大学 Open reading frame prediction method, apparatus and storage medium
CN116453599B (en) * 2023-06-19 2024-03-19 深圳大学 Open reading frame prediction method, apparatus and storage medium

Also Published As

Publication number Publication date
CN111899792B (en) 2022-10-14

Similar Documents

Publication Publication Date Title
WO2019041333A1 (en) Method, apparatus, device and storage medium for predicting protein binding sites
Sun et al. A comprehensive comparison of supervised and unsupervised methods for cell type identification in single-cell RNA-seq
Rasheed et al. Metagenomic taxonomic classification using extreme learning machines
CN109599149B (en) Prediction method of RNA coding potential
CN111899792B (en) Method for screening small open reading frames with peptide coding capacity
CN116013428A (en) Drug target general prediction method, device and medium based on self-supervision learning
Patel et al. Protein secondary structure prediction using support vector machines (SVMs)
CN115240775B (en) Cas protein prediction method based on stacking integrated learning strategy
Szustakowski et al. Less is more: towards an optimal universal description of protein folds
Xi et al. SiftCell: A robust framework to detect and isolate cell-containing droplets from single-cell RNA sequence reads
CN108182347B (en) Large-scale cross-platform gene expression data classification method
Freedman et al. Building better genome annotations across the tree of life
Pérez et al. A computational strategy for protein function assignment which addresses the multidomain problem
Ozdogan et al. Semi-supervised and Incremental VSEARCH for Metagenomic Classification
LU503152B1 (en) A potential functional long non-coding rna prediction method and system
Yu et al. Viabin: A novel method for Overlapped Binning of Metagenomic Contigs using ZINB-autodecoder and Assembly Graphs: Viabin: a metagenomic contigs binning tool
Merschmann et al. A bayesian approach for protein classification
CN118280440A (en) Method, device, equipment and medium for predicting transcription factor binding site
Nanuwa et al. Weighted amino acid composition based on amino acid indices for prediction of protein structural classes
Jaramillo-Garzón et al. Predicting sub-cellular location of proteins based on hierarchical clustering and hidden Markov models
Jia et al. DeepLA: A deep learning-based model for predicting protein function from protein sequence and evolutionary information
Ofer Machine Learning for Protein Function
Lott et al. Simplifying gene trees for easier comprehension
Fromer et al. An Information-Theoretic Approach to the Supervised Partitioning of the Hierarchical Protein Space
Xie et al. Multilevel Attention Network with Semi-supervised Domain Adaptation for Drug-Target Prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant