CN106845149A - A kind of new protein sequence method for expressing based on gene ontology information - Google Patents

A kind of new protein sequence method for expressing based on gene ontology information Download PDF

Info

Publication number
CN106845149A
CN106845149A CN201710071092.6A CN201710071092A CN106845149A CN 106845149 A CN106845149 A CN 106845149A CN 201710071092 A CN201710071092 A CN 201710071092A CN 106845149 A CN106845149 A CN 106845149A
Authority
CN
China
Prior art keywords
protein
information
gene ontology
label
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710071092.6A
Other languages
Chinese (zh)
Other versions
CN106845149B (en
Inventor
肖绚
程翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai simudi Medical Information Technology Co.,Ltd.
Original Assignee
Jingdezhen Ceramic Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdezhen Ceramic Institute filed Critical Jingdezhen Ceramic Institute
Priority to CN201710071092.6A priority Critical patent/CN106845149B/en
Publication of CN106845149A publication Critical patent/CN106845149A/en
Application granted granted Critical
Publication of CN106845149B publication Critical patent/CN106845149B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Analytical Chemistry (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention relates to a kind of new protein sequence method for expressing based on gene ontology information, the all of similar protein matter sequences of protein sequence P are found first by blast program search Swiss Prot databases, training data concentration all proteins are input in GO databases, the GO ontology informations that each protein has are searched;Then the mark gene ontology information that P protein has is searched in gene ontology storehouse;It is the M discrete vector of element by P protein definitions according to the M label that forecasting problem has.This method is by by the protein G O information in sequence sets, it is fused into the vector description of new protein P, so that being substantially reduced using GO method dimensions, for in Protein Subcellular multi-tag location prediction and the prediction of antibacterial peptide function multi-tag, the success rate prediction of correlation predictive device can be significantly improved, with wide utilization prospect.

Description

A kind of new protein sequence method for expressing based on gene ontology information
Technical field
The present invention relates to bioinformatics, protein pseudo amino acid composition composition and traditional protein sequence analysis technology neck Domain, more particularly to a kind of new protein sequence method for expressing based on gene ontology information.
Background technology
Carry out the progress of sequencing technologies with recent two decades, bioinformatics enters into the genome times afterwards comprehensively.How number is analyzed Which subcellular fraction genome sequence in terms of hundred million, such as protein work in, are tied with which kind of function, with which type of two grades Structure, tertiary structure and quaternary structure, these genes are again how to make life entity active, and which protein is probably potential The answer of a series of problem such as drug targets, is the focus of current research.
Due to above mentioned problem is existed using Bioexperiment technology and wasted time and energy, bioinformatics has obtained pole in recent years A series of great development, on-line prediction devices emerge.Although the result that these fallout predictors are predicted also needs to Bioexperiment being verified, But the result of prediction still has to biologist and is very helpful, the scope of experiment is such as reduced, genomic medicine design is carried out Booster action etc..
These fallout predictors are that, based on sequence information, some are that also some are based on newest based on structural information a bit Sequencing information.The prediction effect of the fallout predictor based on sequence information is general than being based on the low of structural information, but its information needed is big All exist so greatly being developed.Egg is mostly described using pseudo amino acid composition composition in the fallout predictor based on sequence information White matter sequence, these pseudo amino acid composition compositions are such as:Dyad composition, triplet composition, the gray theory factor, complexity factors etc. have Protein sequence part amino acid sequence information can be described well, what is had can well describe the global ammonia of protein sequence Base acid order information, positive role is all served to the protein structure based on sequence and function classification prediction.
In recent years with the appearance of Gene Ontology, it have become in biological information field a particularly important method and Instrument, has greatly deepened our integration and utilization to biological data.Using gene ontology(Go Ontology)Information is to egg White matter 26S Proteasome Structure and Function is predicted will be good than other methods such as functional domain and pseudo amino acid composition ingredient prediction effect.Gene ontology The gene and gene outcome vocabulary being related to are divided into three major types, cover three aspects of biology:1)Cellular component;2)Molecule work( Energy;3)Bioprocess.Contained term also increases to more than 50,000 from thousands of in gene ontology storehouse.Gene ontology is one oriented The body of acyclic pattern, has used tri- kinds of relations of is_a, part_of and regulates in current GO.Based on gene ontology information What is commonly use in the method for correlation predictive is using 0-1 discrete vector methods, if protein sequence contains each gene ontology Then this vectorial corresponding element be 1, if without if be 0.This method is only simply to calculate the information of whetheing there is, some Scholar is improved this, calculates the number of times that specific gene ontology occurs in certain protein, thus by 0-1 it is discrete to Amount is changed to integer vectors, increased frequency information.Above-mentioned these methods can be caused due to the increase of the vocabulary in gene ontology storehouse Dimension disaster.The correlation of institute's forecasting problem and gene ontology is directed to for this some scholar, not using all gene ontology institutes Some dictionaries, but part is used, which reduces the dimension of discrete vector, eliminate a little irrelevant informations.
Except using discrete vector method, the also Arithmetic of Semantic Similarity based on gene ontology, mainly include gene sheet Across the branch Similarity of Term algorithm of Similarity of Term calculating method and gene ontology in the same branch of body, these are to gene function point Analyse, compare and predict etc. that biological study popular domain has very important significance.But due to Gene Ontology Term drastically Increase, the complexity of these algorithms and calculating time also increase.
The above method is all based on carrying out gene ontology simple summation statistics or carries out Similarity measures, but due to Not all protein has the information of correlation in GO databases, and this is the defect based on GO information approaches, is this Invention blends GO information with other similar protein matter GO information, and for the classification quantity of institute's forecasting problem, reduces GO and retouch The dimension of vector approach is stated, a kind of new protein sequence based on GO information is designed and is described method to based on sequence information Protein function and structure type prediction etc. provide help.
The content of the invention
The technical problem to be solved in the present invention is to provide a kind of new protein sequence based on gene ontology information and represents Method, it is intended to by other protein Gs O information, the vector description of new protein P is fused into, to solve Protein Subcellular The problem relatively low to tag location prediction rate.
To solve above technical problem, the technical scheme is that:A kind of new albumen based on gene ontology information Matter sequence method for expressing, it is characterised in that comprise the following steps:
(1)Swiss-Prot databases are searched for using blast program find all of similar protein matter sequences of protein sequence P;
(2)Training data concentration all proteins are input in GO databases, the GO bodies that each protein has are searched Information, GO database websites are http://www.geneontology.org/;
(3)The mark gene ontology information that P protein has is searched in gene ontology storehouse, if P protein is without correlation Information, then according to the height with P protein similarities, search the GO information of similar protein matter sequence, until finding at least successively One GO ontology information is expressed as the GO information of P protein
(4)Assuming that P protein functions or other forecasting problems have M label, A is expressed as1,A2,…,AM, by P eggs White matter is defined as the M discrete vector of element, is shown below:
δ1Represent that P protein belongs to first probability of label, δ2Represent that P protein belongs to second probability of label, successively Analogize, δMRepresent that P protein belongs to the probability of m-th label, their initial values are all 0;
δi(I=1,2 ..., M)Computational methods it is as follows:
Successively to GO information contained by P proteinConcentrated in training data and find corresponding protein, Such as n protein is concentrated with training to containThe protein of information, respectively P1、P2、…、Pn, it is assumed that P1Affiliated label It is AiAnd Aj, then δiAnd δjPlus 1 respectively, P2It is A with labelr、At、Ay, then δr、δt、δyPlus 1 respectively, until P protein is had Some GO information calculates finish according to the method described above, has thus obtained the protein containing GO information and has described new method.
Methods described is used in Protein Subcellular multi-tag location prediction, and correlation predictive device predicts that absolute success rate is improved 5~10%.
Method proposed by the present invention is substantially reduced compared with existing GO information approaches with dimension, and existing method dimension reaches To up to ten thousand, and use this method, dimension as the number of tags predicted, typically also with regard to tens dimensions, if the albumen predicted Matter does not have GO information, then using its most like protein G O information, expand the scope that GO information approaches are used.This method is used In Protein Subcellular multi-tag location prediction and the prediction of antibacterial peptide function multi-tag, the pre- of correlation predictive device can be significantly improved Success rate is surveyed, with wide utilization prospect.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention It is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the present invention, limit is not used to The fixed present invention, example is prediction animal protein subcellular fraction multi-tag prediction algorithm herein.
Using the new protein sequence method for expressing based on gene ontology information of the present invention, comprise the following steps that:
1)Swiss-Prot databases are searched for using blast program find all of similar protein matter sequences of protein sequence P.
Protein P can be directly inputted on Swiss-Prot database BLAST instrument webpages, its network address is http://www.uniprot.org/blast/, BLAST operational factor are acquiescence, it is also possible to which BLAST is downloaded on NCBI to be carried out It is locally configured, the machine configuration version:Blast-2.2.28+, all proteins are downloaded in Protein Data Bank Swiss-Prot Sequence;Such as input albumen matter Q63564, can obtain a series of similar protein matter Q8BG39 according to similarity height arrangement, A0A091DVS5、HOVBF0…。
2)Training data concentration all proteins are input in GO databases, the GO that each protein has is searched Ontology information, GO database websites are http://www.geneontology.org/;
The GO information having such as protein Q63564 is(GO: 0001669, GO:0016021, GO:0022857, GO: 0030054, GO:0030672, GO:0043195, GO:0055085).
(3)The mark gene ontology information that P protein has is searched in gene ontology storehouse, if P protein does not have Relevant information, then according to the height with P protein similarities, search the GO information of similar protein matter sequence, until finding successively At least one GO ontology informations are expressed as the GO information of P protein
There is its gene ontology information in database due to Q63564, if it can not obtained according to the first step Similarity sequence order high, the gene ontology information of these sequences of Q8BG39, A0A091DVS5, HOVBF0 ... is found successively As the ontology information of Q63564 sequences.
(4)In existing database in the prediction of animal protein subcellular fraction multi-tag, subcellular fraction is 20 kinds, and P protein is sub- thin Born of the same parents' positioning has 20 labels, is expressed as A1,A2,…,A20, it is 20 discrete vectors of element by P protein definitions, such as Shown in following formula:
δ1Represent that P protein belongs to first probability of label, δ2Represent that P protein belongs to second probability of label, successively Analogize, δ20Represent that P protein belongs to the 20th probability of label, their initial values are all 0;
δi(I=1,2 ..., 20)Computational methods it is as follows:
Successively to GO information (GO contained by P protein Q63564: 0001669, GO:0016021, GO:0022857, GO: 0030054, GO:0030672, GO:0043195, GO:0055085) concentrate to find in training data and contain these GO information Protein, contains gene ontology GO such as in training set:0001669 protein be Q29108, Q32PB3, Q6AXZ6, Q29016, Q63053, A0JN61, P79136, Q63053, P79136, Q29016, Q6AXZ6, Q32PB3, Q29108, Q63053, Respectively P1、P2、…、P14, P1Label belonging to Q29108 is 1, then δ1Plus 1, P2The label that Q32PB3 has is 1,2 and 18, institute With δ1、δ2And δ18Plus 1 respectively, P3It is 1 that Q6AXZ6 has label, then δ1Plus 1, P4Q29016 has label 1, then δ1Plus 1, P5 Q63053 has label 2,5,6,7,9,18,20, then, δ2、δ5、δ6、δ7、δ9、δ18、δ20Plus 1 respectively, P6The mark that A0JN61 has It is 2 and 18 to sign, then δ2And δ18Plus 1, until the GO information that P protein Q63564 has is calculated according to the method described above finish, Thus obtain the protein Q63564 containing GO information and describe new method.
Methods described is used in Protein Subcellular multi-tag location prediction, and correlation predictive device predicts that absolute success rate is improved 8%。
Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in essence of the invention Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.

Claims (2)

1. a kind of new protein sequence method for expressing based on gene ontology information, it is characterised in that comprise the following steps:
(1)Swiss-Prot databases are searched for using blast program find all of similar protein matter sequences of protein sequence P;
(2)Training data concentration all proteins are input in GO databases, the GO bodies that each protein has are searched Information, GO database websites are http://www.geneontology.org/;
(3)The mark gene ontology information that P protein has is searched in gene ontology storehouse, if P protein is without correlation Information, then according to the height with P protein similarities, search the GO information of similar protein matter sequence, until finding at least successively One GO ontology information is expressed as the GO information of P protein
(4)Assuming that P protein functions or other forecasting problems have M label, A is expressed as1,A2,…,AM, by P eggs White matter is defined as the M discrete vector of element, is shown below:
δ1Represent that P protein belongs to first probability of label, δ2Represent that P protein belongs to second probability of label, successively class Push away, δMRepresent that P protein belongs to the probability of m-th label, their initial values are all 0;
δi(I=1,2 ..., M)Computational methods it is as follows:
Successively to GO information contained by P proteinConcentrated in training data and find corresponding protein, Such as n protein is concentrated with training to containThe protein of information, respectively P1、P2、…、Pn, it is assumed that P1Affiliated label It is AiAnd Aj, then δiAnd δjPlus 1 respectively, P2It is A with labelr、At、Ay, then δr、δt、δyPlus 1 respectively, until P protein is had Some GO information calculates finish according to the method described above, has thus obtained the protein containing GO information and has described new method.
2. the protein sequence method for expressing based on gene ontology information according to claim 1, it is characterised in that:It is described Method is used in Protein Subcellular multi-tag location prediction, and correlation predictive device predicts that absolute success rate improves 5~10%.
CN201710071092.6A 2017-02-09 2017-02-09 A kind of protein sequence representation method based on gene ontology information Active CN106845149B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710071092.6A CN106845149B (en) 2017-02-09 2017-02-09 A kind of protein sequence representation method based on gene ontology information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710071092.6A CN106845149B (en) 2017-02-09 2017-02-09 A kind of protein sequence representation method based on gene ontology information

Publications (2)

Publication Number Publication Date
CN106845149A true CN106845149A (en) 2017-06-13
CN106845149B CN106845149B (en) 2019-04-09

Family

ID=59122266

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710071092.6A Active CN106845149B (en) 2017-02-09 2017-02-09 A kind of protein sequence representation method based on gene ontology information

Country Status (1)

Country Link
CN (1) CN106845149B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111091874A (en) * 2019-12-20 2020-05-01 东软集团股份有限公司 Protein feature construction method, device, equipment, storage medium and program product
CN112201300A (en) * 2020-10-23 2021-01-08 天津大学 Protein subcellular localization method based on depth image features and threshold learning strategy
CN115565607A (en) * 2022-10-20 2023-01-03 抖音视界有限公司 Method, device, readable medium and electronic equipment for determining protein information

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105046103A (en) * 2015-07-03 2015-11-11 景德镇陶瓷学院 Novel representation method for protein sequence fusing genetic information

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105046103A (en) * 2015-07-03 2015-11-11 景德镇陶瓷学院 Novel representation method for protein sequence fusing genetic information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XUAN XIAO ET AL: "A Multi-Label Classifier for Predicting the Subcellular Localization of Gram-Negative Bacterial Proteins with Both Single and Multiple Sites", 《PLOS ONE》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111091874A (en) * 2019-12-20 2020-05-01 东软集团股份有限公司 Protein feature construction method, device, equipment, storage medium and program product
CN111091874B (en) * 2019-12-20 2024-01-19 东软集团股份有限公司 Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product
CN112201300A (en) * 2020-10-23 2021-01-08 天津大学 Protein subcellular localization method based on depth image features and threshold learning strategy
CN112201300B (en) * 2020-10-23 2022-05-13 天津大学 Protein subcellular localization method based on depth image features and threshold learning strategy
CN115565607A (en) * 2022-10-20 2023-01-03 抖音视界有限公司 Method, device, readable medium and electronic equipment for determining protein information
CN115565607B (en) * 2022-10-20 2024-02-23 抖音视界有限公司 Method, device, readable medium and electronic equipment for determining protein information

Also Published As

Publication number Publication date
CN106845149B (en) 2019-04-09

Similar Documents

Publication Publication Date Title
Charoenkwan et al. BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides
Rawi et al. PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine
Ahmad et al. Deep-AntiFP: Prediction of antifungal peptides using distanct multi-informative features incorporating with deep neural networks
Woerner et al. Forensic human identification with targeted microbiome markers using nearest neighbor classification
US9779205B2 (en) Systems and methods for rational selection of context sequences and sequence templates
Zhou et al. CNNH_PSS: protein 8-class secondary structure prediction by convolutional neural network with highway
Naseer et al. Sequence-based identification of arginine amidation sites in proteins using deep representations of proteins and PseAAC
CN111401534B (en) Protein performance prediction method and device and computing equipment
Johnson et al. LAmbDA: label ambiguous domain adaptation dataset integration reduces batch effects and improves subtype detection
Zou et al. Approaches for recognizing disease genes based on network
CN106845149A (en) A kind of new protein sequence method for expressing based on gene ontology information
Alonso-Alemany et al. Further steps in TANGO: improved taxonomic assignment in metagenomics
Wang et al. Incorporating deep learning with word embedding to identify plant ubiquitylation sites
Hussain sAMP-PFPDeep: Improving accuracy of short antimicrobial peptides prediction using three different sequence encodings and deep neural networks
Deng et al. Deep neural networks for inferring binding sites of RNA-binding proteins by using distributed representations of RNA primary sequence and secondary structure
Liu et al. Predicting the multi-label protein subcellular localization through multi-information fusion and MLSI dimensionality reduction based on MLFE classifier
Zhang et al. FocusNet: Classifying better by focusing on confusing classes
Murphy et al. Self-supervised learning of cell type specificity from immunohistochemical images
Wang et al. Motif discovery via convolutional networks with K-mer embedding
Wang et al. A Multi-Modal Contrastive Diffusion Model for Therapeutic Peptide Generation
Zomaya Algorithmic and artificial intelligence methods for protein bioinformatics
Zimmermann Backbone dihedral angle prediction
Lin et al. Prediction of Drug-Target Interactions with CNNs and Random Forest
Trivodaliev et al. Deep Learning the Protein Function in Protein Interaction Networks
Kazm et al. Transformer Encoder with Protein Language Model for Protein Secondary Structure Prediction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20210818

Address after: 200000 room jt2132, floor 2, building 39, No. 52, Chengliu Road, Jiading District, Shanghai

Patentee after: Shanghai simudi Medical Information Technology Co.,Ltd.

Address before: 333001 Tao Yang South Road, new Pearl River plant, Jingdezhen, Jiangxi 27

Patentee before: JINGDEZHEN CERAMIC INSTITUTE

TR01 Transfer of patent right