CN106845149A - A kind of new protein sequence method for expressing based on gene ontology information - Google Patents
A kind of new protein sequence method for expressing based on gene ontology information Download PDFInfo
- Publication number
- CN106845149A CN106845149A CN201710071092.6A CN201710071092A CN106845149A CN 106845149 A CN106845149 A CN 106845149A CN 201710071092 A CN201710071092 A CN 201710071092A CN 106845149 A CN106845149 A CN 106845149A
- Authority
- CN
- China
- Prior art keywords
- protein
- information
- gene ontology
- label
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B15/00—ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Engineering & Computer Science (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biotechnology (AREA)
- Medical Informatics (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Chemical & Material Sciences (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Crystallography & Structural Chemistry (AREA)
- Analytical Chemistry (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention relates to a kind of new protein sequence method for expressing based on gene ontology information, the all of similar protein matter sequences of protein sequence P are found first by blast program search Swiss Prot databases, training data concentration all proteins are input in GO databases, the GO ontology informations that each protein has are searched;Then the mark gene ontology information that P protein has is searched in gene ontology storehouse;It is the M discrete vector of element by P protein definitions according to the M label that forecasting problem has.This method is by by the protein G O information in sequence sets, it is fused into the vector description of new protein P, so that being substantially reduced using GO method dimensions, for in Protein Subcellular multi-tag location prediction and the prediction of antibacterial peptide function multi-tag, the success rate prediction of correlation predictive device can be significantly improved, with wide utilization prospect.
Description
Technical field
The present invention relates to bioinformatics, protein pseudo amino acid composition composition and traditional protein sequence analysis technology neck
Domain, more particularly to a kind of new protein sequence method for expressing based on gene ontology information.
Background technology
Carry out the progress of sequencing technologies with recent two decades, bioinformatics enters into the genome times afterwards comprehensively.How number is analyzed
Which subcellular fraction genome sequence in terms of hundred million, such as protein work in, are tied with which kind of function, with which type of two grades
Structure, tertiary structure and quaternary structure, these genes are again how to make life entity active, and which protein is probably potential
The answer of a series of problem such as drug targets, is the focus of current research.
Due to above mentioned problem is existed using Bioexperiment technology and wasted time and energy, bioinformatics has obtained pole in recent years
A series of great development, on-line prediction devices emerge.Although the result that these fallout predictors are predicted also needs to Bioexperiment being verified,
But the result of prediction still has to biologist and is very helpful, the scope of experiment is such as reduced, genomic medicine design is carried out
Booster action etc..
These fallout predictors are that, based on sequence information, some are that also some are based on newest based on structural information a bit
Sequencing information.The prediction effect of the fallout predictor based on sequence information is general than being based on the low of structural information, but its information needed is big
All exist so greatly being developed.Egg is mostly described using pseudo amino acid composition composition in the fallout predictor based on sequence information
White matter sequence, these pseudo amino acid composition compositions are such as:Dyad composition, triplet composition, the gray theory factor, complexity factors etc. have
Protein sequence part amino acid sequence information can be described well, what is had can well describe the global ammonia of protein sequence
Base acid order information, positive role is all served to the protein structure based on sequence and function classification prediction.
In recent years with the appearance of Gene Ontology, it have become in biological information field a particularly important method and
Instrument, has greatly deepened our integration and utilization to biological data.Using gene ontology(Go Ontology)Information is to egg
White matter 26S Proteasome Structure and Function is predicted will be good than other methods such as functional domain and pseudo amino acid composition ingredient prediction effect.Gene ontology
The gene and gene outcome vocabulary being related to are divided into three major types, cover three aspects of biology:1)Cellular component;2)Molecule work(
Energy;3)Bioprocess.Contained term also increases to more than 50,000 from thousands of in gene ontology storehouse.Gene ontology is one oriented
The body of acyclic pattern, has used tri- kinds of relations of is_a, part_of and regulates in current GO.Based on gene ontology information
What is commonly use in the method for correlation predictive is using 0-1 discrete vector methods, if protein sequence contains each gene ontology
Then this vectorial corresponding element be 1, if without if be 0.This method is only simply to calculate the information of whetheing there is, some
Scholar is improved this, calculates the number of times that specific gene ontology occurs in certain protein, thus by 0-1 it is discrete to
Amount is changed to integer vectors, increased frequency information.Above-mentioned these methods can be caused due to the increase of the vocabulary in gene ontology storehouse
Dimension disaster.The correlation of institute's forecasting problem and gene ontology is directed to for this some scholar, not using all gene ontology institutes
Some dictionaries, but part is used, which reduces the dimension of discrete vector, eliminate a little irrelevant informations.
Except using discrete vector method, the also Arithmetic of Semantic Similarity based on gene ontology, mainly include gene sheet
Across the branch Similarity of Term algorithm of Similarity of Term calculating method and gene ontology in the same branch of body, these are to gene function point
Analyse, compare and predict etc. that biological study popular domain has very important significance.But due to Gene Ontology Term drastically
Increase, the complexity of these algorithms and calculating time also increase.
The above method is all based on carrying out gene ontology simple summation statistics or carries out Similarity measures, but due to
Not all protein has the information of correlation in GO databases, and this is the defect based on GO information approaches, is this
Invention blends GO information with other similar protein matter GO information, and for the classification quantity of institute's forecasting problem, reduces GO and retouch
The dimension of vector approach is stated, a kind of new protein sequence based on GO information is designed and is described method to based on sequence information
Protein function and structure type prediction etc. provide help.
The content of the invention
The technical problem to be solved in the present invention is to provide a kind of new protein sequence based on gene ontology information and represents
Method, it is intended to by other protein Gs O information, the vector description of new protein P is fused into, to solve Protein Subcellular
The problem relatively low to tag location prediction rate.
To solve above technical problem, the technical scheme is that:A kind of new albumen based on gene ontology information
Matter sequence method for expressing, it is characterised in that comprise the following steps:
(1)Swiss-Prot databases are searched for using blast program find all of similar protein matter sequences of protein sequence P;
(2)Training data concentration all proteins are input in GO databases, the GO bodies that each protein has are searched
Information, GO database websites are http://www.geneontology.org/;
(3)The mark gene ontology information that P protein has is searched in gene ontology storehouse, if P protein is without correlation
Information, then according to the height with P protein similarities, search the GO information of similar protein matter sequence, until finding at least successively
One GO ontology information is expressed as the GO information of P protein;
(4)Assuming that P protein functions or other forecasting problems have M label, A is expressed as1,A2,…,AM, by P eggs
White matter is defined as the M discrete vector of element, is shown below:
δ1Represent that P protein belongs to first probability of label, δ2Represent that P protein belongs to second probability of label, successively
Analogize, δMRepresent that P protein belongs to the probability of m-th label, their initial values are all 0;
δi(I=1,2 ..., M)Computational methods it is as follows:
Successively to GO information contained by P proteinConcentrated in training data and find corresponding protein,
Such as n protein is concentrated with training to containThe protein of information, respectively P1、P2、…、Pn, it is assumed that P1Affiliated label
It is AiAnd Aj, then δiAnd δjPlus 1 respectively, P2It is A with labelr、At、Ay, then δr、δt、δyPlus 1 respectively, until P protein is had
Some GO information calculates finish according to the method described above, has thus obtained the protein containing GO information and has described new method.
Methods described is used in Protein Subcellular multi-tag location prediction, and correlation predictive device predicts that absolute success rate is improved
5~10%.
Method proposed by the present invention is substantially reduced compared with existing GO information approaches with dimension, and existing method dimension reaches
To up to ten thousand, and use this method, dimension as the number of tags predicted, typically also with regard to tens dimensions, if the albumen predicted
Matter does not have GO information, then using its most like protein G O information, expand the scope that GO information approaches are used.This method is used
In Protein Subcellular multi-tag location prediction and the prediction of antibacterial peptide function multi-tag, the pre- of correlation predictive device can be significantly improved
Success rate is surveyed, with wide utilization prospect.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to embodiments, to the present invention
It is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the present invention, limit is not used to
The fixed present invention, example is prediction animal protein subcellular fraction multi-tag prediction algorithm herein.
Using the new protein sequence method for expressing based on gene ontology information of the present invention, comprise the following steps that:
1)Swiss-Prot databases are searched for using blast program find all of similar protein matter sequences of protein sequence P.
Protein P can be directly inputted on Swiss-Prot database BLAST instrument webpages, its network address is
http://www.uniprot.org/blast/, BLAST operational factor are acquiescence, it is also possible to which BLAST is downloaded on NCBI to be carried out
It is locally configured, the machine configuration version:Blast-2.2.28+, all proteins are downloaded in Protein Data Bank Swiss-Prot
Sequence;Such as input albumen matter Q63564, can obtain a series of similar protein matter Q8BG39 according to similarity height arrangement,
A0A091DVS5、HOVBF0…。
2)Training data concentration all proteins are input in GO databases, the GO that each protein has is searched
Ontology information, GO database websites are http://www.geneontology.org/;
The GO information having such as protein Q63564 is(GO: 0001669, GO:0016021, GO:0022857, GO:
0030054, GO:0030672, GO:0043195, GO:0055085).
(3)The mark gene ontology information that P protein has is searched in gene ontology storehouse, if P protein does not have
Relevant information, then according to the height with P protein similarities, search the GO information of similar protein matter sequence, until finding successively
At least one GO ontology informations are expressed as the GO information of P protein。
There is its gene ontology information in database due to Q63564, if it can not obtained according to the first step
Similarity sequence order high, the gene ontology information of these sequences of Q8BG39, A0A091DVS5, HOVBF0 ... is found successively
As the ontology information of Q63564 sequences.
(4)In existing database in the prediction of animal protein subcellular fraction multi-tag, subcellular fraction is 20 kinds, and P protein is sub- thin
Born of the same parents' positioning has 20 labels, is expressed as A1,A2,…,A20, it is 20 discrete vectors of element by P protein definitions, such as
Shown in following formula:
δ1Represent that P protein belongs to first probability of label, δ2Represent that P protein belongs to second probability of label, successively
Analogize, δ20Represent that P protein belongs to the 20th probability of label, their initial values are all 0;
δi(I=1,2 ..., 20)Computational methods it is as follows:
Successively to GO information (GO contained by P protein Q63564: 0001669, GO:0016021, GO:0022857, GO:
0030054, GO:0030672, GO:0043195, GO:0055085) concentrate to find in training data and contain these GO information
Protein, contains gene ontology GO such as in training set:0001669 protein be Q29108, Q32PB3, Q6AXZ6,
Q29016, Q63053, A0JN61, P79136, Q63053, P79136, Q29016, Q6AXZ6, Q32PB3, Q29108, Q63053,
Respectively P1、P2、…、P14, P1Label belonging to Q29108 is 1, then δ1Plus 1, P2The label that Q32PB3 has is 1,2 and 18, institute
With δ1、δ2And δ18Plus 1 respectively, P3It is 1 that Q6AXZ6 has label, then δ1Plus 1, P4Q29016 has label 1, then δ1Plus 1, P5
Q63053 has label 2,5,6,7,9,18,20, then, δ2、δ5、δ6、δ7、δ9、δ18、δ20Plus 1 respectively, P6The mark that A0JN61 has
It is 2 and 18 to sign, then δ2And δ18Plus 1, until the GO information that P protein Q63564 has is calculated according to the method described above finish,
Thus obtain the protein Q63564 containing GO information and describe new method.
Methods described is used in Protein Subcellular multi-tag location prediction, and correlation predictive device predicts that absolute success rate is improved
8%。
Presently preferred embodiments of the present invention is the foregoing is only, is not intended to limit the invention, it is all in essence of the invention
Any modification, equivalent and improvement made within god and principle etc., should be included within the scope of the present invention.
Claims (2)
1. a kind of new protein sequence method for expressing based on gene ontology information, it is characterised in that comprise the following steps:
(1)Swiss-Prot databases are searched for using blast program find all of similar protein matter sequences of protein sequence P;
(2)Training data concentration all proteins are input in GO databases, the GO bodies that each protein has are searched
Information, GO database websites are http://www.geneontology.org/;
(3)The mark gene ontology information that P protein has is searched in gene ontology storehouse, if P protein is without correlation
Information, then according to the height with P protein similarities, search the GO information of similar protein matter sequence, until finding at least successively
One GO ontology information is expressed as the GO information of P protein;
(4)Assuming that P protein functions or other forecasting problems have M label, A is expressed as1,A2,…,AM, by P eggs
White matter is defined as the M discrete vector of element, is shown below:
δ1Represent that P protein belongs to first probability of label, δ2Represent that P protein belongs to second probability of label, successively class
Push away, δMRepresent that P protein belongs to the probability of m-th label, their initial values are all 0;
δi(I=1,2 ..., M)Computational methods it is as follows:
Successively to GO information contained by P proteinConcentrated in training data and find corresponding protein,
Such as n protein is concentrated with training to containThe protein of information, respectively P1、P2、…、Pn, it is assumed that P1Affiliated label
It is AiAnd Aj, then δiAnd δjPlus 1 respectively, P2It is A with labelr、At、Ay, then δr、δt、δyPlus 1 respectively, until P protein is had
Some GO information calculates finish according to the method described above, has thus obtained the protein containing GO information and has described new method.
2. the protein sequence method for expressing based on gene ontology information according to claim 1, it is characterised in that:It is described
Method is used in Protein Subcellular multi-tag location prediction, and correlation predictive device predicts that absolute success rate improves 5~10%.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710071092.6A CN106845149B (en) | 2017-02-09 | 2017-02-09 | A kind of protein sequence representation method based on gene ontology information |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710071092.6A CN106845149B (en) | 2017-02-09 | 2017-02-09 | A kind of protein sequence representation method based on gene ontology information |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106845149A true CN106845149A (en) | 2017-06-13 |
CN106845149B CN106845149B (en) | 2019-04-09 |
Family
ID=59122266
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710071092.6A Active CN106845149B (en) | 2017-02-09 | 2017-02-09 | A kind of protein sequence representation method based on gene ontology information |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106845149B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111091874A (en) * | 2019-12-20 | 2020-05-01 | 东软集团股份有限公司 | Protein feature construction method, device, equipment, storage medium and program product |
CN112201300A (en) * | 2020-10-23 | 2021-01-08 | 天津大学 | Protein subcellular localization method based on depth image features and threshold learning strategy |
CN115565607A (en) * | 2022-10-20 | 2023-01-03 | 抖音视界有限公司 | Method, device, readable medium and electronic equipment for determining protein information |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105046103A (en) * | 2015-07-03 | 2015-11-11 | 景德镇陶瓷学院 | Novel representation method for protein sequence fusing genetic information |
-
2017
- 2017-02-09 CN CN201710071092.6A patent/CN106845149B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105046103A (en) * | 2015-07-03 | 2015-11-11 | 景德镇陶瓷学院 | Novel representation method for protein sequence fusing genetic information |
Non-Patent Citations (1)
Title |
---|
XUAN XIAO ET AL: "A Multi-Label Classifier for Predicting the Subcellular Localization of Gram-Negative Bacterial Proteins with Both Single and Multiple Sites", 《PLOS ONE》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111091874A (en) * | 2019-12-20 | 2020-05-01 | 东软集团股份有限公司 | Protein feature construction method, device, equipment, storage medium and program product |
CN111091874B (en) * | 2019-12-20 | 2024-01-19 | 东软集团股份有限公司 | Protein feature construction method, protein feature construction device, protein feature construction apparatus, protein feature construction program product, and protein feature construction program product |
CN112201300A (en) * | 2020-10-23 | 2021-01-08 | 天津大学 | Protein subcellular localization method based on depth image features and threshold learning strategy |
CN112201300B (en) * | 2020-10-23 | 2022-05-13 | 天津大学 | Protein subcellular localization method based on depth image features and threshold learning strategy |
CN115565607A (en) * | 2022-10-20 | 2023-01-03 | 抖音视界有限公司 | Method, device, readable medium and electronic equipment for determining protein information |
CN115565607B (en) * | 2022-10-20 | 2024-02-23 | 抖音视界有限公司 | Method, device, readable medium and electronic equipment for determining protein information |
Also Published As
Publication number | Publication date |
---|---|
CN106845149B (en) | 2019-04-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Charoenkwan et al. | BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides | |
Rawi et al. | PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine | |
Ahmad et al. | Deep-AntiFP: Prediction of antifungal peptides using distanct multi-informative features incorporating with deep neural networks | |
Woerner et al. | Forensic human identification with targeted microbiome markers using nearest neighbor classification | |
US9779205B2 (en) | Systems and methods for rational selection of context sequences and sequence templates | |
Zhou et al. | CNNH_PSS: protein 8-class secondary structure prediction by convolutional neural network with highway | |
Naseer et al. | Sequence-based identification of arginine amidation sites in proteins using deep representations of proteins and PseAAC | |
CN111401534B (en) | Protein performance prediction method and device and computing equipment | |
Johnson et al. | LAmbDA: label ambiguous domain adaptation dataset integration reduces batch effects and improves subtype detection | |
Zou et al. | Approaches for recognizing disease genes based on network | |
CN106845149A (en) | A kind of new protein sequence method for expressing based on gene ontology information | |
Alonso-Alemany et al. | Further steps in TANGO: improved taxonomic assignment in metagenomics | |
Wang et al. | Incorporating deep learning with word embedding to identify plant ubiquitylation sites | |
Hussain | sAMP-PFPDeep: Improving accuracy of short antimicrobial peptides prediction using three different sequence encodings and deep neural networks | |
Deng et al. | Deep neural networks for inferring binding sites of RNA-binding proteins by using distributed representations of RNA primary sequence and secondary structure | |
Liu et al. | Predicting the multi-label protein subcellular localization through multi-information fusion and MLSI dimensionality reduction based on MLFE classifier | |
Zhang et al. | FocusNet: Classifying better by focusing on confusing classes | |
Murphy et al. | Self-supervised learning of cell type specificity from immunohistochemical images | |
Wang et al. | Motif discovery via convolutional networks with K-mer embedding | |
Wang et al. | A Multi-Modal Contrastive Diffusion Model for Therapeutic Peptide Generation | |
Zomaya | Algorithmic and artificial intelligence methods for protein bioinformatics | |
Zimmermann | Backbone dihedral angle prediction | |
Lin et al. | Prediction of Drug-Target Interactions with CNNs and Random Forest | |
Trivodaliev et al. | Deep Learning the Protein Function in Protein Interaction Networks | |
Kazm et al. | Transformer Encoder with Protein Language Model for Protein Secondary Structure Prediction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20210818 Address after: 200000 room jt2132, floor 2, building 39, No. 52, Chengliu Road, Jiading District, Shanghai Patentee after: Shanghai simudi Medical Information Technology Co.,Ltd. Address before: 333001 Tao Yang South Road, new Pearl River plant, Jingdezhen, Jiangxi 27 Patentee before: JINGDEZHEN CERAMIC INSTITUTE |
|
TR01 | Transfer of patent right |