CN105260626B

CN105260626B - The full information Forecasting Methodology of protein structure space conformation

Info

Publication number: CN105260626B
Application number: CN201510623583.8A
Authority: CN
Inventors: 杨家安
Original assignee: MAIKERO MEDICINE TECHNOLOGY (WUHAN)CO Ltd
Current assignee: McCullough Biotechnology (Shanghai) Co.,Ltd.
Priority date: 2015-09-25
Filing date: 2015-09-25
Publication date: 2017-11-14
Anticipated expiration: 2035-09-25
Also published as: CN105260626A

Abstract

The present invention relates to a kind of full information Forecasting Methodology of protein structure space conformation, belong to field of bioinformatics.For any protein sequence, with protein structure fingerprint technique, directly by the high flux screening to 5AAPFSC databases, its corresponding protein folding conformation will be obtained.Each folded conformation will be represented that these foldable structures cover secondary structure and tertiary structure by protein folding shape code letter.All possible collapsed shape code can align to form an array, and one PFSC protein steric conformation bands of a spectrum of generation are as prediction result.By the test of the protein to a large amount of known three-dimensional structures, the reliability and validity of the inventive method have been demonstrated well.

Description

The full information Forecasting Methodology of protein structure space conformation

Technical field

The present invention relates to a kind of full information Forecasting Methodology of protein structure space conformation, belong to field of bioinformatics.

Background technology

Protein structure is to carry out genomics, bioinformatics, the important letter of medicament research and development and biotechnology research Breath^1,2.However, up to the present, the three-dimensional structure of the protein only about less than 1% passes through X-ray crystal diffraction or core The measurement of the experimental methods such as magnetic resonance obtains³.Still there is the sequence about more than 5,002,000,000 protein still without three-dimensional The information and data of structure⁴, the highly desirable space structure that can determine that these albumen of biological medicine research.For a long time, in terms of Based on the modeling of calculation machine, many methods and applications on protein structure prediction have been developed.From 1994, every two years " key evaluation (CASP) of the protein structure prediction " activity held once turns into countries in the world protein molecule bioscience One intercommunion platform of family^5,6.In view of the complexity of protein structure, and the possibility folding mode of exponential number level, predict egg The research puzzle of white structure is listed in one of 100 big challenge subjects of 21 century modern science⁷。

So far, it is other to can be basically divided into three major types for the method for various prediction protein structures.The first kind is to be based on sequence Modeling method^8,9,10.This method protein structure known to solves agnoprotein structure.This method need to rely on Similarity degree between sequence compares extraction information, and the degree of reliability on prediction result is always a query.Second class is to adopt The splicing modeling method identified with folding configuration^{11,12,13,14,15}.This method is using statistical method from specific albumen database Screen the correlation between fold segments and sequence.Statistical method can no doubt cover most of folding configurations, but frequency Relatively low folding configuration is often just ignored.3rd class is ab iitio model method^16,17,18.This method is anti-with computer The interaction between the amino acid in protein and atom is iterated to calculate again, and to the last whole conformation system tends to one Relatively low energy state.This method consumes substantial amounts of computer time and resource, and prediction is only capable of obtaining related protein A possible space structure.For a long time, biologist is expected that by Forecasting Methodology and obtains egg that is reliable and having no objection White matter structure.As target, various researchs attempt to improve the Forecasting Methodology of protein structure, but progress in this respect is It is very undesirable.Search to the bottom reason, be due to protein structure complexity in itself and polytropy.

The content of the invention

The technical problems to be solved by the invention are to provide a kind of full information prediction (Complete of protein conformation Prediction for Protein Conformation, CPPC) method.This method simplifies albumen with digital model The complexity of structure, while recognize with full information structured data the polytropy of protein structure.This method being capable of fast prediction The structure of protein, and all possible protein steric conformation is provided.

The full information Forecasting Methodology of the protein conformation of the present invention is built upon patent before inventor On the basis of PFSC disclosed in ZL200880003164.2 (Protein Folding Shape Code, PFSC) The new method of the prediction protein structure of exploitation¹⁹.Can be intact by the PFSC (PFSC) being strictly derived by Ground describes the collapsed shape of continuous 5 amino acid fragments.Any collapsed shape of 5 amino acid fragments in protein can be with Described by 27PFSC vectors, whole 27PFSC vectors employ 26 English alphabets and add $ symbols.It is prior It is that all 27 PFSC vectors cover a complete mathematical space.Moreover, all the collapsed shape of 27 PFSC vectors is It is highly closely related.Each PFSC vectors can be another vector from a vectorial transition and conversion.

The amino acid from the point of view of mathematical angle, by different sequences, 5 amino acid can form different arrangements.From complete 5 amino acid are arbitrarily extracted in 20, portion amino acid will can form different arrangements of the sum for 3,200,000.Each row The possibility folded conformation of row can obtain from global Protein Data Bank (PDB), then with protein folding shape code (PFSC) table Show.On this basis, we create a database to collect the folded conformation of above-mentioned 3,200,000 arrangements.This is brand-new Database be named as 5AAPFSC.In this database, the collapsed shape related to each arrangement will be used intactly pair The PFSC codes storage answered is wherein.

The full information Forecasting Methodology of the protein structure space conformation of the present invention, comprises the following steps：

1) 5 amino acid are arbitrarily extracted from all 20 amino acid, forms sum as 3,200,000 different rows Row, the possibility folded conformation of each arrangement obtains from global Protein Data Bank (PDB), then with protein folding shape code (PFSC) represent；A database is created to collect above-mentioned arrangement and its corresponding protein folding shape code, the database quilt 5AAPFSC is named as, as shown in Figure 1；

2) for the protein of any one structure to be predicted, along the sequence of protein, since N- ends, progressively move To C- ends, every 5 continuous amino acid is successively read, its folded conformation that may have directly obtains from 5AAPFSC databases, With protein folding shape code (PFSC) character representation；The word of frequency of occurrences highest folded conformation code in Protein Data Bank Symbol makes number one, and the high folded conformation code character of the frequency of occurrences second comes second, sequentially forms a row from top to bottom, Untill collecting completely, every 5 continuous amino acid have different number of folded conformation may；

3) the possible collapsed shape code of the whole of testing protein forms an array, referred to as protein folding conformation bands of a spectrum, As shown in Fig. 2 represent the whole possible folded conformation of sequence along protein；For each protein sequence, pass through it Being substituted for each other for whole possible partial folds conformations, can accurately obtain all possible conformation；The sum of possible conformation Mesh is the continued product of all possible folded conformation numbers of every 5 amino acid；

For any one testing protein, although the number of whole possible space conformations is huge, the high sky of possibility Between conformation pass through the high local folded conformation of the frequency of occurrences and obtain.For example, first space conformation be by the frequency of occurrences most High folded configuration shape code is formed；Second space conformation be by the high collapsed shape code of the frequency of occurrences second, it is high second Frequency conformation position, formed using the collapsed shape code of frequency highest as supplement；3rd space conformation is by the frequency of occurrences 3rd high collapsed shape code, in no 3rd high-frequency conformation position, supplement structure is used as using the collapsed shape code of frequency highest Into；And so on, form a series of higher possible prediction conformations of possibility.

Therefore, a succession of protein folding shape code being made up of high-frequency conformation is exactly the higher protein steric knot of possibility Structure conformation.According to protein folding conformation bands of a spectrum, it can be found that more local changes and substitute, be modified to be formed it is more relevant Possible space structure conformation.

The protein steric conformation bands of a spectrum that this analysis method obtains provide a full letter to protein structure space folding conformation The prediction of breath, while disclose the minor variations of its possible any local conformation.The full information prediction of protein conformation (CPPC) significance of method be must to build that a brand-new GFP structure composition database creates from now on The condition wanted.The full information prediction of protein conformation is to predict a new method of protein structure, and this method will push away The development of filamentous actin structural genomics.The full information Predicting Technique for the protein conformation that we develop is not only to albumen The prediction of structure provides complete folded conformation, and the protein structure obtained for comprehensive understanding from measuring has ten Divide significance.

Brief description of the drawings

The structure of Fig. 1,5AAPFSC database.

Fig. 2, the foundation of protein folding conformation bands of a spectrum.

Fig. 3, conformation and full information are pre- known to the 2XCW protein fragments (residue 3-62) of people's cell matter 5'- nucleotides II albumen Survey the contrast of result.Form the first row is the amino acid sequence segments (3-62) of the albumen.Followed by the folding of 8 known structures Conformation, folded conformation are represented with protein folding shape code (PFSC).Lower semisection is 9 possible space structures of prediction in form As.

Fig. 4, the calcitonin prediction of 32 amino acid of marine aquatic biological silver shark (CallorhinchusMiliiX) extract are empty Between conformation.

Embodiment

For any protein sequence, with protein structure fingerprint technique (PSFT), directly by 5AAPFSC data The high flux screening in storehouse, its corresponding protein folding conformation will be obtained.Each folded conformation will be by protein folding shape code (PFSC) letter represents that each of which letter all represents the characteristic of its proprietary foldable structure, these foldable structures are covered Secondary structure and tertiary structure.All possible collapsed shape code can align to form an array, generate a PFSC egg White space conformation bands of a spectrum are as prediction result.By the test of the protein to a large amount of known three-dimensional structures, test well The reliability and validity of this method are demonstrate,proved.

Embodiment one is compareed with prediction result as an example from the protein for having known three-dimensional structure.

People's cell matter 5'- nucleotides II protein is a protein molecular with known three-dimensional structure, and the three-dimensional of it is tied Structure is tested by X-ray crystal diffraction and determined.Fig. 3 first half is listed thin from the people that measures of X-ray crystal diffraction experiment The space conformation of kytoplasm 5'- nucleotides II 8 structures of protein.Its each three-D space structure can be from protein data Storehouse obtains.Then, each conformation is expressed with folded code, and is alignd and lined up array.Each space conformation represents experiment and surveyed The configuration state obtained.List 9 most probable space conformations that the inventive method is predicted to obtain in Fig. 3 lower half.These The step of space conformation can be described by earlier paragraphs obtains.First space conformation is to fold structure by frequency of occurrences highest As the character string of code is formed；Second space conformation is high by the frequency of occurrences second, along with the folding for forming frequency highest The character string of conformation code is formed as supplement；3rd space conformation be it is high by the frequency of occurrences the 3rd, along with form frequency most The character string of high folded conformation code is formed as supplement；And so on, form 9 higher conformations of possibility.From table If as can be seen that using the known conformation of 60 amino acid sequences (3-62) of 2XCW protein fragments as reference, extraction the first row is pre- The result of survey is compared, and the result conformation of full information prediction has 45 folded conformations identical, and 5 similar, 10 differences.Only Consider the prediction result of the first row, so-called rate of accuracy reached to about 80%.

On the other hand, molecular biologist is recognized, the structured data measured from the experiment of X-ray crystal diffraction is egg Some static structure state of white matter, can not reflect protein all may dynamic conformational.The upper semisection of table one is listed 8 known spatial conformations of protein, these space conformations show the changeability of its structure.Compare these changes, full information The prediction bands of a spectrum of prediction can cover the folding configuration of these changes completely.The data of form strongly suggest the egg that we develop Prediction of the full information Predicting Technique of white matter space conformation not only to protein structure provides complete folded conformation, and right It is of great significance in the protein structure that comprehensive understanding obtains from measuring.

Embodiment two selects the albumen of a unknown three-D space structure as an example, and its three-dimensional conformation can pass through Holoprotein information prediction obtains.Fig. 4 is illustrated from 32 amino acid structures of marine aquatic biological silver shark (CallorhinchusMiliiX) Into calcitonin polypeptide space conformation prediction.4 steps that these space conformations can be described by earlier paragraphs obtain.The One space conformation is made up of the character string of frequency of occurrences highest folded conformation code；Second space conformation is by there is frequency Rate second is high, along with the character string for the folded conformation code for forming frequency highest is formed as supplement；3rd space conformation It is high by the frequency of occurrences the 3rd, along with the character string for the folded conformation code for forming frequency highest is formed as supplement；So Analogize, form 13 higher conformations of possibility.Predict the protein steric conformation bands of a spectrum of obtained silver shark calcitonin by 13 eggs White collapsed shape code (PFSC) character code composition.The bands of a spectrum are the complete predictions to silver shark calcitonin space conformation, and are illustrated The possibility change of its local conformation.

Full information prediction (CPPC) method of the protein conformation of the present invention has following four characteristicses and breakthrough.

1. protein conformation full information prediction (CPPC) based on tight mathematical derivation and with albumen knot Structure feature is combined.First, 27PFSC protein foldings shape code, which intactly represents one, has the complete closure of essential meaning empty Between, so ensure that prediction result will not produce missing and omit.On 5 amino acid bases, by establishing 20 ammonia Base acid and with 27 PFSC protein folding shape code correlations.Combining global albumen database, be closely connected protein structure Feature, 5AAPFSC databases are created, enumerate the possible mathematics arrangement of whole of any 5 amino acid in 20 amino acid. For traditional protein tertiary structure method, arranged according to these the CPPC's of the correlation foundation between PFSC codes New method has solid mathematics basis.

2. protein conformation full information prediction (CPPC) provide fast prediction protein structure by way of.According to mesh Preceding computer technology is horizontal, if every 10^-13Second calculates a conformation.For the protein sequence of 100 amino acid, if Allow 10 locus of each amino-acid variants, sum 10 will be produced¹⁰⁰Space conformation.Completing these conformations needs 10⁷⁷Year Complete.For an equal amount of protein sequence, (CPPC) technology is predicted with the full information of protein conformation, it is only necessary to big About 30 second time.The full information prediction (CPPC) of protein conformation is pre- for the structure for completing the thousands of protein sequence of length Survey, also need only to about 120 seconds.

3. the full information prediction (CPPC) of protein conformation is shown along egg by PFSC protein folding shape codes The Bai Xulie possible local of whole folds change.These locals, which fold change and combination, can form exponential total quantity Space conformation.The information of these magnanimity is fully exposed in full information prediction conformation bands of a spectrum.

4. the full information prediction (CPPC) of protein conformation can predict possible protein conformation.According to office The domain folded conformation frequency of occurrences predicts possible space conformation from huge number of space conformation.

Bibliography:

------------

1Jump up to:ab PSI Assessment Panel."Report of the Protein Structure Initiative Assessment Panel".Retrieved December 5,2008

2Baker,D.；Sali,A.(Oct 2001)."Protein structure prediction and Structuralgenomics ", Science 294 (5540):93–6

3Yonath,Ada.X-ray crystallography at the heart oflife science.Current Opinion in Structural Biology.Volume 21,Issue 5,October 2011,Pages 622–626.

4Rigden,Daniel J.From Protein Structure to Function with Bioinformatics.Springer Science.2009.ISBN 978-1-4020-9057-8.

5Moult J.et al.A large-scale experiment to assess protein structure prediction methods,1995；Proteins 23

6http://predictioncenter.org

7Jump up,Editorial:So much more to know.Science 2005,309:78-102

8Zhang Y(2008)."Progress and challenges in protein structure prediction".CurrOpinStructBiol 18(3):342–8.

9Yi Hea,S.Rackovskya,YanpingYina,and Harold.Scheragaa,Alternative approach to protein structure prediction based on sequential similarity of Physical properties, PNAS, 2015,112 (16):5029-5032

10Ashtawy,H.M.；Mahapatra,N.R.,"A Comparative Assessment of Predictive Accuracies of Conventional and Machine Learning Scoring Functions for Protein-Ligand Binding Affinity Prediction,"Computational Biology and Bioinformatics,IEEE/ACM Transactions on,vol.12,no.2,pp.335,347,2015

11Bowie JU,Luthy R,Eisenberg D；Lüthy；Eisenberg(1991)."A method to identify protein sequences that fold into a known three-dimensional structure".Science 253(5016):164–170.

12JT.Huang,T Wang,SR.Huang and X Li,Reduced alphabet for protein folding prediction,Proteins,2015,83-4,631–63

13Bowie JU,Lüthy R,Eisenberg D(1991)."A method to identify protein sequences that fold into aknown three-dimensional structure".Science 253 (5016):164–170.

14Jones DT,TaylorWR,Thornton JM(1992)."A new approach to protein fold recognition".Nature 358(6381):86–89..

15Peng,Jian；Jinbo Xu(2011)."RaptorX:exploiting structure information for protein alignment by statistical inference".Proteins.79Suppl 10:

16Pierce,Levi C.T.；Salomon-Ferrer,Romelia；Augusto F.de Oliveira, Cesar；McCammon,J.Andrew；Walker,Ross C.(2012)."Routine Access to Millisecond Time Scale Events with Accelerated Molecular Dynamics".Journal of Chemical Theory and Computation 8(9):2997–3002.

17Nugent,T.；Jones,D.T.(2012)."Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis".Proc Natl AcadSci U S A 109(24):E1540–7.

18Morcos,F.；Pagnani,A.；Lunt,B.；Bertolino,A.；Marks,DS.；Sander,C.； Zecchina,R.；Onuchic,JN.et al.(Dec 2011)."Direct-coupling analysis of residue coevolution captures

---------

native contacts across many protein families".Proc Natl AcadSci U S A 108(49):E1293–301.

19Yang J.Comprehensive description of protein structures using protein folding shape code.Proteins 2008；71.3:1497-1518.

Claims

1. a kind of full information Forecasting Methodology of protein structure space conformation, it is characterised in that comprise the following steps：

1）5 amino acid are arbitrarily extracted from all 20 amino acid, form sum as 3,200,000 different arrangements, often The potential folded conformation of one arrangement obtains from global Protein Data Bank, then with protein folding shape representation；Create One database is named as 5AAPFSC to collect above-mentioned arrangement and its corresponding protein folding shape code, the database；

2）For the protein of any one structure to be predicted, along the sequence of protein, since N- ends, progressively move to C- End, is successively read every 5 continuous amino acid, its potential folded conformation directly obtains from 5AAPFSC databases, uses protein folding The character representation of shape code；The character of frequency of occurrences highest folded conformation code makes number one in Protein Data Bank, goes out The high folded conformation code character of existing frequency second comes second, sequentially forms a row from top to bottom, until collection is entirely Only, every 5 continuous amino acid have different number of potential folded conformation；

3）The potential collapsed shape code of whole of testing protein forms an array, referred to as protein folding conformation bands of a spectrum, represents Along the whole potential folded conformation of sequence of protein；For each protein sequence, pass through its whole potential partial folds Conformation is substituted for each other, and can accurately obtain all potential conformations；The total number of potential conformation is whole every 5 amino acid The continued product of potential folded conformation number.

2. full information Forecasting Methodology according to claim 1, it is characterised in that a space conformation for predicting to obtain be by Frequency of occurrences highest collapsed shape code is formed.

3. full information Forecasting Methodology according to claim 1, it is characterised in that a space conformation for predicting to obtain be by The high collapsed shape code of the frequency of occurrences second, in no second high-frequency conformation position, made with the collapsed shape code of frequency highest Formed for supplement.

4. full information Forecasting Methodology according to claim 1, it is characterised in that a space conformation for predicting to obtain be by The 3rd high collapsed shape code of the frequency of occurrences, in no 3rd high-frequency conformation position, made with the collapsed shape code of frequency highest Formed for supplement.