CN105260626A

CN105260626A - Complete prediction method for protein structure spatial conformation

Info

Publication number: CN105260626A
Application number: CN201510623583.8A
Authority: CN
Inventors: 杨家安
Original assignee: MAIKERO MEDICINE TECHNOLOGY (WUHAN)CO Ltd
Current assignee: McCullough Biotechnology (Shanghai) Co.,Ltd.
Priority date: 2015-09-25
Filing date: 2015-09-25
Publication date: 2016-01-20
Anticipated expiration: 2035-09-25
Also published as: CN105260626B

Abstract

The invention relates to a complete prediction method for a protein structure spatial conformation, and belongs to the field of bioinformatics. For any protein sequence, a corresponding protein folding conformation can be obtained by adopting a protein structure fingerprint technique directly through high-throughput screening of a 5AAPFSC database. Each folding conformation is represented by protein folding shape code (PFSC) letters, and folding structures cover a secondary structure and a tertiary structure. All possible folding shape codes can be aligned to form an array, thereby generating a PFSC protein spatial conformation spectral band as a prediction result. Through tests of a large amount of known three-dimensional structured proteins, the reliability and effectiveness of the method are well verified.

Description

The perfect information Forecasting Methodology of protein structure space conformation

Technical field

The present invention relates to a kind of perfect information Forecasting Methodology of protein structure space conformation, belong to field of bioinformatics.

Background technology

Protein structure carries out genomics, bioinformatics, the important information of medicament research and development and biotechnology research ^1,2.But, up to the present, only have and approximately measure acquisition less than the three-dimensional structure of the protein of 1% by experimental techniques such as X-ray crystallographic or nuclear magnetic resonance ³.Still the sequence about more than 52,000,000 protein is also had still not have information and the data of three-dimensional structure ⁴, biological medicine research urgently wishes the space structure determining these albumen.For a long time, based on microcomputer modelling, many methods and applications about protein structure prediction are developed.From 1994, " key evaluation (CASP) of the protein structure prediction " activity every two years held once became an intercommunion platform of countries in the world protein molecule bio-science man ^5,6.In view of the complicacy of protein structure, and the possible folding mode of exponential number level, one of research puzzle 100 large challenge subjects being listed in 21 century modern science of predicted protein structure ⁷.

Up to now, the method for various predicted protein structure can be divided into three major types other substantially.The first kind is the modeling method based on sequence ^8,9,10.The method utilizes known protein structure to solve agnoprotein structure.This method needs the similarity degree comparison information extraction depended between sequence, is a query about the degree of reliability predicted the outcome always.Equations of The Second Kind is the splicing modeling method adopting folding configuration identification ^{11,12,13,14,15}.The method employing statistical method screens the mutual relationship between fold segments and sequence from specific albumen database.Statistical method no doubt can contain most of folding configuration, but the lower folding configuration of frequency has often just been left in the basket.3rd class is ab iitio model method ^16,17,18.The method uses the interaction between the amino acid of computing machine repeatedly in iterative computation protein and atom, and to the last whole conformation system is tending towards a lower energy state.The computer time that the method consumption is a large amount of and resource, and prediction only can obtain a possible space structure of related protein.For a long time, biologist expects to obtain protein structure that is reliable and that have no objection by Forecasting Methodology.As target, the Forecasting Methodology improving protein structure is attempted in various research, but progress is in this respect very undesirable.Searching to the bottom reason, is the complicacy due to protein structure itself and polytrope.

Summary of the invention

Technical matters to be solved by this invention is to provide a kind of perfect information prediction (CompletePredictionforProteinConformation, CPPC) method of protein conformation.The method uses digital model to simplify the complicacy of protein structure, uses perfect information structured data to carry out the polytrope of cognitive protein structure simultaneously.The method can the structure of fast prediction protein, and provides all possible protein steric conformation.

The perfect information Forecasting Methodology of protein conformation of the present invention be based upon inventor before the new method of the predicted protein structure of exploitation on PFSC (ProteinFoldingShapeCode, PFSC) basis disclosed in patent ZL200880003164.2 ¹⁹.The PFSC (PFSC) obtained by strictly deriving can describe the collapsed shape of continuous print 5 amino acid fragments in good condition.Any collapsed shape of 5 amino acid fragments in protein can be described by 27PFSC vector, and whole 27PFSC vector have employed 26 English alphabets and adds $ symbols.The more important thing is, whole 27 PFSC vector covers a complete mathematical space.And the collapsed shape of whole 27 PFSC vectors is highly closely related.Each PFSC vector can be another vector from a vectorial transition and conversion.

From mathematical angle amino acid, by different sequences, 5 amino acid can form different arrangements.From whole 20 amino acid, at random extract 5 amino acid can be formed and add up to 3,200, the difference arrangement of 000.The possible folded conformation of each arrangement can obtain from global Protein Data Bank (PDB), then represents with protein folding shape code (PFSC).On this basis, we create a database to collect the folded conformation of above-mentioned 3,200,000 arrangements.This brand-new database is named as 5AAPFSC.In this database, the collapsed shape relevant with each arrangement stores intactly adopting corresponding PFSC code wherein.

The perfect information Forecasting Methodology of protein structure space conformation of the present invention, comprises the steps:

1) from whole 20 amino acid, 5 amino acid are at random extracted, formation adds up to 3, the difference arrangement of 200,000, the possible folded conformation of each arrangement obtains from global Protein Data Bank (PDB), then represents with protein folding shape code (PFSC); Create a database to collect the protein folding shape code of above-mentioned arrangement and correspondence thereof, this database is named as 5AAPFSC, as shown in Figure 1;

2) for the protein of any one structure to be predicted, along the sequence of protein, from N-end, progressively move to C-end, read every 5 continuous print amino acid successively, its folded conformation that may have directly obtains from 5AAPFSC database, with the character representation of protein folding shape code (PFSC); In Protein Data Bank, the character of the folded conformation code that the frequency of occurrences is the highest makes number one, the folded conformation code character that the frequency of occurrences second is high comes second, form row successively from top to bottom, till collection completely, every 5 continuous print amino acid have the folded conformation possibility of different number;

3) the whole possible collapsed shape code of testing protein forms an array, is called protein folding conformation bands of a spectrum, as shown in Figure 2, represents along all possible folded conformation of the sequence of protein; For each protein sequence, by the phase trans-substitution of its all possible partial folds conformation, all possible conformation can be obtained exactly; The total number of possible conformation is the continued product of all every 5 amino acid possibility folded conformation numbers;

For any one testing protein, although all the number of possible space conformation is huge, the space conformation that possibility is high is obtained by the local folded conformation that the frequency of occurrences is high.For example, first space conformation is made up of the folded configuration shape code that the frequency of occurrences is the highest; Second space conformation is by the high collapsed shape code of the frequency of occurrences second, is not having the second high-frequency conformation position, with frequency the highest collapsed shape code formed as a supplement; 3rd space conformation is by frequency of occurrences third high collapsed shape code, is not having third high frequency conformation position, with frequency the highest collapsed shape code formed as a supplement; So analogize, form a series of possible prediction conformation that possibility is higher.

Therefore, a succession of protein folding shape code be made up of high-frequency conformation is exactly the higher protein steric structural conformation of possibility.According to protein folding conformation bands of a spectrum, the change of more local can be found and substitute, carrying out revising formation more about possible space structure conformation.

The protein steric conformation bands of a spectrum that this analysis method obtains provide the prediction of a perfect information to protein structure space folding conformation, disclose the subtle change of its possible any local conformation simultaneously.A significance of perfect information prediction (CPPC) method of protein conformation will create necessary condition for building a brand-new protein gene structure composition database from now on.The perfect information prediction of protein conformation is the new method of of predicted protein structure, and the method will promote the development of protein structure genomics.The perfect information forecasting techniques of the protein conformation of our exploitation not only provides complete folded conformation to the prediction of protein structure, and is of great significance for the protein structure that complete understanding obtains from measuring.

Accompanying drawing explanation

The structure of Fig. 1,5AAPFSC database.

Fig. 2, the foundation of protein folding conformation bands of a spectrum.

The contrast that 2XCW protein fragments (residue 3-62) the known conformation of Fig. 3, people's tenuigenin 5'-nucleotide II albumen and perfect information predict the outcome.Form the first row is the amino acid sequence segments (3-62) of this albumen.Then be the folded conformation of 8 known structure, folded conformation protein folding shape code (PFSC) represents.In form, lower semisection is 9 possible space conformations of prediction.

Fig. 4, marine aquatic biological silver shark (CallorhinchusMiliiX) extract 32 amino acid whose calcitonin prediction space conformation.

Embodiment

For any protein sequence, use protein structure fingerprint technique (PSFT), directly by the high flux screening to 5AAPFSC database, the protein folding conformation of its correspondence will be obtained.Each folded conformation will be represented by protein folding shape code (PFSC) letter, and wherein each letter represents the characteristic of its proprietary foldable structure, and these foldable structures cover secondary structure and tertiary structure.All possible collapsed shape code can align formation array, generates PFSC protein steric conformation bands of a spectrum as predicting the outcome.By the test of the protein to a large amount of known three-dimensional structure, demonstrate reliability and the validity of the method well.

Embodiment one selects one to know, and the protein of three-dimensional structure contrasts with predicting the outcome as an example.

People's tenuigenin 5'-nucleotide II protein is a protein molecular with known three-dimensional structure, and its three-dimensional structure is determined by the experiment of X-ray crystallographic.The first half of Fig. 3 lists the space conformation of testing people's tenuigenin 5'-nucleotide II protein 8 structures recorded from X-ray crystallographic.Its each three-D space structure can obtain from Protein Data Bank.Then, each conformation folded code is expressed, and array is lined up in alignment.The configuration state recorded is tested in the representative of each space conformation.The Lower Half of Fig. 3 is listed the inventive method and is predicted 9 the most probable space conformations obtained.The step that these space conformations can be described by earlier paragraphs obtains.First space conformation is made up of the character string of the highest folded conformation code of the frequency of occurrences; Second space conformation is high by the frequency of occurrences second, add form frequency the highest the character string of folded conformation code form as a supplement; 3rd space conformation is by frequency of occurrences third high, add form frequency the highest the character string of folded conformation code form as a supplement; So analogize, form 9 conformations that possibility is higher.As can be seen from the table, if with the known conformation of 2XCW protein fragments 60 amino acid sequences (3-62) for reference, the result extracting the first row prediction is compared, and the result conformation of perfect information prediction has 45 folded conformations identical, 5 similar, 10 differences.Only consider predicting the outcome of the first row, so-called rate of accuracy reached is to about 80%.

On the other hand, molecular biologist is recognized, tests from X-ray crystallographic some static structure states that the structured data recorded is protein, can not reflect that the whole of protein may dynamic conformational.Protein 8 known spatial conformations that table one upper semisection is listed, these space conformations show the changeability of its structure.Contrast these changes, the prediction bands of a spectrum of perfect information prediction can contain the folding configuration of these changes completely.The data of form effectively illustrate that the perfect information forecasting techniques of the protein conformation that we develop not only provides complete folded conformation to the prediction of protein structure, and are of great significance for the protein structure that complete understanding obtains from measuring.

Embodiment two selects the albumen of a unknown three-D space structure as an example, and its three-dimensional conformation can be obtained by holoprotein information prediction.Fig. 4 illustrates and predicts from the space conformation of the calcitonin polypeptide of marine aquatic biological silver shark (CallorhinchusMiliiX) 32 Amino acid profiles.4 steps that these space conformations can be described by earlier paragraphs obtain.First space conformation is made up of the character string of the highest folded conformation code of the frequency of occurrences; Second space conformation is high by the frequency of occurrences second, add form frequency the highest the character string of folded conformation code form as a supplement; 3rd space conformation is by frequency of occurrences third high, add form frequency the highest the character string of folded conformation code form as a supplement; So analogize, form 13 conformations that possibility is higher.Predict that the protein steric conformation bands of a spectrum of the silver shark calcitonin obtained are made up of 13 protein folding shape code (PFSC) character codes.These bands of a spectrum are the complete prediction to silver shark calcitonin space conformation, and illustrate may changing of its local conformation.

Perfect information prediction (CPPC) method of protein conformation of the present invention has following four characteristics and breakthrough.

1. protein conformation perfect information prediction (CPPC) based on tight mathematical derivation and and protein structure feature combine.First, 27PFSC protein folding shape code intactly represents the complete closure space that has essential meaning, and this ensures that thering predicts the outcome can not produce disappearance and omit.On 5 amino acid bases, by setting up 20 amino acid and and 27 PFSC protein folding shape code correlativitys.Combining global albumen database, the feature of the protein structure that is closely connected, creates 5AAPFSC database, enumerates any 5 amino acid whose all possible mathematics arrangements in 20 amino acid.For traditional protein tertiary structure method, the new method of the CPPC set up according to the correlativity between these arrangement and PFSC codes has solid mathematics basis.

2. protein conformation perfect information prediction (CPPC) provide fast prediction protein structure by way of.According to current computer technology level, if every 10 ^-13second calculates a conformation.For 100 amino acid whose protein sequences, if allow each amino-acid variants 10 locus, sum 10 will be produced ¹⁰⁰space conformation.Complete these conformations and need 10 ⁷⁷year completes.For onesize protein sequence, use perfect information prediction (CPPC) technology of protein conformation, only need about 30 second time.Perfect information prediction (CPPC) of protein conformation, for the structure prediction completing the thousands of protein sequence of length, also only needs about 120 seconds.

3. the whole possible local that perfect information prediction (CPPC) of protein conformation is shown along protein sequence by PFSC protein folding shape code folds change.These locals fold change and combine the space conformation that can form exponential total quantity.The information of these magnanimity is exposed in perfect information prediction conformation bands of a spectrum completely.

4. perfect information prediction (CPPC) of protein conformation can predict possible protein conformation.From the space conformation of enormous quantity, possible space conformation is doped according to the local folded conformation frequency of occurrences.

List of references:

------------

1Jumpupto:abPSIAssessmentPanel."ReportoftheProteinStructureInitiativeAssessmentPanel".RetrievedDecember5,2008

2Baker,D.；Sali,A.(Oct2001)."Proteinstructurepredictionandstructuralgenomics"，Science294(5540):93–6

3Yonath,Ada.X-raycrystallographyattheheartoflifescience.CurrentOpinioninStructuralBiology.Volume21,Issue5,October2011,Pages622–626.

4Rigden,DanielJ.FromProteinStructuretoFunctionwithBioinformatics.SpringerScience.2009.ISBN978-1-4020-9057-8.

5MoultJ.etal.Alarge-scaleexperimenttoassessproteinstructurepredictionmethods,1995；Proteins23

6http://predictioncenter.org

7Jumpup,Editorial:Somuchmoretoknow.Science2005,309:78-102

8ZhangY(2008)."Progressandchallengesinproteinstructureprediction".CurrOpinStructBiol18(3):342–8.

9YiHea,S.Rackovskya,YanpingYina,andHarold.Scheragaa,Alternativeapproachtoproteinstructurepredictionbasedonsequentialsimilarityofphysicalproperties,PNAS，2015,112(16):5029-5032

10Ashtawy,H.M.；Mahapatra,N.R.,"AComparativeAssessmentofPredictiveAccuraciesofConventionalandMachineLearningScoringFunctionsforProtein-LigandBindingAffinityPrediction,"ComputationalBiologyandBioinformatics,IEEE/ACMTransactionson,vol.12,no.2,pp.335,347,2015

11BowieJU,LuthyR,EisenbergD；Lüthy；Eisenberg(1991)."Amethodtoidentifyproteinsequencesthatfoldintoaknownthree-dimensionalstructure".Science253(5016):164–170.

12JT.Huang,TWang,SR.HuangandXLi,Reducedalphabetforproteinfoldingprediction,Proteins,2015,83-4,631–63

13BowieJU,LüthyR,EisenbergD(1991)."Amethodtoidentifyproteinsequencesthatfoldintoaknownthree-dimensionalstructure".Science253(5016):164–170.

14JonesDT,TaylorWR,ThorntonJM(1992)."Anewapproachtoproteinfoldrecognition".Nature358(6381):86–89..

15Peng,Jian；JinboXu(2011)."RaptorX:exploitingstructureinformationforproteinalignmentbystatisticalinference".Proteins.79Suppl10:

16Pierce,LeviC.T.；Salomon-Ferrer,Romelia；AugustoF.deOliveira,Cesar；McCammon,J.Andrew；Walker,RossC.(2012)."RoutineAccesstoMillisecondTimeScaleEventswithAcceleratedMolecularDynamics".JournalofChemicalTheoryandComputation8(9):2997–3002.

17Nugent,T.；Jones,D.T.(2012)."Accuratedenovostructurepredictionoflargetransmembraneproteindomainsusingfragment-assemblyandcorrelatedmutationanalysis".ProcNatlAcadSciUSA109(24):E1540–7.

18Morcos,F.；Pagnani,A.；Lunt,B.；Bertolino,A.；Marks,DS.；Sander,C.；Zecchina,R.；Onuchic,JN.etal.(Dec2011)."Direct-couplinganalysisofresiduecoevolutioncaptures

---------

nativecontactsacrossmanyproteinfamilies".ProcNatlAcadSciUSA108(49):E1293–301.

19YangJ.Comprehensivedescriptionofproteinstructuresusingproteinfoldingshapecode.Proteins2008；71.3:1497-1518。

Claims

1. a perfect information Forecasting Methodology for protein structure space conformation, is characterized in that, comprise the steps:

1) from whole 20 amino acid, at random extract 5 amino acid, formed and add up to 3,200, the difference arrangement of 000, the possible folded conformation of each arrangement obtains from global Protein Data Bank, then represents with protein folding shape code; Create a database to collect the protein folding shape code of above-mentioned arrangement and correspondence thereof, this database is named as 5AAPFSC;

2) for the protein of any one structure to be predicted, along the sequence of protein, from N-end, progressively move to C-end, read every 5 continuous print amino acid successively, its folded conformation that may have directly obtains from 5AAPFSC database, with the character representation of protein folding shape code; In Protein Data Bank, the character of the folded conformation code that the frequency of occurrences is the highest makes number one, the folded conformation code character that the frequency of occurrences second is high comes second, form row successively from top to bottom, till collection completely, every 5 continuous print amino acid have the folded conformation possibility of different number;

3) the whole possible collapsed shape code of testing protein forms an array, is called protein folding conformation bands of a spectrum, represents along all possible folded conformation of the sequence of protein; For each protein sequence, by the phase trans-substitution of its all possible partial folds conformation, all possible conformation can be obtained exactly; The total number of possible conformation is the continued product of all every 5 amino acid possibility folded conformation numbers.

2. perfect information Forecasting Methodology according to claim 1, is characterized in that, predicts that the space conformation obtained is made up of the collapsed shape code that the frequency of occurrences is the highest.

3. perfect information Forecasting Methodology according to claim 1, it is characterized in that, predict that a space conformation obtaining is by the high collapsed shape code of the frequency of occurrences second, there is no the second high-frequency conformation position, with frequency the highest collapsed shape code formed as a supplement.

4. perfect information Forecasting Methodology according to claim 1, it is characterized in that, predict that a space conformation obtaining is by frequency of occurrences third high collapsed shape code, there is no third high frequency conformation position, with frequency the highest collapsed shape code formed as a supplement.