CN105653892A

CN105653892A - Distance spectrum intelligence based normal distribution distance receiving probability model construction method

Info

Publication number: CN105653892A
Application number: CN201511008767.XA
Authority: CN
Inventors: 张贵军; 俞旭锋; 周晓根; 郝小虎; 王柳静; 徐东伟
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2015-12-29
Filing date: 2015-12-29
Publication date: 2016-06-08

Abstract

The present invention provides a distance spectrum intelligence based normal distribution distance receiving probability model construction method. The method comprises: firstly, downloading a high resolution protein file with a known structure in a protein database, and removing a sequence whose homology is greater than a preset threshold by comparing a sequence similarity degree to form a non-redundancy template library; next, performing similarity degree comparison on a protein structure and a query sequence in the template library by means of a slide window, and selecting M segments with the highest score at each location of the query sequence to form a segment library file; then, selecting distances from the same segment structure in a segment library at two locations of the query sequence; and finally, according to distance distribution of an inter-residue distance spectrum, extracting a predicted distance and a variance to construct a probability density function of normal distribution, comparing structure similarity of an induced conformation, and receiving a conformation according to a distance receiving probability of normal probability. The method provided by the present invention has a stronger spatial sampling ability and higher update precision.

Description

A kind of normal distribution distance probability of acceptance model building method based on distance spectrum knowledge

Technical field

The present invention relates to bioinformatics, computer application field, in particular a kind of normal distribution distance probability of acceptance model building method based on distance spectrum knowledge.

Background technology

Protein molecule plays vital effect in biological cell chemical reaction process. Their structural model and biological activity state are to we have appreciated that and cure multiple disease have important meaning. Protein is only folded into specific three dimensional structure could produce its distinctive biological function. It is therefore to be understood that the function of protein, it is necessary for obtaining its three-D space structure.

Tertiary protein structure prediction is a vital task of bioinformatics. It is that extremely complex protein energy function surface is scanned for that protein conformation optimization problem faces maximum challenge now. Protein energy model considers molecular system and becomes key effect and Van der Waals force, electrostatic, hydrogen bond, the non-one-tenth key effect such as hydrophobic, causes the Energy Surface formed it into extremely coarse, and conformation correspondence local minimizers number number is exponentially increased with the increase of sequence length. And the mechanism that protein conformation prediction algorithm can find protein stabilization structure is, substantial amounts of protein meta structure constitutes low energy area, so can find the protein most rock-steady structure of the overall situation it is crucial that algorithm can find substantial amounts of protein meta structure, namely increase the population diversity of algorithm. Therefore, for more accurate protein force field model, choose effective conformational space optimized algorithm, make new protein structure prediction algorithm have more universality and high efficiency becomes the focal issue of protein structure prediction in bioinformatics.

At present, Advances in protein structure prediction substantially can be divided into two classes, based on method and the method being not based on template of template. Wherein, ab initio prediction (Ab-inito) method being not based on template is most widely used. It is applicable to the homology most protein less than 25%, only produces brand new from sequence, and the research of Protein Molecular Design and protein folding etc. is significant.Currently there is the successful ab initio prediction method of following several comparison: the FeLTr method etc. of the designs such as TASSER (Threading/Assembly/Refinement) method of Zhang Yangyu JeffreySkolnick cooperation, DavidBaker and the Rosetta method of team's design, Shehu. But the three dimensional structure of predicted protein matter is carried out up till now but without a kind of very perfect method, well predict the outcome even if obtaining, but also just for some protein, technical bottleneck currently mainly is in that two aspects, first aspect is in that the method for sampling, prior art is not strong to conformational space ability in sampling, further aspect is that conformation update method, and prior art is still not enough to the renewal precision of conformation.

Therefore, existing conformational space searching method Shortcomings, it is necessary to improve.

Summary of the invention

In order to the spatial sampling overcoming existing conformational space searching method is indifferent, update the deficiency that precision is relatively low, the present invention provides the normal distribution distance probability of acceptance model building method based on distance spectrum knowledge that a kind of spatial sampling ability is relatively strong, renewal precision is higher.

The technical solution adopted for the present invention to solve the technical problems is:

A kind of normal distribution distance probability of acceptance model building method based on distance spectrum knowledge, described model building method comprises the following steps:

1) nonredundancy template base is built:

1.1) from Protein Data Bank (PDB) website download resolution less thanHigh accuracy protein, whereinFor parasang,Rice;

1.2) protein containing a plurality of polypeptide chain is split into strand, and retain the longest chain and other chain comparative sequences similarities, remove the similarity redundancy polypeptide chain more than predetermined threshold value;

1.3) remaining polypeptide chain is sought sequence similarity I between two_mn, add up the accumulative similarity of each chainWherein m, n are the sequence number of polypeptide chain, and N is the sum remaining all chains;

1.4) N bar chain is successively decreased arrangement according to accumulative similarity, start to compare removal sequence similarity successively with other chains more than the chain of predetermined threshold value from the chain that accumulative similarity is maximum, obtain non-redundant proteins template base;

2) input inquiry sequence;

3) fragment library is generated:

3.1) build structural similarity function f (i, j), wherein i is search sequence resi-dues, and j is fragment structure;

3.1.1) search sequence by PSI-BLAST comparison predetermined number aminoacid obtain sequence frequency compose to obtain subitem P_q(i, k), wherein i is search sequence resi-dues, and k is predetermined number amino acid classes, and q is search sequence indications;

3.1.2)L_q(i, k) and L_t(j, is k) by the PSI-BLAST search sequence obtained and template sequence logarithmic spectrum;

3.1.3) the secondary structure prediction ss obtaining formwork structure is calculated by PSSpred_t;

3.1.4) it is trained obtaining search sequence secondary structure prediction index s s to sequence spectrum by neural network procedure_q;

3.1.5) template protein solvent accessibility parameter sa is obtained by EDTSurf calculating_t;

3.1.6) search sequence solvent accessibility index s a is obtained by neural network procedure prediction_q;

3.1.7) can predict, by two layers of neural network procedure training sequence spectrum and secondary structure, the dihedral angle obtaining search sequence��_q;

3.1.8) barycenter atom dihedral angle can be obtained by query protein dictionary��_tFor formwork structure;

3.1.9)SP_t(j, k) for the frequency matrix of each residue absolute presupposition quantity residue type in formwork structure;

3.1.10) structural similarity function

Wherein w₁, w₂, w₃, w₄, w₅For weighted value;

3.2) by gapless threading method with 3 residues for monomeric unit, being mated with search sequence by the fragment structure in nonredundancy template base, according to structural similarity function f, (i, j) gives a mark to fragment structure;

3.3) using a sliding window when search sequence is with template segments structure matching, (i, j), selects front M the fragment of highest scoring on each position and constitutes fragment library the similarity score f of search sequence i position of comparison and jth fragment;

4) distance spectrum is obtained:

4.1) fragment that on traversal queries sequence position, M similarity is higher,It is the fragment on search sequence i-th position, F_l ^j(l=1 ..., M) it is the fragment on search sequence jth position;

4.2) a is used_ikAnd a_jlRepresent the fragment structure coming from same formwork structure selected on i and j;

4.3) a is calculated_ikAnd a_jlDistance d in original template structure_ij;

4.4) statistical query sequence is to coming from the distance between template segments, here only add up less thanResidue to spacing, draw rectangular histogram and obtain distance spectrum, the distance of rectangular histogram abscissa is spaced apartWhen in template residue between distance in certain interval, then this interval sum just adds 1, if broken line graph existsIn certain distance interval peak value occurs, then the Prediction distance that the distance interval that this peak value is corresponding is in target sequence residue i to residue j;

5) Prediction distance in pth distance spectrum is extractedAnd variances sigma_p, induced conformational decoy is built the distance probability of acceptance

P_{a c c e p t}^{d e c o y} = \frac{1}{N} Σ_{p = 1}^{N} \frac{1}{σ_{p} \sqrt{2 π}} \exp (- \frac{{(D_{d e c o y}^{p (a, b)} - D_{\Pr o f i l e}^{p})}^{2}}{2 σ_{p}^{2}})

Wherein, decoy is induced conformational, and profile represents distance spectrum, and N is that induced conformational predicts the distance spectrum bar number obtained, and p is distance spectrum index, and a and b is the resi-dues index of record in pth bar distance spectrum respectively,For the space length between induced conformational residue a to residue b,For the Prediction distance of pth bar distance spectrum, ��_pDistribution variance for pth bar distance spectrum.

The technology of the present invention is contemplated that: a kind of normal distribution distance probability of acceptance model building method based on distance spectrum knowledge, first, download the high-resolution protein file that in Protein Data Bank, structure is known, remove the homology Sequence composition nonredundancy template base more than 30% by comparative sequences similarity; Secondly, by a sliding window, the protein structure in template base and search sequence are carried out similarity-rough set, select front 200 fragments of highest scoring in each position of search sequence and constitute fragment library file; Then choose the distance coming from same template segments structure on two positions of search sequence in fragment library and constitute distance spectrum; Finally according to the range distribution that residue spacing goes against accepted conventions, extract Prediction distance and the variance probability density function to its structure normal distribution, the structural similarity of comparison induced conformational, and accept conformation with the distance probability of acceptance of normal distribution.

The invention have the benefit that

Accompanying drawing explanation

Fig. 1 is the schematic diagram of the distance spectrum in protein 1ENH between the 20th residue and the 29th residue.

Fig. 2 is the schematic diagram of the distance probability of acceptance of the 20th residue and the 29th residue spacing in protein 1ENH.

Detailed description of the invention

Below in conjunction with accompanying drawing, the invention will be further described.

See figures.1.and.2, a kind of normal distribution distance probability of acceptance model building method based on distance spectrum knowledge, comprise the following steps:

1) nonredundancy template base is built:

1.2) protein containing a plurality of polypeptide chain is split into strand, and retain the longest chain and other chain comparative sequences similarities, remove the similarity redundancy polypeptide chain more than predetermined threshold value (taking 30%);

1.4) N bar chain is successively decreased arrangement according to accumulative similarity, start to compare removal sequence similarity successively with other chains more than the chain of predetermined threshold value (taking 30%) from the chain that accumulative similarity is maximum, obtain non-redundant proteins template base;

2) input inquiry sequence;

3) fragment library is generated:

3.1.1) search sequence by PSI-BLAST comparison predetermined number (taking 20) aminoacid obtain sequence frequency compose to obtain subitem P_q(i, k), wherein i is search sequence resi-dues, and k is predetermined number (taking 20) amino acid classes, and q is search sequence indications;

3.1.9)SP_t(j, k) for the frequency matrix of each residue absolute presupposition quantity (taking 20) residue type in formwork structure.

3.1.10) structural similarity function

Wherein w₁, w₂, w₃, w₄, w₅For weighted value;

4) distance spectrum is obtained:

4.3) a is calculated_ikAnd a_jlDistance d in original template structure_ij;

4.4) statistical query sequence is to coming from the distance between template segments, here only add up less thanResidue to spacing (residue to intermolecular forces along with distance increase and reduce), drawing rectangular histogram obtains distance spectrum, and the distance of rectangular histogram abscissa is spaced apartWhen in template residue between distance in certain interval, then this interval sum just adds 1, if broken line graph existsIn certain distance interval peak value occurs, then the Prediction distance that the distance interval that this peak value is corresponding is in target sequence residue i to residue j;

P_{a c c e p t}^{d e c o y} = \frac{1}{N} Σ_{p = 1}^{N} \frac{1}{σ_{p} \sqrt{2 π}} \exp (- \frac{{(D_{d e c o y}^{p (a, b)} - D_{\Pr o f i l e}^{p})}^{2}}{2 σ_{p}^{2}})

The present embodiment with sequence length be 54 protein 1ENH for embodiment, a kind of normal distribution distance probability of acceptance model building method based on distance spectrum knowledge, wherein comprise the steps of

1) nonredundancy template base is built:

1.1) from Protein Data Bank (PDB) website download resolution less thanHigh accuracy protein;

1.2) protein containing a plurality of polypeptide chain is split into strand, and retain the longest chain and other chain comparative sequences similarities, remove the similarity redundancy polypeptide chain more than 30%;

1.3) remaining polypeptide chain is sought sequence similarity I between two_mn, add up the accumulative similarity of each chainWherein m, n are the sequence number of polypeptide chain, and N is the total N=35627 remaining all chains;

1.4) N bar chain is successively decreased arrangement according to accumulative similarity, start to compare the removal sequence similarity chain more than 30% successively with other chains from the chain that accumulative similarity is maximum, obtain non-redundant proteins template base;

2) input inquiry sequence;

3) fragment library is generated:

3.1.1) search sequence by 20 aminoacid of PSI-BLAST comparison obtain sequence frequency compose to obtain subitem P_q(i, k), wherein i is search sequence resi-dues, and k is 20 amino acid classes, and q is search sequence indications;

3.1.9)SP_t(j, k) for the frequency matrix of relative 20 residue type of each residue in formwork structure.

3.1.10) structural similarity function

Wherein w₁=2, w₂=6, w₃=2.5, w₄=12, w₅=10 is weighted value;

3.3) sliding window is used when search sequence is with template segments structure matching, similarity score f (the i of search sequence i position of comparison and jth fragment, j), select front 200 fragments of highest scoring on each position and constitute fragment library;

4) distance spectrum is obtained:

4.1) fragment that on traversal queries sequence position, 200 similarities are higher,It is the fragment on search sequence i-th position, F_l ^j(l=1 ..., 200) it is the fragment on search sequence jth position;

4.3) a is calculated_ikAnd a_jlDistance d in original template structure_ij;

P_{a c c e p t}^{d e c o y} = \frac{1}{N} Σ_{p = 1}^{N} \frac{1}{σ_{p} \sqrt{2 π}} \exp (- \frac{{(D_{d e c o y}^{p (a, b)} - D_{\Pr o f i l e}^{p})}^{2}}{2 σ_{p}^{2}})

Wherein, decoy is induced conformational, and profile represents distance spectrum, and N is that induced conformational predicts the distance spectrum bar number obtained, and p is distance spectrum index, and a and b is the resi-dues index of record in pth bar distance spectrum respectively,For the space length between induced conformational residue a to residue b,For the Prediction distance of pth bar distance spectrum, ��_pDistribution variance for pth bar distance spectrum;

With sequence length be 54 protein 1ENH for embodiment, above method is used to obtain the distance probability of acceptance between the distance spectrum of this protein and residue, residue to spacing spectrogram as it is shown in figure 1,1VII based on distance spectrum knowledge normal distribution distance the probability of acceptance show as shown in Figure 2.

The excellent results that the embodiment that the present invention provides that described above is shows, the obvious present invention is not only suitable for above-described embodiment, it can be done many variations and be carried out under not necessarily departing from essence spirit of the present invention and the premise without departing from content involved by flesh and blood of the present invention.

Claims

1. the normal distribution distance probability of acceptance model building method based on distance spectrum knowledge, it is characterised in that: described model building method comprises the following steps:

1) nonredundancy template base is built:

1.1) from Protein Data Bank website download resolution less thanHigh accuracy protein, whereinFor parasang,

2) input inquiry sequence;

3) fragment library is generated:

3.1.10) structural similarity function

Wherein w₁, w₂, w₃, w₄, w₅For weighted value;

4) distance spectrum is obtained:

4.1) fragment that on traversal queries sequence position, M similarity is higher,It is the fragment on search sequence i-th position,It it is the fragment on search sequence jth position;

4.3) a is calculated_ikAnd a_jlDistance d in original template structure_ij;

P_{a c c e p t}^{d e c o y} = \frac{1}{N} Σ_{p = 1}^{N} \frac{1}{σ_{p} \sqrt{2 π}} \exp (- \frac{{(D_{d e c o y}^{p (a, b)} - D_{\Pr o f i l e}^{p})}^{2}}{2 σ_{p}^{2}})