CN108563921B - Protein structure prediction algorithm evaluation index construction method - Google Patents

Protein structure prediction algorithm evaluation index construction method Download PDF

Info

Publication number
CN108563921B
CN108563921B CN201810238748.3A CN201810238748A CN108563921B CN 108563921 B CN108563921 B CN 108563921B CN 201810238748 A CN201810238748 A CN 201810238748A CN 108563921 B CN108563921 B CN 108563921B
Authority
CN
China
Prior art keywords
state
population
individual
algorithm
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810238748.3A
Other languages
Chinese (zh)
Other versions
CN108563921A (en
Inventor
张贵军
谢腾宇
王柳静
王小奇
郝小虎
周晓根
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201810238748.3A priority Critical patent/CN108563921B/en
Publication of CN108563921A publication Critical patent/CN108563921A/en
Application granted granted Critical
Publication of CN108563921B publication Critical patent/CN108563921B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations

Abstract

A protein structure prediction algorithm evaluation index construction method is characterized in that a Rosetta Abinitio protocol is utilized to search a search space, and a potential natural state area is found through clustering of background points; then, executing an iterative process of a prediction algorithm to be evaluated, and analyzing the evolution state of each generation of population; secondly, calculating state transition matrixes of two generations before and after the population and quantifying the change condition of the population state by using Shannon entropy; finally, historical entropy values are recorded, thereby reflecting the effect of the algorithm on protein structure prediction. The invention provides a method for constructing an evaluation index of a protein structure prediction algorithm, which can intuitively reflect the state of the algorithm in the prediction process to a certain extent on one hand, and can compare the functions of a plurality of algorithms in the prediction by utilizing an entropy value on the other hand.

Description

Protein structure prediction algorithm evaluation index construction method
Technical Field
The invention relates to the fields of bioinformatics, intelligent optimization and computer application, in particular to a method for constructing an evaluation index of a protein structure prediction algorithm.
Background
Proteins are substances with a certain spatial structure formed by the way that polypeptide chains consisting of amino acids in a 'dehydration condensation' way are folded by coiling, thereby playing a certain specific function. The three-dimensional structure of proteins is of decisive importance in drug design, protein engineering and biotechnology, and therefore, protein structure prediction is an important research issue.
The experimental determination method of the protein structure comprises X-ray crystallography, nuclear magnetic resonance spectroscopy, electron microscopy and the like. Experimental structure is currently available for sequence-known proteins smaller than 1/1000, and therefore modeling plays an important role in providing structural information for a wide range of biological problems. According to the Anfinsen principle, a three-dimensional structure of a protein is directly predicted from an amino acid sequence by using a computer as a tool and applying an appropriate algorithm, and the prediction is a main research subject in bioinformatics at present. In the CASP experiments of the last 20 years, a great change has occurred in the field of protein structure prediction. In 1994, only 229 unique protein folds were known (http:// www.pdb.org), so most sequences of interest had no detectable homology to known structures and could only be modeled by the "de novo" method. Such modeling is considered to be a "significant challenge" in computational biology.
Various de novo prediction methods are successively developed by many research groups, the accuracy of protein structure prediction is gradually improved, and Rosetta, QUARK and the like are highlighted in the course of CASP events. The Rosetta Abinitio protocol constructs a fragment library according to the known protein three-dimensional structure and a target sequence, and optimizes an energy model by using a fragment assembly technology and a basic Monte Carlo algorithm. However, this method drastically decreases the accuracy of protein structure prediction when the target sequence is long.
In order to solve the problems, researchers propose corresponding prediction algorithms, wherein the most widely applied method is a population evolution algorithm. However, the quality of the protein prediction method based on the population generally reflects from the aspects of final prediction precision, running time and the like, the function of the algorithm in the prediction process cannot be intuitively understood, and the improvement of the algorithm by a researcher is not facilitated. Many current protein structure prediction algorithms are based on a population framework, such as EDA. From the angle of algorithm, various protein structure prediction methods are analyzed and compared to carry out reasonable evaluation, and the method has important significance for further improving sampling efficiency and structure prediction precision.
Therefore, the existing population-based protein structure prediction method has defects in algorithm evaluation, and needs to be improved.
Disclosure of Invention
In order to overcome the defect of the existing protein structure prediction method based on the population in the aspect of algorithm evaluation, the invention provides a method for constructing an evaluation index of a direct analysis protein structure prediction algorithm.
The technical scheme adopted by the invention for solving the technical problems is as follows:
a method for constructing an evaluation index of a protein structure prediction algorithm comprises the following steps:
1) giving input sequence information, and obtaining a fragment library of the sequence by using a Robeta server;
2) initially exploring and establishing a Markov state model for a search space, wherein the process is as follows:
2.1) acquiring m background points: operating the Rosetta Abinitio protocol m times, recording the conformation result of each operation as a background point;
2.2) calculating the root mean square difference distance between m background points to form a distance matrix D;
2.3) according to the distance matrix D, classifying the m background points by using a k-means clustering method to obtain k cluster centers as k Markov states, wherein k is less than m, and the process is as follows:
2.3.1) randomly selecting k points from m background points as a current cluster center;
2.3.2) for each background point biI ∈ {1,..., m } classification: calculating the distance between the background point and k cluster centers
Figure BDA0001604686190000021
Then the category number to which the background point belongs is ci,ciSatisfies the conditions
Figure BDA0001604686190000022
2.3.3) finding out the cluster center of each category of background points, and calculating the sum of the distances from each point to all other points in the same category, wherein the corresponding point with the shortest sum of the distances is the cluster center of the category;
2.3.4) if the cluster center is changed, returning to the step 2.3.2), and continuing the clustering iterative process; otherwise, the cluster center is unchanged, and the next step is executed;
3) the method for evaluating the prediction method of the protein structure based on the population comprises the following steps:
3.1) classifying the initialization population, representing the initial state: the population size is NP, and the population is expressed as P ═ C1,C2,...,CNP},CnN is an nth population individual, and an individual C is calculatednRoot Mean Square Deviation (RMSD) distance from k cluster centers, if CnThe p cluster center is nearest, then the current state of the individualnP, p ∈ { 1.. k }, indicating the individual CnBelonging to class p, the state of the entire population being denoted as statelast={state1,state2,...,stateNP},statelastRepresenting the state of the previous generation population;
3.2) executing next iterative process to the population to obtain the next generation population, wherein the step of the iterative process is determined by an algorithm;
3.3) calculating the current population state: for individual C in the current populationnN ∈ { 1.,. NP } classification, calculating individual CnRMSD distance from k cluster centers, if CnThe distance from the qth cluster center is nearest, then the current state of the individualn' -q, q ∈ { 1.. k }, indicating the individual CnBelonging to class q, the state of the entire population is denoted as statenow={state1′,state2′,...,state′NP},statenowRepresenting the current population state;
3.4) obtaining a Markov state transition matrix T according to the state statistics of the previous generation and the next generation: for conformation CnTwo preceding and succeeding state states of n ∈ { 1.,. NP }nP and staten' -q indicates a transition from state p to state q, then tpq=tpq+1/m,tpqThe value of the matrix T in the p th row and the q th column represents the state transition frequency, and the initial value of the state transition frequency is 0;
3.5) according to the stateCalculating Shannon Entropy value Encopy ═ sigma-T by transfer matrix Tpqlntpq
3.6) update the status of the current population statelast=statenow
3.7) judging whether the algorithm iteration process is finished, if so, outputting a final prediction result and a historical entropy value and finishing the steps; otherwise, go back to step 3.2).
The technical conception of the invention is as follows: firstly, searching a search space by using a Rosetta Abinitio protocol, and finding a potential natural state region by clustering background points; then, executing an iterative process of a prediction algorithm to be evaluated, and analyzing the evolution state of each generation of population; secondly, calculating state transition matrixes of two generations before and after the population and quantifying the change condition of the population state by using Shannon entropy; finally, historical entropy values are recorded, thereby reflecting the effect of the algorithm on protein structure prediction.
The beneficial effects of the invention are as follows: on one hand, the state of the algorithm in the prediction process is intuitively reflected to a certain extent, and on the other hand, the entropy value can be utilized to compare the functions of a plurality of algorithms in the prediction process.
Drawings
FIG. 1 is a basic flowchart of a method for constructing an evaluation index of a protein structure prediction algorithm.
FIG. 2 is an entropy curve diagram obtained by predicting a target protein 1ACF based on a Rosetta Abinitio protocol of a population by a protein structure prediction algorithm evaluation index construction method.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1 and 2, a method for constructing an evaluation index of a protein structure prediction algorithm includes the following steps:
1) giving input sequence information, and obtaining a fragment library of the sequence by using a Robeta server;
2) initially exploring and establishing a Markov state model for a search space, wherein the process is as follows:
2.1) acquiring m background points: operating the Rosetta Abinitio protocol m times, recording the conformation result of each operation as a background point;
2.2) calculating the root mean square difference distance between m background points to form a distance matrix D;
2.3) according to the distance matrix D, classifying the m background points by using a k-means clustering method to obtain k cluster centers as k Markov states, wherein k is less than m, and the process is as follows:
2.3.1) randomly selecting k points from m background points as a current cluster center;
2.3.2) for each background point biI ∈ {1,..., m } classification: calculating the distance between the background point and k cluster centers
Figure BDA0001604686190000051
Then the category number to which the background point belongs is ci,ciSatisfies the conditions
Figure BDA0001604686190000052
2.3.3) finding out the cluster center of each category of background points, and calculating the sum of the distances from each point to all other points in the same category, wherein the corresponding point with the shortest sum of the distances is the cluster center of the category;
2.3.4) if the cluster center is changed, returning to the step 2.3.2), and continuing the clustering iterative process; otherwise, the cluster center is unchanged, and the next step is executed;
3) the method for evaluating the prediction method of the protein structure based on the population comprises the following steps:
3.1) classifying the initialization population, representing the initial state: the population size is NP, and the population is expressed as P ═ C1,C2,...,CNP},CnN is an nth population individual, and an individual C is calculatednRoot Mean Square Deviation (RMSD) distance from k cluster centers, if CnThe p cluster center is nearest, then the current state of the individualnP, p ∈ { 1.. k }, indicating the individual CnBelonging to class p, the state of the entire population being denoted as statelast={state1,state2,...,stateNP},statelastRepresenting the state of the previous generation population;
3.2) executing next iterative process to the population to obtain the next generation population, wherein the step of the iterative process is determined by an algorithm;
3.3) calculating the current population state: for individual C in the current populationnN ∈ { 1.,. NP } classification, calculating individual CnRMSD distance from k cluster centers, if CnThe distance from the qth cluster center is nearest, then the current state of the individualn' -q, q ∈ { 1.. k }, indicating the individual CnBelonging to class q, the state of the entire population is denoted as statenow={state1′,state2′,...,state′NP},statenowRepresenting the current population state;
3.4) obtaining a Markov state transition matrix T according to the state statistics of the previous generation and the next generation: for conformation CnTwo preceding and succeeding state states of n ∈ { 1.,. NP }nP and staten' -q indicates a transition from state p to state q, then tpq=tpq+1/k,tpqThe value of the matrix T in the p th row and the q th column represents the state transition frequency, and the initial value of the state transition frequency is 0;
3.5) calculating the Shannon Entropy value Encopy ∑ -T according to the state transition matrix Tpq lntpq
3.6) update the status of the current population statelast=statenow
3.7) judging whether the algorithm iteration process is finished, if so, outputting a final prediction result and a historical entropy value and finishing the steps; otherwise, go back to step 3.2).
In this embodiment, a method for constructing an evaluation index of a protein structure prediction algorithm is implemented by taking a population-based Rosetta Abinitio protocol prediction target protein 1ACF as an example, and includes the following steps:
1) given sequence information of the ACF 1, obtaining a fragment library of the sequence by using a Robeta server;
2) initially exploring and establishing a Markov state model for a search space, wherein the process is as follows:
2.1) obtaining m as 1000 background points: operating the Rosetta Abinitio protocol m times, recording the conformation result of each operation as a background point;
2.2) calculating the root mean square difference distance between m background points to form a distance matrix D;
2.3) according to the distance matrix D, classifying the m background points by using a k-means clustering method to obtain k cluster centers as k Markov states, wherein k is less than m, and the process is as follows:
2.3.1) randomly selecting k points from m background points as a current cluster center;
2.3.2) for each background point biI ∈ {1,..., m } classification: calculating the distance between the background point and k cluster centers
Figure BDA0001604686190000061
Then the category number to which the background point belongs is ci,ciSatisfies the conditions
Figure BDA0001604686190000062
2.3.3) finding out the cluster center of each category of background points, and calculating the sum of the distances from each point to all other points in the same category, wherein the corresponding point with the shortest sum of the distances is the cluster center of the category;
2.3.4) if the cluster center is changed, returning to the step 2.3.2), and continuing the clustering iterative process; otherwise, the cluster center is unchanged, and the next step is executed;
3) the method for evaluating the prediction method of the protein structure based on the population comprises the following steps:
3.1) classifying the initialization population, representing the initial state: the population size is NP ═ 300, and the population is denoted P ═ C1,C2,...,CNP},CnN is an nth population individual, and an individual C is calculatednRoot Mean Square Deviation (RMSD) distance from k cluster centers, if CnThe p cluster center is nearest, then the current state of the individualnP, p ∈ { 1.. k }, indicating the individual CnBelonging to class p, the state of the entire population being denoted as statelast={state1,state2,...,stateNP},statelastRepresenting the state of the previous generation population;
3.2) executing next iterative process to the population to obtain the next generation population, wherein the step of the iterative process is determined by an algorithm;
3.3) calculating the current population state: for individual C in the current populationnN ∈ { 1.,. NP } classification, calculating individual CnRMSD distance from k cluster centers, if CnThe distance from the qth cluster center is nearest, then the current state of the individualn' -q, q ∈ { 1.. k }, indicating the individual CnBelonging to class q, the state of the entire population is denoted as statenow={state1′,state2′,...,state′NP},statenowRepresenting the current population state;
3.4) obtaining a Markov state transition matrix T according to the state statistics of the previous generation and the next generation: for conformation CnTwo preceding and succeeding state states of n ∈ { 1.,. NP }nP and staten' -q indicates a transition from state p to state q, then tpq=tpq+1/k,tpqIs a matrix Tk×kThe value of the p-th row and the q-th column represents the state transition frequency, and the initial value of the value is 0;
3.5) calculating the Shannon Entropy value Encopy ∑ -T according to the state transition matrix Tpq lntpq
3.6) update the status of the current population statelast=statenow
3.7) judging whether the algorithm iteration process is finished, if so, outputting a final prediction result and a historical entropy value and finishing the steps; otherwise, go back to step 3.2).
By taking the prediction of the target protein 1ACF based on the Rosetta Abinitio protocol of the population as an embodiment, the method is used for constructing an entropy index to visually reflect the function of the algorithm based on the population in the prediction of the protein structure.
The above description is the effect of entropy change obtained by the present invention by taking the population-based Rosetta Abinitio protocol as an example to predict the target protein 1ACF, and is not intended to limit the scope of the present invention, and various modifications and improvements can be made without departing from the scope of the present invention.

Claims (1)

1. A method for constructing an evaluation index of a protein structure prediction algorithm is characterized by comprising the following steps: the evaluation index construction method comprises the following steps:
1) giving input sequence information, and obtaining a fragment library of the sequence by using a Robeta server;
2) initially exploring and establishing a Markov state model for a search space, wherein the process is as follows:
2.1) acquiring m background points: operating the Rosetta Abinitio protocol m times, recording the conformation result of each operation as a background point;
2.2) calculating the root mean square difference distance between m background points to form a distance matrix D;
2.3) according to the distance matrix D, classifying the m background points by using a k-means clustering method to obtain k cluster centers as k Markov states, wherein k is less than m, and the process is as follows:
2.3.1) randomly selecting k points from m background points as a current cluster center;
2.3.2) for each background point biI ∈ {1,..., m } classification: calculating the distance between the background point and k cluster centers
Figure FDA0002946862170000012
Then the category number to which the background point belongs is ci,ciSatisfies the conditions
Figure FDA0002946862170000011
2.3.3) finding out the cluster center of each category of background points, and calculating the sum of the distances from each point to all other points in the same category, wherein the corresponding point with the shortest sum of the distances is the cluster center of the category;
2.3.4) if the cluster center is changed, returning to the step 2.3.2), and continuing the clustering iterative process; otherwise, the cluster center is unchanged, and the next step is executed;
3) the method for evaluating the prediction method of the protein structure based on the population comprises the following steps:
3.1) classifying the initialization population, representing the initial state: the population size is NP, and the population is expressed as P ═ C1,C2,...,CNP},CnN is an nth population individual, and an individual C is calculatednRoot Mean Square Deviation (RMSD) distance from k cluster centers, if CnThe p cluster center is nearest, then the current state of the individualnP, p ∈ { 1.., k }, indicating the individual CnBelonging to class p, the state of the entire population being denoted as statelast={state1,state2,...,stateNP},statelastRepresenting the state of the previous generation population;
3.2) executing next iterative process to the population to obtain the next generation population, wherein the step of the iterative process is determined by an algorithm;
3.3) calculating the current population state: for individual C in the current populationnN ∈ { 1.,. NP } classification, calculating individual CnRMSD distance from k cluster centers, if CnNearest to the qth cluster center, then the individual's current state'nQ, q ∈ { 1.., k }, indicating the individual CnBelongs to class q, and the state of the entire population is represented as state'now={state′1,state′2,...,state′NP},state′nowRepresenting the current population state;
3.4) obtaining a Markov state transition matrix T according to the state statistics of the previous generation and the next generation: for conformation CnTwo preceding and succeeding state states of n ∈ { 1.,. NP }nP and state'nQ indicates a transition from state p to state q, then tpq=tpq+1/k,tpqThe value of the matrix T in the p th row and the q th column represents the state transition frequency, and the initial value of the state transition frequency is 0;
3.5) calculating the Shannon Entropy value Encopy ∑ -T according to the state transition matrix Tpq ln tpq
3.6) update the status of the current population statelast=state'now
3.7) judging whether the algorithm iteration process is finished, if so, outputting a final prediction result and a historical entropy value and finishing the steps; otherwise, go back to step 3.2).
CN201810238748.3A 2018-03-22 2018-03-22 Protein structure prediction algorithm evaluation index construction method Active CN108563921B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810238748.3A CN108563921B (en) 2018-03-22 2018-03-22 Protein structure prediction algorithm evaluation index construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810238748.3A CN108563921B (en) 2018-03-22 2018-03-22 Protein structure prediction algorithm evaluation index construction method

Publications (2)

Publication Number Publication Date
CN108563921A CN108563921A (en) 2018-09-21
CN108563921B true CN108563921B (en) 2021-05-18

Family

ID=63532064

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810238748.3A Active CN108563921B (en) 2018-03-22 2018-03-22 Protein structure prediction algorithm evaluation index construction method

Country Status (1)

Country Link
CN (1) CN108563921B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503486A (en) * 2016-09-23 2017-03-15 浙江工业大学 A kind of differential evolution protein structure ab initio prediction method based on multistage subgroup coevolution strategy
CN107491664A (en) * 2017-08-29 2017-12-19 浙江工业大学 A kind of protein structure ab initio prediction method based on comentropy

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106503486A (en) * 2016-09-23 2017-03-15 浙江工业大学 A kind of differential evolution protein structure ab initio prediction method based on multistage subgroup coevolution strategy
CN107491664A (en) * 2017-08-29 2017-12-19 浙江工业大学 A kind of protein structure ab initio prediction method based on comentropy

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
《A seqlet-based maximum entropy Markov approach for protein secondary structure prediction》;DONG Qiwen等;《Science in China Ser. C Life Sciences》;20051231;全文 *
《An Overview and Practical Guide to Building Markov State Models》;Gregory R. Bowman;《Advances in Experimental Medicine and Biology》;20141231;全文 *
《Refined Markov clustering Algorithm for Mycobacterium Tuberculosis Protein Sequence analysis》;Dr.D.Ramyachitra等;《International Journal of Computer Science & Engineering Technology (IJCSET)》;20140831;全文 *
《Toward a detailed understanding of search trajectories in fragment assembly approaches to protein structure prediction》;Shaun M. Kandathil等;《Proteins》;20160121;全文 *

Also Published As

Publication number Publication date
CN108563921A (en) 2018-09-21

Similar Documents

Publication Publication Date Title
Pomyen et al. Deep metabolome: Applications of deep learning in metabolomics
Camproux et al. A hidden markov model derived structural alphabet for proteins
CN107609342B (en) Protein conformation search method based on secondary structure space distance constraint
CN108334746B (en) Protein structure prediction method based on secondary structure similarity
CN111063389A (en) Ligand binding residue prediction method based on deep convolutional neural network
Li et al. Protein contact map prediction based on ResNet and DenseNet
Zhang et al. Identification of DNA–protein binding sites by bootstrap multiple convolutional neural networks on sequence information
Zhang et al. Enhancing protein conformational space sampling using distance profile-guided differential evolution
CN109360599B (en) Protein structure prediction method based on residue contact information cross strategy
CN109215732B (en) Protein structure prediction method based on residue contact information self-learning
CN105740626A (en) Drug activity prediction method based on machine learning
CN114503203A (en) Protein structure prediction from amino acid sequences using self-attention neural networks
CN110148437A (en) A kind of Advances in protein structure prediction that contact residues auxiliary strategy is adaptive
CN113744799A (en) End-to-end learning-based compound and protein interaction and affinity prediction method
CN109033744A (en) A kind of Advances in protein structure prediction based on residue distance and contact information
CN103886225A (en) Method for designing proteins on basis of polarizable force fields and pso (particle swarm optimization)
CN109215733B (en) Protein structure prediction method based on residue contact information auxiliary evaluation
Morozov et al. Protein-protein docking using a tensor train black-box optimization method
Du et al. Deep multi-label joint learning for RNA and DNA-binding proteins prediction
CN113257357A (en) Method for predicting protein residue contact map
Zhang et al. Two-stage distance feature-based optimization algorithm for de novo protein structure prediction
Wong et al. A comparison study for DNA motif modeling on protein binding microarray
CN108563921B (en) Protein structure prediction algorithm evaluation index construction method
CN109360597B (en) Group protein structure prediction method based on global and local strategy cooperation
CN109360598B (en) Protein structure prediction method based on two-stage sampling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant