CN108563921B

CN108563921B - Protein structure prediction algorithm evaluation index construction method

Info

Publication number: CN108563921B
Application number: CN201810238748.3A
Authority: CN
Inventors: 张贵军; 谢腾宇; 王柳静; 王小奇; 郝小虎; 周晓根
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2018-03-22
Filing date: 2018-03-22
Publication date: 2021-05-18
Anticipated expiration: 2038-03-22
Also published as: CN108563921A

Abstract

A protein structure prediction algorithm evaluation index construction method is characterized in that a Rosetta Abinitio protocol is utilized to search a search space, and a potential natural state area is found through clustering of background points; then, executing an iterative process of a prediction algorithm to be evaluated, and analyzing the evolution state of each generation of population; secondly, calculating state transition matrixes of two generations before and after the population and quantifying the change condition of the population state by using Shannon entropy; finally, historical entropy values are recorded, thereby reflecting the effect of the algorithm on protein structure prediction. The invention provides a method for constructing an evaluation index of a protein structure prediction algorithm, which can intuitively reflect the state of the algorithm in the prediction process to a certain extent on one hand, and can compare the functions of a plurality of algorithms in the prediction by utilizing an entropy value on the other hand.

Description

Protein structure prediction algorithm evaluation index construction method

Technical Field

The invention relates to the fields of bioinformatics, intelligent optimization and computer application, in particular to a method for constructing an evaluation index of a protein structure prediction algorithm.

Background

Proteins are substances with a certain spatial structure formed by the way that polypeptide chains consisting of amino acids in a 'dehydration condensation' way are folded by coiling, thereby playing a certain specific function. The three-dimensional structure of proteins is of decisive importance in drug design, protein engineering and biotechnology, and therefore, protein structure prediction is an important research issue.

The experimental determination method of the protein structure comprises X-ray crystallography, nuclear magnetic resonance spectroscopy, electron microscopy and the like. Experimental structure is currently available for sequence-known proteins smaller than 1/1000, and therefore modeling plays an important role in providing structural information for a wide range of biological problems. According to the Anfinsen principle, a three-dimensional structure of a protein is directly predicted from an amino acid sequence by using a computer as a tool and applying an appropriate algorithm, and the prediction is a main research subject in bioinformatics at present. In the CASP experiments of the last 20 years, a great change has occurred in the field of protein structure prediction. In 1994, only 229 unique protein folds were known (http:// www.pdb.org), so most sequences of interest had no detectable homology to known structures and could only be modeled by the "de novo" method. Such modeling is considered to be a "significant challenge" in computational biology.

Various de novo prediction methods are successively developed by many research groups, the accuracy of protein structure prediction is gradually improved, and Rosetta, QUARK and the like are highlighted in the course of CASP events. The Rosetta Abinitio protocol constructs a fragment library according to the known protein three-dimensional structure and a target sequence, and optimizes an energy model by using a fragment assembly technology and a basic Monte Carlo algorithm. However, this method drastically decreases the accuracy of protein structure prediction when the target sequence is long.

In order to solve the problems, researchers propose corresponding prediction algorithms, wherein the most widely applied method is a population evolution algorithm. However, the quality of the protein prediction method based on the population generally reflects from the aspects of final prediction precision, running time and the like, the function of the algorithm in the prediction process cannot be intuitively understood, and the improvement of the algorithm by a researcher is not facilitated. Many current protein structure prediction algorithms are based on a population framework, such as EDA. From the angle of algorithm, various protein structure prediction methods are analyzed and compared to carry out reasonable evaluation, and the method has important significance for further improving sampling efficiency and structure prediction precision.

Therefore, the existing population-based protein structure prediction method has defects in algorithm evaluation, and needs to be improved.

Disclosure of Invention

In order to overcome the defect of the existing protein structure prediction method based on the population in the aspect of algorithm evaluation, the invention provides a method for constructing an evaluation index of a direct analysis protein structure prediction algorithm.

The technical scheme adopted by the invention for solving the technical problems is as follows:

a method for constructing an evaluation index of a protein structure prediction algorithm comprises the following steps:

1) giving input sequence information, and obtaining a fragment library of the sequence by using a Robeta server;

2) initially exploring and establishing a Markov state model for a search space, wherein the process is as follows:

2.1) acquiring m background points: operating the Rosetta Abinitio protocol m times, recording the conformation result of each operation as a background point;

2.2) calculating the root mean square difference distance between m background points to form a distance matrix D;

2.3) according to the distance matrix D, classifying the m background points by using a k-means clustering method to obtain k cluster centers as k Markov states, wherein k is less than m, and the process is as follows:

2.3.1) randomly selecting k points from m background points as a current cluster center;

2.3.2) for each background point b_iI ∈ {1,..., m } classification: calculating the distance between the background point and k cluster centers

Then the category number to which the background point belongs is c_i,c_iSatisfies the conditions

2.3.3) finding out the cluster center of each category of background points, and calculating the sum of the distances from each point to all other points in the same category, wherein the corresponding point with the shortest sum of the distances is the cluster center of the category;

2.3.4) if the cluster center is changed, returning to the step 2.3.2), and continuing the clustering iterative process; otherwise, the cluster center is unchanged, and the next step is executed;

3) the method for evaluating the prediction method of the protein structure based on the population comprises the following steps:

3.1) classifying the initialization population, representing the initial state: the population size is NP, and the population is expressed as P ═ C₁,C₂,...,C_NP}，C_nN is an nth population individual, and an individual C is calculated_nRoot Mean Square Deviation (RMSD) distance from k cluster centers, if C_nThe p cluster center is nearest, then the current state of the individual_nP, p ∈ { 1.. k }, indicating the individual C_nBelonging to class p, the state of the entire population being denoted as state_last＝{state₁,state₂,...,state_NP}，state_lastRepresenting the state of the previous generation population;

3.2) executing next iterative process to the population to obtain the next generation population, wherein the step of the iterative process is determined by an algorithm;

3.3) calculating the current population state: for individual C in the current population_nN ∈ { 1.,. NP } classification, calculating individual C_nRMSD distance from k cluster centers, if C_nThe distance from the qth cluster center is nearest, then the current state of the individual_n' -q, q ∈ { 1.. k }, indicating the individual C_nBelonging to class q, the state of the entire population is denoted as state_n′_ow＝{state₁′,state₂′,...,state′_NP}，state_n′_owRepresenting the current population state;

3.4) obtaining a Markov state transition matrix T according to the state statistics of the previous generation and the next generation: for conformation C_nTwo preceding and succeeding state states of n ∈ { 1.,. NP }_nP and state_n' -q indicates a transition from state p to state q, then t_pq＝t_pq+1/m，t_pqThe value of the matrix T in the p th row and the q th column represents the state transition frequency, and the initial value of the state transition frequency is 0;

3.5) according to the stateCalculating Shannon Entropy value Encopy ═ sigma-T by transfer matrix T_pqlnt_pq；

3.6) update the status of the current population state_last＝state_now；

3.7) judging whether the algorithm iteration process is finished, if so, outputting a final prediction result and a historical entropy value and finishing the steps; otherwise, go back to step 3.2).

The technical conception of the invention is as follows: firstly, searching a search space by using a Rosetta Abinitio protocol, and finding a potential natural state region by clustering background points; then, executing an iterative process of a prediction algorithm to be evaluated, and analyzing the evolution state of each generation of population; secondly, calculating state transition matrixes of two generations before and after the population and quantifying the change condition of the population state by using Shannon entropy; finally, historical entropy values are recorded, thereby reflecting the effect of the algorithm on protein structure prediction.

The beneficial effects of the invention are as follows: on one hand, the state of the algorithm in the prediction process is intuitively reflected to a certain extent, and on the other hand, the entropy value can be utilized to compare the functions of a plurality of algorithms in the prediction process.

Drawings

FIG. 1 is a basic flowchart of a method for constructing an evaluation index of a protein structure prediction algorithm.

FIG. 2 is an entropy curve diagram obtained by predicting a target protein 1ACF based on a Rosetta Abinitio protocol of a population by a protein structure prediction algorithm evaluation index construction method.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

Referring to fig. 1 and 2, a method for constructing an evaluation index of a protein structure prediction algorithm includes the following steps:

3.4) obtaining a Markov state transition matrix T according to the state statistics of the previous generation and the next generation: for conformation C_nTwo preceding and succeeding state states of n ∈ { 1.,. NP }_nP and state_n' -q indicates a transition from state p to state q, then t_pq＝t_pq+1/k，t_pqThe value of the matrix T in the p th row and the q th column represents the state transition frequency, and the initial value of the state transition frequency is 0;

3.5) calculating the Shannon Entropy value Encopy ∑ -T according to the state transition matrix T_pq ^lnt_pq；

3.6) update the status of the current population state_last＝state_now；

In this embodiment, a method for constructing an evaluation index of a protein structure prediction algorithm is implemented by taking a population-based Rosetta Abinitio protocol prediction target protein 1ACF as an example, and includes the following steps:

1) given sequence information of the ACF 1, obtaining a fragment library of the sequence by using a Robeta server;

2.1) obtaining m as 1000 background points: operating the Rosetta Abinitio protocol m times, recording the conformation result of each operation as a background point;

3.1) classifying the initialization population, representing the initial state: the population size is NP ═ 300, and the population is denoted P ═ C₁,C₂,...,C_NP}，C_nN is an nth population individual, and an individual C is calculated_nRoot Mean Square Deviation (RMSD) distance from k cluster centers, if C_nThe p cluster center is nearest, then the current state of the individual_nP, p ∈ { 1.. k }, indicating the individual C_nBelonging to class p, the state of the entire population being denoted as state_last＝{state₁,state₂,...,state_NP}，state_lastRepresenting the state of the previous generation population;

3.4) obtaining a Markov state transition matrix T according to the state statistics of the previous generation and the next generation: for conformation C_nTwo preceding and succeeding state states of n ∈ { 1.,. NP }_nP and state_n' -q indicates a transition from state p to state q, then t_pq＝t_pq+1/k，t_pqIs a matrix T_k×kThe value of the p-th row and the q-th column represents the state transition frequency, and the initial value of the value is 0;

3.5) calculating the Shannon Entropy value Encopy ∑ -T according to the state transition matrix T_pq lnt_pq；

3.6) update the status of the current population state_last＝state_now；

By taking the prediction of the target protein 1ACF based on the Rosetta Abinitio protocol of the population as an embodiment, the method is used for constructing an entropy index to visually reflect the function of the algorithm based on the population in the prediction of the protein structure.

The above description is the effect of entropy change obtained by the present invention by taking the population-based Rosetta Abinitio protocol as an example to predict the target protein 1ACF, and is not intended to limit the scope of the present invention, and various modifications and improvements can be made without departing from the scope of the present invention.

Claims

1. A method for constructing an evaluation index of a protein structure prediction algorithm is characterized by comprising the following steps: the evaluation index construction method comprises the following steps:

3.1) classifying the initialization population, representing the initial state: the population size is NP, and the population is expressed as P ═ C₁,C₂,...,C_NP}，C_nN is an nth population individual, and an individual C is calculated_nRoot Mean Square Deviation (RMSD) distance from k cluster centers, if C_nThe p cluster center is nearest, then the current state of the individual_nP, p ∈ { 1.., k }, indicating the individual C_nBelonging to class p, the state of the entire population being denoted as state_last＝{state₁,state₂,...,state_NP}，state_lastRepresenting the state of the previous generation population;

3.3) calculating the current population state: for individual C in the current population_nN ∈ { 1.,. NP } classification, calculating individual C_nRMSD distance from k cluster centers, if C_nNearest to the qth cluster center, then the individual's current state'_nQ, q ∈ { 1.., k }, indicating the individual C_nBelongs to class q, and the state of the entire population is represented as state'_now＝{state′₁,state′₂,...,state′_NP}，state′_nowRepresenting the current population state;

3.4) obtaining a Markov state transition matrix T according to the state statistics of the previous generation and the next generation: for conformation C_nTwo preceding and succeeding state states of n ∈ { 1.,. NP }_nP and state'_nQ indicates a transition from state p to state q, then t_pq＝t_pq+1/k，t_pqThe value of the matrix T in the p th row and the q th column represents the state transition frequency, and the initial value of the state transition frequency is 0;

3.5) calculating the Shannon Entropy value Encopy ∑ -T according to the state transition matrix T_pq ln t_pq；

3.6) update the status of the current population state_last＝state'_now；