JP2008090777A

JP2008090777A - Protein folding order prediction method

Info

Publication number: JP2008090777A
Application number: JP2006273816A
Authority: JP
Inventors: Kentaro Onizuka; 健太郎鬼塚
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2006-10-05
Filing date: 2006-10-05
Publication date: 2008-04-17

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method for identifying an order of folding a portion of a protein to be predicted, in three-dimensional protein structure prediction, by characteristics of an amino acid residue sequence of the protein. <P>SOLUTION: The method divides an amino acid residue sequence of a protein to be predicted into fragments of a length n (n=5-9), computes a standard deviation of an energy value based on multidimensional mean force field potential in different three-dimensional structures of the sequence fragment of each fragment, and predicts that three-dimensional structure folding proceeds in descending order of the standard deviation value. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、立体構造未知のタンパク質の立体構造をアミノ酸配列から予測する立体構造予測法の一部となるタンパク質の折り畳みの順序を推定する方法に関する。 The present invention relates to a method for estimating the order of protein folding, which is part of a three-dimensional structure prediction method for predicting the three-dimensional structure of a protein with an unknown three-dimensional structure from an amino acid sequence.

タンパク質は、生体アミノ酸とされる２０種類が、数百程度ペプチド結合してできた鎖状の高分子が、折り畳みによって特定の三次元形状で安定状態を取っているものである。各タンパク質の特定の三次元形状は、立体構造と呼ばれ、これはタンパク質の機能に深く関連している。特に多くのタンパク質に共通する分子認識機能は、この立体構造中の特定の領域が特定の分子と特異的に結合することによって実現されている。 A protein is a chain polymer formed by peptide bonds of about several hundreds of biological amino acids, which are in a specific three-dimensional shape by folding. The specific three-dimensional shape of each protein is called the conformation, which is closely related to the function of the protein. In particular, a molecular recognition function common to many proteins is realized by a specific region in this three-dimensional structure specifically binding to a specific molecule.

それぞれのタンパク質の立体構造は、X線結晶解析や、核磁気共鳴（NMR）によって求めることができる。X線結晶解析においては、まず一種類のタンパク質の単結晶を作る必要があり、この単結晶化がうまくできない場合、X線結晶解析ができない。またNMRでは結晶化の必要はなく、水溶液中の立体構造を求めることができるが、大きなタンパク質では立体構造に関する多くの情報が得られず、必ずしも立体構造が正確に求められるわけではない。現在までに判明しているタンパク質の立体構造を集めたデータベースPDB(Protein Data Bank)には、現在数万種類のタンパク質が登録されているが、この数が年々増えているとはいえ、すでに毎日のように判明しているDNAの配列データベースなどと比べた場合、登録されている数はかなり少ない。 The three-dimensional structure of each protein can be determined by X-ray crystallography or nuclear magnetic resonance (NMR). In X-ray crystal analysis, it is necessary to make a single crystal of one type of protein. If this single crystallization is not successful, X-ray crystal analysis cannot be performed. In NMR, there is no need for crystallization, and a three-dimensional structure in an aqueous solution can be obtained. However, a large amount of information on a three-dimensional structure cannot be obtained with a large protein, and a three-dimensional structure is not necessarily obtained accurately. The database PDB (Protein Data Bank), which collects the three-dimensional structures of proteins that have been identified so far, currently has tens of thousands of proteins registered. Compared with DNA sequence databases that are already known, the number registered is quite small.

それは、X線結晶解析やNMRによって一つのタンパク質の立体構造を決定するのに、数ヶ月から１年以上かかる場合があるからである。一方、立体構造は不明であってもアミノ酸残基配列は、遺伝子情報から、あるいはタンパク質直接であっても比較的簡単に求められる。よって、立体構造が判明していなくても、配列が判明しているタンパク質は、立体構造の判明しているものの１０倍以上存在する。そのことから、タンパク質の立体構造を、アミノ酸残基配列から予測する技術が必要とされているのである。 This is because it may take several months to one year or more to determine the three-dimensional structure of one protein by X-ray crystallography or NMR. On the other hand, even if the three-dimensional structure is unknown, the amino acid residue sequence can be obtained relatively easily from genetic information or even directly from the protein. Therefore, even if the three-dimensional structure is not known, the protein whose sequence is known is 10 times or more of the protein whose three-dimensional structure is known. Therefore, a technique for predicting the three-dimensional structure of a protein from an amino acid residue sequence is required.

タンパク質は、前記のようにアミノ酸が数百結合した鎖状分子である。すなわち、鎖にそった主鎖から、それぞれのアミノ酸固有の側鎖が生えた分子構造である。主鎖には、水素結合を作る水素原子と酸素原子があり、配列上近傍、あるいは遠方の主鎖間に水素結合を作り、安定構造をとることが多い。もっとも典型的なのは、螺旋構造で、とくにi番目のアミノ酸残基主鎖の窒素原子に結合している水素が、i+4番目の主鎖の炭素原子に結合している酸素原子と水素結合するα螺旋構造がもっとも頻繁に現れる局所構造である。その他、配列上はなれたアミノ酸残基主鎖間で水素結合を作るβシート構造もかなり頻繁に現れる構造である。βシートは、その要素となる長く伸びたβストランド構造が配列上遠方の別のβストランドと主鎖間水素結合で結合することによって構成されている。α螺旋、βシートとも、半規則的な構造である。その他に、水素結合を伴って、特殊な部分構造を作ることもあり、これらをターン構造と呼ぶこともある。ターン構造には様々な種類がある。これらα螺旋、βシート、ターンなどの特徴的な部分構造とそのほかのランダム構造をあわせて二次構造と呼ぶ。タンパク質の立体構造は、配列上のそれぞれの部分が二次構造を形成し、この二次構造がさらに折れ畳んで形成されたものである。 A protein is a chain molecule in which several hundred amino acids are bonded as described above. That is, it is a molecular structure in which a side chain unique to each amino acid grows from a main chain along the chain. In the main chain, there are hydrogen atoms and oxygen atoms that form hydrogen bonds, and in many cases, a hydrogen bond is formed between the main chains in the vicinity of the array or at a distant position to form a stable structure. The most typical is a helical structure, in particular, the hydrogen bonded to the nitrogen atom of the i-th amino acid residue main chain hydrogen bonds with the oxygen atom bonded to the carbon atom of the i + 4th main chain. The α helical structure is the local structure that appears most frequently. In addition, a β sheet structure that forms hydrogen bonds between amino acid residue main chains that are separated from each other in sequence is a structure that appears quite frequently. The β sheet is constituted by a long, extended β-strand structure, which is an element of the β-sheet, bonded to another β-strand distant from the array by an inter-chain hydrogen bond. Both the α helix and the β sheet have a semi-regular structure. In addition, special partial structures may be created with hydrogen bonds, and these may be called turn structures. There are various types of turn structures. These characteristic partial structures such as α helix, β sheet, and turn and other random structures are collectively referred to as a secondary structure. The three-dimensional structure of the protein is such that each part on the sequence forms a secondary structure, and this secondary structure is further folded.

アミノ酸残基配列からタンパク質の立体構造を予測しようとする試みは、１９６０年代にミオグロビンの立体構造がX線結晶解析で判明した直後から始まっている。１９７０年代にはタンパク質の配列の中で、どの部分がどの二次構造を取るのかを予測する二次構造予測の研究が、主に統計的な手法をもとに行われた。２０種類のアミノ酸残基それぞれがどの二次構造の中に発見されることが多いかを、立体構造既知のタンパク質から統計的にもとめ、それに基づいて、それぞれの残基がどの二次構造の中にあるかを予測するものである。しかし、２０種類のアミノ酸残基のそれぞれがそれぞれの二次構造となる傾向はそれほど強いものではなかったことと、当時、立体構造既知のタンパク質が大変少なかったことから（数十程度）、予測精度は大変悪かった。 Attempts to predict the three-dimensional structure of a protein from amino acid residue sequences began immediately after the three-dimensional structure of myoglobin was revealed by X-ray crystallography in the 1960s. In the 1970s, research on secondary structure prediction that predicts which part takes which secondary structure in protein sequences was performed mainly based on statistical methods. The secondary structure in which each of the 20 types of amino acid residues is often found is statistically determined from proteins with known three-dimensional structures, and based on that, each residue is in which secondary structure. It is to predict whether there is. However, the tendency of each of the 20 types of amino acid residues to be each secondary structure was not so strong, and at that time there were very few proteins with known three-dimensional structures (about several dozen), so the prediction accuracy Was very bad.

１９８０年代には、主に物理学者が、タンパク質の折り畳み現象そのものを物理的に解析しようとする試みがなされ、実際のタンパク質を、モデル化した格子モデルなどで折り畳みがうまくいくにはどのような条件が必要となるか、などについて多くの研究がなされた(Go N.,Abe H.,(1983) Randomness of the process of protein folding. In J. Pept Protein Res 22 622-632)。 In the 1980s, mainly physicists tried to physically analyze the protein folding phenomenon itself, and what conditions were necessary to successfully fold an actual protein using a modeled lattice model or the like? Much research has been conducted on whether or not it is necessary (Go N., Abe H., (1983) Randomness of the process of protein folding. In J. Pept Protein Res 22 622-632).

しかし、これら当時の研究では実験的にもまた、シミュレーションなどでもタンパク質の折り畳みに関する知見が十分ではなかったため、この研究が実際のタンパク質の構造予測技術に繋がることはなかった。 However, in these studies at that time, knowledge about protein folding was not sufficient either experimentally or by simulation, so this study did not lead to actual protein structure prediction technology.

１９９０年代になり、立体構造の判明しているタンパク質が千を超える状況になってくると、タンパク質の構造を予測するのではなく、与えられた配列が、構造既知のタンパク質の中でどのタンパク質の構造と似ている可能性があるか、について研究が始まった。これを、「折り畳み認識（Fold Recognition）」という。アミノ酸それぞれがどの二次構造にあることが多いか、あるいは、タンパク質表面に来るのか、あるいは内部に埋もれた状態になるのか、などに基づいて、構造を予測しようとするタンパク質のアミノ酸残基配列を構造既知のタンパク質の立体構造に当てはめて、配列と構造との適合度を計算し、適合度が高ければ、その配列は、その適合度の高い構造と似たものになると考えるのである。これを配列プロファイル法と呼ぶこともある（非特許文献３)。 In the 1990s, when the number of proteins with known three-dimensional structures became more than a thousand, rather than predicting the structure of the protein, the given sequence was assigned to which of the known proteins. Research has begun on whether it may be similar to the structure. This is called “Fold Recognition”. The amino acid residue sequence of the protein whose structure is to be predicted is determined based on the secondary structure of each amino acid, whether it is on the surface of the protein, or whether it is buried inside. By applying to the three-dimensional structure of a protein with a known structure, the fitness between the sequence and the structure is calculated. If the fitness is high, the sequence is considered to be similar to the structure with high fitness. This is sometimes called a sequence profile method (Non-patent Document 3).

こうした中で、Sippl（非特許文献１）は、構造既知タンパク質の立体構造中でのアミノ酸残基の分布から平均力場ポテンシャルを求め、これを配列と構造との適合度評価に用いる方法を提唱した（図５）。 Under these circumstances, Sippl (Non-patent Document 1) proposes a method for obtaining an average force field potential from the distribution of amino acid residues in the three-dimensional structure of a protein with a known structure, and using this to evaluate the fitness between the sequence and the structure. (FIG. 5).

２０種類のアミノ酸残基のAとBというアミノ酸残基が、構造既知のタンパク質立体構造中にどのような距離で分布しているかを統計処理し、その距離に対する頻度分布の対数をもってポテンシャルとするものである。 Statistical processing of the distance at which amino acid residues A and B of 20 kinds of amino acid residues are distributed in a known protein three-dimensional structure, and using the logarithm of the frequency distribution for that distance as potential It is.

上記配列プロファイル法と平均力場ポテンシャルの導入、及び、１９９０年代に構造既知のタンパク質が１万個以上に達したことで、折り畳み認識法は、立体構造予測技術として、ある程度実用に耐える精度をもつようになった。これと並んで、配列の類似したタンパク質は構造も似ているという知見をもとに、配列比較から構造を予測する比較モデリング法も発展し、これらの方法を用いることで、構造未知のタンパク質のおよそ半分程度は、構造がかなりの確度で予測できるようになった。 With the introduction of the sequence profile method and average force field potential, and the number of proteins with known structures reaching 10,000 or more in the 1990s, the fold recognition method has a certain degree of accuracy that can withstand practical use as a three-dimensional structure prediction technique. It became so. Alongside this, based on the knowledge that proteins with similar sequences have similar structures, comparative modeling methods that predict structures from sequence comparisons have also been developed. By using these methods, proteins with unknown structures can be identified. About half of the structures can now be predicted with considerable accuracy.

１９９０年代末には、折り畳み認識や比較モデリング法では構造予測が不可能な、全く新しい立体構造のパターンをもつタンパク質に対しても対応できる方法として、断片結合法が提案された。これは、構造予測しようとするタンパク質の配列を固定した長さ（たとえばアミノ酸１０残基分程度の長さ）の断片を切り出し、この断片が取りえる立体構造を、構造既知タンパク質の立体構造データから切り出し、これらをつなぎ合わせて、全体構造を作るというものである（図７図８）。全体構造は全く新しくとも、１０残基程度の断片にしてしまえば、典型的な二次構造などは構造既知タンパク質の構造内に含まれているので、これらのつなぎ合わせで、全体構造が構築できるのではないか、という期待に基づく方法である。 At the end of the 1990s, a fragment binding method was proposed as a method capable of dealing with a protein having a completely new three-dimensional pattern, which cannot be predicted by folding recognition or comparative modeling methods. This is because a fragment of a length (for example, a length of about 10 amino acids) in which the sequence of the protein whose structure is to be predicted is fixed is cut out, and the three-dimensional structure that can be taken from this fragment is obtained from the three-dimensional structure data of the known protein. The whole structure is formed by cutting out and connecting them (FIG. 7 and FIG. 8). Even if the whole structure is completely new, if it is a fragment of about 10 residues, typical secondary structures etc. are included in the structure of proteins with known structures, so the whole structure can be constructed by joining these together. It is a method based on the expectation that it may be.

配列断片の取りえる構造断片の候補を取り出す方法などに様々な細かな改良が行われ、結果として、断片結合法は、ときにかなり正確に立体構造を予測できるまでになった。 Various refinements have been made to the method of extracting structural fragment candidates that can be taken as sequence fragments, and as a result, the fragment bonding method sometimes can predict the three-dimensional structure fairly accurately.

タンパク質の構造予測法は、折り畳み認識と、断片結合法により、ある程度実用的な段階を迎えたが、しかし、これでタンパク質の折り畳みメカニズムが判明したわけではない。断片結合法では、それぞれの配列断片に対して候補構造を１００程度与え、ランダムに作った立体構造中、それぞれの配列部分に候補を当てはめ、全体構造の適合度を計算し、その適合度が高くなるようにして、収束させる方法が用いられる。いわゆる組み合わせ最適化法の利用である。完全にランダムなモンテカルロ法などにくらべて収束時間は短く、また、得られる最終構造は、ときに正しい立体構造にかなり近いものになることもある。しかし、これでタンパク質の構造構築の、あるいは折り畳みのメカニズムが判明したわけではない。 The protein structure prediction method has reached a practical level to some extent by folding recognition and fragment binding methods, but this does not reveal the protein folding mechanism. In the fragment binding method, about 100 candidate structures are given to each sequence fragment, and candidates are assigned to each sequence portion in a randomly created three-dimensional structure, and the fitness of the whole structure is calculated, and the fitness is high. Thus, a convergence method is used. This is the use of a so-called combinatorial optimization method. Compared to a completely random Monte Carlo method, the convergence time is short, and the final structure obtained is sometimes quite close to the correct three-dimensional structure. However, this does not reveal the mechanism of protein structure construction or folding.

９０年代の後半から、タンパク質の折り畳み過程を数値計算シミュレーションによって求める試みも始まり、小さなタンパク質（アミノ酸の数が数十程度）については、シミュレーションによってある程度正しい構造と近い安定構造が得られたとする計算結果も発表された。しかし、タンパク質は水溶液中で折り畳むものであり、水溶液の溶媒効果をシミュレーションに導入することが難しいなど、さまざまな課題があり、また計算時間も実用的なレベルではない。しかし、これらのシミュレーションによって、タンパク質の折り畳み過程が、実験以外の方法で、追いかけられるようになってきたことには十分価値がある。
Sippl M.J.,(1990) "Calculation of Conformational Ensembles form Potentials of Mean Force: An Approach to the Knowledge-based Prediction of Local Structure in Globular Proteins." J. Mol. Biol., 213,850-883 Onizuka K., Noguchi T., Akiyama Y., Matsuda H., (2002) "Using Data Compression for Multidimensional Distribution Analysis" IEEE Intelligent Systems 17(3),48-54 Bowie J.U., Luthy R.L., Eisenberg D., (1991) "A Method to Identify Protein Sequence That Fold into a Known Three-Dimensional Structure." Science, 256,164-170 From the latter half of the 1990s, an attempt to find the protein folding process by numerical simulation also started, and for small proteins (the number of amino acids is about tens of amino acids), a calculation result that a stable structure close to a certain correct structure was obtained by simulation. Was also announced. However, proteins are folded in an aqueous solution, and there are various problems such as difficulty in introducing the solvent effect of the aqueous solution into the simulation, and the calculation time is not practical. However, it is worthwhile that these simulations allow the protein folding process to be followed by methods other than experiments.
Sippl MJ, (1990) "Calculation of Conformational Ensembles form Potentials of Mean Force: An Approach to the Knowledge-based Prediction of Local Structure in Globular Proteins." J. Mol. Biol., 213,850-883 Onizuka K., Noguchi T., Akiyama Y., Matsuda H., (2002) "Using Data Compression for Multidimensional Distribution Analysis" IEEE Intelligent Systems 17 (3), 48-54 Bowie JU, Luthy RL, Eisenberg D., (1991) "A Method to Identify Protein Sequence That Fold into a Known Three-Dimensional Structure." Science, 256,164-170

従来技術のタンパク質立体構造予測における最大の問題は、折り畳みメカニズムが不明であることに起因する。この原因によりタンパク質の立体構造予測における最終ステップである構造最適化においては、前記断片結合法などのように、ランダムに構造を変更するなどして、最適構造に近づけてゆくことが行われている。そのため、最適化において無駄な探索を行うことになり、予測精度も不安定、かつ予測にかかる計算時間も大変長くなる。 The biggest problem in protein tertiary structure prediction in the prior art is due to the unknown folding mechanism. For this reason, in structure optimization, which is the final step in predicting the three-dimensional structure of proteins, as in the fragment binding method, the structure is randomly changed to approach the optimal structure. . For this reason, a useless search is performed in the optimization, the prediction accuracy is unstable, and the calculation time for prediction becomes very long.

この問題を解決するためには、タンパク質が折り畳みを始める前の主鎖が伸びきった状態から、どのような過程あるいは状態を経て、最終的な安定構造をもつ状態に到るかについての知見を得ることである。この知見は、折り畳みの順序を予測する手法を開発することで得られる。 In order to solve this problem, we need to know what process or state the main chain from before the protein begins to fold, and then to the final stable structure. Is to get. This knowledge can be obtained by developing a method for predicting the order of folding.

本発明が解決しようとする課題は、したがって、タンパク質の折り畳みの順序をいかに推定するかである。すなわちタンパク質が、どの部分から最初に折り畳みを開始し、ついでどの部分が折り畳み、最終的にどの部分が折り畳むのかを、そのタンパク質のアミノ酸残基配列の特徴から特定する方法の開発である。 The problem to be solved by the present invention is therefore how to estimate the order of protein folding. That is, it is the development of a method for identifying from the characteristics of the amino acid residue sequence of a protein which protein begins to fold first, then which part is folded, and which part is finally folded.

タンパク質の折り畳みにおいて統計力学の原理が働いていると仮定し、タンパク質のアミノ酸残基配列全体の中の部分配列断片それぞれが可能な部分構造断片を取ったときのその状態での停滞時間が、その部分構造断片のエネルギー値と周囲の温度によって統計的に決定されると仮定する。 Assuming that the principle of statistical mechanics is working in protein folding, the stagnation time in that state when each possible partial sequence fragment in the entire amino acid residue sequence of the protein is taken is Suppose that it is determined statistically by the energy value of the substructure fragment and the ambient temperature.

この前提のもとで、与えられたタンパク質のアミノ酸残基配列Sのi番目のアミノ酸残基から連続する残基n個分の部分配列断片s(n,i)をタンパク質立体構造データベースから選
ばれたK個のテンプレートタンパク質構造の長さn残基の全ての可能な部分構造断片c(k,n,j)に当てはめ（ここでkはK個の中の一つをあらわし、n は断片長、jはこの構造断片が、j番目のアミノ酸残基かから始まることをあらわす）配列断片s(n,i)が、構造断片c(k,n,j)の構造をとったときのエネルギー値E(n,i,k,j)を計算するステップと、このエネルギー値E(n,i,k,j)を可能な全てのk,j に対して計算して得られるエネルギー値の標準偏差σE(n,i)を計算するステップと、この標準偏差σE(n,i)を配列Sで可能な全ての i について計算するステップとを有し、この標準偏差σE(n,i)が最も大きいi=I である配列断片のある場所から折り畳みが開始され、順じσE(n,i)が大きい順にそれに対応する部分配列の領域から折り畳みが進行すると推定することにより、タンパク質の折り畳みの順序を推定する（請求項１）。 Under this assumption, a partial sequence fragment s (n, i) for n consecutive residues from the i-th amino acid residue of the amino acid residue sequence S of a given protein is selected from the protein three-dimensional structure database. Fit to all possible substructure fragments c (k, n, j) of length n residues of K template protein structures, where k is one of K and n is the fragment length , J represents that this structural fragment starts from the jth amino acid residue) The energy value when the sequence fragment s (n, i) takes the structure of the structural fragment c (k, n, j) The step of calculating E (n, i, k, j) and the standard deviation of the energy values obtained by calculating this energy value E (n, i, k, j) for all possible k, j calculating σE (n, i) and calculating this standard deviation σE (n, i) for all i possible in the array S, and this standard deviation σE (n, i) is the most Big i = I Folding starts from the place where the sequence fragment is located, and the order of protein folding is estimated by assuming that folding proceeds from the corresponding partial sequence region in descending order of σE (n, i) (claim) Item 1).

アミノ酸残基n個からなる部分配列断片sが、部分構造断片cをとるときのエネルギーがE(s,c)であるとし、別の部分構造断片c'をとるときのエネルギーがE(s,c')であるならば、この系の温度をT、ボルツマン定数をKとしたとき、統計力学の原理により部分構造断片がcをとっている停滞時間と部分構造断片c'をとっている停滞時間の統計的な比は、exp[(E(s,c)-E(s,c'))/KT] になる。部分配列断片がさまざまな構造をとったときのエネルギーの平均 <E(s,c)> を折り畳みの進行していない段階でのランダム構造でのエネルギーの推定値として採用すると、特定の構造c'を取ったときのエネルギー値とランダム構造でのエネルギー値の推定値との差 E(s,c')-<E(s,c)> は、その構造c' にどの程度の時間停滞するかの目安になる。この差が非常に大きい負の値である場合は、配列断片は、きわめて短時間の間にランダム構造からc'の構造へと変化するはずである。よって、ある配列断片について、様々な構造をとらせた場合のエネルギー値の標準偏差 sqrt[<E(s,c)²>-<E(s,c)>²]は、その部分配列断片が、さまざまな構造をとった場合どの程度エネルギー値が変動するかの目安であり、この値が大きい場合は、その部分配列断片は比較的速やかにエネルギーの低い構造をとるように遷移するはずである。一方、この値が小さい場合は、その構造はランダムのまましばらく様々な構造を試すことになると考えられる。よって、タンパク質の配列全長から、可能な長さn残基の配列断片を取り出し、それぞれの配列断片について、様々構造断片に当てはめた場合のエネルギー値の標準偏差を計算すると、それぞれの配列断片が、どの程度の速さで、特定のエネルギー最適構造へ向かって変化するかを見積もることができる。配列全体の中で、この断片配列エネルギー標準偏差が最も大きいところは、最初に構造がランダムから特定の良い構造断片へ向かって遷移すると考えられるので、この配列断片に対応する場所が、最初に折り畳みが開始する点であると考えられる。 The partial sequence fragment s consisting of n amino acid residues is assumed to have E (s, c) when taking a partial structural fragment c, and the energy when taking another partial structural fragment c ′ is E (s, c ′), if the temperature of this system is T and the Boltzmann constant is K, the stagnation time in which the substructure fragment takes c and the stagnation in which the substructure fragment c ′ is taken according to the principle of statistical mechanics The statistical ratio of time is exp [(E (s, c) -E (s, c ')) / KT]. When the average energy <E (s, c)> when the partial sequence fragments have various structures is adopted as an estimate of the energy in the random structure when the folding is not in progress, the specific structure c ' The difference between the energy value of the measured value and the estimated energy value of the random structure E (s, c ')-<E (s, c)> is how long the structure c' It becomes a standard of. If this difference is a very large negative value, the sequence fragment should change from a random structure to a c ′ structure in a very short time. Therefore, the standard deviation sqrt [<E (s, c) ² >-<E (s, c)> ² ] of energy values when a variety of structures are taken for a sequence fragment is It is a measure of how much the energy value fluctuates when various structures are taken. If this value is large, the partial sequence fragment should transition to take a structure with low energy relatively quickly. . On the other hand, when this value is small, it is considered that various structures will be tried for a while while the structure is random. Therefore, when a sequence fragment having a possible length of n residues is taken out from the full length of the protein sequence and the standard deviation of the energy value when applied to various structural fragments is calculated for each sequence fragment, each sequence fragment is It can be estimated how fast it will change towards a specific energy optimal structure. In the entire sequence, the place where the fragment sequence energy standard deviation is the largest is considered to be that the structure first transitions from random to a specific good structure fragment, so the place corresponding to this sequence fragment is folded first. Is considered to be the starting point.

ここで、各s(n,i)を、c(k,n,j)に当てはめてエネルギー値を計算し、これをすべてのk,j について行うステップにおいて、エネルギー値が最も小さくなったときの、k,j を記憶しておくステップを有し、後に、前記標準偏差σE(n,i)がもっとも大きくなる i=Iである断片については、このエネルギー値がもっとも小さくなった構造断片 c(k,n,j) が、当該部分の部分配列 s(n,i=I)が最終的な立体構造において取る構造と非常に近いと予測することができる（請求項２）。 Here, each s (n, i) is applied to c (k, n, j) to calculate the energy value, and in the step of performing this for all k, j, the energy value becomes the smallest , K, j, and, for a fragment with i = I where the standard deviation σE (n, i) is largest, the structural fragment c ( k, n, j) can be predicted to be very close to the structure that the partial sequence s (n, i = I) of the part takes in the final three-dimensional structure (claim 2).

タンパク質の立体構造は、前記のように比較的規則的なα螺旋やβシートなどの二次構造が折り畳んだものである。よって、タンパク質の折り畳みが開始すると、ランダム構造の段階では空間的に遠くにあった部分構造が接近し、βシートを構成したり、α螺旋のバンドル構造を形成したりする。これらの二次構造の折り畳みにおいては、まず、最終的に接近してβシートやα螺旋バンドル構造を作るために、まず、互いに接近する複数の部分構造が、予め、それらの接近あるいは結合に適した形になっていることが必要である。たとえば、βシートを構成する二つの部分は、ともにβストランドを構成していなければならない。この場合、もしこの部分がα螺旋になっていると、いったんこれを巻き戻して伸ばさなければ、シートを形成できないからである。 The three-dimensional structure of a protein is obtained by folding a secondary structure such as a relatively regular α helix or β sheet as described above. Therefore, when protein folding starts, the partial structures that are spatially distant from each other in the random structure stage approach to form a β sheet or an α helical bundle structure. In the folding of these secondary structures, first, in order to finally come close to form a β sheet or an α spiral bundle structure, first, a plurality of partial structures that are close to each other are suitable for their approach or connection in advance. It is necessary to have a shape. For example, the two parts constituting the β sheet must both constitute β strands. In this case, if this part is an α helix, the sheet cannot be formed unless it is rewound and stretched.

また、二つの部分構造が接近する場合は、この二つ部分構造の間をつなぐ部分の主鎖構造が、この二つの部分構造の接近を許す、あるいは奨励する必要がある。いかに二つの構造がシートやバンドルを作ることで安定になるとしても、その二つの部分をつなぐ部分がそれを許さないような主鎖構造であったら、二つの構造が接近することはなく、よって安定化することもない。安定的な二次構造であるα螺旋や、βシート（あるいはこれを構成するβストランド）をとっていない部分は、ターン構造や、ランダムなコイルによるループ構造である。これらの部分は水素結合などによる安定化がそれほど顕著でないため、比較的不安定な構造である。したがって、こういう部分は、上記の断片配列エネルギーの標準偏差を計算すると、比較的小さい標準偏差を示すことになる。また、βストランドもその部分単独ではエネルギー的に不安定であるから、その部分の配列断片もやはり比較的小さい標準偏差を示す。 In addition, when two partial structures are close to each other, the main chain structure of the portion connecting the two partial structures needs to allow or encourage the proximity of the two partial structures. No matter how the two structures become stable by making a sheet or bundle, if the part connecting the two parts is a main chain structure that does not allow it, the two structures will not approach, so There is no stabilization. The portion not taking the α-helical or β-sheet (or β-strand constituting this) which is a stable secondary structure is a turn structure or a loop structure with a random coil. These portions have relatively unstable structures because stabilization due to hydrogen bonding or the like is not so remarkable. Therefore, such a portion shows a relatively small standard deviation when the standard deviation of the fragment sequence energy is calculated. In addition, since the β-strand alone is energetically unstable, the sequence fragment of that portion also shows a relatively small standard deviation.

このことから、配列断片のエネルギーの標準偏差の値を計算することで、この値が大きいところは、α螺旋構造になることが予測され、小さいところは、βストランドか、ループ領域になることが予測される。 From this, by calculating the value of the standard deviation of the energy of the sequence fragment, it is predicted that when this value is large, it becomes an α-helical structure, and when it is small, it becomes a β-strand or a loop region. is expected.

以上が、タンパク質の折り畳み開始位置、および折り畳み順序を予測するための、配列断片のエネルギー標準偏差計算の概略と、その効果である。 The above is the outline of the energy standard deviation calculation of the sequence fragment for predicting the protein folding start position and the folding order, and the effect.

この技術を実現するため重要なことは、配列断片が特定の構造をとったときのエネルギー値をいかに計算するかである。このエネルギー値が十分現実を反映していなければ、折り畳み開始位置の予測も折り畳み順序の予測も極めて不正確になる。 What is important for realizing this technique is how to calculate the energy value when the sequence fragment has a specific structure. If this energy value does not sufficiently reflect reality, the prediction of the folding start position and the prediction of the folding order will be extremely inaccurate.

よって、配列断片s(n,i)を構造断片c(k,n,j)に当てはめてエネルギーの計算をするステップにおいて、タンパク質の立体構造データベースから抽出され定義される平均力場ポテンシャルの総和としてエネルギー計算をすることによって、上記の折り畳み順序予測が可能である（請求項３）。 Therefore, in the step of calculating energy by applying the sequence fragment s (n, i) to the structural fragment c (k, n, j), the sum of the average force field potentials extracted and defined from the protein structure database is defined as The folding order can be predicted by calculating energy (claim 3).

また、平均力場ポテンシャルが、対を構成するアミノ酸残基種、配列上での相対配置（i番目のアミノ酸残基とj番目のアミノ酸残基との相対配置は i-j で与えられる、及び空
間内のアミノ酸残基の相対配置に基づいて定義され、その中で、とくに空間内のアミノ酸残基の相対配置として、アミノ酸残基間距離だけでなく、一方のアミノ酸残基からみた他方の位置、姿勢をも考慮した多次元拡張された平均力場ポテンシャルを用いて構造断片のエネルギー計算することによって、折り畳み順序予測の高精度化が可能になる（請求項４）。 In addition, the average force field potential is the amino acid residue species constituting the pair, the relative arrangement on the sequence (the relative arrangement of the i-th amino acid residue and the j-th amino acid residue is given by ij, and In particular, the relative position of amino acid residues in space is not only the distance between amino acid residues, but also the position and posture of the other viewed from one amino acid residue. By calculating the energy of the structural fragment using the multi-dimensionally expanded average force field potential that also takes into account, it is possible to improve the accuracy of folding order prediction.

これで、いくつかのタンパク質について前記断片結合法に基づく折り畳みシミュレーションを行い、最終的に正しい立体構造が得られた場合の折り畳みの順序と、配列断片のエネルギー値の標準偏差に基づく解析から推定される折り畳み順序が概ね一致していることが確かめられた。 Folding simulation based on the fragment binding method is performed for several proteins, and the order of folding when the correct three-dimensional structure is finally obtained and the analysis based on the standard deviation of the energy values of the sequence fragments are estimated. It was confirmed that the folding order was almost the same.

ここで発明の概要を説明する。 Here, an outline of the invention will be described.

構造を予測する対象となるタンパク質のアミノ酸残基配列Sが与えられている。Sの長さは、Nであるとする。Sのi番目のアミノ酸残基からn残基分の長さの断片配列s(n,i)を取り出す。タンパク質立体構造データベースPDBから、できるだけ多様な構造をK個選び、これを構造のテンプレートとする。Kは統計処理の安定性を考慮して1000以上である必要がある。この選ばれた構造のk番目のものをC(k)とする。C(k)の中のj 番目のアミノ酸残基からn残基の長さの部分構造断片を、c(k,n,j)とする。 The amino acid residue sequence S of the protein whose structure is to be predicted is given. Assume that the length of S is N. A fragment sequence s (n, i) having a length of n residues is extracted from the i-th amino acid residue of S. From the protein three-dimensional structure database PDB, select as many K structures as possible and use them as template templates. K needs to be 1000 or more in consideration of the stability of statistical processing. The k-th selected structure is C (k). Let c (k, n, j) be a partial structural fragment having a length from the jth amino acid residue to n residues in C (k).

配列s(n,i)をc(k,n,j)に当てはめそのときの平均力場ポテンシャルによるエネルギー値E(n, i, k,j)を計算する。これを全てのk,j について行い、E(n,i,k,j)の標準偏差σE(n,i)を計算する。この標準偏差σE(n,i)を全ての可能な i について計算する。 By applying the array s (n, i) to c (k, n, j), the energy value E (n, i, k, j) by the average force field potential at that time is calculated. This is performed for all k, j, and the standard deviation σE (n, i) of E (n, i, k, j) is calculated. This standard deviation σE (n, i) is calculated for all possible i.

断片長さn は二次構造の大きさや安定性、及び、データベース中の部分構造の多様性を考慮して、５から１７であることが望ましい。結果得られた標準偏差σE(n,i)の中で、もっとも大きな値をとるとき、i=I であったとすると、構造を予測対象であるタンパク質のアミノ酸残基配列の、i=I番目のアミノ酸残基から連続して、i=I+n-1番目までのアミノ酸残基からなる部分配列が、もっとも早い段階で折り畳み、しかも、その折り畳み結果の構造は、最終的な立体構造において、E(n,i=I,k,j)が最も小さい値をとったときの構造断片c(k,n,j)と非常に近い構造になっていることが予測される。標準偏差σが二番目に大きな値をとるのが、i=I₂であるとき、もとの配列S上で、配列断片s(n,i=I)が、配列断片s(n,i=I₂)と重ならない限りにおいて、つまり、I₂>I+n-1 あるいは、I>I₂+n-1であるとき両
者は独立しており、このi=I₂である部分も、二番目に折り畳みを開始し、その断片配列が最終的な立体構造において、E(n,i=I,k,j)が最も小さな値をとったc(k,n,j)と非常に近い構造をとっていると考えられる。 The fragment length n is preferably 5 to 17 in consideration of the size and stability of the secondary structure and the diversity of partial structures in the database. In the standard deviation σE (n, i) obtained as a result, assuming that i = I when taking the largest value, i = I-th of the amino acid residue sequence of the protein whose structure is to be predicted The partial sequence consisting of amino acid residues up to the i = I + n-1th consecutive from the amino acid residues is folded at the earliest stage, and the resulting folded structure is E in the final conformation. It is predicted that the structure is very close to the structural fragment c (k, n, j) when (n, i = I, k, j) takes the smallest value. When the standard deviation σ takes the second largest value when i = I ₂ , on the original sequence S, the sequence fragment s (n, i = I) becomes the sequence fragment s (n, i = as long as it does not overlap with the I _2), _{i.e., I 2> I + n-} 1 or both when it is I> I ₂ + n-1 are independent and some parts are the i = I _2, two Folding starts and the fragment sequence is very close to c (k, n, j) where E (n, i = I, k, j) takes the smallest value in the final conformation. It is thought that it is taking.

こうして、配列S上のどの部分から折り畳みが始まるかが予測できる。ここで重要なことは、この方法で配列断片のエネルギー値の標準偏差を計算するときには、つねに、同じ立体構造テンプレートC(k)を用いるということである。断片ごとに異なる立体構造テンプレートを使うと、その標準偏差を比べる基準が存在しなくなるからである。 In this way, it can be predicted from which part on the array S the folding starts. What is important here is that the same three-dimensional structure template C (k) is always used when calculating the standard deviation of the energy values of the sequence fragments by this method. This is because if a different three-dimensional structure template is used for each fragment, there is no standard for comparing the standard deviation.

タンパク質の折り畳みがどのような形で起こっているかを求めることで、タンパク質の構造予測と、与えられた立体構造をとるタンパク質の設計手法を高精度、かつ高速化することが見込まれる。すなわち、これまでの多くの構造予測法では、立体構造をエネルギー最適化する過程で、ランダム生成された立体構造を少しずつランダムに変形し、そのつどエネルギー値を計算するなどして、次第に最低エネルギーになるようにするシミュレーテッドアニーリング(SA）法などが用いられてきたが、本発明を用いると、ランダム構造を作らず、初期構造から、各断片の標準偏差σE(n,i)によって示唆される、折り畳みの順序に従ってその部分を折り畳むことが可能になるため、ランダムに最適化するよりも、より高速に最適構造が得られ、またその際の予測結果の精度も大幅に向上することが見込まれる。ただし、本発明で得られた折り畳み順序が構造最適化において競合する場合の優先方法など構造最適化アルゴリズムそのものについては本発明に含まれない。 By finding out how protein folding occurs, it is expected to increase the accuracy and speed of protein structure prediction and the design method for proteins that take a given three-dimensional structure. That is, in many conventional structure prediction methods, in the process of optimizing the energy of the three-dimensional structure, the randomly generated three-dimensional structure is randomly changed little by little and the energy value is calculated each time. Although the simulated annealing (SA) method has been used, the random deviation structure is not created by using the present invention, and the initial structure suggests the standard deviation σE (n, i) of each fragment. Therefore, it is possible to fold the part according to the folding order, so that it is possible to obtain the optimum structure faster than random optimization, and to greatly improve the accuracy of the prediction result. It is. However, the present invention does not include the structure optimization algorithm itself such as a priority method when the folding order obtained in the present invention competes in the structure optimization.

本発明は、以下の部分からなる。 The present invention comprises the following parts.

１）構造予測対象のタンパク質アミノ酸残基配列入力部（図１の１）
ここで、構造予測対象のタンパク質のアミノ酸残基配列Sを計算システムのメモリー内の配列に格納する。 1) Protein amino acid residue sequence input part for structure prediction (1 in FIG. 1)
Here, the amino acid residue sequence S of the protein whose structure is to be predicted is stored in a sequence in the memory of the calculation system.

２）テンプレートとなるタンパク質構造読み込み部（図１の２）
タンパク質立体構造データベースPDBの中から選ばれた立体構造C(k)を一つ一つ読み込む。 2) Protein structure reading part as template (2 in FIG. 1)
Read three-dimensional structure C (k) selected from protein three-dimensional structure database PDB one by one.

３）標準偏差計算部（図１の３）
前記C(k)が読み込まれるごとに、全ての配列断片s(n,i)を、C(k)の部分構造断片c(k,n,j)に当てはめ、次に述べる平均力場ポテンシャルによるエネルギー計算部において、E(n,i,k,j)を計算するところである。ここでは、後の構造最適化計算による立体構造予測に備えて、E(n,i,k,j)がそれ以前に計算されたどのE(n,i,k,j)よりも小さい値をとった場合は、その場合、k,j を記憶しておくようにする。 3) Standard deviation calculator (3 in Fig. 1)
Every time C (k) is read, all sequence fragments s (n, i) are applied to the partial structural fragment c (k, n, j) of C (k), and the mean force field potential described below is applied. In the energy calculator, E (n, i, k, j) is being calculated. Here, E (n, i, k, j) is set to a value smaller than any previously calculated E (n, i, k, j) in preparation for 3D structure prediction by subsequent structural optimization calculation. In that case, k, j is stored in that case.

４）平均力場ポテンシャルによるエネルギー計算部（図１の４）
標準偏差計算部より送られた、配列断片s(n,i)と部分構造断片c(k,n,j)から、配列s(n,i)を部分構造断片c(k,n,j)に当てはめたときの平均力場ポテンシャルのエネルギー値を計算するところである。この中に含まれるn 個のアミノ酸残基の全ての対について平均力場ポテンシャルを計算し、その総和をもってこの部分構造断片c(k,n,j)が配列断片s(n,i)であるときのエネルギー値とし、このエネルギー値を前記標準偏差計算部へ結果として返す。図６（配列番号１）、図７は、平均力場ポテンシャルの概念図であり、図７において多次元拡張した平均力場ポテンシャルを説明している。 4) Energy calculation unit based on average force field potential (4 in Fig. 1)
From the sequence fragment s (n, i) and the partial structure fragment c (k, n, j) sent from the standard deviation calculation unit, the sequence s (n, i) is converted into the partial structure fragment c (k, n, j). The energy value of the mean force field potential when applied to is calculated. The average force field potential is calculated for all pairs of n amino acid residues contained therein, and this partial structural fragment c (k, n, j) is the sequence fragment s (n, i) with the sum The energy value is returned to the standard deviation calculator as a result. FIG. 6 (SEQ ID NO: 1) and FIG. 7 are conceptual diagrams of the average force field potential, and the average force field potential expanded multidimensionally in FIG. 7 is described.

計算の流れは、与えられた全体配列Sに対し、図１の２において、まず、kのループをまわし、選択された立体構造C(k)を一つ一つ読み込み、ついで、その中のループで j についてのループをまわし、もっとも内側のループで、i についてのループをまわすようにする。こうすることで、読み出したC(k,n,j)を全て記憶しておく必要がなくなり、また、平均力場ポテンシャルの計算を高速化することができる。 The flow of calculation is as follows. For the given entire array S, in FIG. 1 2, first, the k loop is rotated, and the selected three-dimensional structure C (k) is read one by one. Rotate the loop for j, and the innermost loop for i. By doing so, it is not necessary to store all the read C (k, n, j), and the calculation of the average force field potential can be speeded up.

エネルギー計算部（図１の３）で用いる平均力場ポテンシャルは、Sipplの文献１（非特許文献１）のものを（図６）、多次元拡張したものである（図７）。アミノ酸残基間の立体構造中での距離だけでなく、方位、姿勢についても分割し、ヒストグラムの形式で頻度分布を求めた。アミノ酸残基BのAに対する方位については、アミノ酸残基AのCα原子を原点とし、N原子がXZ平面上に、Cβ原子がZ軸上にくる局所座標を考えて、この局所座標におけるアミノ酸残基BのCα原子の位置座標を極座標表示したときの天頂角θ、経度φで分割する。姿勢については、アミノ酸残基BがAと同じ姿勢になるように回転させた場合のオイラー角（θ,φ,ψ）のθを用いた。これは、アミノ酸残基AのCα原子からCβ原子へのベクトルと、アミノ酸残基Bの同じベクトルの成す角度に対応している。他のオイラー角（φ,ψ）については無視している。ただし、ここでは文献２（非特許文献２）にあるようなデータ圧縮はせず、単純なヒストグラムを用いている。分割数は、天頂角について３（ただし天頂部と反対側の天頂部は角度分割しておらず、その部分は、π/6の天頂角をもっている）、経度について６、オイラー角のθについて３である。また、距離については、１Åごとの分割で、０から１２Åまでの範囲について解析する。 The average force field potential used in the energy calculator (3 in FIG. 1) is a multidimensional extension of the Sippl literature 1 (non-patent literature 1) (FIG. 6). Not only the distance in the three-dimensional structure between amino acid residues, but also the orientation and orientation were divided, and the frequency distribution was obtained in the form of a histogram. Regarding the orientation of amino acid residue B with respect to A, considering the local coordinates where the Cα atom of amino acid residue A is the origin, the N atom is on the XZ plane, and the Cβ atom is on the Z axis, the amino acid residue in this local coordinate is The position coordinates of the Cα atom of the base B are divided by the zenith angle θ and longitude φ when displayed in polar coordinates. Regarding the posture, Eu of the Euler angles (θ, φ, ψ) when rotating so that the amino acid residue B is in the same posture as A was used. This corresponds to the angle formed by the vector from the Cα atom to the Cβ atom of amino acid residue A and the same vector of amino acid residue B. Other Euler angles (φ, ψ) are ignored. However, here, data compression as described in Document 2 (Non-Patent Document 2) is not performed, and a simple histogram is used. The number of divisions is 3 for the zenith angle (however, the zenith on the opposite side of the zenith is not angle-divided, and that portion has a zenith angle of π / 6), 6 for longitude, and 3 for Euler angle θ. It is. As for the distance, the range from 0 to 12 cm is analyzed in divisions of 1 km.

候補構造断片選定に当たっては、Sipplの最初の方法と同様に、正味ポテンシャル（文献内では、net potentialsという用語を用いている）を用いる。 In selecting candidate structure fragments, the net potential (in the literature, the term net potentials is used) is used, as in the first method of Sippl.

（実施例）
本発明をするにあたって、最初に行ったことは、独自に開発したタンパク質立体構造予測システムを用いてランダム構造から、最終的な立体構造になるまでの過程を観察することである（図２図３）。 (Example)
In carrying out the present invention, the first thing to do is to observe the process from a random structure to a final three-dimensional structure using a uniquely developed protein three-dimensional structure prediction system (FIG. 2). ).

タンパク質立体構造予測システムとしては、断片結合法を用いた（図８（配列番号２）、図９（配列番号３））。 As a protein three-dimensional structure prediction system, a fragment binding method was used (FIG. 8 (SEQ ID NO: 2), FIG. 9 (SEQ ID NO: 3) ).

断片結合法における候補構造断片選定には、多次元化された平均力場ポテンシャルを用いた。また、断片結合法での構造最適化においても、候補選定時と同じ多次元化された平均力場ポテンシャルを用いた（図５図６）。 Multi-dimensional average force field potential was used to select candidate structure fragments in the fragment combination method. Also, in the structure optimization by the fragment coupling method, the same multidimensional average force field potential as that at the time of candidate selection was used (FIG. 5 and FIG. 6).

構造既知のタンパク質（PDBに登録されている1QKK、１STMを用いている）を構造予測対象として、それぞれのタンパク質が、PDBに登録されているX線構造と近い構造になるように、条件をいろいろと変えて、構造予測システムを動かし、その構造最適化の過程を観察した。 Using a protein with a known structure (using 1QKK and 1STM registered in the PDB) as a target for structure prediction, various conditions are set so that each protein has a structure close to the X-ray structure registered in the PDB. In other words, we moved the structure prediction system and observed the process of structure optimization.

この観察を通じて、どちらのタンパク質も、最初のランダム構造からX線構造に近い形になる過程で、比較的一定の段階を経て折り畳みが進行することが判明した（図３）。断片結合法であるため、二次構造は比較的初期段階でほぼ正しい位置に正しい形に形成される。すなわち、α螺旋やβストランドの位置は断片結合による最適化過程でかなり早い段階でほぼ固定的になる。ついでこれらが折り畳まれて全体構造になる段階では、配列上の特定の領域で最終構造とほぼ同じ形状の折り畳み領域が生まれ、ついでこれが核となって、周囲の主鎖が巻き付くようにして、最終構造に到ることが観察された。 Through this observation, it was found that the folding of both proteins progressed through a relatively constant stage in the process from the initial random structure to the X-ray structure (FIG. 3). Because of the fragment bonding method, the secondary structure is formed in a correct shape at a substantially correct position in a relatively early stage. That is, the positions of the α helix and β strand become substantially fixed at a considerably early stage in the optimization process by fragment bonding. Then, at the stage where these are folded to form the entire structure, a folded region having almost the same shape as the final structure is born in a specific region on the sequence, and then this is the core, so that the surrounding main chain is wound, It was observed that the final structure was reached.

この観察を通じて、タンパク質の配列中では、比較的早い段階で折り畳みを開始する部分があり、その部分が核となって全体構造を導きだすことが分かる。 From this observation, it can be seen that there is a part that starts folding at a relatively early stage in the protein sequence, and that part leads to the whole structure.

よって、タンパク質の配列中で、どの部分が最初に折り畳みを開始するかを予測できるようにすることが、折り畳みメカニズムの解明に繋がると考えた。 Therefore, we thought that it would lead to the elucidation of the folding mechanism to be able to predict which part of the protein sequence first started folding.

そこで、配列断片に対する候補構造断片をデータベース中から選定する段階で、全て、それぞれの配列断片ごとに、当てはめられた構造での平均力場ポテンシャルによるエネルギー値の平均値と標準偏差を求めた（図４図５）。その結果、標準偏差は配列断片によって様々な値をとり、その値のばらつきが大きいことが判明した。ある配列断片は、さまざまな構造をとったときに、大きくエネルギー値が異なり、非常に標準偏差が大きくなる。一方、別の配列断片は、さまざまな構造に対して、エネルギー値が大きく変化せず、標準偏差が小さくなる。そして、上記の折り畳みシミュレーションでの折り畳み過程の観察結果とあわせた結果、エネルギー値の標準偏差が大きくなっている断片部分が、最初に折り畳みを開始する部分である可能性が出てきた。 Therefore, at the stage of selecting candidate structural fragments for the sequence fragments from the database, the average value and the standard deviation of the energy values by the average force field potential in the applied structure were obtained for each sequence fragment (Fig. 4 FIG. 5). As a result, it was found that the standard deviation takes various values depending on the sequence fragment, and the variation of the value is large. When a certain sequence fragment has various structures, the energy value is greatly different and the standard deviation is very large. On the other hand, the energy value of another sequence fragment does not change greatly with respect to various structures, and the standard deviation becomes small. Then, as a result of combining with the observation result of the folding process in the folding simulation described above, there is a possibility that the fragment portion where the standard deviation of the energy value is the first portion to start folding.

以下に、それぞれの方法について詳しく解説する。 The following explains each method in detail.

２）配列断片に対する候補構造断片の選定法（図８）
断片の長さをWとして、構造予測対象のタンパク質の配列から、長さWの全ての配列断片をN末端側から取り出し、それぞれの配列を、PDBから選ばれた２７００個のタンパク質立体構造のそれぞれの全ての位置に当てはめ、適合度の高い順に、候補となる構造断片を取り出す。それぞれの配列断片に対する候補構造の数は１６から３２程度に限定した。構造予測対象タンパク質１QKKと１STMも、この２７００個のタンパク質の中に含まれるので、適合度に精度の高いものを用いると、各断片では、必ず予測対象タンパク質そのものの構造が対応する部分で断片構造として選定される。 2) Method of selecting candidate structural fragments for sequence fragments (Fig. 8)
Taking the length of the fragment as W, all sequence fragments of length W are extracted from the N-terminal side from the sequence of the protein of the structure prediction target, and each sequence is extracted from each of 2700 protein tertiary structures selected from the PDB. The candidate structural fragments are extracted in descending order of fitness. The number of candidate structures for each sequence fragment was limited to about 16 to 32. The structure prediction target proteins 1QKK and 1STM are also included in these 2700 proteins, so if you use one with a high degree of precision, each fragment will always have a structure that corresponds to the structure of the target protein itself. Selected as

３）平均力場ポテンシャル（図６図７）
平均力場ポテンシャルは、前記Sipplのもの（図６）を多次元拡張したものである（図７）。アミノ酸残基間の立体構造中での距離だけでなく、方位、姿勢についても分割し、ヒストグラムの形式で頻度分布を求めた。アミノ酸残基BのAに対する方位については、アミノ酸残基AのCα原子を原点とし、N原子がXZ平面上に、Cβ原子がZ軸上にくるような局所座標(r,θ,φ)を考えて、天頂角θ、経度φで分割する。姿勢については、アミノ酸残基BがAと同じになるように回転させた場合のオイラー角のθを用いた。これは、アミノ酸残基AのCα原子からCβ原子へのベクトルと、アミノ酸残基Bの同じベクトルの成す角度に対応している。他のオイラー角については無視している。 3) Mean force field potential (Fig. 6 Fig. 7)
The mean force field potential is a multidimensional extension of the above Sippl's (Fig. 6) (Fig. 7). Not only the distance in the three-dimensional structure between amino acid residues, but also the orientation and orientation were divided, and the frequency distribution was obtained in the form of a histogram. Regarding the orientation of amino acid residue B with respect to A, local coordinates (r, θ, φ) are set such that the Cα atom of amino acid residue A is the origin, the N atom is on the XZ plane, and the Cβ atom is on the Z axis. Consider and divide by zenith angle θ and longitude φ. As for the posture, Euler angle θ when rotated so that amino acid residue B is the same as A was used. This corresponds to the angle formed by the vector from the Cα atom to the Cβ atom of amino acid residue A and the same vector of amino acid residue B. The other Euler angles are ignored.

候補構造断片選定に当たっては、Sipplの最初の方法（前記Sipplの文献）と同様に、正味ポテンシャルを用い、折り畳み最適化においては、これに若干の凝集力を加えたものを用いた。凝集力としては、一定範囲内で距離の二乗に比例するごく弱いポテンシャルを用いた。 In selecting the candidate structural fragment, the net potential was used in the same manner as Sippl's first method (Sippl literature), and in the folding optimization, a slight cohesive force was added. As the cohesive force, a very weak potential proportional to the square of the distance within a certain range was used.

４）断片結合法のアルゴリズム（図９）
一般的な断片結合法は、SA法を用いるが、ここでは、これに遺伝アルゴリズム（以後GA法）の要素を加えたものを用いた。ランダム構造を１６から３２本つくり、次に、そのうちの一つの一つの場所について、その場所の候補構造断片の一つを選んで、その断片領域を交換していくことと（構造の微小変形、GA法でいえば突然変異）、最低１０回に一回は、ランダム構造間で全体配列の長さの半分未満２残基以上の領域について、一方からもう一方にコピーすること（GA法での構造のクロスオーバー）を繰り返し、SA法にそって、温度パラメータに沿った形で、変形を受け入れるかどうか決定するというものである。温度は、エネルギー値に対して、絶対温度で300°Kから1200°Kの範囲で変更した。 4) Fragment combining algorithm (Figure 9)
As a general fragment joining method, the SA method is used. Here, a method obtained by adding elements of a genetic algorithm (hereinafter referred to as GA method) to this is used. Create 16 to 32 random structures, then select one of the candidate structure fragments for that location, and replace the fragment regions (small deformation of the structure, Mutation in the case of GA method), copy at least once every 10 times from one to the other region of less than half of the entire sequence length between two random structures (from GA method) The crossover of the structure is repeated, and it is determined whether to accept the deformation along the temperature parameter according to the SA method. The temperature was changed in the range of 300 ° K to 1200 ° K in absolute temperature with respect to the energy value.

タンパク質の断片結合法による折り畳みシミュレーションで、折り畳みが進行する過程を観察した結果、ほとんどのシミュレーションで特定の配列領域において、まず折り畳みが進行し、ついで、その他の部分も折り畳みはじめ、これらがさらに互いに影響しあって全体構造が折り畳まれることが判明した。さらに、この折り畳みの過程は、折り畳みシミュレーションにおけるさまざまなパラメータや条件を変更しても、最終構造としてX線構造に近いものに到る場合は、つねにほとんど同じ過程を経て最終構造に到ることがわかった。そこで、折り畳みがどこの領域から開始されるのかを推定するために、断片配列を多数の構造に当てはめたときのエネルギー値の標準偏差を計算した。その結果、折り畳みが始まる場所と、エネルギーの標準偏差が大きい配列断片の部分とは概ね一致していることがわかった。 As a result of observing the process of folding in protein-folding simulation using the fragment binding method, in most simulations, folding progresses first in a specific sequence region, and then other parts begin to fold, which further affect each other. It turned out that the whole structure was folded. Furthermore, this folding process can always reach the final structure through almost the same process if the final structure is similar to the X-ray structure even if various parameters and conditions in the folding simulation are changed. all right. Therefore, in order to estimate where the folding starts from, the standard deviation of the energy value when the fragment sequence was applied to a number of structures was calculated. As a result, it was found that the place where the folding starts and the portion of the sequence fragment having a large standard deviation of energy substantially coincide.

実施例より判明したことは以下の点である。 The following points were found from the examples.

タンパク質の折り畳みは、結局のところ、統計力学の原理にしたがっていると考えられる。それぞれの配列の部分は、最もエネルギーが低くなる構造へと向かって折り畳まる。それは、統計力学の原理に従っているからであり、ある配列断片Aが、構造XをとったときのエネルギーがE（A,X）であり、構造YをとったときにE（A,Y)であるとき、E(A,Y)が、E(A,X)よりも小さい（安定）エネルギー値であるならば、構造Yに停滞する時間は構造Xに停滞する時間に比べてexp[(E(A,X)-E(A,Y))/KT]倍長いことになる。ある配列断片がさまざまな構造をとったときのエネルギー値の標準偏差が大きいということは、その配列が構造の変化に対してエネルギーが非常に大きく変化するということであり、その配列部分のエネルギーが小さくなることによる安定化は、全体構造の安定化にも大きく貢献する。一方、エネルギー値の標準偏差が小さい配列断片は、その部分の安定化が全体構造の安定化にあまり貢献しない。また、標準偏差の大きい配列断片は、上記の統計力学の原理による各構造をとったときの停滞時間の関係で、短時間で低いエネルギーをもつ安定構造へと到る。一方標準偏差の小さい配列断片は、安定構造に到るまでの時間が長くかかる。よって、それら配列が多数繋がったタンパク質の全体構造で考えた場合、標準偏差が大きい部分配列のところから折り畳みが始まることになり、あとは競合でそれぞれの断片が安定化しようとする。 After all, protein folding is thought to follow the principles of statistical mechanics. Each sequence portion folds towards the lowest energy structure. This is because it follows the principle of statistical mechanics. When a sequence fragment A takes structure X, the energy is E (A, X), and when it takes structure Y, E (A, Y) If E (A, Y) has a smaller (stable) energy value than E (A, X), the time to stay in structure Y is exp [(E (A, X) -E (A, Y)) / KT] times longer. The large standard deviation of the energy value when a sequence fragment has various structures means that the energy of the sequence varies greatly with changes in the structure. Stabilization due to the small size greatly contributes to the stabilization of the entire structure. On the other hand, in a sequence fragment having a small standard deviation of energy value, stabilization of the portion does not contribute much to stabilization of the entire structure. In addition, a sequence fragment having a large standard deviation leads to a stable structure having a low energy in a short time due to the stagnation time when each structure is taken according to the principle of statistical mechanics. On the other hand, a sequence fragment with a small standard deviation takes a long time to reach a stable structure. Therefore, when considering the overall structure of a protein in which a large number of these sequences are connected, folding starts from a partial sequence having a large standard deviation, and each fragment tries to be stabilized by competition.

配列上遠い部分が近傍にきて安定化しシート構造や、あるいはヘリックスバンドル構造（二つのα螺旋が並行に接近する）を作ることについてはもう一つ別の視点を持ち込む必要がある。まず、そのような配列上遠い部分二つが接近して安定化するためには、接近を促すように、その二つの配列の間にはさまれた部分配列が特定の構造を取ることが奨励されていなければならない。また、その場合、それぞれ後に結合する部分は、両者ともに後に結合するときに互いに結合しやすい構造を予めとっておく必要がある。βシートが配列断片AとBからなるためには、A、Bが両方ともβストランドの形になっている必要がある。その部分がたとえ、最終的にシートになったときに安定化するとしても、両社が折り畳みの早い段階でα螺旋構造をとってしまったら、この二つの構造が接近しても、βシートを構成することは確率的にも統計力学的にもほとんどありえない。よって、βシートを構成する配列断片は、α螺旋になったときに、エネルギーが低くならないような配列パターンをもっていなければならない。このことは、α螺旋やβシートをつなぐランダム構造の部分についてもいえる。そういう部分はα螺旋やβストランドになってもあまりエネルギーが低くならないような配列である必要がある。結果として、ループ領域の配列断片は、さまざまな構造になった場合のエネルギー値の標準偏差が小さいことになり、折り畳みで比較的最後まで特定の構造をとらず、ランダム構造のままであり、それ以外の部分が十分に安定構造に至った段階で、初めて周囲に誘導されるように最終のループ構造をとるように折り畳む。 It is necessary to bring in another point of view to stabilize the distant part of the array in the vicinity and create a sheet structure or a helix bundle structure (two α-helicals approach in parallel). First, in order for two distant parts on such an array to come close and stabilize, it is encouraged that the partial array sandwiched between the two arrays takes a specific structure to facilitate access. Must be. In that case, it is necessary for the portions to be bonded later to have a structure that facilitates bonding when both are bonded later. In order for the β sheet to consist of sequence fragments A and B, both A and B must be in the form of β strands. Even if that part stabilizes when it finally becomes a sheet, if both companies take an α-helical structure at an early stage of folding, even if these two structures approach each other, they will form a β-sheet It is almost impossible to do both stochastically and statistically. Therefore, the arrangement | sequence fragment which comprises (beta) sheet | seat must have the arrangement pattern which does not become low energy when it becomes (alpha) helix. This is also true for the part of the random structure that connects the α helix and β sheet. Such a portion needs to be arranged so that the energy does not decrease so much even if it becomes an α helix or β strand. As a result, the sequence fragment of the loop region has a small standard deviation of the energy value when it has various structures, and it does not take a specific structure until the end by folding, and remains a random structure. When the other part has reached a stable structure sufficiently, it is folded so as to take the final loop structure so as to be guided to the surroundings for the first time.

さらにもう一つ、折り畳みの初期段階で最初に折り畳んだ場所が核になって周囲の部分を誘い込み、核が成長する現象についてももう少し考察する必要がある。核になった構造は、それぞれの部分が核を作りやすくなっていて、その核となった構造は非常に安定である。そこで、その状態で停滞する時間が長い。だからこそ、周囲の不安定な構造はその核の部分と結合することによって安定化を図るわけである。 In addition, it is necessary to consider a little more about the phenomenon in which the first fold in the initial stage of folding becomes the nucleus and attracts the surrounding parts, and the nucleus grows. In the structure that became the nucleus, each part is easy to make a nucleus, and the structure that became the nucleus is very stable. Therefore, it takes a long time to stay in that state. That is why the surrounding unstable structure is stabilized by combining with its core part.

タンパク質の折り畳みはよく折り紙のようなもの、といわれる。むしろ、糊をべったりと塗った折り紙に近いだろう。間違った折り方をして、そこで糊でくっついてしまったら、はがしてもとにもどすことは難しい。すなわち、折り畳みには、従来から言われていたような、折り畳み順序が存在する。順序を間違うと折り畳みは最終的な安定構造まで至らない。そして、折り畳みの順序を決めているのは、全体配列の中のそれぞれの部分配列断片の特定の構造を目指して変化する速度にある。最初に折り畳む部分は、その構造になったときに、非常に低いエネルギーをとるような部分であり、その配列断片のエネルギーの標準偏差は大きい。もっとも最後に折り畳む部分は最後まで構造が決定しないのであるから、特定の構造への折り畳みの速度が遅いことになる。したがって、あとから折り畳みが行われる「折り畳み速度の遅い配列」は、必ずしももっとも低いエネルギーとなる最安定構造には至らないことになる。全体配列の中のそれぞれの部分が競合しつつもっともエネルギーの低い状態になるように競争しているのであれば、特定の構造に対してもっとも強い傾向をもつ部分が真っ先に折り畳みを開始し、他の部分はそれに影響されて最終構造を決めるというのは、きわめて自然な結果である。 Protein folding is often referred to as origami. Rather, it will be close to origami with glue. If you fold it wrongly and stick it there with glue, it is difficult to remove it. In other words, folding has a folding order as conventionally known. If the order is incorrect, folding will not reach the final stable structure. The order of folding is determined by the rate of change aimed at a specific structure of each partial sequence fragment in the entire sequence. The first folded part is a part that takes very low energy when it becomes the structure, and the standard deviation of the energy of the sequence fragment is large. However, since the structure of the last folded part is not determined until the end, the folding speed to a specific structure is slow. Therefore, “an array with a slow folding speed” in which folding is performed later does not necessarily lead to the most stable structure having the lowest energy. If each part in the whole sequence is competing to be in the lowest energy state, the part that has the strongest tendency for a specific structure will begin to fold first, and others It is a very natural result that this part is influenced by it and determines the final structure.

もちろん、ここで述べている折り畳みのメカニズムが全てのタンパク質で通用するとは限らない。タンパク質の折り畳みが進行する生物の細胞内は、非常に複雑な環境である。温度のみならず、PH値も異なる。それに、周囲には電解質や脂質などが浮遊しているし、また同種のあるいは別種のタンパク質も多数存在する。それらを避けつつ目的の構造をとるように「設計されている」とすれば、単体の折り畳みだけを考慮するのは間違いである。生物が進化し、かつタンパク質も進化してきたのだとすれば、タンパク質の配列はそれぞれのタンパク質が機能するように調整されていて、それぞれのタンパク質が折り畳みを行う環境に合わせて折り畳みが起こるようになっている。タンパク質によってはそれらの条件により、折り畳みが非常に短時間で進行しなければならないものもあれば、逆に、非常にゆっくりと折り畳む必要があるものもあるだろう。また、PHや温度などが特定の場合にかぎって折り畳みが進行する必要のあるタンパク質もある。これら全てを取り込んだ理論を構築することは大変難しい。したがって、タンパク質の構造予測において万能の方法を構築することは現状ではできない。 Of course, the folding mechanism described here does not work for all proteins. The inside of cells of organisms where protein folding proceeds is a very complex environment. Not only the temperature but also the PH value is different. In addition, electrolytes and lipids are floating in the surroundings, and there are many proteins of the same or different types. If it is “designed” to take the desired structure while avoiding them, it is a mistake to consider only a single fold. If the organism has evolved and the protein has evolved, the protein sequence is adjusted for the function of each protein so that each protein folds according to the folding environment. It has become. Depending on these conditions, some proteins may need to fold very quickly, while others may need to fold very slowly. In addition, there are proteins that need to proceed with folding only when the pH and temperature are specific. It is very difficult to build a theory that incorporates all of these. Therefore, it is currently impossible to construct a universal method for protein structure prediction.

しかし、ここで述べた折り畳みの仕組みは理想的な環境で折り畳みを行う人工タンパク質を考える上で、有益である。この実施例で得られた折り畳みのメカニズムは、基盤として統計力学の原理を全体配列中の部分配列断片に適用し、配列断片同士の競合によって最終構造に到るとするものであるから、理論的にほとんど疑う余地がない。特定の目的構造をもつタンパク質の設計においては、まず目的構造に到る折り畳みの順序を考慮し、その順序に従って、部分配列断片がそれぞれの最終構造にむかって、順序に従った速度で折り畳みするようにすればよいのである。最適な配列設計アルゴリズムはかくして確定的なものとして提示できる。 However, the folding mechanism described here is useful in considering an artificial protein that folds in an ideal environment. The folding mechanism obtained in this example applies the principle of statistical mechanics to the partial sequence fragments in the entire sequence as the basis, and reaches the final structure by competition between the sequence fragments. There is little room for doubt. When designing a protein with a specific target structure, first consider the order of folding to the target structure, and according to the order, the partial sequence fragments should be folded at the speed according to the order toward the final structure. You can do it. The optimal sequence design algorithm can thus be presented as deterministic.

本発明により、以下のようにタンパク質の折り畳みメカニズムと、タンパク質の構造予測に必要なアルゴリズムの基本方針、そして、人工タンパク質の配列設計の基本方針が得られる。 According to the present invention, a protein folding mechanism, a basic policy of an algorithm necessary for protein structure prediction, and a basic policy of an artificial protein sequence design are obtained as follows.

結論として、三点にまとめる。 To conclude, we summarize in three points.

１）タンパク質折り畳みメカニズム
タンパク質の折り畳みのメカニズムが判明した。タンパク質のアミノ酸残基配列中の部分配列断片は、さまざまな構造を取りえるが、それぞれの構造をとっている停滞時間は統計力学の原理に従い、その構造であるときのエネルギー値によって決まっている。タンパク質の折り畳みは、全体配列中の部分配列が互いに競合しつつそれぞれがもっとも低いエネルギー値となるような構造を目指して折り畳む。 1) Protein folding mechanism The mechanism of protein folding was revealed. The partial sequence fragments in the amino acid residue sequence of a protein can take various structures, but the stagnation time taking each structure is determined by the energy value of the structure according to the principle of statistical mechanics. In protein folding, partial sequences in the entire sequence compete with each other, and each protein is folded aiming at a structure having the lowest energy value.

２）タンパク質の構造予測
タンパク質構造予測は、次のようにして行われる。まず部分配列断片ごとに、それが様々な構造をとったときのエネルギー値の標準偏差を計算し、その標準偏差の最も大きなところから順に、それぞれ許されるもっともエネルギーが低くなる最適構造へ向かって折り畳むようにする。途中段階で接近する部分構造があれば、その接近でエネルギーが低くなるようにそれぞれの部分構造、およびその二つの部分構造をつなぐ部分を調整し最適化を計る。ただし、野生型のタンパク質によっては、その折り畳みが特殊な環境で起こるものもあり、この方法で必ずしも正しい立体構造が予測できるとは限らない。しかし、一般的な環境で安定な水溶性タンパク質の多くは、この方法で予測できると考えられる。 2) Protein structure prediction Protein structure prediction is performed as follows. First, for each partial sequence fragment, calculate the standard deviation of the energy value when it takes various structures, and fold toward the optimal structure with the lowest allowed energy in order from the largest standard deviation. Like that. If there is a partial structure that approaches in the middle, the optimization is performed by adjusting each partial structure and the portion that connects the two partial structures so that the energy is lowered by the approach. However, some wild-type proteins may be folded in a special environment, and this method does not always predict the correct three-dimensional structure. However, many water-soluble proteins that are stable in a general environment can be predicted by this method.

３）人工タンパク質の配列設計
目的の構造をもつタンパク質の配列設計においては、その目的構造に到る折り畳み順序を設定し、その順序に応じて、部分の配列が適切な折り畳み速度をもつように設計する。 3) Artificial protein sequence design In designing the sequence of a protein having the target structure, the folding order to reach the target structure is set, and the partial sequence is designed to have an appropriate folding speed according to the order. To do.

タンパク質の立体構造予測が可能になることで、様々な構造未知タンパク質の構造が予測できるようになる。これによって、配列のみ知られているタンパク質に作用する医薬品などを作ることが可能になり、タンパク質とタンパク質の立体構造に基づく相互作用を見積もることができ、創薬分野、医療分野、さらに生物学全般に大きな貢献となる。本発明は、タンパク質立体構造予測法の一部をなすものである。 By making it possible to predict the three-dimensional structure of a protein, it becomes possible to predict the structures of various unknown proteins. This makes it possible to create pharmaceuticals that act on proteins whose sequences are only known, and to estimate interactions based on the three-dimensional structure of proteins and proteins. It will be a big contribution. The present invention forms part of a protein tertiary structure prediction method.

装置の構成図Device configuration diagram 折り畳みシミュレーションにつかった１QKKの立体構造と模式図Three-dimensional structure and schematic diagram of 1QKK used for folding simulation シミュレーションの観察結果の概念図Conceptual diagram of simulation observation results 配列断片のエネルギー標準偏差のグラフ（１ＱＫＫの場合）Graph of energy standard deviation of sequence fragment (in the case of 1QKK) タンパク質１STMの立体構造と、配列断片のエネルギー標準偏差のグラフGraph of 3D structure of protein 1 STM and energy standard deviation of sequence fragments 平均力場ポテンシャルの概念図（先行技術）Conceptual diagram of mean force field potential (prior art) 平均力場ポテンシャルの多次元拡張の概念図（先行技術）Conceptual diagram of multidimensional expansion of mean force field potential (prior art) 断片結合法における候補構造断片の選択法を説明する概念図（先行技術）Conceptual diagram explaining selection method of candidate structure fragment in fragment combination method (prior art) SA法、GA法を組み合わせた断片結合による構造最適化法を説明する概念図（先行技術）Conceptual diagram explaining structure optimization method by fragment combination combining SA method and GA method (prior art)

Claims

In the method of estimating the folding process of a given protein, the order of folding during the process of folding (folding starts from which part of the sequence, folding proceeds in sequence, and finally the final structure is reached. The folding sequence until folding) in such a manner that the partial sequence fragment s (n, i) of n residues consecutive from the i-th amino acid residue of the amino acid residue sequence S of a given protein ) To all possible substructure fragments c (k, n, j) of length n residues of K template protein structures selected from the protein conformation database, where k 1 represents one, n represents the fragment length, j represents that this structural fragment starts at the j-th amino acid residue) sequence fragment s (n, i) represents the structural fragment c (k, n, j) Step to calculate the energy value E (n, i, k, j) when taking the structure And calculating a standard deviation σE (n, i) of energy values obtained by calculating this energy value E (n, i, k, j) for all possible k, j, and A standard deviation σE (n, i) is calculated for all i possible in the array S, and the standard deviation σE (n, i) is folded from a place where there is a sequence fragment where i = I is the largest. A protein folding order prediction method characterized in that it is estimated that folding proceeds from the region of the partial sequence corresponding to the order of σE (n, i) in descending order.

When the energy value is the smallest in the step of calculating the energy value by applying each s (n, i) to c (k, n, j) in claim 1 and performing this for all k, j A step of storing k, j, and later, for a fragment with i = I where the standard deviation σE (n, i) is the largest, the structural fragment c with the smallest energy value A partial structure prediction method, wherein (k, n, j) predicts that the partial sequence s (n, i = I) of the part is very close to the structure taken in the final three-dimensional structure.

In the step of calculating energy by applying the sequence fragment s (n, i) in claim 1 and claim 2 to the structure fragment c (k, n, j), an average defined by extracting from the protein structure database A protein folding order prediction method characterized by calculating energy as the sum of force field potentials.

In claim 3, the mean force field potential is the amino acid residue species constituting the pair, the relative arrangement on the sequence (the relative arrangement of the i-th amino acid residue and the j-th amino acid residue is given by ij, And the relative arrangement of amino acid residues in the space, and in particular, as the relative arrangement of amino acid residues in the space, not only the distance between amino acid residues but also the other A protein folding order prediction method characterized by calculating the energy of structural fragments using multi-dimensional extended mean force field potential considering position and orientation.