JP2551297B2

JP2551297B2 - Protein three-dimensional structure prediction method

Info

Publication number: JP2551297B2
Application number: JP12481792A
Authority: JP
Inventors: 拓馬見塚; 健司山西
Original assignee: Nippon Electric Co Ltd
Current assignee: NEC Corp
Priority date: 1992-05-18
Filing date: 1992-05-18
Publication date: 1996-11-06
Anticipated expiration: 2011-11-06
Also published as: JPH0713959A

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、構造未知のタンパク質
アミノ酸配列から、そのタンパク質の立体構造を予測す
る方法に関する。TECHNICAL FIELD The present invention relates to a method for predicting the tertiary structure of a protein from the amino acid sequence of the protein of unknown structure.

【０００２】[0002]

【従来の技術】タンパク質のアミノ酸配列情報を用い
て、そのタンパク質内の立体構造を予測する問題の一つ
として、タンパク質二次構造予測問題がある。二次構造
とは、αヘリックスやβシートといったタンパク質内部
でのまとまりのある構造を指し、二次構造予測問題は、
タンパク質のアミノ酸配列情報を用いて、３（あるいは
４）種類の二次構造の中から、一次配列の各残基（以
下、予測対象となる残基を中心残基とする）に対応する
一つの二次構造を予測する問題であり、二次構造予測が
可能になることにより、タンパク質の立体的な構造予測
も可能になると考えられている。図３は、本発明の二次
構造（αヘリックス）領域予測方法を示す模式図である
が、従来技術によるタンパク質の二次構造を予測する方
法として、例えば、１９７４年発行の米国の雑誌「バイ
オケミストリー」（Ｂｉｏｃｈｅｍｉｓｔｒｙ）の第２
３巻２２２−２４５頁掲載のチョウ（Ｃｈｏｕ）とファ
スマン（Ｆａｓｍａｎ）による論文「プレディクション
オブプロテインコンホメーション」（Ｐｒｅｄｉ
ｃｔｉｏｎｏｆｐｒｏｔｅｉｎｃｏｎｆｏｒｍａ
ｔｉｏｎ）（以下、ＣＦ法と略す）、１９７８年発行の
米国の雑誌「ジャーナルオブモレキュラバイオロジ
ー」（ＪｏｕｒｎａｌｏｆＭｏｌｅｃｕｌａｒＢ
ｉｏｌｏｇｙ）の第１２０巻９７−１２０頁掲載のガル
ニエ（Ｇａｒｎｉｅｒ）らによる論文「アナリシスオ
ブザアキュレシーアンドインプリケーションズ
オブシンプルメソードフォープレディクティ
ングザセコンダリーストラクチャーオブグロブラープ
ロテインズ」（Ａｎａｌｙｓｉｓｏｆｔｈｅａｃ
ｃｕｒａｃｙａｎｄｉｍｐｌｉｃａｔｉｏｎｓｏ
ｆｓｉｍｐｌｅｍｅｔｈｏｄｆｏｒｐｒｅｄｉ
ｃｔｉｎｇｔｈｅｓｅｃｏｎｄａｒｙｓｔｒｕｃ
ｔｕｒｅｏｆｇｌｏｂｕｌａｒｐｒｏｔｅｉｎ
ｓ）（以下、ＧＯＲ法と略す）、１９８７年発行の米国
の雑誌「ジャーナルオブモレキュラバイオロジ
ー」（ＪｏｕｒａｎａｌｏｆＭｏｌｅｃｕｌａｒ
Ｂｉｏｌｏｇｙ）の第１９８巻４２５−４４３頁掲載の
ギブラト（Ｇｉｂｒａｔ）らによる論文「ファザーデ
ベロプメンツオブプロテインセコンダリーストラクチャプレディクションユー
ジングインホメーションセオリー：ニューパラメ
ータズアンドコンシダレーションオブレジデューペアズ」（Ｆｕｒｔｈｅｒｄｅｖｅｌｏ
ｐｍｅｎｔｓｏｆｐｒｏｔｅｉｎｓｅｃｏｎｄａ
ｒｙｓｔｒｕｃｔｕｒｅｐｒｅｄｉｃｔｉｏｎｕ
ｓｉｎｇｉｎｆｏｒｍａｔｉｏｎｔｈｅｏｒｙ：Ｎ
ｅｗｐａｒａｍｅｔｅｒｓａｎｄｃｏｎｓｉｄｅ
ｒａｔｉｏｎｏｆｒｅｓｉｄｕｅｐａｉｒｓ）（以
下、ＧＧＲ法と略す）、及び１９８８年発行の米国の雑
誌「ジャーナルオブモレキュラバイオロジー」
（ＪｏｕｒｎａｌｏｆＭｏｌｅｃｕｌａｒＢｉｏ
ｌｏｇｙ）の第２０２巻８６５−８８４頁掲載のキャン
（Ｑｉａｎ）らによる論文「プレディクティングザ
セコンダリーストラクチャーオブグロブラー
プロテインズユージングニューラルネットワーク
モデルズ」（Ｐｒｅｄｉｃｔｉｎｇｔｈｅｓｅｃｏ
ｎｄａｒｙｓｔｒｕｃｔｕｒｅｏｆｇｌｏｂｕｌ
ａｒｐｒｏｔｅｉｎｓｕｓｉｎｇｎｅｕｒａｌ
ｎｅｔｗｏｒｋｍｏｄｅｌｓ）（以下、ＱＳ法と略
す）などがある。ＣＦ法は、タンパク質構造のデータベ
ースから各二次構造におけるアミノ酸の統計的な出現頻
度を求め、この頻度表を使用し、経験的な規則に基づく
予測を行っている。また、ＧＯＲ法は、中心残基の二次
構造に対して、その残基から数残基離れた残基により独
立にもたらされる情報量の和を計算し、その相対値から
予測を行い、ＧＧＲ法は、中心残基の二次構造に対し
て、その残基及びその残基から数残基離れた残基により
もたらされる情報量の和から予測を行っている。さら
に、ＱＳ法は、３層のフィードフォワード型のネットワ
ークを使用し、中心残基の前後８残基を含む配列を入力
とし、二次構造に対する中心残基及び周辺残基からの寄
与をニューラルネットワークを用いて抽出することによ
り予測を行っている。2. Description of the Related Art One of the problems in predicting the three-dimensional structure in a protein using the amino acid sequence information of the protein is the protein secondary structure prediction problem. The secondary structure refers to a cohesive structure inside the protein such as α helix and β sheet, and the secondary structure prediction problem is
Using the amino acid sequence information of the protein, one of the 3 (or 4) types of secondary structures corresponding to each residue of the primary sequence (hereinafter, the residue to be predicted is the central residue) It is a problem of predicting the secondary structure, and it is considered that the prediction of the secondary structure also enables the prediction of the three-dimensional structure of the protein. FIG. 3 is a schematic diagram showing a method for predicting a secondary structure (α-helix) region of the present invention. As a method for predicting a secondary structure of a protein according to a conventional technique, for example, a US journal “Bio Second of "Chemistry" (Biochemistry)
Vol. 3, pages 222-245, Chou and Fasman, "The Prediction of Protein Conformation" (Predi).
action of protein conforma
(hereinafter abbreviated as CF method), an American magazine “Journal of Molecular Biology” published in 1978 (Journal of Molecular B).
(Analysis of the Accuratey and Implications of Simple Method for Predicting The Secondary Structure of Globler Proteins), published by Garnier et al.
cure and implications o
f simple method for predi
cinging the secondary struc
pure of global protein
s) (hereinafter, abbreviated as GOR method), an American magazine “Journal of Molecular Biology” issued in 1987 (Journal of Molecular Biology).
Biography, Vol. 198, pp. 425-443, Gibrat et al., "Father Developments of Protein Secondaries Structure Prediction Educating Information Theory: New Parameters and Considation of Residue Pairs" (Further develo).
claims of protein second
ry structure prediction u
sing information theory: N
ew parameters and conside
relation of pairs (hereinafter abbreviated as GGR method) and the American magazine “Journal of Molecular Biology” published in 1988.
(Journal of Molecular Bio
202, pp. 865-884, by Qian et al., "Predicting the
Secondary Structure of Globler
Proteins Using Neural Network Models "(Predicating the seco
ndary structure of globul
ar proteins using neural
network models) (hereinafter abbreviated as QS method). The CF method obtains statistical occurrence frequencies of amino acids in each secondary structure from a protein structure database, and uses this frequency table to make predictions based on empirical rules. Also, the GOR method calculates the sum of the information amounts independently brought about by the residues several residues away from the secondary structure of the central residue, makes a prediction from the relative value, and calculates the GGR The method predicts the secondary structure of the central residue from the sum of the information content provided by the residue and the residues several residues away from the residue. Furthermore, the QS method uses a feedforward type network of three layers, inputs a sequence including 8 residues before and after the central residue, and uses the neural network to make contributions from the central and peripheral residues to the secondary structure. The prediction is performed by extracting with.

【０００３】[0003]

【発明が解決しようとする課題】３種類の二次構造の中
からアミノ酸配列の各残基に対応する二次構造を選択す
る予測を３状態予測と呼ぶが、その予測結果の尺度であ
る予測率は、従来の技術のいずれの方法も３状態予測で
６０％台であり、αヘリックスにだけ限ってより予測率
の高い予測手法が望まれていた。また、従来の予測結果
は、アミノ酸一次配列内の各中心残基に対応する二次構
造を予測する残基対応の予測であり、一次配列内のどの
領域がどの二次構造に相当するかといった領域対応の予
測を行うことも重要であるにも関わらず、このような予
測方式に十分な検討がなされていなかった。さらに、ア
ミノ酸配列を文字列としてのみならず、そのアミノ酸の
性質（疎水性、分子量など）を考慮した予測を行うこと
による予測法も全く確立されていなかった。Prediction for selecting a secondary structure corresponding to each residue of an amino acid sequence from among three types of secondary structures is called three-state prediction, which is a measure of the prediction result. The rate is in the range of 60% in three-state prediction in any of the conventional techniques, and a prediction method with a higher prediction rate is desired only for α-helix. Further, the conventional prediction result is a residue-corresponding prediction that predicts the secondary structure corresponding to each central residue in the amino acid primary sequence, such as which region in the primary sequence corresponds to which secondary structure. Although it is important to make predictions corresponding to regions, such prediction methods have not been sufficiently studied. Furthermore, not only the amino acid sequence as a character string, but also the prediction method by performing the prediction considering the properties of the amino acid (hydrophobicity, molecular weight, etc.) has not been established at all.

【０００４】[0004]

【発明が解決しようとする課題】第１の発明は、タンパ
ク質のアミノ酸配列からタンパク質の構造予測を行うた
めに、構造既知のタンパク質のみならず、構造未知のタ
ンパク質をも使用して、ある立体構造に対応した配列で
ある正例とそうではないことがわかっている配列である
負例とからなる訓練データを抽出する訓練データ抽出ス
テップと、前記正例及び負例からなる訓練データから確
率的規則の学習を行う学習ステップと、学習された確率
的規則を用いてテストアミノ酸配列データの各領域毎の
構造を予測する予測ステップとから成ることを特徴とす
る。The first invention is to predict not only a protein of known structure but also a structure of unknown structure in order to predict the structure of the protein from the amino acid sequence of the protein .
The sequence corresponding to a certain three-dimensional structure using the protein
An array that is known to be positive and not positive
A training data extraction system that extracts training data consisting of
Step a, the learning step of performing learning of probabilistic rules from training data consisting of positive cases and negative cases, the learned stochastic rules for each region of the test amino acid sequence data using
And a prediction step for predicting the structure .

【０００５】第２の発明は、前記訓練データ抽出ステッ
プが、構造機知のタンパク質のアミノ酸配列に対して、
同じファミリーに属するタンパク質のアライメント（整
合）をとり、予測対象とする二次構造領域に対応する部
分配列を、二次構造領域の正例として抽出するステップ
と、構造既知のタンパク質の予測対象とする二次構造に
対応する部分配列に対して、構造既知のタンパク質から
なるデータベースの各配列のアライメントをとり、予測
対象とする二次構造に対応しない部分配列を二次構造領
域の負例として抽出するステップとから成ることを特徴
とする。A second invention is the training data extraction step.
Is the structurally known protein amino acid sequence,
Aligning proteins that belong to the same family, extracting the partial sequence corresponding to the secondary structure region that is the target of prediction as a positive example of the secondary structure region, and using it as the target of prediction of proteins of known structure Align each sequence in the database consisting of proteins of known structure with the partial sequence corresponding to the secondary structure, and extract the partial sequence that does not correspond to the secondary structure to be predicted as a negative example of the secondary structure region. And a step.

【０００６】第３の発明は、前記学習ステップが、前記
正例と前記負例とからなる学習データのアミノ酸の種類
から、確率的規則を用いることにより、この確率的規則
の実数値パラメータを推定することを特徴とする。In a third aspect of the invention, the learning step estimates a real-valued parameter of the stochastic rule by using a stochastic rule from the types of amino acids in the learning data consisting of the positive example and the negative example. It is characterized by doing.

【０００７】第４の発明は、前記学習ステップが、前記
正例と前記負例とからなる学習データのアミノ酸の実数
値属性から、確率的規則を用いることにより、この確率
的規則の実数値パラメータを推定することを特徴とす
る。In a fourth aspect of the present invention, the learning step uses a probabilistic rule from the real-valued attributes of amino acids in the learning data consisting of the positive example and the negative example, whereby the real-valued parameter of the probabilistic rule is obtained. Is estimated.

【０００８】第５の発明は、前記学習ステップが、前記
正例と前記負例とからなる学習データのアミノ酸の実数
値属性から、確率的規則を用いることにより、この確率
的規則の実数値パラメータを推定するステップと、前記
確率的規則におけるモデルを情報量基準を用いて最適化
するステップとから成ることを特徴とする。In a fifth aspect of the present invention, the learning step uses a probabilistic rule from the real-valued attributes of amino acids of the learning data consisting of the positive example and the negative example, whereby the real-valued parameter of the probabilistic rule is obtained. estimating a, the
Optimizing the model in the probabilistic rule using an information criterion.

【０００９】第６の発明は、前記予測ステップが、前記
学習ステップにより学習された確率的規則を使用し、前
記テストアミノ酸配列データの各領域に対して、その活
動度を計算するステップと、計算された活動度の中から
最適値を選出するステップとから成ることを特徴とす
る。[0009] A sixth invention is the prediction step, the
Using the probabilistic rules learned by the learning step, before
It is characterized in that it comprises a step of calculating the activity level of each region of the test amino acid sequence data and a step of selecting an optimum value from the calculated activity levels.

【００１０】[0010]

【実施例】次に、本発明について図面を参照して詳細に
説明する。The present invention will be described in detail with reference to the drawings.

【００１１】図１は、本発明のタンパク質立体構造予測
方法の実施例を説明するフローチャートである。本実施
例では、対象とする二次構造としてαヘリックスを扱う
ものとする。FIG. 1 is a flow chart for explaining an embodiment of the protein tertiary structure prediction method of the present invention. In the present embodiment, it is assumed that an α-helix is used as a target secondary structure.

【００１２】ステップ１０は、第２の発明に含まれる。
このステップでは、αヘリックスの領域がわかっている
タンパク質のアミノ酸配列に対して、同じファミリーの
タンパク質、例えば、種が異なる同じタンパク質のアラ
イメント（整合）をとり、αヘリックスに対応する部分
配列を、αヘリックスの正例として抽出する。Step 10 is included in the second invention.
In this step, the proteins of the same family, for example, the same proteins of different species are aligned with the amino acid sequence of the protein whose α-helix region is known, and the partial sequence corresponding to the α-helix is Extract as a positive example of a helix.

【００１３】例えば、ヘモグロビンというタンパク質の
β鎖の場合には、ヒトのヘモグロビンのαヘリックスの
位置は、Ｘ線結晶回折の結果から明らかになっており、
８個のαヘリックスの領域を有することが知られてい
る。従って、ヒトのヘモグロビンβ鎖に対して、他の
種、例えば、チンパンジー、ウマなどの他の種のヘモグ
ロビンβ鎖のアライメントをとり、８個のαヘリックス
に対応する領域をαヘリックスの正例として抽出する。For example, in the case of the β chain of a protein called hemoglobin, the position of the α helix of human hemoglobin has been clarified from the results of X-ray crystal diffraction,
It is known to have a region of 8 α-helices. Therefore, the human hemoglobin β chain is aligned with the hemoglobin β chain of other species such as chimpanzee and horse, and the region corresponding to 8 α helices is used as a positive example of α helix. Extract.

【００１４】ステップ２０は、第２の発明に含まれる。
このステップでは、αヘリックス位置の知られているタ
ンパク質のαヘリックスに対応する部分配列に対して、
αヘリックス位置の知られているアミノ酸配列データベ
ースの各配列のアライメントをとり、αヘリックスに対
応しない部分配列を、ステップ１０で抽出されたαヘリ
ックスの正例に対する負例として抽出する。Step 20 is included in the second invention.
In this step, for the partial sequence corresponding to the α-helix of the protein whose α-helix position is known,
Aligning each sequence in the amino acid sequence database of known α-helix positions, a partial sequence not corresponding to the α-helix is extracted as a negative example for the positive example of the α-helix extracted in step 10.

【００１５】ヘモグロビンβ鎖の例では、８個のαヘリ
ックスに対応する部分配列に対して、例えば、ＰＤＢ
（ＰｒｏｔｅｉｎＤａｔａＢａｎｋ）などのタンパ
ク質構造データベース内のいくつかの配列に対してアラ
イメントを行い、アライメントの結果得られた各部分配
列において、その配列の構造がαヘリックスではない場
合に、それらを負例として抽出する。例えば、負例抽出
の際のアライメントでは、一定の割合以上の相同性を保
持する部分配列を負例とすることが考えられる。具体的
には、アライメントによる相同性が３０％以上の部分配
列を負例とする方法などがある。In the example of the hemoglobin β chain, for the partial sequence corresponding to the 8 α helices, for example, PDB
Alignment is performed for several sequences in a protein structure database such as (Data Protein Bank), and in each partial sequence obtained as a result of the alignment, if the structure of the sequence is not α-helix, these are regarded as negative examples. Extract as For example, in the alignment at the time of extracting a negative example, it is conceivable that a partial sequence having homology of a certain ratio or more is regarded as a negative example. Specifically, there is a method using a partial sequence having a homology of 30% or more by alignment as a negative example.

【００１６】抽出するデータ数については、例えば、α
ヘリックスの正例となる各領域における正例と負例との
割合を各領域についてそれぞれ等しくすることが考えら
れ、また例えば、その割合として正例、負例を同数とす
ることが考えられる。For the number of data to be extracted, for example, α
It is conceivable to make the ratio of positive and negative examples in each region that is a positive example of the helix the same in each region, and for example, it is possible to make the same number of positive and negative examples.

【００１７】ステップ３０は、第３の発明、第４の発
明、第５の発明に共通に含まれ、確率的規則の実数値パ
ラメータを推定するステップである。このステップで
は、ステップ１０で求めた正例とステップ２０で求めた
負例からなる学習データから、確率的規則を用いること
により、この確率的規則の実数値パラメータを推定す
る。このステップでの確率的規則の構造を、以下に示
す。Step 30 is included in the third invention, the fourth invention, and the fifth invention in common, and is a step of estimating the real-valued parameter of the stochastic rule. In this step, the real-valued parameter of this probabilistic rule is estimated by using the probabilistic rule from the learning data consisting of the positive example obtained in step 10 and the negative example obtained in step 20. The structure of the probabilistic rule in this step is shown below.

【００１８】確率的規則とは、ここでは任意の与えられ
た配列の領域に対して、αヘリックスが対応する確率を
与える確率分布のことである。各χ_i（ｉ＝１，…，
ｎ）をそれぞれ属性値の空間として、χをそれらの直
積、すなわち、χ＝χ₁×χ₂×・・・×χ_nと書く。The probabilistic rule is a probability distribution which gives a probability that an α helix corresponds to an area of an arbitrary given array. Each χ _i (i = 1, ...,
Let n be the space of attribute values, and let χ be the direct product of them, that is, χ = χ ₁ × χ ₂ × ... × χ _n .

【００１９】例えば、χは２０種類のアミノ酸からなる
一つの集合を表す場合や、またχ＝χ₁×χ₂で、χ₁
が疎水性を表す数値の範囲かつχ₂が分子量を表す数値
の範囲を表す場合などがある。この例での前者の場合が
第３の発明で使用され、それ以外の場合が第４の発明及
び第５の発明で使用される。Ｓをある領域の長さＷの配
列であり、各Ｓはχ×χ×・・・×χの元とみなし、ま
た、Ｘ_iを配列Ｓの左から数えてｉ番目の残基であり、
Ｐ（α｜Ｘ_i）が、Ｘ_iに対応する二次構造がαヘリッ
クスである確率とする。ここで、配列S に対応する二次
構造がαヘリックスである確率Ｐ（α｜Ｓ）は、Ｐ（α
｜Ｘ_i）の積として次のようにかけるものと仮定する。For example, χ represents a set of 20 kinds of amino acids, or χ = χ ₁ × χ ₂ , where χ ₁
There is a case where is a range of numerical values showing hydrophobicity and χ ₂ is a range of numerical values showing molecular weight. The former case in this example is used in the third invention, and the other cases are used in the fourth invention and the fifth invention. S is an array having a length W of a certain region, each S is regarded as an element of χ × χ × ... × χ, and X _i is the i-th residue counted from the left of the sequence S,
Let P (α | X _i ) be the probability that the secondary structure corresponding to X _i is an α helix. Here, the probability P (α | S) that the secondary structure corresponding to the sequence S is an α helix is P (α
It is assumed that the product of | X _i ) is multiplied by

【００２０】Ｐ（α｜Ｓ）＝П_{i = 1} ^wＰ（α｜Ｘ_i) さらに、各Ｐ（α｜Ｘ_i）の具体的表現として、例え
ば、有限分割型確率的規則を使用する。有限分割型確率
的規則は次のような構造をもつ条件付き確率分布であ
り、以下のように構成する。前記配列Ｓのｉ番目の残基
における属性の実数値のとり得る範囲を重なり合わない
部分領域（以下、これをセルと呼ぶ）に分割し、ｍを全
セル数、Ｃ_kをｋ番目のセルとした時に、Ｘ_iがｍ個の
セルの内のＣ_kに含まれる場合に、Ｐ（α｜Ｘ_i）＝Ｐ
_k（ｉ）とする。ここで、P (α | S) = П _{i = 1} ^w P (α | X _i ) Further, as a concrete expression of each P (α | X _i ), for example, a finite division stochastic rule is used. The finite division stochastic rule is a conditional probability distribution having the following structure, and is constructed as follows. The possible range of the real value of the attribute at the i-th residue of the array S is divided into non-overlapping partial regions (hereinafter referred to as cells), m is the total number of cells, and C _k is the k-th cell. Then, when X _i is included in C _k of m cells, P (α | X _i ) = P
_{Let k} (i). here,

【００２１】[0021]

【数１】 [Equation 1]

【００２２】であり、これを確率パラメータと呼ぶ。図
４は、有限分割型確率規則の構造を示す模式図である
が、この図では、一例として、値が０から１の範囲をと
る一つの属性により確率パラメータを推定する場合を示
す。Which is called a probability parameter. FIG. 4 is a schematic diagram showing the structure of the finite division type probability rule. In this figure, as an example, a case where the probability parameter is estimated by one attribute whose value ranges from 0 to 1 is shown.

【００２３】確率パラメータは、各セルに含まれる正例
及び負例のデータ数を用いて推定する。ｍをセルの数、
Ｎ_k ⁺（ｉ）をｉ番目の位置でのｋ番目のセルに含まれ
る正例数、Ｎ_k ^-（ｉ）をｉ番目の位置でのｋ番目のセ
ルに含まれる負例数、Ｎ_k（ｉ）をｉ番目の位置でのｋ
番目のセルに含まれる正例数と負例数の和とし、ｉ番目
の位置でのｋ番目のセルにおける推定値をThe probability parameter is estimated using the number of positive and negative examples of data included in each cell. m is the number of cells,
Positive sample number contained N _k ⁺ (i) to k th cell in the i-th position, N _k ^- negative sample number contained (i) to k th cell in the i-th position, N _k (I) k at the i-th position
The sum of the number of positive examples and the number of negative examples contained in the th cell, and the estimated value in the k th cell at the i th position

【００２４】[0024]

【数２】 [Equation 2]

【００２５】とする。例えば、次式のラプラス推定量に
よって、各セルに対する確率パラメータを計算する。It is assumed that For example, a probability parameter for each cell is calculated by the Laplace estimator of the following equation.

【００２６】[0026]

【数３】 (Equation 3)

【００２７】ただし、推定量はラプラス推定量のみなら
ず、多くの推定量が使用できる。However, not only the Laplace estimator but also many estimators can be used.

【００２８】ステップ４０は、第６の発明に含まれる。
このステップでは、ステップ３０において学習された確
率的規則を使用し、テストデータ配列の各領域に対し
て、その活性度を計算する。Step 40 is included in the sixth invention.
This step uses the probabilistic rule learned in step 30 to calculate the activity for each region of the test data array.

【００２９】ここでは、活性度として尤度を使用する。Here, likelihood is used as the activity.

【００３０】具体的には、確率的規則が構成された長さ
ｗのあるαヘリックス領域を考える。テストデータのア
ミノ酸配列に対して、前記領域の長さｗより小さな長さ
ｔのｗ−ｔ＋１個のすべての部分領域を設け、このｗ−
ｔ＋１個の部分領域それぞれをテストアミノ酸配列の左
から順にあてはめていき、テスト配列の各領域の尤度を
計算する。Specifically, consider an α-helix region having a length w for which a stochastic rule is constructed. For the amino acid sequence of the test data, all w-t + 1 partial regions having a length t smaller than the length w of the region are provided, and this w-
Each of the t + 1 partial regions is applied in order from the left of the test amino acid sequence, and the likelihood of each region of the test sequence is calculated.

【００３１】さて、ｋ番目の長さｔの部分領域に対し
て、αヘリックス領域の確率パラメータを左から順に並
べたものをξ_k＝（θ₁，…，θ_t），θ_i＝（Ｐ
₁（ｉ），…，Ｐ_m（ｉ））（ｉ＝１，…，ｔ）と書
く。Now, for the k-th sub-region of length t, the probability parameters of the α-helix region are arranged in order from the left, ξ _k = (θ ₁ , ..., θ _t ), θ _i = (P
₁ (i), ..., P _m (i)) (i = 1, ..., t).

【００３２】ここで、ｍはセルの数であり、θ_iは既に
学習によって値が求められている。Here, m is the number of cells, and θ _i has already been obtained by learning.

【００３３】ｗ−ｔ＋１個の部分領域の位置に対応し
て、このｍｔ次元パラメータは、ｗ−ｔ＋１個求められ
るので、それをξ₁，…，ξ_{w - t + 1}とする。Corresponding to the positions of the w-t + 1 partial areas, the mt-dimensional parameters are obtained in the quantity of _{w-t + 1,} and they are defined as ξ ₁ , ..., ξ _{w -t +1} .

【００３４】前記パラメータを使用して、任意の長さｔ
のテストアミノ酸配列Гに対して、尤度がｗ−ｔ＋１通
り次のように計算できる。Using the above parameters, an arbitrary length t
With respect to the test amino acid sequence Γ of, the likelihood can be calculated as follows with w−t + 1.

【００３５】Ｐ（α｜Г：ξ_k）（ｋ＝１，…，ｗ−ｔ＋１）ただし、各ｋについて、Ｐ（α｜Г：ξ_k）＝Π_{i = 1} ^tＰ（α｜Г：θ_i）ここで、Ｐ（α｜Г：θ_i）はＸ_iがｌ番目のセルに入
れば、Ｐ_ｌ（ｉ）（ｌ＝１，…，ｍ）と計算する。ま
た、Ｐ_I（ｉ）（ｌ＝１，…，ｍ）はすでに学習されて
いる。P (α | Γ: ξ _k ) (k = 1, ..., w−t + 1) However, for each k, P (α | Γ: ξ _k ) = Π _{i = 1} ^t P (α | Γ: θ _i ) Here, P (α | Γ: θ _i ) is calculated as P _l (i) (l = 1, ..., M) if X _i enters the l-th cell. Also, P _I (i) (l = 1, ..., M) has already been learned.

【００３６】例えば、前記有限分割型確率的規則でのア
ミノ酸の属性値の空間がある一つの属性値のみからなる
場合で、またセルの数が３であり、セルに入る推定量は
ラプラス推定量により求めるとする。このとき、あるα
ヘリックス領域のｋ番目の長さ５の部分領域のｉ番目の
位置でのｌ番目のセルの正例数をＮ_l ⁺（ｉ）、ｌ番目
のセルの負例数をＮ_l ^-（ｉ）、正例数と負例数の和を
Ｎ_l（ｉ）とする。すると、ｉ番目の位置でのｌ番目の
セルの推定量は、例えば、Ｐ_l（ｉ）＝（Ｎ_l ⁺（ｉ）
＋１）／（Ｎ_l（ｉ）＋２）として得られる。For example, in the case where the space of the attribute values of amino acids in the finite division stochastic rule consists of only one attribute value, the number of cells is 3, and the estimator entering the cell is the Laplace estimator. Let's ask by. At this time, some α
A positive number of cases of l th cell in the i-th position of the k-th long 5 subregion of helical regions N _l ⁺ (i), a negative number of cases of l th cell N _l ^- (i) , N ₁ (i) is the sum of the number of positive examples and the number of negative examples. Then, the estimator of the l-th cell at the i-th position is, for example, P _l (i) = (N _l ⁺ (i)
+1) / (N _l (i) +2).

【００３７】ここで、テストアミノ酸配列のウィンドウ
の大きさ５の領域Гに対してテストを行うとし、領域Г
のそれぞれの残基は前記部分領域での構成された学習規
則での各１，２，３，２，１番目のセルに入る属性の実
数値を有するとする。すると、前記ｋ番目の部分領域に
よるテストアミノ酸配列の領域Гの活性度は尤度Ｐ（α
｜Г：ｋ）として、次式のように計算される。Ｐ（α｜Г：ｋ）＝｛（Ｎ₁ ⁺（１）＋１）／（Ｎ
₁（１）＋２）｝｛（Ｎ₂ ⁺（２）＋１）／（Ｎ
₂（２）＋２）｝｛（Ｎ₃ ⁺（３）＋１）／（Ｎ
₃（３）＋２）｝｛（Ｎ₂ ⁺（４）＋１）／（Ｎ
₂（４）＋２）｝｛（Ｎ₁ ⁺（５）＋１）／（Ｎ
₁（５）＋２）｝ｗ−ｔ＋１個の部分領域により、テストアミノ酸配列の
とり得るすべての領域に対して、この尤度計算を行う。
また、αヘリックス領域が複数個存在すれば、その各領
域について同様の尤度計算を行う。Here, suppose that the test is performed on a region Γ of the window size 5 of the test amino acid sequence, and the region Γ
It is assumed that each residue has a real value of the attribute that enters each of the 1, 2, 3, 2, and 1st cells in the learning rule configured in the partial region. Then, the activity of the region Γ of the test amino acid sequence by the k-th partial region is the likelihood P (α
| Γ: k) is calculated as follows. P (α | Γ: k) = {(N ₁ ⁺ (1) +1) / (N
₁ (1) +2)} {(N ₂ ⁺ (2) +1) / (N
₂ (2) +2)} {(N ₃ ⁺ (3) +1) / (N
₃ (3) +2)} {(N ₂ ⁺ (4) +1) / (N
₂ (4) +2)} {(N ₁ ⁺ (5) +1) / (N
₁ (5) +2)} This likelihood calculation is performed for all the regions that the test amino acid sequence can take, using w−t + 1 partial regions.
If there are a plurality of α-helix regions, the same likelihood calculation is performed for each region.

【００３８】従って、テストアミノ酸配列内でのウィン
ドウの大きさに対応したすべての領域に対して、尤度が
出力として得られることになる。Therefore, the likelihood is obtained as an output for all regions corresponding to the window size in the test amino acid sequence.

【００３９】以上のウィンドウを使用した各領域に対応
する尤度計算により、一つ一つの残基に対してαヘリッ
クスが対応する確率を計算するのではなく、テストアミ
ノ酸配列の各部分領域にαヘリックスが対応する確率を
尤度として計算することができる。By the likelihood calculation corresponding to each region using the above windows, the probability that the α helix corresponds to each residue is not calculated, but α is calculated for each partial region of the test amino acid sequence. The probability that the helix corresponds can be calculated as the likelihood.

【００４０】ステップ５０は、第６の発明に含まれる。
このステップでは、前記ステップ４０により計算された
複数の活性度の中で、Гに対して最適な一つの活性度を
求め、さらにテストアミノ酸配列全体における活性度の
変化を出力する。Step 50 is included in the sixth invention.
In this step, one of the plurality of activities calculated in step 40 is found to be the optimum one for Γ, and the change in the activity of the entire test amino acid sequence is output.

【００４１】ステップ４０に引き続きここでは、活性度
として尤度を使用する。Continuing from step 40, the likelihood is used here as the activity.

【００４２】例えば、最適値Ｐ（α｜Г：ξ_k ^*）を以
下で定める。Ｐ（α｜Г：ξ_k ^*）＝ｍａｘ｛Ｐ（α｜Г：ξ₁），
…，Ｐ（α｜Г：ξ_{w -t + 1}）｝． αヘリックス領域が複数個あれば、各領域について、同
じГに対して同様な尤度計算を行ない、αヘリックス領
域全体を通じて最大の尤度を最適値として選ぶ方法も考
えられる。For example, the optimum value P (α | Γ: ξ _k ^* ) is determined as follows. P (α | Γ: ξ _k ^* ) = max {P (α | Γ: ξ ₁ ),
…, P (α | Γ: ξ _{w -t + 1} )}. If there are a plurality of α-helix regions, a similar likelihood calculation may be performed for each region for the same Γ, and the maximum likelihood may be selected as the optimum value throughout the α-helix region.

【００４３】さらに、テスト配列内の尤度が与えられた
各領域において、最大の尤度を領域内の各残基の最適値
とする、あるいは、領域内の各残基に対しては、その残
基を含む領域の得られた尤度の平均を各残基の最適値と
する、などの方法を使用し、テストアミノ酸配列全体に
対する尤度の変化を出力する。Further, in each region given a likelihood in the test sequence, the maximum likelihood is set to the optimum value of each residue in the region, or for each residue in the region, A method such as using the average of the obtained likelihoods of the region containing residues as the optimum value of each residue is used to output the change in the likelihood for the entire test amino acid sequence.

【００４４】以上の図１における学習及び予測方法は、
αヘリックス以外の二次構造予測についても適用でき
る。The learning and prediction method shown in FIG.
It can be applied to secondary structure prediction other than α-helix.

【００４５】図２は、本発明のタンパク質立体構造予測
方法の実施例を説明するフローチャートである。本実施
例では、対象とする二次構造としてαヘリックスを扱う
ものとする。FIG. 2 is a flow chart for explaining an embodiment of the protein tertiary structure prediction method of the present invention. In the present embodiment, it is assumed that an α-helix is used as a target secondary structure.

【００４６】ステップ６０は、図１のステップ１０と同
じ処理を行ないαヘリックス領域予測のために必要な正
例を抽出する。In step 60, the same process as in step 10 of FIG. 1 is performed to extract a positive example required for α-helix region prediction.

【００４７】ステップ７０は、図１のステップ２０と同
じ処理を行ないαヘリックス領域予測のために必要な負
例を抽出する。In step 70, the same processing as in step 20 of FIG. 1 is performed to extract a negative example required for α-helix region prediction.

【００４８】ステップ８０は、図１のステップ３０と同
じ処理を行ない確率的規則を適用し、この確率的規則の
実数値パラメータを推定する。In step 80, the same process as in step 30 of FIG. 1 is performed to apply the stochastic rule, and the real-valued parameters of this stochastic rule are estimated.

【００４９】ステップ９０は、第５の発明にのみ含まれ
る。このステップでは、確率的規則のモデルを情報量規
準を用いて最適化する。使用する情報量規準としては、
例えば、ＭＤＬ（ｍｉｎｉｍｕｍｄｅｓｃｒｉｐｔｉ
ｏｎｌｅｎｇｔｈ）規準などが考えられる。Step 90 is included only in the fifth invention. In this step, the stochastic rule model is optimized using the information criterion. The information criterion to use is
For example, MDL (minimum descriptor)
on length) criteria and the like.

【００５０】前記有限分割型確率的規則にあてはめる
と、ＭＤＬ原理は、データ記述長と有限分割型確率的規
則による記述長との和が最小である時に最適な確率的規
則が構成されているとする。なお、ＭＤＬ原理について
は、１９７８年発行の米国の雑誌「オートマティカ」
（Ａｕｔｏｍａｔｉｃａ）の第１４巻４６５−４７１頁
記載のリサネン（Ｒｉｓｓａｎｅｎ）による論文「モデ
リングバイショーテストデータディスクリプシ
ョン」（Ｍｏｄｅｌｉｎｇｂｙｓｈｏｒｔｅｓｔ
ｄａｔａｄｅｓｃｒｉｐｔｉｏｎ）に詳しく記載され
ている。Applying to the finite division type probabilistic rule, the MDL principle is that the optimum probabilistic rule is constructed when the sum of the data description length and the description length by the finite division type probabilistic rule is the minimum. To do. Regarding the MDL principle, the American magazine "Automatic" published in 1978.
(Automatica) Vol. 14, pp. 465-471, by Rissanen, "Modeling by short test data description" (Modeling by shorttest).
data description).

【００５１】前記有限分割型確率的規則に対するデータ
の記述長は、規則の対数尤度の負をとることによって、
次のように計算できる。ただし、以下では対数の底はす
べて２とする。The description length of the data for the finite division type probabilistic rule is obtained by taking the negative of the log-likelihood of the rule.
It can be calculated as follows. However, in the following, the base of the logarithm is all 2.

【００５２】[0052]

【数４】 [Equation 4]

【００５３】また、前記有限分割型確率的規則の記述長
は、各確率パラメータｐ_K（ｉ）の推定値はおよそｌｏ
ｇＮ_k（ｉ）ビットで記述できるから、次のように計
算できる。The description length of the finite division type stochastic rule is such that the estimated value of each probability parameter p _K (i) is approximately lo.
Since it can be described by g N _k (i) bits, it can be calculated as follows.

【００５４】[0054]

【数５】 (Equation 5)

【００５５】したがって、ＭＤＬ原理によれば、次式が
最小になるセル数ｍの大きさを確率規則を構成する最適
なセル数とする。Therefore, according to the MDL principle, the size of the number of cells m that minimizes the following equation is set as the optimum number of cells forming the probability rule.

【００５６】[0056]

【数６】 (Equation 6)

【００５７】ステップ１００は、図１のステップ４０と
同じ処理を行ない、ステップ９０を使用してモデルが最
適化された確率的規則を使用し、テストアミノ酸配列デ
ータの各領域に対して、その活性度を計算する。The step 100 performs the same process as the step 40 of FIG. 1, and uses the stochastic rule for which the model is optimized using the step 90, for each region of the test amino acid sequence data. Calculate the degree.

【００５８】ステップ１１０は、図１のステップ５０と
同じ処理を行ないステップ４０により求められた複数の
活性度から、配列全体に対する活性度の変化を出力す
る。In step 110, the same process as in step 50 of FIG. 1 is performed, and a change in the activity for the entire array is output from the plurality of activities obtained in step 40.

【００５９】以上の図２における学習及び予測方法は、
αヘリックス以外の二次構造予測についても適用でき
る。The learning and prediction method shown in FIG.
It can be applied to secondary structure prediction other than α-helix.

【００６０】[0060]

【発明の効果】二次構造既知のタンパク質のアミノ酸配
列情報から、二次構造未知のタンパク質の二次構造を従
来技術に対して高い精度で予測することができる。特
に、アルブミンのαヘリックス領域を７０％以上の高い
精度で予測可能である。また、ＭＤＬ原理などの情報量
規準によりモデルを最適化することにより、確率的規則
の構造を理論的に最適化することが可能になる。The secondary structure of a protein of unknown secondary structure can be predicted with high accuracy from the amino acid sequence information of the protein of which secondary structure is known. In particular, the α-helix region of albumin can be predicted with high accuracy of 70% or more. Further, by optimizing the model according to the information amount criterion such as the MDL principle, it becomes possible to theoretically optimize the structure of the stochastic rule.

【図面の簡単な説明】[Brief description of drawings]

【図１】本発明のタンパク質立体構造予測方法の一実施
例を示すフローチャートFIG. 1 is a flow chart showing an example of a method for predicting a protein three-dimensional structure of the present invention

【図２】本発明のタンパク質立体構造予測方法の一実施
例を示すフローチャートFIG. 2 is a flowchart showing an embodiment of the protein three-dimensional structure prediction method of the present invention.

【図３】本発明の二次構造（αヘリックス）領域予測方
法の模式図FIG. 3 is a schematic diagram of a secondary structure (α helix) region prediction method of the present invention.

【図４】本発明で使用する確率規則の一例である有限分
割型確率規則の具体例を示す模式図FIG. 4 is a schematic diagram showing a specific example of a finite division type probability rule which is an example of a probability rule used in the present invention.

[Explanation of symbols]

１０正例抽出２０負例抽出３０確率的規則による実数値パラメータ推定４０テスト配列各領域に対する活性度計算５０テスト配列に対する予測値算出６０正例抽出７０負例抽出８０確率的規則による実数値パラメータ推定９０情報量規準による最適化１００テスト配列各領域に対する活性度計算１１０テスト配列に対する予測値算出 10 Positive Example Extraction 20 Negative Example Extraction 30 Real Value Parameter Estimation by Probabilistic Rule 40 Activity Calculation for Each Region of Test Sequence 50 Prediction Value Calculation for Test Sequence 60 Positive Example Extraction 70 Negative Example Extraction 80 Real Value Parameter Estimation by Probabilistic Rule 90 Optimization by information criterion 100 Activity calculation for each region of test sequence 110 Calculation of predicted value for test sequence

Claims

(57) [Claims]

1. In order to predict the structure of a protein from the amino acid sequence of the protein, all proteins of known structure are used.
Without using a protein of unknown structure,
You can see that it is a positive example that is an array corresponding to the structure and that it is not
Extract training data consisting of
Training data extraction step, a learning step of learning probabilistic rules from the training data consisting of the positive example and the negative example, and a structure of each region of the test amino acid sequence data is predicted using the learned probabilistic rules. A method for predicting a three-dimensional structure of a protein, comprising:

2. The training data extraction step aligns proteins belonging to the same family with an amino acid sequence of a protein of which structure is known , and extracts a partial sequence corresponding to a secondary structure region to be predicted. , Extracting as a positive example of the secondary structure region, and aligning each sequence of the database consisting of the protein of known structure with the partial sequence corresponding to the secondary structure to be predicted of the protein of known structure, The method for predicting protein tertiary structure according to claim 1, comprising a step of extracting a partial sequence that does not correspond to the secondary structure to be predicted as a negative example of the secondary structure region.

3. The learning step estimates a real-valued parameter of the probabilistic rule by using a probabilistic rule from the types of amino acids in the learning data consisting of the positive example and the negative example. The protein tertiary structure prediction method according to claim 1.

4. The estimating step estimates a real-valued parameter of the probabilistic rule by using a probabilistic rule from the real-valued attributes of amino acids in the learning data consisting of the positive example and the negative example. The protein three-dimensional structure prediction method according to claim 1, wherein

5. The step of estimating a real-valued parameter of the stochastic rule by using a probabilistic rule from the real-valued attributes of amino acids of the learning data consisting of the positive example and the negative example in the learning step. And a step of optimizing the model in the stochastic rule by using an information criterion.
The method for predicting protein three-dimensional structure according to 1.

6. The predicting step uses the probabilistic rule learned by the learning step , and for each region of the test amino acid sequence data ,
The method for predicting protein three-dimensional structure according to claim 1, comprising a step of calculating the activity level and a step of selecting an optimum value from the calculated activity levels.