JP3867863B2

JP3867863B2 - 3D structure processing equipment

Info

Publication number: JP3867863B2
Application number: JP17302493A
Authority: JP
Inventors: 真弓冨川; 聖一相川; 史子松澤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1993-07-13
Filing date: 1993-07-13
Publication date: 2007-01-17
Anticipated expiration: 2022-01-17
Also published as: JPH0728844A

Description

【０００１】
【産業上の利用分野】
本発明は，物質の二つの立体構造を重ね合わせたり，類似部分を抽出したりする立体構造処理装置に関するものである。
【０００２】
物理・化学の分野では，新しい（未知の）物質の性質を調べたり，新しい物質を人工的に創製するために，分子構造を分析し，物質が持つ機能の発現メカニズムを解明する研究が行われている。これまでの研究成果により，物質の持つ機能と立体構造との間には密接な関係があることが知られており，構造的に類似した部分（あるいは特異的な部分）が物質の機能に大きく関与すると考えられている。このため，Ｘ線結晶解析装置やＮＭＲなどによって物質の立体構造を決定し，その結果明らかになった立体構造のデータベース化が図られている。
【０００３】
例えば，このデータベースから立体構造中の類似部分を計算機で自動的に抽出したり，検索したりすることができれば，従来研究者が行ってきた一連の作業を軽減することができる。
【０００４】
【従来の技術】
物理・化学の分野では，新しい（未知の）物質の性質を調べたり，新しい物質を人工的に創製するために，Ｘ線結晶解析装置やＮＭＲ等の手法で物質の立体構造を決定し，決定された立体構造の情報をデータベースに蓄積することが行われている。代表的なデータベースとして，タンパク質のＸ線結晶解析により明らかになったタンパク質等の立体構造を登録したプロテイン・データ・バンク（ＰＤＢ：Protein Data Bank)が広く知られており，世界的に用いられている。また，化学物質が登録されているデータベースとして，ケンブリッジ・ストラクチャー・データベース（ＣＳＤ：Cambridge Structural Database)が知られている。
【０００５】
タンパク質は複数のアミノ酸が一本の鎖のように連結し，この鎖が生体内で折りたたまることによって立体構造を形成し，各種の機能を発現するようになっている。各アミノ酸はＮ端末からＣ端末に向けて１から順に番号付けることによって表現される。これらの番号はアミノ酸番号，アミノ酸配列番号またはアミノ酸残基番号などと呼ばれている。また，各アミノ酸はその種類に応じて複数の原子より構成される。したがって，上記ＰＤＢには，タンパク質の名前，管理番号，タンパク質を形成するアミノ酸番号，各アミノ酸を構成する各原子の種類と三次元座標等の情報が登録されている。
【０００６】
これまでの化学的な研究成果から，物質の立体構造とその機能との間には密接な関係があることが知られており，物質の改変や新しい機能を持つ物質を創製するために，化学的な実験を通じて立体構造と機能の関係が解明されつつある。その中でも，同じ機能を持つ物質間で構造的に類似した部分（あるいは特異的な部分）が物質の機能に大きく関与すると考えられるため，立体構造中に共通に存在する類似な構造を探し出すことは必要不可欠である。
【０００７】
しかし，三次元座標から直接特徴的な部分を取り出す手法がないため，研究者は，各立体構造を三次元グラフィックシステムで表示し，それを見て人手で特徴的な部分を探しているのが現状である。一般に物質には定まった向きの決め方がなく，片方の物質を基準としてもう一方の物質を回転させながら特徴的な部分を探し出すため，これらの作業にはかなりの時間を要している。
【０００８】
研究者が類似する立体構造を探す際，物質の立体構造の類似性の尺度として r.m.s.d.(root mean square distance)値が使用されている。r.m.s.d.値は互いに対応付けられた物質の構成要素間の平均二乗距離の平方根を表す値である。経験的には，物質間のr.m.s.d.値が１Å以下の場合に，それらの物質はきわめて類似していると考えられている。
【０００９】
図２にr.m.s.d.値の算出方法を示す。
例えば図２の(a) に示すような点集合Ａ＝｛ａ₁，ａ₂，…，ａ_i，…，ａ_m｝で表される物質と，(b) に示すような点集合Ｂ＝｛ｂ₁，ｂ₂，…，ｂ_j，…，ｂ_n｝で表される物質があったとする。これらの物質ＡとＢを構成する要素を(c) に示すように互いに対応付け，対応付けられた要素間のr.m.s.d.値が最小になるように物質Ｂを回転，移動させていく。r.m.s.d.値は，対応付ける点の数をＮ，回転行列をＵ，各重みをｗ_nとすると，次式の
r.m.s.d.＝｛（Σｗ_n（Ｕｂ_n−ａ_n）²）／Ｎ｝^1/2
（ただし，ｎは１からＮまでの総和）
で求められる。なお，この対応付けられた点同士間でr.m.s.d.値を最小化する物質の回転，移動を求める手法はカブシュ等により提案され，現在広く利用されている。しかし，本手法は同数の点同士を比較するため，一方の物質中のどの構成要素を他方の物質のどの構成要素に対応付けるとr.m.s.d.値が最小になるかについて，研究者が試行錯誤しながら求めているのが現状である。
【００１０】
また，新しい物質を創製するためには，既存の物質を調べる必要がある。例えばある物質の耐熱性を強化したい場合には，耐熱性の強い物質に共通する構造を探し出し，そのような構造を新たに作成する物質に付加することによって機能の強化を図ったりする。これには，データベースの中から必要な構造を検索する機能が必要になるが，前述の理由により同様にコンピュータグラフィックシステムを用いて研究者が試行錯誤しながら必要な構造をデータベースから探し出しているのが現状である。
【００１１】
以上のように，従来，作業者は解析したい物質の立体構造をグラフィックシステムを用いてグラフィカルに表示し，画面上で他の分子との視覚的な比較，重ね合わせなどの操作により手作業で解析を行わなければならなかった。
【００１２】
そこで，これらの一連の処理を計算機を使用して，自動的に行うことが考えられるが，対象となる２つの立体構造を形成するそれぞれの点集合間の要素を対応付けて，類似構造を抽出する場合でも，それぞれの点集合の構成要素数をｍ，ｎ個とすると，比較するために生成される組合せ数はｍⁿ個という膨大な数となり，実際には計算不可能である。
【００１３】
さらに，これらの対応付けを生成するにあたり，各点集合の要素間の距離などの幾何学的関係，r.m.s.d.値，ｎｉｌの個数などのしきい値条件，また，構成要素の属性（タンパク質の場合，アミノ酸の種類など）を考慮することにより，最適な組合せを生成することができるが，それでもなお，立体構造の形状，点集合を構成する要素数，幾何学的制約，しきい値などの設定条件によっては計算時間が非常に長くなるという問題がある。
【００１４】
【発明が解決しようとする課題】
以上のように従来，ＣＳＤやＰＤＢを利用して物質の立体構造を解析する場合，大量のデータからの構造的な検索，構造比較を手作業，または非常に長い処理時間のために，多くの時間や労力を要し，作業者の大きな負担となっていた。また，そのためにデータベースのデータを有効に利用することができず，物質の構造解析を十分にできないといった問題点があった。したがって，高速で効率の良い実用的な立体構造データベースの立体構造の類似性による検索システムが必要とされていた。
【００１５】
本発明は上記問題点の解決を図るため，対象となる立体構造を形成する点集合間の要素を対応付ける場合，それぞれの点集合を構成する要素の順序を考慮し，さらに，しきい値条件や点の属性を絞り込むことによって，立体構造の類似した部分構造を自動的に抽出することによる複数の立体構造の重ね合わせ表示や，データベースからの類似構造の検索や，化学物質やタンパク質の構造解析における構造比較や，類似構造の検索の作業を効率化し，作業に要する時間や作業者の負担を軽減し，また従来ではできなかった全データを対象とした立体構造の解析を可能とするシステムを提供することを目的とする。
【００１６】
すなわち，本発明は，立体構造中の類似部分を計算機で自動的に抽出することによってコンピュータグラフィックスにおける立体構造の重ね合わせ表示の自動化を可能とすることを目的とする。
【００１７】
また，本発明は，データベースから類似な立体構造を検索することを可能とし，各作業に要する時間，人員等のコストを削減するとともに，各作業を効率化する物質の解析・検索システムを提供することを目的とする。
【００１８】
また，本発明は，物質の立体構造が格納されたデータベースから，ある既知の機能を発現する立体構造と類似するものを探し出すことによって，機能データベースを自動生成する手段を提供することを目的とする。
【００１９】
また，本発明は，物質の機能と関連する立体構造に関するデータが格納された機能データベースをもとに，ある立体構造の未知の機能を予測する手段を提供することを目的とする。
【００２０】
また，本発明は，組合せ候補のうち，無駄または無意味な候補を除外して，以上の処理を高速化し，実用上十分な短時間で処理するシステムを提供することを目的とする。
【００２１】
【課題を解決するための手段】
図１は本発明の原理説明図である。
図１において，１０はＣＰＵおよびメモリなどからなる処理装置，１１はオペレーティング・システムなどの制御部，１４は二つの集合の各要素を対応付ける組合せを所定の制約条件のもとに生成する組合せ生成処理手段，１５は二つの物質等を重ね合わせる重ね合わせ算出処理手段，１６は二つの物質の類似性を調べる類似性算出処理手段，１７は物質等のデータを入力するデータ入力処理手段，１８は検索結果等を出力する結果出力処理手段，１９は機能データベースを自動生成する機能データベース生成処理手段，２０は指定された物質の未知の機能を予測する機能予測処理手段，２１は物質の立体構造に関するデータが格納されたデータベース，２２はキーボードやポインティング・デバイス等の入力装置，２３は表示装置，２４は機能データベースを表す。
【００２２】
組合せ生成処理手段１４は，タンパク質や各種分子等の立体構造を構成する要素に関する二つの集合の各要素を対応付ける際に，順序による候補の絞り込み，幾何学的な関係による候補の絞り込み，所定のしきい値条件による候補の絞り込み，または点の属性による候補の絞り込みを行い，これらの幾何学的制約，しきい値条件または点の属性を満たす要素の組合せを生成する処理手段である。
【００２３】
重ね合わせ算出処理手段１５は，組合せ生成処理手段１４により生成された組合せの中で各要素間の平均二乗距離の平方根（root mean square distance, r.m.s.d. ，以下r.m.s.d.値と記す）が所定のしきい値（threshold ）より小さく，かつ対応付けられる要素間の距離も所定のしきい値（Emax）より小さい対応付けを探し出すことによって二つの立体構造の最もよく一致する重ね合わせのための位置と方向を算出する処理手段である。
【００２４】
結果出力処理手段１８は，算出された重ね合わせの結果または後述する検索結果などを表示装置２３に出力する処理手段である。
データベース２１には，タンパク質などの物質の立体構造に関するデータが登録されている。データ入力処理手段１７は，データベース２１に登録されているデータを読み込んだり，また入力装置２２からユーザが入力したコマンド等の指示情報を読み込んだりする処理手段である。
【００２５】
類似性算出処理手段１６は，指定された物質の立体構造を検索キーとし，データベース２１中の類似した立体構造を検索するために，組合せ生成処理手段１４が生成した二つの立体構造の要素の組合せについて，空間的に類似するものを探し出す処理手段である。
【００２６】
機能データベース生成処理手段１９は，指定されたある機能を発現する立体構造をもとに，それに類似する機能を発現する立体構造をデータベース２１中から抽出して収集するために，類似性算出処理手段１６が選び出した高い類似性を持つ立体構造を収集し，その収集した立体構造に関するデータを機能別に格納することにより機能データベース２４を生成する処理手段である。
【００２７】
機能予測処理手段２０は，指定された機能が未知の物質の立体構造と機能データベース２４に登録されている機能が既知の立体構造とを照合し，組合せ生成処理手段１４および類似性算出処理手段１６の処理により選び出された類似性の高い立体構造が発現する機能を，指定された物質の立体構造が発現する機能であるとして予測し，その結果を出力する処理手段である。
【００２８】
また，組合せ生成処理手段１４は，立体構造を形成する２つの順序付けられた点集合Ａ，Ｂの各要素を対応付ける際に，点集合Ｂの要素の先頭から順に点集合Ａの要素と対応づけていく。
【００２９】
【作用】
本発明によれば，立体構造を形成する点集合，順序関係のある点集合および部分的に対応付けられた点集合同士をr.m.s.d.値が最適な値をとるように重ね合わせることができるため，従来人手に頼っていた立体構造の重ね合わせ作業を自動化することができ，作業の効率化を図ることができる。
【００３０】
また，本発明によれば，タンパク質の立体構造データベースより，類似性の高い構造を探し出すことを自動化することにより，作業者の負担を軽減することができる。従来のように，グラフィカルに表示して他のタンパク質と比較する手間を減らすことができる。また，作業を自動化することにより，従来ではできなかった全データを対象とした立体構造の解析を行うことが可能になる。
【００３１】
また，機能データベースの自動生成や，機能データベースを用いることによる機能予測の自動化が可能になる。
特に，対応付けられた要素間の距離が所定のしきい値より大きい場合には，組合せの候補から外されるので，絞り込みの処理が高速化される。
【００３２】
また，本発明によれば，立体構造を形成する点集合の各要素の順序に着目し，幾何学的制約，しきい値条件または点の属性を考慮しながら，一方の集合の要素を先頭から順に他方の集合の要素に対応づけて組合せ候補とし，検索のキーとなる構造が複数サイト（連続した複数要素の列部分）からなる場合においても，各サイト内で要素が連続していれば１番目のサイトから順に１サイトずつ対応付けを行い，対応付けの候補と考えられる二点対間の距離が一定値以下であるものだけを候補とし，また，生成した組合せの中で，各点間のr.m.s.d.値が所定のしきい値（threshold ）より小さく，かつ対応付けられる点間の距離も所定のしきい値（Emax）より小さい候補に絞り込み，さらに，例えばアミノ酸残基番号が連続しない箇所のように不連続なサイト数や，サイトを構成する要素の残数を考慮して，組合せ候補のうち，無駄または無意味な候補を除外するので，処理が高速になり，実用上十分な短時間で処理するシステムの提供が可能となる。
【００３３】
【実施例】
以下，図面を参照しつつ本発明の実施例を説明する。
最初に，図１に示す組合せ生成処理手段１４により点を対応付ける基本的な処理について説明する。
〔組合せ生成処理手段１４による対応付けの処理の説明〕
(1) 点集合の対応付け
物質Ａと物質Ｂが各々点集合Ａ＝｛ａ₁，ａ₂，…，ａ_i，…，ａ_m｝，１≦ｉ≦ｍ，Ｂ＝｛ｂ₁，ｂ₂，…，ｂ_j，…，ｂ_n｝，１≦ｊ≦ｎで形成され，各点はａ_i＝（ｘ_i，ｙ_i，ｚ_i），ｂ_j＝（ｘ_j，ｙ_j，ｚ_j）からなる三次元座標であるとする。この場合，各点集合の要素の対応付けは原理的には各集合内の点を順次対応付けていき，図３の（Ａ）に示すような木構造を作成することによってすべての組合せを生成することができる。
【００３４】
図３の（Ｂ）は，点集合Ａの要素が３個，点集合Ｂの集合が４個の場合の対応付けの例，すなわち点集合Ａ＝｛ａ₁，ａ₂，ａ₃｝とＢ＝｛ｂ₁，ｂ₂，ｂ₃，ｂ₄｝の対応付けを表している。点線は生成された候補を表し，その中で最適な対応付け（ａ₁とｂ₂，ａ₂とｂ₃，ａ₃とｂ₄）を実線で表している。
【００３５】
図中のｎｉｌは，対応する点が存在しない場合に相当する。ｎｉｌを対応付けることによって比較する点集合の要素の数が異なる場合でも，最適な対応付けを生成することができる。このようにして生成された組合せの中から，平均二乗距離の平方根の値（r.m.s.d.値）が最小になる組合せを選択することにより最良の対応付けを生成することができる。
【００３６】
図４に順序付けられていない点の要素からなる点集合ＡとＢの対応付けアルゴリズムの例を示す。
点集合Ａから１つずつ要素ａを取り出し，まだ木構造中の先祖，兄弟に含まれていない要素ｂ_jと組合せて，それが制約条件を満たすかどうかを調べる。制約条件を満たしていれば，それを木構造中に登録し，次の要素についての対応付けを行っていく。
【００３７】
しかし，このような手法を取ると，例えばｎのｍ乗個の組合せが生成されるため，一般には計算不可能な問題となってしまう。具体的には，順序づけられていない点集合Ａ（ｍ点）と点集合Ｂ（ｎ点）の場合，生成される組合せの数は，ｉをｎｉｌの数とすると，
【００３８】
【数１】

【００３９】
となる。ここで，ｎ＝４，ｍ＝３とすると，
【００４０】
【数２】

【００４１】
となる。すなわち，図３（Ｂ）に示す点集合Ａ（３点）と点集合Ｂ（４点）のような場合，生成される組合せの数は７３通りとなる。実際には，これらの点（要素）の数はもっとずっと多いのが普通であるので，組合せの数は膨大な数となる。
【００４２】
そこで，これらの対応付けを行う際に，以下の(4) ，(5) ，(6) に示す各点集合内の幾何学的な関係，しきい値条件，点の属性を考慮することによって最適な組合せを生成することを図る。
【００４３】
(2) 順序付けられた点集合の対応付け
物質Ａと物質Ｂが各々点集合Ａ＝｛ａ₁，ａ₂，…，ａ_i，…，ａ_m｝，１≦ｉ≦ｍ，Ｂ＝｛ｂ₁，ｂ₂，…，ｂ_j，…，ｂ_n｝，１≦ｊ≦ｎで形成され，各点はａ_i＝（ｘ_i，ｙ_i，ｚ_i），ｂ_j＝（ｘ_j，ｙ_j，ｚ_j）からなる三次元座標であり，点集合Ａでは，ａ₁＜ａ₂＜…＜ａ_i＜…＜ａ_m（またはａ₁＞ａ₂＞…＞ａ_i＞…＞ａ_m）なる順序関係が成り立ち，同様に点集合Ｂ内でもｂ₁＜ｂ₂＜…＜ｂ_j＜…＜ｂ_n（またはｂ₁＞ｂ₂＞…＞ｂ_j＞…＞ｂ_n）なる順序関係が成り立つとする。
【００４４】
この場合，各点集合の要素の対応付けは原理的には各集合内の点を順序関係に基づいて順次対応付けていき，図５（Ａ）に示す木構造を作成することによってすべての組合せを生成することができる。図５（Ｂ）は，点集合Ａの要素が３点，点集合Ｂの要素が４点の場合の例である。すなわち図５（Ｂ）は，順序付けられた点集合Ａ＝｛ａ₁，ａ₂，ａ₃｝，（順序関係はａ₁＜ａ₂＜ａ₃）と，同様に順序付けられた点集合Ｂ＝｛ｂ₁，ｂ₂，ｂ₃，ｂ₄｝，（順序関係はｂ₁＜ｂ₂＜ｂ₃＜ｂ₄）の対応付けを表している。
【００４５】
点線は生成された候補を表し，その中で最適な対応付け（ａ₁とｂ₂，ａ₂とｂ₃，ａ₃とｂ₄）を実線で表している。図中のｎｉｌは対応する点が存在しない場合に相当する。ｎｉｌを対応付けることによって比較する点集合の要素の数が異なる場合でも最適な対応付けを生成することができる。このようにして生成された組合せの中から，r.m.s.d.値が最小になる組合せを選択することにより，最良の対応付けを生成することができる。
【００４６】
また，点集合内の各要素に対して順序関係を導入することによって上記(1) と比べて大幅に生成する組合せの数を減らすことができる。さらに，これらの対応付けを行う際に，後述する(4) ，(5) ，(6) に示す各点集合内の幾何学的な関係，しきい値，点の属性を考慮することによって，最適な組合せを生成することができる。
【００４７】
図６に順序付けられた点集合の対応付けアルゴリズムの例を示す。
点集合Ａから１つずつ要素ａを取り出し，まだ木構造中の先祖，兄弟に含まれていない要素の中で親ノードの要素より大きい要素ｂ_jと組み合わせて，それが制約条件を満たすかどうかを調べる。制約条件を満たしていれば，それを木構造中に登録し，次の要素についての対応付けを行っていく。
【００４８】
このように順序付けられた点集合の場合，生成される組合せの数は，
【００４９】
【数３】

【００５０】
となる。ここで，ｎ＝４，ｍ＝３とすると，
【００５１】
【数４】

【００５２】
となり，図５（Ｂ）のような点集合Ａ（３点）と点集合Ｂ（４点）の場合には，生成される組合せの数は３５通りとなる。
(3) 部分的に対応付けられた点集合または順序付けられた点集合の対応付け
上記(1) または(2) の場合において，部分的に対応付けられる点同士が決まっていることがある。この場合，各点集合の要素の対応付けは，予め部分的に対応が付けられている点の情報を参照しながら，残りの点に対して(1) ，(2) の手法と同様に各集合内の点を順次対応付けていき，図７に示すような木構造を作成することによってすべての組合せを生成することができる。
【００５３】
図７中の×印は部分的な対応付けによって枝刈りされる部分を表す。図では予め点集合Ａの要素ａ₁と点集合Ｂの要素ｂ₂とが対応付けられていた場合の対応付けを表している。また，(1) ，(2) と同様にこれらの対応付けを行う際に，後述する(4) ，(5) ，(6) に示す各点集合内の幾何学的な関係，しきい値条件，点の属性を考慮することによって最適な組合せを生成することができる。
【００５４】
(4) 幾何学的な関係による候補の絞り込み
幾何学的関係に基づいて点集合の要素を対応付けることによって，無駄な組合せを生成することを防ぐことができるため，効率よく点集合を対応付けることができる。
【００５５】
(a) 空間における二点対間の関係による候補の絞り込み
r.m.s.d.値は平均化された距離のため，ある対応付けられた２点間の距離が大きい場合でもその他の点間の距離が小さければ結果的に算出されるr.m.s.d.値は小さくなる。しかし，このような検索結果には意味がないため対応付けられた点間の距離がある一定値より小さくなるような制限が必要になる。そこで図８に示すような空間上のある２点対（ａ_i，ａ_k）をもう一方の２点対（ｂ_j，ｂ_l）の点間（ａ_iとｂ_j，ａ_kとｂ_l）の距離が一定値（Ｅｍａｘ）以下で重ね合わせるとする。この場合，各点対間の距離を｜ａ_i−ａ_k｜＝ｄ_a，｜ｂ_j−ｂ_l｜＝ｄ_bとすると図９，図１０により明らかなように各点対間の距離の差は
｜ｄ_a−ｄ_b｜≦２×Ｅｍａｘ・・・（式１）
となる。従って点集合の要素を対応付ける際に，上述の式を満たすかチェックすることによって対応付ける点の候補を絞り込むことできる。
【００５６】
ただし（式１）は必要十分条件ではないため，回転行列Ｕを算出した後で再度各点間の距離がＥｍａｘより小さいかチェックする必要がある。再チェックにより対応付けられた各店間の距離はＥｍａｘより小さくなることが保証されるためr.m.s.d.値は次の条件式（式２）を満たすことになる。
【００５７】
【数５】

【００５８】
(b) 距離関係（式１）による候補の絞り込み
点集合Ａの要素ａ_iと点集合Ｂの要素ｂ_jを対応付ける際に，すでに対応付けられた点集合ＡとＢのｓ個の要素の各々について（式１）を満たすかチェックする。これにより複数点間で距離関係を満たす必要があるためより効率良く候補とする点を絞り込むことができる。図１１に複数点間の距離関係による候補点の絞り込みを示す。すでに対応付けられた点を塗りつぶした円で示し，ａ_iに対応する点をｂ_j-3,ｂ_j-2,ｂ_jから選択する場合を表している。図１１（Ｂ）では，すでに対応付けられた点との距離関係より点ｂ_jが選択される。
【００５９】
(c) 角度による候補の絞り込み
立体構造が類似している場合には，立体構造を形成する各点間の角度の関係も類似していると考えることができる。立体では，図１２（Ａ）に示すように，３点間の角度θと４点間のうち各３点から形成される面間の角度φが存在する。以下では，３点間の角度θを例として対応付ける点を絞り込む方法について述べる。
【００６０】
対応付けを行う際に，点集合Ａ内の要素ａ_iと近接するｓ個（２≦ｓ≦ｍ−１，ｎ−１）の点間に成り立つ角度に対し，点集合Ｂ内の要素ｂ_jと近接するｓ個の要素間の角度が許容誤差範囲Δθである点のみを選択し，対応付けることによって対応付ける候補数を絞り込む。
【００６１】
図１２（Ｂ）に，点集合Ａの間に成り立つ幾何学的な関係として各要素間の角度を考え，それに基づいて点集合Ｂの点を対応付ける場合の例を示す。
点集合Ａ内の要素ａ_iと近接するｓ＝２個の点ａ_i-1，ａ_i-2間に成り立つ角度がθ_a，点集合Ｂ内の要素ｂ_p，ｂ_q，ｂ_rの中で各点と隣接する２個の要素ｂ_j-1，ｂ_j-2間の角度が各々θ_b1，θ_b2，θ_b3の場合，角度差が許容誤差範囲Δθである点のみを選択し，それらを対応付ける。本図では点ｂ_qに関してのみ許容誤差範囲｜θ_a−θ_b1｜≦Δθを満たすため，ｂ_jの候補としてｂ_qが選択される。
【００６２】
(d) 重心からの距離，角度による候補の絞り込み
立体構造が類似している場合には，重心からの距離，角度も類似していると考えられる。従って，選択した点間で重心を算出し，上記(a) ，(b) と同様の手法で距離，角度を比較することによって，対応付ける候補を絞り込むことができる。
【００６３】
(5) しきい値条件による候補の絞り込み
上述の(1) 〜(4) の方法において，所定のしきい値を設定し，候補の持つ属性値がしきい値より大きい値を持つならば探索の枝刈りをすることによって，対応付けの効率化を図ることができる。このしきい値として，例えばｎｉｌ数の制限やr.m.s.d.値の制限を用いることができる。
【００６４】
(a) ｎｉｌ数の制限
生成した組合せの中でｎｉｌの総数が多くなり過ぎると，結果的に意味のない組合せの候補が生成される。そこで，点集合Ａと点集合Ｂの各要素間の対応付けを行う際に，ｎｉｌの総数があるしきい値以上になったならば，それを組合せの候補から除外することによって無駄な候補の生成を避け，効率よく対応付けを行うことができる。
【００６５】
図１３は，点集合Ａ＝｛ａ₁，ａ₂，ａ₃｝，Ｂ＝｛ｂ₁，ｂ₂，ｂ₃，ｂ₄｝を対応付ける際にｎｉｌの総数を０個に制限した場合の枝刈りの例を示す。図中，木構造中で×印が付けられた部分が，枝刈りされる部分である。
【００６６】
(b) r.m.s.d.値の制限
点集合Ａの要素ａ_iに点集合Ｂの要素ｂ_jを対応付けることによって，これまでに対応付けられた全点間のr.m.s.d.値が極端に悪くなる場合には，その点を候補から除外することが望ましい。そこで，要素ａ_iに要素ｂ_jを対応付けた場合の全点間のr.m.s.d.値を算出し，r.m.s.d.値があるしきい値以下ならばその点を候補とし，そうでなければ候補から除外することによって対応付ける点の候補を効率よく生成することができる。
【００６７】
(c) 不連続サイト数の制限
生成した組合せの中でアミノ酸残基番号が連続しない箇所（不連続なサイト数）が多くなり過ぎると結果的に意味のない組合せの候補が生成される。そこで，点集合Ａと点集合Ｂの各要素間の対応付けを行う際に，不連続なサイト数があるしきい値以上になる場合には，組合せの候補から除外することによって無駄な候補の生成を避け，効率のよい対応付けを実現する。例えば不連続サイト数が３までと制限したとき，ｂ₁ｂ₃ｂ₄ｂ₇ｂ₁₀では不連続なサイト数が４となるため，ｂ₁₀を対応付ける際に枝刈りされる。
【００６８】
(d) サイト構成残基数の制限
生成した組合せの中でサイトを構成するアミノ酸残基数（サイト構成残基数）が小さくなり過ぎると，結果的に意味のない組合せの候補が生成される。そこで，点集合Ａと点集合Ｂの各要素間の対応付けを行う際に，サイト構成残基数があるしきい値以下になる場合は，組合せの候補から除外することによって無駄な候補の生成を避け，効率のよい対応付けを実現する。例えばサイト構成残基数が２までと制限したとき，ｂ₁ｂ₂ｂ₃ｂ₅ｂ₁₀ではｂ₅は１残基で１つのサイトを構成するためｂ₆以外の要素（ｂ₁₀）を対応付ける際に枝刈りされる。
【００６９】
(6) 点の属性による候補の絞り込み
点集合Ａの要素ａ_iに点集合Ｂの要素ｂ_jを対応付ける際に，点の属性を利用することによって対応付ける点の候補を絞り込むことができる。点の属性として，例えば原子，原子団，分子の種類や，親水性，疎水性や，電荷の正負などを挙げることができる。これらが一致するかどうかを調べることによって，候補に加えるかどうかを判定する。
【００７０】
例えば，タンパク質の構成要素間を対応付ける場合には，点の属性としてアミノ酸残基の種類（原子団に相当する）を用いることにより，対応付ける候補を絞り込むことができる。なお，アミノ酸残基の種類等に関しては，例えば「“生化学の基礎”（東京化学同人出版）p.21〜26」等の参考資料を参照されたい。
【００７１】
また，ある特定の要素に対して制限を加えることによって，同様に対応付ける点の候補を絞ることができる。例えば，ある点に対してはｎｉｌを挿入しないという制限を設けたり，ある点に対して点の属性を指定することなどにより，検索するものに対してよりきめの細かい絞り込みが可能になる。
【００７２】
次に，物質の立体構造としてタンパク質を題材とした例を説明する。ただし，対象は基本的には立体座標であれば特に限定されないため，一般の分子構造についても同様の手法で適用することができる。
【００７３】
(1) 分子構造を重ね合わせる処理装置の例
物質の性質を調べるときには，各分子同士を重ね合わせてお互いに共通な部分や特異的な部分を判別することによって，各物質の性質を分析したり，あるいは予測したりすることができる。これらの作業は，従来人手で行われているため，各分子構造を自動的に重ね合わせて表示する装置が必要とされている。
【００７４】
それを実現する分子構造重ね合わせ表示装置のシステムは，図１に示す装置構成のうち，物質の立体構造に関する情報を登録したデータベース２１，登録されたデータおよびユーザからの入力コマンドを読み込むデータ入力処理手段１７，重ね合わせる立体構造の要素を対応付ける組合せ生成処理手段１４，データベース２１から読み込んだ物質の立体構造（三次元座標）に基づいて，r.m.s.d.値が最小となるように立体構造同士を重ね合わせる重ね合わせ算出処理手段１５および算出された結果に基づいて立体構造同士を重ね合わせて表示する結果出力処理手段１８等を用いて構成される。
【００７５】
(a) データベース２１
物質の立体構造に関する情報を格納したデータベース２１である。データベース２１には，物質の名称，物質を構成する原子の三次元座標等が格納されている。
【００７６】
(b) データ入力処理手段１７
データ入力処理手段１７では，ユーザの入力コマンドに基づいて重ね合わせる物質のデータ（三次元座標）をデータベース２１から読み込み，組合せ生成処理手段１４，重ね合わせ算出処理手段１５へ送る処理を行う。
【００７７】
(c) 組合せ生成処理手段１４
各要素を対応付ける際に，幾何学的な関係による候補の絞り込み，所定のしきい値条件による候補の絞り込み，または点の属性による候補の絞り込みを行い，これらの幾何学的制約，しきい値条件または点の属性を満たす要素の組合せを生成する。
【００７８】
ここでの対応付けでは，タンパク質を構成するアミノ酸配列順序に基づいて空間的に類似な部分を対応付ける機能とアミノ酸配列順序とは無関係に空間的に類似な部分を対応付ける機能を提供する。アミノ酸配列順序に基づいて空間的に類似な部分を検索する場合には，タンパク質を構成する各アミノ酸をアミノ酸配列番号により順序付けられた順序集合として捉えることができ，対応付け処理の説明における(2) 〜(6) で述べた方式で類似な部分を算出することができる。また，各アミノ酸を単なる集合として捉えることによって，上記(1) ，(3) 〜(6) で述べた方式により，アミノ酸配列順序とは無関係に空間的に類似な部分を算出することができる。
【００７９】
(d) 重ね合わせ算出処理手段１５
重ね合わせ算出処理手段１５では，物質の立体構造（三次元座標）について，r.m.s.d.値が最適な値をとるように物質を構成する要素間の対応付けを行い，その座標変換情報を求めて，結果を結果出力処理手段１８へ送る。
【００８０】
(e) 結果出力処理手段１８
結果出力処理手段１８によるグラフィック表示では，重ね合わせ算出処理手段１５で算出された結果に基づいて物質の立体構造を重ね合わせて表示する。表示結果を回転させながら見ることができるようにすることによって，どの部分がどのように重なっているかを三次元グラフィックにより判別することを，さらに容易化することができる。
【００８１】
タンパク質Calmodulinのアミノ酸残基配列を図１４（Ａ）に，TroponinＣのアミノ酸残基配列を図１４（Ｂ）に示す。図１４（Ｃ）はTroponinＣの立体構造を示している。なお，図１４（Ａ），（Ｂ）は，プロテイン・データ・バンク（ＰＤＢ）に登録されているアミノ酸残基配列を抜粋したものである。
【００８２】
図１４（Ａ）に示したアミノ酸残基配列は，本来のアミノ酸残基配列に対してアミノ酸配列番号１−４，１４８番に相当するアミノ酸が欠落しているため番号がずれているが，以降の説明では，図に示したアミノ酸配列番号を使用することにする。
【００８３】
Calmodulinは生化学的な実験結果より，４個のＣａ²⁺を結合することが知られている。Ｃａ²⁺の結合部位はアミノ酸配列中に４箇所（サイト）あり，その中でアミノ酸配列番号８１−１０８，１１７−１４３は，TroponinＣ中に２箇所あるＣａ²⁺の結合部位と各々似た骨格を取ることが知られている。タンパク質はアミノ酸で構成されるが，その骨格は各アミノ酸を構成する一つの原子（Ｃα）の座標で代表できることが知られている。
【００８４】
そこで，CalmodulinのＣａ²⁺の結合部位８１−１０８を検索キー（検索キーをプローブともいう）として，アミノ酸配列順序に基づいて空間的に類似な部分（単一サイト）を検索した結果を，図１５に示す。図１５の結果から，CalmodulinのＣａ²⁺の結合部位８１−１０８に対応付けられたTroponinＣ中のアミノ酸配列番号は９６−１２３番であることがわかる。本結果は，生化学的な実験結果と一致するものである。
【００８５】
図１６にCalmodulinのＣａ²⁺の結合部位８１−１０８と１１７−１４３とをプローブとしてアミノ酸配列順序に基づいて空間的に類似な部分（複数サイト）を検索した結果を示す。図１６から，CalmodulinのＣａ²⁺の結合部位８１−１０８，１１７−１４３に対応付けられたTroponinＣ中のアミノ酸配列番号は，各々９６−１２３，１３２−１５８であることがわかる。本結果も同様に生化学的な実験結果と一致するものである。
【００８６】
このように，本装置を使用すれば，物質間の立体構造のr.m.s.d.値が小さくなるような物質の構成要素間の対応付けがなされるため，対応付けられた部分を重ね合わせて表示することによって，物質の最適な重ね合わせ表示を実現することができる。
【００８７】
(2) 立体構造検索装置，機能データベース生成装置の例
新薬の開発のように新しい機能を持つ物質を開発したり，既に存在する物質の機能強化を図るためには，物質の機能と構造の相関関係を解明することが不可欠である。このような作業を進めるに当たって，類似な立体構造を持つ物質を多数参照する必要が生じる。そのためには，データベースから立体構造が類似した物質を簡単に取り出せる立体構造検索装置が必要になる（また，このような装置があると機能と関係する立体構造を収集した機能データベースを作成することができる。機能データベースについては次の(3) で後述する）。
【００８８】
それを実現する立体構造検索装置は，図１に示す装置構成のうち，物質の立体構造に関する情報を登録したデータベース２１，登録されたデータおよびユーザからの入力コマンドを読み込むデータ入力処理手段１７，比較する立体構造の要素を対応づける組合せ生成処理手段１４，データベース２１から読み込んだ物質の立体構造（三次元座標）について，r.m.s.d.値が最小となる類似な構造を検索する類似性算出処理手段１６，および検索結果を表示する結果出力処理手段１８等を用いて構成される。
【００８９】
同様に機能データベース生成装置は，データベース２１，データ入力処理手段１７，組合せ生成処理手段１４，類似性算出処理手段１６および機能データベース生成処理手段１９等を用いて構成される。
【００９０】
(a) データベース２１
物質の立体構造に関する情報を格納したデータベースである。データベースには，物質の名称，物質を構成する原子の三次元座標等が格納されている。
【００９１】
(b) データ入力処理手段１７
立体構造のデータおよびユーザからの入力コマンドに基づいて検索のキーとなる立体構造および検索時に参照するデータベース２１に登録されている立体構造のデータを読み込み，組合せ生成処理手段１４および類似性算出処理手段１６へ送る。
【００９２】
(c) 組合せ生成処理手段１４および類似性算出処理手段１６
類似性判定のために各要素を対応付ける際に，幾何学的な関係による候補の絞り込み，所定のしきい値条件による候補の絞り込み，または点の属性による候補の絞り込みを行い，これらの幾何学的制約，しきい値条件または点の属性を満たす要素の組合せを生成する。この際，タンパク質を構成するアミノ酸配列順序に基づいて空間的に類似な部分を検索する機能とアミノ酸配列順序とは無関係に空間的に類似な部分を検索する機能を提供する。アミノ酸配列順序に基づいて空間的に類似な部分を検索する場合には，タンパク質を構成する各アミノ酸をアミノ酸配列番号により順序付けられた順序集合として捉えることができ，対応付け処理の説明における(2) 〜(6) で述べた方式で類似な部分を算出することができる。また，各アミノ酸を単なる集合として捉えることによって，上記(1) ，(3) 〜(6) で述べた方式により，アミノ酸配列順序とは無関係に空間的に類似な部分を算出することができる。これにより類似性算出処理手段１６は，類似性の高い立体構造をデータベース２１中から選出する。
【００９３】
(d) 結果出力処理手段１８
結果出力処理手段１８では，類似性算出処理手段１６の結果に基づいて類似な部分をアミノ酸配列名およびアミノ酸番号で表し，類似性の尺度としてr.m.s.d.値を表示する。
【００９４】
(e) 機能データベース生成処理手段１９
また，機能データベース生成装置の場合，機能データベース生成処理手段１９は，類似性算出処理手段１６の結果に基づいて，プローブの立体構造と類似する立体構造を，プローブの立体構造が発現する機能と同じ機能を持つものとする。そして，その情報を機能データベース２４に格納する。
【００９５】
タンパク質elongation factor のＧＴＰ（グアノシン３リン酸）のリン酸結合部位であるアミノ酸残基番号７から１４に対応するＣαの座標をプローブとしてＰＤＢから類似な立体構造の検索を行った結果を，図１７に示す。
【００９６】
この例では，プロテイン・データ・バンク（ＰＤＢ）に登録されているデータ９０５件中でタンパク質の立体構造７４４件を検索の対象とした。検索結果として，図１７に示すように，検索したターゲットタンパク質のアミノ酸残基番号，アミノ酸残基配列，プローブのアミノ酸残基配列およびターゲットとプローブの立体構造のr.m.s.d.値等の情報が出力される。途中の出力を図示省略するが，検索の結果，８個の立体構造が検索されている（プローブ自身を含む）。
【００９７】
タンパク質の種類別にみると，adenylate kinaseが３件，elongation factor が２件（内１件はプローブ自身），ras protein が３件であり，いずれもＡＴＰまたはＧＴＰのリン酸結合部位であった。これから，ＡＴＰまたはＧＴＰのリン酸との結合という機能と立体構造は非常に密接な関係にあり，かつ他のリン酸結合部位とは無関係な構造と偶然一致することがないため非常に特異的な構造を持っていることがわかる。
【００９８】
このように，本装置を使用すれば，プローブ（検索キー）となる物質の立体構造を指定することによって，物質の立体構造を格納したデータベース２１から類似構造を検索することができる。また，この検索結果を利用し，機能データベース２４を自動生成することができる。
【００９９】
(3) 機能予測装置
図１７に示した結果から推察されるように，タンパク質は，ある機能を発現するためにその機能に特異的な立体構造を持っていると考えられている。したがって，機能ごとにその機能に特異的な立体構造のデータベース（図１に示す機能データベース２４）があると，新しくＸ線結晶解析やＮＭＲ等の手法で物質の立体構造が決定された際に，その機能データベース２４に登録されている構造が立体構造内にあるかを調べることによって，その物質がどのような機能を持っており，その機能は立体構造中のどの部分（以下機能部位とよぶ）によって司られているかを予測することができる。
【０１００】
それを実現する機能予測装置は，図１に示す装置構成のうち，データ入力処理手段１７，機能と関連した立体構造を登録した機能データベース２４，これから読み込んだ立体構造の要素と，比較する立体構造の要素とを対応付ける組合せ生成処理手段１４，類似性を調べる類似性算出処理手段１６，その結果から機能と機能部位を予測する機能予測処理手段２０等を用いて構成される。
【０１０１】
(a) データ入力処理手段１７
物質を構成する立体構造のデータを読み込み，組合せ生成処理手段１４へ送る。
【０１０２】
(b) 機能データベース２４
物質の機能とその機能に特異的な立体構造に関する情報を格納したデータベースである。機能データベース２４には，機能の名称，機能に特異的な立体構造を構成する原子の三次元座標等が格納されている。
【０１０３】
(c) 組合せ生成処理手段１４および類似性算出処理手段１６
ここでは，機能データベース２４に登録されている立体構造と入力された立体構造間の最適な重ね合わせを算出する。この際，タンパク質を構成するアミノ酸配列順序に基づいて空間的に類似な部分を検索する機能とアミノ酸配列順序とは無関係に空間的に類似な部分を検索する機能を提供する。アミノ酸配列順序に基づいて空間的に類似な部分を検索する場合には，タンパク質を構成する各アミノ酸をアミノ酸配列番号により順序付けられた順序集合として捉えることができ，対応付け処理の説明における(2) 〜(6) で述べた方式で類似な部分を算出することができる。また，各アミノ酸を単なる集合として捉えることによって，上記(1) ，(3) 〜(6) で述べた方式により，アミノ酸配列順序とは無関係に空間的に類似な部分を算出することができる。
【０１０４】
(d) 機能予測処理手段２０
機能予測処理手段２０では，類似性算出処理手段１６の結果に基づいて機能データベース２４に登録されている機能名，機能部位のアミノ酸配列名およびアミノ酸残基番号で表し，類似性の尺度としてr.m.s.d.値を表示する。
【０１０５】
以上の実施例で説明した立体構造の検索システムでは，立体構造を形成する点集合間の要素の対応付けを，各集合内の点を順次対応付けていくことにより自動化した。この手法では，対応付けるべき点がない場合にｎｉｌを対応付けることによって，比較する点集合の要素の数が異なる場合でも，最適な対応付けを生成することができるようにしている。このようにして，生成された組合せの中から各点間のr.m.s.d.値が所定のしきい値より小さく，かつ対応付けられる点間の距離も所定のしきい値より小さい対応付けを探し出すことによって最良の対応付けを生成する。
【０１０６】
ここで，組合せに関する探索の枝刈りが行われない場合，それぞれの点集合の構成要素数をｍ，ｎ個とすると，生成される組合せ数はｍⁿ個となり，実質的に計算不可能な問題となってしまう。そこで，これらの対応付けを行う際に，各点集合の要素間の距離などの幾何学的関係，r.m.s.d.値，ｎｉｌの個数などのしきい値条件，また，構成要素の属性（タンパク質の場合，アミノ酸の種類など）を考慮することによって最適な組合せを生成するようにしている。どのような条件下でも常に高速に処理する手法を確立することは困難である。
【０１０７】
しかし，タンパク質を構成するアミノ酸はＮ末端からＣ末端に向けてアミノ酸残基番号が付けられており，順序付けられた点集合として捉えることができること，また，タンパク質の機能などの解析を行う際に，類似構造の抽出，検索キーとなる立体構造はアミノ酸配列上で連続している（点集合において連続した要素である）場合が多いことから，このような連続した構造に対して，点集合Ａ，Ｂの構成要素の順序を考慮して対応付けを行うことで，処理の一層の高速化を可能とする。
【０１０８】
本実施例では，立体構造を形成する点集合Ａ，Ｂ間の要素の対応付けを行う際，点集合Ａ，Ｂを構成する要素の順序に着目する。また，上記のように要素を対応付ける際に各点集合間の距離などの幾何学的関係，r.m.s.d.値，ｎｉｌの個数などのしきい値条件，または構成要素の属性を考慮することで，効率良く組合せを生成する。さらに検索のキーとなる構造が複数のサイトからなる場合にも，各サイト内で要素が連続している場合には，１番目のサイトから順に１サイトずつ対応付けを行うことにより，複数サイト全体の対応付けを行う。以下において，対応付けの候補となる部分集合を求める手段について述べる。
【０１０９】
(1) 順序づけられた点集合Ａ，Ｂの構成要素順序に従った対応づけ
図１８は，点集合Ａの構成要素数ｍを求め，点集合Ｂから構成要素順序に従ってｍ個ずつ要素を選んで点集合Ａの各要素と対応づける例を説明するための図である。点集合Ａ，Ｂは順序づけられた点集合であり，その構成要素は，点集合Ａ＝｛ａ₁，…，ａ_m｝，点集合Ｂ＝｛ｂ₁，…，ｂ_i，…，ｂ_n｝である。
【０１１０】
図１９は, 順序付けられた点集合Ａ，Ｂの構成要素順序に従った対応付けのアルゴリズムを示す図である。
順序付けられた点集合Ａ＝｛ａ₁，…，ａ_m｝，Ｂ＝｛ｂ₁，…，ｂ_i，…，ｂ_n｝において，以下の処理を行う。
【０１１１】
処理１：点集合Ａの要素数ｍを求める。
処理２：ｉ＝１とする。
処理３：点集合Ｂのｉ番目の要素から順にｍ個の要素を選び，点集合Ａの各要素と対応づける。
【０１１２】
処理４：点集合Ｂの最後の要素が対応付けられたとき終了する。点集合Ｂの最後の要素が対応付けられていないとき，ｉ＝ｉ＋１にして処理３に戻る。
(2) 順序付けられた点集合Ａ，Ｂの構成要素順序，制約条件を考慮した対応づけ
点集合Ａの要素数ｍを求め，点集合Ｂから構成要素順序に従ってｍ個ずつ要素を選んで点集合Ａの各要素と対応づけを行うが，その際に上述した各点集合内の幾何学的な関係, しきい値条件, 点の属性を考慮することによって最適な対応づけを行う。
【０１１３】
図２０（Ａ）は，順序付けられた点集合Ａ，Ｂの構成要素順序，制約条件を考慮した対応付けのアルゴリズムを示す図であり，図２0 （Ｂ）は，対応付け処理のアルゴリズムを示す図である。
【０１１４】
図２０（Ａ）では，順序付けられた点集合Ａ＝｛ａ₁，…，ａ_m｝，Ｂ＝｛ｂ₁，…，ｂ_i，…，ｂ_n｝において，点集合Ａの要素数ｍを求め，制約条件として，ｎｉｌｍａｘ＝ＮＩＬ総数のしきい値，とする。また，ｉ＝１，ｎｉｌ＝０と初期設定する。その後，図２０（Ｂ）に示す対応付け処理を呼び出す。この対応付け処理では，以下の処理を行う。
【０１１５】
処理１：ｋ＝ｉとする。
処理２：点集合Ｂのｉ番目の要素から一つずつ要素（点ｂ_k）を選び，点ｂ_kが制約条件を満たしていたら，点ｂ_kを点集合Ａの要素と対応付ける。満たしていない場合はｎｉｌと対応付け，ｎｉｌ＝ｎｉｌ＋１とする。
【０１１６】
処理３：ｍ個の要素が対応付くか，またはｎｉｌがｎｉｌｍａｘの値を超えるまで，ｋ＝ｋ＋１として処理２を繰り返す。
処理４：点集合Ｂの最後の要素が選ばれたとき終了する。点集合Ｂの最後の要素がまだ選ばれていないとき，ｉ＝ｉ＋１にして処理１に戻る。
【０１１７】
(3) 順序付けられた点集合Ａ，Ｂの構成要素順序，制約条件を考慮した対応づけ（複数サイトの場合）
図２１（Ａ）に示すように，キーとなる構造である点集合Ａが複数のサイトからなる場合，図２１（Ｂ）に示すように，まずサイト１（site１）の対応づけを上記(2) で述べた方法によって行う。対応付けが成功したら, 図２１（Ｃ）に示すように，ターゲットである点集合Ｂのサイト１と対応付けられた要素よりも後の要素を対象に，サイト２（site２）の対応付けを，上記(2) で述べた方法によって行う。同様に最後のサイトまで処理を行う。
【０１１８】
図２２は，順序付けられた点集合Ａ，Ｂの構成要素順序，制約条件を考慮した複数サイトの対応付けのアルゴリズムを示す図である。
順序付けられた点集合Ａ＝｛ａ₁，…，ａ_m｝，Ｂ＝｛ｂ₁，…，ｂ_i，…，ｂ_n｝に対し，以下の処理を行う。
【０１１９】
処理１：点集合Ａのサイトの総数を求める。
ｓｉｔｅｍａｘ＝サイト総数，
ｎｉｌｍａｘ＝ＮＩＬ総数のしきい値，
ｓｔａｒｔ＝１，ｎ＝１とする。
【０１２０】
処理２：図２０（Ｂ）に示す対応づけ処理を行う。
処理３：ｎ番目のサイトの対応付けに失敗したならば，処理を終了する。
処理４：すべてのサイトが対応付くか，またはｎｉｌがｎｉｌｍａｘの値を超えたならば，処理を終了する。
【０１２１】
処理５：ｎ番目のサイトが対応付けられた位置の次の要素を開始位置（strat)とし，ｎ＝ｎ＋１として処理２以下を繰り返す。
本実施例による物質の立体構造を重ね合わせて表示するシステムについても，最初に説明した実施例の場合と同様な構成で実現することができる。ただし，図１に示す組合せ生成処理手段１４による対応付け方法が異なる。
【０１２２】
始めに，１つのサイトからなるプローブ（検索キー）の具体例を説明する。タンパク質Trypsin のアミノ酸（残基）配列を図２３（Ａ）に，Elastaseのアミノ酸配列を図２３（Ｂ）に示す。図２３はＰＤＢに登録されているアミノ酸配列を抜粋したものである。図２３に示したアミノ酸配列番号は，ＰＤＢに記載されているアミノ酸に対して単純に１から番号をふったものであるので，本来のアミノ酸番号とは異なっているが，以降の説明では図に示したアミノ酸番号を使用することにする。
【０１２３】
図２３に示したTrypsin ，Elastaseはセリンプロテアーゼと呼ばれるタンパク質分解酵素の仲間で，活性部位にヒスチジン，セリン，アスパラギン酸が不可欠な酵素である。これらの酵素の基質特異性は全く異なるが，構造および触媒機構などの点で類似していることから進化的には一群の酵素であろうと考えられている。
【０１２４】
図２４（Ａ）に，Trypsin のヒスチジン活性部位（３６−４１）をプローブとして，Elastaseのヒスチジン活性部位を検索した結果を示す。この結果によると，Trypsin の活性部位３６−４１に対して，Elastaseの４１−４６が対応付けられたことがわかる。
【０１２５】
また，図２４（Ｂ）に，Trypsin のセリン活性部位（１７５−１７９）をプローブとして，Elastaseのセリン活性部位を検索した結果を示す。この結果によると，Trypsin の活性部位１７５−１７９に対して，Elastaseの１８６−１９０が対応付けられたことがわかる。これらの結果は，生化学的な実験で得られた結果と合致するものである。
【０１２６】
次に，複数のサイトからなるプローブ（検索キー）の具体例として，タンパク質elongation factor のＧＴＰ（グアノシン３リン酸）結合部位であるアミノ酸残基番号７−１４，４８−５１，１０３−１０６の３サイトに対応する構造をプローブとして，ras protein から類似構造を検索した結果を図２５に示す。
【０１２７】
ras protein もＧＴＰを結合することが知られており，図２６に示すように，elongation factor のサイト１（７−１４）に対してras protein の１０−１７残基が，サイト２（４８−５１）に対して５７−６０残基が，サイト３（１０３−１０６）に対して１１１−１１４残基が対応付けられたことがわかる。これらの結果は，生化学的な実験で得られた結果と合致するものである。
【０１２８】
このように，本装置によってプローブとなる物質の立体構造を指定することによって，物質の立体構造を格納したデータベースからr.m.s.d.値が最小となる対応付けが算出できるため，対応付けられた部分を重ね合わせて表示することによって物質の最適な重ね合わせ表示を実現することができる。
【０１２９】
例えば，新薬の開発のように求める機能を持つ物質を開発する場合などには，物質の機能と構造の相関関係を解明するため，類似な立体構造を持つ物質を多数参照する必要が生じる。そのため，データベースから立体構造が類似した物質を簡単に取り出せる立体構造の検索システムが必要になるが，前述した例と同様に高速な検索システムを構成することができる。
【０１３０】
先に説明した実施例の手法により，組合せ生成処理手段１４によって効率よく要素の組合せを生成し，類似性算出処理手段１６によって類似する立体構造をデータベース２１から抽出する。そして，結果出力処理手段１８により，類似性算出処理手段１６の結果に基づいて，類似な部分をアミノ酸配列名，およびアミノ酸番号で表し，類似性の尺度としてr.m.s.d.値を表示する。
【０１３１】
このように，本装置によってプローブ（検索キー）となる物質の立体構造を指定することによって，物質の立体構造を格納したデータベース２１からr.m.s.d.値が最小となる類似構造を検索することができる。
【０１３２】
【発明の効果】
以上説明したように，本発明によれば，立体構造を形成する点同士を自動的に重ね合わせることができるため，グラフィックシステムにより立体構造を自動的に重ね合わせて表示することや，立体構造が類似した構造をデータベースから自動的に検索することが可能になる。また，機能データベースを生成し，新しい物質の機能を予測することが可能になる。したがって，新薬の開発やタンパク質の研究等に寄与するところが大きい。
【図面の簡単な説明】
【図１】本発明の原理説明図である。
【図２】本発明に関係するr.m.s.d.値の算出方法を示す図である。
【図３】順序付けられていない点集合ＡとＢの対応付け説明図である。
【図４】順序付けられていない点集合ＡとＢの対応付けアルゴリズムの例を示す図である。
【図５】順序付けられた点集合ＡとＢの対応付け説明図である。
【図６】順序付けられた点集合ＡとＢの対応付けアルゴリズムの例を示す図である。
【図７】部分的に対応付けられた点集合ＡとＢの対応付け説明図である。
【図８】対応付けられる２点間の関係を示す図である。
【図９】２点間の距離を説明するための図である（ｄ_a≧２×Ｅｍａｘ）。
【図１０】２点間の距離を説明するための図である（ｄ_a＜２×Ｅｍａｘ）。
【図１１】複数点間の距離関係による候補の絞り込みを説明するための図である。
【図１２】幾何学的制約による候補の絞り込みを説明するための図である。
【図１３】ｎｉｌ数を制限した点集合ＡとＢの対応付け説明図である。
【図１４】タンパク質の例を示す図である。
【図１５】本発明の実施例による検索結果の例を示す図である。
【図１６】本発明の実施例による検索結果の例を示す図である。
【図１７】立体構造検索結果の例を示す図である。
【図１８】順序付けられた点集合Ａ，Ｂの構成要素順序に従った対応付けを説明する図である。
【図１９】順序付けられた点集合Ａ，Ｂの構成要素順序に従った対応付けアルゴリズムの例を示す図である。
【図２０】順序付けられた点集合Ａ，Ｂの構成要素順序，制約条件を考慮した対応付けアルゴリズムの例を示す図である。
【図２１】点集合の構成要素順序，制約条件を考慮した複数サイトの対応付けを説明する図である。
【図２２】点集合の構成要素順序，制約条件を考慮した複数サイトの対応付けアルゴリズムの例を示す図である。
【図２３】アミノ酸配列の例を示す図である（１サイトの場合）。
【図２４】本実施例による検索結果の例を示す図である（１サイトの場合）。
【図２５】アミノ酸配列の例を示す図である（複数サイトの場合）。
【図２６】本実施例による検索結果の例を示す図である（複数サイトの場合）。
【符号の説明】
１０処理装置
１１制御部
１４組合せ生成処理手段
１５重ね合わせ算出処理手段
１６類似性算出処理手段
１７データ入力処理手段
１８結果出力処理手段
１９機能データベース生成処理手段
２０機能予測処理手段
２１データベース
２２入力装置
２３表示装置
２４機能データベース[0001]
[Industrial application fields]
The present invention relates to a three-dimensional structure processing apparatus that superimposes two three-dimensional structures of a substance or extracts a similar portion.
[0002]
In the field of physics and chemistry, in order to investigate the properties of new (unknown) substances, or to create new substances artificially, research is conducted to analyze the molecular structure and elucidate the expression mechanism of the functions of the substances. ing. From past research results, it is known that there is a close relationship between the function of a substance and the three-dimensional structure, and structurally similar parts (or specific parts) greatly contribute to the function of the substance. It is thought to be involved. For this reason, the three-dimensional structure of a substance is determined by an X-ray crystal analyzer or NMR, and the resulting three-dimensional structure database is being developed.
[0003]
For example, if a similar part in a three-dimensional structure can be automatically extracted or searched from this database by a computer, a series of operations performed by researchers in the past can be reduced.
[0004]
[Prior art]
In the field of physics and chemistry, in order to investigate the properties of new (unknown) substances, or to create new substances artificially, the three-dimensional structure of the substance is determined and determined using techniques such as X-ray crystallography and NMR. The information of the obtained three-dimensional structure is stored in a database. As a representative database, the Protein Data Bank (PDB), which has registered three-dimensional structures of proteins, etc., revealed by X-ray crystallographic analysis of proteins, is widely known and used worldwide. Yes. As a database in which chemical substances are registered, the Cambridge Structural Database (CSD) is known.
[0005]
Proteins are linked to each other like a single chain, and these chains are folded in vivo to form a three-dimensional structure and express various functions. Each amino acid is expressed by numbering from 1 in order from the N terminal to the C terminal. These numbers are called amino acid numbers, amino acid sequence numbers or amino acid residue numbers. Each amino acid is composed of a plurality of atoms depending on the type. Therefore, information such as the name of the protein, the management number, the amino acid number forming the protein, the type of each atom constituting each amino acid, and the three-dimensional coordinates are registered in the PDB.
[0006]
From past chemical research results, it is known that there is a close relationship between the three-dimensional structure of substances and their functions. In order to modify substances and create substances with new functions, The relationship between three-dimensional structure and function is being elucidated through experimental experiments. Among them, structurally similar parts (or specific parts) between substances with the same function are considered to be greatly involved in the function of the substance, so finding similar structures that exist in the three-dimensional structure in common Indispensable.
[0007]
However, since there is no way to extract the characteristic part directly from the three-dimensional coordinates, the researchers display each three-dimensional structure with a three-dimensional graphic system and look for it to search for the characteristic part by hand. Currently. In general, there is no way to determine the orientation of a substance, and it takes a considerable amount of time to search for a characteristic part while rotating the other substance based on one substance.
[0008]
When researchers search for similar three-dimensional structures, the r.m.s.d. (root mean square distance) value is used as a measure of the similarity of the three-dimensional structures of substances. The r.m.s.d. value is a value that represents the square root of the mean square distance between the constituents of the substances associated with each other. Empirically, they are considered very similar if the r.m.s.d. value between them is less than 1 mm.
[0009]
FIG. 2 shows a method for calculating the r.m.s.d. value.
For example, a point set A = {a as shown in FIG.₁, A₂, ..., a_i, ..., a_m} And a point set B = {b as shown in (b)₁, B₂, ..., b_j, ..., b_n}, There is a substance represented by Elements constituting these substances A and B are associated with each other as shown in (c), and the substance B is rotated and moved so that the r.m.s.d. value between the associated elements is minimized. The r.m.s.d. value is N for the number of points to be associated, U for the rotation matrix, and w for each weight._nThen, the following equation
r.m.s.d. = {(Σw_n(Ub_n-A_n)²) / N}^1/2
(Where n is the total from 1 to N)
Is required. A method for obtaining the rotation and movement of a substance that minimizes the r.m.s.d. value between the associated points has been proposed by Kabusch et al. And is currently widely used. However, in order to compare the same number of points in this method, researchers have tried and tried to determine which component in one substance corresponds to which component in the other substance to minimize the rmsd value. This is the current situation.
[0010]
In addition, in order to create a new material, it is necessary to examine existing materials. For example, when it is desired to enhance the heat resistance of a certain substance, a structure common to a highly heat-resistant substance is searched, and the function is strengthened by adding such a structure to a newly created substance. This requires the ability to search the database for the required structure, but for the reasons described above, the computer graphic system is also used by researchers to search for the required structure from the database through trial and error. Is the current situation.
[0011]
As described above, conventionally, the operator graphically displays the three-dimensional structure of the substance to be analyzed using a graphic system, and manually analyzes it by operations such as visual comparison and overlay with other molecules on the screen. Had to do.
[0012]
Therefore, a series of these processes can be performed automatically using a computer, but similar structures are extracted by associating elements between the respective point sets that form the two target solid structures. Even in this case, if the number of components of each point set is m and n, the number of combinations generated for comparison is mⁿIt becomes an enormous number of pieces, which is actually impossible to calculate.
[0013]
Furthermore, in generating these correspondences, geometrical relationships such as the distance between elements of each point set, threshold conditions such as the rmsd value, the number of nil, and component attributes (for proteins, It is possible to generate optimal combinations by considering the types of amino acids, etc., but nonetheless, setting conditions such as the shape of the three-dimensional structure, the number of elements constituting the point set, geometric constraints, and threshold values In some cases, the calculation time becomes very long.
[0014]
[Problems to be solved by the invention]
As described above, when analyzing the three-dimensional structure of a substance using CSD or PDB, a large number of data are structurally searched and structural comparisons are performed manually or for very long processing time. Time and labor were required, which was a heavy burden on the workers. For this reason, the data in the database cannot be used effectively, and there is a problem that the structural analysis of the substance cannot be sufficiently performed. Therefore, there is a need for a search system based on the similarity of the three-dimensional structure of a practical three-dimensional structure database that is fast and efficient.
[0015]
In order to solve the above problems, the present invention considers the order of elements constituting each point set when associating elements between the point sets that form the target three-dimensional structure. By narrowing down the attributes of points, automatically extracting partial structures with similar three-dimensional structures, overlaying multiple three-dimensional structures, searching similar structures from databases, and structural analysis of chemical substances and proteins Provides a system that enables efficient structure comparison and search for similar structures, reduces the time required for the work and the burden on the operator, and enables the analysis of 3D structures for all data that could not be done before The purpose is to do.
[0016]
That is, an object of the present invention is to enable automation of the superposition display of a three-dimensional structure in computer graphics by automatically extracting similar parts in the three-dimensional structure by a computer.
[0017]
Further, the present invention provides a material analysis / retrieval system that makes it possible to retrieve a similar three-dimensional structure from a database, reduces the time required for each work, the cost of personnel, etc., and makes each work more efficient. For the purpose.
[0018]
Another object of the present invention is to provide means for automatically generating a function database by searching for a similar structure to a three-dimensional structure expressing a certain known function from a database in which the three-dimensional structure of a substance is stored. .
[0019]
Another object of the present invention is to provide means for predicting an unknown function of a certain three-dimensional structure based on a function database in which data relating to a three-dimensional structure related to the function of a substance is stored.
[0020]
It is another object of the present invention to provide a system that speeds up the above processing by excluding useless or meaningless candidates from combination candidates, and performs processing in a practically short time.
[0021]
[Means for Solving the Problems]
FIG. 1 is a diagram illustrating the principle of the present invention.
In FIG. 1, 10 is a processing device comprising a CPU and a memory, 11 is a control unit such as an operating system, and 14 is a combination generation process for generating a combination for associating elements of two sets based on predetermined constraints Means 15 is an overlay calculation processing means for superimposing two substances, 16 is a similarity calculation processing means for examining the similarity between the two substances, 17 is a data input processing means for inputting data such as substances, and 18 is a search. Result output processing means for outputting results, etc., 19 is a function database generation processing means for automatically generating a function database, 20 is a function prediction processing means for predicting an unknown function of a specified substance, and 21 is data relating to the three-dimensional structure of the substance. , 22 is an input device such as a keyboard or pointing device, 23 is a display device, and 24 is Representing the ability database.
[0022]
The combination generation processing means 14 narrows down candidates by order, narrows down candidates by geometric relationship, and performs predetermined processing when associating each element of the two sets related to elements constituting the three-dimensional structure such as proteins and various molecules. It is a processing means for generating a combination of elements satisfying these geometric constraints, threshold conditions, or point attributes by narrowing candidates by threshold conditions or by narrowing candidates by point attributes.
[0023]
The overlay calculation processing means 15 has a root mean square distance (rmsd, hereinafter referred to as rmsd value) among the combinations generated by the combination generation processing means 14 with a predetermined threshold value. Calculate the position and direction for the best matching of the two three-dimensional structures by searching for a correspondence that is smaller than (threshold) and the distance between the associated elements is also less than a predetermined threshold (Emax) Processing means.
[0024]
The result output processing means 18 is a processing means for outputting the calculated overlay result or a search result to be described later to the display device 23.
Data relating to the three-dimensional structure of a substance such as protein is registered in the database 21. The data input processing means 17 is a processing means for reading data registered in the database 21 and reading instruction information such as a command input by the user from the input device 22.
[0025]
The similarity calculation processing means 16 uses the three-dimensional structure of the designated substance as a search key, and in order to search for a similar three-dimensional structure in the database 21, the combination of the elements of the two three-dimensional structures generated by the combination generation processing means 14 Is a processing means for finding similar spatially.
[0026]
The function database generation processing means 19 uses a similarity calculation processing means for extracting and collecting a three-dimensional structure expressing a function similar to that based on a three-dimensional structure expressing a specified function. 16 is a processing means for generating a function database 24 by collecting three-dimensional structures having high similarity selected by 16 and storing the collected three-dimensional structure data by function.
[0027]
The function prediction processing means 20 collates the three-dimensional structure of a substance whose designated function is unknown with the three-dimensional structure whose function registered in the function database 24 is known, and generates a combination generation processing means 14 and a similarity calculation processing means 16. This is a processing means for predicting the function that expresses a highly similar three-dimensional structure selected by the above process as the function that expresses the three-dimensional structure of a specified substance, and outputs the result.
[0028]
Further, the combination generation processing means 14 associates the elements of the two ordered point sets A and B forming the three-dimensional structure with the elements of the point set A in order from the head of the elements of the point set B. Go.
[0029]
[Action]
According to the present invention, the point set forming the three-dimensional structure, the point set having the order relation, and the partially associated point sets can be overlapped so that the rmsd value has the optimum value. It is possible to automate the superposition work of the three-dimensional structure that relied on manpower, and to improve the work efficiency.
[0030]
Further, according to the present invention, it is possible to reduce the burden on the operator by automating the search for a highly similar structure from the protein three-dimensional structure database. As in the past, it can be graphically displayed and compared with other proteins. Also, by automating the work, it becomes possible to analyze the three-dimensional structure for all data that could not be done in the past.
[0031]
In addition, function database can be automatically generated and function prediction can be automated by using the function database.
In particular, when the distance between the associated elements is larger than a predetermined threshold value, it is excluded from the combination candidates, so that the narrowing-down process is speeded up.
[0032]
Further, according to the present invention, attention is paid to the order of each element of the point set forming the three-dimensional structure, and the elements of one set are considered from the top while taking into account geometric constraints, threshold conditions or point attributes. Even in the case where a combination candidate is associated with the elements of the other set in order and the search key structure is composed of multiple sites (a sequence of consecutive multiple element columns), 1 if the elements are continuous in each site. The sites are matched one by one in order from the second site, and only those whose distance between two point pairs considered to be candidates for matching is less than a certain value are candidates, and between the points in the generated combination The rmsd value of is smaller than the predetermined threshold (threshold) and the distance between the associated points is narrowed down to candidates smaller than the predetermined threshold (Emax). So discontinuous In consideration of the number of sites and the remaining number of elements that make up the site, unnecessary or meaningless candidates are excluded from the combination candidates. Provision is possible.
[0033]
【Example】
Embodiments of the present invention will be described below with reference to the drawings.
First, basic processing for associating points by the combination generation processing means 14 shown in FIG. 1 will be described.
[Description of Association Processing by Combination Generation Processing Unit 14]
(1) Point set mapping
Substance A and Substance B are each a point set A = {a₁, A₂, ..., a_i, ..., a_m}, 1 ≦ i ≦ m, B = {b₁, B₂, ..., b_j, ..., b_n}, 1 ≦ j ≦ n, and each point is a_i= (X_i, Y_i, Z_i), B_j= (X_j, Y_j, Z_j). In this case, in principle, the elements of each point set are associated with each other in order, and all combinations are generated by creating a tree structure as shown in FIG. can do.
[0034]
FIG. 3B shows an example of correspondence when there are three elements of the point set A and four sets of the point set B, that is, the point set A = {a₁, A₂, A_Three} And B = {b₁, B₂, B_Three, B_Four} Is represented. The dotted line represents the generated candidate, and the optimum correspondence (a₁And b₂, A₂And b_Three, A_ThreeAnd b_Four) Is represented by a solid line.
[0035]
Nil in the figure corresponds to the case where there is no corresponding point. Even when the number of elements in the point set to be compared is different by associating nil, an optimum association can be generated. The best association can be generated by selecting a combination that minimizes the value of the square root of the mean square distance (r.m.s.d. value) from the combinations generated in this way.
[0036]
FIG. 4 shows an example of an association algorithm of point sets A and B made up of unordered point elements.
Extract element a from point set A one by one, and element b not yet included in ancestor or sibling in tree structure_jTo see if it satisfies the constraints. If the constraint condition is satisfied, it is registered in the tree structure, and the next element is associated.
[0037]
However, when such a method is adopted, for example, combinations of n raised to the mth power are generated, which generally causes a problem that cannot be calculated. Specifically, in the case of an unordered point set A (m point) and point set B (n point), the number of generated combinations is as follows:
[0038]
[Expression 1]

[0039]
It becomes. Here, if n = 4 and m = 3,
[0040]
[Expression 2]

[0041]
It becomes. That is, in the case of point set A (3 points) and point set B (4 points) shown in FIG. 3B, the number of generated combinations is 73. In practice, the number of these points (elements) is usually much larger, so the number of combinations is enormous.
[0042]
Therefore, in making these correspondences, the geometrical relationships, threshold conditions, and point attributes in each point set shown in (4), (5), and (6) below are considered. Aiming to generate an optimal combination.
[0043]
(2) Matching ordered point sets
Substance A and Substance B are each a point set A = {a₁, A₂, ..., a_i, ..., a_m}, 1 ≦ i ≦ m, B = {b₁, B₂, ..., b_j, ..., b_n}, 1 ≦ j ≦ n, and each point is a_i= (X_i, Y_i, Z_i), B_j= (X_j, Y_j, Z_j), And in the point set A, a₁<A₂<... <a_i<... <a_m(Or a₁> A₂> ...> a_i> ...> a_m) Holds, and similarly in the point set B, b₁<B₂<... <b_j<... <b_n(Or b₁> B₂> ...> b_j> ...> b_n).
[0044]
In this case, in principle, the elements of each point set are associated with each other by sequentially associating the points in each set based on the order relationship, and by creating the tree structure shown in FIG. Can be generated. FIG. 5B shows an example in which the point set A has 3 elements and the point set B has 4 elements. That is, FIG. 5B shows an ordered point set A = {a₁, A₂, A_Three}, (The order relationship is a₁<A₂<A_Three) And similarly ordered point set B = {b₁, B₂, B_Three, B_Four}, (The order relationship is b₁<B₂<B_Three<B_Four).
[0045]
The dotted line represents the generated candidate, and the optimum correspondence (a₁And b₂, A₂And b_Three, A_ThreeAnd b_Four) Is represented by a solid line. Nil in the figure corresponds to the case where there is no corresponding point. Even when the number of elements in the point set to be compared is different by associating nil, an optimum association can be generated. The best correspondence can be generated by selecting the combination that minimizes the r.m.s.d. value from the combinations generated in this way.
[0046]
In addition, by introducing an order relationship for each element in the point set, the number of combinations to be generated can be greatly reduced compared to (1) above. Furthermore, when making these correspondences, by considering the geometrical relationships, threshold values, and point attributes in each point set shown in (4), (5), and (6) described later, Optimal combinations can be generated.
[0047]
FIG. 6 shows an example of an ordered point set association algorithm.
Extract element a from point set A one by one, and element b that is larger than the parent node element among elements not yet included in the ancestor and siblings in the tree structure_jTo see if it satisfies the constraints. If the constraint condition is satisfied, it is registered in the tree structure, and the next element is associated.
[0048]
For such an ordered set of points, the number of combinations generated is
[0049]
[Equation 3]

[0050]
It becomes. Here, if n = 4 and m = 3,
[0051]
[Expression 4]

[0052]
Thus, in the case of point set A (3 points) and point set B (4 points) as shown in FIG. 5B, the number of generated combinations is 35.
(3) Correspondence between partially associated point sets or ordered point sets
In the case of (1) or (2) above, points that are partially associated may be determined. In this case, the elements of each point set are associated with the remaining points in the same way as in the methods (1) and (2) while referring to the information of the points that are partially associated beforehand. All combinations can be generated by sequentially associating the points in the set and creating a tree structure as shown in FIG.
[0053]
The crosses in FIG. 7 represent parts that are pruned by partial association. In the figure, element a of point set A₁And element b of point set B₂Represents the association in a case where and are associated with each other. In addition, when performing these correspondences as in (1) and (2), the geometric relationships and threshold values in each point set shown in (4), (5), and (6) described later are used. Optimal combinations can be generated by considering conditions and point attributes.
[0054]
(4) Narrowing candidates by geometric relationship
By associating the elements of the point set based on the geometric relationship, it is possible to prevent generation of useless combinations, and thus it is possible to efficiently associate the point sets.
[0055]
(a) Narrowing down candidates based on the relationship between two point pairs in space
Since the r.m.s.d. value is an averaged distance, even if the distance between two associated points is large, the r.m.s.d. value calculated as a result is small if the distance between the other points is small. However, since such a search result is meaningless, it is necessary to limit the distance between the associated points to be smaller than a certain value. Therefore, a pair of two points (a_i, A_k) The other two-point pair (b_j, B_l) Between points (a_iAnd b_j, A_kAnd b_l) Is overlaid at a certain distance (Emax) or less. In this case, the distance between each pair of points_i-A_k| = D_a, | B_j-B_l| =

D

_b9 and 10, the difference in distance between each pair of points is
| D_a-D_b| ≦ 2 × Emax (Formula 1)
It becomes. Therefore, when associating the elements of the point set, it is possible to narrow down the candidate points to be associated by checking whether or not the above formula is satisfied.
[0056]
However, since (Equation 1) is not a necessary and sufficient condition, after calculating the rotation matrix U, it is necessary to check again whether the distance between each point is smaller than Emax. Since it is guaranteed that the distance between the stores associated by the recheck becomes smaller than Emax, the r.m.s.d. value satisfies the following conditional expression (Expression 2).
[0057]
[Equation 5]

[0058]
(b) Narrowing down candidates based on distance relationship (Formula 1)
Element a of point set A_iAnd element b of point set B_j, It is checked whether (Equation 1) is satisfied for each of the s elements of the point sets A and B already associated. Thereby, since it is necessary to satisfy the distance relationship between a plurality of points, the points to be candidates can be narrowed down more efficiently. FIG. 11 shows narrowing of candidate points based on the distance relationship between a plurality of points. The points already associated are shown as filled circles, and a_iThe point corresponding to_j-3,b_j-2,b_jIt represents the case of selecting from. In FIG. 11B, the point b is based on the distance relationship with the already associated point._jIs selected.
[0059]
(c) Narrowing candidates by angle
When the three-dimensional structure is similar, it can be considered that the angle relationship between the points forming the three-dimensional structure is also similar. In the three-dimensional object, as shown in FIG. 12A, there are an angle θ between three points and an angle φ between faces formed from three points among four points. In the following, a method for narrowing down the points to be associated will be described using the angle θ between the three points as an example.
[0060]
The element a in the point set A is used for matching_iAnd an element b in the point set B with respect to an angle established between s (2 ≦ s ≦ m−1, n−1) adjacent points._jThe number of candidates to be matched is narrowed down by selecting only points where the angle between the adjacent s elements is within the allowable error range Δθ.
[0061]
FIG. 12B shows an example in which the angle between each element is considered as a geometric relationship established between the point sets A, and the points of the point set B are associated with each other based on the angles.
Element a in point set A_iS = 2 points a close to_i-1, A_i-2The angle between them is θ_a, Element b in point set B_p, B_q, B_r2 elements b adjacent to each point in_j-1, B_j-2The angle between each is θ_b1, Θ_b2, Θ_b3In the case of, only the points whose angle difference is within the allowable error range Δθ are selected and associated. In this figure, point b_qError range only for_a−θ_b1In order to satisfy | ≦ Δθ, b_jAs a candidate for b_qIs selected.
[0062]
(d) Narrowing candidates by distance and angle from the center of gravity
If the three-dimensional structures are similar, the distance and angle from the center of gravity are considered to be similar. Accordingly, by calculating the center of gravity between the selected points and comparing the distance and angle by the same method as in the above (a) and (b), the candidates to be matched can be narrowed down.
[0063]
(5) Narrowing candidates by threshold conditions
In the above methods (1) to (4), a predetermined threshold value is set, and if the candidate attribute value has a value larger than the threshold value, the search is pruned, so Efficiency can be improved. As this threshold value, for example, the limit of the nil number or the limit of the r.m.s.d. value can be used.
[0064]
(a) Limit on the number of nil
If the total number of nil in the generated combination becomes too large, a meaningless combination candidate is generated as a result. Therefore, when the correspondence between the elements of the point set A and the point set B is performed, if the total number of nil exceeds a certain threshold value, by eliminating it from the combination candidates, It is possible to avoid the generation and perform the association efficiently.
[0065]
FIG. 13 shows a point set A = {a₁, A₂, A_Three}, B = {b₁, B₂, B_Three, B_Four} Shows an example of pruning when the total number of nil is limited to zero. In the figure, the part marked with x in the tree structure is the part to be pruned.
[0066]
(b) r.m.s.d.value restrictions
Element a of point set A_iElement b of point set B_jIf the r.m.s.d. value between all the points associated so far becomes extremely worse by associating, it is desirable to exclude that point from the candidates. Therefore, element a_iElement b_jIf the rmsd value between all points is calculated, and if the rmsd value is below a certain threshold value, that point is selected as a candidate, and if not, the point candidate is efficiently generated by excluding it from the candidate can do.
[0067]
(c) Limit on number of discontinuous sites
If there are too many places where the amino acid residue numbers do not continue (number of discontinuous sites) in the generated combination, a candidate combination that is meaningless is generated as a result. Therefore, if the number of discontinuous sites exceeds a certain threshold when the elements of point set A and point set B are associated with each other, useless candidates are excluded by excluding them from the combination candidates. Avoid generation and achieve efficient mapping. For example, when the number of discontinuous sites is limited to 3, b₁b_Threeb_Fourb₇b_TenSince the number of discontinuous sites is 4, b_TenPruned when matching.
[0068]
(d) Restriction on the number of residues constituting the site
If the number of amino acid residues constituting the site (the number of site-constituting residues) in the generated combination becomes too small, a meaningless combination candidate is generated as a result. Therefore, when performing association between the elements of point set A and point set B, if the number of site constituent residues falls below a certain threshold, generation of useless candidates by excluding them from the combination candidates To achieve efficient mapping. For example, when the number of site constituent residues is limited to 2, b₁b₂b_Threeb_Fiveb_TenThen b_FiveSince one site consists of one residue, b₆Elements other than (b_Ten) Will be pruned.
[0069]
(6) Narrowing candidates by point attributes
Element a of point set A_iElement b of point set B_jWhen associating, the point candidates can be narrowed down by using point attributes. Examples of point attributes include the types of atoms, atomic groups, and molecules, hydrophilicity, hydrophobicity, and positive and negative charges. Determine if they should be added to the candidates by checking if they match.
[0070]
For example, when associating protein constituent elements, candidates for association can be narrowed down by using the type of amino acid residue (corresponding to an atomic group) as a point attribute. For the types of amino acid residues, etc., refer to reference materials such as “Basics of Biochemistry” (Tokyo Kagaku Doujinshi) p.21-26.
[0071]
In addition, by applying a restriction to a specific element, it is possible to narrow down candidate points to be associated in the same manner. For example, by restricting not inserting nil for a certain point, or by specifying a point attribute for a certain point, it becomes possible to narrow down more finely the search target.
[0072]
Next, an example will be described in which protein is used as the three-dimensional structure of a substance. However, since the object is basically not particularly limited as long as it is a three-dimensional coordinate, a general molecular structure can be applied by the same method.
[0073]
(1) Example of processing equipment for superimposing molecular structures
When investigating the properties of a substance, the properties of each substance can be analyzed or predicted by overlapping each molecule and discriminating a common part or a specific part. Since these operations are conventionally performed manually, a device for automatically superimposing and displaying each molecular structure is required.
[0074]
The system of the molecular structure superposition display device that realizes this is the data input process for reading the database 21 in which information on the three-dimensional structure of the substance is registered, the registered data, and the input command from the user in the device configuration shown in FIG. Means 17, combination generation processing means 14 for associating the elements of the three-dimensional structure to be superimposed, and superposition for superimposing the three-dimensional structures so as to minimize the rmsd value based on the three-dimensional structure (three-dimensional coordinates) of the substance read from the database 21 It is configured using the alignment calculation processing means 15 and the result output processing means 18 that displays the three-dimensional structures superimposed on the basis of the calculated results.
[0075]
(a) Database 21
It is the database 21 which stored the information regarding the three-dimensional structure of a substance. The database 21 stores substance names, three-dimensional coordinates of atoms constituting the substances, and the like.
[0076]
(b) Data input processing means 17
In the data input processing means 17, the data (three-dimensional coordinates) of the material to be superimposed is read from the database 21 based on the user input command and sent to the combination generation processing means 14 and the overlay calculation processing means 15.
[0077]
(c) Combination generation processing means 14
When associating each element, narrow down candidates based on geometric relationships, narrow down candidates based on predetermined threshold conditions, or narrow down candidates based on point attributes. Alternatively, a combination of elements satisfying the point attribute is generated.
[0078]
The association here provides a function of associating spatially similar parts based on the amino acid sequence order constituting the protein and a function of associating spatially similar parts regardless of the amino acid sequence order. When searching for spatially similar parts based on the amino acid sequence order, each amino acid constituting the protein can be regarded as an ordered set ordered by the amino acid sequence number. (2) Similar parts can be calculated by the method described in (6). Also, by regarding each amino acid as a simple set, spatially similar parts can be calculated regardless of the amino acid sequence order by the method described in (1), (3) to (6) above.
[0079]
(d) Overlay calculation processing means 15
In the overlay calculation processing means 15, the three-dimensional structure (three-dimensional coordinates) of the substance is associated with the elements constituting the substance so that the rmsd value takes the optimum value, the coordinate conversion information is obtained, and the result Is sent to the result output processing means 18.
[0080]
(e) Result output processing means 18
In the graphic display by the result output processing means 18, the three-dimensional structure of the substance is superimposed and displayed based on the result calculated by the overlay calculation processing means 15. By making it possible to view the display result while rotating it, it is possible to further easily determine which part is overlapped and how by a three-dimensional graphic.
[0081]
The amino acid residue sequence of the protein Calmodulin is shown in FIG. 14 (A), and the amino acid residue sequence of TroponinC is shown in FIG. 14 (B). FIG. 14C shows the three-dimensional structure of TroponinC. 14A and 14B are excerpts of amino acid residue sequences registered in the protein data bank (PDB).
[0082]
The amino acid residue sequence shown in FIG. 14A is deviated from the original amino acid residue sequence because the amino acids corresponding to amino acid sequence numbers 1-4 and 148 are missing. In the description, the amino acid sequence numbers shown in the figure are used.
[0083]
Calmodulin is based on the results of biochemical experiments.²⁺Is known to combine. Ca²⁺There are four binding sites (sites) in the amino acid sequence, among which amino acid sequence numbers 81-108 and 117-143 are two sites in Troponin C.²⁺It is known to take a skeleton similar to each binding site. It is known that proteins are composed of amino acids, but their skeletons can be represented by the coordinates of one atom (Cα) constituting each amino acid.
[0084]
So Calmodulin's Ca²⁺FIG. 15 shows the result of searching for a spatially similar portion (single site) based on the amino acid sequence order using the binding site 81-108 of No. 2 as a search key (the search key is also referred to as a probe). From the results of Fig. 15, Calmodulin's Ca²⁺It can be seen that the amino acid sequence number in Troponin C associated with the binding site 81-108 is 96-123. This result is consistent with the biochemical experimental results.
[0085]
Fig. 16 shows Calmodulin's Ca.²⁺The result of having searched the spatially similar part (plural sites) based on the amino acid sequence order using the binding sites 81-108 and 117-143 as the probes. From Fig. 16, Calmodulin's Ca²⁺It can be seen that the amino acid sequence numbers in TroponinC associated with the binding sites 81-108 and 117-143 are 96-123 and 132-158, respectively. This result is also consistent with the biochemical experimental result.
[0086]
In this way, if this device is used, the rmsd values of the three-dimensional structure between the substances are associated with each other so that the constituent elements of the substances are associated with each other. , Optimal overlay display of substances can be realized.
[0087]
(2) Examples of 3D structure search device and function database generation device
In order to develop a substance with a new function, such as the development of a new drug, or to enhance the function of a substance that already exists, it is essential to elucidate the correlation between the function and structure of the substance. In proceeding with such operations, it is necessary to refer to many substances having similar three-dimensional structures. This requires a 3D structure search device that can easily retrieve materials with similar 3D structures from a database (and with such a device, it is possible to create a function database that collects 3D structures related to functions. (The function database will be described later in (3) below).
[0088]
A three-dimensional structure search apparatus that realizes the above is a database 21 in which information relating to a three-dimensional structure of a substance is registered, data input processing means 17 that reads registered data and an input command from a user, and a comparison. Combination generation processing means 14 for associating three-dimensional structure elements, similarity calculation processing means 16 for searching for a similar structure having a minimum rmsd value for the three-dimensional structure (three-dimensional coordinates) of the substance read from the database 21, and The result output processing means 18 for displaying the search result is used.
[0089]
Similarly, the function database generation device is configured using a database 21, data input processing means 17, combination generation processing means 14, similarity calculation processing means 16, function database generation processing means 19, and the like.
[0090]
(a) Database 21
It is a database that stores information on the three-dimensional structure of a substance. The database stores the names of substances, the three-dimensional coordinates of atoms constituting the substances, and the like.
[0091]
(b) Data input processing means 17
Based on the data of the three-dimensional structure and the input command from the user, the three-dimensional structure which is a search key and the three-dimensional structure data registered in the database 21 to be referred to at the time of retrieval are read, and the combination generation processing means 14 and the similarity calculation processing means are read. Send to 16.
[0092]
(c) Combination generation processing means 14 and similarity calculation processing means 16
When associating each element for similarity determination, narrow down candidates based on geometric relationships, narrow down candidates based on predetermined threshold conditions, or narrow down candidates based on point attributes. Generate element combinations that satisfy constraints, threshold conditions, or point attributes. At this time, a function of searching for a spatially similar portion based on the amino acid sequence order constituting the protein and a function of searching for a spatially similar portion regardless of the amino acid sequence order are provided. When searching for spatially similar parts based on the amino acid sequence order, each amino acid constituting the protein can be regarded as an ordered set ordered by the amino acid sequence number. (2) Similar parts can be calculated by the method described in (6). Also, by regarding each amino acid as a simple set, spatially similar parts can be calculated regardless of the amino acid sequence order by the method described in (1), (3) to (6) above. Thereby, the similarity calculation processing means 16 selects a highly similar three-dimensional structure from the database 21.
[0093]
(d) Result output processing means 18
The result output processing means 18 represents similar parts by amino acid sequence names and amino acid numbers based on the result of the similarity calculation processing means 16, and displays the r.m.s.d. value as a measure of similarity.
[0094]
(e) Function database generation processing means 19
Further, in the case of the function database generation device, the function database generation processing unit 19 has the same three-dimensional structure as the probe three-dimensional structure based on the result of the similarity calculation processing unit 16 as the function of the three-dimensional structure of the probe. It shall have function. Then, the information is stored in the function database 24.
[0095]
FIG. 17 shows the result of searching a similar three-dimensional structure from PDB using the coordinates of Cα corresponding to amino acid residues Nos. 7 to 14 which are phosphate binding sites of GTP (guanosine triphosphate) of the protein elongation factor as probes. Shown in
[0096]
In this example, the search target is 744 protein three-dimensional structures among 905 data registered in the protein data bank (PDB). As a search result, as shown in FIG. 17, information such as the amino acid residue number of the searched target protein, the amino acid residue sequence, the amino acid residue sequence of the probe, and the r.m.s.d. value of the three-dimensional structure of the target and the probe is output. Although the output on the way is not shown, as a result of the search, eight three-dimensional structures are searched (including the probe itself).
[0097]
By protein type, there were 3 adenylate kinases, 2 elongation factors (1 of which was the probe itself), and 3 ras proteins, all of which were ATP or GTP phosphate binding sites. From this, the function of the binding of ATP or GTP to phosphate and the three-dimensional structure are very closely related, and since it does not coincide with the structure unrelated to other phosphate binding sites, it is very specific. You can see that it has a structure.
[0098]
In this way, by using this apparatus, a similar structure can be searched from the database 21 storing the three-dimensional structure of the substance by designating the three-dimensional structure of the substance that is a probe (search key). Further, the function database 24 can be automatically generated by using the search result.
[0099]
(3) Function prediction device
As inferred from the results shown in FIG. 17, it is considered that a protein has a three-dimensional structure specific to its function in order to express a certain function. Therefore, if there is a three-dimensional structure database specific to each function (function database 24 shown in FIG. 1) for each function, when a three-dimensional structure of a substance is newly determined by a technique such as X-ray crystal analysis or NMR, By examining whether the structure registered in the function database 24 is in the three-dimensional structure, what function the substance has and which function in the three-dimensional structure (hereinafter referred to as functional part) You can predict what is governed by.
[0100]
The function predicting apparatus for realizing this is the three-dimensional structure to be compared with the data input processing means 17, the function database 24 in which the three-dimensional structure related to the function is registered, and the elements of the three-dimensional structure read from the function configuration shown in FIG. The combination generation processing means 14 for associating the elements with each other, the similarity calculation processing means 16 for examining the similarity, the function prediction processing means 20 for predicting the function and the functional part from the result, and the like.
[0101]
(a) Data input processing means 17
The data of the three-dimensional structure constituting the substance is read and sent to the combination generation processing means 14.
[0102]
(b) Function database 24
It is a database that stores information on the function of a substance and the three-dimensional structure specific to that function. The function database 24 stores function names, three-dimensional coordinates of atoms constituting a three-dimensional structure specific to the function, and the like.
[0103]
(c) Combination generation processing means 14 and similarity calculation processing means 16
Here, the optimum overlay between the three-dimensional structure registered in the function database 24 and the input three-dimensional structure is calculated. At this time, a function of searching for a spatially similar portion based on the amino acid sequence order constituting the protein and a function of searching for a spatially similar portion regardless of the amino acid sequence order are provided. When searching for spatially similar parts based on the amino acid sequence order, each amino acid constituting the protein can be regarded as an ordered set ordered by the amino acid sequence number. (2) Similar parts can be calculated by the method described in (6). Also, by regarding each amino acid as a simple set, spatially similar parts can be calculated regardless of the amino acid sequence order by the method described in (1), (3) to (6) above.
[0104]
(d) Function prediction processing means 20
In the function prediction processing unit 20, the function name, the amino acid sequence name of the functional site, and the amino acid residue number registered in the function database 24 based on the result of the similarity calculation processing unit 16 are used, and the rmsd value is used as a measure of similarity. Is displayed.
[0105]
In the three-dimensional structure search system described in the above embodiment, the element association between the point sets forming the three-dimensional structure is automated by sequentially associating the points in each set. In this method, when there is no point to be associated, nil is associated so that the optimum association can be generated even when the number of elements in the point set to be compared is different. In this way, it is best to search for a correspondence among the generated combinations by finding an association in which the rmsd value between each point is smaller than a predetermined threshold and the distance between the associated points is smaller than the predetermined threshold. Generate an association.
[0106]
Here, when pruning of searches related to combinations is not performed, if the number of components of each point set is m and n, the number of combinations generated is mⁿIt becomes a problem that becomes practically impossible to calculate. Therefore, when performing these associations, geometrical relationships such as the distance between elements of each point set, threshold conditions such as the rmsd value, the number of nil, and the attributes of the constituent elements (in the case of proteins, The optimum combination is generated by taking into consideration the types of amino acids). It is difficult to establish a method that always performs high-speed processing under any conditions.
[0107]
However, the amino acids that make up a protein are numbered from the N-terminal to the C-terminal, and can be regarded as an ordered set of points. Also, when analyzing protein functions, Extraction of similar structures and three-dimensional structures that serve as search keys are often continuous on the amino acid sequence (continuous elements in the point set), so point sets A, By performing the association in consideration of the order of the components of B, it is possible to further speed up the processing.
[0108]
In this embodiment, when associating elements between the point sets A and B forming the three-dimensional structure, attention is paid to the order of the elements constituting the point sets A and B. In addition, when associating elements as described above, the geometric relationship such as the distance between each point set, the threshold condition such as the rmsd value and the number of nil, or the attribute of the constituent element is taken into account, thereby efficiently Generate a combination. In addition, even when the search key structure consists of multiple sites, if the elements are consecutive within each site, the entire site is associated by associating one site at a time starting with the first site. Is associated. In the following, means for obtaining a subset that is a candidate for association will be described.
[0109]
(1) Correspondence according to the component order of the ordered point sets A and B
FIG. 18 is a diagram for explaining an example in which the number m of components of the point set A is obtained, m elements are selected from the point set B according to the order of the components, and associated with each element of the point set A. Point sets A and B are ordered point sets, and their constituent elements are point set A = {a₁, ..., a_m}, Point set B = {b₁, ..., b_i, ..., b_n}.
[0110]
FIG. 19 is a diagram showing an association algorithm according to the component order of the ordered point sets A and B.
Ordered point set A = {a₁, ..., a_m}, B = {b₁, ..., b_i, ..., b_n}, The following processing is performed.
[0111]
Process 1: The number m of elements of the point set A is obtained.
Process 2: i = 1.
Process 3: m elements are selected in order from the i-th element of the point set B and associated with each element of the point set A.
[0112]
Process 4: End when the last element of the point set B is associated. When the last element of the point set B is not associated, i = i + 1 is set and the processing returns to step 3.
(2) Correspondence taking into account the order of the components of the ordered point sets A and B and the constraints
The number m of elements of the point set A is obtained, and m elements are selected from the point set B according to the order of the constituent elements and are associated with the elements of the point set A. At that time, the geometry in each point set described above is used. Optimal correspondence is taken into account by taking into account the general relationship, threshold conditions, and point attributes.
[0113]
FIG. 20A is a diagram showing an association algorithm in consideration of the order of the constituent elements of the ordered point sets A and B and the constraint conditions, and FIG. 20B is a diagram showing the algorithm of the association processing. It is.
[0114]
In FIG. 20A, the ordered point set A = {a₁, ..., a_m}, B = {b₁, ..., b_i, ..., b_n}, The number m of elements of the point set A is obtained and nilmax = the threshold value of the total number of NILs is set as a constraint. Also, i = 1 and nil = 0 are initially set. Thereafter, the association process shown in FIG. In this association processing, the following processing is performed.
[0115]
Process 1: k = i.
Process 2: One by one element from the i-th element of the point set B (point b_k) And point b_kIf satisfies the constraints, the point b_kAre associated with elements of the point set A. If not, it is associated with nil and nil = nil + 1.
[0116]
Process 3: Process 2 is repeated with k = k + 1 until m elements correspond or nil exceeds the value of nilmax.
Process 4: End when the last element of the point set B is selected. When the last element of the point set B has not yet been selected, i = i + 1 is set, and the process 1 is returned to.
[0117]
(3) Correspondence considering the order of the elements of the ordered point sets A and B and the constraints (in the case of multiple sites)
As shown in FIG. 21A, when the point set A, which is a key structure, is composed of a plurality of sites, as shown in FIG. This is done by the method described in). If the association is successful, as shown in FIG. 21 (C), the association of the site 2 (site 2) is performed on the elements after the element associated with the site 1 of the target point set B. This is done by the method described in (2) above. Similarly, processing is performed up to the last site.
[0118]
FIG. 22 is a diagram showing an algorithm for associating a plurality of sites in consideration of the component order of the ordered point sets A and B and the constraint conditions.
Ordered point set A = {a₁, ..., a_m}, B = {b₁, ..., b_i, ..., b_n}, The following processing is performed.
[0119]
Process 1: The total number of sites in the point set A is obtained.
sitemax = total number of sites,
nilmax = NIL total threshold,
It is assumed that start = 1 and n = 1.
[0120]
Process 2: The association process shown in FIG.
Process 3: If the n-th site association fails, the process ends.
Process 4: If all sites are associated or nil exceeds the value of nilmax, the process is terminated.
[0121]
Process 5: The element next to the position where the nth site is associated is set as the start position (strat), and n = n + 1 is repeated and the process 2 and subsequent steps are repeated.
The system for displaying the three-dimensional structure of the material according to this embodiment in an overlapping manner can also be realized with the same configuration as in the first embodiment described above. However, the association method by the combination generation processing means 14 shown in FIG. 1 is different.
[0122]
First, a specific example of a probe (search key) consisting of one site will be described. The amino acid (residue) sequence of the protein Trypsin is shown in FIG. 23 (A), and the amino acid sequence of Elastase is shown in FIG. 23 (B). FIG. 23 shows an excerpt of the amino acid sequence registered in the PDB. The amino acid sequence numbers shown in FIG. 23 are different from the original amino acid numbers because they are simply numbered from 1 for the amino acids described in the PDB. The amino acid numbers shown will be used.
[0123]
Trypsin and Elastase shown in FIG. 23 are members of a proteolytic enzyme called serine protease, and histidine, serine and aspartic acid are indispensable in the active site. Although the substrate specificities of these enzymes are completely different, they are considered to be a group of enzymes evolutionarily because they are similar in terms of structure and catalytic mechanism.
[0124]
FIG. 24 (A) shows the results of searching for the histidine active site of Elastase using the histidine active site (36-41) of Trypsin as a probe. According to this result, it can be seen that 41-46 of Elastase is associated with the active site 36-41 of Trypsin.
[0125]
FIG. 24B shows the results of searching for the serine active site of Elastase using the trypsin serine active site (175-179) as a probe. According to this result, it can be seen that 186-190 of Elastase is associated with the active site 175-179 of Trypsin. These results are consistent with those obtained from biochemical experiments.
[0126]
Next, as a specific example of a probe (search key) consisting of a plurality of sites, amino acid residue numbers 7-14, 48-51, 103-106, which are GTP (guanosine triphosphate) binding sites of a protein elongation factor, are used. FIG. 25 shows the result of searching for a similar structure from ras protein using the structure corresponding to the site as a probe.
[0127]
It is known that ras protein also binds GTP. As shown in FIG. 26, 10-17 residues of ras protein are present in site 2 (48-51) relative to site 1 (7-14) of the elongation factor. ) To 57-60 residues and Site 3 (103-106) to 111-114 residues. These results are consistent with those obtained from biochemical experiments.
[0128]
In this way, by specifying the three-dimensional structure of the substance to be used as a probe by this device, the correspondence that minimizes the rmsd value can be calculated from the database storing the three-dimensional structure of the substance. Display, it is possible to realize the optimum overlay display of substances.
[0129]
For example, when developing a substance having a desired function, such as the development of a new drug, it is necessary to refer to many substances having similar three-dimensional structures in order to clarify the correlation between the function and structure of the substance. For this reason, a three-dimensional structure retrieval system that can easily extract substances having a similar three-dimensional structure from a database is required. However, a high-speed retrieval system can be configured as in the above-described example.
[0130]
The combination generation processing unit 14 efficiently generates a combination of elements by the method of the embodiment described above, and the similarity calculation processing unit 16 extracts a similar three-dimensional structure from the database 21. Then, based on the result of the similarity calculation processing means 16, the result output processing means 18 represents the similar part with the amino acid sequence name and the amino acid number, and displays the r.m.s.d. value as a measure of similarity.
[0131]
In this way, by designating the three-dimensional structure of a substance to be a probe (search key) with this apparatus, a similar structure having the smallest r.m.s.d. value can be retrieved from the database 21 storing the three-dimensional structure of the substance.
[0132]
【The invention's effect】
As described above, according to the present invention, since the points forming the three-dimensional structure can be automatically overlapped, the three-dimensional structure can be automatically superimposed and displayed by the graphic system. Similar structures can be automatically retrieved from the database. In addition, a function database can be created to predict the function of new substances. Therefore, it greatly contributes to the development of new drugs and protein research.
[Brief description of the drawings]
FIG. 1 is a diagram illustrating the principle of the present invention.
FIG. 2 is a diagram showing a method for calculating an r.m.s.d. value related to the present invention.
FIG. 3 is an explanatory diagram of correspondence between point sets A and B that are not ordered;
FIG. 4 is a diagram illustrating an example of an association algorithm for point sets A and B that are not ordered;
FIG. 5 is an explanatory diagram for associating ordered point sets A and B;
FIG. 6 is a diagram illustrating an example of an association algorithm between ordered point sets A and B;
FIG. 7 is an explanatory diagram of association between point sets A and B partially associated with each other;
FIG. 8 is a diagram illustrating a relationship between two points that are associated with each other.
FIG. 9 is a diagram for explaining a distance between two points (d_a≧ 2 × Emax).
FIG. 10 is a diagram for explaining a distance between two points (d_a<2 × Emax).
FIG. 11 is a diagram for explaining narrowing down of candidates based on a distance relationship between a plurality of points.
FIG. 12 is a diagram for explaining candidate narrowing down due to geometric constraints;
FIG. 13 is an explanatory diagram of association between point sets A and B with a limited number of nil.
FIG. 14 shows examples of proteins.
FIG. 15 is a diagram illustrating an example of a search result according to the embodiment of the present invention.
FIG. 16 is a diagram illustrating an example of a search result according to the embodiment of the present invention.
FIG. 17 is a diagram illustrating an example of a three-dimensional structure search result.
FIG. 18 is a diagram for explaining association according to the component order of the ordered point sets A and B;
FIG. 19 is a diagram illustrating an example of an associating algorithm according to the constituent element order of the ordered point sets A and B;
FIG. 20 is a diagram illustrating an example of an association algorithm that takes into account the order of components of the ordered point sets A and B and the constraint conditions;
FIG. 21 is a diagram for explaining the association of a plurality of sites in consideration of the component order of a point set and constraint conditions.
FIG. 22 is a diagram illustrating an example of an association algorithm of a plurality of sites in consideration of the component order of a point set and constraint conditions.
FIG. 23 is a diagram showing an example of an amino acid sequence (in the case of one site).
FIG. 24 is a diagram showing an example of a search result according to the present embodiment (in the case of one site).
FIG. 25 shows an example of an amino acid sequence (in the case of multiple sites).
FIG. 26 is a diagram showing an example of a search result according to the present embodiment (in the case of a plurality of sites).
[Explanation of symbols]
10 Processing device
11 Control unit
14 Combination generation processing means
15 Overlay calculation processing means
16 Similarity calculation processing means
17 Data input processing means
18 Result output processing means
19 Function database generation processing means
20 Function prediction processing means
21 Database
22 Input device
23 Display device
24 Function database

Claims

Data relating to two three-dimensional structures of proteins were treated by computer, a three-dimensional structure processing apparatus for superimposing these two three-dimensional structures,
A database (21) in which data on the three-dimensional structure of proteins are registered;
Data input processing means (17) for reading data registered in the database (21) and instruction information from the user;
Based on said data input processing means (17) is data read, the two three-dimensional structure of the protein, has a respective amino acids constituting the protein with the unit sequential relationship in accordance with the amino acid sequence ID has the attributes of an amino acid and each acquires the data described as a set of elements having the coordinates of a point, when associating each element among a set of two three-dimensional structure of the protein, with respect to newly associate candidate elements from the collecting case multiple elements and candidate distance relationship or already associated between the multiple-element and the candidate elements associated with the distance relationship or already between the two elements to be satisfied with the already the associated element and the candidate elements Narrowing the candidates by geometric constraints that restrict the mapping according to the angular relationship between the elements or the difference between the sets of combinations thereof, and And a limit on the number of nil indicating that there is no corresponding element or a limit on the value of the square root of the mean square distance between elements already associated between sets and between candidate elements or a limit on the number of discontinuous sites or site configuration residues candidate narrowing down based on a predetermined threshold condition on the limits, and to narrow down the candidates by attributes of amino acids, new satisfying attributes of these geometrical constraints and threshold condition and amino acids A combination generation processing means (14) that incorporates elements to which correspondence can be attached and generates a combination of the corresponding elements as a corresponding element ;
The square root of the mean square distance between the elements associated between collection from among a plurality of combinations of said associated generated by combining generation processing means (14) elements is less than a predetermined threshold value, position and orientation for overlapping also the distance to left looking for a predetermined threshold smaller combination that best matches the two three-dimensional structure of the protein combinations that find their between each element and associated Overlay calculation processing means (15) for calculating
Stereostructure processing apparatus comprising the result output processing means for superposing the two stereoscopic structure of the protein to display based on the result calculated (18) by the superimposition calculation processing means (15) .

Includes a database of data concerning the three-dimensional structure of the protein is stored (21), by searching the database (21) the specified three-dimensional structure as a search key, a similar conformation from in said database (21) A three-dimensional structure processing device for extraction,
Data input processing means (17) for reading data registered in the database (21) and three-dimensional structure instruction information serving as a search key from the user;
Based on the data read from the database (21) by the data input processing means (17), the three-dimensional structure of the protein three-dimensional structure registered in the database (21) and the three-dimensional structure serving as the search key. For the determination of similarity, as a set of elements having the three-dimensional structure as a unit, each amino acid constituting the protein as a unit, an amino acid attribute, an order relationship according to the amino acid sequence number, and a coordinate as a point each acquired data described, when associating each element among a set of two three-dimensional structure of the protein, with respect to newly associate candidate elements from the collecting case, and already the associated element and the candidate elements distance relationship or already been mapped between the multiple-element and the candidate elements associated with the distance relationship or already between the two elements to be satisfied between the The number of nil which indicates that multiple elements and angular relationship or correspondence geometrical constraints by the candidate narrowing constraining the response to differences between the set of combinations thereof between a candidate element, and the corresponding element is not present A predetermined threshold for a limit on the square root of the mean square distance between elements already mapped between sets or between candidate elements, or a limit on the number of discontinuous sites or a limit on the number of site constituent residues performs candidate narrowing, and the candidate narrowing by attributes of amino acids by value condition associated with incorporating elements newly corresponding is attached to meet the attributes of these geometrical constraints and the threshold condition and amino acid components A combination generation processing means (14) for generating a combination of elements associated with
The square root of the mean square distance between the elements associated between collection from among a plurality of combinations of said associated generated by combining generation processing means (14) elements is less than a predetermined threshold value, and also the distance between the elements associated locate the predetermined threshold smaller combination, similarity from the three-dimensional structure of the protein that the registered in the database (21) in the three-dimensional structure serving as the search key Similarity calculation processing means (16) for selecting a three-dimensional structure having a high
The similarity calculation processing means (16) three-dimensional structure processing apparatus characterized by comprising retrieval is output as a result result output processing means (18) with high conformational similarity was singled out by.

Comprises a database (21) containing the data regarding the three-dimensional structure of the protein, based on the three-dimensional structure expressing certain specified functions, the conformation expressing similar functions from the in the database (21) to it A three-dimensional structure processing device that extracts and collects and generates a function database (24) in which data on three-dimensional structures related to protein functions is stored ,
Data input processing means (17) for reading data registered in the database (21) and three-dimensional structure instruction information expressing a certain function from the user;
Based on said data input processing means (17) is read data, the determination of the similarity between two three-dimensional structures and three-dimensional structure that expresses the certain function and the three-dimensional structure of the protein that the registered in the database (21) Therefore, for the two three-dimensional structures, data described as a set of elements having amino acid attributes as a unit, having an amino acid attribute in the amino acid sequence number, and having coordinates as points each acquired when associating each element among a set of two three-dimensional structure of the protein, with respect to newly associate candidate elements from the current case, to be satisfied with the already the associated element and the candidate elements multiple elements and candidate needed to distance relationship or previously associated between the multiple-element and the candidate elements kicked with corresponding distance relationship or already between two elements Between angular relation or geometric candidate narrowing down by constraints to constrain the association according to the difference between the set of combinations thereof, and the corresponding element number of nil indicating that no restrictions or set between the The candidate according to a predetermined threshold condition regarding the limit of the square root of the mean square distance between the elements already associated with each other and the limit of the number of discontinuous sites or the number of site constituent residues Refine, and to narrow down the candidates by attributes of amino acids, associated with a the associated elements incorporating new elements corresponding is attached to meet the attributes of these geometrical constraints and threshold condition and amino acids A combination generation processing means (14) for generating a combination of elements;
The square root of the mean square distance between the elements associated between collection from among a plurality of combinations of said associated generated by combining generation processing means (14) elements is less than a predetermined threshold value, and also the distance between the elements associated locate the predetermined threshold is less than the combination, similar to the three-dimensional structure expressing the certain feature from the three-dimensional structure of the protein that the registered in the database (21) Similarity calculation processing means (16) for selecting a highly reliable three-dimensional structure;
The collect high have steric structure of similarity was singled out by the similarity calculation processing means (16), data relating to the collection conformation, corresponding to the designated function by the instruction information from the user stereostructure processing apparatus characterized by comprising a function database generating means for generating (19) the function database (24) by storing.

A function database data regarding the three-dimensional structure associated with the function of the protein is stored (24), based on the three-dimensional structure of a protein is given, the three-dimensional structure processing apparatus for predicting the function of the three-dimensional structure is expressed Because
Data input processing means (17) for reading the data related to the three-dimensional structure of the protein registered in the function database (24) and the instruction information of the three-dimensional structure of the protein designated by the user;
Based on the data read by the data input processing means (17) , the similarity of two steric structures between the three- dimensional structure of the protein registered in the function database (24) and the three-dimensional structure of the designated protein . For the determination, the two three-dimensional structures are described as a set of elements having the amino acid attribute as a unit, having the order of amino acid sequence number according to the amino acid sequence number, and having the coordinates as a point. each acquired data, when associating each element among a set of two three-dimensional structure of the protein, with respect to newly associate candidate elements from the current case, meet between already the associated element and the candidate elements a distance relationship or already more associated elements between the multiple-element and the candidate elements associated with the distance relationship or already between two elements to be Angular relationship or correspondence geometrical constraints by the candidate narrowing constraining the response to differences between the set of combinations thereof between a complementary elements, and the number of nil indicating that the corresponding element is not present restrictions or According to a predetermined threshold condition regarding the restriction of the value of the square root of the mean square distance between elements already associated between sets and between candidate elements, or the restriction on the number of discontinuous sites or the restriction on the number of site constituent residues candidate narrowing, and performs an attribute by the candidate narrowing of amino acids, the association and the associated elements incorporating new elements corresponding is attached to meet the attributes of these geometrical constraints and threshold condition and amino acids combination generating means (14) for generating a combination of obtained element,
The square root of the mean square distance between the elements associated between collection from among a plurality of combinations of said associated generated by combining generation processing means (14) elements is less than a predetermined threshold value, In addition , a search is made for a combination in which each distance between each associated element is smaller than a predetermined threshold value , and the three-dimensional structure of the protein registered in the function database (24) is changed to the three-dimensional structure of the designated protein. Similarity calculation processing means (16) for selecting a highly similar three-dimensional structure;
Get the role of high conformational similarity was singled out by the similarity calculation processing means (16) is expressed from the function database (24), as a three-dimensional structure of the specified protein is a function that expresses A three-dimensional structure processing apparatus comprising: a function prediction processing means (20) for predicting and outputting information on the function and functional part.