JP5170630B2

JP5170630B2 - Protein function identification device

Info

Publication number: JP5170630B2
Application number: JP2007326742A
Authority: JP
Inventors: 弘毅塚本; 達也吉川; 祐一郎蓬来; 一彦福井
Original assignee: National Institute of Advanced Industrial Science and Technology AIST
Current assignee: National Institute of Advanced Industrial Science and Technology AIST
Priority date: 2007-12-19
Filing date: 2007-12-19
Publication date: 2013-03-27
Anticipated expiration: 2027-12-19
Also published as: JP2009151406A

Description

本発明は、バイオインフォマティクスによりタンパク質の立体構造を基にしてタンパク質機能を識別する技術に関する。 The present invention relates to a technique for identifying protein functions based on the three-dimensional structure of a protein by bioinformatics.

従来、バイオインフォマティクスの分野では、タンパク質の構造から機能を識別しようという試みが盛んになされている。この種の技術では、機能既知タンパク質の構造と機能未知タンパク質の構造が何らかのかたちで比較されて、機能未知タンパク質の機能が識別される。 Conventionally, in the field of bioinformatics, attempts have been actively made to distinguish functions from protein structures. In this type of technology, the structure of a protein with unknown function is compared with the structure of a protein with unknown function in some way to identify the function of a protein with unknown function.

現在最も広く用いられている方法は、機能既知タンパク質との一次構造情報すなわち配列の類似性検索である。この方法では、「配列−類縁関係−機能」という相関関係が利用され、そして、できる限り遠い類縁関係を検出する努力がなされる。しかし、遠い類縁関係から機能を推定することには原理的に限界がある。原理的な限界は、タンパク質機能の第一原理（配列−立体構造−機能）における立体構造の部分を省略したことによって引き起こされるものである。実際、配列の解析だけでは、５０％程度の遺伝子の機能が推定されているだけであり、残りの遺伝子機能は未知のままである。 The most widely used method at present is primary structure information, ie, sequence similarity search, with a protein of known function. In this method, the “sequence-affinity-function” correlation is utilized and an effort is made to detect the distant relationships as far as possible. However, there is a limit in principle to estimate the function from the distant relationship. The principle limit is caused by the omission of the three-dimensional structure in the first principle of protein function (sequence-stereostructure-function). In fact, only 50% of gene functions are estimated by sequence analysis alone, and the remaining gene functions remain unknown.

三次構造情報を用いた機能識別の方法として、非特許文献１の方法が知られている。この従来技術は、タンパク質の機能を、その機能発現の最初のステップである結合、すなわち分子認識であると見なして、結合部位の予測を行うアプローチを提案している。この従来技術は、ある分子表面領域において似た形状と静電ポテンシャル及び親水・疎水性を持つタンパク質は似た低分子リガンドを結合すると仮定して、機能を識別する。また、非特許文献２は、モジュールとよばれるタンパク質の部分構造に注目して機能同定する方法を提案している。 As a function identification method using tertiary structure information, the method of Non-Patent Document 1 is known. This conventional technique proposes an approach for predicting a binding site by regarding the function of a protein as binding, that is, molecular recognition, which is the first step of expression of the function. This prior art discriminates functions assuming that proteins with similar shape, electrostatic potential and hydrophilicity / hydrophobicity in a certain molecular surface region bind similar small molecule ligands. Non-Patent Document 2 proposes a method of function identification by paying attention to a partial structure of a protein called a module.

ここで、タンパク質の多くは、（１）タンパク質間相互作用を契機に、（２）タンパク質とタンパク質または低分子化学物質の間で化学反応を起こす。非特許文献１は、（２）の化学反応をタンパク質機能と見なしており、（１）のタンパク質間相互作用に基づいた機能識別を行っていない。また、非特許文献２も、（１）のタンパク質間相互作用に基づいた機能識別を行っていない。非特許文献２は、タンパク質間相互作用に言及しているものの、タンパク質相互作用を機能識別に使うことはできていない。 Here, many of proteins (1) take the interaction between proteins as a trigger, and (2) cause a chemical reaction between the protein and the protein or low molecular chemical substance. Non-Patent Document 1 regards the chemical reaction of (2) as a protein function, and does not identify the function based on the protein-protein interaction of (1). Non-patent document 2 also does not perform function identification based on the protein-protein interaction of (1). Non-Patent Document 2 refers to protein-protein interaction, but protein interaction cannot be used for function identification.

このように、従来の機能識別技術は、タンパク質間相互作用を反映できていない。つまり、あるタンパク質の機能を識別するのに、そのタンパク質と他のタンパク質の相互作用は使われていない。実際、タンパク質間相互作用の研究はなされているものの、タンパク質相互作用を機能識別に取り入れることに成功した例は見られない。 Thus, the conventional function identification technique cannot reflect the protein-protein interaction. In other words, the interaction between a protein and another protein is not used to identify the function of a protein. In fact, although protein-protein interactions have been studied, there have been no successful examples of incorporating protein interactions into functional discrimination.

また、非特許文献１、２は、立体構造の部分的特徴に注目して機能識別を行っている。そのため、実際には注目部分と異なる部分の構造が機能に関係している場合に、機能を正確に識別できない。また、機能に関与する部分が予め分かっていることを前提としているので、機能に関与する部分が分かっていない場合は機能識別ができない。 In Non-Patent Documents 1 and 2, function identification is performed by paying attention to partial features of the three-dimensional structure. Therefore, when the structure of the part different from the target part is actually related to the function, the function cannot be accurately identified. Further, since it is assumed that the part related to the function is known in advance, the function cannot be identified when the part related to the function is not known.

立体構造を用いた機能識別の現状は上記の通りである。立体構造による機能識別への期待は大きいものの、現状ではまだ十分な識別能力が得られているとはいえず、識別能力の向上が常に望まれる。 The current state of function identification using a three-dimensional structure is as described above. Although there is a great expectation for function identification by a three-dimensional structure, it cannot be said that sufficient identification ability is obtained at present, and improvement of identification ability is always desired.

別の関連技術として、非特許文献３のＺＤＯＣＫが知られている。ＺＤＯＣＫは、２つのタンパク質の立体構造から、それら２つのタンパク質がドッキング（結合）した複合体構造を求める技術である。ＺＤＯＣＫは、２つのタンパク質の立体構造間の形状相補性を評価する。すなわち、２つのタンパク質の表面の凹凸がどの程度合っているかが評価される。ＺＤＯＣＫによれば、タンパク質間相互作用を求めることができる。しかし、ＺＤＯＣＫは、タンパク質の機能識別を行う方法は提供していない。 As another related technique, ZDOCK of Non-Patent Document 3 is known. ZDOCK is a technique for obtaining a complex structure in which two proteins are docked (coupled) from the three-dimensional structure of two proteins. ZDOCK evaluates the shape complementarity between the three protein conformations. That is, it is evaluated how well the surface irregularities of the two proteins are matched. According to ZDOCK, protein-protein interaction can be determined. However, ZDOCK does not provide a method for identifying the function of a protein.

別の関連技術として、非特許文献４がある。非特許文献４は、レセプタとリガンドのサンプルを提供する。
Kengo Kinoshita, Haruki Nakamura, Identification of protein biochemical functions by similarity search using the molecular surface database eF-site, Protein Science, Cold Spring Harbor Laboratory Press, 2003, 12: 1589-1595 K. Yura, M. Shionyu, K. Kawaktani, M Go, Repetitive use of a phosphate-binding module in DNA polymerase, Oct-1 POU domain and phage repressors, CMLS Cellular and Molecular Life Science, Birkhauser Verlag, Basel, 1999, 55: 472-486 Rong Chen, Zhiping Weng, A Novel Shape Complementarity Scoring Function for Protein-Protein Docking, PROTEINS: Structure, Function, and Genetics, WILEY-LISS, INC., 2003, 51: 397-408 Julian Mintseris, Kevin Wiehe, Brian Pierce, Robert Anderson, Rong Chen, Joel Janin, and Zhiping Weng, Protein-Protein Docking Benchmark 2.0: An Update, PROTEINS: Structure, Function, and Genetics, WILEY-LISS, INC., 2005, 60: 214-216 Another related technique is Non-Patent Document 4. Non-Patent Document 4 provides a sample of a receptor and a ligand.
Kengo Kinoshita, Haruki Nakamura, Identification of protein biochemical functions by similarity search using the molecular surface database eF-site, Protein Science, Cold Spring Harbor Laboratory Press, 2003, 12: 1589-1595 K. Yura, M. Shionyu, K. Kawaktani, M Go, Repetitive use of a phosphate-binding module in DNA polymerase, Oct-1 POU domain and phage repressors, CMLS Cellular and Molecular Life Science, Birkhauser Verlag, Basel, 1999, 55: 472-486 Rong Chen, Zhiping Weng, A Novel Shape Complementarity Scoring Function for Protein-Protein Docking, PROTEINS: Structure, Function, and Genetics, WILEY-LISS, INC., 2003, 51: 397-408 Julian Mintseris, Kevin Wiehe, Brian Pierce, Robert Anderson, Rong Chen, Joel Janin, and Zhiping Weng, Protein-Protein Docking Benchmark 2.0: An Update, PROTEINS: Structure, Function, and Genetics, WILEY-LISS, INC., 2005, 60: 214-216

本発明は、上記背景の下でなされたものであり、その目的は、タンパク質間相互作用を考慮した機能識別を行うことができ、高い識別能力で機能識別を行うことができる技術を提供することにある。 The present invention has been made under the background described above, and an object of the present invention is to provide a technique capable of performing functional identification considering protein-protein interaction and capable of performing functional identification with high discrimination ability. It is in.

本発明の一態様は、機能既知のタンパク質の立体構造と機能未知のタンパク質の立体構造に基づいて機能未知のタンパク質の機能を識別するタンパク質機能識別装置であって、機能既知のタンパク質としての複数のレセプタの立体構造を記憶するレセプタ記憶部と、複数のリガンドの立体構造を記憶するリガンド記憶部と、前記複数のレセプタの立体構造及び前記複数のリガンドの立体構造に基づき、機能識別のための教師データとして、機能既知である複数のレセプタの各々について、各レセプタが前記複数のリガンドとそれぞれドッキングするときの複数の形状相補性評価値を算出すると共に前記各レセプタの電荷情報を算出する教師データ生成部と、機能未知タンパク質の立体構造を入力する未知タンパク質入力部と、前記機能未知タンパク質の立体構造及び前記複数のリガンドの立体構造に基づき、機能識別のための識別入力データとして、前記機能未知タンパク質が前記複数のリガンドとそれぞれドッキングするときの複数の形状相補性評価値を算出すると共に前記機能未知タンパク質の電荷情報を算出する識別入力データ生成部と、前記教師データを学習して前記機能未知タンパク質の機能を識別する学習識別部とを備え、前記学習識別部は、機能が共通する複数のレセプタにおける、前記複数のリガンドとのドッキングの前記複数の形状相補性評価値及び前記電荷情報の類似性に基づいて、前記複数の形状相補性評価値及び前記電荷情報が前記機能未知タンパク質と類似するレセプタの機能を求めることにより、前記機能未知タンパク質の機能を識別する。 One embodiment of the present invention is a protein function identification device that identifies the function of a protein with an unknown function based on the three-dimensional structure of a protein with a known function and the three-dimensional structure of a protein with an unknown function. A receptor storage unit for storing the three-dimensional structure of the receptor, a ligand storage unit for storing the three-dimensional structure of a plurality of ligands, a teacher for function identification based on the three-dimensional structure of the plurality of receptors and the three-dimensional structure of the plurality of ligands For each of a plurality of receptors whose functions are known, teacher data generation for calculating a plurality of shape complementarity evaluation values when each receptor is docked with each of the plurality of ligands and calculating charge information of each receptor Part, an unknown protein input part for inputting the three-dimensional structure of the unknown protein, and the unknown function Based on the three-dimensional structure of the protein and the three-dimensional structure of the plurality of ligands, a plurality of shape complementarity evaluation values when the unknown protein is docked with the plurality of ligands are calculated as identification input data for function identification. And an identification input data generation unit that calculates charge information of the function-unknown protein, and a learning identification unit that learns the teacher data and identifies the function of the function-unknown protein. Based on similarity of the plurality of shape complementarity evaluation values and the charge information of docking with the plurality of ligands in a plurality of common receptors, the plurality of shape complementarity evaluation values and the charge information are the functions unknown By determining the function of a receptor similar to a protein, the function of the unknown protein is identified.

本発明によれば、上記のように機能既知の複数のレセプタの立体構造と複数のリガンドの立体構造から教師データが求められる。教師データは、各々のレセプタについて、各レセプタが複数のリガンドとそれぞれドッキングするときの複数の形状相補性評価値と、各レセプタの電荷情報とを含む。また、機能未知タンパク質の立体構造と複数のリガンドの立体構造から識別入力データが求められ、この識別入力データは、機能未知タンパク質が複数のリガンドとそれぞれドッキングするときの複数の形状相補性評価値と、機能未知タンパク質の電荷情報とを含む。そして、本発明は、上記の教師データを学習して機能未知タンパク質の機能を識別する。このように形状相補性評価値と電荷情報を用いることにより、タンパク質間相互作用を考慮した機能識別を行うことができ、高い識別能力で機能識別を行うことができる。 According to the present invention, teacher data is obtained from the three-dimensional structures of a plurality of receptors having known functions and the three-dimensional structures of a plurality of ligands as described above. The teacher data includes, for each receptor, a plurality of shape complementarity evaluation values when each receptor is docked with a plurality of ligands, and charge information of each receptor. In addition, identification input data is obtained from the three-dimensional structure of the unknown protein and the three-dimensional structure of a plurality of ligands. The identification input data includes a plurality of shape complementarity evaluation values when the unknown protein is docked with a plurality of ligands, respectively. And charge information of proteins of unknown function. And this invention learns said teacher data, and identifies the function of a function unknown protein. In this way, by using the shape complementarity evaluation value and the charge information, it is possible to perform function identification considering protein-protein interaction, and it is possible to perform function identification with high discrimination ability.

前記教師データ生成部は、各レセプタの立体構造と各リガンドの立体構造の組合せに対して、両立体表面の凹凸を合わせるドッキングシミュレーションを行って、両立体表面の凹凸がどの程度合っているかを表す形状相補性評価値を算出してよく、前記識別入力データ生成部は、前記機能未知タンパク質の立体構造と各リガンドの立体構造の組合せに対して前記ドッキングシミュレーションを行って前記形状相補性評価値を算出してよい。本発明によれば、ドッキングシミュレーションを行うことにより形状相補性を適切に評価できる。 The teacher data generation unit performs a docking simulation to match the unevenness of the surface of the compatible body with respect to the combination of the three-dimensional structure of each receptor and the three-dimensional structure of each ligand, and expresses how much the unevenness of the surface of the compatible body matches. A shape complementarity evaluation value may be calculated, and the identification input data generation unit performs the docking simulation on a combination of the three-dimensional structure of the function-unknown protein and the three-dimensional structure of each ligand to obtain the shape complementarity evaluation value. It may be calculated. According to the present invention, shape complementarity can be appropriately evaluated by performing a docking simulation.

前記教師データ生成部及び前記識別入力データ生成部は、前記ドッキングシミュレーションの前にタンパク質の立体構造からタンパク質表面を特定する処理を行い、前記タンパク質表面の厚みが調整可能でよい。本発明によれば、ドッキングシミュレーションの用いられる立体構造の表面厚さを調整でき、これにより識別能力を向上できる。 The teacher data generation unit and the identification input data generation unit may perform a process of specifying the protein surface from the three-dimensional structure of the protein before the docking simulation, and the thickness of the protein surface may be adjustable. According to the present invention, the surface thickness of the three-dimensional structure used in the docking simulation can be adjusted, and thereby the identification ability can be improved.

前記教師データ生成部及び前記識別入力データ生成部により算出される前記電荷情報は、前記各レセプタ又は前記機能未知タンパク質であるタンパク質の全体の全電荷と表面の全電荷の差分を含んでよい。本発明によれば、上記のようにタンパク質の全体の全電荷と表面の全電荷の差分を含む電荷情報を教師データとして用いることにより、識別能力を向上できる。 The charge information calculated by the teacher data generation unit and the identification input data generation unit may include a difference between the total charges of the proteins that are the receptors or the proteins of unknown function and the total charges of the surface. According to the present invention, as described above, by using the charge information including the difference between the total charge of the entire protein and the total charge of the surface as the teacher data, the discrimination ability can be improved.

前記教師データ生成部及び前記識別入力データ生成部により算出される前記電荷情報は、前記各レセプタ又は前記機能未知タンパク質であるタンパク質の溶媒露出面積と、前記タンパク質全体の正電荷残基の数と、前記タンパク質全体の負電荷残基の数と、前記タンパク質全体のヒスチジン残基の数と、前記タンパク質全体の全電荷と、前記タンパク質表面の正電荷残基の数と、前記タンパク質表面の負電荷残基の数と、前記タンパク質表面のヒスチジン残基の数と、前記タンパク質表面の全電荷と、前記タンパク質全体の全電荷と表面の全電荷の差分とを含んでよい。本発明によれば、上記のような各種のパラメータを含む電荷情報を教師データとして用いることにより、識別能力を向上できる。 The charge information calculated by the teacher data generation unit and the identification input data generation unit is the solvent exposure area of the protein that is each receptor or the function-unknown protein, the number of positive charge residues of the whole protein, The number of negatively charged residues of the whole protein, the number of histidine residues of the whole protein, the total charge of the whole protein, the number of positively charged residues of the protein surface, and the negative charge residue of the protein surface. It may include the number of groups, the number of histidine residues on the protein surface, the total charge on the protein surface, and the difference between the total charge on the whole protein and the total charge on the surface. According to the present invention, the identification capability can be improved by using the charge information including the various parameters as described above as the teacher data.

前記学習識別部は、前記教師データに含まれる前記電荷情報及び前記形状相補性評価値をレセプタ機能によって複数の機能カテゴリに分類し、前記識別入力データに含まれる前記未知タンパク質の前記電荷情報及び前記形状相補性評価値がどの機能カテゴリに属するかを判別してよい。本発明によれば、機能既知の複数のレセプタの情報を適切に学習でき、機能既知レセプタの情報に基づいて機能未知タンパク質の機能を識別できる。 The learning identification unit classifies the charge information and the shape complementarity evaluation value included in the teacher data into a plurality of functional categories by a receptor function, and the charge information of the unknown protein included in the identification input data and the It may be determined to which functional category the shape complementarity evaluation value belongs. According to the present invention, it is possible to appropriately learn information on a plurality of receptors with known functions, and it is possible to identify the function of a function-unknown protein on the basis of information on known receptors.

前記各レセプタと前記複数のリガンドとの前記複数の形状相補性評価値及び前記各レセプタの前記電荷情報が、前記各レセプタのレセプタベクトルを構成し、前記機能未知タンパク質と前記複数のリガンドとの前記複数の形状相補性評価値及び前記機能未知タンパク質の前記電荷情報が、前記機能未知タンパク質の識別入力ベクトルを構成し、前記学習識別部は、前記複数のレセプタに対応する複数のレセプタベクトルを前記複数の機能カテゴリに分ける分離面を検出し、前記分離面と前記識別入力ベクトルの位置関係から前記識別入力ベクトルが属する機能カテゴリを判別してよい。本発明によれば、ベクトルデータを用いることにより、機能既知の複数のレセプタの情報を適切に学習でき、機能既知レセプタの情報に基づいて機能未知タンパク質の機能を識別できる。学習識別にはサポートベクトルマシンが用いられてよい。 The plurality of shape complementarity evaluation values of each receptor and the plurality of ligands and the charge information of each receptor constitute a receptor vector of each receptor, and the function-unknown protein and the plurality of ligands A plurality of shape complementarity evaluation values and the charge information of the function-unknown protein constitute an identification input vector of the function-unknown protein, and the learning identification unit includes a plurality of receptor vectors corresponding to the plurality of receptors. The separation plane divided into functional categories may be detected, and the functional category to which the identification input vector belongs is determined from the positional relationship between the separation plane and the identification input vector. According to the present invention, by using vector data, information on a plurality of receptors with known functions can be appropriately learned, and the function of a function-unknown protein can be identified based on the information on receptors with known functions. A support vector machine may be used for learning identification.

前記複数の機能カテゴリは、抗体、酵素及びその他の機能の３カテゴリであってよい。本発明によれば、抗体、酵素及びその他の機能といった大分類の機能を適切に識別できる。 The plurality of functional categories may be three categories of antibodies, enzymes, and other functions. According to the present invention, large-scale functions such as antibodies, enzymes, and other functions can be appropriately identified.

本発明の別の態様は、機能既知のタンパク質の立体構造と機能未知のタンパク質の立体構造に基づいて機能未知のタンパク質の機能を識別するタンパク質機能識別装置であって、機能識別のための教師データとして、機能既知のタンパク質としての複数のレセプタの各々について、各レセプタの機能の情報、前記各レセプタの立体構造が複数のリガンドの立体構造とそれぞれドッキングするときの複数の形状相補性評価値及び前記各レセプタの電荷情報を記憶する教師データ記憶部と、機能識別のための識別入力データとして、前記機能未知タンパク質の立体構造が前記複数のリガンドの立体構造とそれぞれドッキングするときの複数の形状相補性評価値を算出すると共に前記機能未知タンパク質の電荷情報を算出する識別入力データ生成部と、前記教師データを学習して前記機能未知タンパク質の機能を識別する学習識別部とを備え、前記学習識別部は、機能が共通する複数のレセプタにおける、前記複数のリガンドとのドッキングの前記複数の形状相補性評価値及び前記電荷情報の類似性に基づいて、前記複数の形状相補性評価値及び前記電荷情報が前記機能未知タンパク質と類似するレセプタの機能を求めることにより、前記機能未知タンパク質の機能を識別する。この態様によっても上述した本発明の利点が得られる。 Another aspect of the present invention is a protein function identification device for identifying the function of a protein with unknown function based on the three-dimensional structure of a protein with known function and the three-dimensional structure of a protein with unknown function, and comprising teacher data for function identification As for each of a plurality of receptors as proteins having a known function, information on the function of each receptor, a plurality of shape complementarity evaluation values when the three-dimensional structure of each receptor is respectively docked with a three-dimensional structure of a plurality of ligands, and A teacher data storage unit for storing charge information of each receptor, and a plurality of shape complementarity when the three-dimensional structure of the function-unknown protein is docked with the three-dimensional structure of the plurality of ligands as identification input data for function identification An identification input data generation unit that calculates an evaluation value and calculates charge information of the function-unknown protein A learning identification unit that learns the teacher data and identifies the function of the unknown protein, and the learning identification unit includes a plurality of docking with the plurality of ligands in a plurality of receptors having a common function. Based on the shape complementarity evaluation value and the similarity of the charge information, the function of the function unknown protein is obtained by obtaining the function of a receptor whose plurality of shape complementarity evaluation values and the charge information are similar to the function unknown protein. Identify. This aspect also provides the above-described advantages of the present invention.

本発明の別の態様は、機能既知のタンパク質の立体構造と機能未知のタンパク質の立体構造に基づいて機能未知のタンパク質の機能を識別するタンパク質機能識別方法であって、機能既知のタンパク質としての複数のレセプタの立体構造をレセプタ記憶部から読み出し、複数のリガンドの立体構造をリガンド記憶部から読み出し、前記複数のレセプタの立体構造及び前記複数のリガンドの立体構造に基づき、機能識別のための教師データとして、機能既知である複数のレセプタの各々について、各レセプタが前記複数のリガンドとそれぞれドッキングするときの複数の形状相補性評価値を算出すると共に前記各レセプタの電荷情報を算出し、機能未知タンパク質の立体構造を入力し、前記機能未知タンパク質の立体構造及び前記複数のリガンドの立体構造に基づき、機能識別のための識別入力データとして、前記機能未知タンパク質が前記複数のリガンドとそれぞれドッキングするときの複数の形状相補性評価値を算出すると共に前記機能未知タンパク質の電荷情報を算出し、前記教師データを学習して前記機能未知タンパク質の機能を識別する学習識別を行い、前記学習識別は、機能が共通する複数のレセプタにおける、前記複数のリガンドとのドッキングの前記複数の形状相補性評価値及び前記電荷情報の類似性に基づいて、前記複数の形状相補性評価値及び前記電荷情報が前記機能未知タンパク質と類似するレセプタの機能を求めることにより、前記機能未知タンパク質の機能を識別する。この態様によっても上述した本発明の利点が得られる。 Another aspect of the present invention is a protein function identification method for identifying the function of a protein with unknown function based on the three-dimensional structure of a protein with known function and the three-dimensional structure of a protein with unknown function. The three-dimensional structure of each receptor is read from the receptor storage unit, the three-dimensional structure of a plurality of ligands is read from the ligand storage unit, and the teacher data for function identification based on the three-dimensional structure of the plurality of receptors and the three-dimensional structure of the plurality of ligands As for each of a plurality of receptors having known functions, a plurality of shape complementarity evaluation values are calculated when each receptor is docked with the plurality of ligands, and charge information of each receptor is calculated, and the function unknown protein The three-dimensional structure of the protein of unknown function and the plurality of regans Based on the three-dimensional structure, as the identification input data for function identification, a plurality of shape complementarity evaluation values when the unknown protein is docked with the plurality of ligands are calculated, and the charge information of the unknown protein is calculated. Calculating and learning learning to identify the function of the unknown protein by learning the teacher data, the learning identification is the plurality of shapes of docking with the plurality of ligands in a plurality of receptors having a common function Based on the complementarity evaluation value and the similarity of the charge information, the function of the function unknown protein is determined by obtaining the function of the receptor whose shape complementarity evaluation value and the charge information are similar to the function unknown protein. Identify. This aspect also provides the above-described advantages of the present invention.

本発明の別の態様は、機能既知のタンパク質の立体構造と機能未知のタンパク質の立体構造に基づいて機能未知のタンパク質の機能を識別する処理をコンピュータに実行させるタンパク質機能識別プログラムであって、機能既知のタンパク質としての複数のレセプタの立体構造をレセプタ記憶部から読み出し、複数のリガンドの立体構造をリガンド記憶部から読み出し、前記複数のレセプタの立体構造及び前記複数のリガンドの立体構造に基づき、機能識別のための教師データとして、機能既知である複数のレセプタの各々について、各レセプタが前記複数のリガンドとそれぞれドッキングするときの複数の形状相補性評価値を算出すると共に前記各レセプタの電荷情報を算出し、機能未知タンパク質の立体構造を入力し、前記機能未知タンパク質の立体構造及び前記複数のリガンドの立体構造に基づき、機能識別のための識別入力データとして、前記機能未知タンパク質が前記複数のリガンドとそれぞれドッキングするときの複数の形状相補性評価値を算出すると共に前記機能未知タンパク質の電荷情報を算出し、前記教師データを学習して前記機能未知タンパク質の機能を識別する学習識別を行う処理を前記コンピュータに実行させ、前記学習識別処理は、機能が共通する複数のレセプタにおける、前記複数のリガンドとのドッキングの前記複数の形状相補性評価値及び前記電荷情報の類似性に基づいて、前記複数の形状相補性評価値及び前記電荷情報が前記機能未知タンパク質と類似するレセプタの機能を求めることにより、前記機能未知タンパク質の機能を識別する。この態様によっても上述した本発明の利点が得られる。 Another aspect of the present invention is a protein function identification program for causing a computer to execute a process of identifying a function of a protein whose function is unknown based on the three-dimensional structure of a protein whose function is known and the three-dimensional structure of a protein whose function is unknown. Reads the three-dimensional structure of a plurality of receptors as known proteins from the receptor storage unit, reads the three-dimensional structure of a plurality of ligands from the ligand storage unit, and functions based on the three-dimensional structure of the plurality of receptors and the three-dimensional structure of the plurality of ligands As teacher data for identification, for each of a plurality of receptors with known functions, a plurality of shape complementarity evaluation values are calculated when each receptor is docked with the plurality of ligands, and the charge information of each receptor is obtained. Calculate and input the three-dimensional structure of the function unknown protein, Based on the three-dimensional structure of the protein and the three-dimensional structure of the plurality of ligands, a plurality of shape complementarity evaluation values when the unknown protein is docked with the plurality of ligands are calculated as identification input data for function identification. And calculating the charge information of the function-unknown protein and learning the teacher data to identify the function of the function-unknown protein, and causing the computer to execute the process. The plurality of shape complementarity evaluation values and the charge information are based on the similarity of the plurality of shape complementarity evaluation values and the charge information of docking with the plurality of ligands in the plurality of receptors. The function of the unknown protein is identified by determining the function of the receptor similar to the above. This aspect also provides the above-described advantages of the present invention.

本発明によれば、上記のように複数のリガンドとの形状相補性と電荷情報を使って機能未知タンパク質の機能を識別することにより、タンパク質間相互作用を考慮した機能識別を行うことができ、高い識別能力で機能識別を行うことができる。 According to the present invention, by identifying the function of a protein with unknown function using shape complementarity and charge information with a plurality of ligands as described above, it is possible to perform function identification in consideration of protein-protein interaction, Functional identification can be performed with high identification ability.

以下、本発明の好適な実施の形態について、図面を参照して説明する。 DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments of the invention will be described with reference to the drawings.

図１は、本発明の実施の形態に係るタンパク質機能識別装置を示している。タンパク質機能識別装置１は、コンピュータ装置であり、演算装置であるＣＰＵと、ＲＡＭ、ＲＯＭ等の記憶装置と、キーボード及びポインティングデバイス等の入力装置と、ディスプレイ及びプリンタ等の出力装置と、ハードディスク等の外部記憶装置を備えている。タンパク質機能識別装置１は、ネットワークとの通信機能を備え、この通信機能が情報の入出力装置として機能してよい。また、外部記録媒体に対するデータの読み書きの構成も、入出力装置として機能してよい。記憶装置には、本発明の各種処理をコンピュータに行わせるプログラムが記憶されており、このプログラムを実行することによってタンパク質機能識別装置１が実現される。タンパク質機能識別装置１は一つのコンピュータ装置で構成されてもよく、複数のコンピュータ装置で構成されてもよく、それらは分散配置されてもよい。 FIG. 1 shows a protein function identification device according to an embodiment of the present invention. The protein function identification device 1 is a computer device, a CPU that is an arithmetic device, a storage device such as a RAM and a ROM, an input device such as a keyboard and a pointing device, an output device such as a display and a printer, and a hard disk. An external storage device is provided. The protein function identification device 1 may have a communication function with a network, and this communication function may function as an information input / output device. In addition, the configuration for reading and writing data with respect to the external recording medium may also function as an input / output device. The storage device stores a program that causes a computer to perform various processes of the present invention, and the protein function identification device 1 is realized by executing the program. The protein function identification device 1 may be composed of one computer device or a plurality of computer devices, and they may be arranged in a distributed manner.

タンパク質機能識別装置１は、概略的には、機能既知のタンパク質の立体構造と機能未知のタンパク質の立体構造に基づいて機能未知のタンパク質の機能を識別する装置である。機能既知のタンパク質の情報としては、複数のレセプタの立体構造データが用いられる。さらに、複数のリガンドの立体構造データが用いられる。レセプタ及びリガンドは詳細にはレセプタタンパク質及びリガンドタンパク質であるが、ここでは簡略化してレセプタ及びリガンドと呼ぶ。タンパク質機能識別装置１は、これら立体構造データを基に、機能未知タンパク質の立体構造から機能を特定する。 The protein function identification device 1 is a device that generally identifies the function of a protein whose function is unknown based on the three-dimensional structure of a protein whose function is known and the three-dimensional structure of a protein whose function is unknown. Three-dimensional structure data of a plurality of receptors is used as information on proteins with known functions. Furthermore, three-dimensional structure data of a plurality of ligands is used. Receptors and ligands are specifically receptor proteins and ligand proteins, but are simply referred to herein as receptors and ligands. The protein function identification device 1 identifies the function from the three-dimensional structure of the unknown protein based on these three-dimensional structure data.

図１に示すように、タンパク質機能識別装置１は、複数のレセプタの立体構造を記憶するレセプタ記憶部３、複数のリガンドの立体構造を記憶するリガンド記憶部５、複数のレセプタ及び複数のリガンドの立体構造データから教師データを生成する教師データ部７、機能未知タンパク質の立体構造を入力する未知タンパク質入力部９、機能未知タンパク質の立体構造と複数のリガンドの立体構造のデータから識別入力データを生成する識別入力データ部１１、教師データ及び識別入力データを処理して機能未知タンパク質の機能を識別する学習識別部１３及び識別された機能の情報を出力する出力部１５を備えている。 As shown in FIG. 1, the protein function identification device 1 includes a receptor storage unit 3 that stores a plurality of receptor three-dimensional structures, a ligand storage unit 5 that stores a plurality of ligand three-dimensional structures, a plurality of receptors, and a plurality of ligands. Teacher data section 7 for generating teacher data from the three-dimensional structure data, unknown protein input section 9 for inputting the three-dimensional structure of the unknown protein, and identification input data from the three-dimensional structure of the unknown protein and the three-dimensional structure of a plurality of ligands An identification input data unit 11 that performs processing of the teacher data and the identification input data to identify the function of the unknown protein and an output unit 15 that outputs information on the identified function.

レセプタ記憶部３は、上述のように機能既知の複数のレセプタの立体構造データを記憶し、また、リガンド記憶部５は、複数のリガンドの立体構造データを記憶する。リガンドについても、機能既知のリガンドのデータが記憶されてよい。立体構造データとしては、ＰＤＢ（Protein Data Bank）形式のデータが好適に用いられる。 As described above, the receptor storage unit 3 stores the three-dimensional structure data of a plurality of receptors whose functions are known, and the ligand storage unit 5 stores the three-dimensional structure data of a plurality of ligands. As for the ligand, data of a ligand whose function is known may be stored. As the three-dimensional structure data, data in PDB (Protein Data Bank) format is preferably used.

教師データ部７は、教師データ生成部２１及び教師データ記憶部２３を有している。教師データ生成部２１が、レセプタ記憶部３の複数のレセプタの立体構造及びリガンド記憶部５の複数のリガンドの立体構造から教師データを生成し、生成された教師データが教師データ記憶部２３に記憶される。以下の教師データの生成処理について詳細に説明する。 The teacher data unit 7 includes a teacher data generation unit 21 and a teacher data storage unit 23. The teacher data generation unit 21 generates teacher data from the three-dimensional structure of the plurality of receptors in the receptor storage unit 3 and the three-dimensional structure of the plurality of ligands in the ligand storage unit 5, and the generated teacher data is stored in the teacher data storage unit 23. Is done. The following teacher data generation process will be described in detail.

図２は、教師データ生成部２１の構成を示している。図２に示すように、教師データ生成部２１は、教師ドッキング評価部３１、教師ドッキングデータ記憶部３３、教師残基電荷評価部３５及び教師残基電荷データ記憶部３７を有している。教師ドッキング評価部３１は、複数のレセプタと複数のリガンドから作られる各組合せに対してドッキングシミュレーションを行って形状相補性評価値を算出し、算出した形状相補性評価値を教師ドッキングデータ記憶部３３に格納する。教師残基電荷評価部３５は、複数のレセプタの各々についての電荷情報を算出し、算出した電荷情報を教師残基電荷データ記憶部３７に格納する。 FIG. 2 shows a configuration of the teacher data generation unit 21. As shown in FIG. 2, the teacher data generation unit 21 includes a teacher docking evaluation unit 31, a teacher docking data storage unit 33, a teacher residue charge evaluation unit 35, and a teacher residue charge data storage unit 37. The teacher docking evaluation unit 31 calculates a shape complementarity evaluation value by performing a docking simulation for each combination formed from a plurality of receptors and a plurality of ligands, and the calculated shape complementarity evaluation value is used as the teacher docking data storage unit 33. To store. The teacher residue charge evaluation unit 35 calculates charge information for each of the plurality of receptors, and stores the calculated charge information in the teacher residue charge data storage unit 37.

教師ドッキング評価部３１は、図示のように、グリッド変換部４１、４３、ＰＳＣ変換部４５、４７及びドッキングシミュレーション部４９を有する。 The teacher docking evaluation unit 31 includes grid conversion units 41 and 43, PSC conversion units 45 and 47, and a docking simulation unit 49 as illustrated.

グリッド変換部４１には、レセプタ記憶部３からレセプタの立体構造が入力される。グリッド変換部４１は、ＰＤＢ形式で表現されているレセプタの立体構造をグリッドデータに変換する。同様に、グリッド変換部４３には、リガンド記憶部５からリガンドの立体構造が入力され、グリッド変換部４３は、ＰＤＢ形式で表現されているリガンドの立体構造をグリッドデータに変換する。ＰＤＢ形式のデータは、タンパク質中の各原子の座標データで構成されている。これに対して、グリッドデータは、等間隔の多数のグリッド点のデータで構成されており、各グリッド点が、構造属性値（構造上、タンパク質のどの部分（表面、内部、外部）に位置するかの情報）を有する。グリッドデータとしては、Ｖｏｘｅｌ表現形式のデータが好適に用いられる。Ｖｏｘｅｌ表現については後述する。変換された各レセプタのグリッドデータを、レセプタグリッドデータといい、変換された各リガンドのグリッドデータを、リガンドグリッドデータという。レセプタグリッドデータはＰＳＣ変換部４５に入力され、リガンドグリッドデータは、ＰＳＣ変換部４７に入力される。 The three-dimensional structure of the receptor is input from the receptor storage unit 3 to the grid conversion unit 41. The grid conversion unit 41 converts the three-dimensional structure of the receptor expressed in the PDB format into grid data. Similarly, the three-dimensional structure of the ligand is input from the ligand storage unit 5 to the grid conversion unit 43, and the grid conversion unit 43 converts the three-dimensional structure of the ligand expressed in the PDB format into grid data. The data in the PDB format is composed of coordinate data of each atom in the protein. On the other hand, the grid data is composed of data of a large number of grid points at equal intervals, and each grid point is located in a structural attribute value (structurally, in which part of the protein (surface, internal, external)) Information). As the grid data, data in a Voxel expression format is preferably used. The Voxel expression will be described later. The converted grid data of each receptor is referred to as receptor grid data, and the converted grid data of each ligand is referred to as ligand grid data. The receptor grid data is input to the PSC converter 45, and the ligand grid data is input to the PSC converter 47.

ＰＳＣ変換部４５、４７は、入力されたタンパク質立体構造のグリッドデータを、ドッキングシミュレーションを実行するために必要なＰＳＣ表現形式に変換する。ＰＳＣとは、ボストン大学のＷｅｎｇらにより開発されたＺＤＯＣＫ（非特許文献３を参照）にて定義されたPair wise Shape Complementarity接触表面スコア関数の略である。ＰＳＣ変換部４５は、各レセプタのグリッドデータをＰＳＣ接触表面スコア関数に従ってＰＳＣ表現へと変換する。同様に、ＰＳＣ変換部４７は、各リガンドのグリッドデータをＰＳＣ接触表面スコア関数に従ってＰＳＣ表現へと変換する。ＰＳＣ接触表面スコア関数については後述する。変換された各レセプタのＰＳＣデータを、レセプタＰＳＣデータといい、変換された各リガンドのＰＳＣデータを、リガンドＰＳＣデータという。レセプタＰＳＣデータ及びリガンドＰＳＣデータは、ドッキングシミュレーション部４９へと入力される。 The PSC conversion units 45 and 47 convert the input grid data of the protein three-dimensional structure into a PSC expression format necessary for executing the docking simulation. PSC is an abbreviation for Pair wise Shape Complementarity contact surface score function defined in ZDOCK (see Non-Patent Document 3) developed by Weng et al. Of Boston University. The PSC converter 45 converts the grid data of each receptor into a PSC representation according to the PSC contact surface score function. Similarly, the PSC converter 47 converts the grid data of each ligand into a PSC expression according to the PSC contact surface score function. The PSC contact surface score function will be described later. The converted PSC data of each receptor is referred to as receptor PSC data, and the converted PSC data of each ligand is referred to as ligand PSC data. The receptor PSC data and the ligand PSC data are input to the docking simulation unit 49.

ドッキングシミュレーション部４９は、レセプタＰＳＣデータとリガンドＰＳＣデータを用いて、ドッキングシミュレーション計算を実行する。ドッキングシミュレーションは、２つのタンパク質の立体表面の凹凸を合わせてみて、２つの立体表面の凹凸がどの程度合っているかを表す形状相補性評価値を算出する。本実施の形態では、形状相補性評価値として、接触表面スコア値Ｓｐｓｃが得られる。ドッキングシミュレーションの詳細については後述する。 The docking simulation unit 49 executes docking simulation calculation using the receptor PSC data and the ligand PSC data. In the docking simulation, the three-dimensional surface irregularities of two proteins are combined to calculate a shape complementarity evaluation value representing how much the two three-dimensional surface irregularities match. In the present embodiment, the contact surface score value Spsc is obtained as the shape complementarity evaluation value. Details of the docking simulation will be described later.

教師ドッキング評価部３１は、上述したデータの処理、すなわちレセプタ記憶部３及びリガンド記憶部５からのデータ取得と、グリッド変換部４１、４３、ＰＳＣ変換部４５、４７及びドッキングシミュレーション部４９での処理を複数回繰り返す。これにより、複数のレセプタと複数のリガンドによって作られる多数の組合せの各々について、ドッキングシミュレーションが行われ、形状相補性評価値が得られる。 The teacher docking evaluation unit 31 processes the above-described data, that is, data acquisition from the receptor storage unit 3 and the ligand storage unit 5, and processing in the grid conversion units 41 and 43, the PSC conversion units 45 and 47, and the docking simulation unit 49. Repeat several times. Thereby, a docking simulation is performed for each of a large number of combinations made up of a plurality of receptors and a plurality of ligands, and a shape complementarity evaluation value is obtained.

上記の複数回の処理は、内外２重ループの計算により実現される。まず、内側のループは、リガンド記憶部５に蓄えられている全てのリガンドデータに対するループである。このループは、１つのレセプタに着目し、リガンドを順次取り替えて計算を繰り返す。例えば、リガンド記憶部５に蓄えられているデータ数が２００個であった場合、内側のループは２００回繰り返される。得られるＳｐｓｃも２００個になる。したがって、この例では、１つの教師ドッキングデータの構成は、レセプタの名前と、２００個のＳｐｓｃとから成る。 The above-mentioned multiple times of processing is realized by calculation of inner and outer double loops. First, the inner loop is a loop for all the ligand data stored in the ligand storage unit 5. This loop focuses on one receptor and repeats the calculation by sequentially replacing the ligand. For example, when the number of data stored in the ligand storage unit 5 is 200, the inner loop is repeated 200 times. The resulting Spsc is also 200. Therefore, in this example, the configuration of one teacher docking data includes a receptor name and 200 Spsc.

次に、外側のループは、レセプタ記憶部３に蓄えられている全てのレセプタデータに対するループである。レセプタ記憶部３に蓄えられているデータ数が１００個であった場合、外側のループは１００回繰り返される。従って、外側及び内側のループを考慮すると、得られる教師ドッキングデータの数は、２００×１００＝２００００個となる。 Next, the outer loop is a loop for all receptor data stored in the receptor storage unit 3. When the number of data stored in the receptor storage unit 3 is 100, the outer loop is repeated 100 times. Therefore, considering the outer and inner loops, the number of teacher docking data obtained is 200 × 100 = 20000.

なお、グリッドデータへの変換と、ＰＳＣデータへの変換は、各レセプタ及び各リガンドに対して１回だけ行われてもよい。この場合、複数のレセプタＰＳＣデータ及び複数のリガンドＰＳＣデータが保持される。そして、ドッキングシミュレーションが、複数のレセプタＰＳＣデータ及び複数のリガンドＰＳＣデータの全組合せに対して行われる。 Note that the conversion to grid data and the conversion to PSC data may be performed only once for each receptor and each ligand. In this case, a plurality of receptor PSC data and a plurality of ligand PSC data are retained. Then, the docking simulation is performed on all combinations of the plurality of receptor PSC data and the plurality of ligand PSC data.

図４は、教師ドッキング評価部３１で得られるデータを示している。図示のように、複数のレセプタと複数のリガンドの全部の組合せのＳｐｓｃが得られる。図の例では、レセプタ及びリガンドに番号が付されている。図中のＳ（ｉ、ｊ）は、レセプタｉとリガンドｊのＳｐｓｃを表す。図４は上述の例に対応しており、１００個のレセプタと２００個のリガンドから２００００個のＳｐｓｃが求められている。 FIG. 4 shows data obtained by the teacher docking evaluation unit 31. As shown, the Spsc of all combinations of multiple receptors and multiple ligands is obtained. In the illustrated example, the receptor and the ligand are numbered. In the figure, S (i, j) represents the Spsc of receptor i and ligand j. FIG. 4 corresponds to the above-described example, and 20000 Spsc is obtained from 100 receptors and 200 ligands.

次に、教師残基電荷評価部３５による電荷情報の生成について説明する。教師残基電荷評価部３５は、レセプタ記憶部３から複数のレセプタの立体構造データを取得する。各レセプタの立体構造データに対して、下記の処理が行われて、電荷情報が算出される。 Next, generation of charge information by the teacher residue charge evaluation unit 35 will be described. The teacher residue charge evaluation unit 35 acquires the three-dimensional structure data of a plurality of receptors from the receptor storage unit 3. The following processing is performed on the three-dimensional structure data of each receptor to calculate charge information.

教師残基電荷評価部３５は、分子表面計算部５１、アミノ酸残基計数部５３及び電荷計算部５５を有する。分子表面計算部５１は、溶媒露出面積を計算し、タンパク質表面に露出しているアミノ酸残基（表面露出残基）を検出する。ここでは、水分子程度の大きさを持つプローブ球をタンパク質表面に沿って転がし、プローブ球に接触するアミノ酸残基を検出することで、表面露出残基を検出する。検出時にプローブ球が接触した部分（転がった部分）の面積が、溶媒露出面積として求められる。アミノ酸残基計数部５３は、正電荷残基の数、負電荷残基の数及びヒスチジン残基の数を数える。溶媒露出面積計算において判定された表面露出残基だけを対象に、その全ての真空状態における電荷残基の数を計数する。ヒスチジン残基は、ｐＨによって正負が変わる残基である。ここでは、ヒスチジン残基は中性として扱われる。電荷計算部５５は、レセプタデータと分子表面計算部５１及びアミノ酸残基計数部５３の処理結果から、タンパク質（レセプタ）全体の全電荷と、タンパク質表面の全電荷と、これら２つの電荷の差分を求める。このようにして、教師残基電荷評価部３５は、図５に示されるように、下記の真空状態における電荷情報（Ｅ１）〜（Ｅ１０）を求める。得られた電荷情報は、教師残基電荷データ記憶部３７に格納される。 The teacher residue charge evaluation unit 35 includes a molecular surface calculation unit 51, an amino acid residue counting unit 53, and a charge calculation unit 55. The molecular surface calculation unit 51 calculates the solvent exposed area and detects amino acid residues (surface exposed residues) exposed on the protein surface. Here, the surface exposed residue is detected by rolling a probe sphere having a size of about a water molecule along the protein surface and detecting an amino acid residue contacting the probe sphere. The area of the part (rolled part) in contact with the probe ball at the time of detection is obtained as the solvent exposure area. The amino acid residue counting unit 53 counts the number of positively charged residues, the number of negatively charged residues, and the number of histidine residues. Only the surface exposed residues determined in the solvent exposed area calculation are counted, and the number of charged residues in all the vacuum states is counted. A histidine residue is a residue whose polarity changes depending on pH. Here, the histidine residue is treated as neutral. From the receptor data and the processing results of the molecular surface calculation unit 51 and the amino acid residue counting unit 53, the charge calculation unit 55 calculates the total charge of the entire protein (receptor), the total charge of the protein surface, and the difference between these two charges. Ask. In this manner, the teacher residue charge evaluation unit 35 obtains the following charge information (E1) to (E10) in the vacuum state as shown in FIG. The obtained charge information is stored in the teacher residue charge data storage unit 37.

(Ｅ１) タンパク質（レセプタ）の溶媒露出面積
(Ｅ２) タンパク質全体の正電荷残基の数
(Ｅ３) タンパク質全体の負電荷残基の数
(Ｅ４) タンパク質全体のヒスチジン残基の数
(Ｅ５) タンパク質全体の全電荷（正電荷＋負電荷）
酸解離定数pKaなどを考慮しない真空状態における正負電荷の数と全電荷を計算する。上述したように、ヒスチジンは中性と仮定し、上記電荷の計算には含めない。
(Ｅ６) タンパク質表面の正電荷残基の数
(Ｅ７) タンパク質表面の負電荷残基の数
(Ｅ８) タンパク質表面のヒスチジン残基の数
(Ｅ９) タンパク質表面の全電荷（正電荷＋負電荷）
(Ｅ１０)タンパク質全体の全電荷とタンパク質表面の全電荷の差（＝Ｅ５−Ｅ９） (E1) Solvent exposed area of protein (receptor)
(E2) Number of positively charged residues in the entire protein
(E3) Number of negatively charged residues in the entire protein
(E4) Number of histidine residues in the whole protein
(E5) Total charge of whole protein (positive charge + negative charge)
Calculate the number of positive and negative charges and the total charge in vacuum without taking into account the acid dissociation constant pKa. As mentioned above, histidine is assumed to be neutral and is not included in the charge calculation.
(E6) Number of positively charged residues on the protein surface
(E7) Number of negatively charged residues on the protein surface
(E8) Number of histidine residues on the protein surface
(E9) Total charge on the protein surface (positive charge + negative charge)
(E10) Difference between the total charge of the whole protein and the total charge on the protein surface (= E5-E9)

以上により、教師データ生成部２１では、教師ドッキング評価部３１がレセプタとリガンドのドッキングシミュレーションを行い、シミュレーション結果のＳｐｓｃ（形状相補性評価値）を教師ドッキングデータ記憶部３３に格納する。また、教師残基電荷評価部３５が、複数のレセプタの各々の電荷情報を算出して、教師残基電荷評価部３５に格納する。これらのデータは、教師データとして教師データ記憶部２３に格納される。また、複数のレセプタの各々が持つ既知の機能の情報が、レセプタ記憶部３から読み出されて、教師データの一部として教師データ記憶部２３に格納される。本実施の形態の例では、レセプタの機能は、抗体、酵素、その他の３種類であり、教師データ記憶部２３は、各レセプタがどの種類の機能を持つかの情報を記憶する。 As described above, in the teacher data generation unit 21, the teacher docking evaluation unit 31 performs docking simulation of the receptor and the ligand, and stores Spsc (shape complementarity evaluation value) of the simulation result in the teacher docking data storage unit 33. In addition, the teacher residue charge evaluation unit 35 calculates the charge information of each of the plurality of receptors and stores it in the teacher residue charge evaluation unit 35. These data are stored in the teacher data storage unit 23 as teacher data. Also, information on known functions of each of the plurality of receptors is read from the receptor storage unit 3 and stored in the teacher data storage unit 23 as part of the teacher data. In the example of the present embodiment, there are three types of functions of the receptor, such as an antibody, an enzyme, and the like. The teacher data storage unit 23 stores information on what type of function each receptor has.

以上に教師データ部７について説明した。次に、識別入力データ部１１について説明する。識別入力データ１１は、機能未知タンパク質の立体構造データから、教師データに基づいた識別処理に適用されるべき識別入力データを生成する構成である。教師データ部７と識別入力データ部１１は概ね同様の構成を有しているが、処理されるデータが異なる。すなわち、教師データ部７は、レセプタ記憶部３から得られる機能既知のレセプタの立体構造データを処理するのに対して、識別入力データ部１１は、未知タンパク質入力部９から入力される機能未知タンパク質の立体構造データを処理する。以下の説明において、教師データ部７と共通する事項の説明は適当に省略する。 The teacher data unit 7 has been described above. Next, the identification input data unit 11 will be described. The identification input data 11 is configured to generate identification input data to be applied to the identification process based on the teacher data from the three-dimensional structure data of the function unknown protein. The teacher data section 7 and the identification input data section 11 have substantially the same configuration, but the data to be processed is different. That is, the teacher data unit 7 processes the three-dimensional structure data of a receptor having a known function obtained from the receptor storage unit 3, whereas the identification input data unit 11 has a function unknown protein input from the unknown protein input unit 9. 3D structure data is processed. In the following description, description of matters common to the teacher data unit 7 will be appropriately omitted.

図１に示されるように、識別入力データ部１１は、識別入力データ生成部２５及び識別入力データ記憶部２７を有している。識別入力データ生成部２５には、識別入力データ部１１から、機能未知タンパク質の立体構造データが入力される。この立体構造データは、レセプタ記憶部３及びリガンド記憶部５のデータと同様にＰＤＢ形式のデータである。識別入力データ生成部２５が、機能未知タンパク質の立体構造及びリガンド記憶部５の複数のリガンドの立体構造から識別入力データを生成し、生成された識別入力データが識別入力データ記憶部２７に記憶される。 As shown in FIG. 1, the identification input data unit 11 includes an identification input data generation unit 25 and an identification input data storage unit 27. The identification input data generation unit 25 receives the three-dimensional structure data of the unknown function protein from the identification input data unit 11. This three-dimensional structure data is data in the PDB format, similar to the data in the receptor storage unit 3 and the ligand storage unit 5. The identification input data generation unit 25 generates identification input data from the three-dimensional structure of the unknown protein and the three-dimensional structures of the plurality of ligands in the ligand storage unit 5, and the generated identification input data is stored in the identification input data storage unit 27. The

図３は、識別入力データ生成部２５の構成を示している。図３に示すように、識別入力データ生成部２５は、入力ドッキング評価部６１、入力ドッキングデータ記憶部６３、入力残基電荷評価部６５及び入力残基電荷データ記憶部６７を有している。入力ドッキング評価部６１は、機能未知タンパク質と複数のリガンドから作られる各組合せに対してドッキングシミュレーションを行って形状相補性評価値を算出し、算出した形状相補性評価値を入力ドッキングデータ記憶部６３に格納する。入力残基電荷評価部６５は、機能未知タンパク質の電荷情報を算出し、算出した電荷情報を入力残基電荷データ記憶部６７に格納する。 FIG. 3 shows a configuration of the identification input data generation unit 25. As shown in FIG. 3, the identification input data generation unit 25 includes an input docking evaluation unit 61, an input docking data storage unit 63, an input residue charge evaluation unit 65, and an input residue charge data storage unit 67. The input docking evaluation unit 61 calculates a shape complementarity evaluation value by performing a docking simulation on each combination made from a protein with unknown function and a plurality of ligands, and the calculated shape complementarity evaluation value is input to the docking data storage unit 63. To store. The input residue charge evaluation unit 65 calculates the charge information of the protein whose function is unknown, and stores the calculated charge information in the input residue charge data storage unit 67.

入力ドッキング評価部６１は、グリッド変換部７１、７３、ＰＳＣ変換部７５、７７及びドッキングシミュレーション部７９を有している。グリッド変換部７１は、ＰＤＢ形式の機能未知タンパク質の立体構造データをグリッドデータに変換する。ここでも、Ｖｏｘｅｌ表現のデータが生成される。この機能未知タンパク質のグリッドデータは、ＰＳＣ変換部７５によりＰＳＣデータに変換される。また、グリッド変換部７３及びＰＳＣ変換部７７は、教師データの生成と同様に、各リガンドの立体構造データをグリッドデータに変換し、さらにＰＳＣデータに変換する。 The input docking evaluation unit 61 includes grid conversion units 71 and 73, PSC conversion units 75 and 77, and a docking simulation unit 79. The grid conversion unit 71 converts the three-dimensional structure data of the function unknown protein in PDB format into grid data. Again, Voxel representation data is generated. The grid data of this unknown protein is converted into PSC data by the PSC converter 75. In addition, the grid conversion unit 73 and the PSC conversion unit 77 convert the three-dimensional structure data of each ligand into grid data and further convert into PSC data, as in the generation of the teacher data.

ドッキングシミュレーション部７９は、機能未知タンパク質とリガンドに対してドッキングシミュレーションを行う。ドッキングシミュレーション処理は、教師データ生成におけるドッキングシミュレーション処理と同様でよい。ただし、レセプタＰＳＣデータの代わりに、機能未知タンパク質のＰＳＣデータがリガンドＰＳＣデータとドッキングされる。そして、機能未知タンパク質とリガンドのＳｐｓｃが形状相補性評価値として得られる。 The docking simulation unit 79 performs docking simulation on the protein with unknown function and the ligand. The docking simulation process may be the same as the docking simulation process in generating teacher data. However, instead of receptor PSC data, PSC data of a protein with unknown function is docked with ligand PSC data. Then, the Spsc of the protein with unknown function and the ligand is obtained as the shape complementarity evaluation value.

入力ドッキング評価部６１は、上記の処理を複数回繰り返すことにより、機能未知タンパク質と複数のリガンドの各々とのＳｐｓｃを求める。この複数回のシミュレーションのために、１重ループの計算が好適に行われる。このループは、リガンド記憶部５に蓄えられている全てのリガンドに対するループであり、そして、教師データの生成処理にて説明された内側のループと同様である。リガンド記憶部５に蓄えられているリガンドのデータ数が２００個であった場合、ループは２００回繰り返され、２００個のＳｐｓｃが得られる。そして、機能未知タンパク質の名前と共に、２００個のＳｐｓｃが入力ドッキングデータ記憶部６３に記憶される。 The input docking evaluation unit 61 obtains the Spsc between the protein with unknown function and each of the plurality of ligands by repeating the above process a plurality of times. A single loop calculation is preferably performed for this multiple simulations. This loop is a loop for all the ligands stored in the ligand storage unit 5, and is the same as the inner loop described in the teacher data generation process. When the number of ligand data stored in the ligand storage unit 5 is 200, the loop is repeated 200 times to obtain 200 Spsc. Then, 200 Spscs are stored in the input docking data storage unit 63 along with the names of the proteins with unknown functions.

次に、入力残基電荷評価部６５による電荷情報の生成について説明する。入力残基電荷評価部６５は、教師残基電荷評価部３５と同様の構成を有する。ただし、教師残基電荷評価部３５がレセプタの立体構造データを処理するのに対して、入力残基電荷評価部６５は機能未知タンパク質の立体構造データを処理する。そして、入力残基電荷評価部６５は、機能未知タンパク質についての電荷情報として、前出の（Ｅ１）〜（Ｅ１０）のデータを算出する（図５）。電荷情報は入力残基電荷データ記憶部６７に格納される。 Next, generation of charge information by the input residue charge evaluation unit 65 will be described. The input residue charge evaluation unit 65 has the same configuration as the teacher residue charge evaluation unit 35. However, the teacher residue charge evaluation unit 35 processes the three-dimensional structure data of the receptor, whereas the input residue charge evaluation unit 65 processes the three-dimensional structure data of the function-unknown protein. Then, the input residue charge evaluation unit 65 calculates the data (E1) to (E10) described above as the charge information about the protein with unknown function (FIG. 5). The charge information is stored in the input residue charge data storage unit 67.

入力残基電荷評価部６５は、分子表面計算部８１、アミノ酸残基計数部８３及び電荷計算部８５を有しており、これらは、教師残基電荷評価部３５の分子表面計算部５１、アミノ酸残基計数部５３及び電荷計算部５５とそれぞれ対応する。 The input residue charge evaluation unit 65 includes a molecular surface calculation unit 81, an amino acid residue counting unit 83, and a charge calculation unit 85, which include the molecular surface calculation unit 51 of the teacher residue charge evaluation unit 35, amino acids It corresponds to the residue counting unit 53 and the charge calculation unit 55, respectively.

以上により、識別入力データ生成部２５では、入力ドッキング評価部６１が機能未知タンパク質とリガンドのドッキングシミュレーションを行い、シミュレーション結果のＳｐｓｃ（形状相補性評価値）を入力ドッキングデータ記憶部６３に格納する。また、入力残基電荷評価部６５が、機能未知タンパク質の電荷情報を算出して、入力残基電荷データ記憶部６７に格納する。これらのデータは、識別入力データとして識別入力データ記憶部２７に格納される。 As described above, in the identification input data generation unit 25, the input docking evaluation unit 61 performs docking simulation between the unknown protein and the ligand, and stores the simulation result Spsc (shape complementation evaluation value) in the input docking data storage unit 63. Further, the input residue charge evaluation unit 65 calculates the charge information of the protein whose function is unknown and stores it in the input residue charge data storage unit 67. These data are stored in the identification input data storage unit 27 as identification input data.

次に、学習識別部１３について説明する。学習識別部１３は、教師データを学習して機能未知タンパク質の機能を識別する構成である。 Next, the learning identification unit 13 will be described. The learning identifying unit 13 is configured to learn the teacher data and identify the function of the function unknown protein.

図６は、学習識別処理の概念を示している。教師データは、複数のレセプタの各々の機能と、複数のレセプタと複数のリガンドの全組合せの各々におけるＳｐｓｃと、複数のレセプタの各々の電荷情報であった。図６では、教師データが、レセプタ別のデータとして整理されている。レセプタ及びリガンドには番号が付されている。図中のＳ（ｉ，ｊ）は、レセプタｉとリガンドｊのＳｐｓｃである。また、Ｅ１（ｉ）は、レセプタｉの電荷情報Ｅ１である（電荷情報Ｅ２〜Ｅ１０についても同様、以下同じ）。図示のように、各レセプタ１，２・・・は、全リガンド１〜２００とのドッキングのＳｐｓｃと、電荷情報Ｅ１〜Ｅ１０を有する。また、レセプタは、機能（抗体、酵素、その他）により複数の機能カテゴリ１、２、３に分類される。 FIG. 6 shows the concept of learning identification processing. The teacher data was the function of each of the plurality of receptors, the Spsc in each of all combinations of the plurality of receptors and the plurality of ligands, and the charge information of each of the plurality of receptors. In FIG. 6, the teacher data is organized as data for each receptor. Receptors and ligands are numbered. S (i, j) in the figure is Spsc of receptor i and ligand j. E1 (i) is the charge information E1 of the receptor i (the same applies to the charge information E2 to E10). As shown, each of the receptors 1, 2,... Has docking Spsc with all the ligands 1 to 200, and charge information E1 to E10. In addition, receptors are classified into a plurality of functional categories 1, 2, and 3 according to function (antibody, enzyme, etc.).

一方、識別入力データは、図示のように、機能未知タンパク質についてのデータである。図中のＳ（ｔ，ｊ）は、機能未知タンパク質とリガンドｊのＳｐｓｃを表し、Ｅ１（ｔ）は、機能未知タンパク質の電荷情報Ｅ１を表す。複数のＳｐｓｃと電荷情報を有する点では、レセプタのデータと同様である。 On the other hand, the identification input data is data on a protein whose function is unknown as shown in the figure. In the figure, S (t, j) represents Spsc of the protein with unknown function and ligand j, and E1 (t) represents charge information E1 of the protein with unknown function. It is the same as the receptor data in that it has a plurality of Spsc and charge information.

ここで、本発明は、機能が共通する複数のレセプタにおいては、教師データ（すなわち、形状相補性評価値（２００個のＳｐｓｃ）及び電荷情報（Ｅ１〜Ｅ１０））が類似することに着目する。そして、学習識別部１３は、複数の形状相補性評価値及び電荷情報が機能未知タンパク質と類似するレセプタの機能を求めることにより、機能未知タンパク質の機能を識別する。つまり、学習識別部１３は、機能未知タンパク質の情報が、３つの機能カテゴリのうちのどのグループの情報と最も類似するかを求める。 Here, the present invention pays attention to the fact that teacher data (that is, shape complementarity evaluation values (200 Spsc) and charge information (E1 to E10)) are similar in a plurality of receptors having a common function. And the learning identification part 13 identifies the function of a function unknown protein by calculating | requiring the function of the receptor in which several shape complementarity evaluation values and electric charge information are similar to a function unknown protein. That is, the learning identification unit 13 determines which group of the three functional categories has the most similar information on the unknown protein.

学習識別部１３は、上記の処理を実現するために、教師データに含まれる電荷情報及び形状相補性評価値をレセプタ機能によって複数の機能カテゴリに分類し、識別入力データに含まれる未知タンパク質の電荷情報及び形状相補性評価値がどの機能カテゴリに属するかを判別する処理を行う。学習識別部１３は、いわゆる機械学習を行うように構成されている。具体的には、学習識別部１３の処理はサポートベクトルマシン（以下、ＳＶＭ）を用いて実現され、その際のカーネルとしてはＲａｄｉｕｓＢａｓｉｓＦｕｎｃｔｉｏｎが用いられる。 The learning identification unit 13 classifies the charge information and the shape complementarity evaluation value included in the teacher data into a plurality of functional categories by the receptor function to realize the above processing, and the charge of the unknown protein included in the identification input data Processing is performed to determine to which functional category the information and shape complementarity evaluation values belong. The learning identification unit 13 is configured to perform so-called machine learning. Specifically, the processing of the learning identification unit 13 is realized using a support vector machine (hereinafter referred to as “SVM”), and a Radius Basis Function is used as a kernel at that time.

図７は、ＳＶＭの概念を示している。ＳＶＭでは、複数のベクトルが複数のグループに分けられる。そして、複数のグループの間の分離面が求められる。分離面は、ベクトル空間を分離する面であり、超平面といわれる。 FIG. 7 shows the concept of SVM. In SVM, a plurality of vectors are divided into a plurality of groups. And the separation surface between several groups is calculated | required. The separation plane is a plane that separates the vector space and is called a hyperplane.

図７に示すように、本実施の形態では、教師データから、複数のレセプタにそれぞれ対応する複数のレセプタベクトルが得られる。レセプタベクトルは、図示のように、レセプタ名と、複数のＳｐｓｃと、電荷情報と、機能からなる。複数のＳｐｓｃは、複数のリガンドにそれぞれ対応する。また、電荷情報は、前出の教師残基電荷評価部３５により得られた（Ｅ１）〜（Ｅ１０）の全情報である。機能は、抗体、酵素及びその他の３種類のいずれかである。図７では、図６と同様にレセプタ及びリガンドに番号が付されており、Ｓｐｓｃ（ｉ，ｊ）がレセプタｉとリガンドｊのＳｐｓｃであり、Ｅ１（ｉ）がレセプタｉの電荷情報Ｅ１（ｉ）である。図７の例は、図６の例に対応しており、レセプタの数が１００であり、リガンドの数が２００である。 As shown in FIG. 7, in this embodiment, a plurality of receptor vectors respectively corresponding to a plurality of receptors are obtained from the teacher data. As illustrated, the receptor vector includes a receptor name, a plurality of Spsc, charge information, and a function. A plurality of Spsc corresponds to a plurality of ligands, respectively. The charge information is all information (E1) to (E10) obtained by the teacher residue charge evaluation unit 35 described above. The function is one of antibodies, enzymes, and other three types. In FIG. 7, as in FIG. 6, the receptor and the ligand are numbered, Spsc (i, j) is Spsc of the receptor i and the ligand j, and E1 (i) is the charge information E1 (i) of the receptor i. ). The example of FIG. 7 corresponds to the example of FIG. 6, where the number of receptors is 100 and the number of ligands is 200.

図７に示すように、識別入力データからも、同様のベクトルデータが得られる。このベクトルデータを、識別入力ベクトルデータと呼ぶ。識別入力ベクトルデータは、複数のＳｐｓｃと電荷情報とを有する。ただし、機能が未知なので、識別入力ベクトルデータは機能の情報は持たない。 As shown in FIG. 7, similar vector data can be obtained from the identification input data. This vector data is called identification input vector data. The identification input vector data includes a plurality of Spsc and charge information. However, since the function is unknown, the identification input vector data has no function information.

ＳＶＭへは、まず、複数のレセプタに対応する複数のレセプタベクトルデータが入力される。ＳＶＭは、それら複数のレセプタベクトルデータを、複数の機能カテゴリ（抗体、酵素、その他）に分ける分離面を検出する。ここでは、ＳＶＭの概念を示すために、「抗体と酵素の分離面」が示されている。実際には、「抗体とその他の分離面」及び「酵素とその他の分離面」も検出される。これら分離面は、タンパク質機能識別装置１の記憶装置に保持されてよい。 First, a plurality of receptor vector data corresponding to a plurality of receptors are input to the SVM. The SVM detects separation surfaces that divide the plurality of receptor vector data into a plurality of functional categories (antibodies, enzymes, etc.). Here, in order to show the concept of SVM, “separation surface of antibody and enzyme” is shown. In practice, “antibody and other separation surfaces” and “enzyme and other separation surfaces” are also detected. These separation surfaces may be held in the storage device of the protein function identification device 1.

上記のようにして分離面が得られた後に、識別入力ベクトルデータがＳＶＭに入力される。ＳＶＭは、分離面と識別入力ベクトルの位置関係から識別入力ベクトルが属する機能カテゴリを判別する。判別された機能カテゴリに対応する機能が、機能未知タンパク質の機能として求められる。学習識別部１３により求められた機能の情報は、識別結果として出力部１５から出力される。 After the separation plane is obtained as described above, the identification input vector data is input to the SVM. The SVM determines the function category to which the identification input vector belongs from the positional relationship between the separation plane and the identification input vector. A function corresponding to the discriminated function category is obtained as a function of the function unknown protein. Information on the function obtained by the learning identification unit 13 is output from the output unit 15 as an identification result.

以上に、タンパク質機能識別装置１の各部構成と処理について説明した。次に、グリッドデータ変換、接触表面スコア、ドッキングシミュレーションついて、さらに詳しく説明する。 The configuration and processing of each part of the protein function identification device 1 have been described above. Next, grid data conversion, contact surface score, and docking simulation will be described in more detail.

「グリッドデータ変換（Ｖｏｘｅｌ表現）」
前述したように、本実施の形態では、ドッキングシミュレーションの前に、タンパク質（レセプタ、リガンド及び機能未知タンパク質）の立体構造データが、グリッドデータに変換される。この際、以下に説明するようなＶｏｘｅｌ表現形式にて、グリッドデータが生成される。 "Grid data conversion (Voxel expression)"
As described above, in the present embodiment, the three-dimensional structure data of proteins (receptors, ligands and functionally unknown proteins) is converted into grid data before docking simulation. At this time, grid data is generated in a Voxel expression format as described below.

Ｖｏｘｅｌ表現は、ＰＤＢ形式のタンパク質立体構造に対してＶｏｘｅｌ化という変換処理をすることで行われる。Ｖｏｘｅｌ化とは、対象となるタンパク質を３次元のｌ×ｍ×ｎグリッドに離散化し、タンパク質の形状に起因する幾何学的な特性に従って、各グリッド点に対してタンパク質の構造属性値であるＳＵＲＦＡＣＥ（Ｓ）、ＩＮＮＥＲＥ（Ｉ）、ＣＡＶＩＴＹ（Ｃ）、ＯＵＴＥＲ（Ｏ）の何れかを割り当てる処理である。ｌ、ｍ、ｎの値は、グリッドが作る空間が処理対象のタンパク質全てのサイズをカバーできる程度に大きくなければならない。本実施の形態の例では、ｌ＝ｍ＝ｎ＝６４に設定されている。また、グリッド間隔は１．２オングストロームに設定されている。 The Voxel expression is performed by performing a conversion process called Voxel conversion on the protein three-dimensional structure in the PDB format. Voxelization means that a target protein is discretized into a three-dimensional l × m × n grid, and a SURFACE that is a structural attribute value of the protein for each grid point according to the geometric characteristics resulting from the shape of the protein. (S), INNERE (I), CAVITY (C), or OUTER (O) is assigned. The values of l, m and n must be large enough that the space created by the grid can cover the size of all proteins to be processed. In the example of the present embodiment, l = m = n = 64 is set. The grid interval is set to 1.2 angstroms.

図８〜図１３は、Ｖｏｘｅｌ化の処理の流れを示している。ＶｏｘｅｌモデルＶｏｘｅｌ（Ｖ）は、主に２つの特徴を持つ。１つは、タンパク質表面の凹凸をより忠実にモデル化するために、コノリー表面を決定する際に用いるプローブ球の半径の大きさを原子種に応じて適応的に定義した点である。更に、原子種のｖａｎｄｅｒＷａａｌｓ半径の大きさに応じたボーナス半径を加えることにより、一般的に広く用いられている水分子などの一定値半径によるタンパク質表面よりも、より鋭敏な凹凸面を決定することができる。もう１つの特徴は、２つのタンパク質の形状相補性をより正確に計算するために、Ｖｏｘｅｌ化の最終段階で表面の厚さを変更し、タンパク質の表面と内部をより厳密に区別して定義した点である。ボーナス半径の大きさやＶｏｘｅｌ化された表面の厚みの決定は、最終的な識別性能が最大になる様に好適に調整される。 8 to 13 show the flow of Voxelization processing. The Voxel model Voxel (V) has two main features. One is that the radius of the probe sphere used to determine the Connolly surface is adaptively defined according to the atomic species in order to more accurately model the unevenness of the protein surface. In addition, by adding a bonus radius according to the size of the van der Waals radius of the atomic species, a more concavo-convex surface is determined than a protein surface with a constant value radius such as a water molecule that is widely used in general. can do. Another feature is that, in order to calculate the shape complementarity of two proteins more accurately, the surface thickness was changed at the final stage of voxelization to define the protein surface and the interior more precisely. It is. Determination of the size of the bonus radius and the thickness of the voxelized surface is preferably adjusted so that the final discrimination performance is maximized.

ＶｏｘｅｌモデルＶｏｘｅｌ（Ｖ）を式（１）に定義する。ここで、ＩＮＮＥＲ（Ｉ）はタンパク質分子が存在する領域を示し、ＳＵＲＦＡＣＥ（Ｓ）はタンパク質表面の領域を示す。ＣＡＶＩＴＹ（Ｃ）はタンパク質の内側で、かつ、タンパク質分子が存在しない領域を示す。ＯＵＴＥＲ（Ｏ）はタンパク質分子が存在しない領域で、かつ、ＣＡＶＩＴＹ（Ｃ）ではない領域を示す。ただし、ＤＥＦＡＵＬＴ（Ｄ）は、Ｖｏｘｅｌ（Ｖ）の初期領域を示す属性値であり、Ｉ、Ｏ、Ｓ、Ｃの何れの属性値も取らない空集合
を示す。ＴＭＰ−ＳＵＲＦＡＣＥ（Ｓｔｍｐ）は、一時的にタンパク質表面と定義する領域を示す。
A Voxel model Voxel (V) is defined in Equation (1). Here, INNER (I) indicates a region where a protein molecule exists, and SURFACE (S) indicates a region on the protein surface. CAVITY (C) indicates the region inside the protein and where no protein molecule is present. OUTER (O) indicates a region where no protein molecule exists and is not CAVITY (C). However, DEFAULT (D) is an attribute value indicating the initial area of Voxel (V), and indicates an empty set that does not take any attribute value of I, O, S, and C. TMP-SURFACE (Stmp) indicates a region temporarily defined as the protein surface.

Ｖｏｘｅｌ化の流れを式（２）に定式化する。また、図８〜図１３は、Ｖｏｘｅｌ化の流れの詳細をステップ毎に示している。
The Voxelization flow is formulated into Equation (2). 8 to 13 show details of the Voxelization flow for each step.

・手順１（ＤＥＦＡＵＬＴの決定）：Ｖｏｘｅｌの全てのグリッド点に対して、初期属性値ＤＥＦＡＵＬＴを割り当てる。 Procedure 1 (DEFAULT determination): An initial attribute value DEFAULT is assigned to all grid points of Voxel.

・手順２（ＩＮＮＥＲの決定）：原子の空間重心を中心とした半径Ｒｉ（＝Ｒｖｉ＋Δｒｉ）の球体内に含まれるグリッド点に対して、ＩＮＮＥＲを割り当てる。ここで、Ｒｖｉは各原子のｖａｎｄｅｒＷａａｌｓ半径であり、Δｒｉは原子種毎に定義する半径の増分である。 Procedure 2 (Determination of INNER): INNER is assigned to grid points included in a sphere having a radius Ri (= Rvi + Δri) centered on the space center of gravity of the atom. Here, Rvi is the van der Waals radius of each atom, and Δri is an increment of the radius defined for each atomic species.

・手順３（ＯＵＴＥＲの決定）：Ｖｏｘｅｌの境界とＩＮＮＥＲで囲まれた領域に含まれるグリッド点に対して、ＯＵＴＥＲを割り当てる。 Procedure 3 (Determination of OUTER): OUTER is assigned to grid points included in the area surrounded by the border of Voxel and INNER.

・手順４（ＴＭＰ−ＳＵＲＦＡＣＥの決定）：ＯＵＴＥＲに隣接するＩＮＮＥＲに対して、ＴＭＰ−ＳＵＲＦＡＣＥを一時的に割り当てる。このＴＭＰ−ＳＵＲＦＡＣＥは、手順５において最終的なＳＵＲＦＡＣＥを決定する際の基準となる、ＩＮＮＥＲとＯＵＴＥＲの境界層である。 Procedure 4 (determination of TMP-SURFACE): TMP-SURFACE is temporarily assigned to INNER adjacent to OUTER. This TMP-SURFACE is a boundary layer between INNER and OUTER, which is a reference for determining the final SURFACE in the procedure 5.

・手順５（ＳＵＲＦＡＣＥの決定）：閾値以上のＯＵＴＥＲと隣り合うＴＭＰ−ＳＵＲＦＡＣＥに対して、ＳＵＲＦＡＣＥを割り当てる。残りのＴＭＰ−ＳＵＲＦＡＣＥに対して、ＩＮＮＥＲを割り当てる。この閾値を変更することによって、Ｖｏｘｅｌ化されたタンパク質表面の厚さを調整することができる。 Procedure 5 (Determination of SURFACE): SURFACE is allocated to TMP-SURFACE adjacent to OUTER that is equal to or greater than the threshold. Assign INNER to the remaining TMP-SURFACE. By changing this threshold, the thickness of the voxelized protein surface can be adjusted.

・手順６（ＣＡＶＩＴＹの決定）：残りのＤＥＦＡＵＬＴに対して、ＣＡＶＩＴＹを割り当てる。 Procedure 6 (decision of CAVITY): CAVITY is allocated to the remaining DEFAULT.

「ＰＳＣ接触表面スコア関数」
次に、ＰＳＣ接触表面スコア関数は、Ｖｏｘｅｌ表現のグリッドデータを、シミュレーション用のＰＳＣデータに変換するために使用される。 "PSC contact surface score function"
Next, the PSC contact surface score function is used to convert Voxel-represented grid data into PSC data for simulation.

ＰＳＣ接触表面スコア関数は、タンパク質間の形状相補性を評価するために定義される関数であり、上述のＶｏｘｅｌ化されたタンパク質（ＩＮＮＥＲ、ＯＵＴＥＲ、ＳＵＲＦＡＣＥ、ＣＡＶＩＴＹ）に対するスコアの配点方法を定義する。この関数を用いて、後述のドッキングシミュレーションにて接触表面スコアＳｐｓｃが計算される。本実施の形態では、接触表面スコア関数として、ＺＤＯＣＫ（非特許文献３を参照）で定義されたＰａｉｒｗｉｓｅＳｈａｐｅＣｏｍｐｌｅｍｅｎｔａｒｉｔｙ（ＰＳＣ）を採用した。以下に、ＰＳＣ接触表面スコア関数についてさらに説明するが、本実施の形態のＰＳＣ接触表面スコア関数は非特許文献３に従っている。 The PSC contact surface score function is a function defined to evaluate shape complementarity between proteins, and defines a score scoring method for the above-mentioned Voxelized proteins (INNER, OUTER, SURFACE, CAVITY). Using this function, the contact surface score Spsc is calculated in a docking simulation described later. In the present embodiment, Pairwise Shape Complementarity (PSC) defined by ZDOCK (see Non-Patent Document 3) is employed as the contact surface score function. Hereinafter, the PSC contact surface score function will be further described. The PSC contact surface score function according to the present embodiment follows Non-Patent Document 3.

図１４は、ＰＳＣ接触表面スコア関数の概要を示しており、接触表面スコア関数によってレセプタとリガンドの各グリッド点にスコアが付けられている。式（３）は、図１４のスコアを与えるＰＳＣ接触表面スコア関数を示している。式（３）は、レセプタとリガンドで配点法が異なるという特徴を持つ。また、レセプタのポケット構造にリガンドを誘導するように、レセプタのポケット周辺に高いスコアを配置する戦略を取っている。
FIG. 14 shows an overview of the PSC contact surface score function, where the receptor and ligand grid points are scored by the contact surface score function. Equation (3) shows the PSC contact surface score function that gives the score of FIG. Formula (3) is characterized in that the scoring method differs between the receptor and the ligand. In addition, a strategy is adopted in which a high score is arranged around the receptor pocket so as to induce a ligand into the receptor pocket structure.

式（３）の接触表面スコア関数について詳細に説明する。式中のＲｐｓｃとＬｐｓｃは、それぞれ、レセプタとリガンドの接触表面スコア関数である。また、Ｒｅ[ ]とＩｍ[ ]は、それぞれ複素関数の実部と虚部を示している。式３により、レセプタの接触表面スコア関数Ｒｐｓｃの実部、虚部、リガンドの接触表面スコア関数Ｌｐｓｃの虚部が定義される。 The contact surface score function of Formula (3) will be described in detail. Rpsc and Lpsc in the equation are the contact surface score functions of the receptor and the ligand, respectively. Re [] and Im [] indicate the real part and imaginary part of the complex function, respectively. Equation 3 defines the real and imaginary parts of the receptor contact surface score function Rpsc and the imaginary part of the ligand contact surface score function Lpsc.

式（３）において、レセプタの接触表面スコア関数Ｒｐｓｃの実部についてみると、上述のようにレセプタのポケットに対しリガンドの凸面を効果的に誘導するために、ｂｏｎｕｓＲというパラメータが導入されている。このｂｏｎｕｓＲは、ＯＵＴＥＲのスコアの決定に用いられる。具体的には、ＯＵＴＥＲのグリッド点には、ｖａｎｄｅｒＷａａｌｓ半径にｂｏｎｕｓＲを加算した距離の範囲内にあるＳＵＲＦＡＣＥのグリッド点の数がスコアとして与えられる。この処理では、ＳＵＲＦＡＣＥに近いＯＵＴＥＲには１以上のスコアが与えられるが、ＳＵＲＦＡＣＥから遠いＯＵＴＥＲのスコアは０になる。したがって、ＳＵＲＦＡＣＥ近傍のＯＵＴＥＲ（表面近傍の外周層）に、１以上のスコアが与えられる。そして、さらに、ポケット内のＯＵＴＥＲグリッド点の近くには多くのＳＵＲＦＡＣＥが存在するので、ポケット内ではスコアがより大きくなる。これにより、下記のドッキングシミュレーションにてポケットにリガンド凸面を効果的に誘導できる。図１４の例では、ポケットの底の部分のスコアが５である。 In Equation (3), regarding the real part of the contact surface score function Rpsc of the receptor, a parameter called bonusR is introduced in order to effectively induce the convex surface of the ligand with respect to the pocket of the receptor as described above. This bonusR is used to determine the score of OUTER. Specifically, the number of SURFACE grid points within the distance range obtained by adding bonusR to the van der Waals radius is given as a score to the OUTER grid points. In this process, a score of 1 or more is given to the outer near SURFACE, but the score of the outer far from SURFACE is zero. Therefore, a score of 1 or more is given to OUTER near the SURFACE (outer peripheral layer near the surface). Further, since there are many SURFACEs near the OUTER grid points in the pocket, the score is larger in the pocket. Thereby, the convex surface of the ligand can be effectively guided to the pocket by the following docking simulation. In the example of FIG. 14, the score at the bottom of the pocket is 5.

また、式（３）において、レセプタの接触表面スコア関数Ｒｐｓｃの実部としては、ＯＵＴＥＲ以外の部分には０が与えられる。また、リガンドの接触表面スコア関数Ｌｐｓｃの実部について見ると、ＳＵＲＦＡＣＥ及びＩＮＮＥＲに１が与えられ、その他の部分には０が与えられる。また、虚部については、レセプタとリガンドで共通であり、ＳＵＲＦＡＣＥ（表面）に３ｉが与えられ、ＩＮＮＥＲ，ＣＡＶＩＴＹ（内部）に９ｉが与えられ、ＯＵＴＥＲ（外部）は虚部を持たない。 In the expression (3), 0 is given to the part other than the OUTER as a real part of the contact surface score function Rpsc of the receptor. Further, when looking at the real part of the contact surface score function Lpsc of the ligand, 1 is given to SURFACE and INNER, and 0 is given to the other parts. The imaginary part is common to the receptor and the ligand, 3i is given to SURFACE (surface), 9i is given to INNER and CAVITY (inside), and OUTER (outside) has no imaginary part.

タンパク質機能識別装置１では、グリッドデータ（Ｖｏｘｅｌデータ）をＰＳＣデータに変換する際に、式（３）の接触表面スコア関数にしたがって、すべてのグリッド点にスコアが付与される。すなわち、レセプタのリガンドの各グリッド点において、式（３）に従って実部及び虚部が決定される。 In the protein function identification device 1, when grid data (Voxel data) is converted into PSC data, scores are assigned to all grid points according to the contact surface score function of Equation (3). That is, at each grid point of the receptor ligand, the real part and the imaginary part are determined according to Equation (3).

「ドッキングシミュレーション」
次に、接触表面スコア関数を用いたドッキングシミュレーションについて説明する。上記のように、接触表面スコア関数により、グリッドデータ（Ｖｏｘｅｌデータ）のすべてのグリッド点にスコアが付与されて、ＰＳＣデータに変換されている。ドッキングシミュレーションでは、レセプタのＰＳＣデータとリガンドのＰＳＣデータから、Ｓｐｓｃ（接触表面スコア値）が算出される。Ｓｐｓｃの計算では、レセプタとレセプタのＰＳＣデータが空間上で重ねられて、重なったグリッド点のスコア同士がかけ算される。全部のグリッド点のペアについてのスコアの積の総和が求められる。Ｓｐｓｃは、レセプタとリガンドをドッキングしたときに表面凹凸がどの程度合うかを表す評価値であり、本発明の形状相補性評価値の一種である。 "Docking simulation"
Next, docking simulation using the contact surface score function will be described. As described above, scores are assigned to all grid points of grid data (Voxel data) by the contact surface score function, and converted to PSC data. In the docking simulation, Spsc (contact surface score value) is calculated from the PSC data of the receptor and the PSC data of the ligand. In the calculation of Spsc, the receptor and the PSC data of the receptor are overlapped in space, and the scores of the overlapping grid points are multiplied together. The sum of the product of scores for all pairs of grid points is determined. Spsc is an evaluation value indicating how much the surface irregularities match when the receptor and the ligand are docked, and is a kind of shape complementarity evaluation value of the present invention.

ドッキングシミュレーションでは、レセプタとリガンドの位置関係を少しずつ変えたときのＳｐｓｃの最大値が探索される。これにより、レセプタとリガンドの凹凸が最もよく合うときの位置関係とＳｐｓｃが求められる。探索アルゴリズムは、レセプタに対するリガンドの回転・並進の全空間を網羅的に計算する。回転角の刻みを決定すると回転空間が決まる。例えば、回転角の刻みが１５°に設定される。この場合、リガンドの回転体の数は３６００個である。探索アルゴリズムには、レセプタとリガンドの回転体間の並進空間の高速探索のために、ＦＦＴが好適に用いられる。 In the docking simulation, the maximum value of Spsc when the positional relationship between the receptor and the ligand is changed little by little is searched for. Thereby, the positional relationship and Spsc when the concave and convex portions of the receptor and the ligand are best matched are obtained. The search algorithm comprehensively calculates the entire space of rotation and translation of the ligand relative to the receptor. When the increment of the rotation angle is determined, the rotation space is determined. For example, the increment of the rotation angle is set to 15 °. In this case, the number of rotating bodies of the ligand is 3600. As the search algorithm, FFT is preferably used for high-speed search of the translation space between the receptor and the rotating body of the ligand.

式（４）は、Ｓｐｓｃを示している。前出の式（３）に従ってＶｏｘｅｌのすべてのグリッド点に対してスコアが割り当てられると、式（４）を用いてＳｐｓｃを計算することができる。
Equation (4) shows Spsc. Once scores have been assigned to all grid points in the Voxel according to equation (3) above, Spsc can be calculated using equation (4).

ここで、式（４）におけるｏ、ｐ、ｑは、レセプタに対してリガンドが移動するグリッド点の数である。例えば、レセプタとリガンドが接触しない場合は、接触表面スコアＳｐｓｃは、０となる。ＳＵＲＦＡＣＥとＳＵＲＦＡＣＥ同士の接触の場合は、レセプタのＯＵＴＥＲとリガンドのＳＵＲＦＡＣＥが重なり、Ｓｐｓｃは１（＝１×１）となる（虚数は無視される）。図１４に示したように、ポケットなどではレセプタのＯＵＴＥＲのスコアが１より大きく、これによりＳｐｓｃも１より大きくなる。また、ＩＮＮＥＲとＩＮＮＥＲ同士の接触の場合は、物理的に禁じられる内部同士の衝突に対するペナルティとして、−８１（＝９ｉ×９ｉ）が与えられる。ＳＵＲＦＡＣＥとＳＵＲＦＡＣＥが重なる場合も、ペナルティとして−９（＝３ｉ×３ｉ）が与えられる。 Here, o, p, and q in Equation (4) are the number of grid points at which the ligand moves relative to the receptor. For example, when the receptor and the ligand do not contact, the contact surface score Spsc becomes zero. In the case of contact between SURFACE and SURFACE, the receptor OUTER and the ligand SURFACE overlap, and Spsc is 1 (= 1 × 1) (imaginary number is ignored). As shown in FIG. 14, in the pocket or the like, the score of OUTER of the receptor is larger than 1, thereby causing Spsc to be larger than 1. Further, in the case of contact between INNER and INNER, −81 (= 9i × 9i) is given as a penalty with respect to the collision between the physically prohibited insides. Even when SURFACE and SURFACE overlap, -9 (= 3i × 3i) is given as a penalty.

Ｓｐｓｃは、最終的に式（５）で表されるＲｐｓｃとＬｐｓｃの相関関数を用いて計算することができる。この計算を実施することがドッキングシミュレーションの実行であり、Ｓｐｓｃが出力値となる。
Spsc can be calculated finally using the correlation function of Rpsc and Lpsc expressed by Equation (5). Executing this calculation is the execution of the docking simulation, and Spsc is the output value.

「本実施の形態の適用例」
次に、本実施の形態のタンパク質機能識別装置１によって実施された識別処理の具体例を説明する。 "Application example of this embodiment"
Next, a specific example of the identification process performed by the protein function identification device 1 of the present embodiment will be described.

「サンプルについて」
図１５及び図１６は、それぞれ、レセプタ及びリガンドのサンプルを示している。この例では、サンプルとして、５２個のレセプタと５０（２つの重複を含む）個のリガンドを教師データ及び入力データとして用いた。すべてのサンプルは、ボストン大学のＷｅｎｇらのＰｒｏｔｅｉｎ−ＰｒｏｔｅｉｎＤｏｃｋｉｎｇＢｅｎｃｈｍａｒｋ２．０（非特許文献４）から抽出した。このベンチマークデータセットに限らず、レセプタとリガンドからなるタンパク質ペアのデーセットには、実験による立体構造の採取状況により二つのデータ分類ｂｏｕｎｄとｕｎｂｏｕｎｄがある。ｂｏｕｎｄは、タンパク質ペアが結合した状態のままで得られた立体構造データを指し、ｕｎｂｏｕｎｄは、タンパク質ペアが結合していない状態で、個別に得られた立体構造データを指している。この分類観点から見ると、ＰＤＢ（ＰｒｏｔｅｉｎＤａｔａＢａｎｋ）に登録されている殆どのデータがｕｎｂｏｕｎｄに分類される。この例では、上記ベンチマークデータセットから、更に生化学的機能（抗体−抗原、酵素−阻害剤、酵素―基質など）については、Ｗｅｎｇらによって定義された難易度を考慮し、なるべく偏りが小さくなるタンパク質ペアをサンプルとして用いた（図１５、図１６）。このタンパク質ペアの内訳は以下の通りである。レセプタの５２個は、１２個の“抗体”、１９個の“酵素”、２１個の“その他”から成る。リガンドの５０個は、１１（２つの重複を含む）個の“抗原”、１９個の“基質又は阻害剤”、２０個の“その他”から成る。 About the sample
Figures 15 and 16 show samples of receptor and ligand, respectively. In this example, 52 receptors and 50 (including two duplicates) ligands were used as teaching data and input data as samples. All samples were extracted from Protein-Protein Docking Benchmark 2.0 of Weng et al., Boston University. Not only this benchmark data set, but also a data set of a protein pair consisting of a receptor and a ligand, there are two data classifications “bound” and “unbound” depending on the collection state of the three-dimensional structure by experiment. “bound” refers to the three-dimensional structure data obtained in a state where the protein pair is bound, and “unbound” refers to the three-dimensional structure data obtained individually in a state where the protein pair is not bound. From this classification point of view, most data registered in PDB (Protein Data Bank) is classified as unbound. In this example, from the above benchmark data set, the biochemical functions (antibody-antigen, enzyme-inhibitor, enzyme-substrate, etc.) are less biased in consideration of the difficulty defined by Weng et al. A protein pair was used as a sample (FIGS. 15 and 16). The breakdown of this protein pair is as follows. 52 receptors consist of 12 “antibodies”, 19 “enzymes” and 21 “others”. 50 of the ligands consist of 11 (including 2 overlaps) “antigens”, 19 “substrates or inhibitors” and 20 “others”.

「ＳＶＭの条件について」
ＳＶＭの条件を図１７に示す。実装は、ＬＩＢＳＶＭを用い、カーネルやパラメータは全てＤｅｆａｕｌｔを使用した。 “SVM conditions”
The SVM conditions are shown in FIG. LIBSVM was used for the implementation, and Default was used for all kernels and parameters.

「識別結果」
図１８及び図１９は、サンプルにおけるＳｐｓｃの評価結果を示している。図１８は、１個のレセプタについてのＳｐｓｃである。すなわち、１個のレセプタと５０個のリガンドとのＳｐｓｃを示している。図１９は、５２個のレセプタすべてについてのＳｐｓｃである。図１９の各ラインが、１個のレセプタと５０個のリガンドとのＳｐｓｃを示している。図１８及び図１９は、レセプタのリガンドに対する規格化Ｓｐｓｃの値を示している。規格化Ｓｐｓｃとは、各レセプタに対して得られたＳｐｓｃ群（リガンドの数のＳｐｓｃ群)の最大値を１に規格化した場合のＳｐｓｃである。図１８及び図１９において、横軸はLigand、縦軸は規格化Ｓｐｓｃを示す。図１８及び図１９は、Ｓｐｓｃのみを示している。タンパク質機能識別装置１では、図１８及び図１９のＳｐｓｃに加えて、電荷情報がＳＶＭに投入されて、機能識別が行われる。 "Identification result"
18 and 19 show the evaluation results of Spsc in the sample. FIG. 18 is the Spsc for one receptor. That is, the Spsc of one receptor and 50 ligands is shown. FIG. 19 is the Spsc for all 52 receptors. Each line in FIG. 19 shows the Spsc of one receptor and 50 ligands. 18 and 19 show the normalized Spsc values for the receptor ligands. Normalized Spsc is Spsc when the maximum value of the Spsc group (Spsc group of the number of ligands) obtained for each receptor is normalized to 1. 18 and 19, the horizontal axis represents Ligand, and the vertical axis represents normalized Spsc. 18 and 19 show only Spsc. In the protein function identification apparatus 1, in addition to Spsc in FIGS. 18 and 19, charge information is input to the SVM to perform function identification.

図２０は、識別結果を表のかたちで示している。この例では、タンパク質機能識別装置１により、有病率（Prevalence）＝３３％の時に、正診率（Accuracy）＝９２％、感度＝８８％、特異度＝９４％、Recall＝８８％、Precision＝８８％の識別結果を得ることができた。 FIG. 20 shows the identification results in the form of a table. In this example, when the prevalence is 33%, the protein function identification apparatus 1 has an accuracy of diagnosis = 92%, sensitivity = 88%, specificity = 94%, recall = 88%, precision. = 88% of identification results could be obtained.

上記の識別結果によれば、大きな分類（抗体、酵素、その他）におけるタンパク質機能の識別を行うことができる。この大きな分類は、生物学的機能といえる。これに対して、大きな分類に属する各タンパク質の個々の機能は、生化学的機能といえる。上記の識別結果は、生化学分野の実験系研究者に機能未知のタンパク質の同定の為の実験戦略を提供できる。さらに、上記の識別結果は、生化学的機能識別の探索空間を大幅に縮小することができる。例えば、非特許文献１、２等の技術では、大きな分類のタンパク質機能が不明であるために、巨大な探索空間の中で微小な特徴量が探索される。これに対して、大きな分類が判明していれば、探索空間が大幅に縮小され、これにより、探索効率と精度の向上が期待できる。 According to the above identification result, it is possible to identify protein functions in a large classification (antibody, enzyme, etc.). This large classification is a biological function. On the other hand, each function of each protein belonging to a large classification can be said to be a biochemical function. The above identification result can provide an experimental strategy for identifying a protein with unknown function to an experimental researcher in the field of biochemistry. Furthermore, the above identification results can greatly reduce the search space for biochemical function identification. For example, in technologies such as Non-Patent Documents 1 and 2, since a large classification of protein functions is unknown, a minute feature amount is searched for in a huge search space. On the other hand, if a large classification is known, the search space can be greatly reduced, thereby improving search efficiency and accuracy.

機能の分類に関しては、下記のこともいえる。本実施の形態では、上記のように、大きな分類におけるタンパク質の機能が識別された。この大きな分類の各々において、本発明の適用により、さらに小さな分類における機能が識別されたよい。各々の小さな分類の中で、機能既知の複数のタンパク質の立体構造が用意されて、上述の本発明が適用される。さらに細かな分類での機能の識別も行われてよい。このようにして、本発明の機能識別を複数段階で行うことにより、より詳細な機能の識別が可能である。こうして、上記の大分類の生物学的機能だけでなく、より小さな分類の生化学的機能の識別にも本発明を適用することが可能である。 Regarding the function classification, the following can also be said. In the present embodiment, as described above, protein functions in a large classification are identified. In each of these large classifications, the application of the present invention may identify functions in the smaller classifications. Within each of the small classifications, three-dimensional structures of a plurality of proteins with known functions are prepared, and the above-described present invention is applied. Further, the function may be identified by a fine classification. In this way, more detailed function identification is possible by performing the function identification of the present invention in a plurality of stages. Thus, it is possible to apply the present invention not only to the above-mentioned large-scale biological functions but also to the identification of smaller-scale biochemical functions.

また、本実施の形態は、形状相補性評価値と電荷情報の処理に下記のような工夫をしており、これにより有利な効果が得られる。 In the present embodiment, the following measures are taken for the processing of the shape complementarity evaluation value and the charge information, and an advantageous effect can be obtained.

まず、従来技術と比較の上で本発明の処理を整理する。形状相補性と電荷情報は、どちらもタンパク質の機能に重要な役割をもつ。従来は、形状相補性と電荷情報の両方をパラメータとする一つの統合スコアが算出される。つまり、この統合スコアは、形状相補性と電荷情報の関数であった。 First, the processing of the present invention is organized in comparison with the prior art. Both shape complementarity and charge information play important roles in protein function. Conventionally, one integrated score using both shape complementarity and charge information as parameters is calculated. In other words, this integrated score was a function of shape complementarity and charge information.

これに対して、本発明のドッキングシミュレーションは、形状相補性評価値であるＳｐｓｃのみを求めている。電荷情報は、ドッキングシミュレーション結果とは別に処理される。そして、電荷情報は、ドッキングシミュレーション結果の形状相補性評価値と共にＳＶＭへ入力される。しかし、最後まで、電荷情報は、ドッキングシミュレーションのパラメータにはならない。 On the other hand, the docking simulation of the present invention obtains only Spsc that is a shape complementarity evaluation value. The charge information is processed separately from the docking simulation result. The charge information is input to the SVM together with the shape complementarity evaluation value of the docking simulation result. However, until the end, the charge information is not a parameter for the docking simulation.

この点に関し、ＺＤＯＣＫは、形状相補性の評価の原理においては本発明と共通している。しかし、ＺＤＯＣＫも、形状相補性と電荷情報との関数によって統合スコアを求めており、この点では本発明と異なる。 In this regard, ZDOCK is in common with the present invention in the principle of shape complementarity evaluation. However, ZDOCK also obtains an integrated score by a function of shape complementarity and charge information, and this point is different from the present invention.

さて、上述した本発明の構成は、以下の点で有利である。従来技術は、形状相補性と電荷情報の両方の関数である統合スコアを求めている。この統合スコアは、ドッキングするタンパク質（レセプタとリガンド）のデータが完全であること、すなわち、欠落部分がないことを前提としている。実験精度の問題でタンパク質の一部が欠落している場合、統合スコアが大きく影響を受け、精度が大幅に低下する。 The configuration of the present invention described above is advantageous in the following points. The prior art seeks an integrated score that is a function of both shape complementarity and charge information. This integrated score is based on the assumption that the data of docking proteins (receptors and ligands) is complete, that is, there are no missing parts. If a part of the protein is missing due to the problem of experimental accuracy, the integrated score is greatly affected and the accuracy is greatly reduced.

これに対して、本発明では、形状相補性評価値と電荷情報が別々に扱われる。この場合、欠落の影響は、形状相補性評価値に留まる。電荷情報については別に処理されて、欠落の影響を受けにくい。これにより、最終の識別結果への欠落部分の影響が低減し、より高い識別精度が得られる。 On the other hand, in the present invention, the shape complementarity evaluation value and the charge information are handled separately. In this case, the effect of missing remains in the shape complementarity evaluation value. The charge information is processed separately and is not easily affected by the loss. Thereby, the influence of the missing part on the final identification result is reduced, and higher identification accuracy is obtained.

また、本発明では、Ｅ１〜Ｅ１０の電荷情報が利用された。この中で、Ｅ１０の電荷情報が重要である。Ｅ１０は、タンパク質全体の全電荷とタンパク質表面の全電荷の差（＝Ｅ５−Ｅ９）である。Ｅ１〜Ｅ９に加えてＥ１０をベクトル要素としてＳＶＭへ入力することにより、識別能力が向上した。 In the present invention, the charge information E1 to E10 is used. Among these, the charge information of E10 is important. E10 is the difference between the total charge of the whole protein and the total charge on the protein surface (= E5-E9). By inputting E10 as a vector element in addition to E1 to E9 to the SVM, the discrimination ability is improved.

以上に本発明の好適な実施の形態について説明した。上述したように、本発明によれば、機能既知の複数のレセプタの立体構造と複数のリガンドの立体構造から教師データが求められる。教師データは、各々のレセプタについて、各レセプタが複数のリガンドとそれぞれドッキングするときの複数の形状相補性評価値と、各レセプタの電荷情報とを含む。また、機能未知タンパク質の立体構造と複数のリガンドの立体構造から識別入力データが求められ、この識別入力データは、機能未知タンパク質が複数のリガンドとそれぞれドッキングするときの複数の形状相補性評価値と、機能未知タンパク質の電荷情報とを含む。そして、本発明は、上記の教師データを学習して機能未知タンパク質の機能を識別する。このように形状相補性評価値と電荷情報を用いることにより、タンパク質間相互作用を考慮した機能識別を行うことができ、高い識別能力で機能識別を行うことができる。 The preferred embodiments of the present invention have been described above. As described above, according to the present invention, teacher data is obtained from the three-dimensional structure of a plurality of receptors having known functions and the three-dimensional structure of a plurality of ligands. The teacher data includes, for each receptor, a plurality of shape complementarity evaluation values when each receptor is docked with a plurality of ligands, and charge information of each receptor. In addition, identification input data is obtained from the three-dimensional structure of the unknown protein and the three-dimensional structure of a plurality of ligands. The identification input data includes a plurality of shape complementarity evaluation values when the unknown protein is docked with a plurality of ligands, respectively. And charge information of proteins of unknown function. And this invention learns said teacher data, and identifies the function of a function unknown protein. In this way, by using the shape complementarity evaluation value and the charge information, it is possible to perform function identification considering protein-protein interaction, and it is possible to perform function identification with high discrimination ability.

以上に本発明の好適な実施の形態を説明した。しかし、本発明は上述の実施の形態に限定されず、当業者が本発明の範囲内で上述の実施の形態を変形可能なことはもちろんである。 The preferred embodiments of the present invention have been described above. However, the present invention is not limited to the above-described embodiments, and it goes without saying that those skilled in the art can modify the above-described embodiments within the scope of the present invention.

以上のように、本発明にかかるタンパク質機能識別装置は、バイオインフォマティクスによりタンパク質の立体構造を基にしてタンパク質機能を識別する技術として有用である。 As described above, the protein function identification device according to the present invention is useful as a technique for identifying protein functions based on the three-dimensional structure of the protein by bioinformatics.

本発明の実施の形態に係るタンパク質機能識別装置を示す図である。It is a figure which shows the protein function identification device which concerns on embodiment of this invention. 教師データ生成部の構成を示す図である。It is a figure which shows the structure of a teacher data production | generation part. 識別入力データ生成部の構成を示す図である。It is a figure which shows the structure of an identification input data generation part. 教師ドッキング評価部で得られる形状相補性評価値のデータを示す図である。It is a figure which shows the data of the shape complementarity evaluation value obtained in a teacher docking evaluation part. 電荷情報を示す図である。It is a figure which shows charge information. 学習識別処理を概念的に示す図である。It is a figure which shows learning identification processing notionally. サポートベクトルマシンを用いた学習識別処理を示す図である。It is a figure which shows the learning identification process using a support vector machine. Ｖｏｘｅｌ化の処理の流れを示す図であって、ＤＥＦＡＵＬＴの割り当て処理を示す図である。It is a figure which shows the flow of a Voxelization process, Comprising: It is a figure which shows the allocation process of DEFULT. Ｖｏｘｅｌ化の処理の流れを示す図であって、ＩＮＮＥＲの割り当て処理を示す図である。It is a figure which shows the flow of a Voxelization process, Comprising: It is a figure which shows the allocation process of INNER. Ｖｏｘｅｌ化の処理の流れを示す図であって、ＯＵＴＥＲの割り当て処理を示す図である。It is a figure which shows the flow of a Voxelization process, Comprising: It is a figure which shows the assigning process of OUTER. Ｖｏｘｅｌ化の処理の流れを示す図であって、ＴＭＰ−ＳＵＲＦＡＣＥの割り当て処理を示す図である。It is a figure which shows the flow of a Voxelization process, Comprising: It is a figure which shows the allocation process of TMP-SURFACE. Ｖｏｘｅｌ化の処理の流れを示す図であって、ＳＵＲＦＡＣＥの割り当て処理を示す図である。It is a figure which shows the flow of a Voxelization process, Comprising: It is a figure which shows the allocation process of SURFACE. Ｖｏｘｅｌ化の処理の流れを示す図であって、ＣＡＶＩＴＹの割り当て処理を示す図である。It is a figure which shows the flow of a Voxelization process, Comprising: It is a figure which shows the allocation process of CAVITY. ＰＳＣ接触表面スコア関数を説明するための図である。It is a figure for demonstrating a PSC contact surface score function. 具体例におけるレセプタのサンプルを示す図である。It is a figure which shows the sample of the receptor in a specific example. 具体例におけるリガンドのサンプルを示す図である。It is a figure which shows the sample of the ligand in a specific example. 具体例におけるサポートベクトルマシンの処理条件を示す図である。It is a figure which shows the processing conditions of the support vector machine in a specific example. サンプルにおけるＳｐｓｃの評価結果を示す図である。It is a figure which shows the evaluation result of Spsc in a sample. サンプルにおけるＳｐｓｃの評価結果を示す図である。It is a figure which shows the evaluation result of Spsc in a sample. 本発明による識別結果の例を示す図である。It is a figure which shows the example of the identification result by this invention.

Explanation of symbols

１タンパク質機能識別装置
３レセプタ記憶部
５リガンド記憶部
７教師データ部
９未知タンパク質入力部
１１識別入力データ部
１３学習識別部
１５出力部
２１教師データ生成部
２３教師データ記憶部
２５識別入力データ生成部
２７識別入力データ記憶部
３１教師ドッキング評価部
３３教師ドッキングデータ記憶部
３５教師残基電荷評価部
３７教師残基電荷データ記憶部
４１、４３、７１、７３グリッド変換部
４５、４７、７５、７７ＰＳＣ変換部
４９、７９ドッキングシミュレーション部
５１、８１分子表面計算部
５３、８３アミノ酸残基計数部
５５、８５電荷計算部
６１入力ドッキング評価部
６３入力ドッキングデータ記憶部
６５入力残基電荷評価部
６７入力残基電荷データ記憶部 DESCRIPTION OF SYMBOLS 1 Protein function identification device 3 Receptor memory | storage part 5 Ligand memory | storage part 7 Teacher data part 9 Unknown protein input part 11 Identification input data part 13 Learning identification part 15 Output part 21 Teacher data generation part 23 Teacher data storage part 25 Identification input data generation part 27 Identification Input Data Storage Unit 31 Teacher Docking Evaluation Unit 33 Teacher Docking Data Storage Unit 35 Teacher Residue Charge Evaluation Unit 37 Teacher Residue Charge Data Storage Unit 41, 43, 71, 73 Grid Conversion Units 45, 47, 75, 77 PSC Conversion unit 49, 79 Docking simulation unit 51, 81 Molecular surface calculation unit 53, 83 Amino acid residue counting unit 55, 85 Charge calculation unit 61 Input docking evaluation unit 63 Input docking data storage unit 65 Input residue charge evaluation unit 67 Input residue Base charge data storage

Claims

A protein function identification device for identifying the function of a protein with unknown function based on the three-dimensional structure of a protein with known function and the three-dimensional structure of a protein with unknown function,
A receptor storage unit that stores the three-dimensional structure of a plurality of receptors as proteins of known function;
A ligand storage unit that stores a three-dimensional structure of a plurality of ligands;
When each receptor docks with the plurality of ligands for each of the plurality of receptors whose functions are known as teacher data for function identification based on the three-dimensional structures of the plurality of receptors and the three-dimensional structures of the plurality of ligands. A teacher data generation unit for calculating a plurality of shape complementarity evaluation values and calculating charge information of each receptor;
An unknown protein input section for inputting the three-dimensional structure of the function unknown protein;
Based on the three-dimensional structure of the unknown protein and the three-dimensional structure of the plurality of ligands, a plurality of shape complementarity evaluation values when the unknown protein is docked with the plurality of ligands as identification input data for function identification And an identification input data generation unit for calculating the charge information of the protein with unknown function,
A learning discriminating unit that learns the teacher data and identifies the function of the function-unknown protein, and the learning discriminating unit includes the plurality of shapes of docking with the plurality of ligands in a plurality of receptors having a common function. Based on the complementarity evaluation value and the similarity of the charge information, the function of the function unknown protein is determined by obtaining the function of the receptor whose shape complementarity evaluation value and the charge information are similar to the function unknown protein. Identify and
The teacher data generation unit performs a docking simulation to match the unevenness of the surface of the compatible body with respect to the combination of the three-dimensional structure of each receptor and the three-dimensional structure of each ligand, and expresses how much the unevenness of the surface of the compatible body matches. A shape complementarity evaluation value is calculated, and the identification input data generation unit calculates the shape complementarity evaluation value by performing the docking simulation on a combination of the three-dimensional structure of the unknown protein and the three-dimensional structure of each ligand. ,
The teacher data generation unit and the identification input data generation unit perform a process of specifying the protein surface from the three-dimensional structure of the protein before the docking simulation, and the thickness of the protein surface can be adjusted.
The charge information calculated by the teacher data generation unit and the identification input data generation unit includes a difference between the total charge of the protein that is each receptor or the protein of unknown function and the total charge of the surface. protein function identification device to.

A protein function identification device for identifying the function of a protein with unknown function based on the three-dimensional structure of a protein with known function and the three-dimensional structure of a protein with unknown function,
A receptor storage unit that stores the three-dimensional structure of a plurality of receptors as proteins of known function;
A ligand storage unit that stores a three-dimensional structure of a plurality of ligands;
When each receptor docks with the plurality of ligands for each of the plurality of receptors whose functions are known as teacher data for function identification based on the three-dimensional structures of the plurality of receptors and the three-dimensional structures of the plurality of ligands. A teacher data generation unit for calculating a plurality of shape complementarity evaluation values and calculating charge information of each receptor;
An unknown protein input section for inputting the three-dimensional structure of the function unknown protein;
Based on the three-dimensional structure of the unknown protein and the three-dimensional structure of the plurality of ligands, a plurality of shape complementarity evaluation values when the unknown protein is docked with the plurality of ligands as identification input data for function identification And an identification input data generation unit for calculating the charge information of the protein with unknown function,
A learning discriminating unit that learns the teacher data and identifies the function of the function-unknown protein, and the learning discriminating unit includes the plurality of shapes of docking with the plurality of ligands in a plurality of receptors having a common function. Based on the complementarity evaluation value and the similarity of the charge information, the function of the function unknown protein is determined by obtaining the function of the receptor whose shape complementarity evaluation value and the charge information are similar to the function unknown protein. Identify and
The teacher data generation unit performs a docking simulation to match the unevenness of the surface of the compatible body with respect to the combination of the three-dimensional structure of each receptor and the three-dimensional structure of each ligand, and expresses how much the unevenness of the surface of the compatible body matches. A shape complementarity evaluation value is calculated, and the identification input data generation unit calculates the shape complementarity evaluation value by performing the docking simulation on a combination of the three-dimensional structure of the unknown protein and the three-dimensional structure of each ligand. ,
The teacher data generation unit and the identification input data generation unit perform a process of specifying the protein surface from the three-dimensional structure of the protein before the docking simulation, and the thickness of the protein surface can be adjusted.
The charge information calculated by the teacher data generation unit and the identification input data generation unit is the solvent exposure area of the protein that is each receptor or the function-unknown protein, the number of positive charge residues of the whole protein, The number of negatively charged residues of the whole protein, the number of histidine residues of the whole protein, the total charge of the whole protein, the number of positively charged residues of the protein surface, and the negative charge residue of the protein surface. A protein function identification apparatus comprising: the number of groups, the number of histidine residues on the protein surface, the total charge on the protein surface, and the difference between the total charge on the entire protein and the total charge on the surface .

The learning identification unit classifies the charge information and the shape complementarity evaluation value included in the teacher data into a plurality of functional categories by a receptor function, and the charge information of the unknown protein included in the identification input data and the The protein function identification apparatus according to claim 1 or 2 , wherein a functional category to which the shape complementarity evaluation value belongs is determined.

The plurality of shape complementarity evaluation values of each of the receptors and the plurality of ligands and the charge information of each of the receptors constitute a receptor vector of each of the receptors,
The plurality of shape complementarity evaluation values of the function unknown protein and the plurality of ligands and the charge information of the function unknown protein constitute an identification input vector of the function unknown protein,
The learning identification unit detects a separation plane that divides a plurality of receptor vectors corresponding to the plurality of receptors into the plurality of function categories, and a function to which the identification input vector belongs from a positional relationship between the separation plane and the identification input vector The protein function identification device according to claim 3 , wherein a category is discriminated.

The protein function identification device according to claim 3 or 4 , wherein the plurality of function categories are three categories of antibodies, enzymes, and other functions.

A protein function identification method for identifying the function of a protein with unknown function based on the three-dimensional structure of a protein with known function and the three-dimensional structure of a protein with unknown function,
Read the three-dimensional structure of multiple receptors as proteins of known function from the receptor storage unit,
Read the three-dimensional structure of multiple ligands from the ligand storage unit,
Processing to identify the protein surface from the three-dimensional structure of the protein, the thickness of the protein surface can be adjusted,
When each receptor docks with the plurality of ligands for each of the plurality of receptors whose functions are known as teacher data for function identification based on the three-dimensional structures of the plurality of receptors and the three-dimensional structures of the plurality of ligands. A plurality of shape complementarity evaluation values are calculated by performing a docking simulation to match the unevenness of the surface of the compatible body, and calculating charge information including a difference between the total charge of each receptor and the total charge of the surface ,
Enter the three-dimensional structure of the unknown protein
Processing to identify the protein surface from the three-dimensional structure of the protein, the thickness of the protein surface can be adjusted,
Based on the three-dimensional structure of the unknown protein and the three-dimensional structure of the plurality of ligands, a plurality of shape complementarity evaluation values when the unknown protein is docked with the plurality of ligands as identification input data for function identification And calculating the charge information including the difference between the total charge of the entire protein of unknown function and the total charge of the surface, by performing the docking simulation ,
The learning identification is performed by learning the teacher data to identify the function of the function-unknown protein, and the learning identification is performed by the plurality of shape complementarity evaluations of docking with the plurality of ligands in a plurality of receptors having a common function. Identifying the function of the function-unknown protein by determining the function of a receptor whose plurality of shape complementarity evaluation values and the charge information are similar to the function-unknown protein based on the similarity of the value and the charge information A protein function identification method characterized by the above.

A protein function identification method for identifying the function of a protein with unknown function based on the three-dimensional structure of a protein with known function and the three-dimensional structure of a protein with unknown function,
Read the three-dimensional structure of multiple receptors as proteins of known function from the receptor storage unit,
Read the three-dimensional structure of multiple ligands from the ligand storage unit,
Processing to identify the protein surface from the three-dimensional structure of the protein, the thickness of the protein surface can be adjusted,
When each receptor docks with the plurality of ligands for each of the plurality of receptors whose functions are known as teacher data for function identification based on the three-dimensional structures of the plurality of receptors and the three-dimensional structures of the plurality of ligands. A plurality of shape complementarity evaluation values are calculated by performing a docking simulation that matches the unevenness of the surface of the compatible body, and the solvent exposed area of each receptor , the total number of positively charged residues, and the total negatively charged residues The total number of histidine residues, the total total charge, the number of surface positively charged residues, the number of negatively charged residues on the surface, the number of histidine residues on the surface, the surface Calculate charge information including the total charge and the difference between the total charge and the total charge on the surface ,
Enter the three-dimensional structure of the unknown protein
Processing to identify the protein surface from the three-dimensional structure of the protein, the thickness of the protein surface can be adjusted,
Based on the three-dimensional structure of the unknown protein and the three-dimensional structure of the plurality of ligands, a plurality of shape complementarity evaluation values when the unknown protein is docked with the plurality of ligands as identification input data for function identification Is calculated by performing the docking simulation and the solvent exposure area of the protein of unknown function , the total number of positively charged residues, the total number of negatively charged residues, the total number of histidine residues, Total total charge, number of surface positive charge residues, number of surface negative charge residues, number of surface histidine residues, total surface charge, total total charge and total surface charge Charge information including the difference between
The learning identification is performed by learning the teacher data to identify the function of the function-unknown protein, and the learning identification is performed by the plurality of shape complementarity evaluations of docking with the plurality of ligands in a plurality of receptors having a common function. Identifying the function of the function-unknown protein by determining the function of a receptor whose plurality of shape complementarity evaluation values and the charge information are similar to the function-unknown protein based on the similarity of the value and the charge information A protein function identification method characterized by the above.

A protein function identification program for causing a computer to execute a process of identifying a function of a protein with unknown function based on the three-dimensional structure of a protein with known function and the three-dimensional structure of a protein with unknown function,
Read the three-dimensional structure of multiple receptors as proteins of known function from the receptor storage unit,
Read the three-dimensional structure of multiple ligands from the ligand storage unit,
Processing to identify the protein surface from the three-dimensional structure of the protein, the thickness of the protein surface can be adjusted,
When each receptor docks with the plurality of ligands for each of the plurality of receptors whose functions are known as teacher data for function identification based on the three-dimensional structures of the plurality of receptors and the three-dimensional structures of the plurality of ligands. A plurality of shape complementarity evaluation values are calculated by performing a docking simulation to match the unevenness of the surface of the compatible body, and calculating charge information including a difference between the total charge of each receptor and the total charge of the surface ,
Enter the three-dimensional structure of the unknown protein
Processing to identify the protein surface from the three-dimensional structure of the protein, the thickness of the protein surface can be adjusted,
Based on the three-dimensional structure of the unknown protein and the three-dimensional structure of the plurality of ligands, a plurality of shape complementarity evaluation values when the unknown protein is docked with the plurality of ligands as identification input data for function identification And calculating the charge information including the difference between the total charge of the entire protein of unknown function and the total charge of the surface, by performing the docking simulation ,
The computer executes a process for learning to identify the function of the unknown protein by learning the teacher data, and the learning identification process includes docking with the plurality of ligands in a plurality of receptors having a common function. Based on the similarity of the plurality of shape complementarity evaluation values and the charge information, the function of the receptor having the plurality of shape complementarity evaluation values and the charge information similar to the function unknown protein is obtained. A protein function identification program characterized by identifying the function of an unknown protein.

A protein function identification program for causing a computer to execute a process of identifying a function of a protein with unknown function based on the three-dimensional structure of a protein with known function and the three-dimensional structure of a protein with unknown function,
Read the three-dimensional structure of multiple receptors as proteins of known function from the receptor storage unit,
Read the three-dimensional structure of multiple ligands from the ligand storage unit,
Processing to identify the protein surface from the three-dimensional structure of the protein, the thickness of the protein surface can be adjusted,
When each receptor docks with the plurality of ligands for each of the plurality of receptors whose functions are known as teacher data for function identification based on the three-dimensional structures of the plurality of receptors and the three-dimensional structures of the plurality of ligands. A plurality of shape complementarity evaluation values are calculated by performing a docking simulation that matches the unevenness of the surface of the compatible body, and the solvent exposed area of each receptor , the total number of positively charged residues, and the total negatively charged residues The total number of histidine residues, the total total charge, the number of surface positively charged residues, the number of negatively charged residues on the surface, the number of histidine residues on the surface, the surface Calculate charge information including the total charge and the difference between the total charge and the total charge on the surface ,
Enter the three-dimensional structure of the unknown protein
Processing to identify the protein surface from the three-dimensional structure of the protein, the thickness of the protein surface can be adjusted,
Based on the three-dimensional structure of the unknown protein and the three-dimensional structure of the plurality of ligands, a plurality of shape complementarity evaluation values when the unknown protein is docked with the plurality of ligands as identification input data for function identification Is calculated by performing the docking simulation and the solvent exposure area of the protein of unknown function , the total number of positively charged residues, the total number of negatively charged residues, the total number of histidine residues, Total total charge, number of surface positive charge residues, number of surface negative charge residues, number of surface histidine residues, total surface charge, total total charge and total surface charge Charge information including the difference between
The computer executes a process for learning to identify the function of the unknown protein by learning the teacher data, and the learning identification process includes docking with the plurality of ligands in a plurality of receptors having a common function. Based on the similarity of the plurality of shape complementarity evaluation values and the charge information, the function of the receptor having the plurality of shape complementarity evaluation values and the charge information similar to the function unknown protein is obtained. A protein function identification program characterized by identifying the function of an unknown protein.