JP5773406B2

JP5773406B2 - GPI-anchored protein determination device, determination method, and determination program

Info

Publication number: JP5773406B2
Application number: JP2010169324A
Authority: JP
Inventors: 有理池田; 大貴田中; 昌朗吉澤; 貴規佐々木; 修己池田
Original assignee: MEIJI UNIVERSITY LEGAL PERSON
Current assignee: MEIJI UNIVERSITY LEGAL PERSON
Priority date: 2010-07-28
Filing date: 2010-07-28
Publication date: 2015-09-02
Anticipated expiration: 2030-07-28
Also published as: JP2012032163A

Description

本発明は、検査対象タンパク質がＧＰＩ（ｇｌｙｃｏｓｙｌｐｈｏｓｐｈａｔｉｄｙｌｉｎｏｓｉｔｏｌ）アンカー型タンパク質であるか否かを判定するＧＰＩアンカー型タンパク質の判定装置、判定方法及び判定プログラムに関する。 The present invention relates to a GPI-anchored protein determination apparatus, determination method, and determination program for determining whether or not a test target protein is a GPI (glycosylphosphatidylinositol) anchor-type protein.

生体内の多くのタンパク質は、糖鎖、脂質、糖脂質等により翻訳後修飾を受けており、これらの修飾がタンパク質の機能や細胞内局在に影響することが知られている。これらの翻訳後修飾の中でも、脂質と糖鎖とからなる糖脂質であるＧＰＩアンカーによる修飾は、非常に重要な意味を有するとされている。このことは、ＧＰＩアンカーが真核生物や古細菌において広く保存されていること、ＧＰＩアンカーを欠損した酵母や原虫は生存できず、ＧＰＩアンカーを欠損したヒトは造血幹細胞に異常を生じること等からも明らかである。
ＧＰＩにより修飾を受けるタンパク質は、ＧＰＩアンカー型タンパク質と呼ばれる。ＧＰＩアンカー型タンパク質は、そのアミノ酸配列のＮ末端に小胞体輸送のシグナルペプチドを有するため、小胞体内に輸送された後に翻訳を完了する。その後、ＧＰＩアンカー修飾部位（ωサイト）のＣ末端側に存在するプロペプチドが、トランスアミダーゼにより切断及び除去され、ＧＰＩアンカー型タンパク質は小胞体内で生合成されたＧＰＩアンカーと結合する。ＧＰＩアンカーと結合したＧＰＩアンカー型タンパク質は、ゴルジ体を経て細胞膜表面に輸送され、ＧＰＩアンカーにより細胞膜に繋ぎ止められる。
ＧＰＩアンカー型タンパク質の特徴としては、Ｎ末端のシグナルペプチド及びＣ末端のプロペプチドの疎水性が高く、ωサイトの近隣には残基サイズの小さいアミノ酸が存在することが知られている。 Many proteins in living bodies are post-translationally modified by sugar chains, lipids, glycolipids, etc., and it is known that these modifications affect the function and intracellular localization of the protein. Among these post-translational modifications, modification with GPI anchors, which are glycolipids composed of lipids and sugar chains, is considered to have a very important meaning. This is because GPI anchors are widely preserved in eukaryotes and archaea, yeasts and protozoa lacking GPI anchors cannot survive, and humans lacking GPI anchors have abnormalities in hematopoietic stem cells, etc. Is also obvious.
Proteins that are modified by GPI are called GPI-anchored proteins. Since the GPI-anchored protein has a signal peptide for endoplasmic reticulum transport at the N-terminus of its amino acid sequence, translation is completed after transport into the endoplasmic reticulum. Thereafter, the propeptide present on the C-terminal side of the GPI anchor modification site (ω site) is cleaved and removed by transamidase, and the GPI anchored protein binds to the GPI anchor biosynthesized in the endoplasmic reticulum. The GPI-anchored protein bound to the GPI anchor is transported to the cell membrane surface via the Golgi apparatus, and is tethered to the cell membrane by the GPI anchor.
As a characteristic of GPI-anchored proteins, it is known that an N-terminal signal peptide and a C-terminal propeptide are highly hydrophobic, and an amino acid having a small residue size exists in the vicinity of the ω site.

ＧＰＩアンカー型タンパク質としては、ＣＤ１４、ＣＤ１６ｂ等の受容体、５’−ヌクレオチダーゼ、アルカリフォスファターゼ等の酵素等の生体反応に極めて重要なタンパク質が多く発見されている。また、狂牛病関連のプリオンタンパク質や、癌関連のヒト癌胎児性抗原（ＣＥＡ）等、重篤な疾患に関わるタンパク質も見出されている。しかしながら、現在までに真核生物で知られているＧＰＩアンカー型タンパク質は１００種類程度であり、未だ発見されていないＧＰＩアンカー型タンパク質が多く存在すると考えられている。そこで、近年では、コンピュータを用いたバイオインフォマティクス手法により、アミノ酸配列からＧＰＩアンカー型タンパク質を新たに見つける試みがなされている。 As GPI-anchored proteins, many proteins that are extremely important for biological reactions such as receptors such as CD14 and CD16b, enzymes such as 5'-nucleotidase and alkaline phosphatase have been discovered. Proteins related to serious diseases such as mad cow disease-related prion protein and cancer-related human carcinoembryonic antigen (CEA) have also been found. However, there are about 100 types of GPI-anchored proteins known to date in eukaryotes, and it is considered that there are many GPI-anchored proteins that have not yet been discovered. In recent years, therefore, attempts have been made to newly find GPI-anchored proteins from amino acid sequences by bioinformatics techniques using computers.

例えば、非特許文献１には、真核生物のＧＰＩアンカー型タンパク質を学習のデータセットとして、隠れマルコフモデルとサポートベクターマシン（ＳＶＭ）とを組み合わせた判定手法を用いて、検査対象タンパク質のアミノ酸配列情報から、検査対象タンパク質がＧＰＩアンカー型タンパク質であるか否かを判定する方法が記載されている。
また、非特許文献２には、原核生物及び真核生物のＧＰＩアンカー型タンパク質を学習のデータセットとして、ωサイト前後のアミノ酸配列におけるアミノ酸の性質及び出現頻度をスコア化し、ＧＰＩアンカー修飾部位を予測し、検査対象タンパク質がＧＰＩアンカー型タンパク質であるか否かを判定する方法が記載されている。
さらに、非特許文献３には、ニューラルネットワークの一種であるコホーネン自己組織化マップを用いて、検査対象の真核生物タンパク質がＧＰＩアンカー型タンパク質であるか否かを判定する方法が記載されている。 For example, in Non-Patent Document 1, the eukaryotic GPI-anchored protein is used as a learning data set, and a determination method combining a hidden Markov model and a support vector machine (SVM) is used. A method for determining whether or not a test target protein is a GPI-anchored protein from information is described.
Non-Patent Document 2 also uses GPI-anchored proteins of prokaryotes and eukaryotes as learning data sets, scores the amino acid properties and frequency of occurrence in the amino acid sequence around the ω site, and predicts GPI anchor modification sites. A method for determining whether or not a protein to be examined is a GPI-anchored protein is described.
Furthermore, Non-Patent Document 3 describes a method for determining whether a eukaryotic protein to be examined is a GPI-anchored protein using a Kohonen self-organizing map which is a kind of neural network. .

Ｐｉｅｒｌｅｏｎｉら、「ＢＭＣＢｉｏｉｎｆｏｒｍａｔｉｃｓ」、２００８年、ｖｏｌ．９、ｎｏ．３９２、ｐｐ．１−１１Pierleoni et al., “BMC Bioinformatics”, 2008, vol. 9, no. 392, pp. 1-11 Ｅｉｓｅｎｈａｂｅｒら、「ＪｏｕｒｎａｌｏｆＭｏｌｅｃｕｌａｒＢｉｏｌｏｇｙ」、１９９９年、ｖｏｌ．２９２、ｐｐ．７４１−７５８Eisenhaber et al., “Journal of Molecular Biology”, 1999, vol. 292, pp. 741-758 Ｆｒａｎｋｈａｕｓｅｒら、「Ｂｉｏｉｎｆｏｒｍａｔｉｃｓ」、２００５年、ｖｏｌ．２１、ｎｏ．９、ｐｐ．１８４６−１８５２Frankhauser et al., “Bioinformatics”, 2005, vol. 21, no. 9, pp. 1846-1852

上述したような従来のＧＰＩアンカー型タンパク質判定方法は、ＧＰＩアンカー型タンパク質のアミノ酸出現確率や疎水性値、分子量を解析手段（ニューラルネットワーク、ＳＶＭなど）への入力値として用いている。そのため、非ＧＰＩアンカー型タンパク質らしさについての判定がなされず、新規のＧＰＩアンカー型タンパク質を判定する感度及び選択性が十分ではない。そこで、より高い感度及び選択性で、検査対象タンパク質がＧＰＩアンカー型タンパク質であるか否かを判定することへの要求がある。
本発明は、上記事情に鑑みてなされたものであって、高感度且つ高選択的に検査対象タンパク質がＧＰＩアンカー型タンパク質であるか否かを判定することが可能なＧＰＩアンカー型タンパク質の判定装置、判定方法及び判定プログラムを提供することを目的とする。 In the conventional GPI anchor type protein determination method as described above, the amino acid appearance probability, hydrophobicity value, and molecular weight of the GPI anchor type protein are used as input values to the analysis means (neural network, SVM, etc.). Therefore, the determination about the non-GPI anchor type protein is not made, and the sensitivity and selectivity for determining a new GPI anchor type protein are not sufficient. Therefore, there is a demand for determining whether or not the test target protein is a GPI anchor type protein with higher sensitivity and selectivity.
The present invention has been made in view of the above circumstances, and is a GPI-anchored protein determination apparatus capable of determining whether or not a test target protein is a GPI-anchored protein with high sensitivity and high selectivity. It is an object to provide a determination method and a determination program.

本発明は上記の課題を解決するためになされたものであり、検査対象タンパク質がＧＰＩアンカー型タンパク質であるか否かを判定するＧＰＩアンカー型タンパク質の判定装置であって、前記検査対象タンパク質のアミノ酸配列情報を取得する配列取得部と、前記配列取得部が取得したアミノ酸配列情報における既知のＧＰＩアンカー型タンパク質のプロペプチド領域を含む領域として、前記アミノ酸配列情報のＣ末端から予め定められた残基数の領域を特定し、当該プロペプチド領域を含む領域のアミノ酸残基を抽出し、当該抽出したアミノ酸残基のそれぞれに対して、当該プロペプチド領域を含む領域のアミノ酸残基の側鎖サイズの平均化に用いる残基数である側鎖サイズ特性抽出必要数を用いて、連続する当該側鎖サイズ特性抽出必要数分のアミノ酸残基の各側鎖サイズ指標値の平均値である平均側鎖サイズを１残基ずつずらしながら複数算出する側鎖サイズ算出部と、既知のＧＰＩアンカー型タンパク質の所定の領域内の位置に存在するアミノ酸残基の種類の出現頻度と既知の非ＧＰＩアンカー型タンパク質の所定の領域内の位置に存在するアミノ酸残基の種類の出現頻度とから求められる既知のＧＰＩアンカー型タンパク質のアミノ酸残基位置におけるアミノ酸残基の種類の出現度合いを示す位置特異的スコアを取得し、当該位置特異的スコアに基づき、前記側鎖サイズ算出部が算出した平均側鎖サイズが最小となる位置を基準位置とする、当該基準位置からＮ末端側及びＣ末端側に連続する所定の残基数のアミノ酸残基からなる所定の領域におけるアミノ酸残基の部分配列の各アミノ酸残基の位置特異的スコアを特定し、当該各アミノ酸残基の位置特異的スコアを示す数値列であるスコア数値列を生成するスコア数値列生成部と、前記スコア数値列生成部が生成したスコア数値列を入力し、ＧＰＩアンカー型タンパク質らしさを示す０以上１以下の期待値を出力する分類部であって、既知のＧＰＩアンカー型タンパク質の前記スコア数値列を入力とした場合に、期待値として１を出力し、既知の非ＧＰＩアンカー型タンパク質の前記スコア数値列を入力した場合に、期待値として０を出力するように学習された分類部と、前記分類部が出力した期待値が０．５未満であると判定した場合に、前記検査対象タンパク質がＧＰＩアンカー型タンパク質でないと判定するＧＰＩアンカー型タンパク質判定部と、を備えることを特徴とする。 The present invention has been made to solve the above-described problem, and is a GPI-anchored protein determination apparatus for determining whether or not a test target protein is a GPI-anchored protein, the amino acid of the test target protein Residues determined in advance from the C-terminal of the amino acid sequence information as a region including a sequence acquisition unit for acquiring sequence information and a known GPI-anchored protein propeptide region in the amino acid sequence information acquired by the sequence acquisition unit Specify the number of regions, extract the amino acid residues of the region containing the propeptide region, and for each of the extracted amino acid residues, the side chain size of the amino acid residues of the region containing the propeptide region Using the necessary number of side chain size characteristics extraction, which is the number of residues used for averaging, it is necessary to extract the side chain size characteristics consecutively A side chain size calculating unit that calculates a plurality of average side chain sizes that are average values of the side chain size index values of each amino acid residue while shifting one residue at a time, and a predetermined region of a known GPI-anchored protein. amino acids of the known GPI anchored protein obtained from the type of the frequency of occurrence of amino acid residues present at positions in a predetermined region of the frequency and the known non-GPI anchored proteins types of amino acid residues present at positions Obtain a position-specific score indicating the degree of appearance of the type of amino acid residue at the residue position, and based on the position-specific score, the position where the average side chain size calculated by the side chain size calculation unit is minimized A portion of an amino acid residue in a predetermined region comprising a predetermined number of amino acid residues that are consecutive from the reference position to the N-terminal side and the C-terminal side from the reference position A score numerical value sequence generation unit that specifies a position specific score of each amino acid residue in the sequence and generates a score numerical value sequence that is a numerical value sequence indicating the position specific score of each amino acid residue; and the score numerical value sequence generation unit Is a classification unit that inputs the generated score numerical sequence and outputs an expected value of 0 or more and 1 or less indicating the GPI-anchored protein, and when the score numerical sequence of a known GPI-anchored protein is input When the score value sequence of a known non-GPI anchor type protein is output as an expected value, the classification unit learned to output 0 as the expected value, and the expectation output from the classification unit A GPI anchor type protein determination unit that determines that the protein to be examined is not a GPI anchor type protein when it is determined that the value is less than 0.5. It is characterized by that.

また、本発明は、前記分類部は、ニューラルネットワークであり、前記スコア数値列生成部が生成するスコア数値列の要素数と同数のノードで構成される入力層と、複数のノードで構成される隠れ層と、１つのノードで構成される出力層とを少なくとも含む階層型の構造を有し、前記入力層の各ノードは、前記スコア数値列のうち自身に対応づけられた要素が示す値を前記隠れ層のノードのそれぞれに出力し、前記隠れ層の各ノードは、前記入力層の各ノードが出力する値を所定の伝達関数に代入し、得られた値を前記出力層のノードに出力し、前記出力層のノードは、前記隠れ層の各ノードが出力する値を所定の伝達関数に代入し、得られた値を期待値として出力することを特徴とする。 According to the present invention, the classification unit is a neural network, and includes an input layer configured by the same number of nodes as the number of elements of the score numerical sequence generated by the score numerical sequence generation unit, and a plurality of nodes. It has a hierarchical structure including at least a hidden layer and an output layer composed of one node, and each node of the input layer has a value indicated by an element associated with itself in the score numerical sequence. Output to each of the nodes of the hidden layer, each node of the hidden layer substitutes the value output by each node of the input layer into a predetermined transfer function, and outputs the obtained value to the node of the output layer The node of the output layer substitutes a value output from each node of the hidden layer into a predetermined transfer function, and outputs the obtained value as an expected value.

また、本発明において、前記分類部は、既知のＧＰＩアンカー型タンパク質の前記スコア数値列を入力した場合に、期待値として１を出力するように前記ノードの伝達関数の係数を変化させ、前記既知の非ＧＰＩアンカー型タンパク質の前記スコア数値列を入力した場合に、期待値として０を出力するように前記ノードの伝達関数の係数を変化させることで学習されたことを特徴とする。 Further, in the present invention, the classification unit changes a coefficient of the transfer function of the node so as to output 1 as an expected value when the score numerical sequence of a known GPI anchor type protein is input, and the known When the score numerical sequence of the non-GPI anchor type protein is input, learning is performed by changing the coefficient of the transfer function of the node so as to output 0 as an expected value.

また、本発明において、前記ノードのそれぞれは伝達関数としてシグモイド関数を用いることを特徴とする。 In the present invention, each of the nodes uses a sigmoid function as a transfer function.

また、本発明において、前記側鎖サイズ特性抽出必要数は、当該側鎖サイズ特性抽出必要数を用いて、既知の複数のＧＰＩアンカー型タンパク質の小側鎖サイズ判定領域に対して平均側鎖サイズを算出した場合に、前記ＧＰＩアンカー型タンパク質から算出した平均側鎖サイズが最小となるアミノ酸残基のうち、当該アミノ酸残基のＣ末端側に隣接するアミノ酸残基がＧＰＩアンカー修飾部位であるものの個数が最大となるような値であることを特徴とする。 In the present invention, the number of side chain size characteristics required for extraction is the average side chain size for the small side chain size determination regions of a plurality of known GPI-anchored proteins using the required number of side chain size characteristics. Of the amino acid residues having the smallest average side chain size calculated from the GPI-anchored protein, the amino acid residue adjacent to the C-terminal side of the amino acid residue is a GPI anchor modification site It is a value that maximizes the number .

また、本発明において、前記小側鎖サイズ判定領域は、既知のＧＰＩアンカー型タンパク質の前記平均側鎖サイズが最小となる位置が含まれる領域である、ことを特徴とする。 In the present invention, the small side chain size determination region is a region including a position where the average side chain size of a known GPI-anchored protein is minimized .

また、本発明において、前記位置特異的スコアは、式（４）から算出されたものであることを特徴とする。 In the present invention, the position-specific score is calculated from the equation (4).

また、本発明において、前記所定の領域内の位置ｐに存在するアミノ酸残基の種類ｉの出現頻度は、式（３）から算出されたものであることを特徴とする。 In the present invention, the appearance frequency of the type i of the amino acid residue present at the position p in the predetermined region is calculated from the equation (3).

また、本発明は、前記配列取得部が取得したアミノ酸配列情報における既知のＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域に対応する領域として、前記アミノ酸配列情報のＮ末端から予め定められた残基数の領域を特定し、当該Ｎ末端側の高疎水性領域に対応する領域のアミノ酸残基を抽出し、前記Ｎ末端側の高疎水性領域に対応する領域のアミノ酸残基の疎水性値の平均化に用いる残基数であるＮ末端側疎水性特性抽出必要数を用いて、連続する当該Ｎ末端側疎水性特性抽出必要数分のアミノ酸残基の各疎水性指標値の平均であるＮ末端側平均疎水性値を、前記抽出したアミノ酸残基のそれぞれに対して１残基ずつずらしながら複数算出するＮ末端側疎水性値算出部と、前記Ｎ末端側疎水性値算出部が算出した複数のＮ末端側平均疎水性値のうちの最大値が、既知のＧＰＩアンカー型タンパク質におけるＮ末端側平均疎水性値の特性を示すＮ末端側疎水性閾値以上であるか否かを判定するＮ末端側疎水性判定部とを備え、前記側鎖サイズ算出部、前記スコア数値列生成部、前記分類部、前記ＧＰＩアンカー型タンパク質判定部は、前記Ｎ末端側疎水性判定部が、前記Ｎ末端側疎水性値算出部の算出したＮ末端側平均疎水性値の最大値が前記Ｎ末端側疎水性閾値以上であると判定したアミノ酸配列情報に対して処理を行うことを特徴とする。 Further, the present invention is predetermined from the N-terminal of the amino acid sequence information as a region corresponding to the highly hydrophobic region on the N-terminal side of the known GPI anchor type protein in the amino acid sequence information acquired by the sequence acquisition unit The region of the number of residues is specified, the amino acid residues in the region corresponding to the high hydrophobic region on the N-terminal side are extracted, and the hydrophobicity of the amino acid residues in the region corresponding to the high hydrophobic region on the N-terminal side is extracted Using the required number of N-terminal side hydrophobic property extractions, which is the number of residues used for averaging values, the average of the respective hydrophobicity index values of amino acid residues for the necessary number of consecutive N-terminal side hydrophobic property extractions An N-terminal side hydrophobicity value calculating unit for calculating a plurality of N-terminal side average hydrophobicity values while shifting each residue by one residue with respect to each of the extracted amino acid residues; and the N-terminal side hydrophobicity value calculating unit, Multiple calculated N-terminal flats N-terminal side hydrophobicity determination unit for determining whether or not the maximum value among the hydrophobicity values is equal to or greater than the N-terminal side hydrophobicity threshold value indicating the characteristic of the N-terminal side average hydrophobicity value in a known GPI-anchored protein The side chain size calculation unit, the score value string generation unit, the classification unit, the GPI anchor type protein determination unit, the N-terminal side hydrophobicity determination unit, and the N-terminal side hydrophobicity value calculation unit The processing is performed on the amino acid sequence information determined that the calculated maximum N-terminal side average hydrophobicity value is equal to or greater than the N-terminal side hydrophobicity threshold value.

また、本発明において、前記Ｎ末端側疎水性閾値は、予め既知の複数のＧＰＩアンカー型タンパク質に対して前記Ｎ末端側平均疎水性値の算出を行い、当該算出されたＮ末端側平均疎水性値の最大値の集合における最小値であることを特徴とする。 In the present invention, the N-terminal side hydrophobicity threshold value is obtained by calculating the N-terminal side average hydrophobicity value for a plurality of known GPI-anchored proteins, and calculating the calculated N-terminal side average hydrophobicity. It is the minimum value in the set of maximum values.

また、本発明において、前記Ｎ末端側疎水性特性抽出必要数は、当該Ｎ末端側疎水性特性抽出必要数を用いて、既知の複数のＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域のアミノ酸残基のそれぞれに対してＮ末端側平均疎水性値を算出し、前記既知のＧＰＩアンカー型タンパク質から算出したＮ末端側平均疎水性値の最大値の集合における最小値を抽出し、前記Ｎ末端側疎水性特性抽出必要数を用いて、既知の複数の非ＧＰＩアンカー型タンパク質における既知のＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域に対応する領域のアミノ酸残基のそれぞれに対してＮ末端側平均疎水性値を算出した場合に、前記既知の非ＧＰＩアンカー型タンパク質から算出したＮ末端側平均疎水性値の最大値のうち、前記抽出した最小値より値が大きいものの個数が最小となるような値であることを特徴とする。 Further, in the present invention, the necessary number of N-terminal side hydrophobic property extractions is determined by using the necessary number of N-terminal side hydrophobic property extractions of the highly hydrophobic regions on the N-terminal side of a plurality of known GPI-anchored proteins. An N-terminal average hydrophobicity value is calculated for each amino acid residue, and a minimum value in a set of maximum N-terminal average hydrophobicity values calculated from the known GPI-anchored protein is extracted. For each amino acid residue in the region corresponding to the highly hydrophobic region on the N-terminal side of the known GPI-anchored protein in a plurality of known non-GPI-anchored proteins, using the required number of terminal hydrophobic characteristics extraction When the N-terminal side average hydrophobicity value is calculated, the maximum value of the N-terminal side average hydrophobicity value calculated from the known non-GPI anchor type protein is the extracted minimum value. Wherein the number of those larger value is a value that minimizes.

また、本発明は、前記配列取得部が取得したアミノ酸配列情報における既知のＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域に対応する領域として、前記アミノ酸配列情報のＮ末端から予め定められた残基数の領域を特定し、当該Ｎ末端側の高疎水性領域に対応する領域以外のアミノ酸残基を抽出し、前記Ｎ末端側の高疎水性領域に対応する領域以外のアミノ酸残基の疎水性値の平均化に用いる残基数であるＮ末端外疎水性特性抽出必要数を用いて、連続する当該Ｎ末端外疎水性特性抽出必要数分のアミノ酸残基の各疎水性指標値の平均であるＮ末端外平均疎水性値を、前記抽出したアミノ酸残基のそれぞれに対して１残基ずつずらしながら複数算出するＮ末端外疎水性値算出部と、前記Ｎ末端外疎水性値算出部が算出した複数のＮ末端外平均疎水性値のうちの最大値が、既知のＧＰＩアンカー型タンパク質におけるＮ末端外平均疎水性値の特性を示すＮ末端外疎水性閾値以上であるか否かを判定するＮ末端外疎水性判定部と、を備え、前記側鎖サイズ算出部、前記スコア数値列生成部、前記分類部、前記ＧＰＩアンカー型タンパク質判定部は、前記Ｎ末端外疎水性判定部が、前記Ｎ末端外疎水性値算出部の算出したＮ末端外平均疎水性値の最大値が前記Ｎ末端外疎水性閾値以上であると判定したアミノ酸配列情報に対して処理を実行することを特徴とする。 Further, the present invention is predetermined from the N-terminal of the amino acid sequence information as a region corresponding to the highly hydrophobic region on the N-terminal side of the known GPI anchor type protein in the amino acid sequence information acquired by the sequence acquisition unit The region of the number of residues is specified, amino acid residues other than the region corresponding to the N-terminal high hydrophobic region are extracted, and amino acid residues other than the region corresponding to the N-terminal high hydrophobic region are extracted. Using the necessary number of N-terminal non-hydrophobic property extractions, which is the number of residues used for averaging of the hydrophobicity values, each of the hydrophobic index values of amino acid residues for the necessary number of consecutive N-terminal non-hydrophobic property extractions N-terminal extra-hydrophobic value calculator for calculating a plurality of average N-terminal extra-hydrophobic values while shifting one residue at a time for each of the extracted amino acid residues, and calculating the N-terminal extra-hydrophobic value N calculated by the department N-terminal extra-hydrophobicity for determining whether or not the maximum value of out-of-end average hydrophobicity is equal to or greater than the N-end outside hydrophobicity threshold indicating the characteristics of the N-end outside average hydrophobicity in known GPI-anchored proteins A side chain size calculation unit, the score value sequence generation unit, the classification unit, and the GPI anchor type protein determination unit, The processing is executed on the amino acid sequence information determined that the maximum value of the N-terminal outside average hydrophobicity calculated by the sex value calculation unit is equal to or greater than the N-terminal outside hydrophobicity threshold.

また、本発明において、前記Ｎ末端外疎水性閾値は、予め既知の複数のＧＰＩアンカー型タンパク質に対して前記Ｎ末端外平均疎水性値の算出を行い、当該算出されたＮ末端外平均疎水性値の最大値の集合における最小値であることを特徴とする。 In the present invention, the N-terminal extra-hydrophobic threshold is obtained by calculating the N-terminal extra-average hydrophobicity for a plurality of known GPI-anchored proteins in advance, and calculating the calculated N-end extra-hydrophobic average It is the minimum value in the set of maximum values.

また、本発明において、前記Ｎ末端外疎水性特性抽出必要数は、当該Ｎ末端外疎水性特性抽出必要数を用いて、既知の複数のＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域以外の領域のアミノ酸残基のそれぞれに対してＮ末端外平均疎水性値を算出し、前記既知のＧＰＩアンカー型タンパク質から算出したＮ末端外平均疎水性値の最大値の集合における最小値を抽出し、前記Ｎ末端外疎水性特性抽出必要数を用いて、既知の複数の非ＧＰＩアンカー型タンパク質における既知のＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域に対応する領域以外の領域のアミノ酸残基のそれぞれに対してＮ末端外平均疎水性値を算出した場合に、前記既知の非ＧＰＩアンカー型タンパク質から算出したＮ末端外平均疎水性値の最大値のうち、前記抽出した最小値より値が大きいものの個数が最小となるような値であることを特徴とする。 Further, in the present invention, the required number of N-terminal extra-hydrophobic characteristics is extracted using the necessary number of N-terminal extra-hydrophobic characteristics to extract other than the highly hydrophobic region on the N-terminal side of a plurality of known GPI-anchored proteins. N-terminal outer average hydrophobicity value is calculated for each of the amino acid residues in the region, and the minimum value in the set of maximum values of the N-terminal outer average hydrophobicity value calculated from the known GPI-anchored protein is extracted. The amino acid residues in regions other than the region corresponding to the highly hydrophobic region on the N-terminal side of the known GPI-anchored protein in a plurality of known non-GPI-anchored proteins using the necessary number of N-terminal extra-hydrophobic characteristics extraction When calculating the N-terminal outer average hydrophobicity value for each of the groups, among the maximum values of the N-terminal outer average hydrophobicity values calculated from the known non-GPI anchored protein, Wherein the number of those larger than the minimum value serial extraction is the value that minimizes.

また、本発明において、前記既知のＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域に対応する領域は、既知のＧＰＩアンカー型タンパク質において、前記Ｎ末端側平均疎水性値が最大となる位置が含まれる領域である、ことを特徴とする。 In the present invention, the region corresponding to the highly hydrophobic region on the N-terminal side of the known GPI-anchored protein has a position where the N-terminal side average hydrophobicity value is maximum in the known GPI-anchored protein. It is a region that is included.

また、本発明は、既知のＧＰＩアンカー型タンパク質のＣ末端側の高疎水性領域に対応する領域として、前記アミノ酸配列情報のＣ末端から予め定められた残基数のアミノ酸残基を特定し、前記Ｎ末端外疎水性値算出部が算出したＮ末端外平均疎水性値が最大となるアミノ酸残基の位置が当該特定した領域内にあるか否かを判定するＣ末端側最大疎水位置判定部を備え、前記側鎖サイズ算出部、前記スコア数値列生成部、前記分類部、前記ＧＰＩアンカー型タンパク質判定部は、前記Ｃ末端側最大疎水位置判定部が、前記Ｎ末端外疎水性値算出部の算出したＮ末端外平均疎水性値が最大となるアミノ酸残基の位置が前記既知のＧＰＩアンカー型タンパク質のＣ末端側の高疎水性領域に対応する領域内にあると判定したアミノ酸配列情報に対して処理を実行することを特徴とする。 Further, the present invention specifies an amino acid residue having a predetermined number of residues from the C-terminus of the amino acid sequence information as a region corresponding to a highly hydrophobic region on the C-terminal side of a known GPI-anchored protein, C-terminal maximum hydrophobic position determination unit for determining whether or not the position of the amino acid residue having the maximum N-terminal external hydrophobicity value calculated by the N-terminal external hydrophobicity value calculation unit is within the specified region The side chain size calculation unit, the score value string generation unit, the classification unit, the GPI anchor type protein determination unit, the C-terminal side maximum hydrophobic position determination unit, the N-terminal extra-hydrophobic value calculation unit In the amino acid sequence information determined that the position of the amino acid residue having the maximum N-terminal outer average hydrophobicity calculated in is within the region corresponding to the highly hydrophobic region on the C-terminal side of the known GPI-anchored protein. Against And executes the processing.

また、本発明において、前記既知のＧＰＩアンカー型タンパク質のＣ末端側の高疎水性領域に対応する領域は、既知のＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域に対応する領域以外の領域において、前記Ｎ末端外平均疎水性値が最大となる位置が含まれる領域である、ことを特徴とする。 In the present invention, the region corresponding to the highly hydrophobic region on the C-terminal side of the known GPI-anchored protein is a region other than the region corresponding to the highly hydrophobic region on the N-terminal side of the known GPI-anchored protein. In the above, the region including the position where the N-terminal outer average hydrophobicity value is maximum is included.

また、本発明は、検査対象タンパク質がＧＰＩアンカー型タンパク質であるか否かを判定するＧＰＩアンカー型タンパク質の判定装置を用いた判定方法であって、前記ＧＰＩアンカー型タンパク質の判定装置の配列取得部は、前記検査対象タンパク質のアミノ酸配列情報を取得し、前記ＧＰＩアンカー型タンパク質の判定装置の側鎖サイズ算出部は、前記配列取得部が取得したアミノ酸配列情報における既知のＧＰＩアンカー型タンパク質のプロペプチド領域を含む領域として、前記アミノ酸配列情報のＣ末端から予め定められた残基数の領域を特定し、当該プロペプチド領域を含む領域のアミノ酸残基を抽出し、当該抽出したアミノ酸残基のそれぞれに対して、当該プロペプチド領域を含む領域のアミノ酸残基の側鎖サイズの平均化に用いる残基数である側鎖サイズ特性抽出必要数を用いて、連続する当該側鎖サイズ特性抽出必要数分のアミノ酸残基の各側鎖サイズ指標値の平均値である平均側鎖サイズを１残基ずつずらしながら複数算出し、前記ＧＰＩアンカー型タンパク質の判定装置のスコア数値列生成部は、既知のＧＰＩアンカー型タンパク質の所定の領域内の位置に存在するアミノ酸残基の種類の出現頻度と既知の非ＧＰＩアンカー型タンパク質の所定の領域内の位置に存在するアミノ酸残基の種類の出現頻度とから求められる既知のＧＰＩアンカー型タンパク質のアミノ酸残基位置におけるアミノ酸残基の種類の出現度合いを示す位置特異的スコアを取得し、当該位置特異的スコアに基づき、前記側鎖サイズ算出部が算出した平均側鎖サイズが最小となる位置を基準位置とする、当該基準位置からＮ末端側及びＣ末端側に連続する所定の残基数のアミノ酸残基からなる所定の領域におけるアミノ酸残基の部分配列の各アミノ酸残基の位置特異的スコアを特定し、当該各アミノ酸残基の位置特異的スコアを示す数値列であるスコア数値列を生成し、前記ＧＰＩアンカー型タンパク質の判定装置の分類部は、既知のＧＰＩアンカー型タンパク質の前記スコア数値列を入力とした場合に、期待値として１を出力し、既知の非ＧＰＩアンカー型タンパク質の前記スコア数値列を入力した場合に、期待値として０を出力するように学習され、前記スコア数値列生成部が生成したスコア数値列を入力し、ＧＰＩアンカー型タンパク質であるか否かを示す０以上１以下の期待値を出力し、前記ＧＰＩアンカー型タンパク質の判定装置のＧＰＩアンカー型タンパク質判定部は、前記分類部が出力した期待値が０．５未満であると判定した場合に、前記検査対象タンパク質がＧＰＩアンカー型タンパク質でないと判定することを特徴とする。 The present invention also relates to a determination method using a determination apparatus for a GPI anchor type protein for determining whether or not a protein to be examined is a GPI anchor type protein, the sequence acquisition unit of the determination apparatus for the GPI anchor type protein Acquires the amino acid sequence information of the protein to be tested, and the side chain size calculation unit of the GPI anchor type protein determination device uses a propeptide of a known GPI anchor type protein in the amino acid sequence information acquired by the sequence acquisition unit. A region having a predetermined number of residues from the C-terminus of the amino acid sequence information is identified as a region including the region, and amino acid residues in the region including the propeptide region are extracted, and each of the extracted amino acid residues Is used to average the side chain size of amino acid residues in the region containing the propeptide region. Using the necessary number of side chain size characteristics extraction that is the number of residues, one average side chain size that is the average value of each side chain size index value of amino acid residues corresponding to the necessary number of consecutive side chain size characteristics extraction remains The score numerical value sequence generation unit of the GPI anchor type protein determination device calculates a plurality of values while shifting each group, and the known frequency of occurrence of the types of amino acid residues existing in a predetermined region of the GPI anchor type protein is known. Of the type of amino acid residue at the amino acid residue position of a known GPI anchor type protein determined from the frequency of occurrence of the type of amino acid residue present at a position within a predetermined region of the non-GPI anchor type protein of A position-specific score is obtained, and based on the position-specific score, the position where the average side chain size calculated by the side chain size calculation unit is minimum is a reference position. Identifies the position-specific score of each amino acid residue in a partial sequence of amino acid residues in a predetermined region consisting of a predetermined number of amino acid residues consecutive from the reference position to the N-terminal side and the C-terminal side Then, a score numerical sequence that is a numerical sequence indicating a position-specific score of each amino acid residue is generated, and the classification unit of the GPI-anchored protein determination device uses the score numerical sequence of the known GPI-anchored protein. When it is input, 1 is output as an expected value, and when the score numerical sequence of a known non-GPI anchor type protein is input, learning is performed to output 0 as an expected value, and the score numerical sequence generation unit Is input, and an expected value not less than 0 and not more than 1 indicating whether the protein is a GPI-anchored protein is output. The GPI anchor type protein determination unit of the fixed device determines that the test target protein is not a GPI anchor type protein when the expected value output from the classification unit is determined to be less than 0.5. .

また、本発明は、検査対象タンパク質がＧＰＩアンカー型タンパク質であるか否かを判定するＧＰＩアンカー型タンパク質の判定装置を、前記検査対象タンパク質のアミノ酸配列情報を取得する配列取得部、前記配列取得部が取得したアミノ酸配列情報における既知のＧＰＩアンカー型タンパク質のプロペプチド領域を含む領域として、前記アミノ酸配列情報のＣ末端から予め定められた残基数の領域を特定し、当該プロペプチド領域を含む領域のアミノ酸残基を抽出し、当該抽出したアミノ酸残基のそれぞれに対して、当該プロペプチド領域を含む領域のアミノ酸残基の側鎖サイズの平均化に用いる残基数である側鎖サイズ特性抽出必要数を用いて、連続する当該側鎖サイズ特性抽出必要数分のアミノ酸残基の各側鎖サイズ指標値の平均値である平均側鎖サイズを１残基ずつずらしながら複数算出する側鎖サイズ算出部、既知のＧＰＩアンカー型タンパク質の所定の領域内の位置に存在するアミノ酸残基の種類の出現頻度と既知の非ＧＰＩアンカー型タンパク質の所定の領域内の位置に存在するアミノ酸残基の種類の出現頻度とから求められる既知のＧＰＩアンカー型タンパク質のアミノ酸残基位置におけるアミノ酸残基の種類の出現度合いを示す位置特異的スコアを取得し、当該位置特異的スコアに基づき、前記側鎖サイズ算出部が算出した平均側鎖サイズが最小となる位置を基準位置とする、当該基準位置からＮ末端側及びＣ末端側に連続する所定の残基数のアミノ酸残基からなる所定の領域におけるアミノ酸残基の部分配列の各アミノ酸残基の位置特異的スコアを特定し、当該各アミノ酸残基の位置特異的スコアを示す数値列であるスコア数値列を生成するスコア数値列生成部、前記スコア数値列生成部が生成したスコア数値列を入力し、ＧＰＩアンカー型タンパク質であるか否かを示す０以上１以下の期待値を出力する分類部であって、既知のＧＰＩアンカー型タンパク質の前記スコア数値列を入力とした場合に、期待値として１を出力し、既知の非ＧＰＩアンカー型タンパク質の前記スコア数値列を入力した場合に、期待値として０を出力するように学習された分類部、前記分類部が出力した期待値が０．５未満であると判定した場合に、前記検査対象タンパク質がＧＰＩアンカー型タンパク質でないと判定するＧＰＩアンカー型タンパク質判定部として機能させるための判定プログラムである。 In addition, the present invention provides a GPI anchor type protein determination apparatus that determines whether or not a test target protein is a GPI anchor type protein, a sequence acquisition unit that acquires amino acid sequence information of the test target protein, and the sequence acquisition unit As a region including a known GPI-anchored protein propeptide region in the amino acid sequence information obtained by the method, a region having a predetermined number of residues from the C-terminus of the amino acid sequence information is specified, and the region including the propeptide region Extract side amino acid residues, and for each of the extracted amino acid residues, extract the side chain size characteristic that is the number of residues used to average the side chain size of the amino acid residues in the region containing the propeptide region. Using the required number, the average value of each side chain size index value of amino acid residues for the required number of consecutive side chain size characteristics extraction required Side chain size calculation unit for multiple calculation while shifting some average side chain size one residue, the frequency of occurrence of different amino acids residues present in the position of a predetermined region of the known GPI-anchored protein and a known non-GPI Position specific indicating the degree of appearance of amino acid residue type at the amino acid residue position of a known GPI anchor type protein obtained from the appearance frequency of the type of amino acid residue existing at a position within a predetermined region of the anchor type protein A score is obtained, and based on the position-specific score, the position where the average side chain size calculated by the side chain size calculation unit is the minimum is set as a reference position, and is continuously from the reference position to the N-terminal side and the C-terminal side. Identifying a position-specific score for each amino acid residue of a partial sequence of amino acid residues in a predetermined region consisting of a predetermined number of amino acid residues, A score value string generation unit that generates a score value string that is a value string indicating a position-specific score of each amino acid residue, and the score value string generated by the score value string generation unit is input, and is a GPI anchor type protein A classifying unit that outputs an expected value of 0 or more and 1 or less indicating whether or not the score value sequence of a known GPI-anchored protein is input, outputs 1 as an expected value, When the score value sequence of the GPI-anchored protein is input, the classification unit learned to output 0 as the expected value, and when the expected value output by the classification unit is determined to be less than 0.5 And a determination program for causing a function of the GPI anchor type protein determination unit to determine that the test target protein is not a GPI anchor type protein.

本発明によれば、ＰＳＳＭ（ｐｏｓｉｔｉｏｎｓｐｅｃｉｆｉｃｓｃｏｒｉｎｇｍａｔｒｉｘ；位置特異的スコアリングマトリックス）によって検査対象タンパク質のアミノ酸配列の各アミノ酸残基の位置特異的スコアを示すスコア数値列を生成する。そして、機械学習された分類部が当該スコア数値列を入力し、ＧＰＩアンカー型タンパク質らしさを示す０以上１以下の期待値を出力することで検査対象タンパク質がＧＰＩアンカー型タンパク質であるか否かを判定する。これにより、本発明によるＧＰＩアンカー型タンパク質の判定装置は、高感度且つ高選択的に検査対象タンパク質がＧＰＩアンカー型タンパク質であるか否かを判定することができる。 According to the present invention, a score value sequence indicating a position-specific score of each amino acid residue of an amino acid sequence of a protein to be tested is generated by PSSM (position specific scoring matrix). Then, the machine-learned classification unit inputs the score numerical sequence, and outputs an expected value of 0 or more and 1 or less indicating the likelihood of GPI anchor protein, thereby determining whether or not the test target protein is a GPI anchor protein. judge. Thereby, the determination apparatus of the GPI anchor type protein by this invention can determine whether the test object protein is a GPI anchor type protein with high sensitivity and high selectivity.

本発明の一実施形態によるＧＰＩアンカー型タンパク質判定装置の構成を示す概略ブロック図である。It is a schematic block diagram which shows the structure of the GPI anchor type protein determination apparatus by one Embodiment of this invention. 疎水性指標値記憶部が記憶する情報を示す図である。It is a figure which shows the information which a hydrophobicity index value memory | storage part memorize | stores. 側鎖サイズ指標値記憶部が記憶する情報を示す図である。It is a figure which shows the information which a side chain size index value memory | storage part memorize | stores. ＰＳＳＭ記憶部が記憶するＰＳＳＭを示す図である。It is a figure which shows PSSM which a PSSM memory | storage part memorize | stores. ＰＳＳＭ記憶部が記憶するＰＳＳＭを示す図である。It is a figure which shows PSSM which a PSSM memory | storage part memorize | stores. ＧＰＩアンカー型タンパク質判定装置１００の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the GPI anchor type protein determination apparatus 100. ＧＰＩアンカー型タンパク質の疎水性プロファイルを示す第１のグラフである。It is a 1st graph which shows the hydrophobic profile of GPI anchor type protein. Ｎ末端側平均疎水性値の算出方法を示す図である。It is a figure which shows the calculation method of N terminal side average hydrophobicity value. 既知のＧＰＩアンカー型タンパク質のＮ末端から３０残基以内におけるＮ末端側平均疎水性値の最大値の分布を示すグラフである。It is a graph which shows distribution of the maximum value of the N terminal side average hydrophobicity value within 30 residues from the N terminal of a known GPI anchor type protein. ＧＰＩアンカー型タンパク質の疎水性プロファイルを示す第２のグラフである。It is a 2nd graph which shows the hydrophobic profile of GPI anchor type protein. 既知のＧＰＩアンカー型タンパク質及び既知の非ＧＰＩアンカー型タンパク質のＮ末端外平均疎水性値の最大値を示すグラフである。It is a graph which shows the maximum value of the N terminal outer average hydrophobicity value of a known GPI anchor type protein and a known non-GPI anchor type protein. ＧＰＩアンカー型タンパク質の側鎖サイズのプロファイルを示すグラフである。It is a graph which shows the profile of the side chain size of GPI anchor type protein. アミノ酸配列の抽出方法を示す図である。It is a figure which shows the extraction method of an amino acid sequence. 位置特異的スコアの割り当て方法を示す図である。It is a figure which shows the allocation method of a position specific score. 冗長性を排除したＧＰＩアンカー型タンパク質データセットに含まれる１１３のＳＷＩＳＳ−ＰＲＯＴエントリーネームを示す図である。It is a figure which shows 113 SWISS-PROT entry names contained in the GPI anchor type protein data set which excluded redundancy. 本実施形態で用いるニューラルネットワークの構成を示す図である。It is a figure which shows the structure of the neural network used by this embodiment. 本実施形態によるＧＰＩアンカー型タンパク質判定装置の判定精度を示す第１の表である。It is a 1st table | surface which shows the determination precision of the GPI anchor type protein determination apparatus by this embodiment. 本実施形態によるＧＰＩアンカー型タンパク質判定装置の判定精度を示す第２の表である。It is a 2nd table | surface which shows the determination precision of the GPI anchor type protein determination apparatus by this embodiment. 基準位置を含む所定の範囲を基準位置から（−１２残基〜＋１２残基）を（−１０残基〜＋１２残基）に変更した場合の判定精度を示す表である。It is a table | surface which shows the determination precision at the time of changing the predetermined range containing a reference | standard position from a reference | standard position to (-10 residues-+12 residues) from (-12 residues-+12 residues). 基準位置を含む所定の範囲を基準位置から（−１２残基〜＋１２残基）を（−１２残基〜＋９残基）に変更した場合の判定精度を示す表である。It is a table | surface which shows the determination precision at the time of changing the predetermined range containing a reference | standard position from (-12 residues-+12 residues) to (-12 residues-+9 residues) from a reference position.

以下、図面を参照しながら本発明の実施形態について詳しく説明する。
図１は、本発明の一実施形態によるＧＰＩアンカー型タンパク質判定装置の構成を示す概略ブロック図である。
ＧＰＩアンカー型タンパク質判定装置１００は、配列記憶部１０１、配列取得部１０２、疎水性指標値記憶部１０３、疎水性指標値特定部１０４、Ｎ末端側疎水性値算出部１０５、Ｎ末端側疎水性判定部１０６、Ｎ末端外疎水性値算出部１０７、Ｎ末端外疎水性判定部１０８、Ｃ末端側最大疎水位置判定部１０９、側鎖サイズ指標値記憶部１１０、側鎖サイズ指標値特定部１１１、側鎖サイズ算出部１１２、ＰＳＳＭ記憶部１１３、スコア数値列生成部１１４、ニューラルネットワーク１１５（分類部）、ＧＰＩアンカー型タンパク質判定部１１６を備える。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a schematic block diagram showing the configuration of a GPI anchor type protein determination apparatus according to an embodiment of the present invention.
The GPI anchor type protein determination apparatus 100 includes a sequence storage unit 101, a sequence acquisition unit 102, a hydrophobicity index value storage unit 103, a hydrophobicity index value identification unit 104, an N-terminal side hydrophobicity value calculation unit 105, and an N-terminal side hydrophobicity. Determination unit 106, N-terminal extra-hydrophobic value calculation unit 107, N-terminal extra-hydrophobicity determination unit 108, C-terminal side maximum hydrophobic position determination unit 109, side chain size index value storage unit 110, side chain size index value identification unit 111 , A side chain size calculation unit 112, a PSSM storage unit 113, a score numerical value sequence generation unit 114, a neural network 115 (classification unit), and a GPI anchor type protein determination unit 116.

配列記憶部１０１は、機能未知の哺乳類のタンパク質の完全長アミノ酸配列情報を記憶する。
配列取得部１０２は、配列記憶部１０１から検査対象となるタンパク質のアミノ酸配列情報を取得する。
疎水性指標値記憶部１０３は、アミノ酸残基に対応付けて当該アミノ酸残基の疎水性指標値を記憶する。
疎水性指標値特定部１０４は、配列取得部１０２が取得したアミノ酸配列の各アミノ酸残基それぞれの疎水性指標値を疎水性指標値記憶部１０３が記憶する疎水性指標値から特定し、アミノ酸残基毎の疎水性指標値を示す連続する数値列を生成する。
Ｎ末端側疎水性値算出部１０５は、疎水性指標値特定部１０４が生成した数値列に基づいて、配列取得部１０２が取得したアミノ酸配列情報が示すＮ末端側の連続するアミノ酸残基の平均疎水性値（Ｎ末端側平均疎水性値）を算出する。
Ｎ末端側疎水性判定部１０６は、Ｎ末端側疎水性値算出部１０５が算出した平均疎水性値の最大値がＮ末端側疎水性閾値以上であるか否かを判定する。ここで、Ｎ末端側疎水性閾値とは、既知のＧＰＩアンカータンパク質におけるＮ末端側平均疎水性値の特性を示す閾値である。 The sequence storage unit 101 stores full-length amino acid sequence information of mammalian proteins whose functions are unknown.
The sequence acquisition unit 102 acquires amino acid sequence information of a protein to be examined from the sequence storage unit 101.
The hydrophobicity index value storage unit 103 stores the hydrophobicity index value of the amino acid residue in association with the amino acid residue.
The hydrophobicity index value specifying unit 104 specifies the hydrophobicity index value of each amino acid residue of the amino acid sequence acquired by the sequence acquisition unit 102 from the hydrophobicity index value stored in the hydrophobicity index value storage unit 103, and the amino acid residue A continuous numerical sequence indicating the hydrophobicity index value for each group is generated.
The N-terminal side hydrophobicity value calculating unit 105 calculates an average of consecutive amino acid residues on the N-terminal side indicated by the amino acid sequence information acquired by the sequence acquiring unit 102 based on the numerical sequence generated by the hydrophobicity index value specifying unit 104. The hydrophobicity value (N-terminal side average hydrophobicity value) is calculated.
The N-terminal side hydrophobicity determining unit 106 determines whether or not the maximum average hydrophobicity value calculated by the N-terminal side hydrophobicity value calculating unit 105 is equal to or greater than the N-terminal side hydrophobicity threshold value. Here, the N-terminal side hydrophobicity threshold value is a threshold value indicating characteristics of the N-terminal side average hydrophobicity value in a known GPI anchor protein.

Ｎ末端外疎水性値算出部１０７は、疎水性指標値特定部１０４が生成した数値列に基づいて、配列取得部１０２が取得したアミノ酸配列情報のうち、Ｎ末端側疎水性値算出部１０５が平均疎水性値を算出した範囲以外の連続するアミノ酸残基の平均疎水性値（Ｎ末端外平均疎水性値）を算出する。
Ｎ末端外疎水性判定部１０８は、Ｎ末端外疎水性値算出部１０７が算出した平均疎水性値の最大値がＮ末端外疎水性閾値以上であるか否かを判定する。ここで、Ｎ末端外疎水性閾値とは、既知のＧＰＩアンカー型タンパク質におけるＮ末端外平均疎水性値の特性を示す閾値である。
Ｃ末端側最大疎水位置判定部１０９は、Ｎ末端外疎水性値算出部１０７が算出した平均疎水性値が最大となるアミノ酸残基の位置が既知のＧＰＩアンカー型タンパク質のＣ末端側の高疎水性領域に対応する領域内にあるか否かを判定する。 The N-terminal outside hydrophobicity value calculation unit 107 includes the N-terminal side hydrophobicity value calculation unit 105 in the amino acid sequence information acquired by the sequence acquisition unit 102 based on the numerical sequence generated by the hydrophobicity index value specifying unit 104. The average hydrophobicity value (N-terminal outside average hydrophobicity value) of consecutive amino acid residues outside the range in which the average hydrophobicity value was calculated is calculated.
The N-terminal outside hydrophobicity determining unit 108 determines whether or not the maximum value of the average hydrophobicity calculated by the N-terminal outside hydrophobicity value calculating unit 107 is equal to or greater than the N-terminal outside hydrophobicity threshold. Here, the N-terminal outside hydrophobicity threshold is a threshold value indicating the characteristic of the N-terminal outside hydrophobicity value in a known GPI-anchored protein.
The C-terminal-side maximum hydrophobic position determining unit 109 has a high hydrophobicity on the C-terminal side of the GPI-anchored protein whose amino acid residue having the maximum average hydrophobicity value calculated by the N-terminal extra-hydrophobic value calculating unit 107 is known. It is determined whether or not it is within the area corresponding to the sex area.

側鎖サイズ指標値記憶部１１０は、アミノ酸残基に対応付けて当該アミノ酸残基の側鎖サイズ指標値を記憶する。
側鎖サイズ指標値特定部１１１は、配列取得部１０２が取得したアミノ酸配列の各アミノ酸残基それぞれの側鎖サイズ指標値を、側鎖サイズ指標値記憶部１１０が記憶する側鎖サイズ指標値から特定し、アミノ酸残基毎の側鎖サイズ指標値を示す連続する数値列を生成する。
側鎖サイズ算出部１１２は、側鎖サイズ指標値特定部１１１が生成した数値列に基づいて、配列取得部１０２が取得したアミノ酸配列情報が示すＣ末端側のアミノ酸残基の平均残基サイズを算出する。
ＰＳＳＭ記憶部１１３は、ＧＰＩアンカー型タンパク質のアミノ酸残基位置におけるアミノ酸残基の種類の出現度合いを示す位置特異的スコアを保持するＰＳＳＭを記憶する。ここで、位置特異的スコアとは、ＧＰＩアンカー型タンパク質である可能性を示す値であり、当該値が大きいほどＧＰＩアンカー型タンパク質である可能性が高いことを表す。
スコア数値列生成部１１４は、ＰＳＳＭ記憶部１１３が記憶するＰＳＳＭに基づいて、側鎖サイズ算出部１１２が算出した側鎖のサイズの平均が最小となるアミノ酸残基の位置を基準位置とする所定の領域におけるスコア数値列を生成する。ここで生成するスコア数値列とは、配列取得部１０２が取得した検査対象となるタンパク質の所定の領域のそれぞれのアミノ酸残基の位置特異的スコアを要素とする配列である。
ニューラルネットワーク１１５は、スコア数値列生成部１１４が生成したスコア数値列を入力し、ＧＰＩアンカー型タンパク質らしさを示す０以上１以下の期待値を出力する。なお、ニューラルネットワーク１１５は、予め、既知のＧＰＩアンカー型タンパク質のスコア数値列を入力とした場合に、期待値として１を出力し、既知の非ＧＰＩアンカー型タンパク質のスコア数値列を入力した場合に、期待値として０を出力するように学習されている。
ＧＰＩアンカー型タンパク質判定部１１６は、配列取得部１０２が取得した検査対象となるタンパク質がＧＰＩアンカー型タンパク質であるか否かを判定する。 The side chain size index value storage unit 110 stores the side chain size index value of the amino acid residue in association with the amino acid residue.
The side chain size index value specifying unit 111 calculates the side chain size index value of each amino acid residue of the amino acid sequence acquired by the sequence acquisition unit 102 from the side chain size index value stored in the side chain size index value storage unit 110. A continuous numerical sequence indicating the side chain size index value for each amino acid residue is generated.
The side chain size calculation unit 112 calculates the average residue size of amino acid residues on the C-terminal side indicated by the amino acid sequence information acquired by the sequence acquisition unit 102 based on the numerical sequence generated by the side chain size index value specifying unit 111. calculate.
The PSSM storage unit 113 stores a PSSM that holds a position-specific score indicating the appearance degree of the type of amino acid residue at the amino acid residue position of the GPI-anchored protein. Here, the position-specific score is a value indicating the possibility of being a GPI-anchored protein, and the larger the value, the higher the possibility of being a GPI-anchored protein.
Based on the PSSM stored in the PSSM storage unit 113, the score numerical value sequence generation unit 114 is a predetermined position with the position of the amino acid residue that minimizes the average side chain size calculated by the side chain size calculation unit 112 as a reference position. The score numerical sequence in the area of is generated. The score numerical sequence generated here is a sequence having as elements the position-specific scores of the respective amino acid residues in a predetermined region of the protein to be examined, which is acquired by the sequence acquisition unit 102.
The neural network 115 receives the score value sequence generated by the score value sequence generation unit 114, and outputs an expected value of 0 or more and 1 or less indicating the GPI anchor type protein. The neural network 115 outputs 1 as an expected value when a score value sequence of a known GPI anchor type protein is input in advance, and when a score value sequence of a known non-GPI anchor type protein is input. , Learning to output 0 as an expected value.
The GPI anchor type protein determination unit 116 determines whether or not the protein to be examined acquired by the sequence acquisition unit 102 is a GPI anchor type protein.

図２は、疎水性指標値記憶部が記憶する情報を示す図である。
疎水性指標値記憶部１０３は、図２に示すように、アミノ酸残基の各々に対して、当該アミノ酸残基の疎水性を示す指標値を記憶している。なお、本実施形態では、疎水性指標値としてＫＹＴＪ８２０１０１（ＫｙｔｅＪ．，ＤｏｏｌｉｔｔｌｅＲ．，「ＪｏｕｒｎａｌｏｆＭｏｌｅｃｕｌａｒＢｉｏｌｏｇｙ」、１９８２年、ｖｏｌ．１５７、ｎｏ．１、ｐｐ．１０５−１３２）で示される疎水性指標値を用いている。図２において、アミノ酸残基の「Ａ」はアラニンを示し、「Ｒ」はアルギニンを示し、「Ｎ」はアスパラギンを示し、「Ｄ」はアスパラギン酸を示し、「Ｃ」はシステインを示し、「Ｑ」はグルタミンを示し、「Ｅ」はグルタミン酸を示し、「Ｇ」はグリシンを示し、「Ｈ」はヒスチジンを示し、「Ｉ」はイソロイシンを示し、「Ｌ」はロイシンを示し、「Ｋ」はリシンを示し、「Ｍ」はメチオニンを示し、「Ｆ」はフェニルアラニンを示し、「Ｐ」はプロリンを示し、「Ｓ」はセリンを示し、「Ｔ」はトレオニンを示し、「Ｗ」はトリプトファンを示し、「Ｙ」はチロシンを示し、「Ｖ」はバリンを示す。 FIG. 2 is a diagram illustrating information stored in the hydrophobic index value storage unit.
As shown in FIG. 2, the hydrophobic index value storage unit 103 stores an index value indicating the hydrophobicity of each amino acid residue for each amino acid residue. In this embodiment, the hydrophobicity index value is represented by KYTJ820101 (Kyte J., Doolittle R., “Journal of Molecular Biology”, 1982, vol. 157, no. 1, pp. 105-132). Sex index values are used. In FIG. 2, the amino acid residue “A” represents alanine, “R” represents arginine, “N” represents asparagine, “D” represents aspartic acid, “C” represents cysteine, “ “Q” represents glutamine, “E” represents glutamic acid, “G” represents glycine, “H” represents histidine, “I” represents isoleucine, “L” represents leucine, “K” Represents lysine, “M” represents methionine, “F” represents phenylalanine, “P” represents proline, “S” represents serine, “T” represents threonine, “W” represents tryptophan “Y” represents tyrosine and “V” represents valine.

図３は、側鎖サイズ指標値記憶部が記憶する情報を示す図である。
側鎖サイズ指標値記憶部１１０は、図３に示すように、アミノ酸残基の各々に対して、当該アミノ酸残基の側鎖のサイズを示す指標値を記憶している。なお、本実施形態では、側鎖サイズ指標値としてＤＡＷＤ７２０１０１（ＤａｗｓｏｎＤ．Ｍ．，「ＴｈｅＢｉｏｌｏｇｉｃａｌＧｅｎｅｔｉｃｓｏｆＭａｎ」、ＡｃａｄｅｍｉｃＰｒｅｓｓ、１９７２年、ｐｐ．１−３８）で示される側鎖サイズ指標値を用いている。 FIG. 3 is a diagram illustrating information stored in the side chain size index value storage unit.
As shown in FIG. 3, the side chain size index value storage unit 110 stores, for each amino acid residue, an index value indicating the size of the side chain of the amino acid residue. In this embodiment, the side chain size index value represented by DAWD720101 (Dawson DM, “The Biological Genetics of Man”, Academic Press, 1972, pp. 1-38) is used as the side chain size index value. Used.

図４及び図５は、ＰＳＳＭ記憶部が記憶するＰＳＳＭを示す図である。
ＰＳＳＭ記憶部１１３は、図４及び図５に示すように、アミノ酸残基の位置におけるアミノ酸残基の種類の出現度合いを示す位置特異的スコアを要素とするＰＳＳＭを記憶している。図４及び図５では、アミノ酸残基位置の基準位置を０とし、負数側をＮ末端側、正数側をＣ末端側としている。なお、ＰＳＳＭの作成方法については、後述する。ここで、基準位置とは、ＧＰＩアンカー型タンパク質のＧＰＩアンカー修飾部位（ωサイト）のＣ末端側に隣接するアミノ酸残基の位置を示す。 4 and 5 are diagrams showing the PSSM stored in the PSSM storage unit.
As shown in FIG. 4 and FIG. 5, the PSSM storage unit 113 stores a PSSM whose element is a position-specific score indicating the appearance degree of the type of amino acid residue at the amino acid residue position. 4 and 5, the reference position of the amino acid residue position is 0, the negative number side is the N-terminal side, and the positive number side is the C-terminal side. The PSSM creation method will be described later. Here, the reference position indicates the position of an amino acid residue adjacent to the C-terminal side of the GPI anchor modification site (ω site) of the GPI anchor type protein.

そして、ＧＰＩアンカー型タンパク質判定装置１００において、配列取得部１０２は、検査対象タンパク質のアミノ酸配列情報を取得し、側鎖サイズ算出部１１２は、配列取得部１０２が取得したアミノ酸配列情報における既知のＧＰＩアンカー型タンパク質のプロペプチド領域に対応する領域のアミノ酸残基のそれぞれに対して、連続する側鎖サイズ特性抽出必要数分のアミノ酸残基の側鎖サイズ指標値の平均値である平均側鎖サイズを算出する。具体的には、スコア数値列生成部１１４は、ＰＳＳＭ記憶部１１３に記憶されている既知のＧＰＩアンカー型タンパク質のアミノ酸残基位置におけるアミノ酸残基の種類の出現度合いを示す位置特異的スコアをＰＳＳＭ記憶部１１３から取得し、当該位置特異的スコアに基づいて、側鎖サイズ算出部１１２が算出した平均側鎖サイズが最小となるアミノ酸残基の位置を基準位置とする、当該基準位置からＮ末端側及びＣ末端側に連続する所定の残基数のアミノ酸残基からなる所定の領域におけるアミノ酸残基の部分配列の各アミノ酸残基の位置特異的スコアを特定し、当該各アミノ酸残基の位置特異的スコアを示す数値列であるスコア数値列を生成する。次に、ニューラルネットワーク１１５は、スコア数値列生成部１１４が生成したスコア数値列を入力し、ＧＰＩアンカー型タンパク質らしさを示す０以上１以下の期待値を出力する。なお、ニューラルネットワーク１１５は、既知のＧＰＩアンカー型タンパク質のスコア数値列を入力とした場合に、期待値として１を出力し、既知の非ＧＰＩアンカー型タンパク質のスコア数値列を入力した場合に、期待値として０を出力するように学習されている。
そして、ＧＰＩアンカー型タンパク質判定部１１６は、ニューラルネットワーク１１５が出力した期待値が０．５未満であると判定した場合に、検査対象タンパク質がＧＰＩアンカー型タンパク質でないと判定する。
これにより、ＧＰＩアンカー型タンパク質判定装置１００は、高感度且つ高選択的に検査対象タンパク質がＧＰＩアンカー型タンパク質であるか否かを判定する。 In the GPI-anchored protein determination device 100, the sequence acquisition unit 102 acquires amino acid sequence information of the protein to be examined, and the side chain size calculation unit 112 includes a known GPI in the amino acid sequence information acquired by the sequence acquisition unit 102. For each amino acid residue in the region corresponding to the propeptide region of the anchor-type protein, the average side chain size that is the average value of the side chain size index values of amino acid residues for the number of consecutive side chain size characteristics required for extraction Is calculated. Specifically, the score numerical value sequence generation unit 114 calculates a position-specific score indicating the degree of appearance of the type of amino acid residue at the amino acid residue position of a known GPI anchor type protein stored in the PSSM storage unit 113. Based on the position-specific score obtained from the storage unit 113, the position of the amino acid residue having the smallest average side chain size calculated by the side chain size calculation unit 112 is used as the reference position, and the N-terminal from the reference position Specifying a position-specific score of each amino acid residue in a partial sequence of amino acid residues in a predetermined region consisting of a predetermined number of amino acid residues consecutive on the side and the C-terminal side, and the position of each amino acid residue A score numeric string that is a numeric string indicating a specific score is generated. Next, the neural network 115 receives the score value sequence generated by the score value sequence generation unit 114 and outputs an expected value of 0 or more and 1 or less indicating the GPI anchor type protein character. The neural network 115 outputs 1 as an expected value when a score value sequence of a known GPI anchor type protein is input, and expects when a score value sequence of a known non-GPI anchor type protein is input. It is learned to output 0 as a value.
When the GPI anchor type protein determination unit 116 determines that the expected value output from the neural network 115 is less than 0.5, the GPI anchor type protein determination unit 116 determines that the protein to be examined is not a GPI anchor type protein.
Thereby, the GPI anchor type protein determination apparatus 100 determines whether the protein to be examined is a GPI anchor type protein with high sensitivity and high selectivity.

次に、ＧＰＩアンカー型タンパク質判定装置１００の動作を説明する。
図６は、ＧＰＩアンカー型タンパク質判定装置１００の動作を示すフローチャートである。
＜ステップＳ１：配列を取得＞
まず、使用者による動作開始指示により、ＧＰＩアンカー型タンパク質判定装置１００が動作を開始すると、配列取得部１０２は、配列記憶部１０１から検査対象となるタンパク質のアミノ酸配列情報を取得する。 Next, the operation of the GPI anchor type protein determination apparatus 100 will be described.
FIG. 6 is a flowchart showing the operation of the GPI anchor type protein determination apparatus 100.
<Step S1: Acquire array>
First, when the GPI anchor type protein determination apparatus 100 starts operation according to an operation start instruction from the user, the sequence acquisition unit 102 acquires amino acid sequence information of a protein to be examined from the sequence storage unit 101.

＜ステップＳ２：疎水性指標値を特定＞
配列取得部１０２がアミノ酸配列情報を取得すると、疎水性指標値特定部１０４は、疎水性指標値記憶部１０３を参照して、配列取得部１０２が取得したアミノ酸配列情報の各アミノ酸残基の疎水性指標値を特定し、当該疎水性指標値を示す数値列を生成する。例えば、配列取得部１０２が取得したアミノ酸配列情報が、「ＭＬＬＥＰＧＲＧＣＣ……」という配列を示す場合、疎水性指標値特定部１０４は、疎水性指標値記憶部１０３が記憶する図２に示す指標値より「1.9、3.8、3.8、-3.5、-1.6、-0.4、-4.5、-0.4、2.5、2.5……」という数値列を生成する。 <Step S2: Specify hydrophobicity index value>
When the sequence acquisition unit 102 acquires amino acid sequence information, the hydrophobicity index value specifying unit 104 refers to the hydrophobicity index value storage unit 103, and the hydrophobicity of each amino acid residue in the amino acid sequence information acquired by the sequence acquisition unit 102 The sex index value is specified, and a numerical string indicating the hydrophobic index value is generated. For example, when the amino acid sequence information acquired by the sequence acquisition unit 102 indicates the sequence “MLLEPGRGCC...”, The hydrophobic index value specifying unit 104 stores the index values shown in FIG. Then, the numerical sequence “1.9, 3.8, 3.8, -3.5, -1.6, -0.4, -4.5, -0.4, 2.5, 2.5 ..." is generated.

＜ステップＳ３：Ｎ末端側の疎水性指標値を抽出＞
ステップＳ２で、疎水性指標値特定部１０４が疎水性指標値を示す数値列を生成すると、Ｎ末端側疎水性値算出部１０５は、疎水性指標値特定部１０４が生成した数値列から、ＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域に対応する領域のアミノ酸残基を示す部分数値列を抽出する。
本実施形態では、ＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域に対応する領域として、Ｎ末端から３０残基以内のアミノ酸残基を用いる。Ｎ末端から３０残基以内のアミノ酸残基の領域は、既知の複数のＧＰＩアンカー型タンパク質のアミノ酸残基のそれぞれに対して、後述するステップＳ４と同様の処理によって平均疎水性値（Ｎ末端側平均疎水性値）を算出した場合に、当該算出した平均疎水性値が最大となるアミノ酸残基列の中央に位置するアミノ酸残基が含まれる領域である。 <Step S3: Extraction of hydrophobic index value on N-terminal side>
In step S2, when the hydrophobic index value specifying unit 104 generates a numerical sequence indicating the hydrophobic index value, the N-terminal side hydrophobic value calculating unit 105 calculates GPI from the numerical sequence generated by the hydrophobic index value specifying unit 104. A partial numerical sequence indicating amino acid residues in a region corresponding to the highly hydrophobic region on the N-terminal side of the anchor type protein is extracted.
In the present embodiment, amino acid residues within 30 residues from the N-terminus are used as the region corresponding to the highly hydrophobic region on the N-terminus side of the GPI-anchored protein. The region of amino acid residues within 30 residues from the N-terminal is obtained by subjecting each amino acid residue of a plurality of known GPI-anchored proteins to an average hydrophobicity value (N-terminal side) by the same process as in Step S4 described later. When the (average hydrophobicity value) is calculated, this is a region including an amino acid residue located at the center of the amino acid residue sequence where the calculated average hydrophobicity value is maximum.

図７は、ＧＰＩアンカー型タンパク質の疎水性プロファイルを示す第１のグラフである。
図７は、ＳＷＩＳＳ−ＰＲＯＴｖｅｒ５４．０のＢＹ５５＿ＨＵＭＡＮ（１８１ａａ）エントリーに対して、後述するステップＳ４と同様の処理によって算出したＮ末端側平均疎水性値（１１残基平均の場合）を示すグラフである。ここで、横軸は、Ｎ末端側疎水性特性抽出必要数の連続するアミノ酸残基列の中央に位置するアミノ酸残基のＮ末端からの残基位置を示し、縦軸はＮ末端側平均疎水性値の値を示す。
図７に示すように、既知のＧＰＩアンカー型タンパク質のＮ末端側の領域は疎水性が高く、Ｎ末端から３０残基以内にＮ末端側平均疎水性値が最大となる位置が存在する。 FIG. 7 is a first graph showing the hydrophobic profile of a GPI-anchored protein.
FIG. 7 is a graph showing N-terminal side average hydrophobicity values (in the case of an average of 11 residues) calculated by the same processing as in Step S4 described later for the BY55_HUMAN (181aa) entry of SWISS-PROT ver54.0. is there. Here, the horizontal axis indicates the residue position from the N-terminal of the amino acid residue located in the middle of the necessary number of consecutive N-terminal hydrophobic characteristics extraction sequence, and the vertical axis indicates the N-terminal average hydrophobicity Indicates the value of the sex value.
As shown in FIG. 7, the region on the N-terminal side of the known GPI-anchored protein is highly hydrophobic, and there is a position where the N-terminal average hydrophobicity value is maximum within 30 residues from the N-terminus.

＜ステップＳ４：Ｎ末端側平均疎水性値を算出＞
図８は、Ｎ末端側平均疎水性値の算出方法を示す図である。
Ｎ末端側疎水性値算出部１０５は、ステップＳ３でＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域に対応する領域のアミノ酸残基を示す部分数値列を抽出すると、当該部分数値列の連続するＮ末端側疎水性特性抽出必要数分の各疎水性指標値の平均であるＮ末端側平均疎水性値を、図８に示すように、１残基ずつずらしながら算出する。
ここで、Ｎ末端側疎水性特性抽出必要数の連続するアミノ酸残基列の中央のアミノ酸残基の位置がＮ末端からｒ残基目であるときのＮ末端側平均疎水性値は、式（１）を用いて算出できる。 <Step S4: Calculate N-terminal side average hydrophobicity value>
FIG. 8 is a diagram showing a method for calculating the N-terminal side average hydrophobicity value.
When the N-terminal-side hydrophobicity value calculating unit 105 extracts the partial numerical sequence indicating the amino acid residues in the region corresponding to the high-hydrophobic region on the N-terminal side of the GPI-anchored protein in step S3, The N-terminal side average hydrophobicity value, which is the average of the respective hydrophobicity index values for the required number of N-terminal side hydrophobic property extractions, is calculated while shifting by one residue as shown in FIG.
Here, the N-terminal side average hydrophobicity value when the position of the central amino acid residue of the necessary number of consecutive amino acid residue sequences from the N-terminal side is the r-th residue from the N-terminal is expressed by the formula ( 1).

但し、ｎは、平均化に用いる前後の残基数を示す。つまり、２ｎ＋１は、Ｎ末端側疎水性特性抽出必要数を示す。また、Ｈ（ｉ）は、Ｎ末端側疎水性特性抽出必要数の連続するアミノ酸残基列の中央のアミノ酸残基の位置がＮ末端からｉ残基目である場合のアミノ酸残基の疎水性指標値を示す。
つまり、Ｎ末端からｒ残基目のアミノ酸残基が中央に位置するアミノ酸残基列のＮ末端側平均疎水性値は、Ｎ末端からｒ−ｎ残基目のアミノ酸残基から、Ｎ末端からｒ＋ｎ残基目のアミノ酸残基までの疎水性指標値の平均となる。なお、このとき、Ｎ末端からｎ残基以内のアミノ酸残基は、前後ｎ残基の平均値を算出できないため、Ｎ末端側平均疎水性値として例えばＮＵＬＬ値を代入しておくと良い。 However, n shows the number of residues before and after used for averaging. That is, 2n + 1 indicates the required number of N-terminal side hydrophobic characteristics extraction. H (i) is the hydrophobicity of the amino acid residue when the position of the central amino acid residue in the N-terminal side hydrophobic property extraction necessary number of consecutive amino acid residue sequences is the i-th residue from the N-terminal. Indicates the index value.
That is, the average hydrophobicity value on the N-terminal side of the amino acid residue sequence in which the amino acid residue at the r-th residue from the N-terminus is at the center is calculated from the amino acid residue at the rn-th residue from the N-terminus, This is the average of the hydrophobicity index values up to the r + nth amino acid residue. At this time, for amino acid residues within n residues from the N-terminus, the average value of the preceding and succeeding n residues cannot be calculated, and therefore, for example, a NULL value may be substituted as the N-terminal side average hydrophobicity value.

本実施形態では、Ｎ末端側疎水性特性抽出必要数として１１残基を用いる。つまり、Ｎ末端側平均疎水性値として、Ｎ末端からｒ残基目のアミノ酸残基の前後５残基のアミノ酸残基の疎水性指標値の平均を算出する。ここで、Ｎ末端側疎水性特性抽出必要数を１１残基と決定する方法を説明する。 In this embodiment, 11 residues are used as the necessary number of N-terminal hydrophobic characteristics extraction. That is, the average of the hydrophobicity index values of the five amino acid residues before and after the r-th amino acid residue from the N-terminal is calculated as the N-terminal side average hydrophobicity value. Here, a method for determining the required number of N-terminal hydrophobic property extractions as 11 residues will be described.

まず、既知の複数のＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域、すなわちＮ末端から３０残基以内のアミノ酸残基から、Ｎ末端側疎水性特性抽出必要数の候補となる範囲の平均疎水性値を、１残基ずつずらしながら算出する。次に、既知の複数のＧＰＩアンカー型タンパク質のそれぞれの平均疎水性値の最大値を抽出する。そして、抽出した最大値の集合における最小値を抽出する。
次に、既知の複数の非ＧＰＩアンカー型タンパク質における、既知のＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域に対応する領域、すなわち既知の複数の非ＧＰＩアンカー型タンパク質のＮ末端から３０残基以内のアミノ酸残基から、Ｎ末端側疎水性特性抽出必要数の候補となる個数の連続するアミノ酸残基列の平均疎水性値を、１残基ずつずらしながら算出する。そして、非ＧＰＩアンカー型タンパク質から算出した平均疎水性値の最大値のうち、既知の複数のＧＰＩアンカー型タンパク質のそれぞれの平均疎水性値の最大値の集合から抽出した最小値より値が大きいものの個数を計数する。
この処理をＮ末端側疎水性特性抽出必要数の候補となる値を変えて実行し、非ＧＰＩアンカー型タンパク質から算出した平均疎水性値の最大値のうち、既知の複数のＧＰＩアンカー型タンパク質のそれぞれの平均疎水性値の最大値の集合から抽出した最小値より値が大きいものの個数が最小となるようなＮ末端側疎水性特性抽出必要数の候補を、Ｎ末端側疎水性特性抽出必要数として決定する。 First, the average of the N-terminal highly hydrophobic regions of a plurality of known GPI-anchored proteins, that is, the range of candidates for the required number of N-terminal hydrophobic characteristics extraction from amino acid residues within 30 residues from the N-terminal The hydrophobicity value is calculated while shifting by one residue. Next, the maximum value of the average hydrophobicity value of each of a plurality of known GPI-anchored proteins is extracted. Then, the minimum value in the extracted set of maximum values is extracted.
Next, in a plurality of known non-GPI-anchored proteins, a region corresponding to the highly hydrophobic region on the N-terminal side of the known GPI-anchored proteins, that is, 30 residues from the N-terminus of the plurality of known non-GPI-anchored proteins. From the amino acid residues within the group, the average hydrophobicity value of the number of consecutive amino acid residue sequences that are candidates for the required number of N-terminal hydrophobicity characteristics extraction is calculated while shifting by one residue. And among the maximum values of the average hydrophobicity values calculated from the non-GPI anchor type proteins, the values are larger than the minimum values extracted from the set of the maximum average hydrophobicity values of each of the plurality of known GPI anchor type proteins. Count the number.
This process is executed by changing the candidate value of the necessary number of N-terminal side hydrophobic property extractions, and among the maximum average hydrophobicity values calculated from the non-GPI anchor type proteins, N-terminal side hydrophobic property extraction necessary number candidates for which the number of those having a value larger than the minimum value extracted from the set of maximum values of the respective average hydrophobicity values is minimized Determine as.

そして、本実施形態では、ＳＷＩＳＳ−ＰＲＯＴｖｅｒ５４．０より取得した既知の哺乳類ＧＰＩアンカー型タンパク質の完全長アミノ酸配列データセット、及び既知の哺乳類非ＧＰＩアンカー型タンパク質の完全長アミノ酸配列データセットを用いて上述した方法を実行した結果、Ｎ末端側疎水性特性抽出必要数を１１残基として決定した。 And in this embodiment, using the full length amino acid sequence data set of the known mammalian GPI anchor type protein acquired from SWISS-PROT ver54.0, and the full length amino acid sequence data set of the known mammalian non-GPI anchor type protein, As a result of executing the above-described method, the required number of N-terminal hydrophobic character extractions was determined as 11 residues.

＜ステップＳ５：Ｎ末端側平均疎水性値の最大値の判定＞
ステップＳ４で、Ｎ末端側疎水性値算出部１０５が、部分数値列の各疎水性指標値のＮ末端側平均疎水性値を算出すると、Ｎ末端側疎水性判定部１０６は、算出したＮ末端側平均疎水性値の最大値がＮ末端側疎水性閾値以上であるか否かを判定する。なお、Ｎ末端側疎水性閾値は、ＧＰＩアンカー型タンパク質におけるＮ末端側平均疎水性値の特性を示す閾値であり、本実施形態では、Ｎ末端側疎水性閾値として１．５０を用いる。１．５０という値は、予め既知の複数のＧＰＩアンカー型タンパク質に対してＮ末端側平均疎水性値の算出を行い、当該算出されたＮ末端側平均疎水性値の最大値の集合における最小値として算出された値である。 <Step S5: Determination of the maximum N-terminal side average hydrophobicity value>
In step S4, when the N-terminal side hydrophobicity value calculating unit 105 calculates the N-terminal side average hydrophobicity value of each hydrophobicity index value in the partial numerical sequence, the N-terminal side hydrophobicity determining unit 106 calculates the calculated N-terminal side hydrophobicity value. It is determined whether or not the maximum side average hydrophobicity value is equal to or greater than the N-terminal side hydrophobicity threshold. The N-terminal side hydrophobicity threshold is a threshold value indicating the characteristics of the N-terminal side average hydrophobicity value in the GPI-anchored protein. In this embodiment, 1.50 is used as the N-terminal side hydrophobicity threshold. The value of 1.50 is the minimum value in the set of maximum values of the calculated N-terminal side average hydrophobicity values obtained by calculating the N-terminal side average hydrophobicity value for a plurality of known GPI-anchored proteins. Is a value calculated as

図９は、既知のＧＰＩアンカー型タンパク質のＮ末端から３０残基以内におけるＮ末端側平均疎水性値の最大値の分布を示すグラフである。ここで、横軸はＮ末端側平均疎水性値の最大値を示し、縦軸はＧＰＩアンカー型タンパク質が当該最大値をとる頻度を示す。
図９に示すように、既知のＧＰＩアンカー型タンパク質のＮ末端から３０残基以内のアミノ酸残基から算出されたＮ末端側平均疎水性値の最大値は、Ｎ末端側疎水性閾値である１．５０以上の値となる。従って、検査対象タンパク質のＮ末端から３０残基以内のアミノ酸残基から算出されたＮ末端側平均疎水性値の最大値が１．５０以上であれば、検査対象タンパク質がＧＰＩアンカー型タンパク質である可能性が高く、当該最大値が１．５０未満であれば、検査対象タンパク質がＧＰＩアンカー型タンパク質である可能性が低いと判定できる。 FIG. 9 is a graph showing the distribution of the maximum value of the N-terminal average hydrophobicity within 30 residues from the N-terminus of a known GPI-anchored protein. Here, the horizontal axis indicates the maximum value of the N-terminal side average hydrophobicity, and the vertical axis indicates the frequency at which the GPI-anchored protein takes the maximum value.
As shown in FIG. 9, the maximum value of the N-terminal average hydrophobicity value calculated from amino acid residues within 30 residues from the N-terminus of a known GPI-anchored protein is the N-terminal hydrophobicity threshold value 1 A value of 50 or more. Therefore, if the maximum value of the N-terminal average hydrophobicity calculated from amino acid residues within 30 residues from the N-terminus of the test target protein is 1.50 or more, the test target protein is a GPI-anchored protein. If the possibility is high and the maximum value is less than 1.50, it can be determined that the test target protein is unlikely to be a GPI-anchored protein.

＜ステップＳ６：Ｎ末端外の疎水性指標値を抽出＞
ステップＳ５でＮ末端側疎水性判定部１０６が、算出したＮ末端側平均疎水性値の最大値がＮ末端側疎水性閾値以上であると判定した場合（ステップＳ５：ＹＥＳ）、Ｎ末端外疎水性値算出部１０７は、ステップＳ２で疎水性指標値特定部１０４が生成した数値列から、ステップＳ３でＮ末端側疎水性値算出部１０５が抽出した部分数値列以外の残りの部分数値列を抽出する。すなわち、疎水性指標値特定部１０４が生成した数値列から、Ｎ末端から３０残基以降のアミノ酸残基を示す部分数値列を抽出する。 <Step S6: Extract hydrophobic index value outside N-terminal>
When the N-terminal side hydrophobicity determining unit 106 determines in step S5 that the calculated maximum N-terminal side average hydrophobicity value is equal to or greater than the N-terminal side hydrophobicity threshold value (step S5: YES), the N-terminal outside hydrophobicity The sex value calculation unit 107 obtains the remaining partial numeric sequences other than the partial numeric sequence extracted by the N-terminal side hydrophobic value calculation unit 105 in step S3 from the numeric sequence generated by the hydrophobic index value specifying unit 104 in step S2. Extract. That is, a partial numerical sequence indicating amino acid residues after 30 residues from the N-terminal is extracted from the numerical sequence generated by the hydrophobic index value specifying unit 104.

＜ステップＳ７：Ｎ末端外平均疎水性値を算出＞
次に、Ｎ末端外疎水性値算出部１０７は当該部分数値列の連続するＮ末端外疎水性特性抽出必要数分の各疎水性指標値の平均であるＮ末端外平均疎水性値を、１残基ずつずらしながら算出する。
ここで、Ｎ末端外疎水性特性抽出必要数の連続するアミノ酸残基列の中央のアミノ酸残基の位置がＮ末端からｒ残基目であるときのＮ末端側平均疎水性値は、Ｎ末端側平均疎水性値と同様に、式（１）を用いて算出できる。なお、このとき、Ｃ末端からｎ残基以内のアミノ酸残基は、前後ｎ残基の平均値を算出できないため、Ｎ末端外平均疎水性値として例えばＮＵＬＬ値を代入しておくと良い。 <Step S7: Calculate N-terminal outer average hydrophobicity value>
Next, the N-terminal non-hydrophobic value calculation unit 107 calculates an N-terminal non-hydrophobic average hydrophobicity value that is an average of the respective hydrophobicity index values for the necessary number of consecutive N-terminal non-hydrophobic characteristics extraction in the partial numerical sequence. Calculate while shifting each residue.
Here, the N-terminal side average hydrophobicity value when the position of the central amino acid residue of the necessary number of consecutive amino acid residue sequences from the N-terminal is the r-th residue from the N-terminal is expressed as Similar to the side average hydrophobicity value, it can be calculated using equation (1). At this time, for amino acid residues within n residues from the C-terminus, the average value of the preceding and succeeding n residues cannot be calculated, and therefore, for example, a NULL value may be substituted as the N-terminal non-average hydrophobicity value.

本実施形態では、Ｎ末端外疎水性特性抽出必要数として１７残基を用いる。つまり、Ｎ末端外平均疎水性値として、Ｎ末端からｒ残基目のアミノ酸残基を中心とする前後８残基のアミノ酸残基の疎水性指標値の平均を算出する。ここで、Ｎ末端外疎水性特性抽出必要数を１７残基と決定する方法を説明する。 In this embodiment, 17 residues are used as the required number of N-terminal extra-hydrophobic property extractions. That is, the average of the hydrophobicity index values of the eight amino acid residues before and after the r-th amino acid residue from the N-terminal is calculated as the N-terminal outer average hydrophobicity value. Here, a method for determining the required number of N-terminal extra-hydrophobic property extractions as 17 residues will be described.

まず、既知の複数のＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域以外の領域、すなわちＮ末端から３０残基以降のアミノ酸残基から、Ｎ末端外疎水性特性抽出必要数の候補となる個数の連続するアミノ酸残基列の平均疎水性値を、１残基ずつずらしながら算出する。次に、既知の複数のＧＰＩアンカー型タンパク質のそれぞれの平均疎水性値の最大値を抽出する。そして、抽出した最大値の集合における最小値を抽出する。
次に、既知の複数の非ＧＰＩアンカー型タンパク質における既知の複数のＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域に対応する領域以外の領域、すなわち既知の複数の非ＧＰＩアンカー型タンパク質のＮ末端から３０残基以降のアミノ酸残基から、Ｎ末端外疎水性特性抽出必要数の候補となる範囲の平均疎水性値を、１残基ずつずらしながら算出する。そして、非ＧＰＩアンカー型タンパク質から算出した平均疎水性値の最大値のうち、既知の複数のＧＰＩアンカー型タンパク質のそれぞれの平均疎水性値の最大値の集合から抽出した最小値より値が大きいものの個数を計数する。
この処理をＮ末端外疎水性特性抽出必要数の候補となる値を変えて実行し、非ＧＰＩアンカー型タンパク質から算出した平均疎水性値の最大値のうち、既知の複数のＧＰＩアンカー型タンパク質のそれぞれの平均疎水性値の最大値の集合から抽出した最小値より値が大きいものの個数が最小となるＮ末端外疎水性特性抽出必要数の候補を、Ｎ末端外疎水性特性抽出必要数として決定する。 First, a region other than the highly hydrophobic region on the N-terminal side of a plurality of known GPI-anchored proteins, that is, an amino acid residue after 30 residues from the N-terminus, is a candidate for the necessary number of N-terminal extra-hydrophobic characteristics extraction. The average hydrophobicity value of the number of consecutive amino acid residue sequences is calculated while shifting by one residue. Next, the maximum value of the average hydrophobicity value of each of a plurality of known GPI-anchored proteins is extracted. Then, the minimum value in the extracted set of maximum values is extracted.
Next, a region other than the region corresponding to the N-terminal high hydrophobic region of the plurality of known GPI-anchored proteins in the plurality of known non-GPI-anchored proteins, that is, the N of the plurality of known non-GPI-anchored proteins. The average hydrophobicity value in a range that is a candidate for the required number of N-terminal extra-hydrophobic characteristics extraction is calculated by shifting one residue at a time from amino acid residues after 30 residues from the end. And among the maximum values of the average hydrophobicity values calculated from the non-GPI anchor type proteins, the values are larger than the minimum values extracted from the set of the maximum average hydrophobicity values of each of the plurality of known GPI anchor type proteins. Count the number.
This process is executed by changing the candidate values of the necessary number of N-terminal extra-hydrophobic characteristics extraction, and among the maximum average hydrophobicity values calculated from the non-GPI anchored proteins, a plurality of known GPI anchored proteins The candidate for the required number of N-terminal non-hydrophobic characteristics extraction that minimizes the number of those that are larger than the minimum value extracted from the set of maximum values of the respective average hydrophobicity values is determined as the required number of N-terminal non-hydrophobic characteristics extraction To do.

図１０は、ＧＰＩアンカー型タンパク質の疎水性プロファイルを示す第２のグラフである。
図１０は、ＳＷＩＳＳ−ＰＲＯＴｖｅｒ５４．０のＢＹ５５＿ＨＵＭＡＮ（１８１ａａ）エントリーに対して、ステップＳ７と同様の処理によって算出したＮ末端外平均疎水性値（１７残基平均の場合）を示すグラフである。ここで、横軸は、Ｎ末端外疎水性特性抽出必要数の連続するアミノ酸残基列の中央に位置するアミノ酸残基のＮ末端からの残基位置を示し、縦軸はＮ末端外平均疎水性値の値を示す。
図１０に示すように、既知のＧＰＩアンカー型タンパク質のＣ末端側の領域は、Ｎ末端からの３０残基に次いで疎水性が高い。 FIG. 10 is a second graph showing the hydrophobicity profile of GPI-anchored proteins.
FIG. 10 is a graph showing the N-terminal outside average hydrophobicity value (in the case of an average of 17 residues) calculated by the same process as in step S7 for the BY55_HUMAN (181aa) entry of SWISS-PROT ver54.0. Here, the horizontal axis indicates the residue position from the N-terminal of the amino acid residue located at the center of the necessary number of consecutive N-terminal hydrophobic characteristics extraction, and the vertical axis indicates the N-terminal outer average hydrophobicity. Indicates the value of the sex value.
As shown in FIG. 10, the region on the C-terminal side of the known GPI-anchored protein is highly hydrophobic next to 30 residues from the N-terminus.

そして、本実施形態では、ＳＷＩＳＳ−ＰＲＯＴｖｅｒ５４．０より取得した既知の哺乳類ＧＰＩアンカー型タンパク質の完全長アミノ酸配列データセット、及び既知の哺乳類非ＧＰＩアンカー型タンパク質の完全長アミノ酸配列データセットを用いて上述した方法を実行した結果、Ｎ末端外疎水性特性抽出必要数を１７残基として決定した。 And in this embodiment, using the full length amino acid sequence data set of the known mammalian GPI anchor type protein acquired from SWISS-PROT ver54.0, and the full length amino acid sequence data set of the known mammalian non-GPI anchor type protein, As a result of executing the method described above, the required number of N-terminal extra-hydrophobic property extractions was determined as 17 residues.

＜ステップＳ８：Ｎ末端外平均疎水性値の最大値の判定＞
ステップＳ７で、Ｎ末端外疎水性値算出部１０７が、部分数値列の連続するＮ末端外疎水性特性抽出必要数分の各疎水性指標値の平均であるＮ末端外平均疎水性値を、１残基ずつずらしながら算出すると、Ｎ末端外疎水性判定部１０８は、算出したＮ末端外平均疎水性値の最大値がＮ末端外疎水性閾値以上であるか否かを判定する。なお、Ｎ末端外疎水性閾値は、既知のＧＰＩアンカー型タンパク質のＮ末端外平均疎水性値の特性を示す閾値であり、本実施形態では、Ｎ末端外疎水性閾値として１．３８を用いている。
１．３８という値は、予め既知の複数のＧＰＩアンカー型タンパク質に対してＮ末端外平均疎水性値の算出を行い、当該算出されたＮ末端側平均疎水性値の最大値の集合における最小値として算出された値である。 <Step S8: Determination of the maximum value of the N-terminal outside average hydrophobicity>
In step S7, the N-terminal outside hydrophobicity value calculation unit 107 calculates an N-terminal outside hydrophobicity value that is an average of the respective hydrophobicity index values for the necessary number of consecutive N-terminal outside hydrophobicity characteristics extraction in the partial numerical sequence, When calculating by shifting one residue at a time, the N-terminal outside hydrophobicity determination unit 108 determines whether or not the calculated maximum N-terminal outside hydrophobicity value is equal to or greater than the N-terminal outside hydrophobicity threshold. The N-terminal outside hydrophobicity threshold is a threshold value indicating the characteristics of the known NPI-terminal average hydrophobicity value of a known GPI-anchored protein. In this embodiment, 1.38 is used as the N-terminal outside hydrophobicity threshold. Yes.
A value of 1.38 is a minimum value in a set of maximum values of N-terminal side average hydrophobicity values obtained by calculating an N-terminal outside average hydrophobicity value for a plurality of GPI-anchored proteins known in advance. Is a value calculated as

図１１は、既知のＧＰＩアンカー型タンパク質及び既知の非ＧＰＩアンカー型タンパク質のＮ末端外平均疎水性値の最大値を示すグラフである。ここで、横軸は、Ｎ末端外疎水性特性抽出必要数の連続するアミノ酸残基列の中央に位置するアミノ酸残基のＣ末端からの残基位置を示し、縦軸はＮ末端外平均疎水性値の値を示す。
図１１に示すように、既知のＧＰＩアンカー型タンパク質のＮ末端から３０残基以降のアミノ酸残基から算出されたＮ末端外平均疎水性値の最大値は、Ｎ末端外疎水性閾値である１．３８以上の値となる。従って、検査対象タンパク質のＮ末端から３０残基以降のアミノ酸残基から算出されたＮ末端外平均疎水性値の最大値が１．３８以上であれば、検査対象タンパク質がＧＰＩアンカー型タンパク質である可能性が高く、当該最大値が１．３８未満であれば、検査対象タンパク質がＧＰＩアンカー型タンパク質である可能性が低いと判定できる。 FIG. 11 is a graph showing the maximum value of the average N-terminal outside hydrophobicity of known GPI-anchored proteins and known non-GPI-anchored proteins. Here, the horizontal axis indicates the residue position from the C-terminal of the amino acid residue located in the center of the necessary number of consecutive amino acid residue strings for the N-terminal outer hydrophobic property extraction, and the vertical axis indicates the N-terminal outer average hydrophobicity. Indicates the value of the sex value.
As shown in FIG. 11, the maximum value of the N-terminal outside average hydrophobicity value calculated from amino acid residues after 30 residues from the N-terminus of a known GPI-anchored protein is the N-terminal outside hydrophobicity threshold value 1 .38 or more. Therefore, if the maximum value of the N-terminal outside average hydrophobicity calculated from the amino acid residues after 30 residues from the N-terminus of the test target protein is 1.38 or more, the test target protein is a GPI-anchored protein. If the possibility is high and the maximum value is less than 1.38, it can be determined that there is a low possibility that the protein to be examined is a GPI-anchored protein.

＜ステップＳ９：Ｎ末端外平均疎水性値が最大となるアミノ酸残基位置の判定＞
Ｎ末端外疎水性判定部１０８が、算出したＮ末端外平均疎水性値の最大値がＮ末端外疎水性閾値以上であると判定した場合（ステップＳ８：ＹＥＳ）、Ｃ末端側最大疎水位置判定部１０９は、ステップＳ７で算出したＮ末端外平均疎水性値が最大となるアミノ酸残基の位置が、ＧＰＩアンカー型タンパク質のＣ末端側の高疎水性領域に対応する領域内にあるか否かを判定する。
本実施形態では、ＧＰＩアンカー型タンパク質のＣ末端側の高疎水性領域に対応する領域として、Ｃ末端から１４残基以内のアミノ酸残基を用いる。Ｃ末端から１４残基以内のアミノ酸残基という領域は、既知の複数のＧＰＩアンカー型タンパク質のＮ末端側の高疎水性領域以外の領域、すなわちＮ末端から３０残基以降のアミノ酸残基のそれぞれに対してＮ末端外平均疎水性値を算出した場合に、当該算出したＮ末端外平均疎水性値が最大となる連続するアミノ酸残基列の中央に位置するアミノ酸残基が含まれる領域である。 <Step S9: Determination of the amino acid residue position where the N-terminal outer average hydrophobicity value is maximized>
When the N-terminal outside hydrophobicity determination unit 108 determines that the calculated maximum N-terminal outside average hydrophobicity value is equal to or greater than the N-terminal outside hydrophobicity threshold (step S8: YES), the C-terminal side maximum hydrophobic position determination The part 109 determines whether or not the position of the amino acid residue having the maximum N-terminal outside average hydrophobicity calculated in step S7 is in the region corresponding to the highly hydrophobic region on the C-terminal side of the GPI-anchored protein. Determine.
In this embodiment, amino acid residues within 14 residues from the C-terminus are used as a region corresponding to the highly hydrophobic region on the C-terminal side of the GPI-anchored protein. The region of amino acid residues within 14 residues from the C-terminal is a region other than the highly hydrophobic region on the N-terminal side of a plurality of known GPI-anchored proteins, that is, amino acid residues after 30 residues from the N-terminal. When the N-terminal outside average hydrophobicity value is calculated, the amino acid residue located at the center of the continuous amino acid residue sequence where the calculated N-terminal outside average hydrophobicity value is maximum is included. .

図１１に示すように、既知のＧＰＩアンカー型タンパク質のＮ末端から３０残基以降のアミノ酸残基から算出されたＮ末端外平均疎水性値が最大となるアミノ酸残基列の中央に位置するアミノ酸残基は、Ｃ末端側の高疎水性領域内に存在する。従って、検査対象タンパク質のＮ末端から３０残基以降のアミノ酸残基から算出されたＮ末端外平均疎水性値が最大となるアミノ酸残基列の中央に位置するアミノ酸残基がＧＰＩアンカー型タンパク質のＣ末端側の高疎水性領域に対応する領域内に存在すれば、検査対象タンパク質がＧＰＩアンカー型タンパク質である可能性が高く、当該領域内に存在しなければ、検査対象タンパク質がＧＰＩアンカー型タンパク質である可能性が低いと判定できる。
つまり、図１１における網掛け矩形の範囲が、Ｎ末端外疎水性閾値及びＣ末端側の高疎水性領域の条件を満たす範囲を示し、当該範囲内に含まれる非ＧＰＩアンカー型タンパク質の個数が最小となるよう、Ｎ末端外疎水性閾値及びＣ末端側の高疎水性領域に対応する領域とを決定している。 As shown in FIG. 11, the amino acid located at the center of the amino acid residue sequence having the maximum N-terminal outside average hydrophobicity calculated from the amino acid residues after 30 residues from the N-terminus of a known GPI-anchored protein Residues are present in the highly hydrophobic region on the C-terminal side. Therefore, the amino acid residue located at the center of the amino acid residue sequence having the maximum N-terminal outside average hydrophobicity calculated from the amino acid residues after 30 residues from the N-terminal of the test target protein is the GPI-anchored protein. If it exists in the region corresponding to the highly hydrophobic region on the C-terminal side, there is a high possibility that the protein to be examined is a GPI-anchored protein, and if it is not present in this region, the protein to be examined is a GPI-anchored protein. It can be determined that there is a low possibility of being.
That is, the shaded rectangle range in FIG. 11 indicates a range that satisfies the conditions of the N-terminal extra-hydrophobic threshold and the C-terminal high hydrophobic region, and the number of non-GPI anchor proteins included in the range is the minimum Thus, the N-terminal extra-hydrophobic threshold and the region corresponding to the C-terminal high hydrophobic region are determined.

＜ステップＳ１０：小側鎖サイズ判定領域の残基を抽出＞
Ｃ末端側最大疎水位置判定部１０９が、Ｎ末端外平均疎水性値が最大となるアミノ酸残基の位置がＣ末端から１４残基以内の位置であると判定した場合（ステップＳ９：ＹＥＳ）側鎖サイズ指標値特定部１１１は、ステップＳ１で配列取得部１０２が取得したアミノ酸配列情報から、小側鎖サイズ判定領域のアミノ酸残基に相当する部分配列を抽出する。ここで、小側鎖サイズ判定領域とは、既知のＧＰＩアンカー型タンパク質のプロペプチド領域を含む領域であり、本実施形態では、Ｃ末端から３０残基以内のアミノ酸残基を用いる。Ｃ末端から３０残基以内のアミノ酸残基という領域は、既知のＧＰＩアンカー型タンパク質において、後述するステップＳ１２と同様の処理によって平均側鎖サイズを算出した場合に、当該算出した平均側鎖サイズが最小となるアミノ酸残基列の中央に位置するアミノ酸残基が含まれる領域である。 <Step S10: Extract residues in small side chain size determination region>
When the C-terminal side maximum hydrophobic position determination unit 109 determines that the position of the amino acid residue having the maximum N-terminal outside average hydrophobicity value is within 14 residues from the C-terminal (step S9: YES) The chain size index value specifying unit 111 extracts a partial sequence corresponding to the amino acid residue of the small side chain size determination region from the amino acid sequence information acquired by the sequence acquisition unit 102 in step S1. Here, the small side chain size determination region is a region including a propeptide region of a known GPI-anchored protein, and in this embodiment, amino acid residues within 30 residues from the C-terminus are used. In the region of amino acid residues within 30 residues from the C-terminal, when the average side chain size is calculated in the known GPI-anchored protein by the same process as in Step S12 described later, the calculated average side chain size is This is a region including an amino acid residue located at the center of the amino acid residue string that is the smallest.

＜ステップＳ１１：側鎖サイズ指標値を特定＞
側鎖サイズ指標値特定部１１１は、ステップＳ１０で小側鎖サイズ判定領域のアミノ酸残基に相当する部分配列を抽出すると、側鎖サイズ指標値記憶部１１０を参照して、抽出した部分配列が示す各アミノ酸残基に側鎖サイズ指標値を割り当てた数値列を生成する（ステップＳ１１）。例えば、配列取得部１０２が取得したアミノ酸配列情報が、「ＭＬＬＥＰＧＲＧＣＣ……」という配列を示す場合、側鎖サイズ指標値特定部１１１は、側鎖サイズ指標値記憶部１１０が記憶する図３に示す指標値より「6、5.5、5.5、5、5.5、0.5、7.5、0.5、3、3……」という数値列を生成する。 <Step S11: Specify side chain size index value>
When the side chain size index value specifying unit 111 extracts the partial sequence corresponding to the amino acid residue in the small side chain size determination region in step S10, the side chain size index value storage unit 110 refers to the side chain size index value storage unit 110 to extract the partial sequence. A numerical sequence in which a side chain size index value is assigned to each amino acid residue shown is generated (step S11). For example, when the amino acid sequence information acquired by the sequence acquisition unit 102 indicates the sequence “MLLEPGRGCC...”, The side chain size index value specifying unit 111 is stored in the side chain size index value storage unit 110 as illustrated in FIG. A numerical string “6, 5.5, 5.5, 5, 5.5, 0.5, 7.5, 0.5, 3, 3...” Is generated from the index value.

＜ステップＳ１２：平均側鎖サイズを算出＞
ステップＳ１１で、側鎖サイズ指標値特定部１１１が側鎖サイズ指標値を示す数値列を生成すると、側鎖サイズ算出部１１２は、側鎖サイズ指標値特定部１１１が生成した数値列の連続する側鎖サイズ特性抽出必要数分の各側鎖サイズ指標値の平均である平均側鎖サイズを、１残基ずつずらしながら算出する。
ここで、平均側鎖サイズ特性抽出必要分の連続するアミノ酸残基列の中央のアミノ酸残基の位置がＮ末端からｒ残基目であるときの平均側鎖サイズは、式（２）を用いて算出できる。 <Step S12: Calculate average side chain size>
In step S11, when the side chain size index value specifying unit 111 generates a numerical sequence indicating the side chain size index value, the side chain size calculating unit 112 continues the numerical sequence generated by the side chain size index value specifying unit 111. The average side chain size, which is the average of the side chain size index values for the required number of side chain size characteristics, is calculated while shifting by one residue.
Here, the average side chain size when the position of the central amino acid residue in the sequence of amino acid residues necessary for the extraction of the average side chain size characteristic is the r-th residue from the N-terminal is expressed by the formula (2). Can be calculated.

但し、ｎは、平均化に用いる前後の残基数を示す。つまり、２ｎ＋１は、側鎖サイズ特性抽出必要数を示す。また、Ｖ（ｉ）はＮ末端からｉ残基目に存在するアミノ酸残基の側鎖サイズ指標値を示す。
つまり、Ｎ末端からｒ残基目のアミノ酸残基が中央に位置するアミノ酸残基列の平均側鎖サイズは、Ｎ末端からｒ−ｎ残基目のアミノ酸残基から、Ｎ末端からｒ＋ｎ残基目のアミノ酸残基までの側鎖サイズ指標値の平均となる。なお、このとき、Ｃ末端からｎ残基以内のアミノ酸残基は、前後ｎ残基の平均値を算出できないため、平均側鎖サイズとして例えばＮＵＬＬ値を代入しておくと良い。 However, n shows the number of residues before and after used for averaging. That is, 2n + 1 indicates the number of side chain size characteristic extraction required. V (i) represents the side chain size index value of the amino acid residue present at the i-th residue from the N-terminus.
In other words, the average side chain size of the amino acid residue sequence in which the amino acid residue at the r-th residue from the N-terminus is at the center is determined from the amino acid residue at the rn-th residue from the N-terminal to the r + n residue from the N-terminal. It is the average of the side chain size index values up to the amino acid residue of the eye. At this time, for amino acid residues within n residues from the C-terminal, the average value of the preceding and succeeding n residues cannot be calculated. Therefore, for example, a NULL value may be substituted as the average side chain size.

本実施形態では、側鎖サイズ特性抽出必要数として３残基を用いる。つまり、Ｎ末端側平均疎水性値として、Ｎ末端からｒ残基目のアミノ酸残基に隣接するアミノ酸残基の疎水性指標値の平均を算出する。ここで、側鎖サイズ特性抽出必要数を３残基と決定する方法を説明する。 In this embodiment, 3 residues are used as the required number of side chain size characteristics extraction. That is, the average of the hydrophobicity index values of amino acid residues adjacent to the r-th amino acid residue from the N-terminus is calculated as the N-terminal average hydrophobicity value. Here, a method of determining the required number of side chain size characteristic extractions as 3 residues will be described.

まず、既知の複数のＧＰＩアンカー型タンパク質の小側鎖サイズ判定領域、すなわちＣ末端から３０残基以内のアミノ酸残基から、側鎖サイズ特性抽出必要数の候補となる範囲の平均疎水性値を、１残基ずつずらしながら算出する。次に、既知の複数のＧＰＩアンカー型タンパク質のそれぞれから、平均側鎖サイズが最小となるアミノ酸残基を特定する。そして、当該抽出したアミノ酸残基のＣ末端側に隣接するアミノ酸残基がＧＰＩアンカー修飾部位（ωサイト）であるものの個数を計数する。
この処理をＮ末端側疎水性特性抽出必要数の候補となる値を変えて実行し、全ＧＰＩアンカー型タンパク質のうち、平均側鎖サイズが最小となるアミノ酸残基のＣ末端側に隣接するアミノ酸残基がＧＰＩアンカー修飾部位であるものの個数が最大となる側鎖サイズ特性抽出必要数の候補を、側鎖サイズ特性抽出必要数として決定する。 First, an average hydrophobicity value in a range that is a candidate for the required number of side chain size characteristics extraction from a small side chain size determination region of a plurality of known GPI anchor type proteins, that is, amino acid residues within 30 residues from the C-terminus. Calculate while shifting one residue at a time. Next, an amino acid residue having a minimum average side chain size is identified from each of a plurality of known GPI-anchored proteins. Then, the number of amino acid residues adjacent to the C-terminal side of the extracted amino acid residue is a GPI anchor modification site (ω site) is counted.
This process is performed by changing the candidate value for the required number of N-terminal hydrophobic characteristics extraction, and among all GPI-anchored proteins, the amino acid adjacent to the C-terminal side of the amino acid residue having the smallest average side chain size The candidate of the necessary number of side chain size characteristics extraction that maximizes the number of residues whose residues are GPI anchor modification sites is determined as the necessary number of side chain size characteristics extraction.

そして、本実施形態では、ＳＷＩＳＳ−ＰＲＯＴｖｅｒ５４．０より取得した既知の哺乳類ＧＰＩアンカー型タンパク質の完全長アミノ酸配列データセット、及び既知の哺乳類非ＧＰＩアンカー型タンパク質の完全長アミノ酸配列データセットを用いて上述した方法を実行した結果、Ｎ末端側疎水性特性抽出必要数を３残基として決定した。 And in this embodiment, using the full length amino acid sequence data set of the known mammalian GPI anchor type protein acquired from SWISS-PROT ver54.0, and the full length amino acid sequence data set of the known mammalian non-GPI anchor type protein, As a result of executing the above-mentioned method, the required number of N-terminal hydrophobic characteristics extraction was determined as 3 residues.

図１２は、ＧＰＩアンカー型タンパク質の側鎖サイズのプロファイルを示すグラフである。
図１２は、ＳＷＩＳＳ−ＰＲＯＴｖｅｒ５４．０のＢＹ５５＿ＨＵＭＡＮ（１８１ａａ）エントリーに対して、ステップＳ１２と同様の処理によって算出した平均側鎖サイズを示すグラフである。ここで、横軸は、平均側鎖サイズのアミノ酸残基列の中央に位置するアミノ酸残基のＣ末端からの残基位置を示し、縦軸は平均側鎖サイズの値を示す。
図１２に示すように、既知のＧＰＩアンカー型タンパク質のＧＰＩアンカー修飾部位は、平均側鎖サイズが最小となるアミノ酸残基のＣ末端側に隣接している。 FIG. 12 is a graph showing a profile of side chain size of GPI-anchored protein.
FIG. 12 is a graph showing the average side chain size calculated by the same process as step S12 for the BY55_HUMAN (181aa) entry of SWISS-PROT ver54.0. Here, the horizontal axis indicates the residue position from the C-terminal of the amino acid residue located at the center of the amino acid residue sequence having the average side chain size, and the vertical axis indicates the value of the average side chain size.
As shown in FIG. 12, the GPI anchor modification site of the known GPI anchor protein is adjacent to the C-terminal side of the amino acid residue having the smallest average side chain size.

＜ステップＳ１３：所定の領域のアミノ酸残基を抽出＞
図１３は、アミノ酸配列の抽出方法を示す図である。
ステップＳ１２で、側鎖サイズ算出部１１２が平均側鎖サイズを算出すると、スコア数値列生成部１１４は、図１３（１）に示すように、側鎖サイズ算出部１１２が算出した平均側鎖サイズが最小となるアミノ酸残基の位置を基準位置として決定する。次に、スコア数値列生成部１１４は、図１３（２）に示すように、当該基準位置を含む所定の領域におけるアミノ酸残基を、ステップＳ１で配列取得部１０２が取得したアミノ酸配列情報から抽出する。
本実施形態では、当該所定の領域として、基準位置からＮ末端側に連続する１２残基のアミノ酸残基とＣ末端側に連続する１２残基のアミノ酸残基とを用いる。 <Step S13: Extract amino acid residues in a predetermined region>
FIG. 13 is a diagram showing an amino acid sequence extraction method.
When the side chain size calculation unit 112 calculates the average side chain size in step S12, the score numerical value sequence generation unit 114 calculates the average side chain size calculated by the side chain size calculation unit 112 as shown in FIG. The position of the amino acid residue that minimizes is determined as the reference position. Next, as shown in FIG. 13 (2), the score numerical value sequence generation unit 114 extracts amino acid residues in a predetermined region including the reference position from the amino acid sequence information acquired by the sequence acquisition unit 102 in step S1. To do.
In this embodiment, 12 amino acid residues that are continuous from the reference position to the N-terminal side and 12 amino acid residues that are continuous to the C-terminal side are used as the predetermined region.

＜ステップＳ１４：位置特異的スコアを割り当てる＞
図１４は、位置特異的スコアの割り当て方法を示す図である。
次に、スコア数値列生成部１１４は、ＰＳＳＭ記憶部１１３が記憶するＰＳＳＭに基づいて、抽出した所定の範囲の各アミノ酸残基の位置特異的スコアを特定し、当該疎水性指標値を示す数値列を生成する。例えば、抽出した所定の範囲のアミノ酸残基が、図１４に示すように「ＣＱＮＡ……Ｓ」という配列を示す場合、スコア数値列生成部１１４は、図４及び図５に示すＰＳＳＭを参照して、「0.21、-0.54、2.69、-0.77、……、1.13」という数値列を生成する。 <Step S14: Assigning a position-specific score>
FIG. 14 is a diagram illustrating a method for assigning position-specific scores.
Next, the score value sequence generation unit 114 specifies the position-specific score of each amino acid residue in the extracted predetermined range based on the PSSM stored in the PSSM storage unit 113, and indicates the hydrophobicity index value. Generate a column. For example, when the extracted amino acid residues in a predetermined range indicate the sequence “CQNA... S” as shown in FIG. 14, the score value string generation unit 114 refers to the PSSM shown in FIGS. Then, a numerical sequence of “0.21, −0.54, 2.69, −0.77,..., 1.13” is generated.

ここで、ステップＳ１４で用いるＰＳＳＭの作成方法を説明する。
まず、既知の哺乳類ＧＰＩアンカー型タンパク質の完全長アミノ酸配列データセット、及び既知の哺乳類非ＧＰＩアンカー型タンパク質の完全長アミノ酸配列データセットを、取得する。本実施形態では、これらのデータセットをＳＷＩＳＳ−ＰＲＯＴｖｅｒ５４．０より取得した。また、ＧＰＩアンカー型タンパク質のデータセットについては、当該アミノ酸配列から翻訳されるＧＰＩアンカー型タンパク質としての特性が実証されていないもの、明らかに完全長ではないもの等を除外した。その結果、ＧＰＩアンカー型タンパク質のエントリー数は３９１であり、非ＧＰＩアンカー型タンパク質のエントリー数は４８９８３であった。 Here, a method of creating the PSSM used in step S14 will be described.
First, a full-length amino acid sequence data set of a known mammalian GPI-anchored protein and a full-length amino acid sequence data set of a known mammalian non-GPI-anchored protein are obtained. In the present embodiment, these data sets are acquired from SWISS-PROT ver 54.0. In addition, regarding GPI-anchored protein data sets, those that have not been demonstrated as GPI-anchored proteins translated from the amino acid sequence, and those that are clearly not full length, etc., were excluded. As a result, the number of entries of the GPI anchor type protein was 391, and the number of entries of the non-GPI anchor type protein was 4,8983.

データセットを取得すると、次に、データセットの各エントリーについて、疎水性のスクリーニングを行う。
まず、上述した式（１）及び図２に示す疎水性指標値を用いて、Ｎ末端側疎水性特性抽出必要数を１１残基に設定して（すなわち、式（１）においてｎ＝５に設定して）各エントリーのＮ末端平均疎水性値を算出し、Ｎ末端から３０残基以内の領域における最大のＮ末端側平均疎水性値が１．５０以上のものを抽出する。次に、抽出されたデータセット中の各エントリーの平均疎水性値を、前記式（１）及び図２に示す疎水性指標値を用いて、Ｎ末端外疎水性特性抽出必要数を１７残基に設定して（すなわち、式（１）においてｎ＝８に設定して）算出し、Ｎ末端から３０残基を除く全領域における最大のＮ末端外平均疎水性値が１．３８であり、且つ、該最大のＮ末端外平均疎水性値を示す残基位置がＣ末端から１４残基以内であるものを抽出する。この結果、実際は完全長でないエントリーや、タンパク質としての発現が推定であるエントリーは排除されることとなる。本実施形態では、疎水性スクリーニング後のＧＰＩアンカー型タンパク質データセットのエントリー数は１２１であり、非ＧＰＩアンカー型タンパク質データセットのエントリー数は２１８であった。 Once the data set is acquired, each entry in the data set is then screened for hydrophobicity.
First, using the above-described formula (1) and the hydrophobicity index value shown in FIG. 2, the required number of N-terminal hydrophobic characteristics extraction is set to 11 residues (that is, n = 5 in formula (1)). Set the N-terminal average hydrophobicity value of each entry, and extract the one with the maximum N-terminal average hydrophobicity value of 1.50 or more in the region within 30 residues from the N-terminal. Next, the average hydrophobicity value of each entry in the extracted data set is determined by using the hydrophobicity index value shown in the formula (1) and FIG. (I.e., setting n = 8 in equation (1)), and the maximum N-terminal outer average hydrophobicity value in all regions excluding 30 residues from the N-terminus is 1.38, In addition, those having a residue position showing the maximum N-terminal outside average hydrophobicity value within 14 residues from the C-terminus are extracted. As a result, entries that are not actually full length or entries that are presumed to be expressed as proteins are excluded. In the present embodiment, the number of entries in the GPI-anchored protein data set after hydrophobic screening was 121, and the number of entries in the non-GPI-anchored protein data set was 218.

次いで、疎水性スクリーニングにより抽出されたデータセットに含まれる同一アミノ酸配列を有するエントリーを除き、冗長性を排除する。この結果、本実施形態では、ＧＰＩアンカー型タンパク質データセットのエントリー数は１１３であり、非ＧＰＩアンカー型タンパク質データセットのエントリー数は２１０であった。冗長性を排除したＧＰＩアンカー型タンパク質データセットに含まれる１１３のＳＷＩＳＳ−ＰＲＯＴエントリーネームを図１５に示す。 Then, the redundancy is eliminated by removing entries having the same amino acid sequence contained in the data set extracted by the hydrophobic screening. As a result, in this embodiment, the number of entries of the GPI anchor type protein data set was 113, and the number of entries of the non-GPI anchor type protein data set was 210. FIG. 15 shows 113 SWISS-PROT entry names included in the GPI-anchored protein data set excluding redundancy.

上記により得られた各データセット中の各エントリーのＣ末端から３０アミノ酸残基までの平均側鎖サイズを、上述した式（２）及び図３に示す側鎖サイズ指標値を用いて、側鎖サイズ特性抽出必要数を３に設定して（すなわち、式（２）においてｎ＝１に設定して）算出する。
そして、データセットのうちＧＰＩアンカー型タンパク質の各エントリーの、平均側鎖サイズが最小となるアミノ酸残基の位置を基準位置とする所定の範囲（基準位置のアミノ酸残基と基準位置からＮ末端側に連続する１２残基のアミノ酸残基とＣ末端側に連続する１２残基のアミノ酸残基とからなる範囲）におけるアミノ酸残基から、式（３）を用いて既知のＧＰＩアンカー型タンパク質の所定の領域内の位置ｐに存在するアミノ酸残基の種類ｉの出現頻度を算出する。 The average side chain size from the C-terminal to the 30 amino acid residues of each entry in each data set obtained as described above is calculated using the above-mentioned formula (2) and the side chain size index value shown in FIG. The size characteristic extraction required number is set to 3 (that is, n = 1 is set in equation (2)).
A predetermined range of each entry of the GPI-anchored protein in the data set with the position of the amino acid residue having the smallest average side chain size as the reference position (the amino acid residue at the reference position and the N-terminal side from the reference position) A predetermined GPI-anchored protein using formula (3) from the amino acid residues in a range consisting of 12 amino acid residues that are consecutive to each other and 12 amino acid residues that are consecutive on the C-terminal side) The appearance frequency of the type i of the amino acid residue existing at position p in the region is calculated.

但し、ｎ_ｉｐは、種類ｉのアミノ酸残基が位置ｐに存在する既知のＧＰＩアンカー型タンパク質の個数を示す。また、εは算出する出現頻度の調整値を示し、本実施形態では１を用いている。また、ｓは、アミノ酸残基の種類数を示す。
これにより、データセットの全てのエントリーにおいて位置ｐに種類ｉが存在しない場合にも、ゼロで除算を行うことを防ぐことができる。
同様に、データセットのうち非ＧＰＩアンカー型タンパク質の各エントリーの、平均側鎖サイズが最小となるアミノ酸残基の位置を基準位置とする所定の範囲におけるアミノ酸残基から、式（３）を用いて既知の非ＧＰＩアンカー型タンパク質の所定の領域内の位置ｐに存在するアミノ酸残基の種類ｉの出現頻度を算出する。 Here, n _ip indicates the number of known GPI-anchored proteins in which the type i amino acid residue is present at position p. Further, ε represents an adjustment value of the appearance frequency to be calculated, and 1 is used in this embodiment. S indicates the number of amino acid residues.
Thereby, even when there is no kind i at the position p in all entries of the data set, division by zero can be prevented.
Similarly, using the formula (3) from the amino acid residues in a predetermined range with the position of the amino acid residue having the smallest average side chain size as the reference position for each entry of the non-GPI anchor type protein in the data set Then, the appearance frequency of the type i of the amino acid residue existing at the position p in the predetermined region of the known non-GPI anchor type protein is calculated.

既知のＧＰＩアンカー型タンパク質の所定の領域内の位置ｐに存在するアミノ酸残基の種類ｉの出現頻度、及び既知の非ＧＰＩアンカー型タンパク質の所定の領域内の位置ｐに存在するアミノ酸残基の種類ｉの出現頻度を算出すると、次に、式（４）を用いて、アミノ酸残基の位置ｐにおけるアミノ酸残基の種類ｉの位置特異的スコアを算出する。 The frequency of occurrence of the type i of the amino acid residue present at position p within a given region of a known GPI-anchored protein, and the amino acid residue present at position p within a given region of a known non-GPI anchored protein Once the appearance frequency of type i is calculated, the position-specific score of type i of the amino acid residue at position p of the amino acid residue is then calculated using equation (4).

但し、ｆ_ｉｐ ^{ｐｏｓｉｔｉｖｅ}は、既知のＧＰＩアンカー型タンパク質の所定の領域内の位置ｐに存在するアミノ酸残基の種類ｉの出現頻度を示す。また、ｆ_ｉｐ ^{ｎｅｇａｔｉｖｅ}は、既知の非ＧＰＩアンカー型タンパク質の所定の領域内の位置ｐに存在するアミノ酸残基の種類ｉの出現頻度を示す。つまり、位置特異的スコアは、所定の範囲におけるあるアミノ酸残基の位置におけるアミノ酸残基の種類の、ＧＰＩアンカー型タンパク質における出現度合いを示している。
このように算出された位置特異的スコアを要素とする２５（所定の領域内のアミノ酸残基数）×２０（アミノ酸残基の種類数）の行列をＰＳＳＭとして生成し、ＰＳＳＭ記憶部１１３に格納しておく。これにより、図４及び図５に示すＰＳＳＭを生成することができる。 However, f _ip ^positive indicates the frequency of occurrence of the type i of the amino acid residue present at position p in a predetermined region of a known GPI-anchored protein. F _ip ^negative indicates the frequency of occurrence of the type i of the amino acid residue present at position p in a predetermined region of a known non-GPI anchored protein. That is, the position-specific score indicates the degree of appearance in the GPI-anchored protein of the type of amino acid residue at a certain amino acid residue position in a predetermined range.
A matrix of 25 (number of amino acid residues in a predetermined region) × 20 (number of types of amino acid residues) having the position-specific score calculated in this way as an element is generated as PSSM and stored in the PSSM storage unit 113. Keep it. Thereby, the PSSM shown in FIGS. 4 and 5 can be generated.

＜ステップＳ１５：ニューラルネットワークによる期待値出力＞
ステップＳ１４でスコア数値列生成部１１４がスコア数値列を生成すると、ニューラルネットワーク１１５は、当該スコア数値列を入力し、ＧＰＩアンカー型タンパク質らしさを示す０以上１以下の期待値を出力する。なお、ＰＳＳＭから得られた複数の位置特異的スコアは、従来、その平均値の高低によって検査対象タンパク質が目的タンパク質であるか否かを判定するために用いられている。本発明の骨子は、スコアの算出に用いられていた複数の位置特異的スコアをニューラルネットワーク１１５の入力値とした点にある。 <Step S15: Expected value output by neural network>
When the score numerical value sequence generation unit 114 generates a score numerical value sequence in step S14, the neural network 115 inputs the score numerical value sequence and outputs an expected value of 0 or more and 1 or less indicating the GPI anchor type protein character. Note that a plurality of position-specific scores obtained from PSSM are conventionally used to determine whether or not a test target protein is a target protein based on the average value. The gist of the present invention resides in that a plurality of position-specific scores used for score calculation are used as input values of the neural network 115.

ここで、ニューラルネットワーク１１５の処理について詳細に説明する。
図１６は、本実施形態で用いるニューラルネットワークの構成を示す図である。
ニューラルネットワーク１１５は、入力層Ｓ_１、隠れ層Ｓ_２、出力層Ｓ_３の３段の階層構造を有する。
入力層Ｓ_１は、スコア数値列生成部１１４が生成するスコア数値列の要素数と同数のノードＮ_１−１〜Ｎ_１−２５（以下、ノードＮ_１−１〜Ｎ_１−２５を総称する場合は、ノードＮ_１と記載する）で構成される。
隠れ層Ｓ_２は、入力層Ｓ_１のノード数と同数のノードＮ_２−１〜Ｎ_２−２５（以下、ノードＮ_２−１〜Ｎ_２−２５を総称する場合は、ノードＮ_２と記載する）で構成される。
出力層Ｓ_３は、１つのノードＮ_３で構成される。 Here, the processing of the neural network 115 will be described in detail.
FIG. 16 is a diagram showing a configuration of a neural network used in the present embodiment.
The neural network 115 has a three-stage hierarchical structure of an input layer S ₁ , a hidden layer S ₂ , and an output layer S ₃ .
Input layer _{S 1} is generic score numerical sequence generator 114 elements as many nodes score numerical sequence generated by the _N ₁ _-1~N 1 -25 (hereinafter, the node _N ₁ _-1~N 1 -25 The case is described as node N ₁ ).
Hidden layer _{S 2,} when collectively input layer _S number of nodes ₁ and the same number of nodes _N ₂ _-1~N 2 -25 (hereinafter, the node _N ₂ _-1~N 2 -25 is described as the node _{N 2} ).
Output layer _{S 3} is constituted by one node _{N 3.}

ノードＮ_１のそれぞれは、スコア数値列生成部１１４が生成するスコア数値列のうち、自身に対応づけられた要素の値を入力し、ノードＮ_２のそれぞれに出力する。ノードＮ_２は、ノードＮ_１のそれぞれが出力する値を入力し、当該入力した値を所定の記憶領域に記憶した伝達関数に代入し、得られた値をノードＮ_３に出力する。ノードＮ_３は、ノードＮ_２のそれぞれが出力する値を入力し、当該入力した値を所定の記憶領域に記憶した伝達関数に代入し、得られた値を期待値として出力する。
なお、ノードＮ_２、Ｎ_３が用いる伝達関数とは、前段のノードから入力したそれぞれの値と入力元のノードに対応する結合加重との積を総和し、得られる値が所定の閾値を超えた場合にのみ値を発火（出力）する関数である。ここで、ノードＮ_２の伝達関数を式（５）に、ノードＮ_３の伝達関数を式（６）に示す。 Node of each N _1, of the score numerical sequence generated by the score numerical sequence generator 114 receives the value of the correspondence obtained element itself, and outputs to each of the nodes N _2. The node N ₂ inputs a value output from each of the nodes N ₁ , substitutes the input value for a transfer function stored in a predetermined storage area, and outputs the obtained value to the node N ₃ . The node N ₃ inputs a value output from each of the nodes N ₂ , substitutes the input value for a transfer function stored in a predetermined storage area, and outputs the obtained value as an expected value.
The transfer functions used by the nodes N ₂ and N ₃ are the sum of products of the respective values input from the previous node and the connection weight corresponding to the input source node, and the obtained value exceeds a predetermined threshold value. This function fires (outputs) a value only when Here, the transfer function of the node N ₂ is shown in Equation (5), and the transfer function of the node N ₃ is shown in Equation (6).

但し、ｎは、ノードＮ_１の総数を示す値であり、本実施形態では２５となる。また、ｗ_ｉは、ノードＮ_１−ｉに対応する結合加重を示す。また、ｘ_ｉは、ノードＮ_１−ｉから入力した値を示す。また、ｍは、ノードＮ_２の総数を示す値であり、本実施形態では２５となる。また、ｗ_ｊは、ノードＮ_２−ｊに対応する結合加重を示す。また、ｘ_ｊは、ノードＮ_２−ｊから入力した値を示す。また、θは、発火のための閾値を示す。また、関数ｆは、０以上１以下の値を出力するシグモイド関数である。なお、シグモイド関数は、式（７）に示す関数である。 However, n is a value indicating the total number of nodes N ₁ and is 25 in this embodiment. Further, w _i indicates a connection weight corresponding to the node N ₁ -i. X _i indicates a value input from the node N ₁ -i. M is a value indicating the total number of nodes N ₂ and is 25 in this embodiment. Further, w _j indicates a connection weight corresponding to the node N ₂ -j. X _j represents a value input from the node N ₂ -j. Θ represents a threshold value for ignition. The function f is a sigmoid function that outputs a value between 0 and 1. The sigmoid function is a function shown in Expression (7).

また、ニューラルネットワーク１１５は、既知のＧＰＩアンカー型タンパク質のスコア数値列を入力とした場合に、期待値として１を出力し、既知の非ＧＰＩアンカー型タンパク質のスコア数値列を入力した場合に、期待値として０を出力するように学習されている。
ここで、ニューラルネットワーク１１５の学習方法を説明する。 The neural network 115 outputs 1 as an expected value when a score value sequence of a known GPI anchor type protein is input, and expects when a score value sequence of a known non-GPI anchor type protein is input. It is learned to output 0 as a value.
Here, a learning method of the neural network 115 will be described.

まず、ＰＳＳＭの作成に用いたＧＰＩアンカー型タンパク質データセット及び非ＧＰＩアンカー型タンパク質データセットを読み出す。次に、当該データセットの各エントリーから、平均側鎖サイズが最小となるアミノ酸残基の位置を基準位置とする所定の範囲（基準位置のアミノ酸残基と基準位置からＮ末端側に連続する１２残基のアミノ酸残基とＣ末端側に連続する１２残基のアミノ酸残基とからなる範囲）におけるアミノ酸残基のそれぞれに対して、ＰＳＳＭ記憶部１１３が記憶する位置特異的スコアを割り当て、スコア数値列を生成する。 First, the GPI anchor type protein data set and the non-GPI anchor type protein data set used to create the PSSM are read. Next, from each entry of the data set, a predetermined range (the amino acid residue at the reference position and the 12 consecutive from the reference position to the N-terminal side) is defined as the position of the amino acid residue having the smallest average side chain size. A position-specific score stored in the PSSM storage unit 113 is assigned to each of the amino acid residues in the range consisting of the amino acid residues of the residues and the 12 amino acid residues continuous on the C-terminal side, and the score Generate numeric columns.

次に、生成したスコア数値列をニューラルネットワーク１１５の入力層Ｓ_１の各ノードＮ_１に入力する。ノードＮ_１のそれぞれは、入力した値をノードＮ_２のそれぞれに出力する。ノードＮ_２は、ノードＮ_１のそれぞれが出力する値を伝達関数に代入し、得られた値をノードＮ_３に出力する。ノードＮ_３は、ノードＮ_２のそれぞれが出力する値を伝達関数に代入し、得られる値を期待値として出力する。 Next, the generated score value sequence is input to each node N ₁ of the input layer S ₁ of the neural network 115. Each node N _1, and outputs an input value to each node N _2. The node N ₂ substitutes the value output from each of the nodes N ₁ for the transfer function, and outputs the obtained value to the node N ₃ . The node N ₃ substitutes the value output from each of the nodes N ₂ for the transfer function, and outputs the obtained value as an expected value.

他方、ニューラルネットワーク１１５のノードＮ_３は、教師データを入力する。教師データとは、入力したデータに対して期待される出力値を示すデータのことである。本実施形態においては、ＧＰＩアンカー型タンパク質のスコア数値列を入力した場合、教師データは１であり、非ＧＰＩアンカー型タンパク質のスコア数値列を入力した場合、教師データは０である。次に、ニューラルネットワーク１１５の各ノードは、教師データと出力した期待値との誤差を最小にするように、自身が用いる伝達関数の結合加重ｗ_ｉ、閾値θを変化させる。
この処理をＰＳＳＭの作成に用いたＧＰＩアンカー型タンパク質データセット及び非ＧＰＩアンカー型タンパク質データセットのそれぞれのエントリーに対して実行する。これにより、ニューラルネットワーク１１５は、既知のＧＰＩアンカー型タンパク質のスコア数値列を入力とした場合に、期待値として１を出力し、既知の非ＧＰＩアンカー型タンパク質のスコア数値列を入力した場合に、期待値として０を出力することとなる。 On the other hand, the node _{N 3} of the neural network 115, and inputs the training data. Teacher data is data indicating an output value expected for input data. In the present embodiment, when the GPI-anchored protein score numerical sequence is input, the teacher data is 1, and when the non-GPI-anchored protein score numerical sequence is input, the teacher data is 0. Next, each node of the neural network 115 changes the coupling weight w _{i of} the transfer function used by itself and the threshold θ so that the error between the teacher data and the output expected value is minimized.
This process is executed for each entry of the GPI-anchored protein data set and non-GPI-anchored protein data set used to create the PSSM. Thereby, the neural network 115 outputs 1 as an expected value when a score value sequence of a known GPI anchor type protein is input, and when a score value sequence of a known non-GPI anchor type protein is input, 0 is output as the expected value.

＜ステップＳ１６：スコアの判定＞
ステップＳ１５でニューラルネットワーク１１５が期待値を出力すると、ＧＰＩアンカー型タンパク質判定部１１６は、出力した期待値が０．５以上であるか否かを判定する。つまり、ＧＰＩアンカー型タンパク質判定部１１６は、ニューラルネットワーク１１５が出力した期待値が、ＧＰＩアンカー型タンパク質を示す１と非ＧＰＩアンカー型タンパク質を示す０との何れに近いかを判定する。 <Step S16: Determination of Score>
When the neural network 115 outputs an expected value in step S15, the GPI anchor type protein determination unit 116 determines whether or not the output expected value is 0.5 or more. That is, the GPI anchor type protein determination unit 116 determines whether the expected value output from the neural network 115 is close to 1 indicating GPI anchor type protein or 0 indicating non-GPI anchor type protein.

＜ステップＳ１７：ＧＰＩアンカー型タンパク質と判定＞
ＧＰＩアンカー型タンパク質判定部１１６は、ステップＳ１６でニューラルネットワーク１１５が出力した期待値が０．５以上であると判定した場合（ステップＳ１６：ＹＥＳ）、ステップＳ１で配列取得部１０２が取得したアミノ酸配列情報が、ＧＰＩアンカー型タンパク質のものであると判定する。 <Step S17: Determination as GPI-anchored protein>
When the GPI anchor type protein determination unit 116 determines that the expected value output from the neural network 115 in step S16 is 0.5 or more (step S16: YES), the amino acid sequence acquired by the sequence acquisition unit 102 in step S1. It is determined that the information is that of a GPI-anchored protein.

＜ステップＳ１８：非ＧＰＩアンカー型タンパク質と判定＞
他方、ステップＳ５でＮ末端側疎水性判定部１０６が、算出したＮ末端側平均疎水性値の最大値がＮ末端側疎水性閾値未満であると判定した場合（ステップＳ５：ＮＯ）、ステップＳ８でＮ末端外疎水性判定部１０８が、算出したＮ末端外平均疎水性値の最大値がＮ末端外疎水性閾値未満であると判定した場合（ステップＳ８：ＮＯ）、ステップＳ９でＣ末端側最大疎水位置判定部１０９が、Ｎ末端外平均疎水性値が最大となるアミノ酸残基の位置がＣ末端側の高疎水性領域に対応する領域内にないと判定した場合（ステップＳ９：ＮＯ）、またはステップＳ１６でニューラルネットワーク１１５が出力した期待値が０．５未満であると判定した場合（ステップＳ１６：ＮＯ）、ＧＰＩアンカー型タンパク質判定部１１６は、ステップＳ１で配列取得部１０２が取得したアミノ酸配列情報が、非ＧＰＩアンカー型タンパク質のものであると判定する。 <Step S18: Determination as non-GPI anchor type protein>
On the other hand, when the N-terminal side hydrophobicity determining unit 106 determines in step S5 that the calculated maximum N-terminal side average hydrophobicity value is less than the N-terminal side hydrophobicity threshold value (step S5: NO), step S8. When the N-terminal outside hydrophobicity determination unit 108 determines that the calculated maximum N-terminal outside average hydrophobicity value is less than the N-terminal outside hydrophobicity threshold (NO in step S8), the C-terminal side in step S9 When the maximum hydrophobic position determination unit 109 determines that the position of the amino acid residue having the maximum N-terminal outer average hydrophobicity value is not within the region corresponding to the highly hydrophobic region on the C-terminal side (step S9: NO) Or when it determines with the expected value which the neural network 115 output in step S16 being less than 0.5 (step S16: NO), the GPI anchor type protein determination part 116 is step S1. Amino acid sequence information sequence acquiring unit 102 has acquired, it determines to be of non-GPI anchored proteins.

上述した動作により、ＧＰＩアンカー型タンパク質判定装置１００は、高感度且つ高選択的に検査対象タンパク質がＧＰＩアンカー型タンパク質であるか否かを判定することができる。
なお、ＧＰＩアンカー型タンパク質及び非ＧＰＩアンカー型タンパク質それぞれの判定精度を求める方法としては、ｎ−ｆｏｌｄｃｒｏｓｓｖａｌｉｄａｔｉｏｎ法（ｎ分割交差検定法）、ｂｏｏｔｓｔｒａｐ法、ｊａｃｋｋｎｉｆｅ法、Ｓｅｌｆ−ｃｏｎｓｉｓｔｅｎｃｙ（自己無撞着）な手法などを挙げることができる。ここで、判定精度とは、判定の感度、選択性、及び成功率のことを言う。
以下に、４分割交差検定法及び自己無撞着な手法について詳述する。 By the operation described above, the GPI anchor type protein determination apparatus 100 can determine whether or not the test target protein is a GPI anchor type protein with high sensitivity and high selectivity.
In addition, as a method of calculating | requiring the determination precision of GPI anchor type protein and each non-GPI anchor type protein, n-fold cross validation method (n division | segmentation cross-validation method), bootstrap method, jackknife method, Self-consistency (self-consistent) Can be mentioned. Here, the determination accuracy refers to determination sensitivity, selectivity, and success rate.
Hereinafter, the quadrant cross-validation method and the self-consistent method will be described in detail.

４分割交差検定法による判定精度とは、以下の処理により算出した判定精度である。
まず、既知のＧＰＩアンカー型タンパク質及び既知の非ＧＰＩアンカー型タンパク質のデータセットを４等分する。次に、分割したデータセットのうち３つの部分データセットを用いてＰＳＳＭを生成する。また、分割したデータセットのうち３つの部分データセットを用いてニューラルネットワーク１１５の学習を行う。次に、３つの部分データセットを用いて生成したＰＳＳＭに基づいて、他の１つの部分データセットの各エントリーのスコア数値列を生成する。次に、当該算出したスコアに基づいて、感度、選択性、成功率を算出する。そして、ＰＳＳＭを生成する部分データセットとスコアを算出する部分データセットとの全ての組み合わせに対して判定精度を算出し、それぞれの平均値をデータセット全体に対する判定精度として算出する。 The determination accuracy by the quadrant cross-validation method is a determination accuracy calculated by the following processing.
First, a known GPI-anchored protein and a known non-GPI-anchored data set are divided into four equal parts. Next, a PSSM is generated using three partial data sets among the divided data sets. Further, the neural network 115 is learned using three partial data sets among the divided data sets. Next, based on the PSSM generated using the three partial data sets, a score numerical sequence for each entry of the other partial data set is generated. Next, sensitivity, selectivity, and success rate are calculated based on the calculated score. Then, the determination accuracy is calculated for all combinations of the partial data set for generating the PSSM and the partial data set for calculating the score, and the respective average values are calculated as the determination accuracy for the entire data set.

自己無撞着な手法による判定精度とは、以下の処理により算出した判定精度である。
まず、上述したスコア判定閾値の決定方法と同様に、既知のＧＰＩアンカー型タンパク質及び既知の非ＧＰＩアンカー型タンパク質のデータセットを用いてＰＳＳＭを生成する。また、当該データセットを用いてニューラルネットワーク１１５の学習を行う。次に、当該ＰＳＳＭを用いて、ＰＳＳＭの生成に用いたデータセットの各エントリーのスコアを算出する。そして、当該算出したスコアに基づいてデータセット全体に対する判定精度を算出する。但し、本実施形態では、ニューラルネットワーク１１５が、既知のＧＰＩアンカー型タンパク質のスコア数値列を入力とした場合に、期待値として必ず１を出力し、既知の非ＧＰＩアンカー型タンパク質のスコア数値列を入力した場合に、期待値として必ず０を出力するように学習されている。そのため、自己無撞着な手法によって算出された感度、選択性、成功率は、すべて１００％となる。 The determination accuracy by the self-consistent method is a determination accuracy calculated by the following processing.
First, in the same manner as the score determination threshold determination method described above, a PSSM is generated using a data set of known GPI-anchored proteins and known non-GPI-anchored proteins. Further, the neural network 115 is learned using the data set. Next, the score of each entry of the data set used to generate the PSSM is calculated using the PSSM. Then, the determination accuracy for the entire data set is calculated based on the calculated score. However, in this embodiment, when the neural network 115 receives a score value sequence of a known GPI anchor type protein, it always outputs 1 as an expected value, and a score value sequence of a known non-GPI anchor type protein. Learning to output 0 as an expected value when it is input. Therefore, the sensitivity, selectivity, and success rate calculated by the self-consistent method are all 100%.

４分割交差検定法について、図１７〜図２０を用いて、さらに具体的に説明する。
図１７は、本実施形態によるＧＰＩアンカー型タンパク質判定装置の判定精度を示す第１の表である。
図１７では、ＧＰＩアンカー型タンパク質判定装置１００がＧＰＩアンカー型タンパク質であると判定した検査対象タンパク質の判定精度、及び非ＧＰＩアンカー型タンパク質であると判定した検査対象タンパク質の判定精度を示している。また、図１７に示すＧＰＩアンカー型タンパク質及び非ＧＰＩアンカー型タンパク質それぞれの判定精度を求めるにあたり、４分割交差検定法を用いた。 The quadrant cross-validation method will be described more specifically with reference to FIGS.
FIG. 17 is a first table showing the determination accuracy of the GPI anchor type protein determination apparatus according to the present embodiment.
FIG. 17 shows the determination accuracy of the test target protein determined by the GPI anchor type protein determination device 100 as a GPI anchor type protein and the determination accuracy of the test target protein determined as a non-GPI anchor type protein. In addition, a four-fold cross-validation method was used to determine the determination accuracy of each of the GPI anchor type protein and the non-GPI anchor type protein shown in FIG.

図１７に示すように、本実施形態による、ＧＰＩアンカー型タンパク質の４分割交差検定法による判定精度は、感度が９１．５％、選択性が９１．５％、成功率が０．９１５であった。また、非ＧＰＩアンカー型タンパク質の４分割交差検定法による判定精度は、感度が９８．２％、選択性が９３．１％、成功率が０．９５６であった。なお、図１７に示す判定制度は、１００回試行のうち、成功率が最高値のときのものである。 As shown in FIG. 17, the determination accuracy of the GPI-anchored protein by the 4-fold cross-validation method according to the present embodiment is 91.5% for sensitivity, 91.5% for selectivity, and 0.915 for success rate. It was. Moreover, the determination accuracy of the non-GPI anchored protein by the 4-fold cross-validation method was a sensitivity of 98.2%, a selectivity of 93.1%, and a success rate of 0.956. In addition, the determination system shown in FIG. 17 is a thing when a success rate is the highest value among 100 trials.

図１８は、本実施形態によるＧＰＩアンカー型タンパク質判定装置の判定精度を示す第２の表である。
図１７では、１００回試行のうち、成功率が最高値のときの判別精度を示したが、図１８では、１００回試行のうち、成功率上位１０％の平均精度を示す。
図１８に示すように、本実施形態による、ＧＰＩアンカー型タンパク質の４分割交差検定法による成功率上位１０％の平均精度は、感度が９１．４％、選択性が９０．２％、成功率が０．９０７であった。また、非ＧＰＩアンカー型タンパク質の４分割交差検定法による成功率上位１０％の平均精度は、感度が９４．８％、選択性が９１．３％、成功率が０．９４９であった。このように、本実施形態によれば、成功率が最高の場合に限らず、平均的に高い判定精度を得ることができることが分かる。 FIG. 18 is a second table showing the determination accuracy of the GPI anchor type protein determination apparatus according to the present embodiment.
FIG. 17 shows the discrimination accuracy when the success rate is the highest value among the 100 trials, but FIG. 18 shows the average accuracy of the top 10% success rate among the 100 trials.
As shown in FIG. 18, according to this embodiment, the average accuracy of the top 10% success rate by GPI-anchored protein quadrant cross-validation method is 91.4% sensitivity, 90.2% selectivity, and success rate Was 0.907. The average accuracy of the top 10% success rate by non-GPI-anchored protein by 4-fold cross-validation method was 94.8% for sensitivity, 91.3% for selectivity, and 0.949 for success rate. Thus, according to the present embodiment, it is understood that high determination accuracy can be obtained on average, not only when the success rate is the highest.

以下に、基準位置を含む所定の範囲を変化させてＧＰＩアンカー型タンパク質の判定を行った場合の判定精度を示す。 The determination accuracy when the GPI anchor type protein is determined by changing a predetermined range including the reference position is shown below.

図１９は、基準位置を含む所定の範囲を基準位置から（−１２残基〜＋１２残基）を（−１０残基〜＋１２残基）に変更した場合の判定精度を示す表である。
図１９に示すように、所定の範囲を、基準位置からＮ末端側に１０残基、Ｃ末端側に１２残基の範囲とした場合の、ＧＰＩアンカー型タンパク質の４分割交差検定法による判定精度は、成功率が最高の場合、感度が９０．０％、選択性が９２．３％、成功率が０．９１１であった。また、１００回試行のうち成功率上位１０％の平均精度は、感度が９０．５％、選択性が９０．０％、成功率が０．９０１であった。
他方、非ＧＰＩアンカー型タンパク質の４分割交差検定法による判定精度は、成功率が最高の場合、感度が９５．５％、選択性が９４．７％、成功率が０．９５１であった。また、１００回試行のうち成功率上位１０％の平均精度は、感度が９４．５％、選択性が９４．９％、成功率が０．９４７であった。
図１９に示す本実施形態による判定精度（基準位置を含む所定の範囲を、基準位置からＮ末端側に１０残基、Ｃ末端側に１２残基の範囲とした場合の判定精度）を、図１７に示す本実施形態による判定精度（基準位置を含む所定の範囲を、基準位置からＮ末端側に１２残基、Ｃ末端側に１２残基の範囲とした場合の判定精度）と比較すると、ＧＰＩアンカー型タンパク質と非ＧＰＩ型タンパク質とで図１７に示す本実施形態による判定精度の方が感度と成功率が高いことが分かる。 FIG. 19 is a table showing determination accuracy when a predetermined range including the reference position is changed from (−12 residue to +12 residue) to (−10 residue to +12 residue) from the reference position.
As shown in FIG. 19, when the predetermined range is a range of 10 residues on the N-terminal side and 12 residues on the C-terminal side from the reference position, the accuracy of determination by GPI-anchored protein quadrant cross-validation method When the success rate was the highest, the sensitivity was 90.0%, the selectivity was 92.3%, and the success rate was 0.911. The average accuracy of the top 10% success rate out of 100 trials was 90.5% sensitivity, 90.0% selectivity, and 0.901 success rate.
On the other hand, regarding the accuracy of determination by non-GPI-anchored protein by the 4-fold cross-validation method, when the success rate was the highest, the sensitivity was 95.5%, the selectivity was 94.7%, and the success rate was 0.951. The average accuracy of the top 10% success rate out of 100 trials was 94.5% sensitivity, 94.9% selectivity, and 0.947 success rate.
FIG. 19 shows determination accuracy (determination accuracy when the predetermined range including the reference position is a range of 10 residues on the N-terminal side and 12 residues on the C-terminal side from the reference position) according to the present embodiment shown in FIG. Compared to the determination accuracy according to the present embodiment shown in FIG. 17 (determination accuracy when the predetermined range including the reference position is a range of 12 residues on the N-terminal side and 12 residues on the C-terminal side from the reference position) It can be seen that the sensitivity and success rate of the determination accuracy according to this embodiment shown in FIG. 17 is higher for the GPI anchor type protein and the non-GPI type protein.

図２０は、基準位置を含む所定の範囲を基準位置から（−１２残基〜＋１２残基）を（−１２残基〜＋９残基）に変更した場合の判定精度を示す表である。
図２０に示すように、所定の範囲を、基準位置からＮ末端側に１２残基、Ｃ末端側に９残基の範囲とした場合の、ＧＰＩアンカー型タンパク質の４分割交差検定法による判定精度は、成功率が最高の場合、感度が９２．９％、選択性が９０．５％、成功率が０．９１６であった。また、１００回試行のうち成功率上位１０％の平均精度は、感度が９０．８％、選択性が８９．４％、成功率が０．９００であった。
他方、非ＧＰＩアンカー型タンパク質の４分割交差検定法による判定精度は、成功率が最高の場合、感度が９４．９％、選択性が９６．２％、成功率が０．９５５であった。また、１００回試行のうち成功率上位１０％の平均精度は、感度が９４．２％、選択性が９５．０％、成功率が０．９４６であった。
図２０に示す本実施形態による判定精度（基準位置を含む所定の範囲を、基準位置からＮ末端側に１２残基、Ｃ末端側に９残基の範囲とした場合の判定精度）を、図１７に示す本実施形態による判定精度（基準位置を含む所定の範囲を、基準位置からＮ末端側に１２残基、Ｃ末端側に１２残基の範囲とした場合の判定精度）と比較すると、ＧＰＩアンカー型タンパク質では図２０に示す本実施形態による判定精度の方が感度と成功率が高いことが分かる。 FIG. 20 is a table showing determination accuracy when the predetermined range including the reference position is changed from the reference position (−12 residue to +12 residue) to (−12 residue to +9 residue).
As shown in FIG. 20, when the predetermined range is a range of 12 residues on the N-terminal side and 9 residues on the C-terminal side from the reference position, the accuracy of determination by GPI-anchored protein quadrant cross-validation method When the success rate was the highest, the sensitivity was 92.9%, the selectivity was 90.5%, and the success rate was 0.916. The average accuracy of the top 10% success rate out of 100 trials was 90.8% sensitivity, 89.4% selectivity, and 0.900 success rate.
On the other hand, regarding the accuracy of determination by non-GPI anchored protein by the 4-fold cross-validation method, when the success rate was the highest, the sensitivity was 94.9%, the selectivity was 96.2%, and the success rate was 0.955. The average accuracy of the top 10% success rate out of 100 trials was 94.2% sensitivity, 95.0% selectivity, and 0.946 success rate.
FIG. 20 shows determination accuracy according to the present embodiment shown in FIG. 20 (determination accuracy when the predetermined range including the reference position is a range of 12 residues from the reference position to the N-terminal side and 9 residues from the C-terminal side). Compared to the determination accuracy according to the present embodiment shown in FIG. 17 (determination accuracy when the predetermined range including the reference position is a range of 12 residues on the N-terminal side and 12 residues on the C-terminal side from the reference position) It can be seen that in the GPI-anchored protein, the determination accuracy according to this embodiment shown in FIG. 20 is higher in sensitivity and success rate.

このように、本実施形態によれば、ＧＰＩアンカー型タンパク質判定装置１００は、ＰＳＳＭによって検査対象タンパク質のアミノ酸配列の各アミノ酸残基の位置特異的スコアを示すスコア数値列を生成する。そして、ニューラルネットワーク１１５が当該スコア数値列を入力し、ＧＰＩアンカー型タンパク質らしさを示す０以上１以下の期待値を出力することで検査対象タンパク質がＧＰＩアンカー型タンパク質であるか否かを判定する。ＰＳＳＭは、既知のＧＰＩアンカー型タンパク質のアミノ酸出現頻度と既知の非ＧＰＩアンカー型タンパク質のアミノ酸出現頻度とを用いて生成されるため、ＰＳＳＭから生成されたスコア数値列は、ＧＰＩアンカー型タンパク質らしさのみならず非ＧＰＩアンカー型タンパク質らしさをも示すこととなる。これにより、ＧＰＩアンカー型タンパク質判定装置１００は、高感度且つ高選択的に検査対象タンパク質がＧＰＩアンカー型タンパク質であるか否かを判定することができる。 As described above, according to the present embodiment, the GPI anchor type protein determination device 100 generates a score numerical sequence indicating the position-specific score of each amino acid residue of the amino acid sequence of the test target protein by PSSM. Then, the neural network 115 inputs the score numerical value sequence, and outputs an expected value of 0 or more and 1 or less indicating the GPI anchor type protein, thereby determining whether or not the protein to be examined is a GPI anchor type protein. Since PSSM is generated using the amino acid appearance frequency of a known GPI-anchored protein and the amino acid appearance frequency of a known non-GPI-anchored protein, the score value sequence generated from PSSM is only the GPI-anchored protein uniqueness In other words, it also shows the uniqueness of a non-GPI anchor type protein. Thereby, the GPI anchor type protein determination apparatus 100 can determine whether the protein to be examined is a GPI anchor type protein with high sensitivity and high selectivity.

また、本実施形態によれば、Ｎ末端側疎水性判定部１０６、Ｎ末端外疎水性判定部１０８、及びＣ末端側最大疎水位置判定部１０９による判定処理をした後に、ニューラルネットワーク１１５による期待値の算出を行う。これにより、ニューラルネットワーク１１５の処理対象となるアミノ酸配列情報の量を減らすことができ、ニューラルネットワーク１１５による期待値算出処理の計算量が多い場合にも、処理の高速化を図ることができる。 Further, according to the present embodiment, after the determination processing by the N-terminal side hydrophobicity determination unit 106, the N-terminal outside hydrophobicity determination unit 108, and the C-terminal side maximum hydrophobic position determination unit 109, the expected value by the neural network 115 Is calculated. Thereby, the amount of amino acid sequence information to be processed by the neural network 115 can be reduced, and the processing speed can be increased even when the calculation amount of the expected value calculation processing by the neural network 115 is large.

以上、図面を参照してこの発明の一実施形態について詳しく説明してきたが、具体的な構成は上述のものに限られることはなく、この発明の要旨を逸脱しない範囲内において様々な設計変更等をすることが可能である。
例えば、本実施形態では、タンパク質の完全長アミノ酸配列情報を検査対象として判定を行ったが、これに限られず、完全長塩基配列情報を検査対象として判定を行っても良い。但し、この場合、ステップＳ１で配列取得部１０２が完全長塩基配列情報を取得した後、図示しない翻訳処理部が、常法によるイントロ配列の除去処理及びアミノ酸配列情報への翻訳処理を行い、当該アミノ酸配列情報を用いてステップＳ２以降の処理を行う。 As described above, the embodiment of the present invention has been described in detail with reference to the drawings. However, the specific configuration is not limited to the above, and various design changes and the like can be made without departing from the scope of the present invention. It is possible to
For example, in the present embodiment, the determination is made with the full-length amino acid sequence information of the protein as a test target. However, the determination is not limited to this, and the determination may be performed with the full-length base sequence information as a test target. However, in this case, after the sequence acquisition unit 102 acquires the full-length base sequence information in step S1, a translation processing unit (not shown) performs intro sequence removal processing and translation processing into amino acid sequence information by a conventional method, The process after step S2 is performed using amino acid sequence information.

また、本実施形態では、期待値を算出する分類部としてニューラルネットワーク１１５を用いる場合を説明したが、これに限られず、例えば、サポートベクターマシンや、ベイジアンネットワークなど、分類部として他の解析手法を用いても良い。 In this embodiment, the case where the neural network 115 is used as a classification unit for calculating an expected value has been described. However, the present invention is not limited to this, and other analysis methods may be used as a classification unit such as a support vector machine or a Bayesian network. It may be used.

また、本実施形態では、ニューラルネットワーク１１５が入力層Ｓ_１、隠れ層Ｓ_２、出力層Ｓ_３の３層構造である場合を説明したが、これに限られず、ニューラルネットワーク１１５が複数の隠れ層を有する４層以上の構造を有していても良い。但し、隠れ層の数が増えると、学習時に、最適解（期待値と教師データとの誤差が最小値となる値）に到達せずに、局所解（期待値と教師データとの誤差が極小値となる値）に陥り、最適な学習がなされない可能性がある。 In this embodiment, the case where the neural network 115 has a three-layer structure of the input layer S ₁ , the hidden layer S ₂ , and the output layer S ₃ has been described. However, the present invention is not limited to this, and the neural network 115 includes a plurality of hidden layers. It may have a structure of four layers or more. However, if the number of hidden layers is increased, the error (minimum error between the expected value and the teacher data) is not reached during learning, and the local solution (the error between the expected value and the teacher data is minimized). There is a possibility that optimal learning may not be performed.

また、本実施形態では、隠れ層のノード数と入力層のノード数とを同数とする場合を説明したが、これに限られず、隠れ層のノード数を入力層のノード数より多くしても良いし、隠れ層のノード数を入力層のノード数より少なくしても良い。但し、隠れ層のノード数を多くした場合、本実施形態と比較して、学習時に、局所解に陥る可能性が高くなり、また計算量が増える。また、隠れ層のノード数を少なくした場合、本実施形態と比較して計算量が減る一方、判別精度が低くなる。 In this embodiment, the number of hidden layer nodes and the number of input layer nodes are the same. However, the present invention is not limited to this, and the number of hidden layer nodes may be greater than the number of input layer nodes. The number of nodes in the hidden layer may be smaller than the number of nodes in the input layer. However, when the number of nodes in the hidden layer is increased, the possibility of falling into a local solution at the time of learning is increased and the amount of calculation is increased as compared with the present embodiment. Further, when the number of hidden layer nodes is reduced, the amount of calculation is reduced as compared with the present embodiment, but the discrimination accuracy is lowered.

上述のＧＰＩアンカー型タンパク質判定装置１００は内部に、コンピュータシステムを有している。そして、上述した各処理部の動作は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータが読み出して実行することによって、上記処理が行われる。ここでコンピュータ読み取り可能な記録媒体とは、磁気ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、半導体メモリ等をいう。また、このコンピュータプログラムを通信回線によってコンピュータに配信し、この配信を受けたコンピュータが当該プログラムを実行するようにしても良い。 The above-mentioned GPI anchor type protein determination apparatus 100 has a computer system inside. The operation of each processing unit described above is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer reading and executing this program. Here, the computer-readable recording medium means a magnetic disk, a magneto-optical disk, a CD-ROM, a DVD-ROM, a semiconductor memory, or the like. Alternatively, the computer program may be distributed to the computer via a communication line, and the computer that has received the distribution may execute the program.

また、上記プログラムは、前述した機能の一部を実現するためのものであっても良い。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル（差分プログラム）であっても良い。 The program may be for realizing a part of the functions described above. Furthermore, what can implement | achieve the function mentioned above in combination with the program already recorded on the computer system, and what is called a difference file (difference program) may be sufficient.

１００…ＧＰＩアンカー型タンパク質判定装置１０１…配列記憶部１０２…配列取得部１０３…疎水性指標値記憶部１０４…疎水性指標値特定部１０５…Ｎ末端側疎水性値算出部１０６…Ｎ末端側疎水性判定部１０７…Ｎ末端外疎水性値算出部１０８…Ｎ末端外疎水性判定部１０９…Ｃ末端側最大疎水位置判定部１１０…側鎖サイズ指標値記憶部１１１…側鎖サイズ指標値特定部１１２…側鎖サイズ算出部１１３…ＰＳＳＭ記憶部１１４…スコア数値列生成部１１５…ニューラルネットワーク１１６…ＧＰＩアンカー型タンパク質判定部 DESCRIPTION OF SYMBOLS 100 ... GPI anchor type protein determination apparatus 101 ... Sequence memory | storage part 102 ... Sequence acquisition part 103 ... Hydrophobic index value memory | storage part 104 ... Hydrophobic index value specific | specification part 105 ... N terminal side hydrophobicity value calculation part 106 ... N terminal side hydrophobicity Sex determining unit 107 ... N-terminal extra-hydrophobic value calculating unit 108 ... N-terminal extra-hydrophobicity judging unit 109 ... C-terminal side maximum hydrophobic position judging unit 110 ... Side chain size index value storage unit 111 ... Side chain size index value specifying unit DESCRIPTION OF SYMBOLS 112 ... Side chain size calculation part 113 ... PSSM memory | storage part 114 ... Score numerical value sequence generation part 115 ... Neural network 116 ... GPI anchor type protein determination part

Claims

A determination apparatus for a GPI anchor type protein that determines whether or not a test target protein is a GPI anchor type protein,
A sequence acquisition unit for acquiring amino acid sequence information of the test protein;
A region having a predetermined number of residues from the C-terminal of the amino acid sequence information is specified as a region including a known GPI-anchored protein propeptide region in the amino acid sequence information acquired by the sequence acquisition unit, and the propeptide The amino acid residues in the region including the region are extracted, and for each extracted amino acid residue, the number of residues used for averaging the side chain size of the amino acid residues in the region including the propeptide region Using the necessary number of chain size characteristic extractions, multiple calculations are performed while shifting the average side chain size, which is the average value of the side chain size index values of amino acid residues for the necessary number of consecutive side chain size characteristic extractions, one by one. A side chain size calculation unit,
Frequency of occurrence of different amino acids residues present in the position of a predetermined region of the known types of amino acid residues present at positions in a predetermined region of the GPI-anchored proteins frequency and known non-GPI-anchored proteins To obtain a position-specific score indicating the degree of appearance of the type of amino acid residue at the amino acid residue position of a known GPI-anchored protein obtained from the above, and the side chain size calculation unit calculates based on the position-specific score A portion of amino acid residues in a predetermined region consisting of a predetermined number of amino acid residues that are continuous from the reference position to the N-terminal side and the C-terminal side, with the position where the average side chain size is minimized as the reference position Specify the position-specific score of each amino acid residue in the sequence, and generate a score sequence that is a numeric sequence indicating the position-specific score of each amino acid residue And the score numerical sequence generation unit,
The classification unit that inputs the score numerical sequence generated by the score numerical sequence generation unit and outputs an expected value of 0 or more and 1 or less indicating the likelihood of GPI anchor protein, and the score numerical sequence of a known GPI anchor protein 1 is output as an expected value, and when the score numerical sequence of a known non-GPI anchor type protein is input, a classification unit that is learned to output 0 as an expected value;
A GPI anchor type protein determination unit that determines that the test target protein is not a GPI anchor type protein when it is determined that the expected value output by the classification unit is less than 0.5;
A GPI-anchored protein determination apparatus comprising:

The classification unit is a neural network,
At least an input layer composed of the same number of nodes as the number of elements of the score numerical sequence generated by the score numerical sequence generation unit, a hidden layer composed of a plurality of nodes, and an output layer composed of one node Has a hierarchical structure including
Each node of the input layer outputs a value indicated by an element associated with itself in the score numerical sequence to each of the nodes of the hidden layer,
Each node of the hidden layer substitutes a value output from each node of the input layer into a predetermined transfer function, and outputs the obtained value to the node of the output layer,
The GPI anchor type according to claim 1, wherein the node of the output layer substitutes a value output from each node of the hidden layer into a predetermined transfer function, and outputs the obtained value as an expected value. Protein determination device.

The classification unit changes the coefficient of the transfer function of the node so as to output 1 as an expected value when the score numerical sequence of a known GPI anchor type protein is input, and the known non-GPI anchor type protein The GPI-anchored protein according to claim 2, wherein learning is performed by changing a coefficient of a transfer function of the node so as to output 0 as an expected value when the score numerical value sequence is input. Judgment device.

The GPI-anchored protein determination device according to claim 2 or 3, wherein each of the nodes uses a sigmoid function as a transfer function.

  The required number of side chain size characteristics extraction is:
  When the average side chain size is calculated for the small side chain size determination regions of a plurality of known GPI anchor proteins using the necessary number of side chain size characteristics extraction, the average side calculated from the GPI anchor proteins A value that maximizes the number of amino acid residues whose chain size is the smallest and whose amino acid residue adjacent to the C-terminal side of the amino acid residue is a GPI anchor modification site
  The apparatus for determining a GPI anchor type protein according to any one of claims 1 to 4, wherein:

  The small side chain size determination region is
  A region including a position where the average side chain size of a known GPI-anchored protein is minimized.
  The apparatus for determining a GPI-anchored protein according to claim 5.

The position specific score is:
F _ip ^positive indicating the frequency of occurrence of the type i of the amino acid residue present at position p in the predetermined region with the position where the average side chain size of a plurality of known GPI-anchored proteins is minimized as a reference position, known using f _ip ^negatives showing the frequency of occurrence of the type i of the amino acid residues at the position p of said predetermined region having an average side chain size of a plurality of non-GPI-anchored proteins to the reference position becomes minimum position of And

The GPI-anchored protein determination apparatus according to any one of claims 1 to 6, wherein the determination apparatus is a GPI-anchored protein determination apparatus.

The frequency of occurrence of the type i of the amino acid residue present at position p in the predetermined region is:
Using n _ip indicating the number of known GPI-anchored proteins where an amino acid residue of type i is present at position p, ε indicating the adjustment value of the appearance frequency, and the number of types s of amino acid residues,

The GPI-anchored protein determination device according to claim 7, wherein the determination device is a GPI-anchored protein.

As a region corresponding to the highly hydrophobic region on the N-terminal side of the known GPI-anchored protein in the amino acid sequence information acquired by the sequence acquisition unit, a region having a predetermined number of residues from the N-terminus of the amino acid sequence information The amino acid residues in the region corresponding to the N-terminal high hydrophobic region are extracted and used for averaging the hydrophobicity values of the amino acid residues in the region corresponding to the N-terminal high hydrophobic region. N-terminal side average hydrophobicity, which is the average of the respective hydrophobicity index values of amino acid residues corresponding to the necessary number of N-terminal side hydrophobic property extractions, using the required number of N-terminal side hydrophobic property extractions as the number of residues N-terminal side hydrophobicity value calculation unit for calculating a plurality of sex values while shifting each residue by one residue with respect to each of the extracted amino acid residues;
The N-terminal side hydrophobicity in which the maximum value among the plurality of N-terminal side average hydrophobicity values calculated by the N-terminal side hydrophobicity value calculating unit indicates the characteristics of the N-terminal side average hydrophobicity value in a known GPI-anchored protein. An N-terminal hydrophobicity determination unit that determines whether or not the threshold value is greater than or equal to the sex threshold,
The side chain size calculation unit, the score value string generation unit, the classification unit, the GPI anchor type protein determination unit, the N-terminal side hydrophobicity determination unit, and the N-terminal side hydrophobicity value calculation unit calculated N 9. The process according to claim 1, wherein a process is performed on amino acid sequence information determined that the maximum value of the terminal-side average hydrophobicity is not less than the N-terminal hydrophobicity threshold. The determination apparatus of GPI anchor type protein of description.

The N-terminal hydrophobicity threshold is:
The N-terminal side average hydrophobicity value is calculated for a plurality of known GPI-anchored proteins in advance, and is the minimum value in the set of maximum values of the calculated N-terminal side average hydrophobicity values. The apparatus for determining a GPI-anchored protein according to claim 9.

The required number of N-terminal hydrophobic characteristics extraction is:
Using the necessary number of N-terminal side hydrophobic property extractions, the N-terminal side average hydrophobicity value is calculated for each of the amino acid residues in the N-terminal side highly hydrophobic region of a plurality of known GPI-anchored proteins. , Extracting a minimum value in a set of maximum N-terminal average hydrophobicity values calculated from the known GPI-anchored proteins, and using the necessary number of N-terminal hydrophobic property extractions, a plurality of known non-GPIs are extracted. When the N-terminal average hydrophobicity value is calculated for each of the amino acid residues in the region corresponding to the highly hydrophobic region on the N-terminal side of the known GPI-anchored protein in the anchor-type protein, the known non-GPI Among the maximum N-terminal side average hydrophobicity values calculated from the anchor-type protein, the value is such that the number of those having a value larger than the extracted minimum value is minimized. The apparatus for determining a GPI-anchored protein according to claim 9 or 10.

As a region corresponding to the highly hydrophobic region on the N-terminal side of the known GPI-anchored protein in the amino acid sequence information acquired by the sequence acquisition unit, a region having a predetermined number of residues from the N-terminus of the amino acid sequence information Identify and extract amino acid residues other than the region corresponding to the N-terminal high hydrophobic region, and average the hydrophobicity values of amino acid residues other than the region corresponding to the N-terminal high hydrophobic region with an N-terminal extracellular hydrophobic character extraction number must be residues for use in, the N-terminal extracellular the average of the hydrophobic index value of the N-terminal extracellular hydrophobic character extracting necessary number of amino acid residues contiguous An N-terminal extra-hydrophobic value calculation unit for calculating a plurality of average hydrophobicity values while shifting each residue by one residue with respect to each of the extracted amino acid residues;
The N-terminal extra-hydrophobic value calculated by the non-N-terminal extra-hydrophobic value calculator is the maximum of the N-terminal extra-hydrophobic values calculated by the N-terminal extra-hydrophobic value in known GPI-anchored proteins. An N-terminal extra-hydrophobicity judgment unit for judging whether or not the sex threshold value is greater than or equal to,
With
The side chain size calculation unit, the score numerical value sequence generation unit, the classification unit, and the GPI anchor type protein determination unit, the N-terminal non-hydrophobicity determination unit calculates the N-terminal non-hydrophobic value calculation unit N 12. The process according to claim 1, wherein the process is performed on the amino acid sequence information determined to have a maximum value of the non-terminal average hydrophobicity value equal to or greater than the N-terminal non-hydrophobic threshold value. An apparatus for determining a GPI-anchored protein according to 1.

The N-terminal extra-hydrophobic threshold is
The N-terminal outside average hydrophobicity value is calculated for a plurality of known GPI-anchored proteins in advance, and is the minimum value in the set of maximum values of the calculated N-terminal outside average hydrophobicity values. The apparatus for determining a GPI-anchored protein according to claim 12.

The required number of N-terminal extra-hydrophobic properties extraction is:
Using the necessary number of N-terminal non-hydrophobic characteristics extraction, N-terminal non-terminal average hydrophobicity value for each of amino acid residues in regions other than the high-hydrophobic region on the N-terminal side of a plurality of known GPI-anchored proteins And calculating a minimum value in the set of maximum N-terminal outside average hydrophobicity values calculated from the known GPI-anchored protein, and using the necessary number of N-terminal outside hydrophobic property extraction, When the N-terminal outside average hydrophobicity value is calculated for each of amino acid residues in a region other than the region corresponding to the highly hydrophobic region on the N-terminal side of the known non-GPI-anchored protein The maximum number of N-terminal outer average hydrophobicity values calculated from the known non-GPI-anchored proteins is such that the number of those having a value larger than the extracted minimum value is minimized. GPI-anchored proteins determination apparatus according to claim 12 or claim 13, wherein characterized in that it is a value.

The region corresponding to the highly hydrophobic region on the N-terminal side of the known GPI-anchored protein is:
In a known GPI-anchored protein, it is a region including a position where the N-terminal side average hydrophobicity value is maximum.
The apparatus for determining a GPI-anchored protein according to any one of claims 9 to 11 , wherein:

As a region corresponding to a highly hydrophobic region on the C-terminal side of a known GPI-anchored protein, an amino acid residue having a predetermined number of residues from the C-terminus of the amino acid sequence information is specified, and the N-terminal extra-hydrophobic property A C-terminal side maximum hydrophobic position determining unit that determines whether or not the position of the amino acid residue having the maximum N-terminal outside average hydrophobicity value calculated by the value calculating unit is within the specified region,
The side chain size calculation unit, the score value string generation unit, the classification unit, the GPI anchor type protein determination unit, the C-terminal maximum hydrophobic position determination unit, the N-terminal extra-hydrophobic value calculation unit calculated Process for amino acid sequence information determined that the position of the amino acid residue having the maximum N-terminal outer average hydrophobicity value is within the region corresponding to the highly hydrophobic region on the C-terminal side of the known GPI-anchored protein The apparatus for determining a GPI-anchored protein according to any one of claims 12 to 14 , wherein:

The region corresponding to the highly hydrophobic region on the C-terminal side of the known GPI-anchored protein is:
In a region other than the region corresponding to the highly hydrophobic region on the N-terminal side of the known GPI-anchored protein, the region includes a position where the N-terminal outside average hydrophobicity value is maximum.
The apparatus for determining a GPI-anchored protein according to claim 16.

A determination method using a determination device for a GPI anchor type protein that determines whether or not a test target protein is a GPI anchor type protein,
The sequence acquisition unit of the GPI-anchored protein determination device acquires amino acid sequence information of the protein to be examined,
The side chain size calculation unit of the GPI-anchored protein determination device uses the C-terminal of the amino acid sequence information as a region including a known GPI-anchored protein propeptide region in the amino acid sequence information acquired by the sequence acquisition unit. A region having a predetermined number of residues is identified, amino acid residues in the region including the propeptide region are extracted, and amino acid residues in the region including the propeptide region are extracted for each of the extracted amino acid residues. Using the necessary number of side chain size characteristics extraction, which is the number of residues used for averaging the side chain size of the group, the average of each side chain size index value of amino acid residues for the necessary number of consecutive side chain size characteristics extraction Calculate multiple values by shifting the average side chain size that is the value by one residue,
The score value string generation unit of the GPI anchor type protein determination device performs the appearance frequency of the types of amino acid residues present at positions within a predetermined region of the known GPI anchor type protein and the predetermined number of the known non-GPI anchor type protein. A position-specific score indicating the degree of appearance of the type of amino acid residue at the amino acid residue position of a known GPI-anchored protein obtained from the frequency of occurrence of the type of amino acid residue present at a position in the region of Based on the position-specific score, a predetermined residue that is continuous from the reference position to the N-terminal side and the C-terminal side, with the position where the average side chain size calculated by the side chain size calculation unit is minimized being the reference position A position-specific score of each amino acid residue in the partial sequence of amino acid residues in a predetermined region consisting of several amino acid residues is identified, and each amino acid residue is identified. Generating a score numerical sequence is a sequence of numerical values indicating the position-specific score residues,
The classification unit of the determination device for the GPI anchor type protein outputs 1 as an expected value when the score value sequence of the known GPI anchor type protein is input, and the score value of the known non-GPI anchor type protein When a column is input, it is learned to output 0 as an expected value, the score numerical sequence generated by the score numerical sequence generation unit is input, and 0 or more and 1 or less indicating whether or not it is a GPI anchor type protein Output the expected value of
When the GPI anchor type protein determination unit of the GPI anchor type protein determination device determines that the expected value output by the classification unit is less than 0.5, the GPI anchor type protein determination unit determines that the test target protein is not a GPI anchor type protein. Judgment method characterized by doing.

A GPI-anchored protein determination device for determining whether a test target protein is a GPI-anchored protein,
A sequence acquisition unit for acquiring amino acid sequence information of the protein to be tested;
A region having a predetermined number of residues from the C-terminal of the amino acid sequence information is specified as a region including a known GPI-anchored protein propeptide region in the amino acid sequence information acquired by the sequence acquisition unit, and the propeptide The amino acid residues in the region including the region are extracted, and for each extracted amino acid residue, the number of residues used for averaging the side chain size of the amino acid residues in the region including the propeptide region Using the necessary number of chain size characteristic extractions, multiple calculations are performed while shifting the average side chain size, which is the average value of the side chain size index values of amino acid residues for the necessary number of consecutive side chain size characteristic extractions, one by one. Side chain size calculation unit,
Frequency of occurrence of different amino acids residues present in the position of a predetermined region of the known types of amino acid residues present at positions in a predetermined region of the GPI-anchored proteins frequency and known non-GPI-anchored proteins To obtain a position-specific score indicating the degree of appearance of the type of amino acid residue at the amino acid residue position of a known GPI-anchored protein obtained from the above, and the side chain size calculation unit calculates based on the position-specific score A portion of amino acid residues in a predetermined region consisting of a predetermined number of amino acid residues that are continuous from the reference position to the N-terminal side and the C-terminal side, with the position where the average side chain size is minimized as the reference position Specify the position-specific score of each amino acid residue in the sequence, and generate a score sequence that is a numeric sequence indicating the position-specific score of each amino acid residue Score numerical sequence generation unit,
The classification unit that inputs the score numerical sequence generated by the score numerical sequence generation unit and outputs an expected value of 0 or more and 1 or less indicating whether or not the protein is a GPI anchor type protein. The classification learned to output 1 as an expected value when the score numerical sequence is input, and to output 0 as an expected value when the score numerical sequence of a known non-GPI anchor type protein is input Part,
The determination program for functioning as a GPI anchor type | mold protein determination part which determines that the said test object protein is not a GPI anchor type protein, when it determines with the expected value which the said classification | category part output is less than 0.5.