JPH08137826A

JPH08137826A - Secondary structure predicting device

Info

Publication number: JPH08137826A
Application number: JP27522094A
Authority: JP
Inventors: Fumiyoshi Sasagawa; 文義笹川; Toru Nakagawa; 徹中川
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1994-11-09
Filing date: 1994-11-09
Publication date: 1996-05-31

Abstract

PURPOSE: To predict secondary structure with high precision in consideration of physical and chemical properties by inputting data, coded in a preprocessing layer by using the properties of an amino acid radical, to a neural network. CONSTITUTION: The preprocessing layer 1 inputs an amino acid radical array and generates data coded by using the properties of the respective amino acid radicals constituting the amino acid radical array, and the neural network 2 consists of an input layer which inputs the data generated by the preprocessing layer 1, an intermediate layer, and an output layer and learns and predicts two-dimensional structure. Namely, the amino acid radical is inputted as tutor data to the preprocessing layer 1, and the amino acid radicals constituting the inputted amino acid radical array are replaced with corresponding chemical and physical properties, which are inputted to the input layer of the neural network 2. When the difference between the output signal of the output layer and the tutor signal showing the secondary structure is larger than a specific value, learning is repeated so that the error decreases.

Description

Detailed Description of the Invention

【０００１】[0001]

【産業上の利用分野】本発明は、アミノ酸基配列をもと
に２次構造を予測する２次構造予測装置に関するもので
ある。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a secondary structure prediction device for predicting secondary structure based on an amino acid sequence.

【０００２】蛋白質工学において、蛋白質を構成するア
ミノ酸基の配列順序（１次構造）を利用し、その蛋白質
の立体構造（２次構造）を予測することが望まれてい
る。[0002] In protein engineering, it is desired to predict the three-dimensional structure (secondary structure) of a protein by utilizing the sequence order (primary structure) of amino acid groups constituting the protein.

【０００３】[0003]

【従来の技術】生体高分子である蛋白質の分子構造（究
極には３次元空間での全構造）を知り、更に蛋白質の生
体内での機能と結びつけて理解することは、生理学的な
認識、医学的治療、薬剤の開発などのための基礎として
重要である。蛋白質の分子は、２０種類のアミノ酸基
が、数十から数百の長さに順番に結合し、そこでできた
鎖が、ところどころで架橋構造を持ち、更に多数のとこ
ろで水素結合により結合した非規則的な構造を持ってい
ることが知られている。2. Description of the Related Art To understand the molecular structure of a protein, which is a biopolymer, (ultimately, the entire structure in a three-dimensional space), and to understand it by associating it with the function of the protein in the living body, It is important as a basis for medical treatment, drug development, etc. A protein molecule has 20 types of amino acid groups linked in sequence to lengths of several tens to several hundreds, and the chains formed there have a cross-linked structure in some places, and moreover, they are bonded by hydrogen bonds at many places. It is known to have a general structure.

【０００４】現在の科学技術では、蛋白質の３次元構造
を決定するには、Ｘ線結晶構造解析法とＮＭＲ（核磁気
共鳴）法とが用いられている。これらにより構造が実験
的に決定された蛋白質は、現在数百種ある。それらの蛋
白質の構造の部分部分には、特徴のある螺旋構造（αヘ
リックスなど）、シート状の構造（βシートなど）、屈
曲部（ターンなど）などと、不規則な部分（コイル部）
とがあり、これらの部分構造は蛋白質の２次構造と呼ば
れている。一方、蛋白質内のアミノ酸基の配列順序（１
次構造と呼ばれている）は、実験的に比較的容易に決定
でき、最近では、種々の蛋白質について、合わせて数万
種が知られている。In the current science and technology, the X-ray crystal structure analysis method and the NMR (nuclear magnetic resonance) method are used to determine the three-dimensional structure of a protein. There are hundreds of proteins whose structures have been experimentally determined by these methods. Irregular parts (coil part) such as characteristic helical structure (α helix etc.), sheet-like structure (β sheet etc.), bent part (turn etc.) etc.
, And these partial structures are called the secondary structure of proteins. On the other hand, the sequence of amino acid groups in the protein (1
The so-called secondary structure) can be determined relatively easily experimentally, and tens of thousands of various proteins have recently been known in total.

【０００５】そこで、蛋白質中の１次構造（アミノ酸基
の配列順序）を知って、その立体構造や機能を推定する
ことができると有益である。この目的の１つの段階とし
て、１次構造から２次構造（部分立体構造）を推定する
研究が各地で行われてきた。その基本的な方法は、立体
構造（３次元構造）が既知の多数の蛋白質分子につい
て、その１次構造と２次構造の関係を分析・学習し、そ
の知見を利用して、立体構造がまだ知られていない蛋白
質分子について、その１次構造から２次構造を推定（予
測）することである。このような方法を蛋白質の２次構
造予測法という。Therefore, it is useful to know the primary structure (sequence order of amino acid groups) in a protein and estimate its three-dimensional structure and function. As one step of this purpose, researches for estimating secondary structure (partial three-dimensional structure) from primary structure have been conducted in various places. The basic method is to analyze and learn the relationship between the primary structure and the secondary structure of many protein molecules with known three-dimensional structure (three-dimensional structure), and use that knowledge to determine the three-dimensional structure. To estimate (predict) the secondary structure from the primary structure of an unknown protein molecule. Such a method is called a protein secondary structure prediction method.

【０００６】近年、パターン認識のための方法として、
生物の神経系に学んだニューラルネットワークを用いた
方法が使われるようになり、蛋白質の２次構造の予測に
も適用されきた。その方法として、例えば図９に示すよ
うに、階層型の入力層、中間層および出力層を用い、学
習用データを入力層に与えてそのときの出力層からの値
を当該学習データの値と比較し、その誤差が閾値以下と
なるように各層の重み係数などを調整することを繰り返
し学習を行う。学習後に、未知の蛋白質のアミノ酸基配
列を入力し、その予測結果を出力層から得るようにして
いた。In recent years, as a method for pattern recognition,
Methods using neural networks learned from the nervous system of living organisms have come to be used, and they have also been applied to predict secondary structure of proteins. As the method, for example, as shown in FIG. 9, a hierarchical input layer, an intermediate layer, and an output layer are used, learning data is given to the input layer, and the value from the output layer at that time is set as the value of the learning data. Iterative learning is performed by comparing and adjusting the weighting coefficient of each layer so that the error is equal to or less than the threshold value. After learning, the amino acid sequence of an unknown protein was input and the prediction result was obtained from the output layer.

【０００７】[0007]

【発明が解決しようとする課題】従来のニューラルネッ
トワークを用いた蛋白質の２次構造予測は、入力とし
て、蛋白質のアミノ酸基配列の１部分（通常は連続した
奇数個、Ｌ個）を取り上げ、そのアミノ酸基の並びを、
２０種のアミノ酸種別コード（および不確定などを示す
特殊コード）により、コード化する。このコード化にお
いて、従来、各アミノ酸の種別を表現するのに、２０種
類のアミノ酸のそれぞれを示す入力セルを用意し、アミ
ノ酸種別に応じてその中の１個の入力セルだけをオン
（値１）にし、他の入力セルはオフ（値０）としてい
る。この入力に対するニューラルネットワークの出力と
しては、入力で使った（入力窓の中の）部分配列の中心
位置での２次構造の種類に対応させる。そして、入力す
る部分配列を、蛋白質分子の鎖の初め（Ｎ末端）から終
わり（Ｃ末端）まで、順次取り上げるように、入力窓を
スキャンする。The conventional secondary structure prediction of a protein using a neural network takes a part of the amino acid group sequence of the protein (usually an odd number and L number) as an input, and The sequence of amino acid groups,
Encoding is performed using 20 types of amino acid type codes (and special codes indicating uncertainties, etc.). In this encoding, conventionally, in order to express the type of each amino acid, an input cell showing each of 20 kinds of amino acids is prepared, and only one input cell among them is turned on (value 1 ) And other input cells are turned off (value 0). The output of the neural network for this input corresponds to the type of secondary structure at the central position of the partial array (in the input window) used for the input. Then, the input window is scanned so that the input partial sequence is sequentially picked up from the beginning (N-terminal) to the end (C-terminal) of the protein molecule chain.

【０００８】上述したニューラルネットワークを用いた
蛋白質の２次構造予測を検討した結果、本発明者は、従
来、アミノ酸基の配列をコード化するに際して、アミノ
酸基を独立のビットで表して区別していただけであり、
各アミノ酸基の性質や類似性を何も表現していないこと
が判明した。詳述すれば、従来の２０種類のアミノ酸基
のそれぞれを独立ビットで表現することは、アミノ酸基
の個々の種別をいわば勝手に決めた名前、略号、アルフ
ァベット、番号などで区別しただけのことであり、ニュ
ーラルネットワークはその名前や略号などの意味を理解
しているわけではないことを見つけた。即ち、各アミノ
酸基の性質はコード化されていなく（いいかえると、
“全て別々の性質を持っている”とだけコード化してい
る）、いろいろな化学的・物理的性質などにおいて、ど
のアミノ酸基と、どのアミノ酸基が似ているか、違って
いるのか、ある特性で言えばアミノ酸基はどういう順序
になるかなどの情報が全くコード化されていないという
問題があった。As a result of investigating the secondary structure prediction of proteins using the above-mentioned neural network, the present inventor has conventionally only distinguished amino acid groups by coding them with independent bits when coding the sequence of amino acid groups. And
It was found that they did not express any property or similarity of each amino acid group. More specifically, the conventional representation of each of the 20 types of amino acid groups by independent bits means that each type of amino acid group is simply distinguished by its name, abbreviation, alphabet, number, etc. Yes, I found that neural networks do not understand the meaning of their names and abbreviations. That is, the nature of each amino acid group is not coded (in other words,
It encodes only "all have different properties"), which amino acid groups are similar to or different from each other in various chemical and physical properties. Speaking of which, there was a problem that information such as the order of amino acid groups was not coded at all.

【０００９】本発明は、これらの問題を解決するため、
アミノ酸基の化学的・物理的などの性質を表現したアミ
ノ酸基のコーディングを前処理層で行い、このコーディ
ングしたデータをニューラルネットワークの入力とし、
化学的・物理的などの性質を考慮した２次構造予測を実
現することを目的としている。The present invention solves these problems.
Amino acid groups that represent the chemical and physical properties of the amino acid groups are coded in the pretreatment layer, and the coded data are input to the neural network.
The objective is to realize secondary structure prediction that takes into consideration the chemical and physical properties.

【００１０】[0010]

【課題を解決するための手段】図１は、本発明の原理ブ
ロック図を示す。図１において、前処理層１は、アミノ
酸基配列を入力とし、当該アミノ酸基配列を構成する各
アミノ酸基の性質を用いてコーディングしたデータを生
成するものである。FIG. 1 shows a block diagram of the principle of the present invention. In FIG. 1, the pretreatment layer 1 receives an amino acid group sequence as input, and generates coded data using the properties of each amino acid group constituting the amino acid group sequence.

【００１１】ニューラルネットワーク２は、前処理層１
によって生成されたデータを入力する入力層、中間層お
よび出力層からなるものであって、２次元構造を学習・
予測するものである。The neural network 2 includes a pre-processing layer 1
It consists of an input layer, an intermediate layer and an output layer for inputting the data generated by
To predict.

【００１２】[0012]

【作用】本発明は、図１に示すように、前処理層１がア
ミノ酸基配列を入力とし、当該アミノ酸基配列を構成す
る各アミノ酸基の性質を用いてコーディングしたデータ
を生成してニューラルネットワーク２を構成する入力層
に入力する。ニューラルネットワークを構成する入力層
から入力されたデータをもとに中間層を介して出力層か
ら予測した２次構造を出力するようにしている。According to the present invention, as shown in FIG. 1, the pretreatment layer 1 receives an amino acid group sequence as input and generates data coded using the properties of each amino acid group constituting the amino acid group sequence to generate a neural network. 2 is input to the input layer. The secondary structure predicted from the output layer is output via the intermediate layer based on the data input from the input layer forming the neural network.

【００１３】この際、前処理層１は、アミノ酸基の性質
として、化学的性質を用い、コーディングしたデータを
生成し、ニューラルネットワーク２の入力層に入力する
ようにしている。At this time, the pretreatment layer 1 uses a chemical property as the property of the amino acid group, generates coded data, and inputs the coded data to the input layer of the neural network 2.

【００１４】また、前処理層１は、アミノ酸基の性質と
して、物理的性質を用い、コーディングしたデータを生
成し、ニューラルネットワーク２の入力層に入力するよ
うにしている。Further, the pretreatment layer 1 uses physical properties as properties of the amino acid group, generates coded data, and inputs the coded data to the input layer of the neural network 2.

【００１５】また、前処理層１は、アミノ酸基の性質と
して、線形あるいは非線形の変換を行って整数あるいは
実数を用い、コーディングしたデータを生成し、ニュー
ラルネットワーク２の入力層に入力するようにしてい
る。Further, the preprocessing layer 1 performs linear or non-linear conversion as the property of the amino acid group and uses integers or real numbers to generate coded data, which is input to the input layer of the neural network 2. There is.

【００１６】また、前処理層１は、アミノ酸基の性質と
して、化学的性質、物理的性質、線形あるいは非線形の
変換を行って整数あるいは実数のいずれか２つ以上を組
み合わせてコーディングしたデータを生成し、ニューラ
ルネットワーク２の入力層に入力するようにしている。Further, the pretreatment layer 1 performs chemical or physical property conversion, linear or non-linear conversion as the property of the amino acid group to generate data coded by combining two or more of integers or real numbers. Then, the input is made to the input layer of the neural network 2.

【００１７】従って、アミノ酸基の化学的・物理的など
の性質を表現したアミノ酸基のコーディングを前処理層
で行い、このコーディングしたデータをニューラルネッ
トワークの入力として２次構造の予測結果を出力するこ
とにより、化学的・物理的などの性質を考慮した２次構
造予測を高精度に行うことが可能となる。Therefore, the amino acid group which represents the chemical and physical properties of the amino acid group is coded in the pretreatment layer, and the coded data is input to the neural network to output the prediction result of the secondary structure. As a result, it becomes possible to perform secondary structure prediction with high accuracy in consideration of chemical and physical properties.

【００１８】[0018]

【実施例】次に、図２から図８を用いて本発明の実施例
の構成および動作を順次詳細に説明する。DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Next, the construction and operation of an embodiment of the present invention will be described in detail with reference to FIGS.

【００１９】図２は、本発明の学習フローチャートを示
す。これは、図１の構成のもとで、教師データを与えて
ニューラルネットワークを学習するときの手順を示す。
図２において、Ｓ１は、教師データ（予め公知のアミノ
酸基配列）を図１の左側から前処理層に入力する。FIG. 2 shows a learning flowchart of the present invention. This shows the procedure for learning the neural network by giving the teacher data under the configuration of FIG.
In FIG. 2, in S1, teacher data (previously known amino acid group sequence) is input to the pretreatment layer from the left side of FIG.

【００２０】Ｓ２は、前処理層でアミノ酸基をその性質
に置き換える。これは、後述する図３に示すアミノ酸基
情報テーブルを参照し、入力されたアミノ酸基配列を構
成するアミノ酸基について該当する化学的性質、物理的
性質およびその他の性質に置き換える。S2 replaces the amino acid group with its property in the pretreatment layer. This is replaced with the corresponding chemical properties, physical properties, and other properties of the amino acid groups constituting the input amino acid group sequence by referring to the amino acid group information table shown in FIG. 3 described later.

【００２１】Ｓ３は、ニューラルネットワークの入力層
に入力する。これは、Ｓ２でアミノ酸基をその性質に置
き換えたデータを、ニュラルネットワーク２の入力層に
入力する。S3 is input to the input layer of the neural network. In this, the data in which the amino acid group is replaced by its property in S2 is input to the input layer of the neural network 2.

【００２２】Ｓ４は、出力層で教師信号と出力信号との
差を求める。これは、Ｓ１で与えた教師データ中の入力
データに対する出力データ（出力信号）と、ニューラル
ネットワーク２の出力層から出力された出力信号との差
（誤差）を求める。In step S4, the difference between the teacher signal and the output signal is obtained in the output layer. This calculates the difference (error) between the output data (output signal) for the input data in the teacher data given in S1 and the output signal output from the output layer of the neural network 2.

【００２３】Ｓ５は、Ｓ４で求めた差よりも小さいか判
別する。ＹＥＳの場合には、差が決められた値よりも小
さいと判明し、学習を終了する。一方、ＮＯの場合に
は、差が決められた値よりも大きく学習が充分ではない
と判明したので、Ｓ６で、差が大きい場合バックプロパ
ゲーション法による学習で中間層と出力層の重みωと閾
値θの調整を行い、Ｓ１に戻り、繰り返す。In S5, it is determined whether the difference is smaller than the difference obtained in S4. In the case of YES, it is found that the difference is smaller than the determined value, and the learning ends. On the other hand, in the case of NO, it is found that the difference is larger than the determined value and the learning is not sufficient. Therefore, in S6, when the difference is large, the weight ω of the intermediate layer and the output layer is determined by the learning by the back propagation method. The threshold value θ is adjusted, the process returns to S1 and is repeated.

【００２４】以上によって、アミノ酸基配列とその既知
の２次構造とからなる教師データのうち、アミノ酸基配
列を図１の前処理層１に入力して当該前処理層１でアミ
ノ酸基の性質に置き換えたデータとし、このデータをニ
ューラルネットワークの入力層に入力し、出力層から出
力された出力信号（２次構造）と、教師データ中の既知
の２次構造を表す教師信号との差を求め、この差が決め
られた値よりも大きいときに、バックプロパゲーション
法によって誤差が小さくなる方向に中間層および出力層
の重みωと閾値θを調整するという学習を繰り返す。そ
して、差が決められた値よりも小さくなったときに学習
を終了する。From the above, among the teaching data consisting of the amino acid group sequence and its known secondary structure, the amino acid group sequence is input to the pretreatment layer 1 of FIG. The replaced data is input to the input layer of the neural network, and the difference between the output signal (secondary structure) output from the output layer and the teacher signal representing the known secondary structure in the teacher data is obtained. When the difference is larger than the determined value, the learning of adjusting the weight ω and the threshold value θ of the intermediate layer and the output layer by the back propagation method is repeated. Then, the learning is ended when the difference becomes smaller than the determined value.

【００２５】図３は、本発明のアミノ酸基情報テーブル
例を示す。これは、アミノ酸基の公知の化学的情報、物
理的情報、およびその他の情報の例であって、図２のＳ
２で説明したように、アミノ酸基配列を構成するアミノ
酸基を化学的性質、物理的性質およびその他の性質に置
き換えるためのものである。FIG. 3 shows an example of the amino acid group information table of the present invention. This is an example of known chemical information, physical information, and other information of amino acid groups, which is shown in FIG.
As described in 2, the purpose is to replace the amino acid groups constituting the amino acid group sequence with chemical properties, physical properties and other properties.

【００２６】ここで、酸性、塩基性、Ｃｙｓか？の欄の
０、１は、ＮＯ、ＹＥＳをそれぞれ表し、０．５はどち
らとも確定できないことを表す。疎水性指標は疎水性の
大きさを表す（負は親水性を表す）。表面積は、後述す
る図４に示すように、水分子が側鎖に接しながら動くと
き水分子の中心が作る面の面積である。Where is acidic, basic, or Cys? In the column, 0 and 1 represent NO and YES, respectively, and 0.5 represents that neither of them can be determined. The hydrophobicity index represents the degree of hydrophobicity (negative indicates hydrophilicity). The surface area is the area of the surface formed by the center of the water molecule when the water molecule moves while contacting the side chain, as shown in FIG. 4 described later.

【００２７】以上のように、化学的性質、物理的性質、
およびその他の性質について、アミノ酸基ごとに図３の
テーブルに登録しておき、図１のアミノ酸基配列が入力
されたときに、当該アミノ酸基配列を構成する各アミノ
酸基の化学的性質、物理的性質、およびその他の性質に
置き換えて出力することが可能となる。As described above, the chemical properties, the physical properties,
And other properties are registered in the table of FIG. 3 for each amino acid group, and when the amino acid group sequence of FIG. 1 is input, the chemical properties and physical properties of each amino acid group constituting the amino acid group sequence are input. It becomes possible to output by replacing with the property and other properties.

【００２８】図４は、本発明のアミノ酸基の表面積の定
義説明図を示す。ここで、模式的に表示したように、ア
ミノ酸基の表面積は、アミノ酸基の側鎖に水分子が接し
ながら動くとき、水分子の中心が作る面の面積（表面
積、溶媒露出表面積）である。FIG. 4 is an explanatory view of the definition of the surface area of the amino acid group of the present invention. Here, as schematically shown, the surface area of the amino acid group is the surface area (surface area, solvent exposed surface area) formed by the center of the water molecule when the water molecule moves while contacting the side chain of the amino acid group.

【００２９】図５は、本発明の性質説明図を示す。これ
は、横軸の親水性・疎水性指標（Hydropathy)と、縦軸
の溶媒露出表面積とによるアミノ酸基の分類を視角的に
分かり易く表示したものである。この横軸の親水性・疎
水性指標は、溶媒中にアミノ酸基があるときと、そうで
ないときのエネルギーの差を規格化した量である。縦軸
は、アミノ酸基の溶媒露出面積（オングストロームの二
乗）を示す。例えばＡｒｇは、溶媒露出面積が２２５、
疎水性指標が−４．５であることを表す。Ｃｙｓを特別
扱いしているのは、Ｃｙｓ同士が他のアミノ酸基にはな
いジスルフィド結合（ＳＳ結合）を作るためである。FIG. 5 is a diagram for explaining the nature of the present invention. This is a visual representation of the classification of amino acid groups based on the hydrophilicity / hydrophobicity index (Hydropathy) on the horizontal axis and the solvent exposed surface area on the vertical axis. The hydrophilicity / hydrophobicity index on the abscissa is a normalized amount of energy difference between when the amino acid group is present in the solvent and when it is not. The vertical axis represents the solvent exposed area (angstrom square) of the amino acid group. For example, Arg has a solvent exposed area of 225,
It shows that the hydrophobicity index is -4.5. Cys is treated specially because it forms a disulfide bond (SS bond) that is not present in other amino acid groups.

【００３０】図６は、本発明の入力セルがブール値だけ
を扱う場合の構成図を示す。これは、ニューラルネット
ワークの入力層の入力セルが値０あるいは値１だけしか
扱わない場合の構成である。FIG. 6 shows a block diagram when the input cell of the present invention handles only Boolean values. This is a configuration in which the input cell of the input layer of the neural network handles only the value 0 or the value 1.

【００３１】図６において、アミノ酸基配列は、前処理
層１に入力するアミノ酸基の配列である。前処理層１
は、アミノ酸基配列を入力とし、アミノ酸基について化
学的性質、物理的性質、その他の性質に置き換えたデー
タを、ニューラルネットワーク２の入力層に入力するも
のである。In FIG. 6, the amino acid group sequence is the sequence of amino acid groups input to the pretreatment layer 1. Pretreatment layer 1
Is an input of the amino acid group sequence, and the data obtained by replacing the amino acid groups with chemical properties, physical properties, and other properties is input to the input layer of the neural network 2.

【００３２】ニューラルネットワーク２は、入力層、中
間層、および出力層から構成されるものである。中間層
および出力層は、それぞれ重みω、および閾値θを任意
に調整できるものである。ここで、教師データのうちの
入力データである既知のアミノ酸基配列を前処理層１に
入力し、出力層からの出力信号と、教師データのうちの
出力データとを比較し、その差が閾値以下となるように
バックプロパゲーション法により、中間層および出力層
の重みωおよび閾値θをそれぞれ調整し、学習を行う。
ここで、各層のセルは、値０あるいは値１のみを扱うよ
うにしている。The neural network 2 is composed of an input layer, an intermediate layer, and an output layer. The weight ω and the threshold θ can be arbitrarily adjusted in the intermediate layer and the output layer, respectively. Here, a known amino acid base sequence, which is the input data of the teacher data, is input to the preprocessing layer 1, the output signal from the output layer is compared with the output data of the teacher data, and the difference is the threshold value. Learning is performed by adjusting the weight ω and the threshold θ of the intermediate layer and the output layer by the backpropagation method as described below.
Here, the cell of each layer handles only the value 0 or the value 1.

【００３３】以下この図６の構成を用いた実施例につい
て順次詳細に説明する。（１）実施例１：蛋白質のアミノ酸基配列に対する入
力窓は１３アミノ酸基とし、それぞれの入力アミノ酸基
について、前処理層１によって、入力されたアミノ酸基
の種別に応じて、化学的性質、物理的性質、その他の性
質による分類を表現するように、予め指定した方法で複
数のビットにエンコードし、その信号をニューラルネッ
トワーク２の入力層に送出する。ニューラルネットワー
ク２は、入力層のセルが値０あるいは値１のいずれかの
みを扱う。これら前処理層１および入力層、中間層、出
力層からなるニューラルネットワークについて、教師デ
ータを前処理層１に入力し、出力層からの出力信号（２
次構造）と教師データ中の教師信号との差が閾値以下と
なるようにバッププロパゲーンション法によって学習を
行う。学習後に、未知のアミノ酸基配列を前処理層１に
入力し、その予測結果を出力層から得る。Hereinafter, embodiments using the configuration of FIG. 6 will be sequentially described in detail. (1) Example 1: The input window for the amino acid group sequence of a protein is 13 amino acid groups, and the chemical properties and physical properties of each input amino acid group are determined by the pretreatment layer 1 according to the type of the input amino acid group. A plurality of bits are encoded by a predesignated method so as to express the classification according to the physical property and other properties, and the signal is sent to the input layer of the neural network 2. The neural network 2 handles only cells having a value 0 or a value 1 in the input layer. With respect to the neural network including the pre-processing layer 1, the input layer, the intermediate layer, and the output layer, teacher data is input to the pre-processing layer 1 and the output signal (2
Learning is performed by the Bop propagation method so that the difference between the following structure) and the teacher signal in the teacher data is less than or equal to the threshold value. After learning, an unknown amino acid group sequence is input to the pretreatment layer 1, and its prediction result is obtained from the output layer.

【００３４】（２）実施例２：実施例１のエンコード
の具体例として、下記のアミノ酸基の化学的性質、物理
的性質を用いてエンコードする。(2) Example 2: As a specific example of the encoding of Example 1, the following chemical and physical properties of the amino acid group are used for encoding.

【００３５】親水性・疎水性を重要視し、Doolittl
eによる指標を用い、９区分とし、これを独立ビット割
り当て法により、９ビットで表現した。次に、アミノ酸基の大きさの特性を導入するため
に、溶媒露出面積をもちいて、１０区分とし、独立ビッ
ト割り当て法を導入するために、１０ビットで表現し
た。Emphasizing hydrophilicity / hydrophobicity, Doolittl
Using the index by e, it was divided into 9 sections, and this was represented by 9 bits by the independent bit allocation method. Next, in order to introduce the characteristic of the size of the amino acid group, the solvent exposed area was used to make 10 sections, and in order to introduce the independent bit allocation method, it was expressed by 10 bits.

【００３６】不確定などを示す補助的記号（Ｚ、
Ｂ、Ｘ、！、およびアミノ酸基以外）は、それぞれ１ビ
ット（合計５ビット）で表した。この実施例２では、１つづつのアミノ酸基を合計２４ビ
ットで表現している。このエンコード法は、アミノ酸基
を基本的に２つだけの化学的性質、物理的性質を用いて
表現と分類を行うようにしたものである。ただし、２種
の性質のそれぞれを細分化して、独立ビット割り当て法
で表現しているため、“区分”の性格が強い表現法にな
っている。An auxiliary symbol (Z,
B, X ,! , And amino acid groups) are each represented by 1 bit (5 bits in total). In the second embodiment, one amino acid group is represented by a total of 24 bits. This encoding method basically expresses and classifies amino acid groups using only two chemical and physical properties. However, since each of the two types of properties is subdivided and expressed by the independent bit allocation method, the expression method has a strong "category" character.

【００３７】（３）実施例３：実施例１のエンコード
の具体例として、下記のアミノ酸基の化学構造式の情報
を用いてエンコードする。(3) Example 3: As a specific example of the encoding of Example 1, encoding is performed using the following information on the chemical structural formula of the amino acid group.

【００３８】アミノ酸基の大きさとして、側鎖の重
原子数を利用し、これを８区分とし（数０、１、２、
３、４、５、６−７、８以上）、棒グラフ表現により７
ビットで表した。The number of heavy atoms in the side chain is used as the size of the amino acid group, and this is divided into 8 categories (numbers 0, 1, 2,
3, 4, 5, 6-7, 8 or more), 7 by bar graph representation
Expressed in bits.

【００３９】側鎖の回転障害の可能性のあるもの
（１ビット）、プロリンの環形成（１ビット）で表し
た。ＣＯ基の有無（１ビット）、ＣＯＯＨ基の有無（１
ビット）、ＮＨ₂またはＮＨ基の有無およびその単数／
複数（２ビット）、ＳＨ基の有無（１ビット）で表し
た。It is represented by the possibility of side chain rotation disorder (1 bit) and proline ring formation (1 bit). Presence / absence of CO group (1 bit), Presence / absence of COOH group (1
Bit), presence / absence of NH ₂ or NH group and its singular /
It is represented by a plurality (2 bits) and the presence or absence of an SH group (1 bit).

【００４０】Ｓの有無（１ビット）、ＳＨ基の有無
（１ビット）で表した。イオン性（１ビット）、親水性（１ビット）、疎水
性（１ビット）で表した。The presence or absence of S (1 bit) and the presence or absence of SH group (1 bit) are used. It was expressed as ionic (1 bit), hydrophilic (1 bit), and hydrophobic (1 bit).

【００４１】芳香族性（１ビット）、複素環性（１
ビット）で表した。不確定性（記号Ｚ、Ｂ、Ｘを合わせて１ビット）、
チェインブレイク（１ビット）で表した。Aromatic (1 bit), heterocyclic (1
Bit). Uncertainty (1 bit including symbols Z, B, and X),
It is represented by a chain break (1 bit).

【００４２】この実施例３では、１つのアミノ酸基を合
計２３ビットで表現している。このエンコード法は、化
学構造式の情報を基礎にしており、種々の化学的性質、
物理的性質が化学構造の諸要因に起因し、あるいはかな
りの相関関係を持っているという概念を取り入れ、グル
ープ化して表現したものであり、“区分”の側面が強い
表現法になっている。In the third embodiment, one amino acid group is represented by 23 bits in total. This encoding method is based on the information of the chemical structural formula, various chemical properties,
It is a grouping expression that incorporates the concept that physical properties are caused by various factors of chemical structure or has a considerable correlation, and the "category" aspect is a strong expression method.

【００４３】（４）実施例４：実施例１のエンコード
の具体例として、下記のアミノ酸基の化学的性質、物理
的性質を用いてエンコードする。(4) Example 4: As a specific example of the encoding of Example 1, the encoding is carried out by using the following chemical and physical properties of the amino acid group.

【００４４】親水性、疎水性の特性を重要視し、Do
olittleによる指標を用いて、これを４区分とし、区間
ビット法を用いて３ビットで表現した。次に、アミノ酸基の大きさの特性として、溶媒露出
表面積を用い、これを５区分とし、４ビットを用いて区
分ビット表現によって表す。The importance of hydrophilic and hydrophobic properties
Using the index by olittle, this was divided into 4 sections and expressed by 3 bits using the interval bit method. Next, the solvent-exposed surface area is used as a characteristic of the size of the amino acid group, and this is divided into 5 sections, which is represented by a section bit expression using 4 bits.

【００４５】特別な要素として、主鎖架橋を作るシ
スティンを他から区分（１ビット）して表した。ＣＯＯＨ基（またはＣＯ基）を持つもの（酸性）
と、ＮＨ₂基（またはＮＨ基）を持つもの（塩基性）と
を、それぞれグループとして識別（２ビット）して表し
た。As a special element, the cystine that forms the main-chain cross-links is shown by being divided (1 bit) from the others. Those with a COOH group (or CO group) (acidic)
And those having an NH ₂ group (or an NH group) (basic) are represented as groups (2 bits).

【００４６】グルタミン酸とグルタミンの区別が不
明（記号Ｚ）の場合、両者を折喪して表す。アスパラギ
ン酸とアスパラギンの区別が不明（記号Ｂ）の場合の同
様に表す。When the distinction between glutamic acid and glutamine is unknown (symbol Z), both are broken down. The same applies when the distinction between aspartic acid and asparagine is unknown (symbol B).

【００４７】アミノ酸基が不確定（記号Ｘ）の場合
には、親水性・疎水性指標および溶媒露出表面積につい
ては、全アミノ酸基の荷重平均になるように表す。この実施例４の分類を説明したものが図５である。横軸
にの親水性・疎水性指標を取り、縦軸にの溶媒露出
表面積と取って、全てのアミノ酸基をプロットしたもの
である。また、システィンを◎印、酸性基を□印、
塩基性基を△、の不確定を×印で表している。縦・横
の線が、の区分のしかたであり、枠外に入力層の各
ビットがオンになる範囲を示した。図５において、同じ
枡目の中にあるアミノ酸基は、、の特性に関しては
同じであるとみなしている。溶媒に対する化学的性質
（この例では親水性、疎水性）およびアミノ酸基の大き
さに関する性質（この例では溶媒露出表面積）につい
て、具体的にどの特性を採用し、どのように区分するか
は、いろいろ選択の仕方がある。本実施例４のように、
比較的大まかな区分を用いる場合には、細部の違いはあ
まり問題にしなくてもよい。この大まかな分類を補うも
のとして、との性質により、別の要素を取り入れ
て、区別をしている。また、、、の方法により、
不確定などを示す補助記号（Ｚ、Ｂ、Ｘ、！）に対する
ビットを不要とし、本質的でない区別を取捨した。When the amino acid group is indeterminate (symbol X), the hydrophilicity / hydrophobicity index and the solvent exposed surface area are expressed as the weighted average of all amino acid groups. FIG. 5 illustrates the classification of the fourth embodiment. All the amino acid groups are plotted by taking the hydrophilicity / hydrophobicity index on the horizontal axis and the solvent exposed surface area on the vertical axis. Also, cystine is marked with ◎, acidic groups are marked with □,
Basic groups are indicated by Δ, and uncertainties are indicated by x. The vertical / horizontal lines are the methods of dividing, and the range in which each bit of the input layer is turned on is shown outside the frame. In FIG. 5, amino acid groups in the same grid are considered to be the same with respect to the property of. Regarding the chemical properties with respect to the solvent (hydrophilicity and hydrophobicity in this example) and the properties relating to the size of the amino acid group (solvent exposed surface area in this example), what properties are specifically adopted and how are they classified? There are various ways to choose. As in the fourth embodiment,
Differences in detail may be less of an issue when using relatively coarse divisions. As a supplement to this rough classification, due to the property of and, another element is incorporated to make a distinction. Also, by the method of
Bits for supplementary symbols (Z, B, X ,!) indicating indeterminacy and the like are unnecessary, and unnecessary distinctions are discarded.

【００４８】この実施例４では、親水性、疎水性指標
に３ビット、大きさの指標に４ビットを使い、、
のための３ビットと合わせて、合計１０ビットで、２０
種のアミノ酸基を分類・表現している。区分の意識を最
小限にして、大きなグループ化と、秩序化と主眼として
表現法である。In the fourth embodiment, 3 bits are used for the hydrophilicity / hydrophobicity index and 4 bits are used for the size index,
20 bits for a total of 10 bits, including 3 bits for
Amino acid groups of species are classified and expressed. It is a method of expression with a focus on grouping, ordering, and grouping with the minimum awareness of division.

【００４９】図７は、本発明の入力セルがブール値と実
数値［０，１］を扱える場合の構成図を示す。これは、
ニューラルネットワークの入力層の入力セルが値０ある
いは値１だけでなく、０から１の間の実数値を扱える場
合の構成である。FIG. 7 shows a block diagram when the input cell of the present invention can handle a Boolean value and a real value [0, 1]. this is,
This is a configuration in which the input cell of the input layer of the neural network can handle not only the value 0 or the value 1 but also a real value between 0 and 1.

【００５０】図７において、アミノ酸基配列は、前処理
層１に入力するアミノ酸基の配列である。前処理層１
は、アミノ酸基配列を入力とし、アミノ酸基について化
学的性質、物理的性質、その他の性質に置き換えたデー
タを、ニューラルネットワーク２の入力層に入力するも
のである。In FIG. 7, the amino acid group sequence is the sequence of amino acid groups input to the pretreatment layer 1. Pretreatment layer 1
Is an input of the amino acid group sequence, and the data obtained by replacing the amino acid groups with chemical properties, physical properties, and other properties is input to the input layer of the neural network 2.

【００５１】ニューラルネットワーク２は、入力層、中
間層、および出力層から構成されるものである。中間層
および出力層は、それぞれ重みω、および閾値θを任意
の調整できるものである。ここで、教師データのうちの
入力データである既知のアミノ酸基配列を前処理層１に
入力し、出力層からの出力信号と、教師データのうちの
出力データとを比較し、その差が閾値以下となるように
バックプロパゲーション法により、中間層および出力層
の重みωおよび閾値θをそれぞれ調整し、学習を行う。
ここで、各層のセルのうち□は、値０あるいは値１のみ
を扱うセルであり、■は値０から値１までの実数値を扱
うセルである。The neural network 2 is composed of an input layer, an intermediate layer, and an output layer. In the intermediate layer and the output layer, the weight ω and the threshold θ can be arbitrarily adjusted. Here, a known amino acid base sequence, which is the input data of the teacher data, is input to the preprocessing layer 1, the output signal from the output layer is compared with the output data of the teacher data, and the difference is the threshold value. Learning is performed by adjusting the weight ω and the threshold θ of the intermediate layer and the output layer by the backpropagation method as described below.
Here, among the cells of each layer, □ is a cell that handles only the value 0 or 1, and ■ is a cell that handles real-valued values from 0 to 1.

【００５２】以下この図７の構成を用いた実施例につい
て順次詳細に説明する。（５）実施例５：入力されたアミノ酸基の種別に応じ
て、化学的性質、物理的性質を分類するためのビット表
現に使うと共に同時に、化学的性質、物理的性質を数値
として表現するのに使う。ニューラルネットワーク２の
入力層は、値０あるいは値１のみを扱うセルと、実数値
［０、１］（０から１の値を表す）を扱うセルとを組み
合わせたものである。Embodiments using the configuration of FIG. 7 will be sequentially described in detail below. (5) Example 5: It is used for bit expression for classifying chemical properties and physical properties according to the type of the input amino acid group, and at the same time, the chemical properties and physical properties are expressed as numerical values. To use. The input layer of the neural network 2 is a combination of cells that handle only values 0 or 1 and cells that handle real values [0, 1] (representing values from 0 to 1).

【００５３】（６）実施例６：この実施例６は、実施
例４でビット表現によりエンコードしたものを、数値的
にエンコードしたものである。即ち、親水性・疎水性の
指標および溶媒露出表面積を数値のまま扱い、［０、
１］の実数値にエンコードする、下記のようにしたもの
である。(6) Sixth Embodiment: This sixth embodiment is a numerically encoded version of the bit representation of the fourth embodiment. That is, the hydrophilicity / hydrophobicity index and the solvent exposed surface area are treated as numerical values, and [0,
[1] is encoded as a real number and is as follows.

【００５４】親水性、疎水性の特性を表すのに、Do
olittleによる指標［−４．５、＋４．５］を、線形変
換して［０、１］の実数として表す。アミノ酸基の大きさの特性として、溶媒露出表面積
を用い、区間［５０Å²、３００Å²］を線形変換で、
［０、１］の実数として表す。Do is used to show hydrophilic and hydrophobic characteristics.
The index [-4.5, +4.5] by olittle is linearly converted and expressed as a real number of [0, 1]. Using the solvent exposed surface area as the characteristic of the size of the amino acid group, the interval [50 Å ² , 300 Å ² ] is linearly transformed to
Represented as a real number of [0,1].

【００５５】特別な要素として、主鎖架橋を作るシ
スティンを他から区分（１ビット）して表した。ＣＯＯＨ基（またはＣＯ基）を持つもの（酸性）
と、ＮＨ₂基（またはＮＨ基）を持つもの（塩基性）と
を、それぞれグループとして識別（２ビット）して表し
た。As a special element, the cystine that forms the main chain cross-links is represented by being separated (1 bit) from the others. Those with a COOH group (or CO group) (acidic)
And those having an NH ₂ group (or an NH group) (basic) are represented as groups (2 bits).

【００５６】グルタミン酸とグルタミンの区別が不
明（記号Ｚ）の場合は両者の平均値で表す。アスパラギ
ン酸とアスパラギンの区別が不明（記号Ｂ）の場合も同
様に表す。When the distinction between glutamic acid and glutamine is unknown (symbol Z), the average value of both is shown. The same applies when the distinction between aspartic acid and asparagine is unknown (symbol B).

【００５７】アミノ酸基が不確定（記号Ｘ）の場合
には、全アミノ酸基の荷重平均値で表す。図８は、本発明の教師データとその学習結果例を示す。
これは、教師データと、その学習結果、１ＣＰＶという
略号で表される蛋白質のアミノ酸基配列とその２次構造
の学習結果を示す。ここで、アミノ酸基は、予め前処理
層１によってその性質で置き換えてニューラルネットワ
ークの入力層に入力する。Ｈはαヘリックス、Ｅはβシ
ート、ハイフン（−）はコイルを表す。また、記号ｏ
は、学習が成功したことを示している。When the amino acid group is uncertain (symbol X), it is represented by the weighted average value of all amino acid groups. FIG. 8 shows teaching data of the present invention and an example of learning results thereof.
This shows the teaching data, the learning result, and the learning result of the amino acid group sequence of the protein represented by the abbreviation 1CPV and its secondary structure. Here, the amino acid groups are replaced by their properties in advance by the pretreatment layer 1 and input to the input layer of the neural network. H represents an α helix, E represents a β sheet, and a hyphen (-) represents a coil. Also, the symbol o
Indicates that the learning was successful.

【００５８】図９は、本発明のニューラルネットワーク
の蛋白質２次構造予測の結果の比較例を示す。ここで、
実施例１から実施例６は、（１）から（６）によって既
述した実施例である。FIG. 9 shows a comparative example of the results of protein secondary structure prediction of the neural network of the present invention. here,
The first to sixth embodiments are the embodiments already described in (1) to (6).

【００５９】[0059]

【発明の効果】以上説明したように、本発明によれば、
アミノ酸基の物理・化学的などの性質を表現したアミノ
酸基のコーディングを前処理層で行い、このコーディン
グしたデータをニューラルネットワークの入力として２
次構造の予測結果を出力する構成を採用しているため、
物理・化学的などの性質を考慮して２次構造予測を高精
度に行うことができる。これらにより、従来のニューラ
ルネットワークによる２次構造予測を、盲目的なパター
ンによって学習・予測を行うのではなくて、本発明によ
り、アミノ酸基配列を構成する各アミノ酸基の化学的性
質、物理的性質を取り入れて学習・予測を高精度に行う
ことが可能となった。As described above, according to the present invention,
Amino acid groups that represent the physical and chemical properties of the amino acid groups are coded in the pretreatment layer, and the coded data are input to the neural network.
Since the configuration that outputs the prediction result of the next structure is adopted,
Secondary structure prediction can be performed with high accuracy in consideration of physical and chemical properties. As a result, the secondary structure prediction by the conventional neural network is not learned and predicted by a blind pattern, but according to the present invention, the chemical properties and physical properties of each amino acid group constituting the amino acid group sequence are used. It became possible to perform learning and prediction with high accuracy by incorporating the.

[Brief description of drawings]

【図１】本発明の原理ブロック図である。FIG. 1 is a principle block diagram of the present invention.

【図２】本発明の学習フローチャートである。FIG. 2 is a learning flowchart of the present invention.

【図３】本発明のアミノ酸基情報テーブル例である。FIG. 3 is an example of an amino acid group information table of the present invention.

【図４】本発明のアミノ酸基の表面積の定義説明図であ
る。FIG. 4 is an explanatory diagram of the definition of the surface area of the amino acid group of the present invention.

【図５】本発明の性質説明図である。FIG. 5 is an explanatory view of properties of the present invention.

【図６】本発明の入力セルがブール値だけを扱う場合の
構成図である。FIG. 6 is a configuration diagram when an input cell of the present invention handles only a Boolean value.

【図７】本発明の入力セルがブール値と実数値［０、
１］を扱える場合の構成図である。FIG. 7 shows that the input cell of the present invention has a Boolean value and a real value [0,
1] is a configuration diagram when [1] can be handled.

【図８】本発明の教師データとその学習結果例を示す。FIG. 8 shows teacher data of the present invention and examples of learning results thereof.

【図９】従来技術の説明図である。FIG. 9 is an explanatory diagram of a conventional technique.

[Explanation of symbols]

１：前処理層２：ニューラルネットワーク 1: Pretreatment layer 2: Neural network

Claims

[Claims]

1. A pretreatment layer (1) which receives an amino acid group sequence as input and generates data coded using the properties of each amino acid group constituting the amino acid group sequence, and a pretreatment layer (1) And a neural network (2) including an input layer, an intermediate layer, and an output layer for inputting the selected data.

2. The secondary structure prediction device according to claim 1, wherein the coded data is generated by using a chemical property as a property of the amino acid group.

3. The secondary structure prediction device according to claim 1, wherein physical properties are used as properties of the amino acid group to generate coded data.

4. The secondary structure prediction apparatus according to claim 1, wherein as the property of the amino acid group, linear or non-linear conversion is performed and an integer or a real number is used to generate coded data.

5. A secondary structure prediction apparatus comprising a combination of any two or more of claims 2 to 4.