JP2008100918A

JP2008100918A - Similarity calculation processing system, processing method and program of the same

Info

Publication number: JP2008100918A
Application number: JP2006282101A
Authority: JP
Inventors: Tsutomu Osouda; 勉襲田
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2006-10-17
Filing date: 2006-10-17
Publication date: 2008-05-01

Abstract

<P>PROBLEM TO BE SOLVED: To provide a similarity calculation processing system capable of accurately extracting compounds, for example, which are not similar to each other as a whole but partially similar, and to provide a processing method and program of the same. <P>SOLUTION: The similarity calculation processing system includes: a training data memory device 101 for storing inputted training data; a prediction data memory device 102 for storing data for prediction; at least one descriptor selecting device 103 for extracting a part of descriptors of the input training data and of the prediction data; a similarity calculating device 104 provided corresponding to each of the descriptor selecting device 103 for calculating a similarity between the training data and the prediction data; and a calculation result collecting device 105 for calculating a similarity from the calculation results by the similarity calculating device 104 by each of the data stored in the prediction data memory device. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、類似した化合物の類似度計算処理システム、その処理方法及びプログラムに関し、詳しくは、創薬の現場において類似した化合物を探すときに有効な類似度計算処理システム、その処理方法及びプログラムに関する。 The present invention relates to a similarity calculation processing system for similar compounds, a processing method thereof, and a program. More specifically, the present invention relates to a similarity calculation processing system, a processing method, and a program effective when searching for similar compounds in the field of drug discovery. .

類似計算手法は、類似しているデータを探すための手法として幅広く知られており、さまざまな場面で使われている。 The similarity calculation method is widely known as a method for searching for similar data, and is used in various situations.

例えば、特許文献1に示される化合物推定方法では、未知細胞の薬理活性や毒性を有する化合物を推定するために、未知細胞の遺伝子情報に類似する既知細胞の遺伝子情報から推定する方法である。 For example, in the compound estimation method disclosed in Patent Document 1, in order to estimate a compound having pharmacological activity or toxicity of an unknown cell, it is a method of estimation from gene information of a known cell similar to gene information of an unknown cell.

また、特許文献２に示される化合物アライメント方法では、立体構造情報から化合物間のアライメントの類似度を求めて算出して化合物のアライメントを決定する方法である。 Further, the compound alignment method disclosed in Patent Document 2 is a method for determining the alignment of compounds by calculating and calculating the similarity of alignment between compounds from the three-dimensional structure information.

また、特許文献３に示される共通構造抽出装置は、2つの異なる立体構造において、対応する共通点を見つけ出し、それによって、全体構造の類似度を算出する方法である。 The common structure extraction device disclosed in Patent Document 3 is a method of finding corresponding common points in two different three-dimensional structures, and thereby calculating the similarity of the entire structure.

ここで、たとえば化合物の類似度を測る尺度として、Ｔａｎｉｍｏｔｏ係数は良く知られた類似度計算方法である。ある構造の有無を示す０，１のビットを使い、そのような記述子を複数準備することによって、化合物は０，１のビット列によって表現することができる。そのような０，１のビット列に対するＴａｎｉｍｏｔｏ係数は以下のように定義され、その値が１であるほどそれらの化合物は類似しており、０に近いほどそれらの化合物は類似していない化合物として判断される。 Here, as a measure for measuring the similarity of compounds, for example, the Tanimoto coefficient is a well-known similarity calculation method. By preparing a plurality of such descriptors using 0 and 1 bits indicating the presence or absence of a certain structure, a compound can be expressed by a bit string of 0 and 1. The Tanimoto coefficient for such a bit string of 0, 1 is defined as follows. The closer the value is 1, the more similar the compounds are. Is done.

２つの化合物の表現が、下記数１式に示されるものであると定義されているとする。 Assume that the expressions of the two compounds are defined as those shown in the following formula 1.

上記数１式と定義されているとすると、Ｔａｎｉｍｏｔｏ係数は、次の数２式のように定義される。 Assuming that the equation 1 is defined, the Tanimoto coefficient is defined as the following equation 2.

上記数２式において、「ａｎｄ」，「ｏｒ」はそれぞれビット演算を表しており以下の数３式のように定義される。 In the above equation 2, “and” and “or” represent bit operations, respectively, and are defined as the following equation 3.

創薬開発の現場などでは、数十万から数百万個の化合物が含まれる化合物ライブラリの中からターゲットに対する活性を有する化合物を探索する必要がある。そのような場合、たとえば特許などの文献情報に記載されている活性化合物の情報をもととして類似度を計算し、類似度の高い化合物についてのみアッセイ実験を実施するということが行われる。そのようにすることで、化合物ライブラリに含まれるすべての化合物を実験することなく、高確率で活性化合物を見出すことが可能になるためで、ｉｎｓｉｌｉｃｏスクリーニング技術の一つとなっている。 In the field of drug development, etc., it is necessary to search for a compound having activity against a target from a compound library containing hundreds of thousands to millions of compounds. In such a case, for example, the similarity is calculated based on the information on the active compound described in the literature information such as patents, and the assay experiment is performed only on the compound having a high similarity. By doing so, it becomes possible to find active compounds with high probability without experimenting with all the compounds contained in the compound library, which is one of in silico screening techniques.

従来の類似度計算手法としては、なんらかの距離関数（ユークリッド、バイナリーなど）を使うか、距離の公理を満たしていなくとも、上記のようなＴａｎｉｍｏｔｏ係数などの手法を使う方法が使われてきた。距離関数を使う場合、距離が短いものを類似度が高いデータとして、Ｔａｎｉｍｏｔｏ係数を使う場合には１に近いほど類似度が高いデータとして考える。 As a conventional similarity calculation method, a method using a distance function (Euclidean, binary, etc.) or a method using a method such as the Tanimoto coefficient as described above has been used even if the distance axiom is not satisfied. When using the distance function, data having a short distance is considered as data with high similarity, and when using the Tanimoto coefficient, data closer to 1 is considered as data with high similarity.

特開２００６−１０７３９５号公報JP 2006-107395 A 特開２００６−１３４２５２号公報JP 2006-134252 A 特開平７−２８７７１７号公報JP-A-7-287717 Ｂｙｕｎｇ−ＪｅＳｕｎｇｅｔａｌ：Ｎａｔｕｒｅ、ｖｏｌ４２５，ｐｐ．９８−１０２，Ｓｅｐｔｅｍｂｅｒ（２００３）．Byung-Je Sung et al: Nature, vol 425, pp. 98-102, September (2003).

従来の類似度計算手法および装置は、特許文献１，２及び３に示されるように、ように全体的な特徴を捕らえて類似度を計算する手法であった。 The conventional similarity calculation method and apparatus are methods for calculating the similarity by capturing the overall features as described in Patent Documents 1, 2, and 3.

化合物のスクリーニングを行う場合、化合物全体的な特徴が似ていれば化合物の性質も似ているといえるのであるが、必ずしも全体の特徴が似ている必要はなく、タンパクと薬とが鍵と鍵穴にたとえられるように化合物がある特定の特徴を持っていれば、タンパクに作用することも存在し、その場合、全体的に見ると必ずしも類似している必要はない。そのような場合に上記のような類似度計算手法を使うと、全体的には似ていないが部分的に似た活性化合物を拾い出すことができないことがある。 When screening a compound, it can be said that the characteristics of the compound are similar if the overall characteristics of the compound are similar, but the overall characteristics do not necessarily have to be similar, and proteins and drugs are key and keyholes. If a compound has a certain characteristic, it can also act on a protein, in which case it does not necessarily have to be similar overall. In such a case, if the similarity calculation method as described above is used, it may not be possible to pick out active compounds that are not similar but partially similar.

そこで、本発明の技術的課題は、例えば、全体的には似ていないが部分的に似た化合物を精度よく抽出することができる類似度計算処理システム、その処理方法及びプログラムを提供することにある。 Therefore, a technical problem of the present invention is to provide a similarity calculation processing system, a processing method thereof, and a program capable of accurately extracting, for example, compounds that are partially similar but partially similar. is there.

本発明によれば、入力された訓練データを記憶することと、予測するためのデータを記憶することと、前記入力された訓練データおよび予測データの記述子の一部を抽出することと、前記訓練データと前記予測データの類似度を計算することと、前記類似度の計算結果を前記記憶された予測データごとに類似度を算出することとを備えたことを特徴とする類似度計算処理方法が得られる。 According to the present invention, storing input training data, storing data for prediction, extracting a part of descriptors of the input training data and prediction data, A similarity calculation processing method comprising: calculating similarity between training data and the prediction data; and calculating a similarity for each of the stored prediction data based on the calculation result of the similarity Is obtained.

また、本発明によれば、前記類似度計算処理方法において、前記類似度計算には、Ｔａｎｉｍｏｔｏ係数を使った類似度計算を行うことを特徴とする類似度計算処理方法が得られる。 In addition, according to the present invention, in the similarity calculation processing method, a similarity calculation processing method characterized in that similarity calculation using a Tanimoto coefficient is performed for the similarity calculation.

また、本発明によれば、前記いずれかの類似度計算処理方法において、前記類似度計算は、計算された類似度の中から最大の値を出力することを特徴とする類似度計算処理方法が得られる。 According to the present invention, there is provided the similarity calculation processing method according to any one of the similarity calculation processing methods, wherein the similarity calculation outputs a maximum value from the calculated similarities. can get.

また、本発明によれば、前記いずれか一つの類似度計算処理方法において、前記類似度の算出では、前記記憶された予測データに対応する値の集計で平均値と分散を計算し、その２つの値の関数を使って最終的な数値を出力することを特徴とする類似度計算処理方法が得られる。 According to the present invention, in any one of the similarity calculation processing methods, in calculating the similarity, an average value and a variance are calculated by aggregating values corresponding to the stored prediction data. A similarity calculation method characterized in that a final numerical value is output using a function of two values is obtained.

また、本発明によれば、前記いずれか一つの類似度計算処理方法において、前記記述子の抽出は、複数回なされ、前記複数回の抽出操作間で同数の乱数を発生させることを特徴とする類似度計算処理方法が得られる。 According to the present invention, in any one of the similarity calculation processing methods, the descriptor is extracted a plurality of times, and the same number of random numbers is generated between the plurality of extraction operations. A similarity calculation processing method is obtained.

また、本発明によれば、前記いずれか一つの類似度計算処理方法において、前記類似度の算出は、予め定められた平面からの距離の関数を使うことで最終的な数値を出力することを特徴とする類似度計算処理方法が得られる。 According to the present invention, in any one of the similarity calculation processing methods, the calculation of the similarity may include outputting a final numerical value by using a function of a distance from a predetermined plane. A characteristic similarity calculation processing method is obtained.

また、本発明によれば、入力された訓練データを記憶しておくための訓練データ記憶装置と、予測するためのデータを記憶するための予測データ記憶装置と、前記入力された訓練データおよび予測データの記述子の一部を抽出するための少なくとも一つの記述子選択装置と、前記記述子選択装置にそれぞれ対応して設けられ前記訓練データと前記予測データの類似度を計算する類似度計算装置と、前記類似度計算装置が計算した結果を前記予測データ記憶装置に記憶されたデータごとに類似度を算出する計算結果集計装置とを備えたことを特徴とする類似度計算処理システムが得られる。 According to the present invention, the training data storage device for storing the input training data, the prediction data storage device for storing the data for prediction, the input training data and the prediction At least one descriptor selection device for extracting a part of a descriptor of data, and a similarity calculation device provided corresponding to each descriptor selection device and calculating the similarity between the training data and the prediction data And a calculation result aggregating device for calculating the similarity for each data stored in the prediction data storage device. The similarity calculation processing system is obtained. .

また、本発明によれば、前記類似度計算処理システムにおいて、前記類似度計算装置は、Ｔａｎｉｍｏｔｏ係数を使った類似度計算を行うことを特徴とする類似度計算処理システムが得られる。 In addition, according to the present invention, in the similarity calculation processing system, the similarity calculation processing system is characterized in that the similarity calculation device performs similarity calculation using a Tanimoto coefficient.

また、本発明によれば、前記いずれかの類似度計算処理システムにおいて、前記類似度計算装置は、計算された類似度の中から最大の値を出力することを特徴とする類似度計算処理システムが得られる。 According to the present invention, in any one of the similarity calculation processing systems, the similarity calculation apparatus outputs a maximum value from the calculated similarities. Is obtained.

また、本発明によれば、前記いずれか一つの類似度計算処理システムにおいて、前記計算結果集計装置は、予測データ記憶装置に記憶されたデータに対応する値の集計で平均値と分散を計算し、その２つの値の関数を使って最終的な数値を出力することを特徴とする類似度計算処理システムが得られる。 Further, according to the present invention, in any one of the similarity calculation processing systems, the calculation result aggregation device calculates an average value and a variance by aggregating values corresponding to data stored in the prediction data storage device. A similarity calculation processing system characterized in that a final numerical value is output using a function of the two values is obtained.

また、本発明によれば、前記いずれか一つの類似度計算処理システムにおいて、前記記述子選択装置の各々は、当該記述子選択装置間で同数の乱数を発生させることを特徴とする類似度計算処理システムが得られる。 According to the present invention, in any one of the similarity calculation processing systems, each of the descriptor selection devices generates the same number of random numbers between the descriptor selection devices. A processing system is obtained.

また、本発明によれば、前記いずれか一つの類似度計算処理システムにおいて、前記計算結果集計装置は、予め定められた平面からの距離の関数を使うことで最終的な数値を出力することを特徴とする類似度計算処理システムが得られる。 Further, according to the present invention, in any one of the similarity calculation processing systems, the calculation result totaling device outputs a final numerical value by using a function of a distance from a predetermined plane. A characteristic similarity calculation processing system is obtained.

また、本発明によれば、入力された訓練データを記憶しておくための訓練データ記憶手段と、予測するためのデータを記憶するための予測データ記憶手段と、前記入力された訓練データおよび予測データの記述子の一部を抽出するための少なくとも一つの記述子選択手段と、前記記述子選択装置にそれぞれ対応して設けられ前記訓練データと前記予測データの類似度を計算する類似度計算手段と、前記類似度計算装置が計算した結果を前記予測データ記憶装置に記憶されたデータごとに類似度を算出する計算結果集計手段とを備えたことを特徴とする類似度計算処理プログラムが得られる。 According to the present invention, the training data storage means for storing the input training data, the prediction data storage means for storing the data for prediction, the input training data and the prediction At least one descriptor selecting means for extracting a part of the descriptor of the data, and a similarity calculating means for calculating the similarity between the training data and the prediction data provided corresponding to each of the descriptor selecting devices And a calculation result aggregating means for calculating a similarity for each data stored in the prediction data storage device. The similarity calculation processing program is obtained. .

また、本発明によれば、前記類似度計算処理プログラムにおいて、前記類似度計算手段は、Ｔａｎｉｍｏｔｏ係数を使った類似度計算を行うことを特徴とする類似度計算処理プログラムが得られる。 Further, according to the present invention, in the similarity calculation processing program, a similarity calculation processing program is obtained in which the similarity calculation means performs similarity calculation using a Tanimoto coefficient.

また、本発明によれば、前記いずれかの類似度計算処理プログラムにおいて、前記類似度計算手段は、計算された類似度の中から最大の値を出力することを特徴とする類似度計算処理プログラムが得られる。 According to the present invention, in any one of the similarity calculation processing programs, the similarity calculation means outputs a maximum value from the calculated similarities. Is obtained.

また、本発明によれば、前記いずれか一つの類似度計算処理プログラムにおいて、前記計算結果集計手段は、予測データ記憶装置に記憶されたデータに対応する値の集計で平均値と分散を計算し、その２つの値の関数を使って最終的な数値を出力することを特徴とする類似度計算処理プログラムが得られる。 Further, according to the present invention, in any one of the similarity calculation processing programs, the calculation result aggregation means calculates an average value and a variance by aggregating values corresponding to data stored in the prediction data storage device. Thus, a similarity calculation processing program characterized in that a final numerical value is output using a function of the two values is obtained.

また、本発明によれば、前記いずれか一つの類似度計算処理プログラムにおいて、前記記述子選択手段の各々は、当該記述子選択手段間で同数の乱数を発生させることを特徴とする類似度計算処理プログラムが得られる。 Further, according to the present invention, in any one of the similarity calculation processing programs, each of the descriptor selection means generates the same number of random numbers between the descriptor selection means. A processing program is obtained.

また、本発明によれば、前記いずれか一つの類似度計算処理プログラムにおいて、前記計算結果集計装置は、予め定められた平面からの距離の関数を使うことで最終的な数値を出力することを特徴とする類似度計算処理プログラムが得られる。 According to the present invention, in any one of the similarity calculation processing programs, the calculation result totaling device outputs a final numerical value by using a function of a distance from a predetermined plane. A characteristic similarity calculation processing program is obtained.

本発明においては、例えば、全体的には似ていないが部分的に似た化合物を精度よく抽出することができる類似度計算処理システム、その処理方法及びプログラムを提供することができる。 In the present invention, for example, it is possible to provide a similarity calculation processing system, a processing method thereof, and a program capable of accurately extracting compounds that are partially similar but partially similar.

以下、本発明の実施の形態について図面を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、本発明の実施の形態による類似度計算処理装置の概略構成を示す図である。図1を参照すると、類似度計算処理装置１００は、入力された訓練データを記憶しておくための訓練データ記憶装置１０１と、予測するためのデータを記憶するための予測データ記憶装置１０２と、入力された訓練データおよび予測データの記述子の一部を抽出するための記述選択装置１０３と、訓練データと予測データの類似度を計算しもっとも類似度が高い数値を出力する類似度計算装置１０４と、各類似度計算装置１０４が計算した結果を予測データ記憶装置１０２に記憶されたデータごとに平均や分散などを計算し、それらから類似度を算出する計算結果集計装置１０５とを備えて構成されている。 FIG. 1 is a diagram showing a schematic configuration of a similarity calculation processing apparatus according to an embodiment of the present invention. Referring to FIG. 1, a similarity calculation processing device 100 includes a training data storage device 101 for storing input training data, a prediction data storage device 102 for storing data for prediction, A description selection device 103 for extracting a part of the descriptor of the input training data and prediction data, and a similarity calculation device 104 that calculates the similarity between the training data and the prediction data and outputs a numerical value with the highest similarity. And a calculation result totaling device 105 that calculates the average and variance for each data stored in the prediction data storage device 102 and calculates the similarity based on the results calculated by each similarity calculation device 104. Has been.

図１では、記述子選択装置１０３ａ、ｂ・・・（以下、まとめて１０３で示す）が各類似度計算装置１０４ａ，ｂ，・・・（以下、まとめて１０４で示す）に付属しており、それぞれが独立に動作をするように表現されている。 In FIG. 1, descriptor selection devices 103a, b,... (Hereinafter collectively shown as 103) are attached to the respective similarity calculation devices 104a, b,. , Each is expressed to work independently.

独立に動作をするという意味では、類似度計算装置１０４および記述子選択装置１０３は複数設置する必要は必ずしもなく、１つの記述子選択装置１０３および類似度計算装置１０４を設置しておき、それらが逐次的に同様の処理を繰り返し行うような形態であってもよい。 In the sense that they operate independently, it is not always necessary to install a plurality of similarity calculation devices 104 and descriptor selection devices 103, and one descriptor selection device 103 and similarity calculation device 104 are installed. A configuration in which the same processing is sequentially repeated may be employed.

各類似度計算装置１０４が計算した類似度を計算結果集計装置１０５で受け取り、その内部で平均や分散を計算し、それらの指標から類似度を算出する。 The similarity calculated by each similarity calculation device 104 is received by the calculation result totaling device 105, and an average and variance are calculated therein, and the similarity is calculated from these indices.

図２は、記述子選択装置１０３の具体的構成を示す図である。図２を参照すると、入力された訓練データ、予測データから選び出すための記述子に対応する番号をランダムに算出する記述子抽出装置２０１と、算出された番号に対応する記述子を入力された訓練データ、予測データから選び出しデータの変換を行うデータ変換装置２０２とを備えて構成される。 FIG. 2 is a diagram illustrating a specific configuration of the descriptor selection device 103. Referring to FIG. 2, a descriptor extracting apparatus 201 that randomly calculates a number corresponding to a descriptor for selecting from input training data and prediction data, and a training that receives a descriptor corresponding to the calculated number. A data conversion device 202 that selects data from the predicted data and converts the data is configured.

次に、図１の類似度計算処理装置１００の動作について図３を参照しながら説明する。図３は図１の類似度計算処理装置１００の動作説明に供せられる図である。 Next, the operation of the similarity calculation processing apparatus 100 of FIG. 1 will be described with reference to FIG. FIG. 3 is a diagram for explaining the operation of the similarity calculation processing apparatus 100 of FIG.

図３を参照すると、訓練データと予測データが記述子選択装置に入力される（ステップ３０１）。 Referring to FIG. 3, training data and prediction data are input to the descriptor selection device (step 301).

入力された訓練データは、訓練データ記憶装置１０１に記憶され、入力された予測データは予測データ記憶装置１０２に記憶される（ステップ３０２）。 The input training data is stored in the training data storage device 101, and the input prediction data is stored in the prediction data storage device 102 (step 302).

入力する訓練データと予測データの間では同じ記述子を使ってそれらが同じ順序で記述されている必要がある。またこの説明においてはひとつのデータがひとつの行に対応して説明しているが必ずしもそのようなデータ形式になっている必要はなく、ひとつのデータがひとつの列に対応していても良い。 The input training data and the prediction data need to be described in the same order using the same descriptor. In this description, one data corresponds to one row, but it does not necessarily have such a data format, and one data may correspond to one column.

１つのデータが列に対応している場合には、以下の説明において「行」と記述されているものは「列」に、「列」と記述されているものは「行」と読み替える必要がある。 When one piece of data corresponds to a column, in the following description, “row” must be read as “column”, and “column” must be read as “row”. is there.

訓練データ記憶装置１０１及び予測データ記憶装置１０２に記憶された訓練データおよび予測データはそれぞれ記述子選択装置１０３に送られる。 The training data and the prediction data stored in the training data storage device 101 and the prediction data storage device 102 are sent to the descriptor selection device 103, respectively.

ここで記述子選択装置１０３の内部の動作について図２を再び参照しながら説明を行う。図２を参照すると、たとえば、下記数４式で示されるデータが訓練データ記憶装置１０１及び予測データ記憶装置１０２から送られてきたとする。 Here, the internal operation of the descriptor selection device 103 will be described with reference to FIG. 2 again. Referring to FIG. 2, for example, it is assumed that data represented by the following equation 4 is sent from the training data storage device 101 and the prediction data storage device 102.

訓練データおよび予測データの表現形式は同じであるため、上記例においてはそのうちの一方を示している。実際には同時に入力された訓練データおよび予測データに対して同じ処理が行われる。この例では簡単のためその一方の処理だけを示す。 Since the representation format of the training data and the prediction data is the same, one of them is shown in the above example. Actually, the same processing is performed on the training data and the prediction data input simultaneously. In this example, only one of the processes is shown for simplicity.

記述子選択装置１０３に入力されたデータに対して記述子抽出装置１０３において、下記数５式で示されるような乱数を作成したとする（ステップ３０３）。 Assume that the descriptor extraction device 103 generates a random number represented by the following equation 5 for the data input to the descriptor selection device 103 (step 303).

上記乱数の値はデータ変換装置２０２に送られる。記述子選択装置１０３に入力されたデータもデータ変換装置２０２に送られる。データ変換装置２０２においては、記述子抽出装置２０１において選択された類似度に対応して入力されたデータのうち１列目、３列目、５列目が取り出され以下の数６式で示されるデータを出力する（ステップ３０４）。 The random number value is sent to the data converter 202. Data input to the descriptor selection device 103 is also sent to the data conversion device 202. In the data conversion device 202, the first, third, and fifth columns are extracted from the data input corresponding to the similarity selected by the descriptor extraction device 201, and are expressed by the following equation (6). Data is output (step 304).

上記データを各類似度計算装置１０４に出力する。この例では理解度を優先して記述子抽出装置１０３が出力した値と同じ列の出力を行っているが、必ずもこのように対応する必要はなく、例えばｉという数値に対して（（ｉ＋５）を列数で割ったあまり＋１）列を選択しても良い。また、１つの値に対して１列を対応させる必要はなく、複数の列を取り出すような処理を行っていても良い。つまり、この対応は１対１であってもいいし、１対多、多対１、多対多でもよい。 The data is output to each similarity calculation device 104. In this example, the same column as the value output by the descriptor extraction device 103 is output with priority on understanding, but it is not always necessary to deal with such a case. For example, for a numerical value i ((i + 5 ) Divided by the number of columns + 1) columns may be selected. Further, it is not necessary to correspond one column to one value, and processing such as extracting a plurality of columns may be performed. That is, this correspondence may be one-to-one, one-to-many, many-to-one, or many-to-many.

記述子選択装置１０３から訓練データおよび予測データに対応したデータがそれぞれ出力され類似度計算装置１０４に送られる。簡単のため、それらの出力データをそれぞれ訓練データＡ、予測データＡと呼ぶことにする。類似度計算装置１０４では入力された予測データＡの一つ一つのデータに対し、訓練データＡからの類似度をそれぞれ計算する（ステップ３０５）。つまり、予測データＡの１つのデータに対して、訓練データＡに含まれる個数分の類似度が計算されることになる。それら複数の値の中で最も類似度の高いデータを出力する。予測データＡに含まれるひとつのデータに対し１つの類似度が出力されるため、累計するとデータの個数分だけ予測値が出力されることになる。それらの処理は設置された類似度計算装置１０４においてそれぞれ独立に行われる。独立に行われるという意味では必ずしも複数設置する必要はなく、ひとつの類似度計算処理装置を設置して、それが逐次的に処理するようなものであっても良い。 Data corresponding to the training data and the prediction data is output from the descriptor selection device 103 and sent to the similarity calculation device 104. For simplicity, these output data are called training data A and prediction data A, respectively. The similarity calculation device 104 calculates the similarity from the training data A for each piece of input prediction data A (step 305). That is, the similarity of the number included in the training data A is calculated for one piece of the prediction data A. The data having the highest similarity among the plurality of values is output. Since one degree of similarity is output for one piece of data included in the prediction data A, when it is accumulated, prediction values corresponding to the number of data are output. These processes are performed independently in the installed similarity calculation device 104. It is not always necessary to install a plurality of devices in the sense that they are performed independently, and a single similarity calculation processing device may be installed and processed sequentially.

次に、類似度計算装置１０４から出力された入力された予測データに対応する類似度は計算結果集計装置１０５に送られる。計算結果集計装置１０５では計算された複数の類似度から、予測データごとに集計を行う（ステップ３０５）。ここで、集計は平均値と分散を計算する。この説明では簡単のため平均値、分散、記述したが、必ずしも平均値でなくても良く、平均値の定数倍などその関数ならどんなものであっても良い。また、分散についても同様である。分散の場合にはさらに置き換えが可能で、最大値と最小値を探しその差を使うことや、その差の関数を使うようなものであっても良い。トータルでは予測データに対し類似度の平均と分散が計算される。次にそれらの指標からあるひとつの値に対応するような関数を使って一つの値に対応させ出力を行う。その関数として例えば横軸に平均値、縦軸に分散をとり、ある１点からある直線への距離を関数として使うことができる。仮に直線として縦軸に平行な十分に遠方のものをとるなら、類似度の平均値が最終的な結果として出力される。また、同様にその直線として十分に離れた、軸とは平行でない直線をとれば、平均と分散を使った関数となり、その関数を使った値を出力することになる。 Next, the similarity corresponding to the input prediction data output from the similarity calculation device 104 is sent to the calculation result totaling device 105. The calculation result totaling apparatus 105 performs totalization for each prediction data from the plurality of similarities calculated (step 305). Here, the aggregation calculates the average value and the variance. In this description, the average value, variance, and description are described for the sake of simplicity. However, the average value and the variance are not necessarily required, and any function such as a constant multiple of the average value may be used. The same applies to the dispersion. In the case of variance, further replacement is possible, and it is possible to search for the maximum and minimum values and use the difference or use a function of the difference. In total, the average and variance of the similarity are calculated for the predicted data. Next, a function corresponding to one value from those indices is used to output corresponding to one value. As the function, for example, the horizontal axis represents the average value, the vertical axis represents the dispersion, and the distance from a certain point to a certain straight line can be used as the function. If a straight line that is far enough parallel to the vertical axis is taken, the average value of the similarity is output as the final result. Similarly, if a straight line that is sufficiently separated as the straight line and is not parallel to the axis is taken, a function using the mean and the variance is obtained, and a value using the function is output.

次に、図４を参照しながら、本発明の実施の形態による類似度計算処理装置の作用効果について、具体的に説明する。 Next, the effect of the similarity calculation processing device according to the embodiment of the present invention will be specifically described with reference to FIG.

本発明の効果は従来の類似度計算手法では見つけ出すことの難しかった部分的に似たデータが見つけ出しやすくなることである。 The effect of the present invention is that it becomes easy to find partially similar data that was difficult to find by the conventional similarity calculation method.

訓練データとしては、非特許文献１に記載されている３化合物の構造をビット列に置き換えたものを、評価データとしては市販されている約１６０００化合物の構造をビット列に置き換えたもの入力した。 As training data, the data obtained by replacing the structure of the three compounds described in Non-Patent Document 1 with a bit string and the evaluation data obtained by replacing the structure of about 16000 compounds commercially available with a bit string were input.

記述子抽出装置２０１においては（１〜列数の個数）の値を取る乱数を生成するようにした。 In the descriptor extracting apparatus 201, a random number that takes a value of (1 to the number of columns) is generated.

また各記述子選択装置間１０３では同じ個数の乱数を生成するようにした。 In addition, the same number of random numbers is generated between the descriptor selection devices 103.

データ変換装置２０２においては出力された乱数に対応するような列を取り出すようにした。 In the data converter 202, a column corresponding to the output random number is taken out.

類似度計算装置１０４ではＴａｎｉｍｏｔｏ係数を使った類似度計算するようにし、計算結果集計装置１０５では平均と分散を計算するようにし、関数として以下数７式に示すもの使った。 The similarity calculation device 104 calculates the similarity using the Tanimoto coefficient, and the calculation result totaling device 105 calculates the average and variance, and the function shown in the following equation 7 is used.

上記数７式は、ｘ＋１０＊ｙ−１０という平面への距離を示している。そのときに計算した結果は、図４に示されている。 The above formula 7 shows the distance to the plane of x + 10 * y-10. The result calculated at that time is shown in FIG.

図４の横軸は本装置が出力した数値を示しており０に値が近づくほど順位が高く、１に近づくほど順位が低いことを示している。縦軸はそこまでの順位に含まれる活性化合物の個数を示している。 The horizontal axis in FIG. 4 indicates the numerical value output by the present apparatus. The closer the value is to 0, the higher the ranking is, and the closer the value is to 1, the lower the ranking is. The vertical axis represents the number of active compounds included in the ranks so far.

図４にある２本の線は上側にあるほうが本発明による指標の結果、下側にあるほうがＴａｎｉｍｏｔｏ係数によって順位付けをおこなった従来の手法の結果を示している。 The two lines in FIG. 4 indicate the result of the index according to the present invention on the upper side and the result of the conventional method in which the lower side is ranked by the Tanimoto coefficient.

本発明の方法は、従来の方法に比べ順位の高い部分に活性化合物がより多く含まれていることがわかる。仮に市販されている化合物の一部を、このような順位付けを使って選び出す場合、本発明を使うことによって活性化合物がより上位に位置しており、より多くの活性化合物を選び出すことができることを示している。 It can be seen that the method of the present invention contains a larger amount of the active compound in the higher rank portion than the conventional method. If some of the commercially available compounds are selected using such ranking, the active compound is positioned higher by using the present invention, and it is possible to select more active compounds. Show.

以上の説明の通り、本発明は、創薬の現場において類似した化合物を探すときに有効であり、製薬等の分野に適用される。 As described above, the present invention is effective when searching for similar compounds in the field of drug discovery, and is applied to fields such as pharmaceuticals.

本発明の実施の形態による類似度計算処理装置の概略構成を示す図である。It is a figure which shows schematic structure of the similarity calculation processing apparatus by embodiment of this invention. 図１の記述子選択装置の具体的構成を示す図である。It is a figure which shows the specific structure of the descriptor selection apparatus of FIG. 図１の類似度計算処理装置１００の動作説明に供せられる図である。It is a figure used for operation | movement description of the similarity calculation processing apparatus 100 of FIG. 本発明の実施の形態による類似度計算処理装置の作用効果について、具体的に説明する。The effect of the similarity calculation processing apparatus according to the embodiment of the present invention will be specifically described.

Explanation of symbols

１００類似度計算処理装置
１０１訓練データ記憶装置
１０２予測データ記憶装置
１０３記述子選択装置
１０４類似度計算装置
１０５計算結果集計装置
２０１記述子抽出装置
２０２データ変換装置 DESCRIPTION OF SYMBOLS 100 Similarity calculation processing apparatus 101 Training data storage apparatus 102 Predictive data storage apparatus 103 Descriptor selection apparatus 104 Similarity calculation apparatus 105 Calculation result totaling apparatus 201 Descriptor extraction apparatus 202 Data conversion apparatus

Claims

Storing input training data; storing data for prediction; extracting a portion of a descriptor of the input training data and prediction data; the training data and the prediction data And calculating a similarity for each of the stored prediction data, and calculating a similarity.

2. The similarity calculation processing method according to claim 1, wherein the similarity calculation is performed by calculating a similarity using a Tanimoto coefficient.

3. The similarity calculation processing method according to claim 1, wherein the similarity calculation outputs a maximum value from the calculated similarities.

The similarity calculation processing method according to any one of claims 1 to 3, wherein the similarity is calculated by calculating an average value and a variance by aggregating values corresponding to the stored prediction data, A similarity calculation processing method characterized by outputting a final numerical value using a function of the two values.

5. The similarity calculation processing method according to claim 1, wherein the descriptor is extracted a plurality of times, and the same number of random numbers is generated between the plurality of extraction operations. Similarity calculation processing method.

6. The similarity calculation processing method according to claim 1, wherein the similarity is calculated by outputting a final numerical value by using a function of a distance from a predetermined plane. A similarity calculation processing method characterized by the above.

A training data storage device for storing input training data, a prediction data storage device for storing data for prediction, and a part of descriptors of the input training data and prediction data. At least one descriptor selection device for extracting, a similarity calculation device that is provided corresponding to each of the descriptor selection devices and calculates the similarity between the training data and the prediction data, and the similarity calculation device includes: A similarity calculation processing system comprising: a calculation result aggregating device for calculating a similarity for each data stored in the predicted data storage device.

8. The similarity calculation processing system according to claim 7, wherein the similarity calculation device performs similarity calculation using a Tanimoto coefficient.

9. The similarity calculation processing system according to claim 7 or 8, wherein the similarity calculation device outputs a maximum value from the calculated similarities.

The similarity calculation processing system according to any one of claims 7 to 9, wherein the calculation result aggregation device calculates an average value and a variance by aggregating values corresponding to data stored in the prediction data storage device. A similarity calculation processing system characterized by calculating and outputting a final numerical value using a function of the two values.

The similarity calculation processing system according to any one of claims 7 to 10, wherein each of the descriptor selection devices generates the same number of random numbers between the descriptor selection devices. Degree calculation processing system.

12. The similarity calculation processing system according to claim 7, wherein the calculation result totaling apparatus outputs a final numerical value by using a function of a distance from a predetermined plane. A similarity calculation processing system characterized by that.

Training data storage means for storing input training data, prediction data storage means for storing data for prediction, and part of descriptors of the input training data and prediction data At least one descriptor selecting means for extracting, similarity calculating means provided corresponding to each of the descriptor selecting devices, for calculating the similarity between the training data and the prediction data, and the similarity calculating device A similarity calculation processing program, comprising: a calculation result totaling unit that calculates a similarity for each data stored in the prediction data storage device based on the calculated result.

14. The similarity calculation processing program according to claim 13, wherein the similarity calculation means performs similarity calculation using a Tanimoto coefficient.

15. The similarity calculation processing program according to claim 13, wherein the similarity calculation means outputs the maximum value from the calculated similarities.

The similarity calculation processing program according to any one of claims 13 to 15, wherein the calculation result aggregation means calculates an average value and a variance by aggregating values corresponding to data stored in the prediction data storage device. A similarity calculation processing program characterized by calculating and outputting a final numerical value using a function of the two values.

The similarity calculation processing program according to any one of claims 13 to 16, wherein each of the descriptor selection means generates the same number of random numbers between the descriptor selection means. Degree calculation processing program.

The similarity calculation processing program according to any one of claims 13 to 17, wherein the calculation result aggregation device outputs a final numerical value by using a function of a distance from a predetermined plane. A similarity calculation processing program characterized by that.