JP7218274B2

JP7218274B2 - Compound Property Prediction Apparatus, Compound Property Prediction Program, and Compound Property Prediction Method for Predicting Properties of Compound

Info

Publication number: JP7218274B2
Application number: JP2019200488A
Authority: JP
Inventors: 諒亮亀澤; 和樹藤川; 正弘望月
Original assignee: DeNA Co Ltd
Current assignee: DeNA Co Ltd
Priority date: 2019-11-05
Filing date: 2019-11-05
Publication date: 2023-02-06
Anticipated expiration: 2039-11-05
Also published as: JP2021076890A

Description

本発明は、化合物の性質を予測するための化合物性質予測装置、化合物性質予測プログラム及び化合物性質予測方法に関する。 The present invention relates to a compound property prediction device, a compound property prediction program, and a compound property prediction method for predicting properties of a compound.

創薬研究では、新薬の候補として見出された新薬候補化合物（以下、リード化合物という）を基準として、リード化合物の構造を変更するように設計、合成及び評価を繰り返すことによって新薬としての化合物の構造を徐々に最適化していく作業（以下、最適化プログラムという）が行われる。このとき、リード化合物における医薬としての主活性（薬効）を維持しつつ、人体や動物における吸収（ａｂｓｏｒｐｔｉｏｎ）、分布（ｄｉｓｔｒｉｂｕｔｉｏｎ）、代謝（ｍｅｔａｂｏｌｉｓｍ）、排泄（ｅｘｃｒｅｔｉｏｎ）及び毒性（ｔｏｘｉｃｉｔｙ）に関する化合物の性質を示すＡＤＭＥＴ属性を改善するように化合物を探索する。 In drug discovery research, a new drug candidate compound (hereinafter referred to as a lead compound) that has been discovered as a new drug candidate is used as a standard, and by repeating design, synthesis, and evaluation so as to change the structure of the lead compound, the compound as a new drug is developed. An operation (hereinafter referred to as an optimization program) is performed to gradually optimize the structure. At this time, while maintaining the main activity (drug effect) as a drug in the lead compound, it is possible to improve the compound's absorption, distribution, metabolism, excretion, and toxicity in humans and animals. Compounds are searched to improve the ADMET attribute that indicates the property.

創薬研究におけるリード化合物からの最適化プログラムでは、予測対象となる化合物は膨大な化合物の集合からランダムに選択されるのではなく、最適化プログラムの過程で提案されたＡＤＭＥＴ属性等の性質が既知の化合物に構造が似た化合物を選択することが好ましい場合が多い。そこで、最適化プログラムの期間短縮及びコスト低減を図るために、最適化プログラムにおいて未だ検討されていない化合物についてＡＤＭＥＴ属性を予測する技術が望まれている。 In the optimization program from the lead compound in drug discovery research, the compound to be predicted is not randomly selected from a huge set of compounds, but the properties such as ADMET attributes proposed in the optimization program are known. It is often preferable to select a compound that is structurally similar to the compound of . Therefore, in order to shorten the period of the optimization program and reduce the cost, there is a demand for a technique for predicting the ADMET attributes of compounds that have not yet been examined in the optimization program.

本発明の１つの態様は、化合物の性質を予測するための化合物性質予測装置であって、複数の化合物について、当該化合物の各々について実測された前記性質を当該化合物の各々に関連付けた化合物データベースを記憶した記憶手段にアクセス可能であり、前記化合物データベースから選択された２つの化合物を選択化合物として、前記選択化合物の共通構造及び差分構造と、前記選択化合物の前記性質と、の組み合わせを少なくとも含む教師付訓練データとして用いて、予測対象である化合物に対する前記性質を予測するための機械学習をさせた化合物性質予測モデルを構築する性質学習手段と、前記予測対象である化合物と前記化合物データベースから選択された化合物の共通構造及び差分構造を前記化合物性質予測モデルへ入力することによって、前記化合物性質予測モデルの出力として前記予測対象である化合物の前記性質の予測結果を得る性質予測手段と、を備えることを特徴とする化合物性質予測装置である。 One aspect of the present invention is a compound property prediction device for predicting properties of a compound, comprising a compound database in which the property measured for each of a plurality of compounds is associated with each of the compounds. A teacher that is accessible to a stored storage means and that includes at least a combination of two compounds selected from the compound database as selected compounds, the common structure and differential structure of the selected compounds, and the properties of the selected compounds. a property learning means for constructing a compound property prediction model that is subjected to machine learning for predicting the property of a compound to be predicted using training data, and a compound selected from the compound to be predicted and the compound database; a property prediction means for obtaining a prediction result of the property of the compound to be predicted as an output of the compound property prediction model by inputting the common structure and the difference structure of the compound to the compound property prediction model. It is a compound property prediction device characterized by

ここで、前記性質学習手段は、グラフニューラルネットワーク（ＧＮＮ）を用いて、前記共通構造を共通グラフ構造とし、前記差分構造を差分グラフ構造として前記教師付訓練データとして用いることが好適である。 Here, it is preferable that the property learning means uses a graph neural network (GNN) to use the common structure as a common graph structure and the difference structure as a difference graph structure as the supervised training data.

また、最大共通部分構造解析（ＭＣＳ）によって前記選択化合物の共通構造を求めることが好適である。 It is also preferred to determine the consensus structure of said selected compounds by maximum common substructure analysis (MCS).

また、前記性質は、化合物に対するＡＤＭＥＴ属性の少なくとも１つであることが好適である。 It is also preferred that the property is at least one ADMET attribute for the compound.

また、前記化合物データベースは、創薬研究におけるリード化合物の最適化プログラムにおいて得られた化合物と当該化合物について実測された性質とを含むことが好適である。 In addition, it is preferable that the compound database includes compounds obtained in an optimization program for lead compounds in drug discovery research and properties actually measured for the compounds.

また、前記化合物データベースに含まれるデータを前記最適化プログラム毎に時系列的に並べて分割し、前部分のデータを前記教師付訓練データとして用い、後部分のデータを検証データ又は評価データとして用いて前記機械学習を行うことが好適である。 Further, the data contained in the compound database is arranged in chronological order for each optimization program and divided, the data of the front part is used as the supervised training data, and the data of the back part is used as verification data or evaluation data. It is preferable to perform the machine learning.

また、前記化合物データベースは、前記リード化合物に対する複数の前記最適化プログラムにおいて得られた化合物と当該化合物について実測された性質とを含み、前記検証データを前記最適化プログラム毎に順番に選択して前記機械学習を繰り返して行うことが好適である。 In addition, the compound database includes compounds obtained in a plurality of the optimization programs for the lead compound and properties actually measured for the compounds, and the verification data is sequentially selected for each optimization program, and the Iterative machine learning is preferred.

本発明の別の態様は、化合物の性質を予測するための化合物性質予測プログラムであって、複数の化合物について、当該化合物の各々について実測された前記性質を当該化合物の各々に関連付けた化合物データベースを記憶した記憶手段にアクセス可能であるコンピュータを、前記化合物データベースから選択された２つの化合物を選択化合物として、前記選択化合物の共通構造及び差分構造と、前記選択化合物の前記性質と、の組み合わせを少なくとも含む教師付訓練データとして用いて、予測対象である化合物に対する前記性質を予測するための機械学習をさせた化合物性質予測モデルを構築する性質学習手段と、前記予測対象である化合物と前記化合物データベースから選択された化合物の共通構造及び差分構造を前記化合物性質予測モデルへ入力することによって、前記化合物性質予測モデルの出力として前記予測対象である化合物の前記性質の予測結果を得る性質予測手段として、機能させることを特徴とする化合物性質予測プログラムである。 Another aspect of the present invention is a compound property prediction program for predicting the properties of a compound, comprising a compound database in which the properties actually measured for each of a plurality of compounds are associated with each of the compounds. A computer having access to a stored storage means is configured to, using two compounds selected from the compound database as selected compounds, at least combine common structures and differential structures of the selected compounds and the properties of the selected compounds. property learning means for constructing a compound property prediction model that is subjected to machine learning for predicting the property of a compound that is a prediction target using supervised training data including the compound, and from the compound that is the prediction target and the compound database Functioning as a property prediction means for obtaining a prediction result of the property of the compound to be predicted as an output of the compound property prediction model by inputting the common structure and differential structure of the selected compound into the compound property prediction model It is a compound property prediction program characterized by allowing

本発明の別の態様は、化合物の性質を予測するための化合物性質予測方法であって、複数の化合物について、当該化合物の各々について実測された前記性質を当該化合物の各々に関連付けた化合物データベースを記憶した記憶手段にアクセス可能であるコンピュータを用いて、前記化合物データベースから選択された２つの化合物を選択化合物として、前記選択化合物の共通構造及び差分構造と、前記選択化合物の前記性質と、の組み合わせを少なくとも含む教師付訓練データとして用いて、予測対象である化合物に対する前記性質を予測するための機械学習をさせた化合物性質予測モデルを構築する性質学習工程と、前記予測対象である化合物と前記化合物データベースから選択された化合物の共通構造及び差分構造を前記化合物性質予測モデルへ入力することによって、前記化合物性質予測モデルの出力として前記予測対象である化合物の前記性質の予測結果を得る性質予測工程と、を備えることを特徴とする化合物性質予測方法である。 Another aspect of the present invention is a compound property prediction method for predicting the properties of a compound, comprising: a compound database in which the properties actually measured for each of a plurality of compounds are associated with each of the compounds; Using a computer having access to the stored storage means, two compounds selected from the compound database are used as selected compounds, and the common structure and differential structure of the selected compounds are combined with the properties of the selected compounds. using as supervised training data containing at least a property prediction step of obtaining prediction results of the properties of the compound to be predicted as an output of the compound property prediction model by inputting common structures and differential structures of compounds selected from a database into the compound property prediction model; A compound property prediction method characterized by comprising:

本発明の実施の形態は、リード化合物に対する最適化プログラムにおいて対象となる化合物の性質を予測することを可能とする化合物性質予測装置、化合物性質予測プログラム及び化合物性質予測方法を提供することを目的の１つとする。本発明の実施の形態の他の目的は、本明細書全体を参照することにより明らかになる。 An object of the present invention is to provide a compound property prediction device, a compound property prediction program, and a compound property prediction method that enable prediction of the properties of a target compound in an optimization program for a lead compound. Let it be one. Other objects of embodiments of the present invention will become apparent by reference to the specification as a whole.

本発明の実施の形態における化合物性質予測装置の構成を示す図である。It is a figure which shows the structure of the compound property prediction apparatus in embodiment of this invention. 本発明の実施の形態における化合物性質予測処理を示すフローチャートである。4 is a flow chart showing compound property prediction processing in the embodiment of the present invention. 本発明の実施の形態における化合物の構造の例を示す図である。It is a figure which shows the example of the structure of the compound in embodiment of this invention. 本発明の実施の形態における化合物データベースの例を示す図である。It is a figure which shows the example of the compound database in embodiment of this invention. 本発明の実施の形態における最適化プログラムを示す図である。It is a figure which shows the optimization program in embodiment of this invention. 本発明の実施の形態におけるデータ分割処理を説明する図である。It is a figure explaining the data division process in embodiment of this invention. 本発明の実施の形態における化合物の共通構造及び差分構造を求める処理を説明するための図である。FIG. 4 is a diagram for explaining the process of obtaining common structures and differential structures of compounds according to the embodiment of the present invention; 本発明の実施の形態における機械学習を説明するための図である。It is a figure for demonstrating the machine learning in embodiment of this invention. 本発明の実施の形態における化合物の性質の予測処理を説明するための図である。FIG. 4 is a diagram for explaining prediction processing of properties of a compound in the embodiment of the present invention;

本発明の実施の形態における化合物性質予測装置１００は、図１に示すように、処理部１０、記憶部１２、入力部１４、出力部１６及び通信部１８を含んで構成される。 A compound property prediction apparatus 100 according to an embodiment of the present invention includes a processing unit 10, a storage unit 12, an input unit 14, an output unit 16 and a communication unit 18, as shown in FIG.

化合物性質予測装置１００は、一般的なコンピュータにより構成することができる。処理部１０は、ＣＰＵ等を含んで構成され、化合物性質予測装置１００における処理を統合的に行う。処理部１０は、記憶部１２に記憶されている化合物性質予測プログラムを実行することにより、本実施の形態における化合物性質予測処理を行う。記憶部１２は、化合物性質予測処理において用いられる化合物性質予測モデル（化合物性質予測器）、創薬研究において得られた化合物とその性質を関連付けた化合物データベース等、化合物性質予測処理において必要な情報を記憶する。記憶部１２は、例えば、半導体メモリ、ハードディスク等で構成することができる。記憶部１２は、化合物性質予測装置１００の内部に設けてもよいし、無線や有線等の情報網を利用して処理部１０からアクセスできるように外部に設けてもよい。入力部１４は、化合物性質予測装置１００に対して情報を入力するための手段を含む。出力部１６は、化合物性質予測装置１００において処理された情報を表示させる手段を含む。通信部１８は、外部の装置（サーバ等）との情報交換を行うためのインターフェースを含んで構成される。通信部１８は、例えば、インターネット等の情報通信網に接続されることによって、外部の装置との通信を可能にする。 The compound property prediction device 100 can be configured by a general computer. The processing unit 10 includes a CPU and the like, and integrally performs processing in the compound property prediction device 100 . The processing unit 10 executes the compound property prediction program stored in the storage unit 12 to perform compound property prediction processing in the present embodiment. The storage unit 12 stores information necessary for the compound property prediction process, such as a compound property prediction model (compound property predictor) used in the compound property prediction process, and a compound database that associates compounds obtained in drug discovery research with their properties. Remember. The storage unit 12 can be composed of, for example, a semiconductor memory, a hard disk, or the like. The storage unit 12 may be provided inside the compound property prediction apparatus 100, or may be provided outside so as to be accessible from the processing unit 10 using an information network such as wireless or wired. Input unit 14 includes means for inputting information to compound property prediction apparatus 100 . The output unit 16 includes means for displaying information processed by the compound property prediction device 100 . The communication unit 18 includes an interface for exchanging information with an external device (server or the like). The communication unit 18 enables communication with external devices by being connected to an information communication network such as the Internet.

［化合物性質予測処理］
以下、図２のフローチャートを参照して、本実施の形態における化合物性質予測処理について説明する。化合物性質予測装置１００は、化合物性質予測プログラムを実行することによって、化合物とその性質を含む既知の学習用データを用いて化合物の性質を予測するための機械学習を行って化合物性質予測モデルを生成し、当該化合物性質予測モデルを用いて予測対象である化合物の性質を予測する処理を行う。 [Compound Property Prediction Processing]
The compound property prediction process according to the present embodiment will be described below with reference to the flowchart of FIG. The compound property prediction apparatus 100 executes a compound property prediction program to perform machine learning for predicting compound properties using known learning data including compounds and their properties, thereby generating a compound property prediction model. Then, the compound property prediction model is used to predict the property of the compound to be predicted.

本実施の形態では、創薬研究において新薬の候補として見出されたリード化合物の構造を変更するように設計、合成及び評価を繰り返す最適化プログラムにおいて評価済みの化合物及びＡＤＭＥＴ属性を関連付けて化合物データベースとして記憶部１２に記憶させる。機械学習では、化合物データベースに含まれている化合物とＡＤＭＥＴ属性を教師付訓練データとして用いる。 In the present embodiment, a compound database is created by associating compounds and ADMET attributes that have been evaluated in an optimization program that repeats design, synthesis, and evaluation so as to change the structure of a lead compound found as a candidate for a new drug in drug discovery research. is stored in the storage unit 12 as. Machine learning uses the compounds and ADMET attributes contained in the compound database as supervised training data.

ただし、学習に用いる化合物や予測対象とする化合物は創薬に関する化合物に限定されるものではない。また、化合物の性質は、ＡＤＭＥＴ属性に限定されるものではなく、化合物に関する情報であればよい。また、ＡＤＭＥＴ属性のすべての項目を使用してよいし、一部の項目のみを使用してもよい。 However, compounds used for learning and compounds to be predicted are not limited to compounds related to drug discovery. Further, the property of the compound is not limited to the ADMET attribute, and may be any information related to the compound. Also, all items of the ADMET attribute may be used, or only some of the items may be used.

図３は、化合物の例を示す。図３（ａ）は、最適化プログラムの出発点となるリード化合物の構造式を示す。図３（ｂ）～図３（ｃ）は、リード化合物の一部の構造を他の構造に変更した類似化合物Ａ及び類似化合物Ｂの構造式を示す。なお、リード化合物の構造や類似化合物の構造は、例示であり、これらに限定されるものではない。類似化合物は、例えば、リード化合物から一部の構造を取り除いた構造としてもよいし、リード化合物の一部の構造を他の構造に置換した構造としてもよいし、リード化合物の構造に他の構造を付加した構造としてもよい。 FIG. 3 shows examples of compounds. FIG. 3(a) shows the structural formula of the lead compound that serves as the starting point for the optimization program. FIGS. 3(b) to 3(c) show structural formulas of analogous compound A and analogous compound B in which a part of the structure of the lead compound is changed to another structure. The structure of the lead compound and the structure of the analogous compound are examples, and the present invention is not limited to these. The analogous compound may have, for example, a structure obtained by removing a part of the structure from the lead compound, a structure obtained by replacing a part of the structure of the lead compound with another structure, or a structure obtained by replacing the structure of the lead compound with another structure. may be added.

図４は、化合物データベースの例を示す。化合物データベースは、創薬研究において行われた一連の最適化プログラム毎に特有に割り当てられた最適化プログラムＩＤ（ＰＩＤ）、最適化プログラム名、化合物毎に特有に割り当てられた化合物ＩＤ、化合物名、化合物の構造、最適化プログラムにおいて評価された化合物の性質、評価日時を関連付けて記憶させたデータベースである。化合物の構造としては、図３で例示したように、化合物を構成する原子及びそれらの結合状態が記憶される。化合物の構造は、例えば、ＳＭＩＬＥＳ記法によって化合物データベースに登録すればよい。 FIG. 4 shows an example of a compound database. The compound database includes an optimization program ID (PID) uniquely assigned to each series of optimization programs performed in drug discovery research, an optimization program name, a compound ID uniquely assigned to each compound, a compound name, It is a database in which the structure of a compound, the property of the compound evaluated in the optimization program, and the date and time of evaluation are stored in association with each other. As the structure of the compound, as illustrated in FIG. 3, the atoms constituting the compound and their bonding states are stored. The structure of a compound may be registered in a compound database using, for example, the SMILES notation.

なお、図４では、最適化プログラムａにおいて評価された３つの化合物のみについて化合物ＩＤ、化合物名、化合物の構造及びその性質を関連付けた例を示したが、一般的には最適化プログラム毎にリード化合物から派生させた多数の化合物の性質が評価されて化合物データベースとして記憶される。また、本実施の形態では、化合物の性質としてＡＤＭＥＴ属性のＣＹＰ３Ａ４阻害率及びＪＰ１に対する溶解度の値のみを例示したが、これらに限定されるものではなく、ＡＤＭＥＴ属性の他の項目の値や化合物の他の性質を用いてもよい。 Note that FIG. 4 shows an example in which the compound ID, compound name, compound structure and properties are associated with only three compounds evaluated in the optimization program a. The properties of a large number of compounds derived from compounds are evaluated and stored as a compound database. Further, in the present embodiment, only the ADMET attribute CYP3A4 inhibition rate and JP1 solubility value were exemplified as the properties of the compound, but the present invention is not limited to these, and the values of other items of the ADMET attribute and the compound Other properties may be used.

また、図５に示すように、創薬研究では、１つのリード化合物から複数の最適化プログラムが実行される。したがって、複数の最適化プログラムが実行された場合、それぞれの最適化プログラムに関連付けて化合物ＩＤ、化合物名、化合物の構造・性質及び評価日時などの時系列的な順序関係を示す数値が化合物データベースとして記憶される。 Also, as shown in FIG. 5, in drug discovery research, multiple optimization programs are run from a single lead compound. Therefore, when multiple optimization programs are executed, numerical values indicating chronological order relationships such as compound ID, compound name, compound structure/property, and evaluation date and time are associated with each optimization program as a compound database. remembered.

ステップＳ１０では、化合物データベースの分割処理が行われる。当該ステップの処理によって、化合物性質予測装置１００は、データ分割手段として機能する。化合物性質予測装置１００の機械学習では、化合物データベースに記憶されているデータを教師付訓練データ、検証データ及び評価データに分割して使用する。 In step S10, division processing of the compound database is performed. Through the processing of this step, the compound property prediction device 100 functions as data dividing means. In the machine learning of the compound property prediction apparatus 100, the data stored in the compound database is divided into supervised training data, verification data and evaluation data for use.

処理部１０は、記憶部１２から化合物データベースを読み出して以下の処理を行う。本実施の形態では、図６に示すように、化合物データベースに記憶されているデータを最適化プログラム毎に評価日時に沿って時系列的にソートした状態で訓練データ、検証データ及び評価データに分割する。 The processing unit 10 reads out the compound database from the storage unit 12 and performs the following processing. In the present embodiment, as shown in FIG. 6, the data stored in the compound database is sorted chronologically along the evaluation date and time for each optimization program and divided into training data, verification data, and evaluation data. do.

ここで、訓練データとは、機械学習によって化合物性質予測装置１００の化合物性質予測モデルを構築するためのデータである。また、検証データとは、機械学習におけるハイパーパラメータを決定してモデルを選択するために使用するデータである。評価データとは、機械学習によって構築された化合物性質予測モデルが適切であるかを評価するために使用するデータである。 Here, the training data is data for constructing a compound property prediction model of the compound property prediction device 100 by machine learning. Validation data is data used to determine hyperparameters in machine learning and to select a model. Evaluation data is data used to evaluate whether a compound property prediction model constructed by machine learning is appropriate.

処理部１０は、最適化プログラム毎に化合物データベースに記憶されているデータを時系列的に並べたうえで２つに分割する。そして、複数の最適化プログラム（プログラムａ～ｄ）のうち予測対象とする最適化プログラム（プログラムａ）において時系列的に分割された前・後のグループのうち後のグループに該当するデータを評価データとする。また、評価データを抽出した最適化プログラム（プログラムａ）以外の最適化プログラム（プログラムｂ～ｄ）において時系列的に分割された前・後のグループのうち後のグループに該当するデータを検証データとする。 The processing unit 10 arranges the data stored in the compound database in chronological order for each optimization program and divides the data into two. Then, among the multiple optimization programs (programs a to d), the data corresponding to the latter group out of the groups divided chronologically in the optimization program (program a) targeted for prediction is evaluated. data. In addition, the data corresponding to the latter group among the groups before and after the groups divided chronologically in the optimization programs (programs b to d) other than the optimization program (program a) from which the evaluation data was extracted is the verification data. and

また、機械学習をより適切に行うために、クロスバリデーションを適用して、検証データを変更して繰り返し機械学習をさせるようにしてもよい。例えば、図６の学習過程１～３に示すように、評価データを抽出した最適化プログラム（プログラムａ）以外の最適化プログラム（プログラムｂ～ｄ）から検証データを抽出する最適化プログラムを順番に変更して機械学習を繰り返して行わせるようにしてもよい。 Also, in order to perform machine learning more appropriately, cross-validation may be applied to change verification data and repeat machine learning. For example, as shown in learning processes 1 to 3 in FIG. 6, optimization programs for extracting verification data from optimization programs (programs b to d) other than the optimization program (program a) from which the evaluation data was extracted are sequentially selected. It may be modified so that machine learning is performed repeatedly.

なお、本実施の形態（図６）では、最適化プログラム毎に化合物データベースに含まれるデータを前半３０％と後半７０％に分割したが、これに限定されるものではなく、他の割合に分割してもよい。すなわち、予測対象である化合物の性質が適切に出力されるように化合物性質予測モデルが機械学習されるような割合に分割すればよい。 In the present embodiment (FIG. 6), the data included in the compound database is divided into the first half 30% and the second half 70% for each optimization program, but it is not limited to this, and the data is divided into other ratios. You may In other words, it may be divided into ratios such that the compound property prediction model is machine-learned so that the property of the compound to be predicted is appropriately output.

ステップＳ１２では、化合物のペア選択処理が行われる。当該ステップの処理によって、化合物性質予測装置１００は、化合物選択手段として機能する。処理部１０は、記憶部１２に記憶されている化合物データベースから同じ最適化プログラムに関連付けられている訓練データ（トレインデータ）から２つの化合物を選択化合物として選択する。選択化合物に関連付けられたデータは、化合物性質予測モデルを機械学習させるための教師付訓練データとして使用される。 In step S12, compound pair selection processing is performed. Through the processing of this step, the compound property prediction device 100 functions as compound selection means. The processing unit 10 selects two compounds as selected compounds from the training data (train data) associated with the same optimization program from the compound database stored in the storage unit 12 . Data associated with the selected compounds are used as supervised training data for machine learning compound property prediction models.

ここで、同一の最適化プログラムに関連付けられている訓練データから２つの化合物のペアを選択化合物として選択する場合、単純にすべての化合物のペアの組み合わせをデータセットとして学習させると化合物性質予測モデルの過適合が起こるおそれがある。そこで、同一の最適化プログラムに関連付けられている訓練データからペアとなる化合物の一方を一様にサンプリングする。これによって、選択化合物となるペアの化合物のうち一方の化合物は訓練データから偏りなくサンプリングされる。そして、訓練データに含まれる残りのデータからサンプリングされた化合物とペアになり得る化合物を一様にサンプリングして２つの化合物を組み合わせて選択化合物として選択する。このような処理とすることによって、少なくともペアとなる２つの化合物のうち一方は訓練データの中から偏りなく選択することができる。なお、実装では、化合物データベースに含まれる各化合物に対してペアとなる化合物のデータの集合を設定しておき、そのペアのなかで化合物を順番に選択するようにすればよい。 Here, when selecting two compound pairs as selected compounds from the training data associated with the same optimization program, simply learning the combination of all compound pairs as a data set yields a compound property prediction model. Overfitting may occur. One of the paired compounds is then uniformly sampled from the training data associated with the same optimization program. As a result, one compound of the pair of compounds to be the selected compound is sampled without bias from the training data. Then, compounds that can be paired with the sampled compounds are uniformly sampled from the remaining data included in the training data, and two compounds are combined and selected as a selection compound. By performing such processing, at least one of the two compounds forming a pair can be selected from the training data without bias. In implementation, a set of data of compounds forming a pair may be set for each compound contained in the compound database, and compounds in the pair may be selected in order.

同様に、処理部１０は、記憶部１２に記憶されている化合物データベースから検証データから２つの化合物を選択化合物として選択する。また、同様に、処理部１０は、記憶部１２に記憶されている化合物データベースから評価データから２つの化合物を選択化合物として選択する。 Similarly, the processing unit 10 selects two compounds from the verification data from the compound database stored in the storage unit 12 as selected compounds. Similarly, the processing unit 10 selects two compounds from the evaluation data from the compound database stored in the storage unit 12 as selected compounds.

ステップＳ１４では、化合物の共通構造及び差分構造を抽出する処理が行われる。当該ステップの処理によって、化合物性質予測装置１００は、構造解析手段として機能する。処理部１０は、ステップＳ１２において選択化合物とされた２つの化合物のペア毎に共通する化学的構造及び共通しない化学的構造をそれぞれ共通構造及び差分構造として抽出してベクトル化する。例えば、最大共通部分構造解析（ＭＣＳ：ＭａｘｉｍｕｍＣｏｍｍｏｎＳｕｂｓｔｒｕｃｔｕｒｅ）であるｒｄｋｉｔのｒｄＦＭＣＳ．ＦｉｎｄＭＣＳ（）を利用することで選択化合物とされた２つの化合物の共通構造を抽出することができる。さらに、２つの化合物についてそれぞれ共通構造以外の構造を差分構造として抽出する。共通構造及び差分構造は、例えば、ＳＭＩＬＥＳ記法により表現することができる。 In step S14, a process of extracting common structures and differential structures of compounds is performed. Through the processing of this step, the compound property prediction device 100 functions as structural analysis means. The processing unit 10 extracts and vectorizes a common structure and an uncommon chemical structure for each pair of two compounds selected as the selected compounds in step S12 as a common structure and a difference structure, respectively. For example, rdFMCS. By using FindMCS( ), it is possible to extract the common structure of the two selected compounds. Further, structures other than the common structure are extracted as differential structures for each of the two compounds. A common structure and a difference structure can be represented by, for example, the SMILES notation.

例えば、図７に示すように、選択化合物であるペアの化合物毎（選択化合物１及び選択化合物２）に共通構造と差分構造を抽出する。ここで、選択化合物１に存在する構造であるが選択化合物２には存在しない構造を差分構造１とし、選択化合物２に存在する構造であるが選択化合物１には存在しない構造を差分構造２として抽出している。 For example, as shown in FIG. 7, a common structure and a difference structure are extracted for each pair of selected compounds (selected compound 1 and selected compound 2). Here, a structure that exists in the selected compound 1 but does not exist in the selected compound 2 is defined as a differential structure 1, and a structure that exists in the selected compound 2 but does not exist in the selected compound 1 is defined as a differential structure 2. are extracting.

ステップＳ１６では、化合物性質予測モデルを構築するための機械学習が行われる。当該ステップの処理によって、化合物性質予測装置１００は、性質学習手段として機能する。処理部１０は、図８に示すように、ステップＳ１４において抽出された訓練データの選択化合物の共通構造及び差分構造（差分構造１及び差分構造２）をそれぞれ共通グラフ構造及び差分グラフ構造として、これらに対して化合物データベースとして記憶されている当該選択化合物を構成する２つの化合物の性質を教師データとして組み合わせて、選択化合物である２つの化合物の共通構造及び差分構造を含む入力に対して当該化合物の性質が出力されるように化合物性質予測モデルを機械学習させる。 In step S16, machine learning is performed to construct a compound property prediction model. Through the processing of this step, the compound property prediction device 100 functions as property learning means. As shown in FIG. 8, the processing unit 10 converts the common structure and difference structure (difference structure 1 and difference structure 2) of the selected compound of the training data extracted in step S14 into a common graph structure and a difference graph structure, respectively. By combining the properties of the two compounds that make up the selected compound stored as a compound database for the input containing the common structure and the difference structure of the two compounds that are the selected compounds, the input of the compound A compound property prediction model is machine-learned so that properties are output.

化合物性質予測モデルには、グラフニューラルネットワーク（ＧＮＮ：ＧｒａｐｈＮｅｕｒａｌＮｅｔｗｏｒｋ）を適用することが好適である。ＧＮＮは、グラフ構造を扱うニューラルネットワークであり、多くのモデルが提唱されている。化合物性質予測装置１００を構成するための化合物性質予測モデルを構築するためには、特にこれに限定されるものではないが、ＧＩＮ（ＧｒａｐｈＩｓｏｍｏｒｐｈｉｓｍＮｅｔｗｏｒｋ）［Ｘｕ＋，ＩＣＬＲ２０１９］を適用することが好適である。機械学習のモデルでは、ニューラルネットワークの層数、活性化関数、損失関数等は適宜選択することが好適である。 It is preferable to apply a graph neural network (GNN) to the compound property prediction model. GNN is a neural network that handles graph structures, and many models have been proposed. In order to build a compound property prediction model for configuring the compound property prediction device 100, it is preferable to apply GIN (Graph Isomorphism Network) [Xu+, ICLR2019], although it is not particularly limited to this. be. In the machine learning model, it is preferable to appropriately select the number of layers of the neural network, the activation function, the loss function, and the like.

具体的には、例えば以下のように処理を行ってもよい。ステップＳ１４において抽出された選択化合物の共通構造をＧＮＮの入力として、共通構造を部分グラフとしてグラフ畳み込み処理（Ｃｏｎｖｏｌｕｔｉｏｎ）を行ったうえで、共通構造のグラフ構造全体に対してリードアウト（Ｒｅａｄｏｕｔ）を行うことで共通構造のグラフ全体の特徴ベクトルを得る。ここで、リードアウト（Ｒｅａｄｏｕｔ）とは、グラフ構造中のすべてのノード（原子）に割り当てられたベクトルに対して和（Ｓｕｍ）を算出したり、最大値（Ｍａｘ）を求めたりする処理である。また、ステップＳ１４において抽出された選択化合物の差分構造（差分構造１及び差分構造２）をＧＮＮの入力として、差分構造を部分グラフとしてグラフ畳み込み処理（Ｃｏｎｖｏｌｕｔｉｏｎ）を行ったうえで、差分構造のグラフ構造に対してリードアウト（Ｒｅａｄｏｕｔ）を行うことで差分構造の特徴ベクトルを得る。このとき、差分構造に代えて選択化合物の２つの化合物自体の構造に対してグラフ畳み込み処理（Ｃｏｎｖｏｌｕｔｉｏｎ）を行ったうえで、差分構造に限定してリードアウト（Ｒｅａｄｏｕｔ）を行ったり、化合物自体の構造（化合物のグラフ構造のすべてのノード）に対してリードアウト（Ｒｅａｄｏｕｔ）を行ったりしてもよい。 Specifically, for example, the following processing may be performed. The common structure of the selected compounds extracted in step S14 is used as an input for the GNN, and graph convolution is performed using the common structure as a subgraph, and readout is performed for the entire graph structure of the common structure. By doing so, we obtain the feature vector of the entire graph of the common structure. Here, the readout is a process of calculating the sum (Sum) of the vectors assigned to all nodes (atoms) in the graph structure or finding the maximum value (Max). . Further, the differential structure (differential structure 1 and differential structure 2) of the selected compound extracted in step S14 is used as an input for the GNN, and the differential structure is used as a subgraph to perform graph convolution processing (Convolution). A feature vector of the differential structure is obtained by reading out the structure. At this time, instead of the differential structure, graph convolution processing (Convolution) is performed on the structures of the two compounds themselves of the selected compounds, and then readout is performed limited to the differential structure. You may read out (Readout) with respect to a structure (all the nodes of the graph structure of a compound).

また、訓練データの選択化合物の共通構造及び差分構造のみならず、選択化合物の２つの化合物を構成する原子の種類、原子間の結合状態を教師付訓練データとして入力して機械学習させてもよい。また、選択化合物を構成する２つの化合物の性質を教師データとしてもよいし、２つの化合物の性質の差分を教師データとしてもよい。 In addition, not only the common structure and differential structure of the selected compounds of the training data, but also the types of atoms constituting two compounds of the selected compounds and the bonding state between atoms may be input as supervised training data for machine learning. . Also, the properties of two compounds constituting the selected compound may be used as teacher data, or the difference between the properties of two compounds may be used as teacher data.

このように、訓練データに含まれる選択化合物の共通構造及び差分構造を少なくとも含む訓練データを入力として当該選択化合物の性質を出力するような化合物性質予測モデルを機械学習させる。さらに、ステップＳ１２において選ばれた検証データを用いて、検証データに含まれる選択化合物の共通構造及び差分構造と当該選択化合物の性質のデータを用いて得られた化合物性質予測モデルにおけるハイパーパラメータを決定して最適な化合物性質予測モデルを選択する。また、ステップＳ１２において選ばれた評価データを用いて、得られた化合物性質予測モデルが評価データに含まれる選択化合物の共通構造及び差分構造に対して実際の評価でえられた当該選択化合物の性質を出力できているか否かの評価を行う。 In this way, machine learning is performed for a compound property prediction model that outputs the properties of the selected compound with input of training data that includes at least the common structure and differential structure of the selected compound included in the training data. Furthermore, using the verification data selected in step S12, the hyperparameters in the compound property prediction model obtained using the common structure and differential structure of the selected compound included in the verification data and the property data of the selected compound are determined. to select the optimal compound property prediction model. Further, using the evaluation data selected in step S12, the property of the selected compound obtained by actual evaluation of the common structure and differential structure of the selected compound included in the evaluation data is is output.

また、クロスバリデーションを適用する場合、検証データを変更して繰り返し機械学習をさせる。例えば、図６の学習過程１～３に示すように、検証データを抽出する最適化プログラムを順番に変更して機械学習を繰り返して行わせる。 Also, when applying cross-validation, machine learning is repeated by changing the validation data. For example, as shown in learning processes 1 to 3 in FIG. 6, the optimization program for extracting verification data is sequentially changed to repeat machine learning.

ステップＳ１８では、予測対象である化合物の性質を予測する処理が行われる。当該ステップの処理によって、化合物性質予測装置１００は、性質予測手段として機能する。まず、いずれかの最適化プログラムにおいて性質を予測する対象となる化合物の構造データの入力を受け付ける。当該予測対象である化合物の構造は、入力部１４を用いて受け付けてもよいし、予め記憶部１２に記憶させておいてもよい。次ぎに、処理部１０は、化合物データベースにおいて当該予測対象である化合物と同一の最適化プログラムに属する化合物を１つ選択し、当該化合物の構造と予測対象である化合物の構造との共通構造及び差分構造を抽出してベクトル化する。例えば、ｒｄｋｉｔのｒｄＦＭＣＳ．ＦｉｎｄＭＣＳ（）を利用することで２つの化合物の共通構造を抽出することができる。さらに、２つの化合物についてそれぞれ共通構造以外の構造を差分構造として抽出する。そして、当該化合物のベクトル化された共通構造及び差分構造をステップＳ１６で得られた化合物性質予測モデルに入力することで当該化合物の性質の予測結果の出力を得る。 In step S18, a process of predicting properties of the compound to be predicted is performed. Through the processing of this step, the compound property prediction device 100 functions as property prediction means. First, an input of structural data of a compound whose properties are to be predicted in any optimization program is accepted. The structure of the compound to be predicted may be received using the input unit 14 or may be stored in the storage unit 12 in advance. Next, the processing unit 10 selects one compound belonging to the same optimization program as the compound to be predicted in the compound database, and calculates the common structure and difference between the structure of the compound and the structure of the compound to be predicted. Extract and vectorize the structure. For example, rdFMCS. A common structure of two compounds can be extracted by using FindMCS(). Further, structures other than the common structure are extracted as differential structures for each of the two compounds. Then, by inputting the vectorized common structure and differential structure of the compound into the compound property prediction model obtained in step S16, the property prediction result output of the compound is obtained.

なお、ステップＳ１６において訓練データとして選択化合物の共通構造及び差分構造のみならず、選択化合物の２つの化合物を構成する原子の種類、原子間の結合状態を教師付訓練データとして入力して機械学習させた場合、化合物を構成する原子の種類や原子間の結合状態も化合物性質予測モデルに入力すればよい。 In step S16, as training data, not only the common structure and differential structure of the selected compound, but also the types of atoms constituting two compounds of the selected compound and the bonding state between atoms are input as supervised training data, and machine learning is performed. In such a case, the types of atoms constituting the compound and the state of bonding between atoms may be input into the compound property prediction model.

以上のように、本実施の形態における化合物性質予測装置１００では、化合物の共通構造及び差分構造を含む訓練データを用いて機械学習させることによって、性質が未知である化合物の性質をより適切に予測できる化合物性質予測モデルを構築することができる。また、機械学習に用いられるデータを時系列的に並べたうえで分割して、後のグループに該当するデータを評価データや検証データとすることによって、性質が未知である化合物の性質をさらに適切に予測できる化合物性質予測モデルを構築することができる。 As described above, the compound property prediction apparatus 100 according to the present embodiment performs machine learning using training data including common structures and differential structures of compounds, thereby more appropriately predicting properties of compounds whose properties are unknown. It is possible to construct a compound property prediction model that can be used. In addition, by arranging the data used for machine learning in chronological order and dividing it, the data corresponding to the latter group can be used as evaluation data or verification data, so that the properties of compounds whose properties are unknown can be more appropriately identified. It is possible to construct a compound property prediction model that can predict

なお、本実施の形態における化合物性質予測装置１００では、データ分割手段、化合物選択手段、構造解析手段、性質学習手段、性質予測手段を１つの装置にて実現する構成としたが、これらの手段を異なる装置や異なる実行主体にて実現するようにしてもよい。例えば、これらの手段のうち幾つかをサーバコンピュータで実現し、残りの手段をクライアントコンピュータで実現するようにしてもよい。 In the compound property prediction device 100 of the present embodiment, the data division means, compound selection means, structure analysis means, property learning means, and property prediction means are realized in one device. It may be implemented by a different device or a different execution entity. For example, some of these means may be implemented by the server computer, and the remaining means may be implemented by the client computer.

１０処理部、１２記憶部、１４入力部、１６出力部、１８通信部、２０処理部、２２記憶部、２４入力部、２６出力部、２８通信部、１００化合物性質予測装置。
10 processing unit, 12 storage unit, 14 input unit, 16 output unit, 18 communication unit, 20 processing unit, 22 storage unit, 24 input unit, 26 output unit, 28 communication unit, 100 compound property prediction device.

Claims

A compound property prediction device for predicting properties of a compound,
For a plurality of compounds, a storage means storing a compound database in which compounds obtained in a lead compound optimization program in drug discovery research and properties actually measured for the compounds are associated with each of the compounds,
Using two compounds selected from the compound database as selected compounds as supervised training data including at least a combination of the common structure and the differential structure of the selected compounds and the properties of the selected compounds, a prediction target In order to predict the property of a certain compound, the data contained in the compound database is arranged in time series and divided for each of the optimization programs, the data of the front part is used as the supervised training data, and the data of the back part A property learning means for constructing a compound property prediction model that is machine-learned using as verification data or evaluation data ,
By inputting the common structure and differential structure of the compound to be predicted and the compound selected from the compound database to the compound property prediction model, the property of the compound to be predicted as the output of the compound property prediction model A property prediction means for obtaining a prediction result of
A compound property prediction device comprising:

The compound property prediction device according to claim 1,
The compound property prediction model uses a graph neural network (GNN) as the common structure as a common graph structure and the difference structure as a difference graph structure as the supervised training data. .

The compound property prediction device according to claim 1 or 2,
A compound property prediction device, wherein the common structure of the selected compound is determined by maximum common substructure analysis (MCS).

The compound property prediction device according to any one of claims 1 to 3,
A compound property prediction device, wherein the property is at least one ADMET attribute for the compound.

The compound property prediction device according to claim 1 ,
The compound database includes compounds obtained in a plurality of the optimization programs for the lead compound and properties actually measured for the compounds, and the verification data is sequentially selected for each optimization program to perform the machine learning. A compound property prediction device characterized by repeatedly performing.

A compound property prediction program for predicting properties of a compound,
For a plurality of compounds, a computer capable of accessing a storage means storing a compound database in which compounds obtained in a lead compound optimization program in drug discovery research and properties actually measured for the compounds are associated with each of the compounds. ,
Using two compounds selected from the compound database as selected compounds as supervised training data including at least a combination of the common structure and the differential structure of the selected compounds and the properties of the selected compounds, a prediction target In order to predict the property of a certain compound, the data contained in the compound database is arranged in time series and divided for each of the optimization programs, the data of the front part is used as the supervised training data, and the data of the back part as verification data or evaluation data, and a property learning means for constructing a compound property prediction model that is machine-learned;
By inputting the common structure and differential structure of the compound to be predicted and the compound selected from the compound database to the compound property prediction model, the property of the compound to be predicted as the output of the compound property prediction model A compound property prediction program characterized by functioning as a property prediction means for obtaining a prediction result of.

A compound property prediction method for predicting the properties of a compound,
For a plurality of compounds, a computer capable of accessing a storage means storing a compound database in which compounds obtained in a lead compound optimization program in drug discovery research and properties actually measured for the compounds are associated with each of the compounds. make use of,
Using two compounds selected from the compound database as selected compounds as supervised training data including at least a combination of the common structure and the differential structure of the selected compounds and the properties of the selected compounds, a prediction target In order to predict the property of a certain compound, the data contained in the compound database is arranged in time series and divided for each of the optimization programs, the data of the front part is used as the supervised training data, and the data of the back part A property learning step of constructing a compound property prediction model subjected to machine learning using as verification data or evaluation data ;
By inputting the common structure and differential structure of the compound to be predicted and the compound selected from the compound database to the compound property prediction model, the property of the compound to be predicted as the output of the compound property prediction model A property prediction step of obtaining a prediction result of
A compound property prediction method comprising: